<img height="400" alt="image" src="https://github.com/user-
attachments/assets/bdd5c0a8-1bbb-4199-9026-57f0e5202d73" />
<img height="400" alt="image" src="https://github.com/user-
attachments/assets/7ea63e58-2ab7-4132-b29e-b20597c7093f" />
We were copying the schema preemptively on each `Database::connect`, now
the schema is shared until a change needs to be made by sharing a single
`Arc` and mutating it via `Arc::make_mut`. This is faster as reduces
memory usage.
Closes#2022
### Async IO performance, part 0
Relatively small and focused PR that mainly does two things, will add a
.md document of the proposed/planned improvements to the io_uring module
to fully revamp our async IO.
1. **Registration of file descriptors.**
At startup, by calling `io_uring_register_files_sparse` we can allocate
an array in shared kernel/user space by calling register_files_sparse
which initializes each slot to `-1`, and when we open a file we call
`io_uring_register_files_update`, providing an index into this array and
`fd`.
Then for the IO submission, we can reference the index into this array
instead of the fd, saving the kernel the work of looking up the fd in
the process file table, incrementing the reference count, doing the
operation, then finally decrementing the refcount. Instead the kernel
can just index into the array and do the operation.
This especially provides an improvement for cases like this, where files
are open for long periods of time, which the kernel will perform many
operations on.
The eventual goal of this, is to use Fixed read/write operations, where
both the file descriptor and the underlying buffer is registered with
the kernel. There is another branch continuing this work, that
introduces a buffer pool that memlock's one large 32MB arena mmap and
tries to use that wherever possible.
These Fixed operations are essentially the "holy grail" of io_uring
performance (for file operations).
2. **!Vectored IO**
This is kind of backwards, because the goal is to indeed implement
proper vectored IO and I'm removing some of the plumbing in this PR, but
currently we have been using `Writev`/`Readv`, while never submitting >
1 iovec at a time.
Writes to the WAL, especially, would benefit immensely from vectored IO,
as it is append-only and therefore all writes are contiguous. Regular
checkpointing/cache flushing to disk can also be adapted to aggregate
these writes and submit many in a single system call/opcode.
Until this is implemented, the bookkeeping and iovecs are unnecessary
noise/overhead, so let's temporarily remove them and revert to normal
`read`/`write` until they are needed and it can be designed from
scratch.
3. **Flags**
`setup_single_issuer` hints to the kernel that `IOURING_ENTER` calls
will all be sent from a single thread, and `setup_coop_taskrun` removes
some unnecessary kernel interrupts for providing cqe's which most single
threaded applications do not need. Both these flags demonstrate modest
improvement of performance.
Closes#2127
Enables formatting `Expr::Column` by adding the context to `ToTokens`
instead of creating a new unparsing implementation for each node.
`ToTokens` implemented for:
- [x] `UpdatePlan`
- [x] `Plan`
- [x] `JoinedTable`
- [x] `SelectPlan`
- [x] `DeletePlan`
Reviewed-by: Pedro Muniz (@pedrocarlo)
Closes#1949
This PR renames CDC table column names to use "change"-centric
terminology and avoid using `operation_xxx` column names.
Just a small refactoring to bring more consistency as `turso-db` refer
to the feature as capture data **changes** - and there is no word
operation here.
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#2120
Aftermath of seek-related refactor in #2065, which you can read for
background. The change in this PR is documented pretty well inline - if
we receive a `TryAdvance` seek result when seeking after balancing, we
need to - well - try to advance.
Closes#2116Closes#2115
## What does this fix
This PR fixes an issue with BTree upwards traversal logic where we would
try to go up to a parent node in `next()` even though we are at the very
end of the btree. This behavior can leave the cursor incorrectly
positioned at an interior node when it should be at the right edge of
the rightmost leaf.
## Why doesn't it cause problems on main
This bug is masked on `main` by every table `insert()` (wastefully)
calling `find_cell()`:
- `op_new_rowid` called, let's say the current max rowid is `666`.
Cursor is left pointing at `666`.
- `insert()` is called with rowid `667`, cursor is currently pointing at
`666`, which is incorrect.
- `find_cell()` does a binary search every time, and hence somewhat
accidentally positions the cursor correctly _after_ `666` so that the
insert goes to the correct place
## Why was this issue found
in #1988, I am removing `find_cell()` entirely in favor of always
performing a seek to the correct location - and skipping `seek` when it
is not required, saving us from wasting a binary search on every insert
- but this change means that we need to call `next()` after
`op_new_rowid` to have the cursor positioned correctly at the new
insertion slot. Doing this surfaces this upwards traversal bug in that
PR branch.
## Details of solution
- Store `cell_count` together with `cell_idx` in pagestack, so that
chlidren can know whether their parents have reached their end without
doing IO
- To make this foolproof, pin pages on `PageStack` so the page cache
cannot evict them during tree traversal
- `cell_indices` renamed to `node_states` since it now carries more
information (cell index AND count, instead of just index)
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#2005
## Background
When a divider cell is deleted from an index interior page, the
following algorithm is used:
1. Find predecessor: Move to largest key in left subtree of the current
page. This is always a leaf page.
2. Create replacement: Convert this predecessor leaf cell to interior
cell format, using original cell's left child page pointer
3. Replace: Drop original cell from parent page, insert replacement at
same position
4. Cleanup: Delete the taken predecessor cell from the leaf page
<img width="845" height="266" alt="Screenshot 2025-07-16 at 10 39 18"
src="https://github.com/user-
attachments/assets/30517da4-a4dc-471e-a8f5-c27ba0979c86" />
## The faulty code leading to the bug
The error in our logic was that we always expected to only traverse down
one level of the btree:
```rust
let parent_page = self.stack.parent_page().unwrap();
let leaf_page = self.stack.top();
```
This meant that when the deletion happened on say, level 1, and the
replacement cell was taken from level 3, we actually inserted the
replacement cell into level 2 instead of level 1.
## Manifestation of the bug in issue 2106
In #2106, this manifested as the following chain of pages, going from
parent to children:
3 -> 111 -> 119
- Cell to be deleted was on page 3 (whose left pointer is 111)
- Going to the largest key in the left subtree meant traversing from 3
to 111 and then from 111 to 119
- a replacement cell was taken from 119
- incorrectly inserted into 111
- and its left child pointer also set as 111!
- now whenever page 111 wanted to go to its left child page, it would
just traverse back to itself, eventually causing a crash because we have
a hard limit of the number of pages on the page stack.
## The fix
The fix is quite trivial: store the page we are on before we start
traversing down.
Closes#2106Closes#2108
Instead of e.g.:
> Error: failed with error: 'InternalError("`SELECT * FROM sparkling_kuh
WHERE (sparkling_kuh.warmhearted_bowden > 'givhaqhwibn') ` should return
no values for table `sparkling_kuh`")'
Now you get:
> Error: failed with error: 'InternalError("Assertion '`SELECT * FROM
sparkling_kuh WHERE (sparkling_kuh.warmhearted_bowden > 'givhaqhwibn') `
should return no values for table `sparkling_kuh`' failed: expected no
rows but got 1 rows: \"-37703991.25525856, sleek_leeder,
passionate_deleuze, -2463056772.592847, shining_kuo, polite_mcbarron,
X'616D626974696F75735F647261676F6E6F776C', warmhearted_bekken\"")'
Closes#2111
`FaultyQuery` causes frequent false positives in simulator due to the
following chain of events:
- we write rows and flush wal to disk
- inject fault during fsync which fails
- error is returned to caller, simulator thinks those rows dont exist
because the query failed
- we reopen the database i.e. read the WAL back to memory from disk, it
has those extra rows we think we didn't write
- assertion fails because table has more rows than simulator expected
More discussion about fsync behavior in issue #2091Closes#2110