## What
Rowsets are used in SQLite for two purposes:
1. for membership tests on a set of `i64`s,
2. for in-order iteration of a set of `i64`s,
Both in cases where we can just use rowids (which are `i64`) instead of
building an entire ephemeral btree from a table's contents.
For example, in cases where a `DELETE FROM tbl WHERE ...` is performed
on a table that has any `BEFORE DELETE` triggers, SQLite collects the
table's rowids into a RowSet before actually performing the deletion.
This is similar to how an UPDATE that modifies rowids (or the index used
to iterate the UPDATE loop) will first collect the rows into an
ephemeral index, and same with `INSERT INTO ... SELECT`.
## Details
RowSet uses a "batch" concept where insertions of a given batch must be
guaranteed by caller to contain no duplicates and will be pushed into a
vector for O(1). When a new batch is started, the previous batch is
folded into a `BTreeSet` so that membership tests can be performed in
O(logn). As far as I can tell, the "in-order iteration" use case doesn't
use this batch logic at all.
## AI disclosure
This entire PR description was written by me - no AIs were harmed in the
production of it. However, the code itself was mostly vibecoded using
two agents in Cursor:
- Composer 1: given the SQLite opcode documentation and rowset.c source
code, and asked to implement the VDBE instructions and the RowSet
module.
- GPT-5: given the same SQLite docs and source code, and asked to review
Composer 1's work and write feedback into a separate markdown file.
This loop was run for roughly 4-5 iterations, where each time GPT-5's
feedback was given to Composer 1, until GPT-5 found nothing to comment
anymore.
After this, I instructed Composer 1 to improve the documentation to be
less stupid.
After that, I made a manual editing pass over the runtime code to e.g.
change boolean flags to a `RowSetMode` enum to make clearer that the
rowset has two distinct mutually exclusive purposes (membership tests
and in-order iteration), plus cleaned up some other dumb shit and added
comments.
I am still not sure if this saved time or not.
Closes#3938
Rowsets are used in SQLite for two purposes:
1. for membership tests on a set of `i64`s,
2. for in-order iteration of a set of `i64`s,
Both in cases where we can just use rowids (which are `i64`) instead of building an entire ephemeral btree from a table's contents.
For example, in cases where a `DELETE FROM tbl WHERE ...` is performed on a table that has any `BEFORE DELETE` triggers, SQLite collects the table's rowids into a RowSet before actually performing the deletion. This is similar to how an UPDATE that modifies rowids (or the index used to iterate the UPDATE loop) will first collect the rows into an ephemeral index, and same with `INSERT INTO ... SELECT`.
This entire PR description was written by me - no AIs were harmed in the production of it. However, the code itself was mostly vibecoded using two agents in Cursor:
- Composer 1: given the SQLite opcode documentation and rowset.c source code, and asked to implement the VDBE instructions and the RowSet module.
- GPT-5: given the same SQLite docs and source code, and asked to review Composer 1's work and write feedback into a separate markdown file.
This loop was run for roughly 4-5 iterations, where each time GPT-5's feedback was given to Composer 1, until GPT-5 found nothing to comment anymore.
After this, I instructed Composer 1 to improve the documentation to be less stupid.
After that, I made a manual editing pass over the runtime code to e.g. change boolean flags to a `RowSetMode` enum to make clearer that the rowset has two distinct mutually exclusive purposes (membership tests and in-order iteration), plus cleaned up some other dumb shit and added comments.
I am still not sure if this saved time or not.
Depends on #3920
Moves some code around so it is easier to reuse and less cluttered in
`execute.rs`, and changes how `compare` works. Instead of mutating some
register, we now just return the possible `ValueRef` representation of
that affinity. This allows other parts of the codebase to reuse this
logic without needing to have an owned `Value` or a `&mut Register`
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#3923
Depends on #3919
Also change `op_compare` to reuse the same compare_immutable logic
First step to finish #2304
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#3920
Makes it easier to visualize what is related to Value and what is
related to opcodes. This will also facilitate in my next PR to
generalize certain function over `Value` and `ValueRef` as listed in
#2304Closes#3919
Partial sync for sync engine will need to implement its own version of
`DatabaseStorage` which willl load database pages on demand
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3922
This PR completely integrate custom indices to the query planner.
In order to do that new `Cursor::IndexMethod` is introduced with few
correlated changes in the VM implementation:
1. Added special `IndexMethod{Create,Destroy,Query}` opcodes to handle
index method creation, deletion and query
2. `Next` , `IdxRowid` , `IdxInsert`, `IdxDelete` opcodes updated to
properly handle new cursor case
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#3827
Turso incorrectly creates the first table in an autovacuumed table in
page 2.
(Note: this is on collaboration with @LeMikaelF)
SQLite does not allow enabling or disabling auto-vacuum after the first
table has been created
(https://sqlite.org/pragma.html#pragma_auto_vacuum). This is because the
sequence of the pages in the databases is different when auto-vacuum is
enabled, because the first b-tree page must be page 3 instead of 2, to
make room for the first [Pointer Map
page](https://sqlite.org/fileformat.html#pointer_map_or_ptrmap_pages).
But Turso doesn't currently consider this, which can lead to data loss.
The simplest way to reproduce this is to create an autovacuumed
databases with either `pragma auto_vacuum=full` so that autovacuum runs
on each commit, and then create a table with some data. Turso will
incorrectly create the new table on page 2. After this, every time a new
page is created, either through a page split or because a new table is
created, Turso will write a 5-byte pointer in page 2, starting from the
top of the page, thereby overwriting existing data.
For example, let's start with a clean database and the first bytes of
page 2. It starts with `0d`, the discriminator for a leaf page
([source](https://www.sqlite.org/fileformat.html#b_tree_pages)). The
next interesting number is the number of cells contained in this page
(`01`) at offset 5.
```
$ cargo run -- /tmp/a.db
turso> create table t(a);
turso> insert into t values ('myvalue');
$ dbtotxt /tmp/a.db
| size 8192 pagesize 4096 filename a.db
| page 1 offset 0
# ...snip...
| page 2 offset 4096
| 0: 0d 00 00 00 01 0f f5 00 0f f5 00 00 00 00 00 00 ................
| 4080: 00 00 00 00 00 09 01 02 1b 6d 79 76 61 6c 75 65 .........myvalue
| end a.db
```
Pointer map pages are located every N pages, starting from page 2, and
contain a list of 5-byte pointers that represent the parent page of a
certain page. So whenever Turso or SQLite needs to add a page, it will
overwrite 5 bytes of page 2. This means that for data loss to occur, it
is sufficient to add a single page to the database, for example by
creating a table. Offset 5 will then be zeroed out:
```
$ cargo run -- /tmp/a.db
turso> create table t(a);
turso> insert into t values ('myvalue');
turso> pragma auto_vacuum=full;
turso> create table tt(a);
$ dbtotxt /tmp/a.db
| size 12288 pagesize 4096 filename a.db
| page 1 offset 0
# ...snip...
| page 2 offset 4096
| 0: 01 00 00 00 00 0f f5 00 0f f5 00 00 00 00 00 00 ................
| 4080: 00 00 00 00 00 09 01 02 1b 6d 79 76 61 6c 75 65 .........myvalue
```
Creating more tables, or adding more B-tree pages, will keep overwriting
the rest of the page, until the cells themselves are also overwritten.
## Reproducing the issue in the simulator
We have been unable to reproduce this exact corruption mode in the
simulator, but patching it shows many failure modes, all of which don't
occur with the unpatched simulator. The following seeds are failing. The
following seeds are showing the issue when the patched simulator is ran
against `main`:
- `11522841279124073062`, with "Assertion 'table inquisitive_graham_159
should contain all of its expected values' failed: table
inquisitive_graham_159 does not contain the expected values, the
simulator model has more rows than the database"
- `7057400018220918989`, `16028085350691325843`, `7721542713659053944`,
and `203017821863546118`, with "Failed to read ptrmap key=XXX"
- `12533694709304969540`, `18357088553315413457`, `3108945730906932377`,
with "Integrity Check Failed: Cell N in page 2 is out of range."
- `4757352625344646473`, with "dirty pages should be empty for read
txn"
- `7083498604824302257`, with "header_size: 6272, header_len_bytes: 2,
payload.len(): 13"
- `17881876827470741581`, with "ParseError("no such table:
focused_historians_416")"
- `2092231500503735693`, with "range end index 4789 out of range for
slice of length 4096"
- `7555257419378470845`, with malformed database schema
(imaginative_ontivero\u{1})"
- `12905270229511147245`, with "index out of bounds: the len is 4096 but
the index is 4096"
## Fixing the issue
- When DB is opened, we read the `auto_vacuum` state, instead of
assuming `auto_vacuum=none`.
- Don't allow auto_vacuum to be flipped on non-empty databases as if we
allow this it could cause overlap with existing bits.(ptrmap could
overwrite existing data)
- Modify integrity check to avoid reporting that page 2 is orphaned in
auto-vacuumed databases.
Fixes#3752Closes#3830
Closes#1282
# Support for WHERE clause subqueries
This PR implements support for subqueries that appear in the WHERE
clause of SELECT statements.
## What are those lol
1. **EXISTS subqueries**: `WHERE EXISTS (SELECT ...)`
2. **Row value subqueries**: `WHERE x = (SELECT ...)` or `WHERE (x, y) =
(SELECT ...)`. The latter are not yet supported - only the single-column
("scalar subquery") case is.
3. **IN subqueries**: `WHERE x IN (SELECT ...)` or `WHERE (x, y) IN
(SELECT ...)`
## Correlated vs Uncorrelated Subqueries
- **Uncorrelated subqueries** reference only their own tables and can be
evaluated once.
- **Correlated subqueries** reference columns from the outer query
(e.g., `WHERE EXISTS (SELECT * FROM t2 WHERE t2.id = t1.id)`) and must
be re-evaluated for each row of the outer query
## Implementation
### Planning
During query planning, the WHERE clause is walked to find subquery
expressions (`Expr::Exists`, `Expr::Subquery`, `Expr::InSelect`). Each
subquery is:
1. Assigned a unique internal ID
2. Compiled into its own `SelectPlan` with outer query tables provided
as available references
3. Replaced in the AST with an `Expr::SubqueryResult` node that
references the subquery with its internal ID
4. Stored in a `Vec<NonFromClauseSubquery>` on the `SelectPlan`
For IN subqueries, an ephemeral index is created to store the subquery
results; for other kinds, the results are stored in register(s).
### Translation
Before emitting bytecode, we need to determine when each subquery should
be evaluated:
- **Uncorrelated**: Evaluated once before opening any table cursors
- **Correlated**: Evaluated at the appropriate nested loop depth after
all referenced outer tables are in scope
This is calculated by examining which outer query tables the subquery
references and finding the right-most (innermost) loop that opens those
tables - using similar mechanisms that we use for figuring out when to
evaluate other `WhereTerm`s too.
### Code Generation
- **EXISTS**: Sets a register to 1 if any row is produced, 0 otherwise.
Has new `QueryDestination::ExistsSubqueryResult` variant.
- **IN**: Results stored in an ephemeral index and the index is probed.
- **RowValue**: Results stored in a range of registers. Has new
`QueryDestination::RowValueSubqueryResult` variant.
## Annoying details
### Which cursor to read from in a subquery?
Sometimes a query will use a covering index, i.e. skip opening the table
cursor at all if the index contains All The Needed Stuff.
Correlated subqueries reading columns from outer tables is a bit
problematic in this regard: with our current translation code, the
subquery doesn't know whether the outer query opened a table cursor,
index cursor, or both. So, for now, we try to find a table cursor first,
then fall back to finding any index cursor for that table.
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3847