Turso incorrectly creates the first table in an autovacuumed table in
page 2.
(Note: this is on collaboration with @LeMikaelF)
SQLite does not allow enabling or disabling auto-vacuum after the first
table has been created
(https://sqlite.org/pragma.html#pragma_auto_vacuum). This is because the
sequence of the pages in the databases is different when auto-vacuum is
enabled, because the first b-tree page must be page 3 instead of 2, to
make room for the first [Pointer Map
page](https://sqlite.org/fileformat.html#pointer_map_or_ptrmap_pages).
But Turso doesn't currently consider this, which can lead to data loss.
The simplest way to reproduce this is to create an autovacuumed
databases with either `pragma auto_vacuum=full` so that autovacuum runs
on each commit, and then create a table with some data. Turso will
incorrectly create the new table on page 2. After this, every time a new
page is created, either through a page split or because a new table is
created, Turso will write a 5-byte pointer in page 2, starting from the
top of the page, thereby overwriting existing data.
For example, let's start with a clean database and the first bytes of
page 2. It starts with `0d`, the discriminator for a leaf page
([source](https://www.sqlite.org/fileformat.html#b_tree_pages)). The
next interesting number is the number of cells contained in this page
(`01`) at offset 5.
```
$ cargo run -- /tmp/a.db
turso> create table t(a);
turso> insert into t values ('myvalue');
$ dbtotxt /tmp/a.db
| size 8192 pagesize 4096 filename a.db
| page 1 offset 0
# ...snip...
| page 2 offset 4096
| 0: 0d 00 00 00 01 0f f5 00 0f f5 00 00 00 00 00 00 ................
| 4080: 00 00 00 00 00 09 01 02 1b 6d 79 76 61 6c 75 65 .........myvalue
| end a.db
```
Pointer map pages are located every N pages, starting from page 2, and
contain a list of 5-byte pointers that represent the parent page of a
certain page. So whenever Turso or SQLite needs to add a page, it will
overwrite 5 bytes of page 2. This means that for data loss to occur, it
is sufficient to add a single page to the database, for example by
creating a table. Offset 5 will then be zeroed out:
```
$ cargo run -- /tmp/a.db
turso> create table t(a);
turso> insert into t values ('myvalue');
turso> pragma auto_vacuum=full;
turso> create table tt(a);
$ dbtotxt /tmp/a.db
| size 12288 pagesize 4096 filename a.db
| page 1 offset 0
# ...snip...
| page 2 offset 4096
| 0: 01 00 00 00 00 0f f5 00 0f f5 00 00 00 00 00 00 ................
| 4080: 00 00 00 00 00 09 01 02 1b 6d 79 76 61 6c 75 65 .........myvalue
```
Creating more tables, or adding more B-tree pages, will keep overwriting
the rest of the page, until the cells themselves are also overwritten.
## Reproducing the issue in the simulator
We have been unable to reproduce this exact corruption mode in the
simulator, but patching it shows many failure modes, all of which don't
occur with the unpatched simulator. The following seeds are failing. The
following seeds are showing the issue when the patched simulator is ran
against `main`:
- `11522841279124073062`, with "Assertion 'table inquisitive_graham_159
should contain all of its expected values' failed: table
inquisitive_graham_159 does not contain the expected values, the
simulator model has more rows than the database"
- `7057400018220918989`, `16028085350691325843`, `7721542713659053944`,
and `203017821863546118`, with "Failed to read ptrmap key=XXX"
- `12533694709304969540`, `18357088553315413457`, `3108945730906932377`,
with "Integrity Check Failed: Cell N in page 2 is out of range."
- `4757352625344646473`, with "dirty pages should be empty for read
txn"
- `7083498604824302257`, with "header_size: 6272, header_len_bytes: 2,
payload.len(): 13"
- `17881876827470741581`, with "ParseError("no such table:
focused_historians_416")"
- `2092231500503735693`, with "range end index 4789 out of range for
slice of length 4096"
- `7555257419378470845`, with malformed database schema
(imaginative_ontivero\u{1})"
- `12905270229511147245`, with "index out of bounds: the len is 4096 but
the index is 4096"
## Fixing the issue
- When DB is opened, we read the `auto_vacuum` state, instead of
assuming `auto_vacuum=none`.
- Don't allow auto_vacuum to be flipped on non-empty databases as if we
allow this it could cause overlap with existing bits.(ptrmap could
overwrite existing data)
- Modify integrity check to avoid reporting that page 2 is orphaned in
auto-vacuumed databases.
Fixes#3752Closes#3830
## Gist
This PR implements _statement subtransactions_, which means that a
single statement within an interactive transaction can individually be
rolled back.
## Background
The default constraint violation resolution strategy in SQLite is
`ABORT`, which means to rollback the statement that caused the conflict.
For example:
```sql
CREATE TABLE t(x UNIQUE);
INSERT INTO t VALUES (1);
BEGIN;
INSERT INTO t VALUES (2),(3); -- ok
INSERT INTO t VALUES (4),(1); -- conflict on 1, this statement should rollback
INSERT INTO t VALUES (5); -- ok
COMMIT; -- ok
SELECT * FROM t;
1
2
3
5
```
So far we haven't been able to support this due to lack of support for
subtransactions, and have used the `ROLLBACK` strategy, which means to
rollback the entire transaction on any constraint error.
## Problem
Although PRIMARY KEY and UNIQUE constraints allow defining the conflict
resolution strategy (e.g. `id INTEGER PRIMARY KEY ON CONFLICT
ROLLBACK`), FOREIGN KEY violations do not support this: they always use
`ABORT` i.e. statement subtransaction rollback. For this reason alone it
is important to implement this mechanism now rather than later, since we
already have FOREIGN KEY support implemented.
## Details
This PR implements statement subtransactions with _anonymous
savepoints_. This means that whenever a statement begins, it will open a
new savepoint which will write "page undo images" into a temporary file
called a _subjournal_. Whenever the statement marks a page as dirty, it
will write the before-image of the page into the subjournal so that its
modifications can be undone in the event of an ABORT (statement
rollback).
- Right now, only anonymous savepoints are supported, so the explicit
`SAVEPOINT` syntax is not.
- Due to the above, there can be only one savepoint open per pager, and
this is enforced with assertions.
- The subjournal file is currently entirely in memory. If it were not,
we would either have to block on IO or refactor many usages of code to
account for potentially pending completions.
- Constraint errors no longer cause transactions to abort nor do they
cause the page cache to be cleared - instead, subjournaled pages will be
brought back into the page cache which effectively handles the same
behavior albeit more fine-grained.
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3792
Every transaction was reading page 1 from the WAL to check the schema
cookie in op_transaction, causing unnecessary WAL lookups.
This commit caches the schema_cookie in Pager as AtomicU64, similar to
how page_size and reserved_space are already cached. The cache is
updated when the header is read/modified and invalidated in
begin_read_tx() when WAL changes are detected from other connections.
This matches SQLite's approach of caching frequently accessed header
fields to avoid repeated page 1 reads. Improves write throughput by 5%
in our benchmarks.
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#3727
We don't want something like `BEGIN IMMEDIATE` to start a subtransaction,
so instead we will open it if:
- Statement is write, AND
a) Statement has >0 table_references, or
b) The statement is an INSERT (INSERT doesn't track table_references in
the same way as other program types)
Every transaction was reading page 1 from the WAL to check the schema cookie
in op_transaction, causing unnecessary WAL lookups.
This commit caches the schema_cookie in Pager as AtomicU64, similar to how
page_size and reserved_space are already cached. The cache is updated when the
header is read/modified and invalidated in begin_read_tx() when WAL changes
are detected from other connections.
This matches SQLite's approach of caching frequently accessed header fields to
avoid repeated page 1 reads. Improves write throughput by 5% in our
benchmarks.
This patch pushes unsafe Send and Sync to individual components instead
of doing it at Database level. This makes it easier for us to
incrementally fix thread-safety, but avoid developers adding more thread
unsafe code.
If we don't clear the dirty pages, we will initiate a rollback. In the
rollback, we will attempt to clear the whole page cache, but it will
then panic because there will still be dirty pages from the failed
writev
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#3189
Yield is a completion that does not allocate any inner state. By design
it is completed from the start and has no errors. This allows lightly
yield without allocating any locks nor heap allocate inner state.
This PR add proper program abort in case of unfinished statement reset
and interruption.
Also, this PR makes rollback methods non-failing because otherwise of
their callers usually unclear (if rollback failed - what is the state of
statement/connection/transaction?)
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3591
MVCC is like the annoying younger cousin (I know because I was him) that
needs to be treated differently. MVCC requires us to use root_pages that
might not be allocated yet, and the plan is to use negative root_pages
for that case. Therefore, we need i64 in order to fit this change.