If we don't clear the dirty pages, we will initiate a rollback. In the
rollback, we will attempt to clear the whole page cache, but it will
then panic because there will still be dirty pages from the failed
writev
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#3189
- add index root pages to list of root pages to check
- check for dangling (unused) pages
```sql
$ cargo run wut.db
turso> .mode list
turso> pragma integrity_check;
Page 3: never used
Page 4: never used
Page 7: never used
Page 8: never used
```
```sql
$ sqlite3 wut.db 'pragma integrity_check;'
*** in database main ***
Page 3: never used
Page 4: never used
Page 7: never used
Page 8: never used
```
Closes#3613
Yield is a completion that does not allocate any inner state. By design
it is completed from the start and has no errors. This allows lightly
yield without allocating any locks nor heap allocate inner state.
This PR add proper program abort in case of unfinished statement reset
and interruption.
Also, this PR makes rollback methods non-failing because otherwise of
their callers usually unclear (if rollback failed - what is the state of
statement/connection/transaction?)
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3591
Fixes a page cache staleness issue where connections could incorrectly
believe the database hasn't changed after checkpointing. This can happen
when writes following a checkpoint resulted in the same `max_frame
value`, causing connections to miss updates since they only checked
`max_frame` to detect changes.
Closes#3502
This is a follow up from PR - #3457 which requires users to opt in to
enable encryption. This patch
- Makes appropriate changes to Whopper and Encryption throughput tests
- Updated Rust bindings to pass the encryption options properly
- Added a test for rust bindings
To use encryption in Rust bindings, one needs to do:
```rust
let opts = EncryptionOpts {
hexkey: "b1bbfda...02a5669fc76327".to_string(),
cipher: "aegis256".to_string(),
};
let builder = Builder::new_local(&db_file).experimental_encryption(true).with_encryption(opts.clone());
let db = builder.build().await.unwrap();
```
We will remove the `experimental_encryption` once the feature is stable.
Closes#3532
The page cache implementation uses a pre-allocated vector (`entries`)
with fixed capacity, along with a custom hash map and freelist. This
design requires expensive upfront allocation when creating a new
connection, which severely impacted performance in workloads that open
many short-lived connections (e.g., our concurrent write benchmarks that
create a new connection per transaction).
Therefore, replace the pre-allocated vector with an intrusive doubly-
linked list. This eliminates the page cache initialization overhead from
connection establishment, but also reduces memory usage to entries that
are actually used. Furthermore, the approach allows us to grow the page
cache with much less overhead.
The patch improves concurrent write throughput benchmark by 4x for
single-threaded performance.
Before:
```
$ write-throughput --threads 1 --batch-size 100 -i 1000 --mode concurrent
Running write throughput benchmark with 1 threads, 100 batch size, 1000 iterations, mode: Concurrent
Database created at: write_throughput_test.db
Thread 0: 100000 inserts in 3.82s (26173.63 inserts/sec)
```
After:
```
$ write-throughput --threads 1 --batch-size 100 -i 1000 --mode concurrent
Running write throughput benchmark with 1 threads, 100 batch size, 1000 iterations, mode: Concurrent
Database created at: write_throughput_test.db
Thread 0: 100000 inserts in 0.90s (110848.46 inserts/sec)
```
Closes#3456
The page cache implementation uses a pre-allocated vector (`entries`)
with fixed capacity, along with a custom hash map and freelist. This
design requires expensive upfront allocation when creating a new
connection, which severely impacted performance in workloads that open
many short-lived connections (e.g., our concurrent write benchmarks that
create a new connection per transaction).
Therefore, replace the pre-allocated vector with an intrusive
doubly-linked list. This eliminates the page cache initialization
overhead from connection establishment, but also reduces memory usage to
entries that are actually used. Furthermore, the approach allows us to
grow the page cache with much less overhead.
The patch improves concurrent write throughput benchmark by 4x for
single-threaded performance.
Before:
```
$ write-throughput --threads 1 --batch-size 100 -i 1000 --mode concurrent
Running write throughput benchmark with 1 threads, 100 batch size, 1000 iterations, mode: Concurrent
Database created at: write_throughput_test.db
Thread 0: 100000 inserts in 3.82s (26173.63 inserts/sec)
```
After:
```
$ write-throughput --threads 1 --batch-size 100 -i 1000 --mode concurrent
Running write throughput benchmark with 1 threads, 100 batch size, 1000 iterations, mode: Concurrent
Database created at: write_throughput_test.db
Thread 0: 100000 inserts in 0.90s (110848.46 inserts/sec)
```
Sqlite has a crazy easter egg where a 1 Gib file offset, it creates a
`PENDING_BYTE_PAGE` that is used only by the VFS layer, and is never
read or written into.
To properly test this, I took inspiration from SQLITE testing framework,
and defined a helper method, that is conditionally compiled with the
`test_helper` feature enabled.
https://github.com/sqlite/sqlite/blob/7e38287da43ea3b661da3d8c1f431aa907
d648c9/src/main.c#L4327
As the `PENDING_BYTE` is normally at the 1 Gib mark, I created a
function that modifies the static `PENDING_BYTE` atomic to whatever
value we want. This means we can test this unusual behaviours at any DB
file size we want.
`fuzz_pending_byte_database` is the test that fuzzes different pending
byte offsets and does an integrity check at the end to confirm, we are
compatible with SQLITE
Closes#2749
<img width="1100" height="740" alt="image" src="https://github.com/user-
attachments/assets/06eb258f-b4b4-47bf-85f9-df1cf411e1df" />
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#3431
**Handle table ID / rootpages properly for both checkpointed and non-
checkpointed tables**
Table ID is an opaque identifier that is only meaningful to the MV
store.
Each checkpointed MVCC table corresponds to a single B-tree on the
pager,
which naturally has a root page.
**We cannot use root page as the MVCC table ID directly because:**
- We assign table IDs during MVCC commit, but
- we commit pages to the pager only during checkpoint
which means the root page is not easily knowable ahead of time.
**Hence:**
- MVCC table ids are always negative
- sqlite_schema rows will have a negative rootpage column if the
table has not been checkpointed yet.
- on checkpoint when the table is allocated a real root page, we update
the row in sqlite_schema and in MV store's internal mapping
**On recovery:**
- All sqlite_schema tables are read directly from disk and assigned
`table_id = -1 * root_page` -- root_page on disk must be positive
- Logical log is deserialized and inserted into MV store
- Schema changes from logical_log are captured into the DB's global
schema
**Note about recovery:**
I changed MVCC recovery to happen on DB initialization which should
prevent any races, so no need for `recover_lock`, right @pereman2 ?
Closes#3419
Table ID is an opaque identifier that is only meaningful to the MV store.
Each checkpointed MVCC table corresponds to a single B-tree on the pager,
which naturally has a root page.
We cannot use root page as the MVCC table ID directly because:
- We assign table IDs during MVCC commit, but
- we commit pages to the pager only during checkpoint
which means the root page is not easily knowable ahead of time.
Hence, we:
- store the mapping between table id and btree rootpage
- sqlite_schema rows will have a negative rootpage column if the
table has not been checkpointed yet.
MVCC is like the annoying younger cousin (I know because I was him) that
needs to be treated differently. MVCC requires us to use root_pages that
might not be allocated yet, and the plan is to use negative root_pages
for that case. Therefore, we need i64 in order to fit this change.
This patch improves the encryption module:
1. Previously, we did not use the first 100 bytes in encryption. This
patch uses that portion as associated data, for protection against
tampering and corruption
2. Once the page 1 encrypted, on disk we store a special Turso header
(the first 16 bytes). During decryption we replace this with standard
SQLite's header (`"SQLite format 3\000"`). So that the upper layers (B
Tree or in Sync APIs) operate on the existing SQLite page expectations.
The format is:
```
/// Turso Header (16 bytes)
/// ┌─────────┬───────┬────────┬──────────────────┐
/// │ │ │ │ │
/// │ Turso │Version│ Cipher │ Unused │
/// │ (5) │ (1) │ (1) │ (9 bytes) │
/// │ │ │ │ │
/// └─────────┴───────┴────────┴──────────────────┘
/// 0-4 5 6 7-15
///
/// Standard SQLite Header: "SQLite format 3\0" (16 bytes)
/// ↓
/// Turso Encrypted Header: "Turso" + Version + Cipher ID + Unused
```
Reviewed-by: Nikita Sivukhin (@sivukhin)
Reviewed-by: bit-aloo (@Shourya742)
Closes#3358