### General idea:
(outside of other optimizations made mostly around concurrency):
**When checkpointing, use pages from the PageCache if we can determine
that they are exactly the page/frame that we want.**
e.g. if the frame_cache has an entry:
`Page ID: 104 -> Frame ID's: [1001, 1002]`
and the OngoingCheckpoint has min_frame of 999 and max_frame of 1020, we
should be able to check the PageCache and see if it has page 104, and
only if it is tagged with frame_id = 1002, can we use that page to
backfill the DB file.
Since using a cached page during checkpoint is purely an optimization,
we can be conservative in terms of when we accept that a cached page is
valid to use. I came up with a `wal_tag` which is the frame_id +
checkpoint_seq, which is set only in the two following places:
1. When explicitly reading a frame from the WAL. (inside
Wall::read_frame)
- read_frame is perhaps the most obvious path of ensuring it's the
exact page + frame combination that we want.
2. When appending a frame to the log during the normal process of
writing (during `[Pager::cacheflush]`)
- cacheflush calls append_frame, and inside the Completion, the dirty
flag is cleared, and the wal_tag flag is set to the frame_id.
Inside `finish_read_page` (which is called for every page we read from
either the DB file or WAL.. the `wal_tag` is cleared along with the
`dirty` flag, so that any re-used `PageRef's` don't contain wal_tag's
from any previous or stale pages.
#### **Proposal**:
(In order to merge and simultaneously be able to sleep at night)
there is this debug assertion:
```rust
#[cfg(debug_assertions)]
{
let mut raw = vec![0u8; self.page_size() as usize + WAL_FRAME_HEADER_SIZE];
self.io.wait_for_completion(self.read_frame_raw(target_frame, &mut raw)?)?;
let (_, wal_page) = sqlite3_ondisk::parse_wal_frame_header(&raw);
let cached = cached_page.get_contents().buffer.as_slice();
// while being horrible for performance, we can ensure that the bytes are identical
// when using the cached page vs what we would otherwise have read from disk.
turso_assert!(wal_page == cached, "cache fast-path returned wrong content for page {page_id} frame {target_frame}");
}
```
Performance
=====================================
Average latency for a checkpoint on my local machine:
#### Before: `7-12ms`
#### After: `2-5ms`
Reviewed-by: Nikita Sivukhin (@sivukhin)
Closes#2568
We were storing `txid` in `ProgramState`, this meant it was impossible
to track interactive transactions. This was extracted to `Connection`
instead.
Moreover, transaction state for mvcc now is reset on commit.
Closes#2689
This PR tries to add simple support for delete, with limited testing for
now.
Moreover, there was an error with `forward`, which wasn't obvious
without delete, which didn't skip deleted rows.
Reviewed-by: Avinash Sajjanshetty (@avinassh)
Closes#2672
This gets rid of `InsertState` in `BTreeCursor` plus the `moved_before`
parameter to `BTreeCursor::insert` -- instead, seek logic is now in the
existing state machines for `op_insert` and `op_idx_insert`
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#2639
This patch adds support for per page encryption. The code is of alpha
quality, was to test my hypothesis. All the encryption code is gated
behind a `encryption` flag. To play with it, you can do:
```sh
cargo run --features encryption -- database.db
turso> PRAGMA key='turso_test_encryption_key_123456';
turso> CREATE TABLE t(v);
```
Right now, most stuff is hard coded. We use AES GCM 256. This
information is not stored anywhere, but in future versions we will start
saving this info in the file. When writing to disk, we will generate a
cryptographically secure random salt, use that to encrypt the page. Then
we will store the authentication tag and the salt in the page itself. To
accommodate this encryption hardcodes reserved space of 28 bytes.
Once the key is set in the connection, we propagate that information to
pager and the WAL, to encrypt / decrypt when reading from disk.
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#2567
Let's add an encryption module, hard coded to use AES 256 GCM.
Other required parameters are also hard coded and will be made
configurable in the future PRs.
The module is behind a `encryption` feature flag.
The completion callback can be invoked only once via `OnceLock`, let's not
crash if we e.g. call `Completion::abort()` on an already finished completion.
Closes#2673
Completions can now carry errors inside of them. This allows us to wait
for a completion to complete or to error. When it errors we can properly
tell the caller of `wait_for_completion` that we errored. This will also
allow us to abort completions.
Currently, this just creates the scaffold for us to store the error in
the completion. But to correctly achieve this, it will require some
refactor of our IO implementations to store the `run_once` error for a
particular completion inside of it instead of short circuiting. This
would also allow us to check for an error in `program.step` and properly
rollback.
Also, creates default impls for some common IO methods, this is
important specially for `wait_for_completion` as we want to check the
error in the `Completion` before returning `Ok`.
Maybe we could also accept a Result type in the completion callback so
that we can execute some sort of compensating action on error, like
unlocking a page so it can be evicted by the page cache later.
**EDIT:** actually implemented this in this PR. We store a `Result`
object inside `CompletionInner` behind a `OnceLock` for thread-safety.
We also pass a result object to Completion callbacks to execute
compensating actions.
Reviewed-by: Avinash Sajjanshetty (@avinassh)
Closes#2589