Commit Graph

320 Commits

Author SHA1 Message Date
Pekka Enberg
e03f6dbf94 core/storage: Reduce logging level 2025-10-17 20:09:00 +03:00
Levy A.
77a412f6af refactor: remove unsafe reference semantics from RefValue
also renames `RefValue` to `ValueRef`, to align with rusqlite and other
crates
2025-10-07 10:43:44 -03:00
Pere Diaz Bou
3e508a4b42 core/io: remove new_dummy in place of new_yield
Yield is a completion that does not allocate any inner state. By design
it is completed from the start and has no errors. This allows lightly
yield without allocating any locks nor heap allocate inner state.
2025-10-07 12:00:33 +02:00
pedrocarlo
f3dc0bef5d remove some explicit Arc<dyn File> references 2025-10-03 16:39:57 -03:00
pedrocarlo
e93add6c80 remove dyn DatabaseStorage and replace it with DatabaseFile 2025-10-03 14:14:15 -03:00
Pere Diaz Bou
8f103f7c35 core/wal: introduce transaction_count, same as iChange in sqlite 2025-10-03 13:02:47 +02:00
Pekka Enberg
aa95cb24ea core: Wrap Connection::page_size with AtomicU16 2025-09-24 09:12:46 +03:00
Pekka Enberg
372daef656 core: Wrap Pager::io_ctx in RwLock 2025-09-22 15:00:29 +03:00
Pekka Enberg
6c48cb7043 Merge 'core: Use sequential consistency for atomics by default' from Pekka Enberg
We use relaxed ordering in a lot of places where we really need to
ensure all CPUs see the write. Switch to sequential consistency, unless
acquire/release is explicitly used. If there are places that can be
optimized, we can switch to relaxed case-by-case, but have a comment
explaning *why* it is safe.

Closes #3193
2025-09-18 14:19:54 +03:00
Pekka Enberg
8337e86794 core: Use sequential consistency for atomics by default
We use relaxed ordering in a lot of places where we really need to
ensure all CPUs see the write. Switch to sequential consistency, unless
acquire/release is explicitly used. If there are places that can be
optimized, we can switch to relaxed case-by-case, but have a comment
explaning *why* it is safe.
2025-09-18 13:38:13 +03:00
Pekka Enberg
2a5284afb9 core/storage: Use AtomicU32 for Pager::page_size 2025-09-18 11:33:32 +03:00
Pekka Enberg
2215cccebb core/storage: Wrap Pager::syncing in Arc<AtomicBool> 2025-09-18 10:55:54 +03:00
Jussi Saurio
8bf52de94b Merge 'Remove serialization of normal write/commit path' from Preston Thorpe
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>

Closes #3089
2025-09-17 17:30:45 +03:00
Nikita Sivukhin
3bcac441e4 reduce log level of some very frequent logs 2025-09-15 11:35:41 +04:00
PThorpe92
703cb4a70f Link all writes to the fsync barrier, not just the commit frame 2025-09-14 10:39:52 -04:00
Avinash Sajjanshetty
11030056c7 rename method to verify_checksum 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
e010c46552 use checksums when reading/writing from db file 2025-09-13 11:00:39 +05:30
PThorpe92
7a14c7394f Remove the header copy stored on the WalFile, fix fast_path 2025-09-12 11:29:43 -04:00
Denizhan Dakılır
70102f5f6e add explicit usize type annotation to range iterator in test 2025-09-12 02:18:49 +03:00
Pekka Enberg
0b91f8a715 Merge 'IO: handle errors properly in io_uring' from Preston Thorpe
Because `io_uring` may have many other I/O submission events queued
(that are relevant to the operation) when we experience an error,
marking our `Completion` objects as aborted is not sufficient, the
kernel will still execute queued I/O, which can mutate WAL or DB state
after we’ve declared failure and keep references (iovec arrays, buffers)
alive and stall reuse. We need to stop those in-flight SQEs at the
kernel and then drain the ring to a known-empty state before reusing any
resources.
The following methods were added to the `IO` trait:
`cancel`: which takes a slice of `Completion` objects and has a default
implementation that simply marks them as `aborted`.
`drain`: which has a default noop implementation, but the `io_uring`
backend implements this method to drain the ring.
CC @sivukhin

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #2787
2025-09-10 14:24:43 +03:00
PThorpe92
2f4f67efa8 Remove some unused attributes 2025-09-09 16:17:49 -04:00
PThorpe92
02bebf02a5 Remove read_entire_wal_dumb in favor of reading chunks 2025-09-09 16:06:27 -04:00
PThorpe92
37ec77eec2 Fix read_entire_wal_dumb to prefer streaming read if over 32mb wal file 2025-09-09 13:12:58 -04:00
PThorpe92
ccae3ab0f2 Change callsites to cancel any further IO when an error occurs and drain 2025-09-08 13:18:40 -04:00
Pere Diaz Bou
382a1e14ca Merge 'core: handle edge cases for read_varint' from Sonny
Add handling malformed inputs to function `read_varint` and test cases.
```
# 9 byte truncated to 8
read_varint(&[0x81, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80]) 
    before -> panic index out of bounds: the len is 8 but the index is 8
    after -> LimboError
    
# bits set without end
read_varint(&[0x80; 9]) 
    before -> Ok((128, 9))
    after -> LimboError
```

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #2904
2025-09-05 16:15:42 +02:00
Pekka Enberg
6d80d862ee Merge 'io_uring: prevent out of order operations that could interfere with durability' from Preston Thorpe
closes #1419
When submitting a `pwritev` for flushing dirty pages, in the case that
it's a commit frame, we use a new completion type which tells io_uring
to add a flag, which ensures the following:
1. If any operation in the chain fails, subsequent operations get
cancelled with -ECANCELED
2. All operations in the chain complete in order
If there is an ongoing chain of `IO_LINK`, it ends at the `fsync`
barrier, and ensures everything submitted before it has completed.
for 99% of the cases, the syscall that immediately proceeds the
`pwritev` is going to be the fsync, but just in case, this
implementation links everything that comes between the final commit
`pwritev` and the next `fsync`
In the event that we get a partial write, if it was linked, then we
submit an additional fsync after the partial write completes, with an
`IO_DRAIN` flag after forcing a `submit`, which will mean durability is
maintained, as that fsync will flush/drain everything in the squeue
before submission.
The other option in the event of partial writes on commit frames/linked
writes is to error.. not sure which is the right move here. I guess it's
possible that since the fsync completion fired, than the commit could be
over without us being durable ondisk. So maybe it's an assertion
instead? Thoughts?

Closes #2909
2025-09-05 08:34:35 +03:00
Pekka Enberg
5950003eaf core: Simplify WalFileShared life cycle
Create one WalFileShared for a Database and update its state
accordingly. Also support case where the WAL is disabled.
2025-09-04 21:09:12 +03:00
PThorpe92
e3f366963d Compute the final db page or make the commit frame submit a linked pwritev completion 2025-09-03 16:01:16 -04:00
sonhmai
2b6cb39c7e core: handle edge cases for read_varint 2025-09-03 15:43:34 +07:00
Pekka Enberg
d959319b42 Merge 'Use u64 for file offsets in I/O and calculate such offsets in u64' from Preston Thorpe
Using `usize` to compute file offsets caps us at ~16GB on 32-bit
systems. For example, with 4 KiB pages we can only address up to 1048576
pages; attempting the next page overflows a 32-bit usize and can wrap
the write offset, corrupting data. Switching our I/O APIs and offset
math to u64 avoids this overflow on 32-bit targets

Closes #2791
2025-09-02 09:06:49 +03:00
Pekka Enberg
0c16ca9ce9 Merge 'core/wal: cache file size' from Pere Diaz Bou
Closes #2829
2025-08-30 08:41:58 +03:00
Avinash Sajjanshetty
bb591ab7e1 Propagate decryption erorr when reading from WAL 2025-08-29 18:07:38 +05:30
Pere Diaz Bou
db5e2883ee core/wal: cache wal is initialized 2025-08-29 13:15:09 +02:00
Jussi Saurio
ae0ac189fa perf: avoid constructing PageType for helper methods 2025-08-28 22:56:44 +03:00
Jussi Saurio
9aae3fa859 refactor: remove BTreePageInner
it wasn't used for anything. no more `page.get().get().id`.
2025-08-28 21:44:54 +03:00
PThorpe92
0a56d23402 Use u64 for file offsets in IO and calculate such offsets in u64 2025-08-28 09:44:00 -04:00
Avinash Sajjanshetty
2c0842ff52 Set and propagate IOContext as required 2025-08-27 22:05:01 +05:30
Jussi Saurio
dc6bcd4d41 refactor/btree: rewrite find_free_cell() 2025-08-25 10:08:39 +03:00
Jussi Saurio
4ea8cd0007 refactor/btree: rewrite the free_cell_range() function
i had a rough time reading this function earlier and trying to understand it,
so rewrote it in a way that, to me, is much more readable.
2025-08-25 09:41:44 +03:00
Pekka Enberg
22c9cb6618 s/PerConnEncryptionContext/EncryptionContext/ 2025-08-24 08:17:20 +03:00
Avinash Sajjanshetty
3090545167 use encryption ctx instead of encryption key 2025-08-21 22:36:32 +05:30
Avinash Sajjanshetty
1f93e77828 Remove hardcoded flag usage in DBHeader for encryption
Previously, we just hardcoded the reserved space with encryption flag.
This patch removes that and sets the reserved space if a key was
specified during a creation of db
2025-08-21 16:21:35 +05:30
PThorpe92
e28a38abc5 Fix wal tag safety issues, and add debug assertion that we are reading the proper frames 2025-08-20 17:28:48 -04:00
PThorpe92
d2c3ba14c8 Remove inefficient vec in WAL for tracking pages present in frame cache 2025-08-20 17:28:18 -04:00
PThorpe92
00f2a0f216 Performance improvements to checkpointing. prevent serializing I/O 2025-08-20 17:26:54 -04:00
Pekka Enberg
c2208a542a Merge 'Initial pass to support per page encryption' from Avinash Sajjanshetty
This patch adds support for per page encryption. The code is of alpha
quality, was to test my hypothesis. All the encryption code is gated
behind a `encryption` flag. To play with it, you can do:
```sh
cargo run --features encryption -- database.db

turso> PRAGMA key='turso_test_encryption_key_123456';

turso> CREATE TABLE t(v);
```
Right now, most stuff is hard coded. We use AES GCM 256. This
information is not stored anywhere, but in future versions we will start
saving this info in the file. When writing to disk, we will generate a
cryptographically secure random salt, use that to encrypt the page. Then
we will store the authentication tag and the salt in the page itself. To
accommodate this encryption hardcodes reserved space of 28 bytes.
Once the key is set in the connection, we propagate that information to
pager and the WAL, to encrypt / decrypt when reading from disk.

Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>

Closes #2567
2025-08-20 11:11:24 +03:00
Avinash Sajjanshetty
40a209c000 simplify feature flag usage for encryption 2025-08-20 12:49:38 +05:30
Avinash Sajjanshetty
bd9b4bbfd2 encrypt/decrypt when writing/reading from DB 2025-08-20 11:47:23 +05:30
Avinash Sajjanshetty
94d38be1a2 Set reserved_space to 28 for encrypted databases
We will use this space to store nonce and tag
2025-08-20 11:39:09 +05:30
Avinash Sajjanshetty
a6e9237c94 Set encryption key in pager and WAL 2025-08-20 11:39:09 +05:30