Commit Graph

4866 Commits

Author SHA1 Message Date
Pekka Enberg
380b27f58a Merge 'Busy handler' from Pedro Muniz
I searched using deepwiki how SQLite implements their busy handler. They
use a callback system with exponential backoff, where it stores the
callback in the pager and in the database. I confess I found this
slightly confusing, so I just implemented a simple exponential backoff
directly in the `Statement` struct. I imagine SQLite does this in a more
convoluted manner, as they do not have a concept of yielding as we do.
https://deepwiki.com/search/where-is-the-code-for-the-
busy_4a5ed006-4eed-479f-80c3-dd038832831b
I also fixed the rust bindings so that it yields when we return
`StepResult::IO`, instead of just blocking the async function. To
achieve this I implemented the `Stream` trait for `Rows` struct, which
unfortunately came with a slight change to the function signature of
`rows.next()` to `rows.try_next()`.
EDIT:
~test `test_multiple_connections_fuzz` timeouts because now it has the
busy handler "slowing" things down (this test generates a lot of busy
transactions), so it takes a lot longer for the test to run. Not sure if
it is acceptable for us to reduce the number of operations so the test
is shorter.~
EDIT:
Adjusted the API to be more in line with
https://www.sqlite.org/c3ref/busy_timeout.html.
Sets maximum total accumulated timeout. If the duration is None or Zero,
we unset the busy handler for this Connection.
This api defers slightly from SQLite as instead of sleeping for linear
amount of time specified by the user, we will sleep in phases until the
the total amount of time requested is reached. This means we first sleep
of 1ms, then if we still return busy, we sleep for 2 ms, and repeat
until a maximum of 100 ms per phase or we reached the total timeout.
Example:
1. Set duration to 5ms
2. Step through query -> returns Busy -> sleep/yield for 1 ms
3. Step through query -> returns Busy -> sleep/yield for 2 ms
4. Step through query -> returns Busy -> sleep/yield for 2 ms (totaling
5 ms of sleep)
5. Step through query -> returns Busy -> return Busy to user
This slight api change demonstrated a better throughtput in
`perf/throughput/turso` benchmark
```sh
cargo run -p write-throughput --release -- -t 2

Running write throughput benchmark with 2 threads, 100 batch size, 10 iterations, mode: Legacy
Database created at: write_throughput_test.db
Thread 1: 1000 inserts in 0.04s (23438.42 inserts/sec)
Thread 0: 1000 inserts in 0.08s (12385.64 inserts/sec)

=== BENCHMARK RESULTS ===
Total inserts: 2000
Total time: 0.08s
Overall throughput: 24762.60 inserts/sec
Threads: 2
Batch size: 100
Iterations per thread: 10
Database file exists: true
Database file size: 4096 bytes
```
Depends on #3102
Closes #3067

Closes #3074
2025-09-15 13:52:49 +03:00
Jussi Saurio
aa7a853cd2 mvcc: fix hang when CONCURRENT tx tries to commit and non-CONCURRENT tx is active 2025-09-15 11:09:19 +03:00
Jussi Saurio
9234ef86ae mvcc: fix two sources of panic
1. commit state machine was assuming that begin_write_tx() cannot
fail, but it can fail if there is another tx that is not using
BEGIN CONCURRENT.

2. if a brand new non-CONCURRENT transaction attempts to start
exclusive transaction but fails with Busy, we must end the read
pager read tx it just started, because otherwise the next time
it attempts to do something it will panic with:

"cannot start a new read tx without ending an existing one"
2025-09-15 10:59:44 +03:00
Jussi Saurio
8f43741513 fix mvcc rollback
executing ROLLBACK did not rollback the mv-store transaction
2025-09-15 09:29:08 +03:00
pedrocarlo
3d265489dc modify semantics of busy_timeout to be more on par with sqlite 2025-09-15 02:20:32 -03:00
pedrocarlo
0586b75fbe expose function to set busy timeout duration 2025-09-15 02:20:32 -03:00
pedrocarlo
a56680f79e implement Busy Handler in Turso statements 2025-09-15 02:16:18 -03:00
Jussi Saurio
f4c15a37d3 add manual hack to mvcc test
we rollback the mvcc transaction in the VDBE, so manually roll it
back in the test
2025-09-14 23:46:38 +03:00
Jussi Saurio
db3428a7a9 remove unused pager parameter 2025-09-14 23:44:24 +03:00
Jussi Saurio
d598775e33 mvcc: properly remove mutations of rolled back tx
mvstore was not removing deletions made by a tx that rolled back.
deletions are removed by clearing the `end` mark from the row
version.
2025-09-14 23:29:14 +03:00
Jussi Saurio
dccf8b9472 mvcc: properly clear tx states when mvcc tx rolls back 2025-09-14 23:29:07 +03:00
Jussi Saurio
487b8710d9 mvcc: don't double-rollback on write-write-conflict
handle_program_error() already rolls back if this error happens.
double rollback causes a crash.
2025-09-14 23:28:21 +03:00
Jussi Saurio
2ca1640a2a not always write 2025-09-14 22:24:07 +03:00
Jussi Saurio
396091044e store tx_mode in conn.mv_tx
otherwise op_transaction works completely wrong because each separate
insert statement overrides the tx_mode to Write
2025-09-14 21:59:08 +03:00
Jussi Saurio
7fe25a1d0e mvcc: remove conn.mv_transactions
afaict this isn't needed for anything since there is already
conn.mv_tx_id
2025-09-14 21:26:58 +03:00
Jussi Saurio
5feb9ea2f0 mvcc: fix non-concurrent transaction semantics
on the main branch, mvcc allows concurrent inserts from multiple
txns even without BEGIN CONCURRENT, and then always hangs whenever
one of the txns tries to commit.

this commit fixes that issue.
2025-09-14 21:23:06 +03:00
Jussi Saurio
2ea1798d6e mvcc: end commit state machine early when write set is empty 2025-09-14 20:02:35 +03:00
Pekka Enberg
95660535da core/storage: Demote info logging to debug 2025-09-14 13:10:46 +03:00
PThorpe92
f6dd0bc4d6 Dont grab page cache write lock in a loop 2025-09-13 12:21:13 -04:00
Pekka Enberg
6a2f0d6061 Merge 'Add per page checksums' from Avinash Sajjanshetty
This patch adds checksums to Turso DB. You may check the design here in
the [RFC](https://github.com/tursodatabase/turso/issues/2178).
1. We use reserved bytes (8 bytes) to store the checksums. On every IO
read, we verify that the checksum matches.
2. We use twox hash for checksums.
3. Checksum works only on 4K pages now. It's a small change to enable
for all other sizes, I will send another PR.
4. Right now, it's not possible to switch to different algorithm or turn
off altogether. That will be added in the future PRs.
5. Checksums can be enabled only for new dbs. For existing DBs, we will
disable it.
6. To add checksums for existing DBs, we need vacuum since it would
require rewrite of whole db.

Closes #2840
2025-09-13 18:46:53 +03:00
Pekka Enberg
d8f07fe3da core: Panic on fsync() error by default
Retrying fsync() on error was historically not safe ("fsyncgate") and
Postgres still defaults to panicing on fsync(). Therefore, add a
"data_sync_retry" pragma (disabled by default) and use it to determine
whether to panic on fsync() error or not.
2025-09-13 10:21:12 +03:00
Pekka Enberg
a7e34f1551 Merge 'Handle partial writes in unix IO for pwrite and pwritev' from Preston Thorpe
currently, `io_uring` is setup to handle partial writes for `pwritev`
(will add `pwrite` in subsequent PR), but unix and other IO back-ends
were not correctly setup for this.

Closes #3073
2025-09-13 09:08:43 +03:00
Avinash Sajjanshetty
5256f29a9c Add checksums behind a feature flag 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
11030056c7 rename method to verify_checksum 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
e010c46552 use checksums when reading/writing from db file 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
4b59cf19e5 use checksums when reading/writing from wal 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
14a1307720 Set reserved space as required when allocating page1 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
3b410e4f79 set required reserved bytes while initialising the pager 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
2e6943bfdf Add helper to read reserved bytes value from disk 2025-09-13 11:00:39 +05:30
Avinash Sajjanshetty
c2c1ec2dba Pass use usable_space() instead of hardcoding the value 2025-09-13 11:00:38 +05:30
Avinash Sajjanshetty
15266105f7 Update IOContext to carry checksum ctx 2025-09-13 11:00:38 +05:30
Avinash Sajjanshetty
3f72de3623 Add checksum module 2025-09-13 11:00:37 +05:30
TcMits
48522c1cc0 remove Stmt clone 2025-09-13 12:08:29 +07:00
PThorpe92
6098bca211 Handle partial writes in unix IO for pwrite and pwritev 2025-09-12 18:13:02 -04:00
Preston Thorpe
b1420904bb Merge 'fix(btree): advance cursor after interior node replacement in delete' from Jussi Saurio
## Problem
When a delete replaces an index interior cell, the replacement key is LT
the deleted key. Currently on the main branch, after the deletion
happens, the following call to BTreeCursor::next() stops at the replaced
interior cell.
This is incorrect - imagine the following sequence:
- We are executing a query that deletes all keys WHERE key > 5
- We delete <key=6> from an interior node, and take a replacement
<key=5> from the left subtree of that interior page
- next() is called, and we land on the interior node again, which now
has <key=5>, and we incorrectly delete it even though our WHERE
condition is key > 5.
## Solution
This PR:
- Tracks `interior_node_was_replaced` in CheckNeedsBalancing
- If no balancing is needed and a replacement occurred, advances once so
the next invocation of next() will skip the replaced cell properly
i.e. we prevent next() from landing on the replaced content and ensures
iteration continues with the next logical record.
## Details
This problem only became apparent once we started using indexes as valid
iteration cursors for DELETE operations in #2981
Closes #3045

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Reviewed-by: Preston Thorpe <preston@turso.tech>

Closes #3049
2025-09-12 17:37:01 -04:00
Pekka Enberg
ad6157028e Merge 'core/vdbe: Fix BEGIN CONCURRENT transactions' from Pekka Enberg
The transaction upgrade logic in Transaction opcode is total nonsense
for concurrent transactions so just drop it.
Fixes #3061

Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #3070
2025-09-12 23:11:12 +03:00
Pekka Enberg
a0921c4221 Merge 'core/storage: Remove unused import warning' from Pekka Enberg
Closes #3069
2025-09-12 23:11:05 +03:00
Pekka Enberg
5e2b1bc0d3 Merge 'Fix incompatible math functions' from Levy A.
Fixes #1817, #2068, #1326, #1397.
The solution is very much not ideal, but fixes all math function related
incompatibilities.

Reviewed-by: Preston Thorpe <preston@turso.tech>

Closes #3033
2025-09-12 21:28:08 +03:00
Pekka Enberg
86dcdad3d0 core/vdbe: Fix BEGIN CONCURRENT transactions
The transaction upgrade logic in Transaction opcode is total nonsense
for concurrent transactions so just drop it.

Fixes #3061
2025-09-12 21:19:34 +03:00
Pekka Enberg
2bc8c0c850 core/storage: Remove unused import warning 2025-09-12 21:09:38 +03:00
Pekka Enberg
dcd43ab8fc Merge 'Handle EXPLAIN QUERY PLAN like SQLite' from Lâm Hoàng Phúc
After this PR:
```
turso> EXPLAIN QUERY PLAN SELECT 1;
QUERY PLAN
`--SCAN CONSTANT ROW
turso> EXPLAIN QUERY PLAN SELECT 1 UNION SELECT 1;
QUERY PLAN
`--COMPOUND QUERY
   |--LEFT-MOST SUBQUERY
   |  `--SCAN CONSTANT ROW
   `--UNION USING TEMP B-TREE
      `--SCAN CONSTANT ROW
turso> CREATE TABLE x(y);
turso> CREATE TABLE z(y);
turso> EXPLAIN QUERY PLAN SELECT * from x,z;
QUERY PLAN
|--SCAN x
`--SCAN z
turso> EXPLAIN QUERY PLAN SELECT * from x,z ON x.y = z.y;
QUERY PLAN
|--SCAN x
`--SEARCH z USING INDEX ephemeral_z_t2
turso>
```

Closes #3057
2025-09-12 20:41:23 +03:00
PThorpe92
b04c364981 Fix clippy error 2025-09-12 11:43:38 -04:00
PThorpe92
7a14c7394f Remove the header copy stored on the WalFile, fix fast_path 2025-09-12 11:29:43 -04:00
PThorpe92
25e7c719f1 Update checkpoint_seq on each checkpoint, not just when log restarts
This was causing checkpoint_seq to be 0 when we had already successfully
ran a passive checkpoint, and causing us to use improper pages from the
cache.
2025-09-12 11:29:42 -04:00
Pekka Enberg
14da283e36 Merge 'MVCC: remove reliance on BTreeCursor::has_record()' from Jussi Saurio
Closes #3051
Closes #3032

Closes #3056
2025-09-12 17:31:15 +03:00
Pekka Enberg
54b4c9f30b Merge 'Implement the balance_quick algorithm' from Jussi Saurio
Fast balancing routine for the common special case where the rightmost
leaf page of a given subtree overflows such that the overflowing cell
would be the rightmost cell on the page -- i.e. an append. In this case
we just add a new leaf page as the right sibling of that page, put the
overflow cell there, and insert a new divider cell into the parent. The
high level steps are:
1. Allocate a new leaf page and insert the overflow cell payload in it.
2. Create a new divider cell in the parent - it contains the page number
of the old rightmost leaf, plus the largest rowid on that page.
3. Update the rightmost pointer of the parent to point to the new leaf
page.
4. Continue balance from the parent page (inserting the new divider cell
may have overflowed the parent

Closes #3041
2025-09-12 17:30:52 +03:00
Pekka Enberg
443720c74a Merge 'benchmark: introduce simple 1 thread concurrent benchmark for mvcc/sq…' from Pere Diaz Bou
…lite/wal
This is considerably simpler with 1 thread as we just try to yield
control when I/O happens and we only run io.run_once when all
connections tried to do some work. This allows connections to
cooperatively progress.

Closes #3060
2025-09-12 17:27:41 +03:00
Pekka Enberg
7fdb116d41 Merge 'core/mvcc: queue mvcc txns on pager's end_tx' from Pere Diaz Bou
Flushing mvcc changes to disk requires serialization. To do so we simply
introduce a lock for pager.end_tx, which will take ownership of flushing
to WAL. Once this is finished we can simply release lock.
When multiple tx writes happen concurrently in mvcc, max frame will be
updated. This new max_frame makes is the point of view of the other
transaction return busy because his current wal snapshot is outdated.

Closes #3059
2025-09-12 17:27:17 +03:00
Pere Diaz Bou
ec2cff2026 benchmark: introduce simple 1 thread concurrent benchmark for mvcc/sqlite/wal
This is considerably simpler with 1 thread as we just try to yield
control when I/O happens and we only run io.run_once when all
connections tried to do some work. This allows connections to
cooperatively progress.
2025-09-12 14:02:57 +00:00
Pere Diaz Bou
39fb5913e0 core/mvcc: queue write txn commits in mvcc on pager end_tx
Flushing mvcc changes to disk requires serialization. To do so we simply
introduce a lock for pager.end_tx, which will take ownership of flushing
to WAL. Once this is finished we can simply release lock.
2025-09-12 14:00:02 +00:00