turso

mirror of https://github.com/aljazceru/turso.git synced 2026-01-06 17:54:20 +01:00

Author	SHA1	Message	Date
Jussi Saurio	13219dbf87	Merge 'extend raw WAL API with few more methods' from Nikita Sivukhin This PR extends raw WAL API with few methods which will be helpful for offline-sync: 1. `try_wal_watermark_read_page` - try to read page from the DB with given WAL watermark value\ * Usually, WAL max_frame is set automatically to the latest value (`shared.max_frame`) when transaction is started and then this "watermark" is preserved throughout whole transaction * New method allows to simulate "read from the past" by controlling frame watermark explicitly * There is an alternative to implement some API like `start_read_session(frame_watermark: u64)` - but I decided to expose just single method to simplify the logic and reduce "surface" of actions which can be executed in this "controllable" manner * Also, for simplicity, now `try_wal_watermark_read_page` always read data from disk and bypass any cached values (and also do not populate the cache) 2. `wal_changed_pages_after` - return set of unique pages changed after watermark WAL position in the current WAL session With these 2 methods we can implement `REVERT frame_watermark` logic which will just fetch all changed pages first, and then revert them to the previous value by using `try_wal_watermark_read_page` and `wal_insert_frame` methods (see `test_wal_api_revert_pages` test). Note, that if there were schema changes - than `REVERT` logic described above can bring connection to the inconsistent state, as it will preserve schema information in memory and will still think that table exist (while it can be reverted). This should be considered by any consumer of this new methods. Closes #2433	2025-08-04 23:53:46 +03:00
PThorpe92	6cbc8ff868	Replace values with constants	2025-08-04 15:14:06 -04:00
PThorpe92	73d1fdef14	Fix and change bitmap, apply suggestions and add some optimizations	2025-08-04 14:58:58 -04:00
PThorpe92	f4197f1eb5	change debug assertions to turso asserts	2025-08-04 14:55:48 -04:00
PThorpe92	54696d2f0d	Add additional test for edge cases	2025-08-04 14:55:48 -04:00
PThorpe92	5378195ad6	Add page bitmap to storage mod.rs	2025-08-04 14:55:48 -04:00
PThorpe92	3e30335ea5	Add tests for PageBitmap	2025-08-04 14:55:48 -04:00
PThorpe92	7b1f908c00	Add PageBitmap for use with arena page allocator	2025-08-04 14:55:48 -04:00
Jussi Saurio	7045d44fdc	Merge 'fix/wal: remove start_pages_in_frames_hack to prevent checkpoint data loss' from Jussi Saurio Closes #2421 ## Background We have some kind of transaction-local hack (`start_pages_in_frames`) for bookkeeping how many pages are currently in the in-memory WAL frame cache, I assume for performance reasons or whatever. `wal.rollback()` clears all the frames from `shared.frame_cache` that the rollbacking tx is allowed to clear, and then truncates `shared.pages_in_frames` to however much its local `start_pages_in_frames` value was. ## Problem In `complete_append_frame`, we check if `frame_cache` has that key (page) already, and if not, we add it to `pages_in_frames`. However, `wal.rollback()` never _removes_ the key (page) if its value is empty, so we can end up in a scenario where the `frame_cache` key for `page P` exists but has no frames, and so `page P` does not get added to `pages_in_frames` in `complete_append_frame`. This leads to a checkpoint data loss scenario: - transaction rolls back, has start_pages_in_frames=0, so truncates shared pages_in_frames to an empty vec. let's say `page P` key in `frame_cache` still remains but it has no frames. - The next time someone commits a frame for `page P`, it does NOT get added to `pages_in_frames` because `frame_cache` has that key (although the value vector is empty) - At some point, a checkpoint checkpoints `n` frames, but since `pages_in_frames` does not have `page P`, it doesn't actually checkpoint it and all the "checkpointed" frames are simply thrown away - very similar to the scenario in #2366 ## Fix Remove the `start_pages_in_frames` hack entirely and just make `pages_in_frames` effectively the same as `frame_cache.keys`. I think we could also just get rid of `pages_in_frames` and just use `frame_cache.contains_key(p)` but maybe Pere can chime in here Reviewed-by: Pere Diaz Bou <pere-altea@homail.com> Closes #2422	2025-08-04 19:49:55 +03:00
Nikita Sivukhin	9694366645	add one more assert	2025-08-04 17:23:34 +04:00
Nikita Sivukhin	76bdf0c1ab	small fixes	2025-08-04 17:02:53 +04:00
Nikita Sivukhin	2e23230e79	extend raw WAL API with few more methods - try_wal_watermark_read_page - try to read page from the DB with given WAL watermark value - wal_changed_pages_after - return set of unique pages changed after watermark WAL position	2025-08-04 16:55:50 +04:00
Pekka Enberg	e4accdc29d	Merge 'hide dangerous methods behind conn_raw_api feature' from Nikita Sivukhin WAL API shouldn't be exposed by default because this is relatively dangerous API which we use internally and ordinary users shouldn't not be interested in it. Reviewed-by: Pekka Enberg <penberg@iki.fi> Closes #2424	2025-08-04 14:52:40 +03:00
Pere Diaz Bou	f26e442597	core/mvcc: fix new rowid next rowid was being tracked globally for all tables and restarted to 0 every time database was opened	2025-08-04 12:31:17 +02:00
Nikita Sivukhin	0adb40534c	hind dangerous methods behind conn_raw_api feature	2025-08-04 12:40:28 +04:00
Jussi Saurio	4f3f66d55e	fix/wal: remove start_pages_in_frames_hack to prevent checkpoint data loss We have some kind of transaction-local hack (`start_pages_in_frames`) for bookkeeping how many pages are currently in the in-memory WAL frame cache, I assume for performance reasons or whatever. `wal.rollback()` clears all the frames from `shared.frame_cache` that the rollbacking tx is allowed to clear, and then truncates `shared.pages_in_frames` to however much its local `start_pages_in_frames` value was. In `complete_append_frame`, we check if `frame_cache` has that key (page) already, and if not, we add it to `pages_in_frames`. However, `wal.rollback()` never _removes_ the key (page) if its value is empty, so we can end up in a scenario where the `frame_cache` key for `page P` exists but has no frames, and so `page P` does not get added to `pages_in_frames` in `complete_append_frame`. This leads to a checkpoint data loss scenario: - transaction rolls back, has start_pages_in_frames=0, so truncates shared pages_in_frames to an empty vec. let's say `page P` key in `frame_cache` still remains but it has no frames. - The next time someone commits a frame for `page P`, it does NOT get added to `pages_in_frames` because `frame_cache` has that key - At some point, a PASSIVE checkpoint checkpoints `n` frames, but since `pages_in_frames` does not have `page P`, it doesn't actually checkpoint it and all the "checkpointed" frames are simply thrown away - very similar to the scenario in #2366 Remove the `start_pages_in_frames` hack entirely and just make `pages_in_frames` effectively the same as `frame_cache.keys`. I think we could also just get rid of `pages_in_frames` and just use `frame_cache.contains_key(p)` but maybe Pere can chime in here	2025-08-04 10:35:12 +03:00
Jussi Saurio	63a5ef596b	perf/btree: skip seek in move_to_rightmost() if we are already on rightmost page	2025-08-02 13:56:59 +03:00
Jussi Saurio	3b0c8b08fe	Merge 'perf/pager: dont clear page cache on commit' from Jussi Saurio This should be safe to do as: 1. page cache is private per connection 2. since this connection wrote the flushed pages/frames, they are up to date from its perspective 3. multiple concurrent statements inside one connection are not snapshot-transactional even in sqlite Reviewed-by: Pekka Enberg <penberg@iki.fi> Closes #2407	2025-08-02 13:35:57 +03:00
Jussi Saurio	4497d22d3f	perf/pager: dont clear page cache on commit	2025-08-02 13:09:36 +03:00
Pekka Enberg	598fdade3e	core: Fold HeaderRef to pager module	2025-08-02 09:50:25 +03:00
Jussi Saurio	69fc1ea238	Merge 'perf/btree: improve performance of rowid() function' from Jussi Saurio if the table is an intkey table, we can read the rowid directly without deserializing the full cell, and we also don't need to start deserializing the record if only the rowid is requested. ```sql Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1: Collecting 100 samples in estimated 5.0007 s (11M i Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1 time: [469.38 ns 470.77 ns 472.40 ns] change: [-5.8959% -5.5232% -5.1840%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10: Collecting 100 samples in estimated 5.0088 s (1.9M Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10 time: [2.6523 µs 2.6596 µs 2.6685 µs] change: [-8.7117% -8.4083% -8.0949%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 3 (3.00%) high mild 3 (3.00%) high severe Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50: Collecting 100 samples in estimated 5.0197 s (399k Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50 time: [12.514 µs 12.545 µs 12.578 µs] change: [-9.5243% -9.0562% -8.6227%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100: Collecting 100 samples in estimated 5.0600 s (202 Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100 time: [25.135 µs 25.291 µs 25.470 µs] change: [-8.8822% -8.3943% -7.8854%] (p = 0.00 < 0.05) Performance has improved. ``` "only" 4x slower than sqlite on `SELECT * FROM users LIMIT 100` after this! Reviewed-by: Pere Diaz Bou <pere-altea@homail.com> Closes #2382	2025-08-01 13:35:02 +03:00
Jussi Saurio	c9a3a65942	perf/btree: don't waste time reading contents twice	2025-08-01 11:49:41 +03:00
Jussi Saurio	111c1e64c4	perf/btree: improve performance of rowid() function if the table is an intkey table, we can read the rowid directly without deserializing the full cell, and we also don't need to start deserializing the record if only the rowid is requested. ```sql Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1: Collecting 100 samples in estimated 5.0007 s (11M i Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1 time: [469.38 ns 470.77 ns 472.40 ns] change: [-5.8959% -5.5232% -5.1840%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10: Collecting 100 samples in estimated 5.0088 s (1.9M Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10 time: [2.6523 µs 2.6596 µs 2.6685 µs] change: [-8.7117% -8.4083% -8.0949%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 3 (3.00%) high mild 3 (3.00%) high severe Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50: Collecting 100 samples in estimated 5.0197 s (399k Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50 time: [12.514 µs 12.545 µs 12.578 µs] change: [-9.5243% -9.0562% -8.6227%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100: Collecting 100 samples in estimated 5.0600 s (202 Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100 time: [25.135 µs 25.291 µs 25.470 µs] change: [-8.8822% -8.3943% -7.8854%] (p = 0.00 < 0.05) Performance has improved. ```	2025-08-01 11:44:53 +03:00
Pere Diaz Bou	c807b035c5	core/mvcc: fix tests again had to create connections for every different txn	2025-08-01 10:44:19 +02:00
Pere Diaz Bou	49a00ff338	core/mvcc: load table's rowid on initialization We need to load rowids into mvcc's store in order before doing any read in case there are rows. This has a performance penalty for now as expected because we should, ideally, scan for row ids lazily instead.	2025-08-01 10:38:41 +02:00
Pere Diaz Bou	b4ac38cd25	core/mvcc: persist writes on mvcc commit On Mvcc `commit_txn` we need to persist changes to database, for this case we re-use pager's semantics of transactions: 1. If there are no conflicts, we start `pager.begin_write_txn` 2. `pager.end_txn`: We flush changes to WAL 3. We finish Mvcc transaction by marking rows with new timestamp.	2025-08-01 10:38:41 +02:00
Jussi Saurio	2233bb41c3	Merge 'fix/wal: reset ongoing checkpoint state when checkpoint fails' from Jussi Saurio ## What The following sequence of actions is possible: ```sql -- TRUNCATE checkpoint fails during WAL restart, -- but OngoingCheckpoint.state is still left at Done for conn 0 Connection 0(op=23): PRAGMA wal_checkpoint(TRUNCATE) Connection 0(op=23) Checkpoint TRUNCATE: OK: false, wal_page_count: NULL, checkpointed_count: NULL -- TRUNCATE checkpoint succeeds for conn 1 Connection 1(op=26): PRAGMA wal_checkpoint(TRUNCATE) Connection 1(op=26) Checkpoint TRUNCATE: OK: true, wal_page_count: 0, checkpointed_count: 0 -- Conn 0 now does a PASSIVE checkpoint, and immediately thinks -- it's in the Done state, and thinks it checkpointed 17 frames. -- since mode is PASSIVE, it now thinks both the WAL and the DB have those 17 frames -- so the first 17 frames of the WAL can be ignored from now on. Connection 0(op=27): PRAGMA wal_checkpoint(PASSIVE) Connection 0(op=27) Checkpoint PASSIVE: OK: true, wal_page_count: 0, checkpointed_count: 17 -- Connection 0 starts a txn with min=18 (ignore first 17 frames in WAL), -- and deletes rowid=690, which becomes WAL frame number 1 Connection 0(op=28): DELETE FROM test_table WHERE id = 690 begin_read_tx(min=18, max=0, slot=1, max_frame_in_wal=0) -- Connection 1 starts a txn with min=18 (ignore first 17 frames in WAL), -- and inserts rowid=1128, which becomes WAL frame number 2 Connection 1(op=28): INSERT INTO test_table (id, text) VALUES (1128, text_560) begin_read_tx(min=18, max=1, slot=1, max_frame_in_wal=1) -- Connection 0 again starts tx with min=18, and performs a read, and two wrong things happen: -- 1. it doesn't see row 690 as deleted, because it's in WAL frame 1, which it ignores -- 2. it doesn't see the new row 1128, because it's in WAL frame 2, which it ignores Connection 0(op=29): SELECT * FROM test_table begin_read_tx(min=18, max=2, slot=1, max_frame_in_wal=2) ``` ## Fix Reset `ongoing_checkpoint.state` to `Start` when checkpoint fails. Issue found in #2364 . Reviewed-by: bit-aloo (@Shourya742) Closes #2380	2025-08-01 11:28:04 +03:00
Jussi Saurio	456b7404fb	storage: remove FileMemoryStorage as it is never used	2025-08-01 10:14:36 +03:00
Jussi Saurio	e147494642	pager: make WAL optional again and remove DummyWAL	2025-08-01 10:14:35 +03:00
Jussi Saurio	e6528f2664	fix/wal: reset ongoing checkpoint state when checkpoint fails	2025-08-01 08:39:34 +03:00
pedrocarlo	1abe8fd70c	state machine `seek_to_last`	2025-07-31 11:51:17 -03:00
pedrocarlo	543cdb3e2c	underscoring completions and IOResult to avoid warning messages	2025-07-31 11:51:17 -03:00
pedrocarlo	6bfba2518e	state machine for `move_to_rightmost`	2025-07-31 11:49:12 -03:00
pedrocarlo	966b96882e	`move_to_root` should return completion	2025-07-31 11:49:12 -03:00
pedrocarlo	cf951e24cd	add state machine for `is_empty_table` in preparation for IO Completion refactor	2025-07-31 11:49:12 -03:00
pedrocarlo	7012860800	create separate state machines file	2025-07-31 11:49:12 -03:00
Jussi Saurio	eeceefe49d	Merge 'fix/wal: only rollback WAL if txn was write + fix start state for WalFile' from Jussi Saurio Closes #2363 ## What The following sequence of actions is possible: ``` Some committed frames already exist in the WAL. shared.pages_in_frames.len() > 0. Brand new connection does this: BEGIN ^-- deferred, no read tx started yet, so its `self.start_pages_in_frames` is `0` because it's a brand new WalFile instance ROLLBACK <-- calls `wal.rollback()` and truncates `shared.pages_in_frames` to length `0` PRAGMA wal_checkpoint(); ^-- because `pages_in_frames` is empty, it doesnt actually checkpoint anything but still sets shared.max_frame to 0, causing effectively data loss ``` ## Fix - Only call `wal.rollback()` for write transactions - Set `start_pages_in_frames` correctly so that this doesn't happen even if a regression starts calling `wal.rollback()` again Reviewed-by: Preston Thorpe (@PThorpe92) Closes #2366	2025-07-31 16:16:20 +03:00
Jussi Saurio	62e804480e	fix/wal: make db_changed check detect cases where max frame happens to be the same	2025-07-31 14:37:33 +03:00
Jussi Saurio	e88707c6fd	fix/wal: only rollback WAL if txn was write	2025-07-31 14:18:43 +03:00
Jussi Saurio	39dec647a7	fix/wal: reset page cache when another connection checkpointed in between	2025-07-31 12:44:22 +03:00
Jussi Saurio	7d082ab614	small fix after header accessor refactor	2025-07-31 10:05:52 +03:00
Jussi Saurio	f619556344	Merge 'Direct `DatabaseHeader` reads and writes – `with_header` and `with_header_mut`' from Levy A. This PR introduces two methods to pager. Very much inspired by `with_schema` and `with_schema_mut`. `Pager::with_header` and `Pager::with_header_mut` will give to the closure a shared and unique reference respectively that are transmuted references from the `PageRef` buffer. This PR also adds type-safe wrappers for `Version`, `PageSize`, `CacheSize` and `TextEncoding`, as they have special in-memory representations. Writing the `DatabaseHeader` is just a single `memcpy` now. ```rs pub fn write_database_header(&self, header: &DatabaseHeader) { let buf = self.as_ptr(); buf[0..DatabaseHeader::SIZE].copy_from_slice(bytemuck::bytes_of(header)); } ``` `HeaderRef` and `HeaderRefMut` are used in the `with_header*` methods, but also can be used on its own when there are multiple reads and writes to the header, where putting everything in a closure would add too much nesting. Reviewed-by: Preston Thorpe (@PThorpe92) Closes #2234	2025-07-31 10:02:47 +03:00
Jussi Saurio	62d79e8c16	Merge 'refactor/btree: simplify get_next_record()/get_prev_record()' from Jussi Saurio When traversing, we are only interested the following things: - Is the page a leaf or not - Is the page an index or table page - If not a leaf, what is the left child page This means we don't have to read the entire cell, just the left child page. Reviewed-by: Preston Thorpe (@PThorpe92) Closes #2317	2025-07-31 10:02:08 +03:00
PThorpe92	2e741641e6	Add test to assert we are backfilling all the rows properly with vectored writes	2025-07-30 19:42:54 -04:00
PThorpe92	ade1c182de	Add is_full method to checkpoint batch	2025-07-30 19:42:54 -04:00
PThorpe92	693b71449e	Clean up writev batching and apply suggestions	2025-07-30 19:42:53 -04:00
PThorpe92	ef69df7258	Apply review suggestions	2025-07-30 19:42:53 -04:00
PThorpe92	73882b97d6	Remove unnecessary collecting CQEs into an array in run_once, comments	2025-07-30 19:42:53 -04:00
PThorpe92	efcffd380d	Clean up io_uring writev implementation, add iovec and cqe cache	2025-07-30 19:42:52 -04:00
PThorpe92	b8e6cd5ae2	Fix taking page content from cached pages in checkpoint loop	2025-07-30 19:42:51 -04:00

1 2 3 4 5 ...

1060 Commits