This PR extends raw WAL API with few methods which will be helpful for
offline-sync:
1. `try_wal_watermark_read_page` - try to read page from the DB with
given WAL watermark value\
* Usually, WAL max_frame is set automatically to the latest value
(`shared.max_frame`) when transaction is started and then this
"watermark" is preserved throughout whole transaction
* New method allows to simulate "read from the past" by controlling
frame watermark explicitly
* There is an alternative to implement some API like
`start_read_session(frame_watermark: u64)` - but I decided to expose
just single method to simplify the logic and reduce "surface" of actions
which can be executed in this "controllable" manner
* Also, for simplicity, now `try_wal_watermark_read_page` always
read data from disk and bypass any cached values (and also do not
populate the cache)
2. `wal_changed_pages_after` - return set of unique pages changed after
watermark WAL position in the current WAL session
With these 2 methods we can implement `REVERT frame_watermark` logic
which will just fetch all changed pages first, and then revert them to
the previous value by using `try_wal_watermark_read_page` and
`wal_insert_frame` methods (see `test_wal_api_revert_pages` test).
Note, that if there were schema changes - than `REVERT` logic described
above can bring connection to the inconsistent state, as it will
preserve schema information in memory and will still think that table
exist (while it can be reverted). This should be considered by any
consumer of this new methods.
Closes#2433
Closes#2421
## Background
We have some kind of transaction-local hack (`start_pages_in_frames`)
for bookkeeping how many pages are currently in the in-memory WAL frame
cache, I assume for performance reasons or whatever.
`wal.rollback()` clears all the frames from `shared.frame_cache` that
the rollbacking tx is allowed to clear, and then truncates
`shared.pages_in_frames` to however much its local
`start_pages_in_frames` value was.
## Problem
In `complete_append_frame`, we check if `frame_cache` has that key
(page) already, and if not, we add it to `pages_in_frames`.
However, `wal.rollback()` never _removes_ the key (page) if its value is
empty, so we can end up in a scenario where the `frame_cache` key for
`page P` exists but has no frames, and so `page P` does not get added to
`pages_in_frames` in `complete_append_frame`.
This leads to a checkpoint data loss scenario:
- transaction rolls back, has start_pages_in_frames=0, so truncates
shared pages_in_frames to an empty vec. let's say `page P` key in
`frame_cache` still remains but it has no frames.
- The next time someone commits a frame for `page P`, it does NOT get
added to `pages_in_frames` because `frame_cache` has that key (although
the value vector is empty)
- At some point, a checkpoint checkpoints `n` frames, but since
`pages_in_frames` does not have `page P`, it doesn't actually checkpoint
it and all the "checkpointed" frames are simply thrown away
- very similar to the scenario in #2366
## Fix
Remove the `start_pages_in_frames` hack entirely and just make
`pages_in_frames` effectively the same as `frame_cache.keys`. I think we
could also just get rid of `pages_in_frames` and just use
`frame_cache.contains_key(p)` but maybe Pere can chime in here
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#2422
- try_wal_watermark_read_page - try to read page from the DB with given WAL watermark value
- wal_changed_pages_after - return set of unique pages changed after watermark WAL position
WAL API shouldn't be exposed by default because this is relatively
dangerous API which we use internally and ordinary users shouldn't not
be interested in it.
Reviewed-by: Pekka Enberg <penberg@iki.fi>
Closes#2424
We have some kind of transaction-local hack (`start_pages_in_frames`) for bookkeeping
how many pages are currently in the in-memory WAL frame cache,
I assume for performance reasons or whatever.
`wal.rollback()` clears all the frames from `shared.frame_cache` that the rollbacking tx is
allowed to clear, and then truncates `shared.pages_in_frames` to however much its local
`start_pages_in_frames` value was.
In `complete_append_frame`, we check if `frame_cache` has that key (page) already, and if not,
we add it to `pages_in_frames`.
However, `wal.rollback()` never _removes_ the key (page) if its value is empty, so we can end
up in a scenario where the `frame_cache` key for `page P` exists but has no frames, and so `page P`
does not get added to `pages_in_frames` in `complete_append_frame`.
This leads to a checkpoint data loss scenario:
- transaction rolls back, has start_pages_in_frames=0, so truncates
shared pages_in_frames to an empty vec. let's say `page P` key in `frame_cache` still remains
but it has no frames.
- The next time someone commits a frame for `page P`, it does NOT get added to `pages_in_frames`
because `frame_cache` has that key
- At some point, a PASSIVE checkpoint checkpoints `n` frames, but since `pages_in_frames` does not have
`page P`, it doesn't actually checkpoint it and all the "checkpointed" frames are simply thrown away
- very similar to the scenario in #2366
Remove the `start_pages_in_frames` hack entirely and just make `pages_in_frames` effectively
the same as `frame_cache.keys`. I think we could also just get rid of `pages_in_frames` and just use
`frame_cache.contains_key(p)` but maybe Pere can chime in here
This should be safe to do as:
1. page cache is private per connection
2. since this connection wrote the flushed pages/frames, they are up to
date from its perspective
3. multiple concurrent statements inside one connection are not
snapshot-transactional even in sqlite
Reviewed-by: Pekka Enberg <penberg@iki.fi>
Closes#2407
if the table is an intkey table, we can read the rowid directly without
deserializing the full cell, and we also don't need to start
deserializing the record if only the rowid is requested.
```sql
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1: Collecting 100 samples in estimated 5.0007 s (11M i
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1
time: [469.38 ns 470.77 ns 472.40 ns]
change: [-5.8959% -5.5232% -5.1840%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10: Collecting 100 samples in estimated 5.0088 s (1.9M
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10
time: [2.6523 µs 2.6596 µs 2.6685 µs]
change: [-8.7117% -8.4083% -8.0949%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50: Collecting 100 samples in estimated 5.0197 s (399k
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50
time: [12.514 µs 12.545 µs 12.578 µs]
change: [-9.5243% -9.0562% -8.6227%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100: Collecting 100 samples in estimated 5.0600 s (202
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100
time: [25.135 µs 25.291 µs 25.470 µs]
change: [-8.8822% -8.3943% -7.8854%] (p = 0.00 < 0.05)
Performance has improved.
```
"only" 4x slower than sqlite on `SELECT * FROM users LIMIT 100` after
this!
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#2382
if the table is an intkey table, we can read the rowid directly
without deserializing the full cell, and we also don't need to start
deserializing the record if only the rowid is requested.
```sql
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1: Collecting 100 samples in estimated 5.0007 s (11M i
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/1
time: [469.38 ns 470.77 ns 472.40 ns]
change: [-5.8959% -5.5232% -5.1840%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10: Collecting 100 samples in estimated 5.0088 s (1.9M
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/10
time: [2.6523 µs 2.6596 µs 2.6685 µs]
change: [-8.7117% -8.4083% -8.0949%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50: Collecting 100 samples in estimated 5.0197 s (399k
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/50
time: [12.514 µs 12.545 µs 12.578 µs]
change: [-9.5243% -9.0562% -8.6227%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Benchmarking Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100: Collecting 100 samples in estimated 5.0600 s (202
Execute `SELECT * FROM users LIMIT ?`/limbo_execute_select_rows/100
time: [25.135 µs 25.291 µs 25.470 µs]
change: [-8.8822% -8.3943% -7.8854%] (p = 0.00 < 0.05)
Performance has improved.
```
We need to load rowids into mvcc's store in order before doing any read
in case there are rows.
This has a performance penalty for now as expected because we should,
ideally, scan for row ids lazily instead.
On Mvcc `commit_txn` we need to persist changes to database, for this case we re-use pager's semantics of transactions:
1. If there are no conflicts, we start `pager.begin_write_txn`
2. `pager.end_txn`: We flush changes to WAL
3. We finish Mvcc transaction by marking rows with new timestamp.
## What
The following sequence of actions is possible:
```sql
-- TRUNCATE checkpoint fails during WAL restart,
-- but OngoingCheckpoint.state is still left at Done for conn 0
Connection 0(op=23): PRAGMA wal_checkpoint(TRUNCATE)
Connection 0(op=23) Checkpoint TRUNCATE: OK: false, wal_page_count: NULL, checkpointed_count: NULL
-- TRUNCATE checkpoint succeeds for conn 1
Connection 1(op=26): PRAGMA wal_checkpoint(TRUNCATE)
Connection 1(op=26) Checkpoint TRUNCATE: OK: true, wal_page_count: 0, checkpointed_count: 0
-- Conn 0 now does a PASSIVE checkpoint, and immediately thinks
-- it's in the Done state, and thinks it checkpointed 17 frames.
-- since mode is PASSIVE, it now thinks both the WAL and the DB have those 17 frames
-- so the first 17 frames of the WAL can be ignored from now on.
Connection 0(op=27): PRAGMA wal_checkpoint(PASSIVE)
Connection 0(op=27) Checkpoint PASSIVE: OK: true, wal_page_count: 0, checkpointed_count: 17
-- Connection 0 starts a txn with min=18 (ignore first 17 frames in WAL),
-- and deletes rowid=690, which becomes WAL frame number 1
Connection 0(op=28): DELETE FROM test_table WHERE id = 690
begin_read_tx(min=18, max=0, slot=1, max_frame_in_wal=0)
-- Connection 1 starts a txn with min=18 (ignore first 17 frames in WAL),
-- and inserts rowid=1128, which becomes WAL frame number 2
Connection 1(op=28): INSERT INTO test_table (id, text) VALUES (1128, text_560)
begin_read_tx(min=18, max=1, slot=1, max_frame_in_wal=1)
-- Connection 0 again starts tx with min=18, and performs a read, and two wrong things happen:
-- 1. it doesn't see row 690 as deleted, because it's in WAL frame 1, which it ignores
-- 2. it doesn't see the new row 1128, because it's in WAL frame 2, which it ignores
Connection 0(op=29): SELECT * FROM test_table
begin_read_tx(min=18, max=2, slot=1, max_frame_in_wal=2)
```
## Fix
Reset `ongoing_checkpoint.state` to `Start` when checkpoint fails.
Issue found in #2364 .
Reviewed-by: bit-aloo (@Shourya742)
Closes#2380
Closes#2363
## What
The following sequence of actions is possible:
```
Some committed frames already exist in the WAL. shared.pages_in_frames.len() > 0.
Brand new connection does this:
BEGIN
^-- deferred, no read tx started yet, so its `self.start_pages_in_frames` is `0`
because it's a brand new WalFile instance
ROLLBACK <-- calls `wal.rollback()` and truncates `shared.pages_in_frames` to length `0`
PRAGMA wal_checkpoint();
^-- because `pages_in_frames` is empty, it doesnt actually
checkpoint anything but still sets shared.max_frame to 0, causing effectively data loss
```
## Fix
- Only call `wal.rollback()` for write transactions
- Set `start_pages_in_frames` correctly so that this doesn't happen even
if a regression starts calling `wal.rollback()` again
Reviewed-by: Preston Thorpe (@PThorpe92)
Closes#2366
This PR introduces two methods to pager. Very much inspired by
`with_schema` and `with_schema_mut`. `Pager::with_header` and
`Pager::with_header_mut` will give to the closure a shared and unique
reference respectively that are transmuted references from the `PageRef`
buffer.
This PR also adds type-safe wrappers for `Version`, `PageSize`,
`CacheSize` and `TextEncoding`, as they have special in-memory
representations.
Writing the `DatabaseHeader` is just a single `memcpy` now.
```rs
pub fn write_database_header(&self, header: &DatabaseHeader) {
let buf = self.as_ptr();
buf[0..DatabaseHeader::SIZE].copy_from_slice(bytemuck::bytes_of(header));
}
```
`HeaderRef` and `HeaderRefMut` are used in the `with_header*` methods,
but also can be used on its own when there are multiple reads and writes
to the header, where putting everything in a closure would add too much
nesting.
Reviewed-by: Preston Thorpe (@PThorpe92)
Closes#2234
When traversing, we are only interested the following things:
- Is the page a leaf or not
- Is the page an index or table page
- If not a leaf, what is the left child page
This means we don't have to read the entire cell, just the left child
page.
Reviewed-by: Preston Thorpe (@PThorpe92)
Closes#2317