in #2521, I messed up and introduced improper calculation of the current
checkpoint's max safe frame (mostly due to incorrect comments that I had
left on the method).
The confusion partially stems from our lack of Busy handling at the
moment, but essentially when determining the max safe frame for all
readers, for passive mode we cannot simply `break` out of the loop when
we find a reader with a lower read mark than we have, because _another_
reader might have an even _lower_ read mark, and we could proceed with
the first mark < shared_max.
And for !passive modes, we still attempt to backfill with the same lower
frame, we just return `Busy` at the end, after backfilling what we can
(we just don't reset the log for restart/truncate).
Most of the changes in this PR is just the renaming the fields of
Checkpoint Result, because the names were confusing
Closes#2560
This PR extends raw WAL API with few methods which will be helpful for
offline-sync:
1. `try_wal_watermark_read_page` - try to read page from the DB with
given WAL watermark value\
* Usually, WAL max_frame is set automatically to the latest value
(`shared.max_frame`) when transaction is started and then this
"watermark" is preserved throughout whole transaction
* New method allows to simulate "read from the past" by controlling
frame watermark explicitly
* There is an alternative to implement some API like
`start_read_session(frame_watermark: u64)` - but I decided to expose
just single method to simplify the logic and reduce "surface" of actions
which can be executed in this "controllable" manner
* Also, for simplicity, now `try_wal_watermark_read_page` always
read data from disk and bypass any cached values (and also do not
populate the cache)
2. `wal_changed_pages_after` - return set of unique pages changed after
watermark WAL position in the current WAL session
With these 2 methods we can implement `REVERT frame_watermark` logic
which will just fetch all changed pages first, and then revert them to
the previous value by using `try_wal_watermark_read_page` and
`wal_insert_frame` methods (see `test_wal_api_revert_pages` test).
Note, that if there were schema changes - than `REVERT` logic described
above can bring connection to the inconsistent state, as it will
preserve schema information in memory and will still think that table
exist (while it can be reverted). This should be considered by any
consumer of this new methods.
Closes#2433
- try_wal_watermark_read_page - try to read page from the DB with given WAL watermark value
- wal_changed_pages_after - return set of unique pages changed after watermark WAL position
We have some kind of transaction-local hack (`start_pages_in_frames`) for bookkeeping
how many pages are currently in the in-memory WAL frame cache,
I assume for performance reasons or whatever.
`wal.rollback()` clears all the frames from `shared.frame_cache` that the rollbacking tx is
allowed to clear, and then truncates `shared.pages_in_frames` to however much its local
`start_pages_in_frames` value was.
In `complete_append_frame`, we check if `frame_cache` has that key (page) already, and if not,
we add it to `pages_in_frames`.
However, `wal.rollback()` never _removes_ the key (page) if its value is empty, so we can end
up in a scenario where the `frame_cache` key for `page P` exists but has no frames, and so `page P`
does not get added to `pages_in_frames` in `complete_append_frame`.
This leads to a checkpoint data loss scenario:
- transaction rolls back, has start_pages_in_frames=0, so truncates
shared pages_in_frames to an empty vec. let's say `page P` key in `frame_cache` still remains
but it has no frames.
- The next time someone commits a frame for `page P`, it does NOT get added to `pages_in_frames`
because `frame_cache` has that key
- At some point, a PASSIVE checkpoint checkpoints `n` frames, but since `pages_in_frames` does not have
`page P`, it doesn't actually checkpoint it and all the "checkpointed" frames are simply thrown away
- very similar to the scenario in #2366
Remove the `start_pages_in_frames` hack entirely and just make `pages_in_frames` effectively
the same as `frame_cache.keys`. I think we could also just get rid of `pages_in_frames` and just use
`frame_cache.contains_key(p)` but maybe Pere can chime in here
## What
The following sequence of actions is possible:
```sql
-- TRUNCATE checkpoint fails during WAL restart,
-- but OngoingCheckpoint.state is still left at Done for conn 0
Connection 0(op=23): PRAGMA wal_checkpoint(TRUNCATE)
Connection 0(op=23) Checkpoint TRUNCATE: OK: false, wal_page_count: NULL, checkpointed_count: NULL
-- TRUNCATE checkpoint succeeds for conn 1
Connection 1(op=26): PRAGMA wal_checkpoint(TRUNCATE)
Connection 1(op=26) Checkpoint TRUNCATE: OK: true, wal_page_count: 0, checkpointed_count: 0
-- Conn 0 now does a PASSIVE checkpoint, and immediately thinks
-- it's in the Done state, and thinks it checkpointed 17 frames.
-- since mode is PASSIVE, it now thinks both the WAL and the DB have those 17 frames
-- so the first 17 frames of the WAL can be ignored from now on.
Connection 0(op=27): PRAGMA wal_checkpoint(PASSIVE)
Connection 0(op=27) Checkpoint PASSIVE: OK: true, wal_page_count: 0, checkpointed_count: 17
-- Connection 0 starts a txn with min=18 (ignore first 17 frames in WAL),
-- and deletes rowid=690, which becomes WAL frame number 1
Connection 0(op=28): DELETE FROM test_table WHERE id = 690
begin_read_tx(min=18, max=0, slot=1, max_frame_in_wal=0)
-- Connection 1 starts a txn with min=18 (ignore first 17 frames in WAL),
-- and inserts rowid=1128, which becomes WAL frame number 2
Connection 1(op=28): INSERT INTO test_table (id, text) VALUES (1128, text_560)
begin_read_tx(min=18, max=1, slot=1, max_frame_in_wal=1)
-- Connection 0 again starts tx with min=18, and performs a read, and two wrong things happen:
-- 1. it doesn't see row 690 as deleted, because it's in WAL frame 1, which it ignores
-- 2. it doesn't see the new row 1128, because it's in WAL frame 2, which it ignores
Connection 0(op=29): SELECT * FROM test_table
begin_read_tx(min=18, max=2, slot=1, max_frame_in_wal=2)
```
## Fix
Reset `ongoing_checkpoint.state` to `Start` when checkpoint fails.
Issue found in #2364 .
Reviewed-by: bit-aloo (@Shourya742)
Closes#2380
Closes#2363
## What
The following sequence of actions is possible:
```
Some committed frames already exist in the WAL. shared.pages_in_frames.len() > 0.
Brand new connection does this:
BEGIN
^-- deferred, no read tx started yet, so its `self.start_pages_in_frames` is `0`
because it's a brand new WalFile instance
ROLLBACK <-- calls `wal.rollback()` and truncates `shared.pages_in_frames` to length `0`
PRAGMA wal_checkpoint();
^-- because `pages_in_frames` is empty, it doesnt actually
checkpoint anything but still sets shared.max_frame to 0, causing effectively data loss
```
## Fix
- Only call `wal.rollback()` for write transactions
- Set `start_pages_in_frames` correctly so that this doesn't happen even
if a regression starts calling `wal.rollback()` again
Reviewed-by: Preston Thorpe (@PThorpe92)
Closes#2366