Problem:
Currently `WriteState` "owns" the balancing state machine, even
though a separate `DeleteState` can also trigger balancing, which
results in awkward back-and-forth switching between `CursorState::Write`
and `CursorState::Delete` during balancing.
Fix:
1. Extract `balance_state` as a separate state machine, since its
state transitions are exactly the same regardless of whether an
insert or a delete triggered the balancing.
2. This allows to remove the different 'Balance-xxx' variants from
`WriteState`, as well as removing `WriteInfo` and `DeleteInfo`, as
those states become just simple enums now. Each of them now has a state
called `Balancing` which just delegates work to the balancing state
machine.
3. This further allows us to remove the awkward switching between
`CursorState::Delete` and `CursorState::Write` during a balance that
happens as a result of a deletion.
We currently clone WriteState on every iteration of `insert_into_page()`,
presumably for Borrow Checker Reasons (tm).
There was a bug in `WriteState::Insert` handling where if `fill_cell_payload()`
returned IO, the `fill_cell_payload_state` was not updated in
`write_info.state`, leading to an infinite loop of allocating new pages.
This bug was surfaced by, but not caused by, #2400.
In preparation for tracking IO Completions, we need to start to return
IO in places where completions are created. Doing some more plumbing now
to avoid bigger PRs for the future
Closes#2438
Implement sqlite's fast path defragment algorithm. This path is taken
when:
1. There are 1-2 freeblocks
2. There are at most `max_frag_bytes` fragmented free bytes (-1..=4)
Instead of reconstructing the entire page, it merges the two freeblocks
and then moves the merged freeblock to the left, effectively turning it
into free space in the unallocated region, instead of a freeblock.
`max_frag_bytes` is particularly important when jnserting a new cell,
because if the page contains (in total) ~just enough space for the new
cell, then there can be hardly any fragmented free space because
otherwise, merging the 1-2 freeblocks won't produce enough contiguous
free space to fit the cell.
## Benchmark
```sql
Insert rows in batches/limbo_insert_1_rows
time: [26.692 µs 27.153 µs 27.695 µs]
change: [-9.9033% -2.9097% +1.6336%] (p = 0.55 > 0.05)
No change in performance detected.
Insert rows in batches/limbo_insert_10_rows
time: [38.618 µs 40.022 µs 42.201 µs]
change: [-8.9137% -6.6405% -4.2299%] (p = 0.00 < 0.05)
Performance has improved.
Insert rows in batches/limbo_insert_100_rows
time: [168.94 µs 169.58 µs 170.31 µs]
change: [-22.520% -17.669% -12.790%] (p = 0.00 < 0.05)
Performance has improved.
```
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#2411
Closes#1967
To support this I had to change how we did `epilogue` similarly to how
SQLite does it. SQLIte first declares a `beginWriteOperation` when some
statement is going to necessitate a Write Transaction. And as we now
need to pass the current schema cookie to `epilogue` it was easier to
call epilogue only in one location (like we do with prologue), and just
have each statement declare their intentions separately. This allows us
to not have to pass the Schema around just to do the epilogue. I believe
this is something that @jussisaurio would be interested in.
~Also had to disable the MVCC test, as it was extremely buggy for me.~
Just disabled reprepare statements for MVCC
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#2214
More changes. I want to avoid big PRs, so doing these changes in small
increments. I think in like 2 PRs after this one, I will be able make
the change effectively.
Closes#2400
This PR extends raw WAL API with few methods which will be helpful for
offline-sync:
1. `try_wal_watermark_read_page` - try to read page from the DB with
given WAL watermark value\
* Usually, WAL max_frame is set automatically to the latest value
(`shared.max_frame`) when transaction is started and then this
"watermark" is preserved throughout whole transaction
* New method allows to simulate "read from the past" by controlling
frame watermark explicitly
* There is an alternative to implement some API like
`start_read_session(frame_watermark: u64)` - but I decided to expose
just single method to simplify the logic and reduce "surface" of actions
which can be executed in this "controllable" manner
* Also, for simplicity, now `try_wal_watermark_read_page` always
read data from disk and bypass any cached values (and also do not
populate the cache)
2. `wal_changed_pages_after` - return set of unique pages changed after
watermark WAL position in the current WAL session
With these 2 methods we can implement `REVERT frame_watermark` logic
which will just fetch all changed pages first, and then revert them to
the previous value by using `try_wal_watermark_read_page` and
`wal_insert_frame` methods (see `test_wal_api_revert_pages` test).
Note, that if there were schema changes - than `REVERT` logic described
above can bring connection to the inconsistent state, as it will
preserve schema information in memory and will still think that table
exist (while it can be reverted). This should be considered by any
consumer of this new methods.
Closes#2433
Closes#2421
## Background
We have some kind of transaction-local hack (`start_pages_in_frames`)
for bookkeeping how many pages are currently in the in-memory WAL frame
cache, I assume for performance reasons or whatever.
`wal.rollback()` clears all the frames from `shared.frame_cache` that
the rollbacking tx is allowed to clear, and then truncates
`shared.pages_in_frames` to however much its local
`start_pages_in_frames` value was.
## Problem
In `complete_append_frame`, we check if `frame_cache` has that key
(page) already, and if not, we add it to `pages_in_frames`.
However, `wal.rollback()` never _removes_ the key (page) if its value is
empty, so we can end up in a scenario where the `frame_cache` key for
`page P` exists but has no frames, and so `page P` does not get added to
`pages_in_frames` in `complete_append_frame`.
This leads to a checkpoint data loss scenario:
- transaction rolls back, has start_pages_in_frames=0, so truncates
shared pages_in_frames to an empty vec. let's say `page P` key in
`frame_cache` still remains but it has no frames.
- The next time someone commits a frame for `page P`, it does NOT get
added to `pages_in_frames` because `frame_cache` has that key (although
the value vector is empty)
- At some point, a checkpoint checkpoints `n` frames, but since
`pages_in_frames` does not have `page P`, it doesn't actually checkpoint
it and all the "checkpointed" frames are simply thrown away
- very similar to the scenario in #2366
## Fix
Remove the `start_pages_in_frames` hack entirely and just make
`pages_in_frames` effectively the same as `frame_cache.keys`. I think we
could also just get rid of `pages_in_frames` and just use
`frame_cache.contains_key(p)` but maybe Pere can chime in here
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#2422
- try_wal_watermark_read_page - try to read page from the DB with given WAL watermark value
- wal_changed_pages_after - return set of unique pages changed after watermark WAL position
WAL API shouldn't be exposed by default because this is relatively
dangerous API which we use internally and ordinary users shouldn't not
be interested in it.
Reviewed-by: Pekka Enberg <penberg@iki.fi>
Closes#2424