Based on #3126Closes#3029Closes#3030Closes#3065Closes#3083Closes#3084Closes#3085
simple reason why mvcc update didn't work: it didn't try to update.
Closes#3127
## Problem
When a delete replaces an index interior cell, the replacement key is LT
the deleted key. Currently on the main branch, after the deletion
happens, the following call to BTreeCursor::next() stops at the replaced
interior cell.
This is incorrect - imagine the following sequence:
- We are executing a query that deletes all keys WHERE key > 5
- We delete <key=6> from an interior node, and take a replacement
<key=5> from the left subtree of that interior page
- next() is called, and we land on the interior node again, which now
has <key=5>, and we incorrectly delete it even though our WHERE
condition is key > 5.
## Solution
This PR:
- Tracks `interior_node_was_replaced` in CheckNeedsBalancing
- If no balancing is needed and a replacement occurred, advances once so
the next invocation of next() will skip the replaced cell properly
i.e. we prevent next() from landing on the replaced content and ensures
iteration continues with the next logical record.
## Details
This problem only became apparent once we started using indexes as valid
iteration cursors for DELETE operations in #2981Closes#3045
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3049
When a delete replaces an interior cell, the replacement key is LT the
deleted key. Currently on the main branch, after the deletion happens,
the following call to BTreeCursor::next() stops at the replaced interior
cell.
This is incorrect - imagine the following sequence:
- We are executing a query that deletes all keys WHERE key > 5
- We delete <key=6> from an interior node, and take a replacement
<key=5> from the left subtree of that interior page
- next() is called, and we land on the interior node again, which
now has <key=5>, and we incorrectly delete it even though our
WHERE condition is key > 5.
This PR:
- Tracks `interior_node_was_replaced` in CheckNeedsBalancing
- If no balancing is needed and a replacement occurred, advances once
so the next invocation of next() will skip the replaced cell properly
i.e. we prevent next() from landing on the replaced content and ensures iteration continues with the next logical record.
Closes#3045
Fast balancing routine for the common special case where the rightmost leaf page of a given subtree overflows (= an append).
In this case we just add a new leaf page as the right sibling of that page, and insert a new divider cell into the parent.
The high level steps are:
1. Allocate a new leaf page and insert the overflow cell payload in it.
2. Create a new divider cell in the parent - it contains the page number of the old rightmost leaf, plus the largest rowid on that page.
3. Update the rightmost pointer of the parent to point to the new leaf page.
4. Continue balance from the parent page (inserting the new divider cell may have overflowedImplement the balance_quick algorithm
Closes#1714
This PR enables the use of an index as the iteration cursor for a point
or range deletion operation. Main changes:
- Use `Delete` opcode for the index that is iterating the rows - avoids
unnecessary seeking on that index, since it's already positioned
correctly
- Fix delete balancing; details below:
### current state
- a deletion may cause a btree rebalancing operation
- to get the cursor back to the right place after a rebalancing, we must
remember what the deleted key was and seek to it
- right now we are using `SeekOp::LT` to move to one slot BEFORE the
deleted key, so that if we delete rows in a loop, the following `Next()`
call will put us back into the right place
### problem
- When we delete multiple rows, we always iterate forwards. Using
`SeekOp::LT` implies backwards iteration, but it works OK for table
btrees since the cursor never remains on an internal node, because table
internal cells do not have payloads. However: this behavior is
problematic for indexes because we can effectively end up skipping
visiting a page entirely. Honestly: despite spending some debugging the
_old_ code, I still don't remember what exactly causes this to happen.
:) It's one of the `iter_dir` specific behaviors in `indexbtree_move_to`
or `get_prev_record()`, but I'm too tired to spend more time figuring it
out. I had the reason in my head before going on vacation, but it was
evicted from the cache it seems...
### solution
use `SeekOp::GE { eq_only: true }` instead and make the next call to
`Next()` a no-op instead. This has the same effect as SeekOp::LT +
next(), but without introducing bugs due to `LT` being implied backwards
iteration.
Reviewed-by: Nikita Sivukhin (@sivukhin)
Closes#2981
The `run_once()` name is just a historical accident. Furthermore, it now
started to appear elsewhere as well, so let's just call it IO::step() as we
should have from the beginning.
- check free list trunk and pages
- use shared hash map to check for duplicate references for pages
- properly check overflow pages
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#2816
Fixes TPC-H query 13 from returning an incorrect result. In this
specific case, we were returning non-null `IdxRowid` values for the
right-hand side table even when there was no match with the left-hand
side table, meaning the join produced matches even in cases where there
shouldn't have been any.
Closes#2794Closes#2795
Fixes TPC-H query 13 from returning an incorrect result. In this specific
case, we were returning non-null `IdxRowid` values for the right-hand side
table even when there was no match with the left-hand side table, meaning
the join produced matches even in cases where there shouldn't have been any.
Closes#2794
## Make fill_cell_payload() safe for async IO and cache spilling
### Problems:
1. fill_cell_payload() is not re-entrant because it can yield IO
on allocating a new overflow page, resulting in losing some of the
input data.
2. fill_cell_payload() in its current form is not safe for cache
spilling
because the previous overflow page in the chain of allocated overflow
pages
can be evicted by a spill caused by the next overflow page
allocation,
invalidating the page pointer and causing corruption.
3. fill_cell_payload() uses raw pointers and `unsafe` as a workaround
from a previous time when we used to clone `WriteState`, resulting in
hard-to-read code.
### Solutions:
1. Introduce a new substate to the fill_cell_payload state machine to
handle
re-entrancy wrt. allocating overflow pages.
2. Always pin the current overflow page so that it cannot be evicted
during the
overflow chain construction. Also pin the regular page the overflow
chain is
attached to, because it is immediately accessed after
fill_cell_payload is done.
3. Remove all explicit usages of `unsafe` from `fill_cell_payload`
(although our pager is ofc still extremely unsafe under the hood :] )
Note that solution 2 addresses a problem that arose in the development
of page cache
spilling, which is not yet implemented, but will be soon.
### Miscellania:
1. Renamed a bunch of variables to be clearer
2. Added more comments about what is happening in fill_cell_payload
Closes#2737
Things that were just wrong:
1. No pages other than the root page were checked, because no looping
was done. Add a loop.
2. Rightmost child page was never added to page stack. Add it.
New integrity check features:
- Add overflow pages to stack as well
- Check that no page is referenced more than once in the tree
No need to pass `disable` flag to the `end_tx` method as it has that
info from connection itself
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#2777