This fairly long commit implements persistence for materialized view.
It is hard to split because of all the interdependencies between components,
so it is a one big thing. This commit message will at least try to go into
details about the basic architecture.
Materialized Views as tables
============================
Materialized views are now a normal table - whereas before they were a virtual
table. By making a materialized view a table, we can reuse all the
infrastructure for dealing with tables (cursors, etc).
One of the advantages of doing this is that we can create indexes on view
columns. Later, we should also be able to write those views to separate files
with ATTACH write.
Materialized Views as Zsets
===========================
The contents of the table are a ZSet: rowid, values, weight. Readers will
notice that because of this, the usage of the ZSet data structure dwindles
throughout the codebase. The main difference between our materialized ZSet and
the standard DBSP ZSet, is that obviously ours is backed by a BTree, not a Hash
(since SQLite tables are BTrees)
Aggregator State
================
In DBSP, the aggregator nodes also have state. To store that state, there is a
second table. The table holds all aggregators in the view, and there is one
table per view. That is __turso_internal_dbsp_state_{view_name}. The format of
that table is similar to a ZSet: rowid, serialized_values, weight. We serialize
the values because there will be many aggregators in the table. We can't rely
on a particular format for the values.
The Materialized View Cursor
============================
Reading from a Materialized View essentially means reading from the persisted
ZSet, and enhancing that with data that exists within the transaction.
Transaction data is ephemeral, so we do not materialize this anywhere: we have
a carefully crafted implementation of seek that takes care of merging weights
and stitching the two sets together.
My goal with this patch is to be able to implement the ProjectOperator
for DBSP circuits using VDBE for expression evaluation.
*not* doing so is dangerous for the following reason: we will end up
with different, subtle, and incompatible behavior between SQLite
expressions if they are used in views versus outside of views.
In fact, even in our prototype had them: our projection tests, which
used to pass, were actually wrong =) (sqlite would return something
different if those functions were executed outside the view context)
For optimization reasons, we single out trivial expressions: they don't
have go through VDBE. Trivial expressions are expressions that only
involve Columns, Literals, and simple operators on elements of the same
type. Even type coercion takes this out of the realm of trivial.
Everything that is not trivial, is then translated with translate_expr -
in the same way SQLite will, and then compiled with VDBE.
We can, over time, make this process much better. There are essentially
infinite opportunities for optimization here. But for now, the main
warts are:
* VDBE execution needs a connection
* There is no good way in VDBE to pass parameters to a program.
* It is almost trivial to pollute the original connection. For example,
we need to issue HALT for the program to stop, but seeing that halt
will usually cause the program to try and halt the original program.
Subprograms, like the ones we use in triggers are a possible solution,
but they are much more expensive to execute, especially given that our
execution would essentially have to have a program with no other role
than to wrap the subprogram.
Therefore, what I am doing is:
* There is an in-memory database inside the projection operator (an
obvious optimization is to share it with *all* projection operators).
* We obtain a connection to that database when the operator is created
* We use that connection to execute our VDBE, which offers a clean, safe
and isolated way to execute the expression.
* We feed the values to the program manually by editing the registers
directly.
If a connection does e.g. CREATE TABLE, it will start a "child statement"
to reparse the schema. That statement does not start its own transaction,
and so should not try to end the existing one either.
We had a logic bug where these steps would happen:
- `CREATE TABLE` executed successfully
- pread fault happens inside `ParseSchema` child stmt
- `handle_program_error()` is called
- `pager.end_tx()` returns immediately because `is_nested_stmt` is true
and we correctly no-op it.
- however, crucially: `handle_program_error()` then sets tx state to None
- parent statement now catches error from nested stmt and calls
`handle_program_error()`, which calls `pager.end_tx()` again, and since
txn state is None, when it calls `rollback()` we panic on the assertion
`"dirty pages should be empty for read txn"`
Solution:
Do not do _any_ error processing in `handle_program_error()` inside a nested
stmt. This means that the parent write txn is still active when it processes
the error from the child and we avoid this panic.
We were storing `txid` in `ProgramState`, this meant it was impossible
to track interactive transactions. This was extracted to `Connection`
instead.
Moreover, transaction state for mvcc now is reset on commit.
Closes#2689
This gets rid of `InsertState` in `BTreeCursor` plus the `moved_before`
parameter to `BTreeCursor::insert` -- instead, seek logic is now in the
existing state machines for `op_insert` and `op_idx_insert`
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#2639
A lot of the structures we have - like the ones under Schema, are
specific for materialized views. In preparation to adding normal views,
rename them, so things are less confusing.
We have to update the Transaction State before checking for the Schema
Cookie so that we can rollback the transaction later on correctly.
Closes#2535Closes#2549
- When the rowid is changed in UPDATE, it is handled as a combination of DELETE + INSERT,
so we dont need to delete the old values in that case
- We should only update the views after the operation on the btree is done
- A proper state machine is needed to handle IO yielding points
Currently, we have a borrow problem because parse_schema_rows() already
borrows `schema`, but then `apply_view_deltas` does the same:
```
thread 'main' panicked at core/vdbe/mod.rs:450:49:
already mutably borrowed: BorrowError
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:697:5
1: core::panicking::panic_fmt
at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panicking.rs:75:14
2: core::cell::panic_already_mutably_borrowed
at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/cell.rs:799:5
3: core::cell::RefCell<T>::borrow
at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/cell.rs:987:25
4: turso_core::vdbe::Program::apply_view_deltas
at ./core/vdbe/mod.rs:450:26
5: turso_core::vdbe::Program::commit_txn
at ./core/vdbe/mod.rs:468:9
6: turso_core::vdbe::execute::op_halt
at ./core/vdbe/execute.rs:1954:15
7: turso_core::vdbe::Program::step
at ./core/vdbe/mod.rs:430:19
8: turso_core::Statement::step
at ./core/lib.rs:1914:23
9: turso_core::util::parse_schema_rows
at ./core/util.rs:91:15
10: turso_core::Connection::parse_schema_rows::{{closure}}
at ./core/lib.rs:1518:17
11: turso_core::Connection::with_schema_mut
at ./core/lib.rs:1625:9
12: turso_core::Connection::parse_schema_rows
at ./core/lib.rs:1515:9
```
However, this is a read transaction and views are not even enabled,
let's just make `apply_view_deltas()` return early if there's no
processing needed, to skip the schema borrow altogether.
This is just the bare minimum that I needed to convince myself that this
approach will work. The only views that we support are slices of the
main table: no aggregations, no joins, no projections.
drop view is implemented.
view population is implemented.
deletes, inserts and updates are implemented.
much like indexes before, a flag must be passed to enable views.
SQLite generates those in aggregations like min / max with collation
information either in the table definition or in the column expression.
We currently generate the wrong result here, and properly generating the
bytecode instruction fixes it.
Unfortunately it seems we are never reaching the point to remove state
machines, so might as well make it easier to make.
There are two points that must be highlighted:
1. There is a `StateTransition` trait implemented like:
```rust
pub trait StateTransition {
type State;
type Context;
fn transition<'a>(&mut self, context: &Self::Context) ->
Result<TransitionResult>;
fn finalize<'a>(&mut self, context: &Self::Context) -> Result<()>;
fn is_finalized(&self) -> bool;
}
```
where there exists `transition` which tries to move state forward, and
`finalize` which marks the state machine as "finalized" so that **no
other call to finalize will forward the state and it will panic instead.
2. Before, we would store the state of a state machine inside the
callee's struct, but I'm proposing we do something different where the
callee will return the state machine and the caller will be responsible
of advancing it. This way we don't need to track many reset operations
in case of failures or rollbacks, and instead we could simply drop a
state machine and all other nested state machines will drop in a
cascade.