Currently, when MVCC is enabled, every transaction mode supports
concurrent reads and writes, which makes it hard to adopt for existing
applications that use `BEGIN DEFERRED` or `BEGIN IMMEDIATE`.
Therefore, add support for `BEGIN CONCURRENT` transactions when MVCC is
enabled. The transaction mode allows multiple concurrent read/write
transactions that don't block each other, with conflicts resolved at
commit time. Furthermore, implement the correct semantics for `BEGIN
DEFERRED` and `BEGIN IMMEDIATE` by taking advantage of the pager level
write lock when transaction upgrades to write. This means that now
concurrent MVCC transactions are serialized against the legacy ones when
needed.
The implementation includes:
- Parser support for CONCURRENT keyword in BEGIN statements
- New Concurrent variant in TransactionMode to distinguish from regular
read/write transactions
- MVCC store tracking of exclusive transactions to support IMMEDIATE and
EXCLUSIVE modes alongside CONCURRENT
- Proper transaction state management for all transaction types in MVCC
This enables better concurrency for applications that can handle
optimistic concurrency control, while still supporting traditional
SQLite transaction semantics via IMMEDIATE and EXCLUSIVE modes.
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#3021
Currently, when MVCC is enabled, every transaction mode supports
concurrent reads and writes, which makes it hard to adopt for existing
applications that use `BEGIN DEFERRED` or `BEGIN IMMEDIATE`.
Therefore, add support for `BEGIN CONCURRENT` transactions when MVCC is
enabled. The transaction mode allows multiple concurrent read/write
transactions that don't block each other, with conflicts resolved at
commit time. Furthermore, implement the correct semantics for `BEGIN
DEFERRED` and `BEGIN IMMEDIATE` by taking advantage of the pager level
write lock when transaction upgrades to write. This means that now
concurrent MVCC transactions are serialized against the legacy ones when
needed.
The implementation includes:
- Parser support for CONCURRENT keyword in BEGIN statements
- New Concurrent variant in TransactionMode to distinguish from regular
read/write transactions
- MVCC store tracking of exclusive transactions to support IMMEDIATE and
EXCLUSIVE modes alongside CONCURRENT
- Proper transaction state management for all transaction types in MVCC
This enables better concurrency for applications that can handle
optimistic concurrency control, while still supporting traditional
SQLite transaction semantics via IMMEDIATE and EXCLUSIVE modes.
This fairly long commit implements persistence for materialized view.
It is hard to split because of all the interdependencies between components,
so it is a one big thing. This commit message will at least try to go into
details about the basic architecture.
Materialized Views as tables
============================
Materialized views are now a normal table - whereas before they were a virtual
table. By making a materialized view a table, we can reuse all the
infrastructure for dealing with tables (cursors, etc).
One of the advantages of doing this is that we can create indexes on view
columns. Later, we should also be able to write those views to separate files
with ATTACH write.
Materialized Views as Zsets
===========================
The contents of the table are a ZSet: rowid, values, weight. Readers will
notice that because of this, the usage of the ZSet data structure dwindles
throughout the codebase. The main difference between our materialized ZSet and
the standard DBSP ZSet, is that obviously ours is backed by a BTree, not a Hash
(since SQLite tables are BTrees)
Aggregator State
================
In DBSP, the aggregator nodes also have state. To store that state, there is a
second table. The table holds all aggregators in the view, and there is one
table per view. That is __turso_internal_dbsp_state_{view_name}. The format of
that table is similar to a ZSet: rowid, serialized_values, weight. We serialize
the values because there will be many aggregators in the table. We can't rely
on a particular format for the values.
The Materialized View Cursor
============================
Reading from a Materialized View essentially means reading from the persisted
ZSet, and enhancing that with data that exists within the transaction.
Transaction data is ephemeral, so we do not materialize this anywhere: we have
a carefully crafted implementation of seek that takes care of merging weights
and stitching the two sets together.
SQLite does not store the rowid alias column in the record at all
when it is a rowid alias, because the rowid is always stored anyway
in the record header.
This commit replaces the `Name(pub String)` struct with a `Name` enum that
explicitly models how the name appeared in the source either as an
unquoted identifier (`Ident`) or a quoted string (`Quoted`).
In the process, the separate `Id` wrapper type has been coalesced into the
`Name` enum, simplifying the AST and reducing duplication in identifier
handling logic.
While this increases the size of some AST nodes (notably `yyStackEntry`),
it improves correctness and makes source structure more explicit for
later phases.
Two of the opcodes we implement (OpenRead and Transaction) should have
an opcode specifying the database to use, but they don't.
Add it, and for now always use 0 (the main database).
Was running the sim with I/O faults enabled and fixed some nasty bugs.
Now, there are some more nasty bugs to fix as well. This is the command
that I use to run the simulator `cargo run -p limbo_sim -- --minimum-
tests 10 --maximum-tests 1000`
This PR mainly fixes the following bugs:
- Not decrementing in flight write counter when `pwrite` fails
- not rolling back the transaction on `step` error
- not rolling back the transaction on `run_once` error
- some functions were just being unwrapped when they could suffer io
errors
- Only change max_frame after wal sync's
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1946
also added a `cursor_loop` helper on `ProgramBuilder` to avoid making
this mistake in the future. this is zero-cost, and will be optimized to
the same thing (hopefully).
Again found when fuzzing nested where clause subqueries:
Aggregate registers need to be NULLed at the start because the same
registers might be reused on another invocation of a subquery, and if
they are not NULLed, the 2nd invocation of the same subquery will have
values left over from the first invocation.
Reviewed-by: Preston Thorpe (@PThorpe92)
Closes#1614
Currently we have this:
program.alloc_cursor_id(Option<String>, CursorType)`
where the String is the table's name or alias ('users' or 'u' in
the query).
This is problematic because this can happen:
`SELECT * FROM t WHERE EXISTS (SELECT * FROM t)`
There are two cursors, both with identifier 't'. This causes a bug
where the program will use the same cursor for both the main query
and the subquery, since they are keyed by 't'.
Instead introduce `CursorKey`, which is a combination of:
1. `TableInternalId`, and
2. index name (Option<String> -- in case of index cursors.
This should provide key uniqueness for cursors:
`SELECT * FROM t WHERE EXISTS (SELECT * FROM t)`
here the first 't' will have a different `TableInternalId` than the
second `t`, so there is no clash.
Currently our "table id"/"table no"/"table idx" references always
use the direct index of the `TableReference` in the plan, e.g. in
`SelectPlan::table_references`. For example:
```rust
Expr::Column { table: 0, column: 3, .. }
```
refers to the 0'th table in the `table_references` list.
This is a fragile approach because it assumes the table_references
list is stable for the lifetime of the query processing. This has so
far been the case, but there exist certain query transformations,
e.g. subquery unnesting, that may fold new table references from
a subquery (which has its own table ref list) into the table reference
list of the parent.
If such a transformation is made, then potentially all of the Expr::Column
references to tables will become invalid. Consider this example:
```sql
-- Assume tables: users(id, age), orders(user_id, amount)
-- Get total amount spent per user on orders over $100
SELECT u.id, sub.total
FROM users u JOIN
(SELECT user_id, SUM(amount) as total
FROM orders o
WHERE o.amount > 100
GROUP BY o.user_id) sub
WHERE u.id = sub.user_id
-- Before subquery unnesting:
-- Main query table_references: [users, sub]
-- u.id refers to table 0, column 0
-- sub.total refers to table 1, column 1
--
-- Subquery table_references: [orders]
-- o.user_id refers to table 0, column 0
-- o.amount refers to table 0, column 1
--
-- After unnesting and folding subquery tables into main query,
-- the query might look like this:
SELECT u.id, SUM(o.amount) as total
FROM users u JOIN orders o ON u.id = o.user_id
WHERE o.amount > 100
GROUP BY u.id;
-- Main query table_references: [users, orders]
-- u.id refers to table index 0 (correct)
-- o.amount refers to table index 0 (incorrect, should be 1)
-- o.user_id refers to table index 0 (incorrect, should be 1)
```
We could ofc traverse every expression in the subquery and rewrite
the table indexes to be correct, but if we instead use stable identifiers
for each table reference, then all the column references will continue
to be correct.
Hence, this PR introduces a `TableInternalId` used in `TableReference`
as well as `Expr::Column` and `Expr::Rowid` so that this kind of query
transformations can happen with less pain.
Reviewable commit by commit. CI failures are not related.
Adds support for e.g. `select first_name, sum(distinct age),
count(distinct age), avg(distinct age) from users group by 1`
Implementation details:
- Creates an ephemeral index per distinct aggregate, and jumps over the
accumulation step if a duplicate is found
Closes#1507