This PR extends the existing encryption support to include the database
header page (page 1).
Reviewed-by: Avinash Sajjanshetty (@avinassh)
Closes#3040
This adds basic support for window functions. For now:
* Only existing aggregate functions can be used as window functions.
* Specialized window-specific functions (`rank`, `row_number`, etc.) are
not yet supported.
* Only the default frame definition is implemented:
`RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW EXCLUDE NO OTHERS`.
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3079
We have not implemented them before because they require the raw
elements to be kept. It is easy to see why in the following example:
```
current_min = 3;
insert(2) => current_min = 2 // can be done without state
delete(2) => needs to look at the state to determine new min!
```
The aggregator state was a very simple key-value structure. To
accomodate for min/max, we will make it into a more complex table, where
we can encode a more complex structure.
The key insight is that we can use a primary key composed of:
```
1) storage_id
2) zset_id,
3) element
```
The storage_id and zset_id are our previous key, except they are now
exploded to support a larger range of storage_id. With more bits
available in the storage_id, we can encode information about which
column we are storing. For aggregations in multiple columns, we will
need to keep a different list of values for min/max!
The element is just the values of the columns.
Because this is a primary key, the data will be sorted in the btree. We
can then just do a prefix search in the first two components of the key
and easily find the min/max when needed.
This new format is also adequate for joins. Joins will just have a new
storage_id which encodes two "columns" (left side, right side).
Closes#3143
Fixes panics with `must have a read transaction to start a write
transaction` - previously we were simply ignoring these Busy errors and
thinking we have a read tx, when we actually don't.
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3148
We start a pager read transaction at the beginning of the MV transaction, because
any reads we do from the database file and WAL must uphold snapshot isolation.
However, we must end and immediately restart the read transaction before committing.
This is because other transactions may have committed writes to the DB file or WAL,
and our pager must read in those changes when applying our writes; otherwise we would overwrite
the changes from the previous committed transactions.
Note that this would be incredibly unsafe in the regular transaction model, but in MVCC we trust
the MV-store to uphold the guarantee that no write-write conflicts happened.
We must iterate the row versions in reverse order because the
versions are in order of oldest to newest, and we must commit
the newest version applied by the active transaction.
In insert_version_raw(), we correctly iterate the versions backwards
because we want to find the newest version that is still older than
the one we are inserting.
However, the order of `.enumerate()` and `.rev()` was wrong, so the
insertion position was calculated based on the position in the
_reversed_ iterator, not the original iterator.
Blacksmith runners have a lot of variance in performance, making it hard
for Nyrkiö to do its job. Discussed on [Discord](https://discord.com/cha
nnels/1258658826257961020/1402269486752469085)
Reviewed-by: Henrik Ingo <henrik@nyrk.io>
Closes#2448
This removes 4 crates from the `cargo build` and tries to ensure that in
the future we avoid the same crates with different versions.
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3141
We have not implemented them before because they require the raw
elements to be kept. It is easy to see why in the following example:
current_min = 3;
insert(2) => current_min = 2 // can be done without state
delete(2) => needs to look at the state to determine new min!
The aggregator state was a very simple key-value structure. To
accomodate for min/max, we will make it into a more complex table, where
we can encode a more complex structure.
The key insight is that we can use a primary key composed of:
1) storage_id
2) zset_id,
3) element
The storage_id and zset_id are our previous key, except they are now
exploded to support a larger range of storage_id. With more bits
available in the storage_id, we can encode information about which
column we are storing. For aggregations in multiple columns, we will
need to keep a different list of values for min/max!
The element is just the values of the columns.
Because this is a primary key, the data will be sorted in the btree.
We can then just do a prefix search in the first two components of
the key and easily find the min/max when needed.
This new format is also adequate for joins. Joins will just have
a new storage_id which encodes two "columns" (left side, right side).
And also change the schema of the main table. I have come to see the
current key-value schema as inadequate for non-aggregate operators.
Calculating Min/Max, for example, doesn't feat in this schema because
we have to be able to track existing values and index them.
Another alternative is to keep one table per operator type, but this
quickly leads to an explosion of tables.
current logic can lead to a situation where:
- we call read_page(trunk_page_id)
- we assign trunk_page in the FreePageState state machine
- the page read fails and cache marks it as !locked && !loaded
- next call to Pager::free_page() asserts that the page is loaded and
panics
Whopper takes so long to run that i wasn't patient enough, but i'm
pretty sure this closes#3101
Reviewed-by: Preston Thorpe <preston@turso.tech>
Closes#3139
current logic can lead to a situation where:
- we call read_page(trunk_page_id)
- we assign trunk_page in the FreePageState state machine
- the page read fails and cache marks it as !locked && !loaded
- next call to Pager::free_page() asserts that the page is loaded and panics
We now panic on fsync error by default to be safe against fsyncgate.
However, no reason to do that in the stress tester, especially since we
test out of disk space errors under Antithesis.
Closes#3131
Fixes `write-throughput` benchmark deadlocking on 2 threads or more. The
gist of the PR is in the big code comment:
```rust
// important not to hold shared lock beyond this point to avoid deadlock scenario where:
// thread 1: takes readlock here, passes reference to shared.file to begin_read_wal_frame
// thread 2: tries to acquire write lock elsewhere
// thread 1: tries to re-acquire read lock in the completion (see 'complete' above)
//
// this causes a deadlock due to the locking policy in parking_lot:
// from https://docs.rs/parking_lot/latest/parking_lot/type.RwLock.html:
// "This lock uses a task-fair locking policy which avoids both reader and writer starvation.
// This means that readers trying to acquire the lock will block even if the lock is unlocked
// when there are writers waiting to acquire the lock.
// Because of this, attempts to recursively acquire a read lock within a single thread may result in a deadlock."
```
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#3132
Based on #3126Closes#3029Closes#3030Closes#3065Closes#3083Closes#3084Closes#3085
simple reason why mvcc update didn't work: it didn't try to update.
Closes#3127
This PR fixes incorrect path registration for sync in browser, add tests
and also expose revision string in the `stats()` method of synced
database
Closes#3124
We now panic on fsync error by default to be safe against fsyncgate.
However, no reason to do that in the stress tester, especially since we
test out of disk space errors under Antithesis.
I searched using deepwiki how SQLite implements their busy handler. They
use a callback system with exponential backoff, where it stores the
callback in the pager and in the database. I confess I found this
slightly confusing, so I just implemented a simple exponential backoff
directly in the `Statement` struct. I imagine SQLite does this in a more
convoluted manner, as they do not have a concept of yielding as we do.
https://deepwiki.com/search/where-is-the-code-for-the-
busy_4a5ed006-4eed-479f-80c3-dd038832831b
I also fixed the rust bindings so that it yields when we return
`StepResult::IO`, instead of just blocking the async function. To
achieve this I implemented the `Stream` trait for `Rows` struct, which
unfortunately came with a slight change to the function signature of
`rows.next()` to `rows.try_next()`.
EDIT:
~test `test_multiple_connections_fuzz` timeouts because now it has the
busy handler "slowing" things down (this test generates a lot of busy
transactions), so it takes a lot longer for the test to run. Not sure if
it is acceptable for us to reduce the number of operations so the test
is shorter.~
EDIT:
Adjusted the API to be more in line with
https://www.sqlite.org/c3ref/busy_timeout.html.
Sets maximum total accumulated timeout. If the duration is None or Zero,
we unset the busy handler for this Connection.
This api defers slightly from SQLite as instead of sleeping for linear
amount of time specified by the user, we will sleep in phases until the
the total amount of time requested is reached. This means we first sleep
of 1ms, then if we still return busy, we sleep for 2 ms, and repeat
until a maximum of 100 ms per phase or we reached the total timeout.
Example:
1. Set duration to 5ms
2. Step through query -> returns Busy -> sleep/yield for 1 ms
3. Step through query -> returns Busy -> sleep/yield for 2 ms
4. Step through query -> returns Busy -> sleep/yield for 2 ms (totaling
5 ms of sleep)
5. Step through query -> returns Busy -> return Busy to user
This slight api change demonstrated a better throughtput in
`perf/throughput/turso` benchmark
```sh
cargo run -p write-throughput --release -- -t 2
Running write throughput benchmark with 2 threads, 100 batch size, 10 iterations, mode: Legacy
Database created at: write_throughput_test.db
Thread 1: 1000 inserts in 0.04s (23438.42 inserts/sec)
Thread 0: 1000 inserts in 0.08s (12385.64 inserts/sec)
=== BENCHMARK RESULTS ===
Total inserts: 2000
Total time: 0.08s
Overall throughput: 24762.60 inserts/sec
Threads: 2
Batch size: 100
Iterations per thread: 10
Database file exists: true
Database file size: 4096 bytes
```
Depends on #3102Closes#3067Closes#3074