Commit Graph

4668 Commits

Author SHA1 Message Date
PThorpe92
39a47d67e6 Apply PR suggestions 2025-09-05 16:13:29 -04:00
PThorpe92
f45a7538fe Use true sieve/gclock algo instead of lru,dont link pages circilarly 2025-09-05 16:13:29 -04:00
PThorpe92
e418a902e5 Fix scoping issues now that refcells are gone to prevent extra destructors 2025-09-05 16:13:28 -04:00
PThorpe92
c85a61442f Remove type alias in page cache 2025-09-05 16:13:28 -04:00
PThorpe92
5ba273eea5 remove unused impl for refbit 2025-09-05 16:13:28 -04:00
PThorpe92
246b62d513 Remove unnecessary refcells, as PageCacheEntry has interior mutability 2025-09-05 16:13:27 -04:00
PThorpe92
582e25241e Implement GClock algorithm to distinguish between hot pages and scan touches 2025-09-05 16:13:27 -04:00
PThorpe92
254a0a9342 Apply fix and rename ignore_existing to upsert 2025-09-05 16:13:27 -04:00
PThorpe92
3a0b9b360a Fix clippy warnings 2025-09-05 16:13:26 -04:00
PThorpe92
03d5598cfb Use sieve algorithm in page cache in place of full LRU 2025-09-05 16:13:26 -04:00
Jussi Saurio
a0613ef781 Avoid allocating and then immediately fallbacking errors in affinity
On the syscall IO backend, on TPC-H query 12, the _dominating_ part
of the stack trace is trying to construct affinities from a character,
failing, allocating an error&string, and then immediately falling back to
Blob affinity and dropping the error&string.

Since I'm on vacation I won't spend cycles on figuring out why we are passing
an incorrect affinity in `flags.get_affinity()` and instead make this lazy PR
just to improve performance and stop doing silly things :]
2025-09-05 18:34:23 +03:00
Pere Diaz Bou
382a1e14ca Merge 'core: handle edge cases for read_varint' from Sonny
Add handling malformed inputs to function `read_varint` and test cases.
```
# 9 byte truncated to 8
read_varint(&[0x81, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80]) 
    before -> panic index out of bounds: the len is 8 but the index is 8
    after -> LimboError
    
# bits set without end
read_varint(&[0x80; 9]) 
    before -> Ok((128, 9))
    after -> LimboError
```

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #2904
2025-09-05 16:15:42 +02:00
Pekka Enberg
811c5a7ce0 Merge 'Fix float formatting and comparison + Blob concat' from Levy A.
These changes can be verified with the expression fuzzer. Fixes
https://github.com/tursodatabase/turso/issues/2881.
- Compatible float formatting
- Compatible integer-float comparisons
- Blob concatenation

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #2929
2025-09-05 17:02:51 +03:00
Pekka Enberg
b2664e12c2 cargo fmt 2025-09-05 16:12:12 +03:00
Pekka Enberg
5dcffadad6 core/vdbe: Remove empty loop
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-05 16:03:25 +03:00
Levy A.
a7b60e6b00 fix: return NULL for negative base or input on exec_math_log 2025-09-05 10:00:59 -03:00
Pekka Enberg
832e0dee81 core/incremental: Fix typos in cursor.rs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-05 15:40:45 +03:00
Glauber Costa
08b2e685d5 Persistence for DBSP-based materialized views
This fairly long commit implements persistence for materialized view.
It is hard to split because of all the interdependencies between components,
so it is a one big thing. This commit message will at least try to go into
details about the basic architecture.

Materialized Views as tables
============================

Materialized views are now a normal table - whereas before they were a virtual
table.  By making a materialized view a table, we can reuse all the
infrastructure for dealing with tables (cursors, etc).

One of the advantages of doing this is that we can create indexes on view
columns.  Later, we should also be able to write those views to separate files
with ATTACH write.

Materialized Views as Zsets
===========================

The contents of the table are a ZSet: rowid, values, weight. Readers will
notice that because of this, the usage of the ZSet data structure dwindles
throughout the codebase. The main difference between our materialized ZSet and
the standard DBSP ZSet, is that obviously ours is backed by a BTree, not a Hash
(since SQLite tables are BTrees)

Aggregator State
================

In DBSP, the aggregator nodes also have state. To store that state, there is a
second table.  The table holds all aggregators in the view, and there is one
table per view. That is __turso_internal_dbsp_state_{view_name}. The format of
that table is similar to a ZSet: rowid, serialized_values, weight. We serialize
the values because there will be many aggregators in the table. We can't rely
on a particular format for the values.

The Materialized View Cursor
============================

Reading from a Materialized View essentially means reading from the persisted
ZSet, and enhancing that with data that exists within the transaction.
Transaction data is ephemeral, so we do not materialize this anywhere: we have
a carefully crafted implementation of seek that takes care of merging weights
and stitching the two sets together.
2025-09-05 07:04:33 -05:00
Levy A.
ef16853bf1 fix: clippy 2025-09-05 03:07:38 -03:00
Levy A.
49eff9a1ca chore: fmt 2025-09-05 03:01:27 -03:00
Levy A.
6d9b57b47e fix: return empty string on NaN 2025-09-05 03:01:27 -03:00
Levy A.
76c8894b1a small tweaks 2025-09-05 03:01:24 -03:00
Levy A.
73e901010c fix: float formating and float comparison 2025-09-05 02:35:03 -03:00
Pekka Enberg
6d80d862ee Merge 'io_uring: prevent out of order operations that could interfere with durability' from Preston Thorpe
closes #1419
When submitting a `pwritev` for flushing dirty pages, in the case that
it's a commit frame, we use a new completion type which tells io_uring
to add a flag, which ensures the following:
1. If any operation in the chain fails, subsequent operations get
cancelled with -ECANCELED
2. All operations in the chain complete in order
If there is an ongoing chain of `IO_LINK`, it ends at the `fsync`
barrier, and ensures everything submitted before it has completed.
for 99% of the cases, the syscall that immediately proceeds the
`pwritev` is going to be the fsync, but just in case, this
implementation links everything that comes between the final commit
`pwritev` and the next `fsync`
In the event that we get a partial write, if it was linked, then we
submit an additional fsync after the partial write completes, with an
`IO_DRAIN` flag after forcing a `submit`, which will mean durability is
maintained, as that fsync will flush/drain everything in the squeue
before submission.
The other option in the event of partial writes on commit frames/linked
writes is to error.. not sure which is the right move here. I guess it's
possible that since the fsync completion fired, than the commit could be
over without us being durable ondisk. So maybe it's an assertion
instead? Thoughts?

Closes #2909
2025-09-05 08:34:35 +03:00
Pekka Enberg
a711f2f137 Merge 'core: Simplify WalFileShared life cycle' from Pekka Enberg
Create one WalFileShared for a Database and update its state
accordingly. Also support case where the WAL is disabled.

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Reviewed-by: Preston Thorpe <preston@turso.tech>

Closes #2918
2025-09-05 08:29:56 +03:00
Preston Thorpe
2ea2be6f85 Merge 'prevent modification to system tables.' from Glauber Costa
SQLite does not allow us to modify system tables, but we do. Let's fix
it.

Reviewed-by: Preston Thorpe <preston@turso.tech>
Reviewed-by: Avinash Sajjanshetty (@avinassh)

Closes #2855
2025-09-04 19:57:04 -04:00
Preston Thorpe
2d55099854 Merge 'mark completion as done only after callback will be executed' from Nikita Sivukhin
- otherwise, in multi-threading environment, other thread can think that
completion is finished and start execution
- this can lead to violated assertions (for example, page must be
loaded, but as callback is not executed yet assert will be fired)
Failing scenario:
1. main thread wants to execute pread - so it schedule IO and return
control to the caller
2. IO thread read data from the disk
3. IO thread executes complete(result)
4. complete func set result of the completion to Ok
5. main thread enter into the step loop again and check completion
status
6. completion marked as finished/is_completed - so main thread continue
execution
7. main thread check that page is loaded and fails with assertion -
because it's not loaded yet
8. IO thread executed the callback and finished the completion

Reviewed-by: Preston Thorpe <preston@turso.tech>

Closes #2922
2025-09-04 19:44:17 -04:00
Glauber Costa
032eabb3a4 prevent modification to system tables.
SQLite does not allow us to modify system tables, but we do.
Let's fix it.
2025-09-04 17:34:47 -05:00
Nikita Sivukhin
4a3d3b3b8c mark completion as done only after callback will be executed
- otherwise, in multi-threading environment, other thread can think that completion is finished
  and start execution
- this can lead to violated assertions (for example, page must be loaded, but as callback is not executed yet
  assert will be fired)
2025-09-04 23:48:08 +04:00
Pekka Enberg
ecbcd1ecd3 Merge ' core/mvcc: make commit_txn return on I/O ' from Pere Diaz Bou
`commit_txn` in MVCC was hacking its way through I/O until now. After
adding this and the test for concurrent writers we now see `busy` errors
returning as expected because there is no `commit` queueing happening
yet until next PR I open.

Closes #2895
2025-09-04 21:24:10 +03:00
Pekka Enberg
5950003eaf core: Simplify WalFileShared life cycle
Create one WalFileShared for a Database and update its state
accordingly. Also support case where the WAL is disabled.
2025-09-04 21:09:12 +03:00
Pekka Enberg
efc105d99e Merge 'Fix column count in ImmutableRow' from Glauber Costa
When we create an ImmutableRow::from_value(), we are always adding a
null padding at the end. We didn't notice this before, because a SQLite
file with an extra column is as valid as any. But that column of course
should not be there.
I traced this to column_count(), which is off by one. My understanding
is that we should be returning based on serial_types, not offset.

Closes #2862
2025-09-04 15:04:13 +03:00
Pekka Enberg
98398a9fe6 Merge 'windows iterator returns no values for shorter slice' from Lâm Hoàng Phúc
Closes #2912
2025-09-04 13:10:18 +03:00
Pekka Enberg
7e2bfd8bc1 Merge 'Refactor LIMIT/OFFSET handling to support expressions' from bit-aloo
Add expression support for `LIMIT` and `OFFSET` by storing them as
`Expr` instead of fixed integers. Constant expressions are folded with
`try_fold_to_i64`, while dynamic ones emit runtime checks, including the
new `IfNeg` opcode to clamp negative or `NULL` values to zero. The
current `build_limit_offset_expr` implementation is still naive and will
be refined in future work.
Fixes #2913

Closes #2720
2025-09-04 11:43:50 +03:00
Pekka Enberg
ef0d10bf2f Merge 'Encryption: add support for other AEGIS and AES-GCM cipher variants' from Frank Denis
Now supported:
- AEGIS variants: 256, 256X2, 256X4, 128L, 128X2, 128X4
- AES-GCM variants: AES-128-GCM, AES-256-GCM
With minor changes in order to make it easy to add new ciphers later
regardless of their key size.

Reviewed-by: Avinash Sajjanshetty (@avinassh)

Closes #2899
2025-09-04 11:42:16 +03:00
Pekka Enberg
44357f93a2 Merge branch 'main' into 2025-08-21-make-limit-and-offset-expr 2025-09-04 09:54:45 +03:00
TcMits
ddbfb6cc16 make clippy happy 2025-09-04 13:04:53 +07:00
TcMits
ce6ff74cd6 add test 2025-09-04 13:02:10 +07:00
TcMits
94b1cf9ab5 windows iterator returns no values for shorter slice 2025-09-04 12:09:21 +07:00
Preston Thorpe
caaf60a7ea Merge 'Unify resolution of aggregate functions' from Piotr Rżysko
This PR unifies the logic for resolving aggregate functions. Previously,
bare aggregates (e.g. `SELECT max(a) FROM t1`) and aggregates wrapped in
expressions (e.g. `SELECT max(a) + 1 FROM t1`) were handled differently,
which led to duplicated code. Now both cases are resolved consistently.
The added benchmark shows a small improvement:
```
Prepare `SELECT first_name, last_name, state, city, age + 10, LENGTH(email), UPPER(first_name), LOWE...
                        time:   [59.791 µs 59.898 µs 60.006 µs]
                        change: [-7.7090% -7.2760% -6.8242%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe
```
For an existing benchmark, no change:
```
Prepare `SELECT first_name, count(1) FROM users GROUP BY first_name HAVING count(1) > 1 ORDER BY cou...
                        time:   [11.895 µs 11.913 µs 11.931 µs]
                        change: [-0.2545% +0.2426% +0.6960%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  5 (5.00%) high severe
```

Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Reviewed-by: Preston Thorpe <preston@turso.tech>

Closes #2884
2025-09-03 19:46:04 -04:00
Preston Thorpe
c55c7d76c3 Merge 'replace some matches with match_ignore_ascii_case macro' from Lâm Hoàng Phúc
Reviewed-by: Nikita Sivukhin (@sivukhin)

Closes #2903
2025-09-03 17:03:19 -04:00
PThorpe92
d894a62132 Add plumbing in io_uring to handle linked writes to ensure consistency 2025-09-03 16:01:16 -04:00
PThorpe92
e3f366963d Compute the final db page or make the commit frame submit a linked pwritev completion 2025-09-03 16:01:16 -04:00
PThorpe92
3831218f6c Add linked to completion types in io/mod 2025-09-03 16:01:16 -04:00
PThorpe92
c5b6df4249 Use mutex in place of spinlock for io_uring 2025-09-03 11:12:33 -04:00
PThorpe92
30454336a6 Make io_uring sound for connections across multiple threads 2025-09-03 10:54:42 -04:00
Pere Diaz Bou
8db5cead07 core/mvcc: only commit if there is a txn 2025-09-03 14:12:48 +02:00
Pere Diaz Bou
b8f83e1fc0 clippy and fmt stuff because if not pekka will tweet 2025-09-03 12:47:55 +02:00
sonhmai
2b6cb39c7e core: handle edge cases for read_varint 2025-09-03 15:43:34 +07:00
TcMits
b6fca2718f fmt 2025-09-03 13:41:23 +07:00