Commit Graph

2340 Commits

Author SHA1 Message Date
Pekka Enberg
7257fb8aae Merge 'core: move pragma statement bytecode generator to its own file.' from Sonny
What?
- no logic change
- refactored and moved pragma statement bytecode generation to its own
package to better structure.

Closes #871
2025-02-03 09:10:33 +02:00
Pekka Enberg
6c34737240 Merge 'Fix rowid generation' from Nikita Sivukhin
Fix panic in case when table has row with rowid equals to `-1`
(`=u64::max`)
```sql
limbo> CREATE TABLE t(x INTEGER PRIMARY KEY)
limbo> INSERT INTO t VALUES (-1)
limbo> INSERT INTO t VALUES (NULL);
thread 'main' panicked at core/vdbe/mod.rs:2499:21:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #868
2025-02-03 09:09:12 +02:00
Pekka Enberg
662d629666 Rename JoinAwareConditionExpr to WhereTerm
We transform all JOIN conditions into WHERE clause terms in the query
planner. The JoinAwareConditionExpr name tries to make that point, but I
think it makes things more confusing. Let's call it WhereTerm (suggested
by Jussi).
2025-02-03 07:46:51 +02:00
Pekka Enberg
bbf73da28f Merge 'core/translate: refactor query planner again to be simpler' from Jussi Saurio
## Simplify bookkeeping of referenced tables in the query planner
This PR refactors the way we track referenced tables and associated
planner operations related to them (scans, index searches, etc).
## The problem with what we currently have:
- We have a tree data structure called `SourceOperator` which is either
a `Join`, `Scan`, `Search`, `Subquery` or `Nothing`.
```rust
/**
  A SourceOperator is a Node in the query plan that reads data from a table.
*/
#[derive(Clone, Debug)]
pub enum SourceOperator {
    // Join operator
    // This operator is used to join two source operators.
    // It takes a left and right source operator, a list of predicates to evaluate,
    // and a boolean indicating whether it is an outer join.
    Join {
        id: usize,
        left: Box<SourceOperator>,
        right: Box<SourceOperator>,
        predicates: Option<Vec<ast::Expr>>,
        outer: bool,
        using: Option<ast::DistinctNames>,
    },
    // Scan operator
    // This operator is used to scan a table.
    // It takes a table to scan and an optional list of predicates to evaluate.
    // The predicates are used to filter rows from the table.
    // e.g. SELECT * FROM t1 WHERE t1.foo = 5
    // The iter_dir are uset to indicate the direction of the iterator.
    // The use of Option for iter_dir is aimed at implementing a conservative optimization strategy: it only pushes
    // iter_dir down to Scan when iter_dir is None, to prevent potential result set errors caused by multiple
    // assignments. for more detailed discussions, please refer to https://github.com/penberg/limbo/pull/376
    Scan {
        id: usize,
        table_reference: TableReference,
        predicates: Option<Vec<ast::Expr>>,
        iter_dir: Option<IterationDirection>,
    },
    // Search operator
    // This operator is used to search for a row in a table using an index
    // (i.e. a primary key or a secondary index)
    Search {
        id: usize,
        table_reference: TableReference,
        search: Search,
        predicates: Option<Vec<ast::Expr>>,
    },
    Subquery {
        id: usize,
        table_reference: TableReference,
        plan: Box<SelectPlan>,
        predicates: Option<Vec<ast::Expr>>,
    },
    // Nothing operator
    // This operator is used to represent an empty query.
    // e.g. SELECT * from foo WHERE 0 will eventually be optimized to Nothing.
    Nothing {
        id: usize,
    },
}
```
- Logically joins are a tree, but this is at least marginally bad for
performance because each `Join` has two boxed child operators, and so
e.g. for a 3-table query you have, for example, 3 `Scan` nodes and then
2 `Join` nodes.
- There are other bigger problems too, though, related to code
structure. We have been carrying around a separate vector of
`referenced_tables` that columns can refer to by index:
```rust
/// A query plan has a list of TableReference objects, each of which represents a table or subquery.
#[derive(Clone, Debug)]
pub struct TableReference {
    /// Table object, which contains metadata about the table, e.g. columns.
    pub table: Table,
    /// The name of the table as referred to in the query, either the literal name or an alias e.g. "users" or "u"
    pub table_identifier: String,
    /// The index of this reference in the list of TableReference objects in the query plan
    /// The reference at index 0 is the first table in the FROM clause, the reference at index 1 is the second table in the FROM clause, etc.
    /// So, the index is relevant for determining when predicates (WHERE, ON filters etc.) should be evaluated.
    pub table_index: usize,
    /// The type of the table reference, either BTreeTable or Subquery
    pub reference_type: TableReferenceType,
}
```
- `referenced_tables` is used because SQLite joins are an `n^tables_len`
nested loop, and we need to figure out during which loop to evaluate a
condition expression. A lot of plumbing in the current code exists for
this, e.g. "pushing predicates" in `optimizer` even though "predicate
pushdown" as a query planner concept is an _optimization_, but in our
current system the "pushdown" is really a _necessity_ to move the
condition expressions to the correct `SourceOperator::predicates` vector
so that they are evaluated at the right point.
- `referenced_tables` is also used to map identifiers in the query to
the correct table, e.g. 'foo' `SELECT foo FROM users` becomes an
`ast::Expr::Column { table: 0, .. }` if `users` is the first table in
`referenced_tables`.
In addition to this, we ALSO had a `TableReferenceType` separately for
checking whether the upper-level query is reading from a BTree table or
a Subquery.
```rust
/// The type of the table reference, either BTreeTable or Subquery
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum TableReferenceType {
    /// A BTreeTable is a table that is stored on disk in a B-tree index.
    BTreeTable,
    /// A subquery.
    Subquery {
        /// The index of the first register in the query plan that contains the result columns of the subquery.
        result_columns_start_reg: usize,
    },
}
```
...even though we already have an `Operator::Subquery` that should be
able to encode this information, but doesn't, because it's a tree and we
refer to things by index in `referenced_tables`.
### Why this is especially stupid
`SourceOperator` and `TableReference` are basically just two
representations of the same thing, one in tree format and another in
vector format. `SourceOperator` even carries around its own copy of
`TableReference`, even though we ALSO have `referenced_tables:
Vec<TableReference>` 🤡
Note that I'm allowed to call the existing code stupid because I wrote
it.
## What we can do instead
Basically, we can just fold the concerns from `SourceOperator` into
`TableReference` and have a list of those in the query plan, one per
table, in loop order (outermost loop is 0, and so on).
Funnily enough, when Limbo had very very few features we used to have a
Vec of LoopInfos similarly, obviously with a lot less information than
now, but for SQLite it's probably the right abstraction. :)
```rust
pub struct SelectPlan {
    /// List of table references in loop order, outermost first.
    pub table_references: Vec<TableReference>,
    ...etc...
}

/// A table reference in the query plan.
/// For example, SELECT * FROM users u JOIN products p JOIN (SELECT * FROM users) sub
/// has three table references:
/// 1. operation=Scan, table=users, table_identifier=u, reference_type=BTreeTable, join_info=None
/// 2. operation=Scan, table=products, table_identifier=p, reference_type=BTreeTable, join_info=Some(JoinInfo { outer: false, using: None }),
/// 3. operation=Subquery, table=users, table_identifier=sub, reference_type=Subquery, join_info=None
#[derive(Debug, Clone)]
pub struct TableReference {
    /// The operation that this table reference performs.
    pub op: Operation,
    /// Table object, which contains metadata about the table, e.g. columns.
    pub table: Table,
    /// The name of the table as referred to in the query, either the literal name or an alias e.g. "users" or "u"
    pub identifier: String,
    /// The join info for this table reference if it is the right side of a join (which all except the first table reference have)
    pub join_info: Option<JoinInfo>,
}
```
And we keep the "operation" part from the "operator", but in a simple
form:
```rust
pub enum Operation {
    Scan {
        iter_dir: Option<IterationDirection>,
    },
    Search(Search),
    Subquery {
        plan: Box<SelectPlan>,
        result_columns_start_reg: usize
    },
}
```
Now we don't need to carry around both the operator tree and
`Vec<TableReference>`, because they are the same thing. If something
refers to the `n'th table`, it is just `plan.table_references[n]`.
We also don't need to recurse through the operator tree and usually we
can just loop from outermost table to innermost table.
---
### Handling the "where to evaluate a condition expression" problem
You can see I've also removed the `predicates` vector from `Scan` and
friends. Previously each `SourceOperator` had a vector of `predicates`
so that we knew at which loop depth to evaluate a condition. Now we
align more with what SQLite does -- it puts all the conditions, even the
join conditions, in the `WHERE` clause and adds extra metadata to them:
```rust
/// In a query plan, WHERE clause conditions and JOIN conditions are all folded into a vector of JoinAwareConditionExpr.
/// This is done so that we can evaluate the conditions at the correct loop depth.
/// We also need to keep track of whether the condition came from an OUTER JOIN. Take this example:
/// SELECT * FROM users u LEFT JOIN products p ON u.id = 5.
/// Even though the condition only refers to 'u', we CANNOT evaluate it at the users loop, because we need to emit NULL
/// values for the columns of 'p', for EVERY row in 'u', instead of completely skipping any rows in 'u' where the condition is false.
#[derive(Debug, Clone)]
pub struct JoinAwareConditionExpr {
    /// The original condition expression.
    pub expr: ast::Expr,
    /// Is this condition originally from an OUTER JOIN?
    /// If so, we need to evaluate it at the loop of the right table in that JOIN,
    /// regardless of which tables it references.
    /// We also cannot e.g. short circuit the entire query in the optimizer if the condition is statically false.
    pub from_outer_join: bool,
    /// The loop index where to evaluate the condition.
    /// For example, in `SELECT * FROM u JOIN p WHERE u.id = 5`, the condition can already be evaluated at the first loop (idx 0),
    /// because that is the rightmost table that it references.
    pub eval_at_loop: usize,
}
```
### Final notes
I've been wanting to make this refactor for a long time now, but the
last straw was when I was making a PR trying to reduce some of the
massive amount of allocations happening in the read path currently, and
I got stuck because of this Operator + referenced_tables shit getting in
the way constantly. So I decided I wanted to get it over with now.
The PR is again very big, but I've artificially split it up into commits
that don't individually compile but at least separate the changes a bit
for you, dear reader.

Closes #853
2025-02-03 07:45:58 +02:00
sonhmai
2d4bf2eb62 core: move pragma statement bytecode generator to its own file. 2025-02-03 09:21:14 +07:00
Nikita Sivukhin
2b9220992d fix attempt to add with overflow crash in case of rowid auto-generation 2025-02-02 20:10:58 +04:00
Nikita Sivukhin
e63d84ed50 refine assertions 2025-02-02 20:10:38 +04:00
Nikita Sivukhin
6cc1b778b4 add test with rowid=-1
- now limbo attempts to add with overflow and panic in this case
2025-02-02 20:02:59 +04:00
Pekka Enberg
593febd9a4 Add Limbo internals doc 2025-02-02 11:42:56 +02:00
Jussi Saurio
c18c6ad64d Marginal changes to use new data structures and field names 2025-02-02 10:18:13 +02:00
Jussi Saurio
82a2850de9 subquery.rs: use iteration instead of recursion and simplify 2025-02-02 10:18:13 +02:00
Jussi Saurio
98439cd936 optimizer.rs: refactor to use new data structures and remove unnecessary stuff
We don't need `push_predicates()` because that never REALLY was a predicate
pushdown optimization -- it just pushed WHERE clause condition expressions
into the correct SourceOperator nodes in the tree.

Now that we don't have a SourceOperator tree anymore and we keep the conditions
in the WHERE clause instead, we don't need to "push" anything anymore. Leaves
room for ACTUAL predicate pushdown optimizations later :)

We also don't need any weird bitmask stuff anymore, and perhaps we never did,
to determine where conditions should be evaluated.
2025-02-02 10:18:13 +02:00
Jussi Saurio
89fba9305a main_loop.rs: use iteration instead of recursion
Now that we do not have a tree of SourceOperators but rather
a Vec of TableReferences, we can just use loops instead of
recursion for handling the main query loop.
2025-02-02 10:18:13 +02:00
Jussi Saurio
09b6bad0af delete.rs: use new data structures when parsing delete 2025-02-02 10:18:13 +02:00
Jussi Saurio
2ddac4bf21 select.rs: use new data structures when parsing select 2025-02-02 10:18:13 +02:00
Jussi Saurio
16a97d3b98 planner.rs: refactor from/join + where parsing logic
- use new TableReference and JoinAwareConditionExpr
- add utilities for determining at which loop depth a
  WHERE condition should be evaluated, now that "operators"
  do not carry condition expressions inside them anymore.
2025-02-02 10:18:13 +02:00
Jussi Saurio
e63256f657 Change Display implementation of Plan to work with new data structures 2025-02-02 10:18:13 +02:00
Jussi Saurio
390d0e673f plan.rs: refactor data structures
- Get rid of SourceOperator tree
- Make plan have a Vec of TableReference, and TableReference now
  contains the information from the old SourceOperator.
- Remove `predicates` (conditions) from Table References -- put
  everything in the WHERE clause like SQLite, and attach metadata
  to the where clause expressions with JoinAwareConditionExpr struct.
- Refactor select_star() to be simpler now that we use a vec, not a tree
2025-02-02 10:18:13 +02:00
Pekka Enberg
dbb7d1a6ba Merge 'Pagecount' from Glauber Costa
This PR implements the Pagecount pragma, as well as its associated
bytecode opcode

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #819
2025-02-02 09:32:18 +02:00
Pekka Enberg
635c45a087 Merge 'Fix null expr codegen' from Nikita Sivukhin
This PR adjust emitted instructions for expressions which include `IS` /
`IS NOT` operators (support for them in the conditions were added in the
#847)

Reviewed-by: Glauber Costa (@glommer)

Closes #857
2025-02-02 09:32:05 +02:00
Pekka Enberg
650b56e203 Merge 'Fix null cmp codegen' from Nikita Sivukhin
This PR remove manual null-comparison optimization (introduced in #847)
which replace `Eq`/`Ne` instructions with explicit `IsNull`/`NotNull`.
There are few factors for this change:
1. Before, manual optimization were incorrect because it ignored
`jump_if_condition_is_true` flag which is important to properly build
logical condition evaluation
2. Manual optimization covered all scenarios in test cases and scenarios
when both sides are non trivial expressions were not covered by tests
3. Limbo already mark literals in the initial emitted bytecode as
constants and will evaluate and store them only once - so performance
difference from manual optimization seems very minor to me (but I am
wrong with high probability)
4. I think (but again, I am wrong with high probability) that such
replacement can be done in the generic optimizator layer instead of
manually encode them in the first emit phase
Fixes #850

Reviewed-by: Glauber Costa (@glommer)

Closes #856
2025-02-02 09:32:00 +02:00
Glauber Costa
a3387cfd5f implement the pragma page_count
To do that, we also have to implement the vdbe opcode Pagecount.
2025-02-01 19:39:46 -05:00
Nikita Sivukhin
1bd8b4ef7a pass null_eq flag for instructions generated for expressions (not in the conditions) 2025-02-02 02:51:51 +04:00
Nikita Sivukhin
4a9292f657 add tests for previously broken case 2025-02-02 02:42:06 +04:00
Nikita Sivukhin
c7aed22e39 null_eq flag disable effect of jump_if_null flag - so it makes no sense to set them both 2025-02-02 02:29:02 +04:00
Nikita Sivukhin
478ee6be8d remove null optimization which didn't check for jump_if_condition_is_true flag
- limbo already store constants only once and more clever optimizations
  better to do with generic optimizator and not manually
2025-02-02 02:28:07 +04:00
Pekka Enberg
20d3399c71 Merge 'implement is and is not where constraints' from Glauber Costa
The main difference between = and != is how null values are handled.
SQLite passes a flag "NULLEQ" to Eq and Ne to disambiguate that.
In the presence of that flag, NULL = NULL.
Some prep work is done to make sure we can pass a flag instead of a
boolean to Eq and Ne. I looked into the bitflags crate but got a bit
scared with the list of dependencies.
Warning:
The following query produces a different result for Limbo:
```
select * from demo where value is null or id == 2;
```
I strongly suspect the issue is with the OR implementation, though. The
bytecode generated is quite different.

Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>

Closes #847
2025-02-01 17:24:11 +02:00
Pekka Enberg
83f9290394 Merge 'Remove labeler 😥' from Kim Seon Woo
Let's add when we can figure out how to use GH_TOKEN well

Closes #852
2025-02-01 17:23:41 +02:00
김선우
45e0e86516 Remove labeler 😥 2025-02-02 00:04:49 +09:00
Glauber Costa
3c77797811 also mark IS DISTINCT FROM as supported
This seems to really be just an alias for IS:

"The IS NOT DISTINCT FROM operator is an alternative spelling for the IS
operator. Likewise, the IS DISTINCT FROM operator means the same thing
as IS NOT. Standard SQL does not support the compact IS and IS NOT
notation. Those compact forms are an SQLite extension. You have to use
the prolix and much less readable IS NOT DISTINCT FROM and IS DISTINCT
FROM operators on other SQL database engines."
2025-02-01 09:30:06 -05:00
Glauber Costa
c04260ab54 rename Flags to a less ambiguous name
Those Flags in SQLite are global, but it doesn't mean it has to be
the case for us as well.
2025-02-01 08:09:06 -05:00
Pekka Enberg
51f0c9e8a3 Merge 'Full flake overhaul' from Levy A.
Improvements:
- Use [rust-overlay](https://github.com/oxalica/rust-overlay), better
maintained than fenix and allows for:
- Use `rust-toolchain.toml` as the source of truth for the current rust
version, instead of tracking with stable. Preventing conflicting
versions with non-nix users.
- Add flake checks, could be useful for CI in the future, together with
crane and cachix.
- Add package, allow people to add limbo as a regular nix package. Now
we can `nix build .#`, `nix run .#` and `nix shell .#` (this one adds
`limbo` to the current `PATH`)
- Use [new `apple-sdk` pattern](https://discourse.nixos.org/t/the-
darwin-sdks-have-been-updated/55295), no need to declare each framework
now.

Closes #835
2025-02-01 10:34:21 +02:00
Pekka Enberg
a450b5cd39 Update README.md 2025-02-01 09:46:21 +02:00
Pekka Enberg
8c4ef098ef Update README.md 2025-02-01 09:42:13 +02:00
Pekka Enberg
e7f18c4736 Merge 'bindings/go: Progress on Go driver, add sync primitives, prevent crashing on concurrent connections' from Preston Thorpe
This PR continues work on the Go bindings.
- Register all symbols from the library at load time to prevent any
repeated `dlsym` calls.
- Add locks to prevent multiple concurrent FFI calls to functions that
act on the same state.
- Adds documentation/example in the go module `README`.
- Fixes memory access issue causing segfault due to passing pointer to
array of strings, that is difficult to work with in Go without the right
primitives. In place, simply return the amount of ResultColumns and Go
can provide the index to receive the column name, similar to
`rowsGetValue`
On next limbo release, I'll add the example to the main `README` next to
the other language examples. Until then, `go get
github.com/tursodatabase/limbo` will not work so the example will remain
in the bindings readme.

Closes #845
2025-02-01 09:25:52 +02:00
Pekka Enberg
43d6c2760d Merge 'update compat list' from Glauber Costa
Those two expr seem to be supported

Closes #846
2025-02-01 09:24:27 +02:00
Pekka Enberg
db29f43d5c Merge 'Simplify bytecode emitters' from Glauber Costa
Instead of always having the caller specify all instructions, this
    work introduces convenience functions into the program builder,
    making the code a lot cleaner.
    Draft for now, as this is done on top of #841

Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>

Closes #844
2025-02-01 09:24:11 +02:00
Pekka Enberg
76535d1224 Merge 'github: Configure labeler workflow environment' from Pekka Enberg
This fix was suggested by @seonWKim.

Closes #848
2025-02-01 09:23:53 +02:00
Pekka Enberg
a3ecc69bbb github: Configure labeler workflow environment
This fix was suggested by @seonWKim.
2025-02-01 09:22:17 +02:00
Glauber Costa
96987db6ca implement is and is not where constraints
The main difference between = and != is how null values are handled.
SQLite passes a flag "NULLEQ" to Eq and Ne to disambiguate that.
In the presence of that flag, NULL = NULL.

Some prep work is done to make sure we can pass a flag instead of a
boolean to Eq and Ne. I looked into the bitflags crate but got a bit
scared with the list of dependencies.
2025-01-31 23:01:49 -05:00
PThorpe92
7ee52fca4d bindings/go: update readme with example, change module name 2025-01-31 19:22:21 -05:00
Glauber Costa
f300d2c8e8 rename register for IsNull opcode
Now it has the same name as NotNull, so it is easier to write macros
2025-01-31 19:09:01 -05:00
Glauber Costa
7e8b190b9a update compat list
Those two expr seem to be supported
2025-01-31 16:56:19 -05:00
PThorpe92
8d93130809 bindings/go: enable multiple connections, register all symbols at library load 2025-01-31 13:28:05 -05:00
PThorpe92
950f29daab bindings/go: Adjust tests for multiple concurrent connections 2025-01-31 13:28:05 -05:00
Pekka Enberg
98579ab2e4 Merge 'Implement Noop bytecode' from Pedro Muniz
This PR implements Noop. I really don't know what else to say. This
bytecode according to sqlite does: _Do nothing. Continue downward to the
next opcode._ I advanced the program counter to account for continuing
to the next instruction.

Closes #795
2025-01-31 18:49:54 +02:00
Pekka Enberg
44e5402464 Merge branch 'main' into feature/noop 2025-01-31 18:49:39 +02:00
Glauber Costa
7aa3cc26ad simplify the writing of bytecode programs
Instead of always having the caller specify all instructions, this
work introduces convenience functions into the program builder,
making the code a lot cleaner.
2025-01-31 11:35:51 -05:00
Glauber Costa
b37317f68b avoid allocations during pragma_list
If we keep the pragma list sorted when declaring it, we can avoid
a vector allocation when printing the pragma_list.
2025-01-31 11:35:51 -05:00
Pekka Enberg
d8a9c57d3a Merge 'Fix table with single column PRIMARY KEY to not create extra btree' from Krishna Vishal
The error is due to comparing the PRIMARY KEY's name to INTEGER when in
it was all in lowercase. This was causing `needs_auto_index` to be set
to `true`.
After the fix:
```
/limbo /tmp/sc2-limbo.db
Limbo v0.0.13
Enter ".help" for usage hints.
limbo> CREATE TABLE temp (t1 integer, primary key (t1));

hexdump -s 28 -n 4 /tmp/sc2-limbo.db
000001c 0000 0200 -- matches SQLite
0000020
```
Closes https://github.com/tursodatabase/limbo/issues/824

Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>

Closes #830
2025-01-31 18:33:28 +02:00