This change connects virtual tables with the query optimizer.
The optimizer now considers virtual tables during join order search
and invokes their best_index callbacks to determine feasible access
paths.
Currently, this is not a visible change, since none of the existing
extensions return information indicating that a plan is invalid.
Different scan parameters are required for different table types.
Currently, index and iteration direction are only used by B-tree tables,
while the remaining table types don’t require any parameters. Planning
access to virtual tables, however, will require passing additional
information from the planner, such as the virtual table index (distinct
from a B-tree index) and the constraints that must be forwarded to the
`filter` method.
Closes: #1947
This PR replaces the `Name(pub String)` struct with a `Name` enum that
explicitly models how the name appeared in the source either as an
unquoted identifier (`Ident`) or a quoted string (`Quoted`).
In the process, the separate `Id` wrapper type has been coalesced into
the `Name` enum, simplifying the AST and reducing duplication in
identifier handling logic.
While this increases the size of some AST nodes (notably
`yyStackEntry`).
cc: @levydsa
Reviewed-by: Levy A. (@levydsa)
Reviewed-by: Preston Thorpe (@PThorpe92)
Closes#2251
Support for attaching databases. The main difference from SQLite is that
we support an arbitrary number of attached databases, and we are not
bound to just 100ish.
We for now only support read-only databases. We open them as read-only,
but also, to keep things simple, we don't patch any of the insert
machinery to resolve foreign tables. So if an insert is tried on an
attached database, it will just fail with a "no such table" error - this
is perfect for now.
The code in core/translate/attach.rs is written by Claude, who also
played a key part in the boilerplate for stuff like the .databases
command and extending the pragma database_list, and also aided me in
the test cases.
This commit replaces the `Name(pub String)` struct with a `Name` enum that
explicitly models how the name appeared in the source either as an
unquoted identifier (`Ident`) or a quoted string (`Quoted`).
In the process, the separate `Id` wrapper type has been coalesced into the
`Name` enum, simplifying the AST and reducing duplication in identifier
handling logic.
While this increases the size of some AST nodes (notably `yyStackEntry`),
it improves correctness and makes source structure more explicit for
later phases.
Previously, the logic for collecting non-aggregate columns was duplicated
across multiple locations and implemented inconsistently. This caused a
bug that was revealed by the refactoring in this commit (see the added
test).
- TableReferences struct, which holds both:
- joined_tables, and
- outer_query_refs
- JoinedTable:
- this is just a rename of the previous TableReference struct
- OuterQueryReference
- this is to distinguish from JoinedTable those cases where
e.g. a subquery refers to an outer query's table, or a CTE
refers to a previous CTE.
Both JoinedTable and OuterQueryReference can be referred to by expressions,
but only JoinedTables are considered for join ordering optimization and so
forth.
This commit does not compile.
Currently we have this:
program.alloc_cursor_id(Option<String>, CursorType)`
where the String is the table's name or alias ('users' or 'u' in
the query).
This is problematic because this can happen:
`SELECT * FROM t WHERE EXISTS (SELECT * FROM t)`
There are two cursors, both with identifier 't'. This causes a bug
where the program will use the same cursor for both the main query
and the subquery, since they are keyed by 't'.
Instead introduce `CursorKey`, which is a combination of:
1. `TableInternalId`, and
2. index name (Option<String> -- in case of index cursors.
This should provide key uniqueness for cursors:
`SELECT * FROM t WHERE EXISTS (SELECT * FROM t)`
here the first 't' will have a different `TableInternalId` than the
second `t`, so there is no clash.
Currently in the main translation logic after planning and optimization,
we don't _really_ need to pass a &mut Vec<WhereTerm> around anymore, except
for the fact that virtual table constraint resolution is done ad-hoc in
`init_loop()`. Even there, the only thing we mutate is `WhereTerm::consumed`
which is a boolean indicating that the term has been "used up" by the optimizer
and shouldn't be evaluated as a normal where clause condition anymore.
In the upcoming branch for WHERE clause subqueries, I want to store immutable
references to WHERE clause expressions in `Resolver`, but this is unfortunately
not possible if we still use the aforementioned mutable references.
Hence, we can temporarily make `WhereTerm::consumed` a `Cell<bool>` which allows
us to pass an immutable reference to `init_loop()`, and the `Cell` can be removed
once the virtual table constraint resolution is moved to an earlier part of the
query processing pipeline.
Currently our "table id"/"table no"/"table idx" references always
use the direct index of the `TableReference` in the plan, e.g. in
`SelectPlan::table_references`. For example:
```rust
Expr::Column { table: 0, column: 3, .. }
```
refers to the 0'th table in the `table_references` list.
This is a fragile approach because it assumes the table_references
list is stable for the lifetime of the query processing. This has so
far been the case, but there exist certain query transformations,
e.g. subquery unnesting, that may fold new table references from
a subquery (which has its own table ref list) into the table reference
list of the parent.
If such a transformation is made, then potentially all of the Expr::Column
references to tables will become invalid. Consider this example:
```sql
-- Assume tables: users(id, age), orders(user_id, amount)
-- Get total amount spent per user on orders over $100
SELECT u.id, sub.total
FROM users u JOIN
(SELECT user_id, SUM(amount) as total
FROM orders o
WHERE o.amount > 100
GROUP BY o.user_id) sub
WHERE u.id = sub.user_id
-- Before subquery unnesting:
-- Main query table_references: [users, sub]
-- u.id refers to table 0, column 0
-- sub.total refers to table 1, column 1
--
-- Subquery table_references: [orders]
-- o.user_id refers to table 0, column 0
-- o.amount refers to table 0, column 1
--
-- After unnesting and folding subquery tables into main query,
-- the query might look like this:
SELECT u.id, SUM(o.amount) as total
FROM users u JOIN orders o ON u.id = o.user_id
WHERE o.amount > 100
GROUP BY u.id;
-- Main query table_references: [users, orders]
-- u.id refers to table index 0 (correct)
-- o.amount refers to table index 0 (incorrect, should be 1)
-- o.user_id refers to table index 0 (incorrect, should be 1)
```
We could ofc traverse every expression in the subquery and rewrite
the table indexes to be correct, but if we instead use stable identifiers
for each table reference, then all the column references will continue
to be correct.
Hence, this PR introduces a `TableInternalId` used in `TableReference`
as well as `Expr::Column` and `Expr::Rowid` so that this kind of query
transformations can happen with less pain.
Previously the Operation enum consisted of:
- Operation::Scan
- Operation::Search
- Operation::Subquery
Which was always a dumb hack because what we really are doing is an
Operation::Scan on a "virtual"/"pseudo" table (overloaded names...)
derived from a subquery appearing in the FROM clause.
Hence, refactor the relevant data structures so that the Table enum now
contains a new variant:
Table::FromClauseSubquery
And the Operation enum only consists of Scan and Search.
```
SELECT * FROM (SELECT ...) sub;
-- the subquery here was previously interpreted as Operation::Subquery on a Table::Pseudo,
-- with a lot of special handling for Operation::Subquery in different code paths
-- now it's an Operation::Scan on a Table::FromClauseSubquery
```
No functional changes (intended, at least!)
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1529
Reviewable commit by commit. CI failures are not related.
Adds support for e.g. `select first_name, sum(distinct age),
count(distinct age), avg(distinct age) from users group by 1`
Implementation details:
- Creates an ephemeral index per distinct aggregate, and jumps over the
accumulation step if a duplicate is found
Closes#1507
Previously the Operation enum consisted of:
- Operation::Scan
- Operation::Search
- Operation::Subquery
Which was always a dumb hack because what we really are doing is
an Operation::Scan on a "virtual"/"pseudo" table (overloaded names...)
derived from a subquery appearing in the FROM clause.
Hence, refactor the relevant data structures so that the Table enum
now contains a new variant:
Table::FromClauseSubquery
And the Operation enum only consists of Scan and Search.
No functional changes (intended, at least!)
We've run into trouble in multiple places due to the fact that
we delete terms from the where clause (e.g. when a constant condition
is removed, or the term becomes part of an index seek key).
A simpler solution is to add a flag indicating that the term is
consumed (used), so that it is not translated in the main loop
anymore when WHERE clause terms are evaluated.