## Simplify bookkeeping of referenced tables in the query planner
This PR refactors the way we track referenced tables and associated
planner operations related to them (scans, index searches, etc).
## The problem with what we currently have:
- We have a tree data structure called `SourceOperator` which is either
a `Join`, `Scan`, `Search`, `Subquery` or `Nothing`.
```rust
/**
A SourceOperator is a Node in the query plan that reads data from a table.
*/
#[derive(Clone, Debug)]
pub enum SourceOperator {
// Join operator
// This operator is used to join two source operators.
// It takes a left and right source operator, a list of predicates to evaluate,
// and a boolean indicating whether it is an outer join.
Join {
id: usize,
left: Box<SourceOperator>,
right: Box<SourceOperator>,
predicates: Option<Vec<ast::Expr>>,
outer: bool,
using: Option<ast::DistinctNames>,
},
// Scan operator
// This operator is used to scan a table.
// It takes a table to scan and an optional list of predicates to evaluate.
// The predicates are used to filter rows from the table.
// e.g. SELECT * FROM t1 WHERE t1.foo = 5
// The iter_dir are uset to indicate the direction of the iterator.
// The use of Option for iter_dir is aimed at implementing a conservative optimization strategy: it only pushes
// iter_dir down to Scan when iter_dir is None, to prevent potential result set errors caused by multiple
// assignments. for more detailed discussions, please refer to https://github.com/penberg/limbo/pull/376
Scan {
id: usize,
table_reference: TableReference,
predicates: Option<Vec<ast::Expr>>,
iter_dir: Option<IterationDirection>,
},
// Search operator
// This operator is used to search for a row in a table using an index
// (i.e. a primary key or a secondary index)
Search {
id: usize,
table_reference: TableReference,
search: Search,
predicates: Option<Vec<ast::Expr>>,
},
Subquery {
id: usize,
table_reference: TableReference,
plan: Box<SelectPlan>,
predicates: Option<Vec<ast::Expr>>,
},
// Nothing operator
// This operator is used to represent an empty query.
// e.g. SELECT * from foo WHERE 0 will eventually be optimized to Nothing.
Nothing {
id: usize,
},
}
```
- Logically joins are a tree, but this is at least marginally bad for
performance because each `Join` has two boxed child operators, and so
e.g. for a 3-table query you have, for example, 3 `Scan` nodes and then
2 `Join` nodes.
- There are other bigger problems too, though, related to code
structure. We have been carrying around a separate vector of
`referenced_tables` that columns can refer to by index:
```rust
/// A query plan has a list of TableReference objects, each of which represents a table or subquery.
#[derive(Clone, Debug)]
pub struct TableReference {
/// Table object, which contains metadata about the table, e.g. columns.
pub table: Table,
/// The name of the table as referred to in the query, either the literal name or an alias e.g. "users" or "u"
pub table_identifier: String,
/// The index of this reference in the list of TableReference objects in the query plan
/// The reference at index 0 is the first table in the FROM clause, the reference at index 1 is the second table in the FROM clause, etc.
/// So, the index is relevant for determining when predicates (WHERE, ON filters etc.) should be evaluated.
pub table_index: usize,
/// The type of the table reference, either BTreeTable or Subquery
pub reference_type: TableReferenceType,
}
```
- `referenced_tables` is used because SQLite joins are an `n^tables_len`
nested loop, and we need to figure out during which loop to evaluate a
condition expression. A lot of plumbing in the current code exists for
this, e.g. "pushing predicates" in `optimizer` even though "predicate
pushdown" as a query planner concept is an _optimization_, but in our
current system the "pushdown" is really a _necessity_ to move the
condition expressions to the correct `SourceOperator::predicates` vector
so that they are evaluated at the right point.
- `referenced_tables` is also used to map identifiers in the query to
the correct table, e.g. 'foo' `SELECT foo FROM users` becomes an
`ast::Expr::Column { table: 0, .. }` if `users` is the first table in
`referenced_tables`.
In addition to this, we ALSO had a `TableReferenceType` separately for
checking whether the upper-level query is reading from a BTree table or
a Subquery.
```rust
/// The type of the table reference, either BTreeTable or Subquery
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum TableReferenceType {
/// A BTreeTable is a table that is stored on disk in a B-tree index.
BTreeTable,
/// A subquery.
Subquery {
/// The index of the first register in the query plan that contains the result columns of the subquery.
result_columns_start_reg: usize,
},
}
```
...even though we already have an `Operator::Subquery` that should be
able to encode this information, but doesn't, because it's a tree and we
refer to things by index in `referenced_tables`.
### Why this is especially stupid
`SourceOperator` and `TableReference` are basically just two
representations of the same thing, one in tree format and another in
vector format. `SourceOperator` even carries around its own copy of
`TableReference`, even though we ALSO have `referenced_tables:
Vec<TableReference>` 🤡
Note that I'm allowed to call the existing code stupid because I wrote
it.
## What we can do instead
Basically, we can just fold the concerns from `SourceOperator` into
`TableReference` and have a list of those in the query plan, one per
table, in loop order (outermost loop is 0, and so on).
Funnily enough, when Limbo had very very few features we used to have a
Vec of LoopInfos similarly, obviously with a lot less information than
now, but for SQLite it's probably the right abstraction. :)
```rust
pub struct SelectPlan {
/// List of table references in loop order, outermost first.
pub table_references: Vec<TableReference>,
...etc...
}
/// A table reference in the query plan.
/// For example, SELECT * FROM users u JOIN products p JOIN (SELECT * FROM users) sub
/// has three table references:
/// 1. operation=Scan, table=users, table_identifier=u, reference_type=BTreeTable, join_info=None
/// 2. operation=Scan, table=products, table_identifier=p, reference_type=BTreeTable, join_info=Some(JoinInfo { outer: false, using: None }),
/// 3. operation=Subquery, table=users, table_identifier=sub, reference_type=Subquery, join_info=None
#[derive(Debug, Clone)]
pub struct TableReference {
/// The operation that this table reference performs.
pub op: Operation,
/// Table object, which contains metadata about the table, e.g. columns.
pub table: Table,
/// The name of the table as referred to in the query, either the literal name or an alias e.g. "users" or "u"
pub identifier: String,
/// The join info for this table reference if it is the right side of a join (which all except the first table reference have)
pub join_info: Option<JoinInfo>,
}
```
And we keep the "operation" part from the "operator", but in a simple
form:
```rust
pub enum Operation {
Scan {
iter_dir: Option<IterationDirection>,
},
Search(Search),
Subquery {
plan: Box<SelectPlan>,
result_columns_start_reg: usize
},
}
```
Now we don't need to carry around both the operator tree and
`Vec<TableReference>`, because they are the same thing. If something
refers to the `n'th table`, it is just `plan.table_references[n]`.
We also don't need to recurse through the operator tree and usually we
can just loop from outermost table to innermost table.
---
### Handling the "where to evaluate a condition expression" problem
You can see I've also removed the `predicates` vector from `Scan` and
friends. Previously each `SourceOperator` had a vector of `predicates`
so that we knew at which loop depth to evaluate a condition. Now we
align more with what SQLite does -- it puts all the conditions, even the
join conditions, in the `WHERE` clause and adds extra metadata to them:
```rust
/// In a query plan, WHERE clause conditions and JOIN conditions are all folded into a vector of JoinAwareConditionExpr.
/// This is done so that we can evaluate the conditions at the correct loop depth.
/// We also need to keep track of whether the condition came from an OUTER JOIN. Take this example:
/// SELECT * FROM users u LEFT JOIN products p ON u.id = 5.
/// Even though the condition only refers to 'u', we CANNOT evaluate it at the users loop, because we need to emit NULL
/// values for the columns of 'p', for EVERY row in 'u', instead of completely skipping any rows in 'u' where the condition is false.
#[derive(Debug, Clone)]
pub struct JoinAwareConditionExpr {
/// The original condition expression.
pub expr: ast::Expr,
/// Is this condition originally from an OUTER JOIN?
/// If so, we need to evaluate it at the loop of the right table in that JOIN,
/// regardless of which tables it references.
/// We also cannot e.g. short circuit the entire query in the optimizer if the condition is statically false.
pub from_outer_join: bool,
/// The loop index where to evaluate the condition.
/// For example, in `SELECT * FROM u JOIN p WHERE u.id = 5`, the condition can already be evaluated at the first loop (idx 0),
/// because that is the rightmost table that it references.
pub eval_at_loop: usize,
}
```
### Final notes
I've been wanting to make this refactor for a long time now, but the
last straw was when I was making a PR trying to reduce some of the
massive amount of allocations happening in the read path currently, and
I got stuck because of this Operator + referenced_tables shit getting in
the way constantly. So I decided I wanted to get it over with now.
The PR is again very big, but I've artificially split it up into commits
that don't individually compile but at least separate the changes a bit
for you, dear reader.
Closes #853
Limbo
Limbo is a work-in-progress, in-process OLTP database management system, compatible with SQLite.
Features
Limbo is an in-process OLTP database engine library that has:
- Asynchronous I/O support on Linux with
io_uring - SQLite compatibility [doc] for SQL dialect, file formats, and the C API
- Language bindings for JavaScript/WebAssembly, Rust, Go, Python, and Java
- OS support for Linux, macOS, and Windows
Getting Started
💻 Command Line
You can install the latest limbo release with:
curl --proto '=https' --tlsv1.2 -LsSf \
https://github.com/tursodatabase/limbo/releases/latest/download/limbo-installer.sh | sh
Then launch the shell to execute SQL statements:
Limbo
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database
limbo> CREATE TABLE users (id INT PRIMARY KEY, username TEXT);
limbo> INSERT INTO users VALUES (1, 'alice');
limbo> INSERT INTO users VALUES (2, 'bob');
limbo> SELECT * FROM users;
1|alice
2|bob
You can also build and run the latest development version with:
cargo run
✨ JavaScript (wip)
Installation:
npm i limbo-wasm
Example usage:
import { Database } from 'limbo-wasm';
const db = new Database('sqlite.db');
const stmt = db.prepare('SELECT * FROM users');
const users = stmt.all();
console.log(users);
🐍 Python (wip)
pip install pylimbo
Example usage:
import limbo
con = limbo.connect("sqlite.db")
cur = con.cursor()
res = cur.execute("SELECT * FROM users")
print(res.fetchone())
Contributing
We'd love to have you contribute to Limbo! Please check out the contribution guide to get started.
FAQ
How is Limbo different from libSQL?
Limbo is a research project to build a SQLite compatible in-process database in Rust with native async support. The libSQL project, on the other hand, is an open source, open contribution fork of SQLite, with focus on production features such as replication, backups, encryption, and so on. There is no hard dependency between the two projects. Of course, if Limbo becomes widely successful, we might consider merging with libSQL, but that is something that will be decided in the future.
Publications
- Pekka Enberg, Sasu Tarkoma, Jon Crowcroft Ashwin Rao (2024). Serverless Runtime / Database Co-Design With Asynchronous I/O. In EdgeSys ‘24. [PDF]
- Pekka Enberg, Sasu Tarkoma, and Ashwin Rao (2023). Towards Database and Serverless Runtime Co-Design. In CoNEXT-SW ’23. [PDF] [Slides]
License
This project is licensed under the MIT license.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in Limbo by you, shall be licensed as MIT, without any additional terms or conditions.
