Commit Graph

4430 Commits

Author SHA1 Message Date
pedrocarlo
002acbb9dc add check for unique contraint in auto index creation 2025-05-14 11:34:11 -03:00
Pekka Enberg
275fd29057 Add clippy to rust-toolchain.toml... 2025-05-14 14:22:02 +03:00
Jussi Saurio
88501fc76f Merge 'Restructure optimizer to support join reordering' from Jussi Saurio
Closes #1283
In lieu of a better description, I will just copypaste the contents of
the new `OPTIMIZER.MD` here:
# Overview of the current state of the query optimizer in Limbo
Query optimization is obviously an important part of any SQL-based
database engine. This document is an overview of what we currently do.
## Structure of the optimizer directory
1. `mod.rs`
   - Provides the high-level optimization interface through
`optimize_plan()`
2. `access_method.rs`
   - Determines what is the best index to use when joining a table to a
set of preceding tables
3. `constraints.rs` - Manages query constraints:
   - Extracts constraints from the WHERE clause
   - Determines which constraints are usable with indexes
4. `cost.rs`
   - Calculates the cost of doing a seek vs a scan, for example
5. `join.rs`
   - Implements the System R style dynamic programming join ordering
algorithm
6. `order.rs`
   - Determines if sort operations can be eliminated based on the chosen
access methods and join order
## Join reordering and optimal index selection
**The goals of query optimization are at least the following:**
1. Do as little page I/O as possible
2. Do as little CPU work as possible
3. Retain query correctness.
**The most important ways to achieve no. 1 and no. 2 are:**
1. Choose the optimal access method for each table (e.g. an index or a
rowid-based seek, or a full table scan if all else fails).
2. Choose the best or near-best way to reorder the tables in the query
so that those optimal access methods can be used.
3. Also factor in whether the chosen join order and indexes allow
removal of any sort operations that are necessary for query correctness.
## Limbo's optimizer
Limbo's optimizer is an implementation of an extremely traditional [IBM
System R](https://www.cs.cmu.edu/~15721-f24/slides/02-Selinger-SystemR-
opt.pdf) style optimizer,
i.e. straight from the 70s! The DP algorithm is explained below.
### Current high level flow of the optimizer
1. **SQL rewriting**
  - Rewrite certain SQL expressions to another form (not a lot
currently; e.g. rewrite BETWEEN as two comparisons)
  - Eliminate constant conditions: e.g. `WHERE 1` is removed, `WHERE 0`
short-circuits the whole query because it is trivially false.
2. **Check whether there is an "interesting order"** that we should
consider when evaluating indexes and join orders
    - Is there a GROUP BY? an ORDER BY? Both?
3. **Convert WHERE clause conjucts to Constraints**
    - E.g. in `WHERE t.x = 5`, the expression `5` _constrains_  table
`t` to values of `x` that are exactly `5`.
    - E.g. in `Where t.x = u.x`, the expression `u.x` constrains `t`,
AND `t.x` constrains `u`.
    - Per table, each constraint has an estimated _selectivity_ (how
much it filters the result set); this affects join order calculations,
see the paragraph on _Estimation_  below.
    - Per table, constraints are also analyzed for whether one or
multiple of them can be used as an index seek key to avoid a full scan.
4. **Compute the best join order using a dynamic programming
algorithm:**
  - `n` = number of tables considered
  - `n=1`: find the lowest _cost_ way to access each single table, given
the constraints of the query. Memoize the result.
  - `n=2`: for each table found in the `n=1` step, find the best way to
join that table with each other table. Memoize the result.
  - `n=3`: for each 2-table subset found, find the best way to join that
result to each other table. Memoize the result.
  - `n=m`: for each `m-1` table subset found, find the best way to join
that result to the `m'th` table
  - **Use pruning to reduce search space:**
    - Compute the literal query order first, and store its _cost_  as an
upper threshold
    - If at any point a considered join order exceeds the upper
threshold, discard that search path since it cannot be better than the
current best.
      - For example, we have `SELECT * FROM a JOIN b JOIN c JOIN d`.
Compute `JOIN(a,b,c,d)` first. If `JOIN (b,a)` is already worse than
`JOIN(a,b,c,d)`, we don't have to even try `JOIN(b,a,c)`.
    - Also keep track of the best plan per _subset_:
      - If we find that `JOIN(b,a,c)` is better than any other
permutation of the same tables, e.g. `JOIN(a,b,c)`, then we can discard
_ALL_ of the other permutations for that subset. For example, we don't
need to consider `JOIN(a,b,c,d)` because we know it's worse than
`JOIN(b,a,c,d)`.
      - This is possible due to the associativity and commutativity of
INNER JOINs.
  - Also keep track of the best _ordered plan_ , i.e. one that provides
the "interesting order" mentioned above.
  - At the end, apply a cost penalty to the best overall plan
    - If it is now worse than the best sorted plan, then choose the
sorted plan as the best plan for the query.
      - This allows us to eliminate a sorting operation.
    - If the best overall plan is still best even with the sorting
penalty, then keep it. A sorting operation is later applied to sort the
rows according to the desired order.
5. **Mutate the plan's `join_order` and `Operation`s to match the
computed best plan.**
### Estimation of cost and cardinalities + a note on table statistics
Currently, in the absence of `ANALYZE`, `sqlite_stat1` etc. we assume
the following:
1. Each table has `1,000,000` rows.
2. Each equality (`=`) filter will filter out some percentage of the
result set.
3. Each nonequality (e.g. `>`) will filter out some smaller percentage
of the result set.
4. Each `4096` byte database page holds `50` rows, i.e. roughly `80`
bytes per row
5. Sort operations have some CPU cost dependent on the number of input
rows to the sort operation.
From the above, we derive the following formula for estimating the cost
of joining `t1` with `t2`
```
JOIN_COST = PAGE_IO(t1.rows) + t1.rows * PAGE_IO(t2.rows)
```
For example, let's take the query `SELECT * FROM t1 JOIN t2 USING(foo)
WHERE t2.foo > 10`. Let's assume the following:
- `t1` has `6400` rows and `t2` has `8000` rows
- there are no indexes at all
- let's ignore the CPU cost from the equation for simplicity.
The best access method for both is a full table scan. The output
cardinality of `t1` is the full table, because nothing is filtering it.
Hence, the cost of `t1 JOIN t2` becomes:
```
JOIN_COST = PAGE_IO(t1.input_rows) + t1.output_rows * PAGE_IO(t2.input_rows)

// plugging in the values:

JOIN_COST = PAGE_IO(6400) + 6400 * PAGE_IO(8000)
JOIN_COST = 80 + 6400 * 100 = 640080
```
Now let's consider `t2 JOIN t1`. The best access method for both is
still a full scan, but since we can filter on `t2.foo > 10`, its output
cardinality decreases. Let's assume only 1/4 of the rows of `t2` match
the condition `t2.foo > 10`. Hence, the cost of `t2 join t1` becomes:
```
JOIN_COST = PAGE_IO(t2.input_rows) + t2.output_rows * PAGE_IO(t1.input_rows)

// plugging in the values:

JOIN_COST = PAGE_IO(8000) + 1/4 * 8000 * PAGE_IO(6400)
JOIN_COST = 100 + 2000 * 80 = 160100
```
Even though `t2` is a larger table, because we were able to reduce the
input set to the join operation, it's dramatically cheaper.
#### Statistics
Since we don't support `ANALYZE`, nor can we assume that users will call
`ANALYZE` anyway, we use simple magic constants to estimate the
selectivity of join predicates, row count of tables, and so on. When we
have support for `ANALYZE`, we should plug the statistics from
`sqlite_stat1` and friends into the optimizer to make more informed
decisions.
### Estimating the output cardinality of a join
The output cardinality (output row count) of an operation is as follows:
```
OUTPUT_CARDINALITY_JOIN = INPUT_CARDINALITY_RHS * OUTPUT_CARDINALITY_RHS

where

INPUT_CARDINALITY_RHS = OUTPUT_CARDINALITY_LHS
```
example:
```
SELECT * FROM products p JOIN order_lines o ON p.id = o.product_id
```
Assuming there are 100 products, i.e. just selecting all products would
yield 100 rows:
```
OUTPUT_CARDINALITY_LHS = 100
INPUT_CARDINALITY_RHS = 100
```
Assuming p.id = o.product_id will return three orders per each product:
```
OUTPUT_CARDINALITY_RHS = 3

OUTPUT_CARDINALITY_JOIN = 100 * 3 = 300
```
i.e. the join is estimated to return 300 rows, 3 for each product.
Again, in the absence of statistics, we use magic constants to estimate
these cardinalities.
Estimating them is important because in multi-way joins the output
cardinality of the previous join becomes the input cardinality of the
next one.

Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>

Closes #1462
2025-05-14 14:02:33 +03:00
Pekka Enberg
1e48190bdc Merge 'Add rustfmt to rust-toolchain.toml' from Pekka Enberg
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>

Closes #1479
2025-05-14 13:03:58 +03:00
Pekka Enberg
6365286a9e Fix CHANGELOG 2025-05-14 11:30:44 +03:00
Pekka Enberg
5739cd5f46 Add rustfmt to rust-toolchain.toml 2025-05-14 11:29:12 +03:00
Pekka Enberg
31ebbb190a Limbo 0.0.20 2025-05-14 09:49:05 +03:00
Jussi Saurio
176d9bd3c7 Prune bad plans earlier to avoid allocating useless JoinN structs 2025-05-14 09:42:26 +03:00
Jussi Saurio
eb983c88c6 reserve capacity for memo hashmap entries 2025-05-14 09:42:26 +03:00
Jussi Saurio
5e5788bdfe Reduce allocations 2025-05-14 09:42:26 +03:00
Jussi Saurio
d2fa91e984 avoid growing vec 2025-05-14 09:42:26 +03:00
Jussi Saurio
625cf005fd Add some utilities to constraint related structs 2025-05-14 09:42:26 +03:00
Jussi Saurio
71ab3d57d8 constraints.rs: more comments 2025-05-14 09:42:26 +03:00
Jussi Saurio
5386859b44 as_binary-components: simplify 2025-05-14 09:42:26 +03:00
Jussi Saurio
1d465e6d94 Remove unnecessary method 2025-05-14 09:42:26 +03:00
Jussi Saurio
9d50446ffb AccessMethod: simplify - get rid of AccessMethodKind as it can be derived 2025-05-14 09:42:26 +03:00
Jussi Saurio
12a2c2b9ad Add more documentation to OPTIMIZER.MD 2025-05-14 09:42:26 +03:00
Jussi Saurio
fe628e221a plan_satisfies_order_target(): simplify 2025-05-14 09:42:26 +03:00
Jussi Saurio
4dde356d97 AccessMethod: simplify 2025-05-14 09:42:26 +03:00
Jussi Saurio
a90358f669 TableMask: comments 2025-05-14 09:42:26 +03:00
Jussi Saurio
f12eb25962 cost.rs: simplify cost estimation 2025-05-14 09:42:26 +03:00
Jussi Saurio
4f07c808b2 Fix bug with constraint ordering introduced by refactor 2025-05-14 09:42:26 +03:00
Jussi Saurio
52b28d3099 rename use_indexes to optimize_table_access 2025-05-14 09:42:26 +03:00
Jussi Saurio
d8218483a2 use_indexes: comments 2025-05-14 09:42:26 +03:00
Jussi Saurio
e53ab385d7 order.rs: comments 2025-05-14 09:42:26 +03:00
Jussi Saurio
ff8e187eda find_best_access_method_for_join_order: comments 2025-05-14 09:42:26 +03:00
Jussi Saurio
3442e4981d remove some unnecessary parameters 2025-05-14 09:42:26 +03:00
Jussi Saurio
c18bb3cd14 rename 2025-05-14 09:42:26 +03:00
Jussi Saurio
15b32f7e57 constraints.rs: more comments 2025-05-14 09:42:26 +03:00
Jussi Saurio
c782616180 Refactor constraints so that WHERE clause is not needed in join reordering phase 2025-05-14 09:42:26 +03:00
Jussi Saurio
6aa5b01a7b Add note about optimizer directory structure 2025-05-14 09:42:26 +03:00
Jussi Saurio
bd875e3876 optimizer module split 2025-05-14 09:42:26 +03:00
Jussi Saurio
ec45a92bac move optimizer to its own directory 2025-05-14 09:42:26 +03:00
Jussi Saurio
c639a43676 fix parenthesized column edge case 2025-05-14 09:42:26 +03:00
Jussi Saurio
90de8791f5 comments 2025-05-14 09:42:26 +03:00
Jussi Saurio
c8f5bd3f4f rename 2025-05-14 09:42:26 +03:00
Jussi Saurio
630a6093aa refactor join_lhs_tables_to_rhs_table 2025-05-14 09:42:26 +03:00
Jussi Saurio
62d2ee8eb6 rename 2025-05-14 09:42:26 +03:00
Jussi Saurio
5f9ebe26a0 as_binary_components() helper 2025-05-14 09:42:26 +03:00
Jussi Saurio
a92d94270a Get rid of useless ScanCost struct 2025-05-14 09:42:26 +03:00
Jussi Saurio
de9e8442e8 fix ephemeral 2025-05-14 09:42:25 +03:00
Jussi Saurio
3b1aef4a9e Do Less Work (tm) - everything works except ephemeral 2025-05-14 09:42:01 +03:00
Jussi Saurio
87850e5706 simplify 2025-05-14 09:41:14 +03:00
Jussi Saurio
77f11ba004 simplify AccessMethodKind 2025-05-14 09:41:14 +03:00
Jussi Saurio
5f724d6b2e Add more comments to join ordering logic 2025-05-14 09:41:14 +03:00
Jussi Saurio
c02d3f8bcd Do groupby/orderby sort elimination based on optimizer decision 2025-05-14 09:41:13 +03:00
Jussi Saurio
1e46f1d9de Feature: join reordering optimizer 2025-05-14 09:40:48 +03:00
Jussi Saurio
c8c83fc6e6 OPTIMIZER.MD docs 2025-05-14 09:39:47 +03:00
Jussi Saurio
67a080bfa0 dont mutate where clause during individual index selection phase 2025-05-14 09:39:47 +03:00
Jussi Saurio
1b71f58bbf Merge 'Redesign parameter binding in query translator' from Preston Thorpe
closes #1467
## Example:
Previously as explained in #1449, our parameter binding wasn't working
properly because we would essentially
assign the first index of whatever was translated first
```console
limbo> create table t (id integer primary key, name text, age integer);
limbo> explain select * from t where name = ? and id > ? and age between ? and ?;
addr  opcode             p1    p2    p3    p4             p5  comment
----  -----------------  ----  ----  ----  -------------  --  -------
0     Init               0     20    0                    0   Start at 20
1     OpenRead           0     2     0                    0   table=t, root=2
2     Variable           1     4     0                    0   r[4]=parameter(1) # always 1
3     IsNull             4     19    0                    0   if (r[4]==NULL) goto 19
4     SeekGT             0     19    4                    0   key=[4..4]
5       Column           0     1     5                    0   r[5]=t.name
6       Variable         2     6     0                    0   r[6]=parameter(2) # always 2
7       Ne               5     6     18                   0   if r[5]!=r[6] goto 18
8       Variable         3     7     0                    0   r[7]=parameter(3) # etc...
9       Column           0     2     8                    0   r[8]=t.age
10      Gt               7     8     18                   0   if r[7]>r[8] goto 18
11      Column           0     2     9                    0   r[9]=t.age
12      Variable         4     10    0                    0   r[10]=parameter(4)
13      Gt               9     10    18                   0   if r[9]>r[10] goto 18
14      RowId            0     1     0                    0   r[1]=t.rowid
15      Column           0     1     2                    0   r[2]=t.name
16      Column           0     2     3                    0   r[3]=t.age
17      ResultRow        1     3     0                    0   output=r[1..3]
18    Next               0     5     0                    0
19    Halt               0     0     0                    0
20    Transaction        0     0     0                    0   write=false
21    Goto               0     1     0                    0
```
## Solution:
`rewrite_expr` currently is used to transform `true|false` to `1|0`, so
it has been adapted to transform anonymous `Expr::Variable`s to named
variables, inserting the appropriate index of the parameter by passing
in a counter.
```rust
        ast::Expr::Variable(var) => {
            if var.is_empty() {
                // rewrite anonymous variables only, ensure that the `param_idx` starts at 1 and
                // all the expressions are rewritten in the order they come in the statement
                *expr = ast::Expr::Variable(format!("{}{param_idx}", PARAM_PREFIX));
                *param_idx += 1;
            }
            Ok(())
        }
```
# Corrected output: (notice the seek)
```console
limbo> explain select * from t where name = ? and id > ? and age between ? and ?;
addr  opcode             p1    p2    p3    p4             p5  comment
----  -----------------  ----  ----  ----  -------------  --  -------
0     Init               0     20    0                    0   Start at 20
1     OpenRead           0     2     0                    0   table=t, root=2
2     Variable           2     4     0                    0   r[4]=parameter(2)
3     IsNull             4     19    0                    0   if (r[4]==NULL) goto 19
4     SeekGT             0     19    4                    0   key=[4..4]
5       Column           0     1     5                    0   r[5]=t.name
6       Variable         1     6     0                    0   r[6]=parameter(1)
7       Ne               5     6     18                   0   if r[5]!=r[6] goto 18
8       Variable         3     7     0                    0   r[7]=parameter(3)
9       Column           0     2     8                    0   r[8]=t.age
10      Gt               7     8     18                   0   if r[7]>r[8] goto 18
11      Column           0     2     9                    0   r[9]=t.age
12      Variable         4     10    0                    0   r[10]=parameter(4)
13      Gt               9     10    18                   0   if r[9]>r[10] goto 18
14      RowId            0     1     0                    0   r[1]=t.rowid
15      Column           0     1     2                    0   r[2]=t.name
16      Column           0     2     3                    0   r[3]=t.age
17      ResultRow        1     3     0                    0   output=r[1..3]
18    Next               0     5     0                    0
19    Halt               0     0     0                    0
20    Transaction        0     0     0                    0   write=false
21    Goto               0     1     0                    0
```
## And a `Delete`:
```console
limbo> explain delete from t where name = ? and age > ? and id > ?;
addr  opcode             p1    p2    p3    p4             p5  comment
----  -----------------  ----  ----  ----  -------------  --  -------
0     Init               0     15    0                    0   Start at 15
1     OpenWrite          0     2     0                    0
2     Variable           3     1     0                    0   r[1]=parameter(3)
3     IsNull             1     14    0                    0   if (r[1]==NULL) goto 14
4     SeekGT             0     14    1                    0   key=[1..1]
5       Column           0     1     2                    0   r[2]=t.name
6       Variable         1     3     0                    0   r[3]=parameter(1)
7       Ne               2     3     13                   0   if r[2]!=r[3] goto 13
8       Column           0     2     4                    0   r[4]=t.age
9       Variable         2     5     0                    0   r[5]=parameter(2)
10      Le               4     5     13                   0   if r[4]<=r[5] goto 13
11      RowId            0     6     0                    0   r[6]=t.rowid
12      Delete           0     0     0                    0
13    Next               0     5     0                    0
14    Halt               0     0     0                    0
15    Transaction        0     1     0                    0   write=true
16    Goto               0     1     0                    0
```

Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>

Closes #1475
2025-05-14 09:26:06 +03:00