Add more documentation to OPTIMIZER.MD

2026-02-06 00:34:23 +01:00 · 2025-05-12 12:06:41 +03:00
parent fe628e221a
commit 12a2c2b9ad
1 changed files with 37 additions and 19 deletions
--- a/core/translate/optimizer/OPTIMIZER.md
+++ b/core/translate/optimizer/OPTIMIZER.md
@@ -31,7 +31,7 @@ Query optimization is obviously an important part of any SQL-based database engi
 2. Do as little CPU work as possible
 3. Retain query correctness.

-**The most important ways to achieve #1 and #2 are:**
+**The most important ways to achieve no. 1 and no. 2 are:**

 1. Choose the optimal access method for each table (e.g. an index or a rowid-based seek, or a full table scan if all else fails).
 2. Choose the best or near-best way to reorder the tables in the query so that those optimal access methods can be used.
@@ -40,16 +40,39 @@ Query optimization is obviously an important part of any SQL-based database engi
 ## Limbo's optimizer

 Limbo's optimizer is an implementation of an extremely traditional [IBM System R](https://www.cs.cmu.edu/~15721-f24/slides/02-Selinger-SystemR-opt.pdf) style optimizer,
-i.e. straight from the 70s! The main ideas are:
+i.e. straight from the 70s! The DP algorithm is explained below.

-1. Find the best (least `cost`) way to access any single table in the query (n=1). Estimate the `output cardinality` (row count) for this table.
-    - For example, if there is a WHERE clause condition `t1.x = 5` and we have an index on `t1.x`, that index is potentially going to be the best way to access `t1`. Assuming `t1` has `1,000,000` rows, we might estimate that the output cardinality of this will be `10,000` after all the filters on `t1` have been applied.
-2. For each result of #1, find the best way to join that result with each other table (n=2). Use the output cardinality of the previous step as the `input cardinality` of this step.
-3. For each result of #2, find the best way to join the result of that 2-way join with each other table (n=3)
-...
-n. Find the best way to join each (n-1)-way join with the remaining table.
+### Current high level flow of the optimizer

-The intermediate steps of the above algorithm are memoized, and finally the join order and access methods with the least cumulative cost is chosen.
+1. **SQL rewriting**
+  - Rewrite certain SQL expressions to another form (not a lot currently; e.g. rewrite BETWEEN as two comparisons)
+  - Eliminate constant conditions: e.g. `WHERE 1` is removed, `WHERE 0` short-circuits the whole query because it is trivially false.
+2. **Check whether there is an "interesting order"** that we should consider when evaluating indexes and join orders
+    - Is there a GROUP BY? an ORDER BY? Both?
+3. **Convert WHERE clause conjucts to Constraints**
+    - E.g. in `WHERE t.x = 5`, the expression `5` _constrains_  table `t` to values of `x` that are exactly `5`.
+    - E.g. in `Where t.x = u.x`, the expression `u.x` constrains `t`, AND `t.x` constrains `u`.
+    - Per table, each constraint has an estimated _selectivity_ (how much it filters the result set); this affects join order calculations, see the paragraph on _Estimation_  below.
+    - Per table, constraints are also analyzed for whether one or multiple of them can be used as an index seek key to avoid a full scan.
+4. **Compute the best join order using a dynamic programming algorithm:**
+  - `n` = number of tables considered
+  - `n=1`: find the lowest _cost_ way to access each single table, given the constraints of the query. Memoize the result.
+  - `n=2`: for each table found in the `n=1` step, find the best way to join that table with each other table. Memoize the result.
+  - `n=3`: for each 2-table subset found, find the best way to join that result to each other table. Memoize the result.
+  - `n=m`: for each `m-1` table subset found, find the best way to join that result to the `m'th` table
+  - **Use pruning to reduce search space:**
+    - Compute the literal query order first, and store its _cost_  as an upper threshold
+    - If at any point a considered join order exceeds the upper threshold, discard that search path since it cannot be better than the current best.
+      - For example, we have `SELECT * FROM a JOIN b JOIN c JOIN d`. Compute `JOIN(a,b,c,d)` first. If `JOIN (b,a)` is already worse than `JOIN(a,b,c,d)`, we don't have to even try `JOIN(b,a,c)`.
+    - Also keep track of the best plan per _subset_:
+      - If we find that `JOIN(b,a,c)` is better than any other permutation of the same tables, e.g. `JOIN(a,b,c)`, then we can discard _ALL_ of the other permutations for that subset. For example, we don't need to consider `JOIN(a,b,c,d)` because we know it's worse than `JOIN(b,a,c,d)`.
+      - This is possible due to the associativity and commutativity of INNER JOINs.
+  - Also keep track of the best _ordered plan_ , i.e. one that provides the "interesting order" mentioned above.
+  - At the end, apply a cost penalty to the best overall plan
+    - If it is now worse than the best sorted plan, then choose the sorted plan as the best plan for the query.
+      - This allows us to eliminate a sorting operation.
+    - If the best overall plan is still best even with the sorting penalty, then keep it. A sorting operation is later applied to sort the rows according to the desired order.
+5. **Mutate the plan's `join_order` and `Operation`s to match the computed best plan.**

 ### Estimation of cost and cardinalities + a note on table statistics

@@ -64,12 +87,7 @@ Currently, in the absence of `ANALYZE`, `sqlite_stat1` etc. we assume the follow
 From the above, we derive the following formula for estimating the cost of joining `t1` with `t2`

 ```
-JOIN_COST = COST(t1.rows) + t1.rows * COST(t2.rows) + E
-
-where
-  COST(rows) = PAGE_IO(rows)
-  and
-  E = one-time cost to build ephemeral index if needed (usually 0)
+JOIN_COST = PAGE_IO(t1.rows) + t1.rows * PAGE_IO(t2.rows)
 ```

 For example, let's take the query `SELECT * FROM t1 JOIN t2 USING(foo) WHERE t2.foo > 10`. Let's assume the following:
@@ -81,22 +99,22 @@ For example, let's take the query `SELECT * FROM t1 JOIN t2 USING(foo) WHERE t2.
 The best access method for both is a full table scan. The output cardinality of `t1` is the full table, because nothing is filtering it. Hence, the cost of `t1 JOIN t2` becomes:

 ```
-JOIN_COST = COST(t1.input_rows) + t1.output_rows * COST(t2.input_rows)
+JOIN_COST = PAGE_IO(t1.input_rows) + t1.output_rows * PAGE_IO(t2.input_rows)

 // plugging in the values:

-JOIN_COST = COST(6400) + 6400 * COST(8000)
+JOIN_COST = PAGE_IO(6400) + 6400 * PAGE_IO(8000)
 JOIN_COST = 80 + 6400 * 100 = 640080
 ```

 Now let's consider `t2 JOIN t1`. The best access method for both is still a full scan, but since we can filter on `t2.foo > 10`, its output cardinality decreases. Let's assume only 1/4 of the rows of `t2` match the condition `t2.foo > 10`. Hence, the cost of `t2 join t1` becomes:

 ```
-JOIN_COST = COST(t2.input_rows) + t2.output_rows * COST(t1.input_rows)
+JOIN_COST = PAGE_IO(t2.input_rows) + t2.output_rows * PAGE_IO(t1.input_rows)

 // plugging in the values:

-JOIN_COST = COST(8000) + 1/4 * 8000 * COST(6400)
+JOIN_COST = PAGE_IO(8000) + 1/4 * 8000 * PAGE_IO(6400)
 JOIN_COST = 100 + 2000 * 80 = 160100
 ```