turso

mirror of https://github.com/aljazceru/turso.git synced 2026-02-09 10:14:21 +01:00

Author	SHA1	Message	Date
Henrik Ingo	b7ed531ef0	Update Nyrkiö change detection to newest version (Yes, I changed the name of the repo.) Also switch back to 'cargo' for the parser, which is the original upstream code. I created 'criterion' because I didn't realize cargo bench spitz half of the text to stdout the other half to stderr.	2025-05-15 03:54:38 +03:00
Jussi Saurio	d086ab29a6	Merge 'Update Unique constraint for Primary Keys and Indexes' from Pedro Muniz This PR attempts to implement Primary Key and Indexes. It supports Update for Primary Keys as a RowId Alias, Composite Primary Keys, and Indexes. I tried to resemble as much as possible how SQLite emits the Opcodes. ~Additionally, to support this I had to fix a bug in the how we searched for the next records in the `Next` opcode, by introducing a Set of seen row id's. The problem was that, you need to delete the previous record and then insert the new record to update. When we did that in a `Rewind` loop, the current cell index in the cursor was always pointing to the incorrect place because we were searching for the next record without checking if we had seen it before. However, I am not sure how this affects the Btree.~ EDIT: After seeing how bad my fix was, I tried a different approach that is more in line with what SQLite does. When performing a `Delete` in the btree, we can save the current `rowid` (`TableBtree`) or the current `record` for (`IndexBtree`), and then restore the correct position later in the `next` function by seeking to the saved context. I'm just not knowledgeable enough yet to be efficient of when we can avoid saving the context and doing the seek later. Closes #1429	2025-05-14 19:54:05 +03:00
pedrocarlo	9fc9415b20	use Jussi's code to avoid cloning immutable record	2025-05-14 13:30:39 -03:00
pedrocarlo	72cc0fcdcb	fixes and comments	2025-05-14 13:30:39 -03:00
pedrocarlo	b2615d7739	add CursorValidState and only save context in delete when rebalancing	2025-05-14 13:30:39 -03:00
pedrocarlo	814508981c	fixing more rebase issues and cleaning up code. Save cursor context when calling delete for later use when needed	2025-05-14 13:30:39 -03:00
pedrocarlo	c69f503eac	rebase adjustments	2025-05-14 13:30:39 -03:00
pedrocarlo	05f4ca28cc	btree rewind and next fix. Keep track of rowids seen to avoid infinite loop	2025-05-14 13:30:39 -03:00
pedrocarlo	c146877344	add sqlite debug cli for nix. Fix cursor delete panic. Add tracing for cell indices in btree	2025-05-14 13:30:39 -03:00
pedrocarlo	6588004f80	fix incorrectly detecting if user provided row_id_alias to set clause	2025-05-14 13:30:39 -03:00
pedrocarlo	482634b598	adjust null opcode emission based in rowid_alias	2025-05-14 13:30:39 -03:00
pedrocarlo	758dfff2fe	modified tests as we do not have rollback yet. Also correctly raise a contraint error on primary keys only	2025-05-14 13:30:39 -03:00
pedrocarlo	3aaf4206b7	altered constraint tests to create bad update statements. Tests caught a bug where I was copying the wrong values from the registers	2025-05-14 13:30:39 -03:00
pedrocarlo	cf7f60b8f5	changed from resolve_label to preassign_label	2025-05-14 13:30:39 -03:00
pedrocarlo	6457d7675a	instruction emitted should be correct, but having an infinite loop bug	2025-05-14 13:30:39 -03:00
pedrocarlo	60a99851f8	emit NoConflict and Halt. Already detects unique constraints	2025-05-14 13:30:39 -03:00
pedrocarlo	5f2216cf8e	modify explain for MakeRecord to show index name	2025-05-14 13:30:39 -03:00
pedrocarlo	9aebfa7b5d	open cursors for write only once	2025-05-14 13:30:39 -03:00
pedrocarlo	5bae32fe3f	modified OpenWrite to include index or table name in explain	2025-05-14 13:30:39 -03:00
pedrocarlo	e7fa023c26	Adding indexes to the update plan	2025-05-14 13:30:39 -03:00
Jussi Saurio	76b29c2909	Merge 'Fix: unique contraint in auto index creation' from Pedro Muniz Closes #1457 . ```sql limbo> CREATE table t2 (x INTEGER PRIMARY KEY, y INTEGER UNIQUE); limbo> SELECT * FROM sqlite_schema; ┌───────┬───────────────────────┬──────────┬──────────┬───────────────────────────────────────────────────────────┐ │ type │ name │ tbl_name │ rootpage │ sql │ ├───────┼───────────────────────┼──────────┼──────────┼───────────────────────────────────────────────────────────┤ │ table │ t2 │ t2 │ 2 │ CREATE TABLE t2 (x INTEGER PRIMARY KEY, y INTEGER UNIQUE) │ ├───────┼───────────────────────┼──────────┼──────────┼───────────────────────────────────────────────────────────┤ │ index │ sqlite_autoindex_t2_1 │ t2 │ 3 │ │ └───────┴───────────────────────┴──────────┴──────────┴───────────────────────────────────────────────────────────┘ ``` Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com> Closes #1458	2025-05-14 19:26:03 +03:00
pedrocarlo	4dc1431428	handling edge case when passing duplicate a multi-column unique index	2025-05-14 11:46:24 -03:00
pedrocarlo	ea15b1f617	extract primary key detection	2025-05-14 11:34:39 -03:00
pedrocarlo	b93156ee9d	simplify unique sets creation	2025-05-14 11:34:39 -03:00
pedrocarlo	e75e20c9e9	fix incorrect matching in column name	2025-05-14 11:34:39 -03:00
pedrocarlo	5dee1adab8	panic when on_conflict is specified for Unique constraint	2025-05-14 11:34:39 -03:00
pedrocarlo	e7d0962a6c	adjust order of indices and add tests	2025-05-14 11:34:39 -03:00
pedrocarlo	3526a206e4	support Unique properly by creating a vec of auto indices	2025-05-14 11:34:39 -03:00
pedrocarlo	c5f004c1d6	added test and adjustments	2025-05-14 11:34:39 -03:00
pedrocarlo	e4ca1bb55e	modify automatic index creation to account for unique columns	2025-05-14 11:34:11 -03:00
pedrocarlo	bb158a5433	add unique field to Column	2025-05-14 11:34:11 -03:00
pedrocarlo	002acbb9dc	add check for unique contraint in auto index creation	2025-05-14 11:34:11 -03:00
Pekka Enberg	275fd29057	Add clippy to rust-toolchain.toml...	2025-05-14 14:22:02 +03:00
Jussi Saurio	88501fc76f	Merge 'Restructure optimizer to support join reordering' from Jussi Saurio Closes #1283 In lieu of a better description, I will just copypaste the contents of the new `OPTIMIZER.MD` here: # Overview of the current state of the query optimizer in Limbo Query optimization is obviously an important part of any SQL-based database engine. This document is an overview of what we currently do. ## Structure of the optimizer directory 1. `mod.rs` - Provides the high-level optimization interface through `optimize_plan()` 2. `access_method.rs` - Determines what is the best index to use when joining a table to a set of preceding tables 3. `constraints.rs` - Manages query constraints: - Extracts constraints from the WHERE clause - Determines which constraints are usable with indexes 4. `cost.rs` - Calculates the cost of doing a seek vs a scan, for example 5. `join.rs` - Implements the System R style dynamic programming join ordering algorithm 6. `order.rs` - Determines if sort operations can be eliminated based on the chosen access methods and join order ## Join reordering and optimal index selection The goals of query optimization are at least the following: 1. Do as little page I/O as possible 2. Do as little CPU work as possible 3. Retain query correctness. The most important ways to achieve no. 1 and no. 2 are: 1. Choose the optimal access method for each table (e.g. an index or a rowid-based seek, or a full table scan if all else fails). 2. Choose the best or near-best way to reorder the tables in the query so that those optimal access methods can be used. 3. Also factor in whether the chosen join order and indexes allow removal of any sort operations that are necessary for query correctness. ## Limbo's optimizer Limbo's optimizer is an implementation of an extremely traditional [IBM System R](https://www.cs.cmu.edu/~15721-f24/slides/02-Selinger-SystemR- opt.pdf) style optimizer, i.e. straight from the 70s! The DP algorithm is explained below. ### Current high level flow of the optimizer 1. SQL rewriting - Rewrite certain SQL expressions to another form (not a lot currently; e.g. rewrite BETWEEN as two comparisons) - Eliminate constant conditions: e.g. `WHERE 1` is removed, `WHERE 0` short-circuits the whole query because it is trivially false. 2. Check whether there is an "interesting order" that we should consider when evaluating indexes and join orders - Is there a GROUP BY? an ORDER BY? Both? 3. Convert WHERE clause conjucts to Constraints - E.g. in `WHERE t.x = 5`, the expression `5` _constrains_ table `t` to values of `x` that are exactly `5`. - E.g. in `Where t.x = u.x`, the expression `u.x` constrains `t`, AND `t.x` constrains `u`. - Per table, each constraint has an estimated _selectivity_ (how much it filters the result set); this affects join order calculations, see the paragraph on _Estimation_ below. - Per table, constraints are also analyzed for whether one or multiple of them can be used as an index seek key to avoid a full scan. 4. Compute the best join order using a dynamic programming algorithm: - `n` = number of tables considered - `n=1`: find the lowest _cost_ way to access each single table, given the constraints of the query. Memoize the result. - `n=2`: for each table found in the `n=1` step, find the best way to join that table with each other table. Memoize the result. - `n=3`: for each 2-table subset found, find the best way to join that result to each other table. Memoize the result. - `n=m`: for each `m-1` table subset found, find the best way to join that result to the `m'th` table - Use pruning to reduce search space: - Compute the literal query order first, and store its _cost_ as an upper threshold - If at any point a considered join order exceeds the upper threshold, discard that search path since it cannot be better than the current best. - For example, we have `SELECT * FROM a JOIN b JOIN c JOIN d`. Compute `JOIN(a,b,c,d)` first. If `JOIN (b,a)` is already worse than `JOIN(a,b,c,d)`, we don't have to even try `JOIN(b,a,c)`. - Also keep track of the best plan per _subset_: - If we find that `JOIN(b,a,c)` is better than any other permutation of the same tables, e.g. `JOIN(a,b,c)`, then we can discard _ALL_ of the other permutations for that subset. For example, we don't need to consider `JOIN(a,b,c,d)` because we know it's worse than `JOIN(b,a,c,d)`. - This is possible due to the associativity and commutativity of INNER JOINs. - Also keep track of the best _ordered plan_ , i.e. one that provides the "interesting order" mentioned above. - At the end, apply a cost penalty to the best overall plan - If it is now worse than the best sorted plan, then choose the sorted plan as the best plan for the query. - This allows us to eliminate a sorting operation. - If the best overall plan is still best even with the sorting penalty, then keep it. A sorting operation is later applied to sort the rows according to the desired order. 5. Mutate the plan's `join_order` and `Operation`s to match the computed best plan. ### Estimation of cost and cardinalities + a note on table statistics Currently, in the absence of `ANALYZE`, `sqlite_stat1` etc. we assume the following: 1. Each table has `1,000,000` rows. 2. Each equality (`=`) filter will filter out some percentage of the result set. 3. Each nonequality (e.g. `>`) will filter out some smaller percentage of the result set. 4. Each `4096` byte database page holds `50` rows, i.e. roughly `80` bytes per row 5. Sort operations have some CPU cost dependent on the number of input rows to the sort operation. From the above, we derive the following formula for estimating the cost of joining `t1` with `t2` ``` JOIN_COST = PAGE_IO(t1.rows) + t1.rows * PAGE_IO(t2.rows) ``` For example, let's take the query `SELECT * FROM t1 JOIN t2 USING(foo) WHERE t2.foo > 10`. Let's assume the following: - `t1` has `6400` rows and `t2` has `8000` rows - there are no indexes at all - let's ignore the CPU cost from the equation for simplicity. The best access method for both is a full table scan. The output cardinality of `t1` is the full table, because nothing is filtering it. Hence, the cost of `t1 JOIN t2` becomes: ``` JOIN_COST = PAGE_IO(t1.input_rows) + t1.output_rows * PAGE_IO(t2.input_rows) // plugging in the values: JOIN_COST = PAGE_IO(6400) + 6400 * PAGE_IO(8000) JOIN_COST = 80 + 6400 * 100 = 640080 ``` Now let's consider `t2 JOIN t1`. The best access method for both is still a full scan, but since we can filter on `t2.foo > 10`, its output cardinality decreases. Let's assume only 1/4 of the rows of `t2` match the condition `t2.foo > 10`. Hence, the cost of `t2 join t1` becomes: ``` JOIN_COST = PAGE_IO(t2.input_rows) + t2.output_rows * PAGE_IO(t1.input_rows) // plugging in the values: JOIN_COST = PAGE_IO(8000) + 1/4 * 8000 * PAGE_IO(6400) JOIN_COST = 100 + 2000 * 80 = 160100 ``` Even though `t2` is a larger table, because we were able to reduce the input set to the join operation, it's dramatically cheaper. #### Statistics Since we don't support `ANALYZE`, nor can we assume that users will call `ANALYZE` anyway, we use simple magic constants to estimate the selectivity of join predicates, row count of tables, and so on. When we have support for `ANALYZE`, we should plug the statistics from `sqlite_stat1` and friends into the optimizer to make more informed decisions. ### Estimating the output cardinality of a join The output cardinality (output row count) of an operation is as follows: ``` OUTPUT_CARDINALITY_JOIN = INPUT_CARDINALITY_RHS * OUTPUT_CARDINALITY_RHS where INPUT_CARDINALITY_RHS = OUTPUT_CARDINALITY_LHS ``` example: ``` SELECT * FROM products p JOIN order_lines o ON p.id = o.product_id ``` Assuming there are 100 products, i.e. just selecting all products would yield 100 rows: ``` OUTPUT_CARDINALITY_LHS = 100 INPUT_CARDINALITY_RHS = 100 ``` Assuming p.id = o.product_id will return three orders per each product: ``` OUTPUT_CARDINALITY_RHS = 3 OUTPUT_CARDINALITY_JOIN = 100 * 3 = 300 ``` i.e. the join is estimated to return 300 rows, 3 for each product. Again, in the absence of statistics, we use magic constants to estimate these cardinalities. Estimating them is important because in multi-way joins the output cardinality of the previous join becomes the input cardinality of the next one. Reviewed-by: Pere Diaz Bou <pere-altea@homail.com> Closes #1462	2025-05-14 14:02:33 +03:00
Pekka Enberg	1e48190bdc	Merge 'Add `rustfmt` to rust-toolchain.toml' from Pekka Enberg Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com> Closes #1479	2025-05-14 13:03:58 +03:00
Pekka Enberg	6365286a9e	Fix CHANGELOG	2025-05-14 11:30:44 +03:00
Pekka Enberg	5739cd5f46	Add `rustfmt` to rust-toolchain.toml	2025-05-14 11:29:12 +03:00
Pekka Enberg	31ebbb190a	Limbo 0.0.20	2025-05-14 09:49:05 +03:00
Jussi Saurio	176d9bd3c7	Prune bad plans earlier to avoid allocating useless JoinN structs	2025-05-14 09:42:26 +03:00
Jussi Saurio	eb983c88c6	reserve capacity for memo hashmap entries	2025-05-14 09:42:26 +03:00
Jussi Saurio	5e5788bdfe	Reduce allocations	2025-05-14 09:42:26 +03:00
Jussi Saurio	d2fa91e984	avoid growing vec	2025-05-14 09:42:26 +03:00
Jussi Saurio	625cf005fd	Add some utilities to constraint related structs	2025-05-14 09:42:26 +03:00
Jussi Saurio	71ab3d57d8	constraints.rs: more comments	2025-05-14 09:42:26 +03:00
Jussi Saurio	5386859b44	as_binary-components: simplify	2025-05-14 09:42:26 +03:00
Jussi Saurio	1d465e6d94	Remove unnecessary method	2025-05-14 09:42:26 +03:00
Jussi Saurio	9d50446ffb	AccessMethod: simplify - get rid of AccessMethodKind as it can be derived	2025-05-14 09:42:26 +03:00
Jussi Saurio	12a2c2b9ad	Add more documentation to OPTIMIZER.MD	2025-05-14 09:42:26 +03:00
Jussi Saurio	fe628e221a	plan_satisfies_order_target(): simplify	2025-05-14 09:42:26 +03:00
Jussi Saurio	4dde356d97	AccessMethod: simplify	2025-05-14 09:42:26 +03:00

1 2 3 4 5 ...

4461 Commits