This PR introduces some modifications to the Program Builder to allow us
to use nested parsing. By focusing the emission of Init and the last
Goto (prologue and epilogue), inside the ProgramBuilder, we can just not
emit them if we are parsing/translating in a nested context. For this
PR, I only migrated insert to use these functions as I need them to
support Insert statements that use `SELECT FROM` syntax. Nested parsing
overall enables code reuse for us and arguably is one of the only ways
to parse deeply nested queries without a lot of code duplication.
#1528Closes#1543
This PR builds on top of
https://github.com/tursodatabase/limbo/pull/1368 and adds few things
like allowing inserting pages with the same page key, fix fuzz tests by
adding transactions and some minor improvements to cacheflush.
Closes#1523
This PR adds a port of [SQLite's CSV virtual table
extension](https://www.sqlite.org/csv.html).
Planned follow-ups:
* Pass detailed error messages from `VTabModule::create`, not just
`ResultCode`s.
* Address the TODO in `VTabModuleImpl::create_schema`.
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#1544
One problem we have with PageRef, is that this Page reference can be
unloaded, this means if we read the page again instead of loading the
page onto the same reference, we will have split brain of references.
To solve this we wrap PageRef in `BTreePage` so that if a page is seen
as unloaded, we will replace BTreePage::page with the newest version of
the page.
Previously, fuzz tests increase the size of page cache indefinitely,
therefore the was no problem of reaching the capacity of a page cache.
By adding transactions to fuzz tests we allow pages to remove dirty
flags once insert is finished.
After inserting a page into the wal, we dispose of the modified page.
This is unnecessary as we can simply move new page to the newest
snapshot where this page can be read.
Dirty pages can be deleted in `cacheflush`. Furthermore, there could be
multiple live references in the stack of a cursor so let's allow them to
exist while deleting.
insert() fails if key exists (there shouldn't be two) and panics if
it's different pages, and also fails if it can't make room for the page.
Replaced the limited pop_if_not_dirty() function with make_room_for().
It tries to evict many pages as requested spare capacity. It should come
handy later by resize() and Pager. make_room_for() tries to make room or
fails if it can't evict enough entries.
For make_room_for() I also tried with an all-or-nothing approach, so if
say a query requests a lot more than possible to make room for, it
doesn't evict a bunch of pages from the cache that might be useful. But
implementing this approach got very complicated since it needs to keep
exclusive PageRefs and collecting this caused segfaults. Might be worth
trying again in the future. But beware the rabbit hole.
Updated page cache test logic for new insert rules.
Updated Pager.allocate_page() to handle failure logic but needs further
work. This is to show new cache insert handling. There are many places
to update.
Left comments on callers of pager and page cache needing to update
error handling, for now.
Add error handling and results for insert(), delete(), _delete(),
_detach(), pop_if_not_dirty(), and clear.
Now these functions fail if a page is dirty, locked, or has other
references.
insert() makes room with pop_if_not_dirty() beforehand to handle
cache full and un-evictable, else it would evict this page
silently.
_delete() returns Ok when key is not present in cache and it tries
first to detach the cache entry and clean its page *before*
removing the entry from the map.
detach() checks firstt if it's possible to evict the page and if
there are no other references to the page before taking its
contents.
test_detach_via_delete() and test_detach_via_insert() fixed by
properly checking before and after dropping the page reference.
test_page_cache_fuzz() fixed by reordering and moving reference to
the page into insert.
Other page cache tests fixed to check new function results.
All page cache tests pass.
Error handling and test fixes for Pager and BTree will be added in
a subsequent commit.
For example, implementing `SELECT DISTINCT` (#1517) and `UNION` (#1545)
require that we are able to create indexes without a rowid column
present. Similarly, `WITHOUT ROWID` tables require this.
I implemented this by replacing the `rowid` and `empty_record`
properties in `BtreeCursor` with
```rust
/// Whether the cursor is currently pointing to a record.
#[derive(Debug, Clone, Copy, PartialEq)]
enum CursorHasRecord {
Yes {
rowid: Option<u64>, // not all indexes and btrees have rowids, so this is optional.
},
No,
}
```
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1518
I tried to be the most similar to rusqlite as possible. The only thing
that's bothering me is `Vec<Vec<Value>>` which I think can be improved
but not so sure how, any inputs on this are welcomed.
Closes#1536
```sql
-- This PR does effectively this transformation:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
-- Same query with common conjuncts (ANDs) extracted:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
p_partkey = l_partkey
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
and (
(
p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
)
or
(
p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
)
or
(
p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
)
);
```
This allows Limbo's optimizer to 1. recognize `p_partkey=l_partkey` as
an index constraint on `part`, and 2. filter out `lineitem` rows before
joining. With this optimization, Limbo completes TPC-H `19.sql` nearly
as fast as SQLite on my machine. Without it, Limbo takes forever.
This branch: `939ms`
Main: `uh, i started running it a few minutes ago and it hasnt finished,
and i dont feel like waiting i guess`
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1520
Previously the Operation enum consisted of:
- Operation::Scan
- Operation::Search
- Operation::Subquery
Which was always a dumb hack because what we really are doing is an
Operation::Scan on a "virtual"/"pseudo" table (overloaded names...)
derived from a subquery appearing in the FROM clause.
Hence, refactor the relevant data structures so that the Table enum now
contains a new variant:
Table::FromClauseSubquery
And the Operation enum only consists of Scan and Search.
```
SELECT * FROM (SELECT ...) sub;
-- the subquery here was previously interpreted as Operation::Subquery on a Table::Pseudo,
-- with a lot of special handling for Operation::Subquery in different code paths
-- now it's an Operation::Scan on a Table::FromClauseSubquery
```
No functional changes (intended, at least!)
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1529
```sql
-- This PR does effectively this transformation:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
-- Same query with common conjuncts (ANDs) extracted:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
p_partkey = l_partkey
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
and (
(
p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
)
or
(
p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
)
or
(
p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
)
);
```