```sql
-- This PR does effectively this transformation:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
-- Same query with common conjuncts (ANDs) extracted:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
p_partkey = l_partkey
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
and (
(
p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
)
or
(
p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
)
or
(
p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
)
);
```
This allows Limbo's optimizer to 1. recognize `p_partkey=l_partkey` as
an index constraint on `part`, and 2. filter out `lineitem` rows before
joining. With this optimization, Limbo completes TPC-H `19.sql` nearly
as fast as SQLite on my machine. Without it, Limbo takes forever.
This branch: `939ms`
Main: `uh, i started running it a few minutes ago and it hasnt finished,
and i dont feel like waiting i guess`
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1520
Previously the Operation enum consisted of:
- Operation::Scan
- Operation::Search
- Operation::Subquery
Which was always a dumb hack because what we really are doing is an
Operation::Scan on a "virtual"/"pseudo" table (overloaded names...)
derived from a subquery appearing in the FROM clause.
Hence, refactor the relevant data structures so that the Table enum now
contains a new variant:
Table::FromClauseSubquery
And the Operation enum only consists of Scan and Search.
```
SELECT * FROM (SELECT ...) sub;
-- the subquery here was previously interpreted as Operation::Subquery on a Table::Pseudo,
-- with a lot of special handling for Operation::Subquery in different code paths
-- now it's an Operation::Scan on a Table::FromClauseSubquery
```
No functional changes (intended, at least!)
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1529
```sql
-- This PR does effectively this transformation:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
p_partkey = l_partkey
and p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
)
or
(
p_partkey = l_partkey
and p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
);
-- Same query with common conjuncts (ANDs) extracted:
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
p_partkey = l_partkey
and l_shipmode in ('AIR', 'AIR REG')
and l_shipinstruct = 'DELIVER IN PERSON'
and (
(
p_brand = 'Brand#22'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
and l_quantity >= 8 and l_quantity <= 8 + 10
and p_size between 1 and 5
)
or
(
p_brand = 'Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
and l_quantity >= 10 and l_quantity <= 10 + 10
and p_size between 1 and 10
)
or
(
p_brand = 'Brand#12'
and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
and l_quantity >= 24 and l_quantity <= 24 + 10
and p_size between 1 and 15
)
);
```
Reviewable commit by commit. CI failures are not related.
Adds support for e.g. `select first_name, sum(distinct age),
count(distinct age), avg(distinct age) from users group by 1`
Implementation details:
- Creates an ephemeral index per distinct aggregate, and jumps over the
accumulation step if a duplicate is found
Closes#1507
Previously the Operation enum consisted of:
- Operation::Scan
- Operation::Search
- Operation::Subquery
Which was always a dumb hack because what we really are doing is
an Operation::Scan on a "virtual"/"pseudo" table (overloaded names...)
derived from a subquery appearing in the FROM clause.
Hence, refactor the relevant data structures so that the Table enum
now contains a new variant:
Table::FromClauseSubquery
And the Operation enum only consists of Scan and Search.
No functional changes (intended, at least!)
Various things to improve speed of long fuzz test execution time:
* remove unnecessary debug_validate_cell calls
* Add SortedVec for keys in fuzz tests
* Validate btree's depth in fuzz test every 1K inserts to not overload
test with validations. We add `VALIDATE_BTREE` env variable to enable
validation on every insert in case it is needed.
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#1521
1. `group_by_contains_all` was incorrect - it was not checking that all
order by columns are in group by; it was instead checking that all group
by columns are in order by, which is absolutely incorrect for the
intended purpose.
2. remove ORDER BY clause if GROUP BY clause can sort the rows in the
same way.
Test failures are not related
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1511
We've run into trouble in multiple places due to the fact that we delete
terms from the where clause (e.g. when a constant condition is removed,
or the term becomes part of an index seek key).
A simpler solution is to add a flag indicating that the term is consumed
(used), so that it is not translated in the main loop anymore when WHERE
clause terms are evaluated.
note: CI failures are unrelated
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1477
Various things:
* remove unnecessary debug_validate_cell calls
* Add SortedVec for keys in fuzz tests
* Validate btree's depth in fuzz test every 1K inserts to not overload
test with validations. We add `VALIDATE_BTREE` env variable to enable
validation on every insert in case it is needed.
This PR adds a new function `read_write_payload_with_offset` to support
reading and writing payload data at specific offsets, handling both
local content and overflow pages. This is a port of SQLite's
`accessPayload` function in `btree.c` and will be essential for
supporting incremental blob I/O in the coming PRs.
- Added a state machine called `PayloadOverflowWithOffset` to make the
procedure reentrant.
- Correctly processes both local payload data and payload stored in
overflow pages
Testing:
- Reading and writing to a column with no overflow pages.
- Reading and writing at an offset with overflow pages (spanning 10
pages)
Reviewed-by: Pere Diaz Bou <pere-altea@homail.com>
Closes#1476
Closes#1482. I needed to change the `key_exists_in_index` function
because it zips the values from the records it is comparing, but if one
of the records is empty or not of the same length, the `all` function
could return true incorrectly.
Reviewed-by: Jussi Saurio <jussi.saurio@gmail.com>
Closes#1514
We've run into trouble in multiple places due to the fact that
we delete terms from the where clause (e.g. when a constant condition
is removed, or the term becomes part of an index seek key).
A simpler solution is to add a flag indicating that the term is
consumed (used), so that it is not translated in the main loop
anymore when WHERE clause terms are evaluated.
Primary keys that are marked as unique constraint, do not need to have
separate indexes, one is enough. In the case primary key is integer,
therefore rowid alias, we still need an index to satisfy unique
constraint.
Closes#1494