Trying to return integer sometimes to match SQLite led to more problems
that I anticipated. The reason being, we can't *really* match SQLite's
behavior unless we know the type of *every* element in the sum. This is
not impossible, but it is very hard, for very little gain.
Fixes#3831
Implements COUNT/SUM/AVG(DISTINCT) and SELECT DISTINCT for materialized views.
To do this we have to keep a list of the actual distinct values
(similarly to how we do for min/max). We then update the operator (and
issue deltas) only when there is a state transition (for example, if we
already count the value x = 1, and we see an insert for x = 1, we do
nothing).
SELECT DISTINCT (with no aggregator) is similar. We already have to keep
a list of the values anyway to power the aggregates. So we just issue
new deltas based on the transition, without updating the aggregator.
MVCC is like the annoying younger cousin (I know because I was him) that
needs to be treated differently. MVCC requires us to use root_pages that
might not be allocated yet, and the plan is to use negative root_pages
for that case. Therefore, we need i64 in order to fit this change.
We have used i64 before because that is the size of an integer in
SQLite. However, I believe that for large enough databases, the chances
of collision here are just too high. The effect of a collision is the
database silently returning incorrect data in the materialized view.
So now that everything else is working, we should move to i128.
The Materialized View code had custom serialization written so we could
move this code forward. Now that we have many operators and the views
work, replace it with something saner.
The main insight is that if we transform the AggregateState into Values
before the serialization, we are able to just use standard SQLite
serialization for the values. We then just have to add sizes, codes for
the functions, etc (which are also represented as Values).
This PR improves the DBSP circuit so that it handles the JOIN operator.
The JOIN operator exposes a weakness of our current model: we usually
pass a list of columns between operators, and find the right column by
name when needed.
But with JOINs, many tables can have the same columns. The operators
will then find the wrong column (same name, different table), and
produce incorrect results.
To fix this, we must do two things:
1) Change the Logical Plan. It needs to track table provenance.
2) Fix the aggregators: it needs to operate on indexes, not names.
For the aggregators, note that table provenance is the wrong
abstraction. The aggregator is likely working with a logical table that
is the result of previous nodes in the circuit. So we just need to be
able to tell it which index in the column array it should use.