This patch pushes unsafe Send and Sync to individual components instead
of doing it at Database level. This makes it easier for us to
incrementally fix thread-safety, but avoid developers adding more thread
unsafe code.
The previous implementation of CompletionGroup would call the group's
callback function directly when the last completion finished:
if prev == 1 {
let group_result = group.result.get().and_then(|e| *e);
(group.complete)(group_result.map_or(Ok(0), Err));
}
This broke nested completion groups because parent groups track their
children via the Completion::callback() method. By calling the function
pointer directly, we bypassed the completion chain and parent groups
never received notification that their child had completed.
The fix stores a reference to the group's own Completion object in
self_completion during build(). When the last child finishes, we call
group_completion.callback() instead of invoking the function directly.
This properly propagates through the completion hierarchy, ensuring
parent groups decrement their outstanding count and eventually complete.
This matches the behavior of individual completions and maintains the
invariant that all completions notify their parents through the unified
callback() mechanism.
The previous implementation of CompletionGroup::add() would filter out
successfully-finished completions:
if !completion.finished() || completion.failed() {
self.completions.push(completion.clone());
}
This caused a problem when combined with drain() in the calling code.
Completions that were already finished would be removed from the source
vector by drain() but not added to the group, effectively losing track
of them.
This breaks the invariant that all completions passed to a group must
be tracked, regardless of their state. The build() method already
handles finished completions correctly by not including them in the
outstanding count.
The fix is to always add all completions and let build() handle their
state appropriately, matching the behavior of the old io_yield_many!()
macro.
Yield is a completion that does not allocate any inner state. By design
it is completed from the start and has no errors. This allows lightly
yield without allocating any locks nor heap allocate inner state.
I added the `Once` before so fix a bug, but it was a bit hackery. We can
`get_or_init` to achieve the same purpose, and the code becomes much
cleaner. `get_or_init` guarantees the init will happen only once as
well.
Reviewed-by: Preston Thorpe <preston@turso.tech>
Reviewed-by: bit-aloo (@Shourya742)
Closes#3578
Introduces a completion group abstraction that allows grouping multiple
I/O completions together for coordinated tracking and error handling.
This enables:
- Tracking completion status of multiple I/O operations as a group
- Detecting when all operations in a group have finished
- Aborting all operations in a group atomically
- Retrieving errors from any completion in the group
The implementation uses intrusive linked lists for efficient membership
tracking and atomic counters for outstanding operation counts. Each
completion can be linked to a group using the new .link() method.
This lays the groundwork for batch I/O operations and coordinated
transaction handling in the storage layer.
We use relaxed ordering in a lot of places where we really need to
ensure all CPUs see the write. Switch to sequential consistency, unless
acquire/release is explicitly used. If there are places that can be
optimized, we can switch to relaxed case-by-case, but have a comment
explaning *why* it is safe.
The `run_once()` name is just a historical accident. Furthermore, it now
started to appear elsewhere as well, so let's just call it IO::step() as we
should have from the beginning.
closes#1419
When submitting a `pwritev` for flushing dirty pages, in the case that
it's a commit frame, we use a new completion type which tells io_uring
to add a flag, which ensures the following:
1. If any operation in the chain fails, subsequent operations get
cancelled with -ECANCELED
2. All operations in the chain complete in order
If there is an ongoing chain of `IO_LINK`, it ends at the `fsync`
barrier, and ensures everything submitted before it has completed.
for 99% of the cases, the syscall that immediately proceeds the
`pwritev` is going to be the fsync, but just in case, this
implementation links everything that comes between the final commit
`pwritev` and the next `fsync`
In the event that we get a partial write, if it was linked, then we
submit an additional fsync after the partial write completes, with an
`IO_DRAIN` flag after forcing a `submit`, which will mean durability is
maintained, as that fsync will flush/drain everything in the squeue
before submission.
The other option in the event of partial writes on commit frames/linked
writes is to error.. not sure which is the right move here. I guess it's
possible that since the fsync completion fired, than the commit could be
over without us being durable ondisk. So maybe it's an assertion
instead? Thoughts?
Closes#2909
- otherwise, in multi-threading environment, other thread can think that completion is finished
and start execution
- this can lead to violated assertions (for example, page must be loaded, but as callback is not executed yet
assert will be fired)
Because we can abort a read_page completion, this means a page can be in
the cache but be unloaded and unlocked. However, if we do not evict that
page from the page cache, we will return an unloaded page later which
will trigger assertions later on. This is worsened by the fact that page
cache is not per `Statement`, so you can abort a completion in one
Statement, and trigger some error in the next one if we don't evict the
page in these circumstances.
Also, to propagate IO errors we need to return the Error from
IOCompletions on step.
Closes#2785
Commit ebe6aa0d28 ("adjust cfg for unix
and linux IO") adjusted the I/O conditional compilation, but forgot that
Android and iOS are also part of Unix target family.
Fixes#2500
The completion callback can be invoked only once via `OnceLock`, let's not
crash if we e.g. call `Completion::abort()` on an already finished completion.
Closes#2673
Problem
There are several problems with our current statically allocated
`BufferPool`.
1. You cannot open two databases in the same process with different
page sizes, because the `BufferPool`'s `Arena`s will be locked forever
into the page size of the first database. This is the case regardless
of whether the two `Database`s are open at the same time, or if the first
is closed before the second is opened.
2. It is impossible to even write Rust tests for different page sizes because
of this, assuming the test uses a single process.
Solution
Make `Database` own `BufferPool` instead of it being statically allocated, so this
problem goes away.
Note that I didn't touch the still statically-allocated `TEMP_BUFFER_CACHE`, because
it should continue to work regardless of this change. It should only be a problem if
the user has two or more databases with different page sizes open simultaneously, because
`TEMP_BUFFER_CACHE` will only support one pool of a given page size at a time, so the rest
of the allocations will go through the global allocator instead.
Notes
I extracted this change out from #2569, because I didn't want it to be smuggled in without
being reviewed as an individual piece.