adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447)#37363
Draft
aljoscha wants to merge 5 commits into
Draft
adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447)#37363aljoscha wants to merge 5 commits into
aljoscha wants to merge 5 commits into
Conversation
b4f219d to
3d5b0ca
Compare
…ESTAMP and SUBSCRIBE EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized timestamp-oracle read (`oracle.read_ts().await`) inline on the single-threaded coordinator loop. The oracle backing store can be slow, and an inline await parks the loop, so a slow oracle wedged every other session for the whole round-trip. The peek path already avoids this: it does the read in its own `LinearizeTimestamp` stage that spawns the round-trip off the loop. Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the same pattern. Each gains a `LinearizeTimestamp` stage that spawns the `read_ts()` round-trip and threads the resulting `oracle_read_ts` forward as plain data. The subsequent stages (`determine_timestamp` / `sequence_peek_timestamp` and the read-hold bookkeeping they do) stay on the loop and consume that value unchanged, so timestamp selection is identical, just no longer blocking. This is timing-neutral with respect to strict serializability: the linearized read still happens during the query's real-time interval (after arrival, before response), against the same backing oracle. It is the same call the peek path already makes off the loop. No caching or shared-atomic shortcut is introduced. The "should we linearize, and against which oracle" decision that was duplicated across the inline `oracle_read_ts` helper and `peek_linearize_timestamp` is factored into a single `Coordinator::linearized_read_ts_oracle` helper that returns the oracle to read from, if any. All three spawn sites share it. Fixes SQL-425
When a SUBSCRIBE starts or stops we record it in the mz_subscriptions introspection table. We did this with a synchronous group commit (`execute`/`blocking`) on the coordinator loop, which acquires a write timestamp from the timestamp oracle inline. When the oracle backend is slow, that round trip blocks the loop and stalls every other session, once per SUBSCRIBE start and once per stop. Instead, defer both the insert and the retraction to a (batched) group commit and wait for the write to become durable off the loop, in a spawned task, before delivering the client-visible response. On start, `implement_subscribe` hands the write-notify future back to its callers, which await it before sending the SUBSCRIBE response. So mz_subscriptions stays synchronously consistent: another session sees the row immediately after the response. On stop, `remove_active_compute_sink` returns the retraction's notify, `drop_compute_sinks` threads it through, and `retire_compute_sinks` awaits it before calling `sink.retire`. So a client that observes a subscribe's retirement (a cancellation or dependency-dropped error) does not then read a stale row. The controller collections are still dropped on the loop. For sinks that write no introspection row (internal subscribes, COPY TO) the notify is already resolved. The cluster test reads mz_subscriptions from a separate session whose read timestamp can lag the commit, so adapt it to poll. `add_active_compute_sink` no longer commits inline, so it is no longer async. Also remove the now-unused `BuiltinTableAppend::blocking`.
The periodic group commit and the user-write path did their timestamp-oracle
round trips inline on the coordinator loop: `peek_write_ts` and `write_ts` when
allocating the write timestamp, and `read_ts` when downgrading read holds
afterwards. The keepalive commit fires every `default_timestamp_interval` (1s)
regardless of load, so under a slow oracle backend the loop spent most of its
time blocked on these round trips and stalled every other session, including
pure reads that never touch the oracle.
Move the oracle round trips and the table append off the loop into a single,
long-lived group committer task. The loop only does the cheap, state-touching
work: `stage_group_commit` drains pending writes, validates and acquires write
locks, builds the append batch, and hands a `GroupCommitRequest` to the
committer. The committer does the throttle peek, allocates the write timestamp,
appends to the tables, applies the write to the oracle, and reads the local read
ts. It then hands the parts that need coordinator state back to the loop via
`Message::GroupCommitApplied`: advance the catalog upper, record statement
execution timestamps, retire client responses, and downgrade read holds.
Processing one request at a time makes the committer the serialization point for
group commits, so appends reach the single table-write worker in timestamp
order. The committer holds only shareable handles, captured once at startup: the
oracle is never replaced in-process, and the table appender (a new
`StorageController::table_appender` returning a cloneable handle over the txns
worker) stays valid for the controller's lifetime. Promotion out of read-only
mode is a process restart, so a fresh committer is spawned then.
Two ordering constraints keep the finalization on the loop rather than in the
committer:
- The catalog upper must be advanced in step with the oracle read ts so reads
of `mz_catalog_raw` at that ts do not block. This has to be serialized with
on-loop catalog transactions, whose commit timestamp is fixed when the
transaction begins, so the committer cannot advance it off the loop without
tripping the single-writer assert. The loop advances it in
`GroupCommitApplied` using a new tolerant
`DurableCatalogState::try_advance_upper`, which is a no-op if an on-loop DDL
transaction already advanced the upper past this commit's timestamp. The
strict `advance_upper` (with its linearizability soft-panic) stays for the
on-loop DDL callers, whose allocate-and-advance is atomic on the loop.
- Statement execution timestamps must be recorded before the client responses
are retired, because retiring ends the statement execution and drops its
logging record. Both happen in `GroupCommitApplied`, in that order.
The txns shard is advanced by on-loop DDL (register / forget / alter of tables)
as well as by the off-loop committer, and both must reach it in timestamp order.
`shard_advance_lock` serializes `[allocate write ts + txns-shard write]` across
the committer and the on-loop DDL sites. Reads never take this lock, so they are
never blocked by it. On-loop DDL can briefly wait for the committer's lock hold
(one oracle round trip), which is acceptable: DDL is infrequent and already did
its own oracle round trips on the loop.
`execute` (DDL builtin-table writes) and bootstrap now go through the committer
too: `execute` stages the write and returns the notify, and bootstrap awaits it
rather than applying the timestamp itself.
Fixes SQL-447
Under an injected 500ms oracle latency with 8 concurrent workers, a victim
`SELECT 1` (no oracle write of its own) goes from multiple seconds to p50 6ms,
p95 9ms while the workers hammer INSERT. Read-your-writes, cross-session
read-after-write, statement-logging writes, and CREATE/DROP/INSERT DDL stress at
both 500ms and 0ms oracle latency all pass with no errors or panics.
The off-loop group committer and on-loop DDL (register / forget / alter of
tables) both write to the txns shard at oracle-allocated timestamps, and both
must reach it in timestamp order. The previous commit serialized them with a
`shard_advance_lock` held around `[allocate write ts + txns-shard write]`. That
couples the two paths: on-loop DDL can wait a full oracle round trip for the
committer's lock hold, and the lock is an all-or-nothing coupling that cannot be
relaxed on just one side.
Replace the lock with optimistic concurrency. Each side allocates a timestamp
from the oracle (which is monotonic) and attempts its txns-shard write. If the
other side advanced the txns upper past it, the write conflicts and the caller
re-allocates a fresh timestamp (necessarily beyond the txns upper) and retries.
The committer's append already did this on `InvalidUppers`. It is now the
primary mechanism rather than a defensive backstop, and the lock is gone.
To make DDL retryable:
- The table write worker returns the conflict as `InvalidUppers` instead of
panicking, rolling back its `write_handles` bookkeeping so a retry at a
fresh timestamp can re-register or re-forget cleanly.
- `create_collections` and `alter_table_desc` register tables in the txns
shard as their last step. On conflict they return `InvalidUppers` with all
other setup already done. The new `register_table_collections` re-opens
write handles from the stored collection metadata (the originals are
consumed by the txns register) and registers them, so the coordinator can
retry it in a loop with fresh timestamps. `drop_tables` forgets
synchronously and returns the conflict for the same retry treatment.
- The coordinator drives the retry loops for CREATE / ALTER / DROP TABLE,
re-allocating the write timestamp and re-advancing the catalog upper (which
also re-checks leadership, see materialize#28216) each attempt.
Reads never take part in the txns-shard advance, so they are never blocked by
either side, and a reader never observes a partially registered table. The
oracle read timestamp is only advanced past the register timestamp
(`apply_local_write`) after a successful register.
At bootstrap no group commits run concurrently, so the register cannot conflict
there and stays a single attempt.
Under an injected 500ms oracle latency with concurrent inserters and DDL, a
victim `SELECT 1` stays at p50 8ms / p95 10ms. A fast-oracle stress with three
inserters and three concurrent CREATE/DROP TABLE loops drove 112 register and
10 forget conflicts, all retried to success with read-your-writes and
cross-session reads intact and no panics.
`create_collections` used to register tables in the txns shard as its last step on the happy path, and return `InvalidUppers` with everything-but-the-register done when the off-loop group committer raced it. The caller then finished the registration through a different method, `register_table_collections`. So there were two registration code paths, and `create_collections` had a "returns half-done, call the other method to finish" contract. Make `register_table_collections` the sole path that registers tables in the txns shard, used identically for the first attempt and for retries, by both bootstrap and runtime DDL. `create_collections` and `alter_table_desc` now only set up storage and never touch the txns shard. The caller (coordinator or bootstrap) always registers as a separate, retryable step, re-allocating a fresh oracle timestamp on `InvalidUppers`. `register_ts` still flows into `create_collections`, but only for the read side: the initial since downgrade that makes a table "spring into existence" at that timestamp. That downgrade happens on the freshly-opened since handle before it is handed to the storage-collections background task, so it belongs in `create_collections`. The write side, txns registration, is what needs to be retryable, and that is now fully separated out. `register_table_collections` re-opens write handles from the stored collection metadata (a conflicting register consumes the originals) and opens them concurrently, since bootstrap registers many tables at once. It also folds in the read-only filtering that `register_table_writes` used to do: in read-only mode only migrated tables are registered. `register_table_writes` is gone. Bootstrap collects the table ids across all dependency layers and registers them once, after all layers are set up. Bootstrap runs no group commits concurrently, so this cannot conflict. Validated with the fast-oracle DDL stress: 110 register and 9 forget conflicts, all retried to success, read-your-writes and cross-session reads intact, no panics, and a clean read-write bootstrap.
70d6efa to
f49e081
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This is a combined / stacked branch carrying three related fixes that take
timestamp-oracle round trips off the coordinator's main loop. It is pushed as a
single PR so CI exercises all three together. Each commit is intended to become
its own (stacked) PR later:
adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE(Fixes SQL-425, also open standalone as adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE #37355)EXPLAIN TIMESTAMP and SUBSCRIBE did the linearized
read_ts()oracle roundtrip inline on the loop. Move it off via a
LinearizeTimestampstage, mirroringthe peek path.
adapter: wait for SUBSCRIBE bookkeeping off the coordinator loopSUBSCRIBE start/stop wrote/retracted its
mz_subscriptionsrow with asynchronous on-loop group commit. Defer the write to a batched group commit and
wait for it off the loop (in a spawned task) before delivering the client-visible
response, so the loop stays free while keeping
mz_subscriptionsconsistent.adapter: allocate group-commit timestamps off the coordinator loop(Fixes SQL-447)The periodic group commit did
peek_write_ts/write_ts/read_tsinline onthe loop, every
default_timestamp_interval(1s) regardless of load. Allocate thewrite timestamp off the loop (
Message::GroupCommitApply) and downgrade read holdsoff the loop (
Message::DowngradeReadHolds). Agroup_commit_last_advance_toguard re-allocates on the loop in the rare case an inline commit (DDL via
BuiltinTableAppend::execute) raced ahead, keeping the catalog and table uppersmonotonic.
Why
When the timestamp-oracle backend (CRDB) is slow, every inline oracle round trip
blocks the single-threaded coordinator loop and stalls all other sessions,
including pure reads that never touch the oracle.
Verification (local repro)
Injected 500ms oracle latency (toxiproxy) with 8 concurrent workers running the
offending statement, measuring a victim
SELECT 1(no oracle write of its own)on a separate connection:
Functional checks at 0ms and 500ms oracle latency: cross-session
read-after-write, read-your-writes, indexed views reflecting new writes,
SUBSCRIBE snapshots, and a concurrent CREATE/INSERT/DROP DDL burst (exercises the
re-allocation guard) all pass with no panic.
Note on correctness
Latency probes demonstrate loop responsiveness, not linearizability. The
group-commit change (#3) is in the most correctness-sensitive part of the
coordinator, so this relies on CI (pgtest, sqllogictest, testdrive,
platform-checks, parallel-workload) for the correctness story. Reviewing the
group-commit timestamp ordering / re-allocation guard is the key ask.
Draft: for CI observation while the individual PRs are split out.