adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447) by aljoscha · Pull Request #37363 · MaterializeInc/materialize

aljoscha · 2026-06-30T16:02:04Z

What

This is a combined / stacked branch carrying three related fixes that take
timestamp-oracle round trips off the coordinator's main loop. It is pushed as a
single PR so CI exercises all three together. Each commit is intended to become
its own (stacked) PR later:

adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE (Fixes SQL-425, also open standalone as adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE #37355)
EXPLAIN TIMESTAMP and SUBSCRIBE did the linearized read_ts() oracle round
trip inline on the loop. Move it off via a LinearizeTimestamp stage, mirroring
the peek path.
adapter: wait for SUBSCRIBE bookkeeping off the coordinator loop
SUBSCRIBE start/stop wrote/retracted its mz_subscriptions row with a
synchronous on-loop group commit. Defer the write to a batched group commit and
wait for it off the loop (in a spawned task) before delivering the client-visible
response, so the loop stays free while keeping mz_subscriptions consistent.
adapter: allocate group-commit timestamps off the coordinator loop (Fixes SQL-447)
The periodic group commit did peek_write_ts / write_ts / read_ts inline on
the loop, every default_timestamp_interval (1s) regardless of load. Allocate the
write timestamp off the loop (Message::GroupCommitApply) and downgrade read holds
off the loop (Message::DowngradeReadHolds). A group_commit_last_advance_to
guard re-allocates on the loop in the rare case an inline commit (DDL via
BuiltinTableAppend::execute) raced ahead, keeping the catalog and table uppers
monotonic.

Why

When the timestamp-oracle backend (CRDB) is slow, every inline oracle round trip
blocks the single-threaded coordinator loop and stalls all other sessions,
including pure reads that never touch the oracle.

Verification (local repro)

Injected 500ms oracle latency (toxiproxy) with 8 concurrent workers running the
offending statement, measuring a victim SELECT 1 (no oracle write of its own)
on a separate connection:

workers run	before (main)	after (this branch)
SELECT (peek)	p50 ~3-5s	p50 6ms, p95 ~9ms
EXPLAIN TIMESTAMP	p50 ~3.5s	p50 6ms, p95 ~8ms
SUBSCRIBE	p50 ~24s	p50 6ms, p95 ~8ms

Functional checks at 0ms and 500ms oracle latency: cross-session
read-after-write, read-your-writes, indexed views reflecting new writes,
SUBSCRIBE snapshots, and a concurrent CREATE/INSERT/DROP DDL burst (exercises the
re-allocation guard) all pass with no panic.

Note on correctness

Latency probes demonstrate loop responsiveness, not linearizability. The
group-commit change (#3) is in the most correctness-sensitive part of the
coordinator, so this relies on CI (pgtest, sqllogictest, testdrive,
platform-checks, parallel-workload) for the correctness story. Reviewing the
group-commit timestamp ordering / re-allocation guard is the key ask.

Draft: for CI observation while the individual PRs are split out.

…ESTAMP and SUBSCRIBE EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized timestamp-oracle read (`oracle.read_ts().await`) inline on the single-threaded coordinator loop. The oracle backing store can be slow, and an inline await parks the loop, so a slow oracle wedged every other session for the whole round-trip. The peek path already avoids this: it does the read in its own `LinearizeTimestamp` stage that spawns the round-trip off the loop. Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the same pattern. Each gains a `LinearizeTimestamp` stage that spawns the `read_ts()` round-trip and threads the resulting `oracle_read_ts` forward as plain data. The subsequent stages (`determine_timestamp` / `sequence_peek_timestamp` and the read-hold bookkeeping they do) stay on the loop and consume that value unchanged, so timestamp selection is identical, just no longer blocking. This is timing-neutral with respect to strict serializability: the linearized read still happens during the query's real-time interval (after arrival, before response), against the same backing oracle. It is the same call the peek path already makes off the loop. No caching or shared-atomic shortcut is introduced. The "should we linearize, and against which oracle" decision that was duplicated across the inline `oracle_read_ts` helper and `peek_linearize_timestamp` is factored into a single `Coordinator::linearized_read_ts_oracle` helper that returns the oracle to read from, if any. All three spawn sites share it. Fixes SQL-425

When a SUBSCRIBE starts or stops we record it in the mz_subscriptions introspection table. We did this with a synchronous group commit (`execute`/`blocking`) on the coordinator loop, which acquires a write timestamp from the timestamp oracle inline. When the oracle backend is slow, that round trip blocks the loop and stalls every other session, once per SUBSCRIBE start and once per stop. Instead, defer both the insert and the retraction to a (batched) group commit and wait for the write to become durable off the loop, in a spawned task, before delivering the client-visible response. On start, `implement_subscribe` hands the write-notify future back to its callers, which await it before sending the SUBSCRIBE response. So mz_subscriptions stays synchronously consistent: another session sees the row immediately after the response. On stop, `remove_active_compute_sink` returns the retraction's notify, `drop_compute_sinks` threads it through, and `retire_compute_sinks` awaits it before calling `sink.retire`. So a client that observes a subscribe's retirement (a cancellation or dependency-dropped error) does not then read a stale row. The controller collections are still dropped on the loop. For sinks that write no introspection row (internal subscribes, COPY TO) the notify is already resolved. The cluster test reads mz_subscriptions from a separate session whose read timestamp can lag the commit, so adapt it to poll. `add_active_compute_sink` no longer commits inline, so it is no longer async. Also remove the now-unused `BuiltinTableAppend::blocking`.

The periodic group commit and the user-write path did their timestamp-oracle round trips inline on the coordinator loop: `peek_write_ts` and `write_ts` when allocating the write timestamp, and `read_ts` when downgrading read holds afterwards. The keepalive commit fires every `default_timestamp_interval` (1s) regardless of load, so under a slow oracle backend the loop spent most of its time blocked on these round trips and stalled every other session, including pure reads that never touch the oracle. Move the oracle round trips and the table append off the loop into a single, long-lived group committer task. The loop only does the cheap, state-touching work: `stage_group_commit` drains pending writes, validates and acquires write locks, builds the append batch, and hands a `GroupCommitRequest` to the committer. The committer does the throttle peek, allocates the write timestamp, appends to the tables, applies the write to the oracle, and reads the local read ts. It then hands the parts that need coordinator state back to the loop via `Message::GroupCommitApplied`: advance the catalog upper, record statement execution timestamps, retire client responses, and downgrade read holds. Processing one request at a time makes the committer the serialization point for group commits, so appends reach the single table-write worker in timestamp order. The committer holds only shareable handles, captured once at startup: the oracle is never replaced in-process, and the table appender (a new `StorageController::table_appender` returning a cloneable handle over the txns worker) stays valid for the controller's lifetime. Promotion out of read-only mode is a process restart, so a fresh committer is spawned then. Two ordering constraints keep the finalization on the loop rather than in the committer: - The catalog upper must be advanced in step with the oracle read ts so reads of `mz_catalog_raw` at that ts do not block. This has to be serialized with on-loop catalog transactions, whose commit timestamp is fixed when the transaction begins, so the committer cannot advance it off the loop without tripping the single-writer assert. The loop advances it in `GroupCommitApplied` using a new tolerant `DurableCatalogState::try_advance_upper`, which is a no-op if an on-loop DDL transaction already advanced the upper past this commit's timestamp. The strict `advance_upper` (with its linearizability soft-panic) stays for the on-loop DDL callers, whose allocate-and-advance is atomic on the loop. - Statement execution timestamps must be recorded before the client responses are retired, because retiring ends the statement execution and drops its logging record. Both happen in `GroupCommitApplied`, in that order. The txns shard is advanced by on-loop DDL (register / forget / alter of tables) as well as by the off-loop committer, and both must reach it in timestamp order. `shard_advance_lock` serializes `[allocate write ts + txns-shard write]` across the committer and the on-loop DDL sites. Reads never take this lock, so they are never blocked by it. On-loop DDL can briefly wait for the committer's lock hold (one oracle round trip), which is acceptable: DDL is infrequent and already did its own oracle round trips on the loop. `execute` (DDL builtin-table writes) and bootstrap now go through the committer too: `execute` stages the write and returns the notify, and bootstrap awaits it rather than applying the timestamp itself. Fixes SQL-447 Under an injected 500ms oracle latency with 8 concurrent workers, a victim `SELECT 1` (no oracle write of its own) goes from multiple seconds to p50 6ms, p95 9ms while the workers hammer INSERT. Read-your-writes, cross-session read-after-write, statement-logging writes, and CREATE/DROP/INSERT DDL stress at both 500ms and 0ms oracle latency all pass with no errors or panics.

The off-loop group committer and on-loop DDL (register / forget / alter of tables) both write to the txns shard at oracle-allocated timestamps, and both must reach it in timestamp order. The previous commit serialized them with a `shard_advance_lock` held around `[allocate write ts + txns-shard write]`. That couples the two paths: on-loop DDL can wait a full oracle round trip for the committer's lock hold, and the lock is an all-or-nothing coupling that cannot be relaxed on just one side. Replace the lock with optimistic concurrency. Each side allocates a timestamp from the oracle (which is monotonic) and attempts its txns-shard write. If the other side advanced the txns upper past it, the write conflicts and the caller re-allocates a fresh timestamp (necessarily beyond the txns upper) and retries. The committer's append already did this on `InvalidUppers`. It is now the primary mechanism rather than a defensive backstop, and the lock is gone. To make DDL retryable: - The table write worker returns the conflict as `InvalidUppers` instead of panicking, rolling back its `write_handles` bookkeeping so a retry at a fresh timestamp can re-register or re-forget cleanly. - `create_collections` and `alter_table_desc` register tables in the txns shard as their last step. On conflict they return `InvalidUppers` with all other setup already done. The new `register_table_collections` re-opens write handles from the stored collection metadata (the originals are consumed by the txns register) and registers them, so the coordinator can retry it in a loop with fresh timestamps. `drop_tables` forgets synchronously and returns the conflict for the same retry treatment. - The coordinator drives the retry loops for CREATE / ALTER / DROP TABLE, re-allocating the write timestamp and re-advancing the catalog upper (which also re-checks leadership, see materialize#28216) each attempt. Reads never take part in the txns-shard advance, so they are never blocked by either side, and a reader never observes a partially registered table. The oracle read timestamp is only advanced past the register timestamp (`apply_local_write`) after a successful register. At bootstrap no group commits run concurrently, so the register cannot conflict there and stays a single attempt. Under an injected 500ms oracle latency with concurrent inserters and DDL, a victim `SELECT 1` stays at p50 8ms / p95 10ms. A fast-oracle stress with three inserters and three concurrent CREATE/DROP TABLE loops drove 112 register and 10 forget conflicts, all retried to success with read-your-writes and cross-session reads intact and no panics.

`create_collections` used to register tables in the txns shard as its last step on the happy path, and return `InvalidUppers` with everything-but-the-register done when the off-loop group committer raced it. The caller then finished the registration through a different method, `register_table_collections`. So there were two registration code paths, and `create_collections` had a "returns half-done, call the other method to finish" contract. Make `register_table_collections` the sole path that registers tables in the txns shard, used identically for the first attempt and for retries, by both bootstrap and runtime DDL. `create_collections` and `alter_table_desc` now only set up storage and never touch the txns shard. The caller (coordinator or bootstrap) always registers as a separate, retryable step, re-allocating a fresh oracle timestamp on `InvalidUppers`. `register_ts` still flows into `create_collections`, but only for the read side: the initial since downgrade that makes a table "spring into existence" at that timestamp. That downgrade happens on the freshly-opened since handle before it is handed to the storage-collections background task, so it belongs in `create_collections`. The write side, txns registration, is what needs to be retryable, and that is now fully separated out. `register_table_collections` re-opens write handles from the stored collection metadata (a conflicting register consumes the originals) and opens them concurrently, since bootstrap registers many tables at once. It also folds in the read-only filtering that `register_table_writes` used to do: in read-only mode only migrated tables are registered. `register_table_writes` is gone. Bootstrap collects the table ids across all dependency layers and registers them once, after all layers are set up. Bootstrap runs no group commits concurrently, so this cannot conflict. Validated with the fast-oracle DDL stress: 110 register and 9 forget conflicts, all retried to success, read-your-writes and cross-session reads intact, no panics, and a clean read-write bootstrap.

aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch from b4f219d to 3d5b0ca Compare June 30, 2026 16:43

aljoscha added 5 commits July 1, 2026 07:02

aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 7 times, most recently from 70d6efa to f49e081 Compare July 1, 2026 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447)#37363

adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447)#37363
aljoscha wants to merge 5 commits into
MaterializeInc:mainfrom
aljoscha:sql-447-group-commit-oracle-off-loop

aljoscha commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aljoscha commented Jun 30, 2026

What

Why

Verification (local repro)

Note on correctness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant