Skip to content

adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447)#37363

Draft
aljoscha wants to merge 5 commits into
MaterializeInc:mainfrom
aljoscha:sql-447-group-commit-oracle-off-loop
Draft

adapter: take timestamp-oracle round trips off the coordinator loop (SQL-425, SQL-447)#37363
aljoscha wants to merge 5 commits into
MaterializeInc:mainfrom
aljoscha:sql-447-group-commit-oracle-off-loop

Conversation

@aljoscha

Copy link
Copy Markdown
Contributor

What

This is a combined / stacked branch carrying three related fixes that take
timestamp-oracle round trips off the coordinator's main loop. It is pushed as a
single PR so CI exercises all three together. Each commit is intended to become
its own (stacked) PR later:

  1. adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE (Fixes SQL-425, also open standalone as adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE #37355)
    EXPLAIN TIMESTAMP and SUBSCRIBE did the linearized read_ts() oracle round
    trip inline on the loop. Move it off via a LinearizeTimestamp stage, mirroring
    the peek path.

  2. adapter: wait for SUBSCRIBE bookkeeping off the coordinator loop
    SUBSCRIBE start/stop wrote/retracted its mz_subscriptions row with a
    synchronous on-loop group commit. Defer the write to a batched group commit and
    wait for it off the loop (in a spawned task) before delivering the client-visible
    response, so the loop stays free while keeping mz_subscriptions consistent.

  3. adapter: allocate group-commit timestamps off the coordinator loop (Fixes SQL-447)
    The periodic group commit did peek_write_ts / write_ts / read_ts inline on
    the loop, every default_timestamp_interval (1s) regardless of load. Allocate the
    write timestamp off the loop (Message::GroupCommitApply) and downgrade read holds
    off the loop (Message::DowngradeReadHolds). A group_commit_last_advance_to
    guard re-allocates on the loop in the rare case an inline commit (DDL via
    BuiltinTableAppend::execute) raced ahead, keeping the catalog and table uppers
    monotonic.

Why

When the timestamp-oracle backend (CRDB) is slow, every inline oracle round trip
blocks the single-threaded coordinator loop and stalls all other sessions,
including pure reads that never touch the oracle.

Verification (local repro)

Injected 500ms oracle latency (toxiproxy) with 8 concurrent workers running the
offending statement, measuring a victim SELECT 1 (no oracle write of its own)
on a separate connection:

workers run before (main) after (this branch)
SELECT (peek) p50 ~3-5s p50 6ms, p95 ~9ms
EXPLAIN TIMESTAMP p50 ~3.5s p50 6ms, p95 ~8ms
SUBSCRIBE p50 ~24s p50 6ms, p95 ~8ms

Functional checks at 0ms and 500ms oracle latency: cross-session
read-after-write, read-your-writes, indexed views reflecting new writes,
SUBSCRIBE snapshots, and a concurrent CREATE/INSERT/DROP DDL burst (exercises the
re-allocation guard) all pass with no panic.

Note on correctness

Latency probes demonstrate loop responsiveness, not linearizability. The
group-commit change (#3) is in the most correctness-sensitive part of the
coordinator, so this relies on CI (pgtest, sqllogictest, testdrive,
platform-checks, parallel-workload) for the correctness story. Reviewing the
group-commit timestamp ordering / re-allocation guard is the key ask.

Draft: for CI observation while the individual PRs are split out.

@aljoscha aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch from b4f219d to 3d5b0ca Compare June 30, 2026 16:43
aljoscha added 5 commits July 1, 2026 07:02
…ESTAMP and SUBSCRIBE

EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized timestamp-oracle
read (`oracle.read_ts().await`) inline on the single-threaded coordinator
loop. The oracle backing store can be slow, and an inline await parks the
loop, so a slow oracle wedged every other session for the whole round-trip.
The peek path already avoids this: it does the read in its own
`LinearizeTimestamp` stage that spawns the round-trip off the loop.

Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the same pattern. Each gains a
`LinearizeTimestamp` stage that spawns the `read_ts()` round-trip and threads
the resulting `oracle_read_ts` forward as plain data. The subsequent stages
(`determine_timestamp` / `sequence_peek_timestamp` and the read-hold
bookkeeping they do) stay on the loop and consume that value unchanged, so
timestamp selection is identical, just no longer blocking.

This is timing-neutral with respect to strict serializability: the linearized
read still happens during the query's real-time interval (after arrival,
before response), against the same backing oracle. It is the same call the
peek path already makes off the loop. No caching or shared-atomic shortcut is
introduced.

The "should we linearize, and against which oracle" decision that was
duplicated across the inline `oracle_read_ts` helper and
`peek_linearize_timestamp` is factored into a single
`Coordinator::linearized_read_ts_oracle` helper that returns the oracle to
read from, if any. All three spawn sites share it.

Fixes SQL-425
When a SUBSCRIBE starts or stops we record it in the mz_subscriptions
introspection table. We did this with a synchronous group commit
(`execute`/`blocking`) on the coordinator loop, which acquires a write
timestamp from the timestamp oracle inline. When the oracle backend is
slow, that round trip blocks the loop and stalls every other session,
once per SUBSCRIBE start and once per stop.

Instead, defer both the insert and the retraction to a (batched) group
commit and wait for the write to become durable off the loop, in a
spawned task, before delivering the client-visible response.

On start, `implement_subscribe` hands the write-notify future back to its
callers, which await it before sending the SUBSCRIBE response. So
mz_subscriptions stays synchronously consistent: another session sees the
row immediately after the response.

On stop, `remove_active_compute_sink` returns the retraction's notify,
`drop_compute_sinks` threads it through, and `retire_compute_sinks`
awaits it before calling `sink.retire`. So a client that observes a
subscribe's retirement (a cancellation or dependency-dropped error) does
not then read a stale row. The controller collections are still dropped
on the loop. For sinks that write no introspection row (internal
subscribes, COPY TO) the notify is already resolved.

The cluster test reads mz_subscriptions from a separate session whose
read timestamp can lag the commit, so adapt it to poll.

`add_active_compute_sink` no longer commits inline, so it is no longer
async. Also remove the now-unused `BuiltinTableAppend::blocking`.
The periodic group commit and the user-write path did their timestamp-oracle
round trips inline on the coordinator loop: `peek_write_ts` and `write_ts` when
allocating the write timestamp, and `read_ts` when downgrading read holds
afterwards. The keepalive commit fires every `default_timestamp_interval` (1s)
regardless of load, so under a slow oracle backend the loop spent most of its
time blocked on these round trips and stalled every other session, including
pure reads that never touch the oracle.

Move the oracle round trips and the table append off the loop into a single,
long-lived group committer task. The loop only does the cheap, state-touching
work: `stage_group_commit` drains pending writes, validates and acquires write
locks, builds the append batch, and hands a `GroupCommitRequest` to the
committer. The committer does the throttle peek, allocates the write timestamp,
appends to the tables, applies the write to the oracle, and reads the local read
ts. It then hands the parts that need coordinator state back to the loop via
`Message::GroupCommitApplied`: advance the catalog upper, record statement
execution timestamps, retire client responses, and downgrade read holds.

Processing one request at a time makes the committer the serialization point for
group commits, so appends reach the single table-write worker in timestamp
order. The committer holds only shareable handles, captured once at startup: the
oracle is never replaced in-process, and the table appender (a new
`StorageController::table_appender` returning a cloneable handle over the txns
worker) stays valid for the controller's lifetime. Promotion out of read-only
mode is a process restart, so a fresh committer is spawned then.

Two ordering constraints keep the finalization on the loop rather than in the
committer:

  - The catalog upper must be advanced in step with the oracle read ts so reads
    of `mz_catalog_raw` at that ts do not block. This has to be serialized with
    on-loop catalog transactions, whose commit timestamp is fixed when the
    transaction begins, so the committer cannot advance it off the loop without
    tripping the single-writer assert. The loop advances it in
    `GroupCommitApplied` using a new tolerant
    `DurableCatalogState::try_advance_upper`, which is a no-op if an on-loop DDL
    transaction already advanced the upper past this commit's timestamp. The
    strict `advance_upper` (with its linearizability soft-panic) stays for the
    on-loop DDL callers, whose allocate-and-advance is atomic on the loop.

  - Statement execution timestamps must be recorded before the client responses
    are retired, because retiring ends the statement execution and drops its
    logging record. Both happen in `GroupCommitApplied`, in that order.

The txns shard is advanced by on-loop DDL (register / forget / alter of tables)
as well as by the off-loop committer, and both must reach it in timestamp order.
`shard_advance_lock` serializes `[allocate write ts + txns-shard write]` across
the committer and the on-loop DDL sites. Reads never take this lock, so they are
never blocked by it. On-loop DDL can briefly wait for the committer's lock hold
(one oracle round trip), which is acceptable: DDL is infrequent and already did
its own oracle round trips on the loop.

`execute` (DDL builtin-table writes) and bootstrap now go through the committer
too: `execute` stages the write and returns the notify, and bootstrap awaits it
rather than applying the timestamp itself.

Fixes SQL-447

Under an injected 500ms oracle latency with 8 concurrent workers, a victim
`SELECT 1` (no oracle write of its own) goes from multiple seconds to p50 6ms,
p95 9ms while the workers hammer INSERT. Read-your-writes, cross-session
read-after-write, statement-logging writes, and CREATE/DROP/INSERT DDL stress at
both 500ms and 0ms oracle latency all pass with no errors or panics.
The off-loop group committer and on-loop DDL (register / forget / alter of
tables) both write to the txns shard at oracle-allocated timestamps, and both
must reach it in timestamp order. The previous commit serialized them with a
`shard_advance_lock` held around `[allocate write ts + txns-shard write]`. That
couples the two paths: on-loop DDL can wait a full oracle round trip for the
committer's lock hold, and the lock is an all-or-nothing coupling that cannot be
relaxed on just one side.

Replace the lock with optimistic concurrency. Each side allocates a timestamp
from the oracle (which is monotonic) and attempts its txns-shard write. If the
other side advanced the txns upper past it, the write conflicts and the caller
re-allocates a fresh timestamp (necessarily beyond the txns upper) and retries.
The committer's append already did this on `InvalidUppers`. It is now the
primary mechanism rather than a defensive backstop, and the lock is gone.

To make DDL retryable:

  - The table write worker returns the conflict as `InvalidUppers` instead of
    panicking, rolling back its `write_handles` bookkeeping so a retry at a
    fresh timestamp can re-register or re-forget cleanly.

  - `create_collections` and `alter_table_desc` register tables in the txns
    shard as their last step. On conflict they return `InvalidUppers` with all
    other setup already done. The new `register_table_collections` re-opens
    write handles from the stored collection metadata (the originals are
    consumed by the txns register) and registers them, so the coordinator can
    retry it in a loop with fresh timestamps. `drop_tables` forgets
    synchronously and returns the conflict for the same retry treatment.

  - The coordinator drives the retry loops for CREATE / ALTER / DROP TABLE,
    re-allocating the write timestamp and re-advancing the catalog upper (which
    also re-checks leadership, see materialize#28216) each attempt.

Reads never take part in the txns-shard advance, so they are never blocked by
either side, and a reader never observes a partially registered table. The
oracle read timestamp is only advanced past the register timestamp
(`apply_local_write`) after a successful register.

At bootstrap no group commits run concurrently, so the register cannot conflict
there and stays a single attempt.

Under an injected 500ms oracle latency with concurrent inserters and DDL, a
victim `SELECT 1` stays at p50 8ms / p95 10ms. A fast-oracle stress with three
inserters and three concurrent CREATE/DROP TABLE loops drove 112 register and
10 forget conflicts, all retried to success with read-your-writes and
cross-session reads intact and no panics.
`create_collections` used to register tables in the txns shard as its last step
on the happy path, and return `InvalidUppers` with everything-but-the-register
done when the off-loop group committer raced it. The caller then finished the
registration through a different method, `register_table_collections`. So there
were two registration code paths, and `create_collections` had a "returns
half-done, call the other method to finish" contract.

Make `register_table_collections` the sole path that registers tables in the
txns shard, used identically for the first attempt and for retries, by both
bootstrap and runtime DDL. `create_collections` and `alter_table_desc` now only
set up storage and never touch the txns shard. The caller (coordinator or
bootstrap) always registers as a separate, retryable step, re-allocating a fresh
oracle timestamp on `InvalidUppers`.

`register_ts` still flows into `create_collections`, but only for the read side:
the initial since downgrade that makes a table "spring into existence" at that
timestamp. That downgrade happens on the freshly-opened since handle before it
is handed to the storage-collections background task, so it belongs in
`create_collections`. The write side, txns registration, is what needs to be
retryable, and that is now fully separated out.

`register_table_collections` re-opens write handles from the stored collection
metadata (a conflicting register consumes the originals) and opens them
concurrently, since bootstrap registers many tables at once. It also folds in
the read-only filtering that `register_table_writes` used to do: in read-only
mode only migrated tables are registered. `register_table_writes` is gone.

Bootstrap collects the table ids across all dependency layers and registers them
once, after all layers are set up. Bootstrap runs no group commits concurrently,
so this cannot conflict.

Validated with the fast-oracle DDL stress: 110 register and 9 forget conflicts,
all retried to success, read-your-writes and cross-session reads intact, no
panics, and a clean read-write bootstrap.
@aljoscha aljoscha force-pushed the sql-447-group-commit-oracle-off-loop branch 7 times, most recently from 70d6efa to f49e081 Compare July 1, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant