adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE#37355
Open
aljoscha wants to merge 2 commits into
Open
adapter: linearize timestamp off the coordinator loop for EXPLAIN TIMESTAMP and SUBSCRIBE#37355aljoscha wants to merge 2 commits into
aljoscha wants to merge 2 commits into
Conversation
…ESTAMP and SUBSCRIBE EXPLAIN TIMESTAMP and SUBSCRIBE performed their linearized timestamp-oracle read (`oracle.read_ts().await`) inline on the single-threaded coordinator loop. The oracle backing store can be slow, and an inline await parks the loop, so a slow oracle wedged every other session for the whole round-trip. The peek path already avoids this: it does the read in its own `LinearizeTimestamp` stage that spawns the round-trip off the loop. Make EXPLAIN TIMESTAMP and SUBSCRIBE follow the same pattern. Each gains a `LinearizeTimestamp` stage that spawns the `read_ts()` round-trip and threads the resulting `oracle_read_ts` forward as plain data. The subsequent stages (`determine_timestamp` / `sequence_peek_timestamp` and the read-hold bookkeeping they do) stay on the loop and consume that value unchanged, so timestamp selection is identical, just no longer blocking. This is timing-neutral with respect to strict serializability: the linearized read still happens during the query's real-time interval (after arrival, before response), against the same backing oracle. It is the same call the peek path already makes off the loop. No caching or shared-atomic shortcut is introduced. The "should we linearize, and against which oracle" decision that was duplicated across the inline `oracle_read_ts` helper and `peek_linearize_timestamp` is factored into a single `Coordinator::linearized_read_ts_oracle` helper that returns the oracle to read from, if any. All three spawn sites share it. Fixes SQL-425
When a SUBSCRIBE starts or stops we record it in the mz_subscriptions introspection table. We did this with a synchronous group commit (`execute`/`blocking`) on the coordinator loop, which acquires a write timestamp from the timestamp oracle inline. When the oracle backend is slow, that round trip blocks the loop and stalls every other session, once per SUBSCRIBE start and once per stop. Instead, defer both the insert and the retraction to a (batched) group commit and wait for the write to become durable off the loop, in a spawned task, before delivering the client-visible response. On start, `implement_subscribe` hands the write-notify future back to its callers, which await it before sending the SUBSCRIBE response. So mz_subscriptions stays synchronously consistent: another session sees the row immediately after the response. On stop, `remove_active_compute_sink` returns the retraction's notify, `drop_compute_sinks` threads it through, and `retire_compute_sinks` awaits it before calling `sink.retire`. So a client that observes a subscribe's retirement (a cancellation or dependency-dropped error) does not then read a stale row. The controller collections are still dropped on the loop. For sinks that write no introspection row (internal subscribes, COPY TO) the notify is already resolved. The cluster test reads mz_subscriptions from a separate session whose read timestamp can lag the commit, so adapt it to poll. `add_active_compute_sink` no longer commits inline, so it is no longer async. Also remove the now-unused `BuiltinTableAppend::blocking`.
def-
approved these changes
Jul 1, 2026
def-
left a comment
Contributor
There was a problem hiding this comment.
No concerns from QA side, just a minor note on the test
| poll until the table reflects the expected state. | ||
| """ | ||
| output = c.sql_query(""" | ||
| query = """ |
Contributor
There was a problem hiding this comment.
c.testdrive does this implicitly
de19d4e to
cc6636e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Since we keep hitting
environmentdstalls, this is one of the proactivefixes from the SQL-42x batch.
EXPLAIN TIMESTAMPandSUBSCRIBEperformed their linearizedtimestamp-oracle read (
oracle.read_ts().await) inline on thesingle-threaded coordinator loop (
explain_timestamp.rsandsubscribe.rs). The oracle backing store can be slow, and an inline awaitparks the loop, so a slow oracle wedged every other session for the whole
round-trip. The peek path does not have this problem: it does the read in its
own
LinearizeTimestampstage that spawns the round-trip off the loop(
peek_linearize_timestamp).The issue includes a reproducer (
coord-oracle-latency) that routes only thetimestamp-oracle connection through toxiproxy and injects latency. With the
old code,
kconcurrentEXPLAIN TIMESTAMPs serialize on the loop and scale~
kx the single-call latency (measured: 1x ≈ 500ms, 8x ≈ 8.7s at 500msinjected latency). A normal peek reads off the loop and batches.
What this does
Make
EXPLAIN TIMESTAMPandSUBSCRIBEfollow the existing peek pattern.Each gains a
LinearizeTimestampstage that spawns theread_ts()round-trip off the coordinator loop and threads the resulting
oracle_read_tsforward as plain data. The subsequent stages(
determine_timestamp/sequence_peek_timestampand the read-holdbookkeeping) stay on the loop and consume that value unchanged, so timestamp
selection is identical, just no longer blocking the loop.
The "should we linearize, and against which oracle" decision that was
duplicated across the inline
oracle_read_tshelper andpeek_linearize_timestampis factored into a singleCoordinator::linearized_read_ts_oraclehelper that returns the oracle toread from, if any. All three spawn sites (peek, subscribe, explain timestamp)
now share it, so this removes duplication rather than adding a fourth copy.
Correctness
This is timing-neutral with respect to strict serializability: the
linearized read still happens during the query's real-time interval (after
arrival, before response), against the same backing oracle. It is the same
call the peek path already makes off the loop. No caching or shared-atomic
shortcut is introduced (see the rejected optimizations in
doc/developer/guide-adapter.md). Read holds are still acquired on the loopin the stage that runs after the spawned read returns, matching peek. The new
spawned reads are cancelable via the existing
cancel_enabled/handle_spawnwiring, also matching peek.
Tests
No new tests in this PR. Correctness of timestamp selection for
EXPLAIN TIMESTAMPandSUBSCRIBEis covered by the existing functional suites. Theissue's
coord-oracle-latencyreproducer is a manual latency diagnostic (itsassertion is written to detect the bug, i.e. on-loop serialization); I'd
like to discuss whether to add an inverted-assertion version as a benchmark,
since latency assertions in a multi-container mzcompose tend to be flaky in
CI.
Fixes SQL-425
Local verification (reproducer)
Ran the issue's setup locally:
environmentdwith catalog/persist direct on CockroachDB and the timestamp oracle routed through toxiproxy with 500ms injected latency. Builtmainand this branch from the same tree and compared.EXPLAIN TIMESTAMP (the issue's clean signal): clear improvement.
read_tsbacking batches for 8 readsSELECT 1latency while 8 sessions hammer itThe "victim" row is the direct test of the headline symptom ("every other session stalls"): a constant
SELECT 1(no oracle read) on a separate connection. Onmainthe coordinator loop is wedged by the inline oracle reads and the victim stalls ~26s; with the fix the read is off-loop and the victim recovers to ~3.5s.Scope note / separate issue. The reproducer also shows that statements which actually read data (
SELECTpeeks, andSUBSCRIBE) still stall the coordinator under a slow oracle through a different on-loop cost (frontier advancement / dataflow lifecycle), independent of the linearized read this PR moves off-loop:SELECTpeek victim stall ~8.5s (this path was already off-loop for the read and is unchanged by this PR).SUBSCRIBEvictim stall ~24.7s on bothmainand this branch.This matches the issue's own note that "peeks have their own on-loop oracle costs too." This PR fixes the specific defect it names (the inline linearized read for EXPLAIN TIMESTAMP and SUBSCRIBE) and is fully effective for EXPLAIN TIMESTAMP. The broader "data reads stall the coordinator under a slow oracle" cost affects peeks too and should be tracked as a separate follow-up rather than folded in here.