Skip to content

docs(rfc): add message queue contract RFC#259

Merged
behinddwalls merged 1 commit into
mainfrom
messagequeue-rfc
Jun 18, 2026
Merged

docs(rfc): add message queue contract RFC#259
behinddwalls merged 1 commit into
mainfrom
messagequeue-rfc

Conversation

@behinddwalls

@behinddwalls behinddwalls commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

Why?

Queue payloads are Go structs serialized with encoding/json, so the wire shape is defined only by Go source. There is no language-neutral contract a non-Go client can compile against, no explicit topic-to-payload binding, and no distinction between a domain's private wiring and a published cross-domain contract.

What?

Doc-only — the design of record for message queue contracts (doc/rfc/messagequeue-contract.md):

  • Contract language: Protobuf, serialized as protobuf JSON (protojson) so payloads stay self-describing JSON on the wire. The .proto is the authority and the Go binding is generated from it, so it cannot drift (no hand-authored struct, no drift test to keep them in sync).
  • Topic binding: a custom topics proto option (defined in api/base/messagequeue) carries the wire topic names on the message itself, read back by reflection — not on the publish/consume hot path, which still resolves topics from a consumer.TopicKey via the registry.
  • Location by audience: external contracts (something outside the owning domain depends on them) live in api/{domain}/messagequeue/; internal ones in {domain}/core/messagequeue/, with the split enforced by Bazel visibility.
  • Documents the accepted protojson conventions (snake_case field names, UPPER_SNAKE enums, int64-as-string) and why JSON Schema, binary proto, and Avro were rejected.

Test Plan

  • make lint (doc-only)

Issues

Stack

  1. refactor(api/base): name the message queue option topic_keys #264
  2. @ docs(rfc): add message queue contract RFC #259
  3. feat(api/runway): add external merge queue contract #260
  4. feat(orchestrator): make merge-conflict check asynchronous via runway #245
  5. feat(merge): make merge asynchronous via runway #247

@behinddwalls behinddwalls marked this pull request as ready for review June 17, 2026 14:33
@behinddwalls behinddwalls requested review from a team and sbalabanov as code owners June 17, 2026 14:33
behinddwalls added a commit that referenced this pull request Jun 17, 2026
## Summary
Move the cross-domain shared trees out of the repo root into a single
`platform/` umbrella. Domain-scoped trees (`submitqueue/`, `stovepipe/`,
`runway/`) keep their own `core/`, `entity/`, `extension/`.

| From | To |
|------|----|
| `core/{errs,metrics,consumer}` | `platform/{errs,metrics,consumer}` |
| `core/httpclient` | `platform/http` |
| `entity/*` (`change`, `mergestrategy`, `messagequeue`) |
`platform/base/*` |
| `extension/*` (`counter`, `messagequeue`) | `platform/extension/*` |

Details:
- `platform/http`: package renamed `httpclient` -> `http`; `net/http`
aliased as `nethttp` inside the package; callers that also import
`net/http` use a `phttp` alias.
- `platform/base`: root doc package renamed `entity` -> `base`; `change`
/ `mergestrategy` / `messagequeue` preserved as subpackages.
- Rewrites all Go import paths and Bazel labels; updates `Makefile`
schema/admin paths and `go:generate` roots.
- Docs refreshed to the `platform/` layout (`CLAUDE.md`, package
READMEs, RFCs), documenting current state only.

> Stacked on #256 (hermetic Go SDK). Review/merge that first.

## Test plan
- [x] `make gazelle` clean (BUILD files in sync)
- [x] `make build` — 201 targets build (hermetic)
- [x] `make test` — 54/54 unit tests pass

Made with [Cursor](https://cursor.com)

## Test Plan


## Issues


## Stack
1. @ #257
1. #258
1. #259
1. #260

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread doc/rfc/messagequeue-contract.md Outdated
Comment thread doc/rfc/messagequeue-contract.md Outdated
Comment thread doc/rfc/messagequeue-contract.md Outdated
Comment thread doc/rfc/messagequeue-contract.md Outdated
Comment thread doc/rfc/messagequeue-contract.md Outdated
Comment thread doc/rfc/messagequeue-contract.md Outdated
Base automatically changed from move-proto-to-api to main June 17, 2026 18:27
behinddwalls added a commit that referenced this pull request Jun 17, 2026
## Summary

### Why?

The "Rebase Stacked PRs" job silently no-ops when the repo's "Automatically delete head branches" setting is on. On merge, GitHub deletes the merged head branch, which auto-retargets every child PR's base from the merged head branch to the merged base (e.g. main) *before* the job runs. The job discovers children via `gh pr list --base <merged-head>`, which then returns nothing — so it rebases nothing, force-pushes nothing, and still reports success. Run #257 (and #256 before it) did exactly this, leaving the downstream stack (#258/#259/#260) with doubled diffs that had to be rebased by hand.

The fix is to keep "Automatically delete head branches" off (now done at the repo level) and let this workflow own head-branch cleanup. With the setting off, GitHub leaves child bases untouched, the lookup finds them, and the job rebases the stack and then deletes the merged branch itself — for both stacked and non-stacked PRs.

### What?

- Document the hard requirement that "Automatically delete head branches" stays off, with the retarget-race rationale, in the workflow header.
- Log a clear line when a merge has no child PRs instead of returning silently.
- Clarify the branch-deletion step: it runs on every merged PR (stacked or not) and is skipped only when a child rebase failed; fix the misleading "All stacked PRs rebased successfully" message that printed even on no-op runs.

No behavior change on the happy path — the job already deleted the merged branch on success; this makes the intent explicit and the logs legible.

## Test Plan

- ✅ `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/rebase-stack.yml'))"` — YAML parses.
- Confirmed `delete_branch_on_merge` is `false` via `gh api repos/uber/submitqueue`.
- Post-merge: the next Rebase Stack run should log child rebases or "No open child PRs … nothing to rebase", then "Deleting merged branch …" and actually remove it.

Co-authored-by: Cursor <cursoragent@cursor.com>
behinddwalls added a commit that referenced this pull request Jun 17, 2026
## Summary

### Why?

The "Rebase Stacked PRs" job was silently no-opping: with the repo's
"Automatically delete head branches" setting on, GitHub deleted the
merged head branch on merge, which auto-retargeted every child PR's base
from the merged head branch to the merged base (e.g. `main`) *before*
the job ran. Its child lookup (`gh pr list --base <merged-head>`) then
found nothing, so it rebased nothing and still went green. Run #257 (and
#256 before it) did exactly this, leaving #258/#259/#260 with doubled
diffs that had to be rebased by hand.

The repo setting is now off, making this workflow the sole owner of
head-branch cleanup. But that exposed a second gap: the job only deleted
the merged branch at the end of its own run. When a child rebase
conflicts, the run intentionally keeps the merged branch (the child
still bases on it) and never fires again — so after the author manually
rebases and retargets the child, nothing ever deletes the orphaned
merged branch.

### What?

- Document the hard requirement that "Automatically delete head
branches" stays **off**, with the retarget-race rationale, in the
workflow header.
- Replace the outcome-gated single-branch delete with an invariant-based
sweep, `cleanup_orphaned_merged_branches`, that runs on every
invocation: for each recently merged PR whose head branch still exists,
delete it **iff** no open PR references it as a base or head. This:
- deletes the just-merged branch on a clean run (children retargeted
away),
- **keeps** it when an immediate child rebase failed (that child still
bases on it — deleting it would retarget the child to `main` and
recreate the broken diff),
- reaps branches stranded by an earlier conflicted run on the next
merge, once their children were manually fixed.
- Branch existence is snapshotted in one `git ls-remote`; the merged
base (e.g. `main`) is never touched. Conflicted runs also emit a
`::warning::` for visibility while still exiting non-fatally, and
no-stack merges log a clear line.

## Test Plan

- ✅ `yaml.safe_load` parses the workflow.
- ✅ `bash -n` on the extracted `run:` script.
- ✅ Simulated the sweep loop with mock data (dedup, skip-gone,
skip-base, consider-delete all behave correctly).
- Post-merge: the next Rebase Stack run should log child rebases (or "No
open child PRs … nothing to rebase"), then a sweep that deletes the
merged branch when nothing depends on it.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
@behinddwalls behinddwalls changed the base branch from main to preetam/api-base-proto June 17, 2026 20:22
@behinddwalls behinddwalls requested a review from sbalabanov June 17, 2026 20:22
@behinddwalls behinddwalls marked this pull request as draft June 17, 2026 20:22
behinddwalls added a commit that referenced this pull request Jun 17, 2026
…y) (#261)

## Summary
### Why?

Each service's gateway proto redefines its own `Change` message (and
SubmitQueue its own `Strategy` enum), duplicating wire contracts that
are meant to stay identical across domains. This is the proto-level gap
mirroring what `platform/base/{change,mergestrategy}` already solved for
the Go entities, and Stovepipe's upcoming gateway API would otherwise
add a third copy of `Change`.

### What?

Introduces an `api/base/` proto tree as the shared-wire-contract analog
of `platform/base/`:

- Adds `api/base/change` (`uber.base.change.Change`) and
`api/base/mergestrategy` (`uber.base.mergestrategy.Strategy`), each a
message-only contract that domains import instead of redefining.
- Teaches the hermetic codegen rule (`tool/proto/proto_codegen.bzl`) to
resolve cross-package proto `import`s (exec-root proto_path + imported
sources as action inputs) and to skip the gRPC/YARPC generators for
service-less contracts via a new `gen_services` attribute.
- Migrates the SubmitQueue gateway proto to import the shared
`Change`/`Strategy` and drops its local definitions; updates Go
consumers (land controller + unit/integration/e2e tests) to the shared
`protopb` packages.
- Makefile now creates the `protopb/` output dir and marks copied stubs
writable, so net-new proto packages generate cleanly.

Known cosmetic gazelle warning: resolving the new cross-proto import
prints "multiple rules may be imported" for the `# keep` rules_go
go_proto_library alias. It does not change generated BUILD files, the
build, or the committed-files path that consumers resolve via the root
`# gazelle:resolve go` directives.

## Test Plan
- ✅ `make proto` (regenerates stubs, including the message-only base
protos)
- ✅ `./tool/bazel build //...` (213 targets)
- ✅ `make test` (54 unit tests pass)
- ✅ `make fmt`

## Issues


## Stack
1. @ #261
1. #259
1. #260

Co-authored-by: Cursor <cursoragent@cursor.com>
@behinddwalls behinddwalls marked this pull request as ready for review June 17, 2026 21:38
Base automatically changed from preetam/api-base-proto to main June 17, 2026 21:38
@behinddwalls behinddwalls marked this pull request as draft June 17, 2026 21:57
@behinddwalls behinddwalls marked this pull request as ready for review June 17, 2026 22:21
@behinddwalls behinddwalls force-pushed the messagequeue-rfc branch 2 times, most recently from b514f27 to 8ea502e Compare June 18, 2026 04:43
@behinddwalls behinddwalls changed the base branch from main to preetam/messagequeue-topic-keys June 18, 2026 04:43
behinddwalls added a commit that referenced this pull request Jun 18, 2026
## Summary
### Why?

The message queue topic-binding proto option was named `topics`, which
reads as a concrete wire topic name. It is not — it carries a stable
**logical topic key**. Each implementer maps the key to whatever topic
name its broker/queue requires (subject to that backend's naming
constraints); on our Go side the keys are `consumer.TopicKey` values,
mapped to concrete names through `TopicRegistry`. The name `topics`
invited the wrong mental model.

### What?

Rename the `google.protobuf.MessageOptions` extension `topics` →
`topic_keys` (field number `50001` unchanged, so the wire/extension
layout is identical) and reframe its doc comment to say it carries a
logical key, not a wire name. Regenerated `messagequeue.pb.go`
(`E_Topics` → `E_TopicKeys`).

This is the base of a stack; the runway contract and the contract RFC
that consume the option are updated in the branches stacked on top.

## Test Plan
✅ `make proto` — descriptor field renamed (`name=topics` →
`name=topic_keys`), field number `50001` unchanged
✅ `./tool/bazel build //...`

## Issues


## Stack
1. @ #264
1. #259
1. #260
1. #245
1. #247
Define how queue payloads are defined, located by audience, and bound to
topic keys. Payloads are Protobuf serialized as protobuf JSON: the .proto is
the language-neutral authority, the Go binding is generated from it (so it
cannot drift), and topic keys bind to payloads via a custom proto option.
External contracts live in api/{domain}/messagequeue, internal in
{domain}/core/messagequeue, with the split enforced by Bazel visibility.
@behinddwalls behinddwalls changed the base branch from preetam/messagequeue-topic-keys to main June 18, 2026 18:21
@behinddwalls behinddwalls merged commit d584539 into main Jun 18, 2026
15 of 27 checks passed
@behinddwalls behinddwalls deleted the messagequeue-rfc branch June 18, 2026 18:37

### Location: audience decides

A contract is **external** when something outside its owning domain depends on it — another domain's service, or a client written in another language. It is **internal** when only the owning domain's own services use it. The test is concrete: *does anything outside this domain compile or deserialize against it?*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better give a definition of a "domain"

behinddwalls added a commit that referenced this pull request Jun 18, 2026
## Summary
### Why?

Runway's merge queues are the first cross-domain message queue contract:
a client (potentially non-Go) publishes a merge request and consumes the
result without access to Runway's Go types or storage. Per the RFC
(#259) this needs a language-neutral, proto-defined contract.

### What?

Establishes the message queue contract pattern using Runway's merge
queues as the reference:

- Adds `api/runway/messagequeue/proto/merge.proto` defining
`MergeRequest`/`MergeResult`, reusing the shared `Change` and `Strategy`
types and the `topics` option from `api/base`; generated into `protopb`.
- Payloads are serialized as **protobuf JSON** via thin `protojson`
helpers (`MergeRequestToBytes`/`MergeRequestFromBytes` and the
`MergeResult` counterparts); a `Topics()` reflection helper exposes the
topic binding. Topic keys are co-located with the contract.
- Moves the merge bindings out of `runway/entity` into
`api/runway/messagequeue`.
- A drift test round-trips the payloads and asserts every Runway topic
key is bound to exactly one message via the `topics` option.

## Test Plan
- ✅ `./tool/bazel test //api/runway/messagequeue:messagequeue_test`
- ✅ `./tool/bazel build //...`

## Issues


## Stack
1. #264
1. #259
1. @ #260
1. #245
1. #247
behinddwalls added a commit that referenced this pull request Jun 18, 2026
…#245)

## Summary
### Why?

The merge-conflict check ran synchronously inside the `validate`
consumer by calling the `mergechecker` extension inline. A real merge
attempt is slow and I/O-heavy, so doing it on the partition lease blocks
the pipeline and couples SubmitQueue to the checker's latency. This
moves the check to an asynchronous round-trip with runway, modelled on
`build`/`buildsignal` but across a service boundary.

### What?

The pipeline gains `validate ⇢ (runway) ⇢ mergeconflictsignal → batch`:

- `validate` drops the inline `mergechecker` call and, after its
existing dedup + change-metadata + claim work, publishes the full
`MergeRequest` to the runway-owned `merge-conflict-checker` queue, keyed
by the request id as the client-owned correlation id. The id
round-trips, so the result correlates straight back to the request
(unlike `build`, whose server-generated id needs a mapping store). The
hand-off to runway is retryable.
- `mergeconflictsignal` (new) consumes runway's `MergeResult` off
`merge-conflict-checker-signal`, advances the request to `batch` when
mergeable, or fails it when conflicted.
- DLQ reconcilers drive the request to `Error` on dead-letter; the
signal DLQ reads the request id straight off the result.

The check is triggered directly from `validate`, not a standalone
`mergeconflict` trigger stage: that hop only forwarded the request id
before publishing to runway, so folding it into `validate` saves a queue
round-trip and removes a stage (its topic key, consumer, and DLQ entry)
with no change to the cross-boundary contract.

Crossing the runway boundary is why these payloads carry full data
rather than entity IDs; the queue-payload-boundary rule is documented in
CLAUDE.md, with the pipeline diagram and stage table updated in
workflow.md and the superseded `mergechecker` validate-path row noted in
extension-contract.md. The `mergechecker` package is left in-tree
(unused on the validate path); removing it is a follow-up. Runway's
service implementation is out of scope — only its contract (added in
#260) is consumed here.

## Test Plan
- ✅ `bazel build //...`
- ✅ `bazel test //... --test_tag_filters=-integration,-e2e` (56 tests
pass)
- ✅ `make gazelle` clean

## Issues


## Stack
1. #264
1. #259
1. #260
1. @ #245
1. #247
behinddwalls added a commit that referenced this pull request Jun 18, 2026
## Summary
### Why?

The committing merge ran synchronously inside the orchestrator, blocking
the partition lease on a slow, I/O-heavy operation and coupling
SubmitQueue to the merger's latency. This moves the merge to an
asynchronous round-trip with Runway — mirroring the merge-conflict check
(#245), but for the committing merge.

### What?

Adds `batch → merge ⇢ (runway) ⇢ mergesignal → conclude/speculate`:

- `merge` (new) consumes a batch ready to land, builds the full
`MergeRequest` from the batch's member requests (one `MergeStep` per
request in Contains order, carrying each request's change and land
strategy), and publishes it to Runway's `merger` queue keyed by the
batch id as the client-owned correlation id.
- `mergesignal` (new) consumes Runway's `MergeResult` off
`merger-signal`, correlates by the echoed id, and transitions the batch
to `Succeeded` (merged) or `Failed` (could not) — then fans out to
`conclude` (so member requests pick up the outcome) and `speculate` (so
dependents can re-plan). Purely result-driven; no poll loop.
- DLQ reconcilers drive the batch to a terminal state on dead-letter.

Reuses the Runway merge contract from #260 — the same
`MergeRequest`/`MergeResult`, carried on the committing
`merger`/`merger-signal` topics rather than the dry-run check topics.
Runway's service implementation is out of scope.

## Test Plan
- ✅ `./tool/bazel build //...`
- ✅ `./tool/bazel test //... --test_tag_filters=-integration,-e2e` (57
tests pass)

## Issues


## Stack
1. #264
1. #259
1. #260
1. #245
1. @ #247
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants