Skip to content

Shared-host JS channel fix + reliability hardening + regression tests#9

Merged
wenkaifan0720 merged 6 commits into
mainfrom
dig/peer-host-channel
Jun 24, 2026
Merged

Shared-host JS channel fix + reliability hardening + regression tests#9
wenkaifan0720 merged 6 commits into
mainfrom
dig/peer-host-channel

Conversation

@wenkaifan0720

Copy link
Copy Markdown
Collaborator

Summary

Makes the shared-host (named-profile, multi-tile) path ready for the Campus pin bump. Motivated by a Campus bug where a peer's edit to an agent_ui tile never reached the owner — root cause was a page→host JS channel silently dropped on a shared cef_host. A pre-pin readiness audit then surfaced a cluster of shared-host reliability bugs, fixed here.

The channel bug + root fix

  • fix(cef_host) — on a shared host, kOpAddChannel arrived with browserId=0 (createBrowser is queued, so the session hasn't attach()ed yet) and was dropped → the window.<name> shim was never injected → page→host messages died.
  • refactor(cef) — fixed at the root: the Swift session now buffers addChannel until attach() and flushes with the real browserId (and re-sends on re-home), so the op always carries a valid session. cef_host simplified to inject the registering session's own frame.

Shared-host reliability (from the audit)

  • EINTR-resilient readAll/writeAll — a signal mid-IO no longer looks like a dead pipe and tears down the whole host.
  • handleHostDeath now stops live CDP relays + fails pending targetId waiters (no leaked listener ports / hung enableAgentControl).
  • CDP sendRawToClient off the shared reader thread (per-relay serial queue) — one stuck client can't starve sibling agent-controlled tiles (was up to 2s SO_SNDTIMEO per frame).
  • Never synthesize an empty browserContextId — crashed Playwright's CRBrowser assertion on reconnect.

Docs + tests

  • README — the per-profile trust/isolation contract ("one trust domain per profile").
  • Unit test for channel call-order independence (a channel registered before create() must be re-sent on create()), plus test/run_channel_integration.sh which runs the channel / shared-host-channel / agent-control(CDP) probes against a real cef_host. The mocked integration_test can't catch native delivery — this is the layer that would have caught the regression.

Verification

  • flutter test green (52 controller tests incl. the new one).
  • Example builds clean; run_channel_integration.sh passes (shared-host channel routes per-session); multiview_probe passes all 13 agent-control isolation/lifecycle checks.

Scope notes

  • The audit's security findings (process-global channels, shared cookie jar) are a documented limitation, not changed here: the trust boundary is the integrator's, set via profiles. Per-session JS isolation is a deferred follow-up, needed only for mutually-distrusting authors on one shared profile.
  • Non-blocking follow-ups tracked: per-op IPC payload caps, reserved channel-name denylist, dpr in opResize, CDP epoch validation, a CI assert that release builds are CEF_HOST_ADHOC=OFF.

🤖 Generated with Claude Code

wenkaifan0720 and others added 6 commits June 24, 2026 13:19
…drop addChannel before browser create)

A page->host JS channel (window.<name>.postMessage) was never injected on a
SHARED-host session (a named profile where N tiles share one cef_host), so the
page's postMessage silently no-op'd. Root cause: on a shared host a session's
createBrowser is QUEUED (pendingCreates), so the addChannel op arrives with
browserId=0 (sent before the session's attach()), and
`case kOpAddChannel: if (!slot) break;` SILENTLY DROPPED it -> the name never
entered the process-global g_channels -> OnLoadStart injected no shim ->
window.<name> undefined.

This is the Campus "a peer's agent_ui edit never reaches the host" bug: the peer
tile rides a shared cef_host, so window.campusHost was never injected and
campus.emit died before the wire (host->page still worked, so the tile rendered
and received state, which masked it).

Fix: don't require `slot` for kOpAddChannel. DoAddChannel now registers the name
in g_channels regardless (OnLoadStart injects it on each future load) AND injects
the shim into every already-loaded browser's main frame (order-independent), and
is null-safe.

Reproduced + verified in isolation via example/lib/channel_probe_shared.dart
(two controllers on one shared cef_host): before, both sessions host=N (shim
absent, postMessage dead); after, host=Y and the shim postMessage reaches each
session's own handler. The single-controller case (channel_probe.dart) always
passed, isolating the bug to the shared/multi-session path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… race at the root

Follow-up to 5aae358. Rather than relying on cef_host accepting a browserId=0
addChannel op, the Swift CefWebSession now BUFFERS channel names and flushes them
(with the real wire id) in attach() — and re-sends on a re-home. So kOpAddChannel
always carries a valid browserId + slot. The cef_host side is simplified back to
injecting the registering session's OWN frame (+ g_channels for the OnLoadStart
re-load path), dropping the broad inject-into-all-browsers loop (the O(N*M) /
cross-session-injection smell). The cef_host null-safe path stays as defense.

Verified on a real shared host (FLUTTER_CEF_ALLOW_INSECURE_PROFILE=1, two
controllers on one cef_host) via example/lib/channel_probe_shared.dart: host=Y on
both and the shim postMessage reaches each session's own handler.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ath CDP cleanup

Two confirmed shared-host correctness bugs from the pre-pin audit:
- readAll/writeAll treated an EINTR-interrupted syscall as a dead pipe and tore
  down the WHOLE shared host (all N browsers wedge). Now retry on EINTR, matching
  the C++ ReadAll/WriteAll on the host side.
- handleHostDeath (unexpected cef_host death) cleared the create queues but leaked
  live CdpRelay listeners and stranded in-flight resolveTargetId waiters
  (enableAgentControl callers hung forever). Now mirror shutdown()'s teardown:
  stop all relays + fail all pending targetId waiters with nil.

Verified: example builds clean; multiview_probe (agent-control on a shared host)
passes all 13 isolation/lifecycle checks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…; never emit empty browserContextId

Two confirmed shared-host CDP bugs from the pre-pin audit:
- sendRawToClient ran on the SHARED CDP reader thread, and writeFrameLocked blocks
  up to SO_SNDTIMEO (~2s) on a stuck client — so one wedged agent client stalled
  CDP delivery to EVERY sibling agent-controlled tile on the same host. Now hop
  onto a per-relay SERIAL queue (preserves this client's frame order; isolates the
  wedged client); clientFd is re-validated under clientLock inside, so a concurrent
  stop()/handler-close is a graceful no-op.
- synthesizeAttachedToTarget could emit browserContextId:"" on reconnect (when the
  real id was never captured for the relay), crashing Playwright's CRBrowser
  assertion. Now skip the synthesized event; the real attachedToTarget carries it.

Verified: multiview_probe passes all 13 checks (CDP delivery + isolation intact).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A named (shared) profile is one cef_host process with one cookie jar, so sessions
sharing it are NOT mutually isolated: shared cookie jar + process-global JS
channels (per-message routing stays per-session). Spell out the rule — co-locate
only mutually-trusting content on one profile; give distrusting content its own
profile or the ephemeral default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Dart integration_test mocks the host method channel, so it cannot catch
native channel-delivery regressions — which is how the shared-host page->host
channel bug shipped. Add:
- a fast unit test (cef_web_controller_test.dart) for call-order independence:
  a channel registered before create() MUST be re-sent on create();
- test/run_channel_integration.sh: builds cef_host + the example and runs the
  channel / shared-host-channel / agent-control(CDP) probes against a REAL host,
  asserting each /tmp result. This is the layer that would have caught the B->A
  Campus regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wenkaifan0720 wenkaifan0720 merged commit a9aa30d into main Jun 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant