ClawBench adapter (Kernel env + cua agent) by rgarcia · Pull Request #44 · kernel/cua

rgarcia · 2026-06-27T12:47:49Z

Summary

Harbor adapter for ClawBench (everyday online tasks on live sites) on the Kernel env + cua agent. Re-targets the upstream Docker-shaped adapter at the Kernel environment: reuses upstream's dataset mapping + the two-stage grader (CDP submission-interception + body-LLM-judge verify.py) verbatim, re-hosting the interceptor against the Kernel session's CDP. Stacked on #40.

Validated

The CDP-second-client spike PASSED — a raw-CDP Fetch.enable interceptor attaches to the live Kernel session alongside cua and works (the gate for the whole approach).
Interceptor wired as a host-side agent-setup sidecar via a thin ClawbenchCuaAgent wrapper (zero shared-core/harbor edits).
Bug found + fixed: Harbor's per-task [agent.env] isn't surfaced to the host agent, so the interceptor saw no eval-schema; the wrapper now reads it from the task dir and passes it inline.
AgentMail disposable-inbox integration behind an EmailProvider abstraction (replaces upstream's PurelyMail).

Live smoke — sonnet-4-6, 20 non-email tasks, browser pool

0/20 (mean 0.0), 1/20 interception fired end-to-end. Unsurprising — ClawBench is hard (frontier best ~33%), this is a v1 adapter on a mid model, and the email + login-walled cohorts were excluded/unscored. The value is that the pipeline + interceptor work end-to-end.

Deferred (noted, not blocking)

Email cohort: AGENTMAIL_API_KEY wasn't in the smoke session env; the AgentMail provider is wired but those tasks were excluded.
Login-walled tasks need per-task Kernel profiles to score.
Agentic 5-layer evaluator + MP4 = phase 2 (MP4 needs X11; not needed for parity).

Details in SMOKE.md. Unit tests + ruff green; CDP spike + interceptor validated live.

Note

Medium Risk
Large new benchmark surface (live sites, judge and AgentMail API keys, host CDP sidecar) but scoped to the adapter package without core Harbor/cua changes; grading contract changes (reward.json shape) affect ClawBench trials only.

Overview
Introduces benchmarks/adapters/clawbench, a Harbor adapter that maps ClawBench V1/V2 cases to Kernel + cua instead of upstream’s Docker/runtime-server stack. A CLI (clawbench_adapter.main) emits flat task.toml tasks with environment/kernel.json, copied verifier assets, and upstream-vendored instruction building.

ClawbenchCuaAgent wraps CuaHarborAgent to provision disposable email/persona (host prepare_task.py), inline identity into the prompt (no VM file tools), start a host-side interceptor.py CDP Fetch sidecar with eval_schema from <task>/tests/eval_schema.json, then upload /data/interception.json for Stage-2 grading. Email uses AgentMailProvider when AGENTMAIL_API_KEY is set.

Stage-2 verify.py ships lenient (default, leaderboard “Reward”) and strict rubrics via CLAWBENCH_JUDGE_RUBRIC, and writes numeric-only reward.json plus diagnostics in clawbench-result.json for Harbor’s float coercion. Docs (README, PARITY.md, SMOKE.md) and pytest coverage cover generator shape, agent lifecycle, interceptor matching, judge behavior, and email cleanup.

^{Reviewed by Cursor Bugbot for commit f185010. Bugbot is set up for automated code reviews on this repo. Configure here.}

Re-targets upstream ClawBench's Docker-shaped Harbor adapter at the Kernel environment: a Python generator turns test-cases/v2/*/task.json into flat, single-step Harbor task dirs (instruction.md + upstream block + Kernel footer, environment/kernel.json {stealth, viewport} with no start_url, schema_version 1.0 task.toml, tests/ grader bundle). The dataset mapping and instruction builder are reused from upstream (vendored, with a runtime import shim), and the Stage-2 body-judge verify.py is reused verbatim. The Stage-1 request interceptor is ported from upstream's start_cdp_handler to run as a sidecar against the Kernel session's CDP socket; finalize_capture and cleanup_email wire the verifier flow. Email provisioning is abstracted behind an EmailProvider (AgentMail impl + a no-inbox fallback for the non-email subset). Build, ruff, and mocked unit tests are green. The live smoke is deferred: it gates on confirming a second raw-CDP Fetch.enable client can attach to a Kernel session alongside cua (see SMOKE.md). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Resolves the gating CDP question and makes the ClawBench Stage-1 grader actually fire on the Kernel env. A spike confirmed a second raw-CDP Fetch.enable client attaches to a live Kernel session (via cdp_ws_url) and can block requests while cua drives the same session over the control plane. Adds ClawbenchCuaAgent, a thin CuaHarborAgent subclass that runs interceptor.py as a host-side sidecar around the cua drive (the Kernel base image has no pip/websocket-client; the CDP endpoint is reachable from the host), then uploads /data/interception.json into the VM for the verbatim Stage-2 verify.py. The shared core is untouched. Smoke (20 V2 non-email tasks, sonnet-4-6 agent + Anthropic judge, pool_size=5): the sidecar attached on all 20, captured the four passive layers, and one block fired end-to-end (judge correctly rejected a telemetry false-positive). Pass rate 0/20 -- the non-login subset hits auth walls and the 450s learning cap timed out every run; no adapter bug. See SMOKE.md. Fixes surfaced by the smoke: - eval_schema now reaches the interceptor: read from the host task dir's tests/eval_schema.json (Harbor's [agent.env] is not surfaced on the host agent), passed inline as CLAWBENCH_EVAL_SCHEMA_JSON; the dead [agent.env] block is gone. - test.sh drops the in-VM finalize (finalization is host-side now). - interceptor.py accepts an inline schema, falling back to the file for in-VM use. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Validate the ClawBench email cohort live against AgentMail. The provisioning code had never run; fix two bugs the live run surfaced: - my-info was never provisioned and is unreadable anyway. prepare_task.py was copied into tests/ but nothing invoked it, and the cua harness on Kernel exposes only browser tools (no file/shell), so the agent cannot read ./my-info/ files at all. ClawbenchCuaAgent now provisions the AgentMail inbox + persona host-side before the drive, inlines the email credentials + persona into the instruction so the agent reads its identity from the prompt, and deletes the inbox in teardown. The resume PDF stays unreachable (no file-upload surface), bounding the resume-upload cohort. - verify.py wrote reward.json as a rich object (judge_match: null, reason, task_id), but Harbor reads it as a flat {key: number} map and coerced every value to float, erroring every trial with a VerifierResult ValidationError. reward.json now emits only numeric reward keys; diagnostics move to clawbench-result.json + reward.txt. Also: add fpdf2 (resume PDF) as a dependency; gitignore run logs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia · 2026-06-27T15:13:42Z

Update — ClawBench email cohort validated live (with AgentMail) (commit 2483e63):

AgentMail disposable-inbox provisioning works end-to-end (create + cleanup, 0 inboxes leaked).
Fixed a verifier-crashing bug: reward.json was written as a rich object, but Harbor reads it as a flat {key: number} map → every ClawBench trial was erroring at the verifier. Now emits {reward, intercepted, judge_match} numeric keys (diagnostics moved to clawbench-result.json).
The cua agent has no in-VM file tool, so my-info (email + persona) is now inlined into the instruction for the agent to type into forms.
AgentMail limitation: no per-inbox webmail, so the fill-a-real-address cohort works but the click-the-verification-link cohort can't complete in-browser (and resume upload needs a file-upload surface). See SMOKE.md → "Email cohort (live)".

Ship both upstream judge rubrics in verify.py (lenient = judge_llm.py, strict = judge.py, both verbatim) and select via CLAWBENCH_JUDGE_RUBRIC (default lenient) so the emitted reward matches the published leaderboard "Reward" column instead of the strict-flavoured number (~half). Default the parse-failure verdict by rubric (lenient -> match, strict -> None), record the chosen rubric in clawbench-result.json, and correct the README's "leaderboard-reproducible" claim. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia · 2026-06-27T15:52:30Z

Parity pass vs TIGER-AI-Lab/ClawBench (f185010): the published leaderboard uses the lenient rubric (runner/judge_llm.py), but we'd shipped a strict-flavoured variant — so our number wasn't leaderboard-reproducible. Now ships both upstream rubrics byte-for-byte and defaults to lenient (= the leaderboard "Reward" column), selectable via CLAWBENCH_JUDGE_RUBRIC. Intentional Kernel adaptations (numeric-only reward.json, host-side interceptor sidecar, AgentMail, inlined persona, no Docker/MP4) confirmed and kept. See PARITY.md.

cursor

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Host email env ignores Harbor
- Host-side prepare and cleanup subprocesses now use an env helper that forwards AGENTMAIL_API_KEY from agent extra_env, restoring live inbox provisioning and teardown.

Or push these changes by commenting:

@cursor push 6a97b228a9

Preview (6a97b228a9)

diff --git a/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py b/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
--- a/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
+++ b/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
@@ -138,7 +138,7 @@
                     "--state-file",
                     str(state_file),
                 ],
-                env={**os.environ},
+                env=self._host_script_env(),
                 stdout=log,
                 stderr=subprocess.STDOUT,
             )
@@ -157,13 +157,27 @@
         try:
             subprocess.call(
                 [sys.executable, str(_CLEANUP_EMAIL), str(state_file)],
-                env={**os.environ},
+                env=self._host_script_env(),
                 stdout=subprocess.DEVNULL,
                 stderr=subprocess.DEVNULL,
             )
         except Exception as exc:
             self.logger.warning(f"clawbench: inbox cleanup failed: {exc}")
 
+    def _host_script_env(self) -> dict[str, str]:
+        """Env for host-side setup/teardown helpers.
+
+        ``prepare_task.py``/``cleanup_email.py`` select AgentMail strictly from
+        ``AGENTMAIL_API_KEY`` in subprocess env. Keep host env defaults, but let
+        ``--ae AGENTMAIL_API_KEY=...`` override so these scripts can provision and
+        tear down real inboxes.
+        """
+        child_env = {**os.environ}
+        api_key = self.extra_env.get("AGENTMAIL_API_KEY", "").strip()
+        if api_key:
+            child_env["AGENTMAIL_API_KEY"] = api_key
+        return child_env
+
     def _start_interceptor(
         self, environment, data_dir: Path
     ) -> subprocess.Popen | None:

diff --git a/benchmarks/adapters/clawbench/tests/test_agent.py b/benchmarks/adapters/clawbench/tests/test_agent.py
--- a/benchmarks/adapters/clawbench/tests/test_agent.py
+++ b/benchmarks/adapters/clawbench/tests/test_agent.py
@@ -230,12 +230,14 @@
 
 def test_prepare_my_info_invokes_prepare_task(tmp_path, monkeypatch):
     agent = _make_agent(tmp_path)
+    agent._extra_env = {"AGENTMAIL_API_KEY": "am-key"}
     myinfo = tmp_path / "myinfo"
     state = tmp_path / "data" / "task-state.json"
     calls = {}
 
     def fake_call(cmd, env=None, **kw):
         calls["cmd"] = cmd
+        calls["env"] = env
         return 0
 
     monkeypatch.setattr("clawbench_adapter.agent.subprocess.call", fake_call)
@@ -245,14 +247,18 @@
     assert calls["cmd"][1].endswith("prepare_task.py")
     assert str(myinfo / "my-info") in calls["cmd"]
     assert str(state) in calls["cmd"]
+    assert calls["env"]["AGENTMAIL_API_KEY"] == "am-key"
 
 
 def test_cleanup_my_info_noops_without_state(tmp_path, monkeypatch):
     agent = _make_agent(tmp_path)
+    agent._extra_env = {"AGENTMAIL_API_KEY": "am-key"}
     called = {"n": 0}
+    envs = []
 
     def fake_call(*a, **k):
         called["n"] += 1
+        envs.append(k.get("env"))
         return 0
 
     monkeypatch.setattr("clawbench_adapter.agent.subprocess.call", fake_call)
@@ -263,3 +269,4 @@
     state.write_text('{"email": {"address": "a@agentmail.to"}}')
     agent._cleanup_my_info(state)
     assert called["n"] == 1  # state present -> cleanup_email.py invoked
+    assert envs[-1]["AGENTMAIL_API_KEY"] == "am-key"

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit f185010. Configure here.}

cursor · 2026-06-27T22:37:59Z

+        try:
+            subprocess.call(
+                [sys.executable, str(_CLEANUP_EMAIL), str(state_file)],
+                env={**os.environ},


Host email env ignores Harbor

Medium Severity

The AGENTMAIL_API_KEY isn't propagated to the host-side prepare_task.py and cleanup_email.py subprocesses. These scripts, which provision disposable email inboxes, only receive os.environ. The key is incorrectly placed in task.toml's [verifier.env], which Harbor prevents from reaching agent host-side processes, causing email cohort tasks to lack real inboxes.

Additional Locations (1)

benchmarks/adapters/clawbench/src/clawbench_adapter/adapter.py#L114-L121

^{Reviewed by Cursor Bugbot for commit f185010. Configure here.}

rgarcia and others added 3 commits June 27, 2026 09:13

rgarcia marked this pull request as ready for review June 27, 2026 22:33

cursor Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClawBench adapter (Kernel env + cua agent)#44

ClawBench adapter (Kernel env + cua agent)#44
rgarcia wants to merge 4 commits into
hypeship/cua-bench-online-mind2webfrom
hypeship/bench-clawbench

rgarcia commented Jun 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rgarcia commented Jun 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validated

Live smoke — sonnet-4-6, 20 non-email tasks, browser pool

Deferred (noted, not blocking)

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Host email env ignores Harbor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rgarcia commented Jun 27, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading