Skip to content

ClawBench adapter (Kernel env + cua agent)#44

Open
rgarcia wants to merge 4 commits into
hypeship/cua-bench-online-mind2webfrom
hypeship/bench-clawbench
Open

ClawBench adapter (Kernel env + cua agent)#44
rgarcia wants to merge 4 commits into
hypeship/cua-bench-online-mind2webfrom
hypeship/bench-clawbench

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Harbor adapter for ClawBench (everyday online tasks on live sites) on the Kernel env + cua agent. Re-targets the upstream Docker-shaped adapter at the Kernel environment: reuses upstream's dataset mapping + the two-stage grader (CDP submission-interception + body-LLM-judge verify.py) verbatim, re-hosting the interceptor against the Kernel session's CDP. Stacked on #40.

Validated

  • The CDP-second-client spike PASSED — a raw-CDP Fetch.enable interceptor attaches to the live Kernel session alongside cua and works (the gate for the whole approach).
  • Interceptor wired as a host-side agent-setup sidecar via a thin ClawbenchCuaAgent wrapper (zero shared-core/harbor edits).
  • Bug found + fixed: Harbor's per-task [agent.env] isn't surfaced to the host agent, so the interceptor saw no eval-schema; the wrapper now reads it from the task dir and passes it inline.
  • AgentMail disposable-inbox integration behind an EmailProvider abstraction (replaces upstream's PurelyMail).

Live smoke — sonnet-4-6, 20 non-email tasks, browser pool

0/20 (mean 0.0), 1/20 interception fired end-to-end. Unsurprising — ClawBench is hard (frontier best ~33%), this is a v1 adapter on a mid model, and the email + login-walled cohorts were excluded/unscored. The value is that the pipeline + interceptor work end-to-end.

Deferred (noted, not blocking)

  • Email cohort: AGENTMAIL_API_KEY wasn't in the smoke session env; the AgentMail provider is wired but those tasks were excluded.
  • Login-walled tasks need per-task Kernel profiles to score.
  • Agentic 5-layer evaluator + MP4 = phase 2 (MP4 needs X11; not needed for parity).

Details in SMOKE.md. Unit tests + ruff green; CDP spike + interceptor validated live.


Note

Medium Risk
Large new benchmark surface (live sites, judge and AgentMail API keys, host CDP sidecar) but scoped to the adapter package without core Harbor/cua changes; grading contract changes (reward.json shape) affect ClawBench trials only.

Overview
Introduces benchmarks/adapters/clawbench, a Harbor adapter that maps ClawBench V1/V2 cases to Kernel + cua instead of upstream’s Docker/runtime-server stack. A CLI (clawbench_adapter.main) emits flat task.toml tasks with environment/kernel.json, copied verifier assets, and upstream-vendored instruction building.

ClawbenchCuaAgent wraps CuaHarborAgent to provision disposable email/persona (host prepare_task.py), inline identity into the prompt (no VM file tools), start a host-side interceptor.py CDP Fetch sidecar with eval_schema from <task>/tests/eval_schema.json, then upload /data/interception.json for Stage-2 grading. Email uses AgentMailProvider when AGENTMAIL_API_KEY is set.

Stage-2 verify.py ships lenient (default, leaderboard “Reward”) and strict rubrics via CLAWBENCH_JUDGE_RUBRIC, and writes numeric-only reward.json plus diagnostics in clawbench-result.json for Harbor’s float coercion. Docs (README, PARITY.md, SMOKE.md) and pytest coverage cover generator shape, agent lifecycle, interceptor matching, judge behavior, and email cleanup.

Reviewed by Cursor Bugbot for commit f185010. Bugbot is set up for automated code reviews on this repo. Configure here.

rgarcia and others added 3 commits June 27, 2026 09:13
Re-targets upstream ClawBench's Docker-shaped Harbor adapter at the Kernel
environment: a Python generator turns test-cases/v2/*/task.json into flat,
single-step Harbor task dirs (instruction.md + upstream block + Kernel footer,
environment/kernel.json {stealth, viewport} with no start_url, schema_version
1.0 task.toml, tests/ grader bundle). The dataset mapping and instruction
builder are reused from upstream (vendored, with a runtime import shim), and the
Stage-2 body-judge verify.py is reused verbatim.

The Stage-1 request interceptor is ported from upstream's start_cdp_handler to
run as a sidecar against the Kernel session's CDP socket; finalize_capture and
cleanup_email wire the verifier flow. Email provisioning is abstracted behind an
EmailProvider (AgentMail impl + a no-inbox fallback for the non-email subset).

Build, ruff, and mocked unit tests are green. The live smoke is deferred: it
gates on confirming a second raw-CDP Fetch.enable client can attach to a Kernel
session alongside cua (see SMOKE.md).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves the gating CDP question and makes the ClawBench Stage-1 grader actually
fire on the Kernel env. A spike confirmed a second raw-CDP Fetch.enable client
attaches to a live Kernel session (via cdp_ws_url) and can block requests while
cua drives the same session over the control plane.

Adds ClawbenchCuaAgent, a thin CuaHarborAgent subclass that runs interceptor.py
as a host-side sidecar around the cua drive (the Kernel base image has no
pip/websocket-client; the CDP endpoint is reachable from the host), then uploads
/data/interception.json into the VM for the verbatim Stage-2 verify.py. The
shared core is untouched.

Smoke (20 V2 non-email tasks, sonnet-4-6 agent + Anthropic judge, pool_size=5):
the sidecar attached on all 20, captured the four passive layers, and one block
fired end-to-end (judge correctly rejected a telemetry false-positive). Pass rate
0/20 -- the non-login subset hits auth walls and the 450s learning cap timed out
every run; no adapter bug. See SMOKE.md.

Fixes surfaced by the smoke:
- eval_schema now reaches the interceptor: read from the host task dir's
  tests/eval_schema.json (Harbor's [agent.env] is not surfaced on the host agent),
  passed inline as CLAWBENCH_EVAL_SCHEMA_JSON; the dead [agent.env] block is gone.
- test.sh drops the in-VM finalize (finalization is host-side now).
- interceptor.py accepts an inline schema, falling back to the file for in-VM use.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Validate the ClawBench email cohort live against AgentMail. The provisioning
code had never run; fix two bugs the live run surfaced:

- my-info was never provisioned and is unreadable anyway. prepare_task.py was
  copied into tests/ but nothing invoked it, and the cua harness on Kernel
  exposes only browser tools (no file/shell), so the agent cannot read
  ./my-info/ files at all. ClawbenchCuaAgent now provisions the AgentMail inbox
  + persona host-side before the drive, inlines the email credentials + persona
  into the instruction so the agent reads its identity from the prompt, and
  deletes the inbox in teardown. The resume PDF stays unreachable (no
  file-upload surface), bounding the resume-upload cohort.
- verify.py wrote reward.json as a rich object (judge_match: null, reason,
  task_id), but Harbor reads it as a flat {key: number} map and coerced every
  value to float, erroring every trial with a VerifierResult ValidationError.
  reward.json now emits only numeric reward keys; diagnostics move to
  clawbench-result.json + reward.txt.

Also: add fpdf2 (resume PDF) as a dependency; gitignore run logs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia

rgarcia commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Update — ClawBench email cohort validated live (with AgentMail) (commit 2483e63):

  • AgentMail disposable-inbox provisioning works end-to-end (create + cleanup, 0 inboxes leaked).
  • Fixed a verifier-crashing bug: reward.json was written as a rich object, but Harbor reads it as a flat {key: number} map → every ClawBench trial was erroring at the verifier. Now emits {reward, intercepted, judge_match} numeric keys (diagnostics moved to clawbench-result.json).
  • The cua agent has no in-VM file tool, so my-info (email + persona) is now inlined into the instruction for the agent to type into forms.
  • AgentMail limitation: no per-inbox webmail, so the fill-a-real-address cohort works but the click-the-verification-link cohort can't complete in-browser (and resume upload needs a file-upload surface). See SMOKE.md → "Email cohort (live)".

Ship both upstream judge rubrics in verify.py (lenient = judge_llm.py,
strict = judge.py, both verbatim) and select via CLAWBENCH_JUDGE_RUBRIC
(default lenient) so the emitted reward matches the published leaderboard
"Reward" column instead of the strict-flavoured number (~half). Default the
parse-failure verdict by rubric (lenient -> match, strict -> None), record the
chosen rubric in clawbench-result.json, and correct the README's
"leaderboard-reproducible" claim.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia

rgarcia commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Parity pass vs TIGER-AI-Lab/ClawBench (f185010): the published leaderboard uses the lenient rubric (runner/judge_llm.py), but we'd shipped a strict-flavoured variant — so our number wasn't leaderboard-reproducible. Now ships both upstream rubrics byte-for-byte and defaults to lenient (= the leaderboard "Reward" column), selectable via CLAWBENCH_JUDGE_RUBRIC. Intentional Kernel adaptations (numeric-only reward.json, host-side interceptor sidecar, AgentMail, inlined persona, no Docker/MP4) confirmed and kept. See PARITY.md.

@rgarcia rgarcia marked this pull request as ready for review June 27, 2026 22:33

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Host email env ignores Harbor
    • Host-side prepare and cleanup subprocesses now use an env helper that forwards AGENTMAIL_API_KEY from agent extra_env, restoring live inbox provisioning and teardown.

Create PR

Or push these changes by commenting:

@cursor push 6a97b228a9
Preview (6a97b228a9)
diff --git a/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py b/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
--- a/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
+++ b/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
@@ -138,7 +138,7 @@
                     "--state-file",
                     str(state_file),
                 ],
-                env={**os.environ},
+                env=self._host_script_env(),
                 stdout=log,
                 stderr=subprocess.STDOUT,
             )
@@ -157,13 +157,27 @@
         try:
             subprocess.call(
                 [sys.executable, str(_CLEANUP_EMAIL), str(state_file)],
-                env={**os.environ},
+                env=self._host_script_env(),
                 stdout=subprocess.DEVNULL,
                 stderr=subprocess.DEVNULL,
             )
         except Exception as exc:
             self.logger.warning(f"clawbench: inbox cleanup failed: {exc}")
 
+    def _host_script_env(self) -> dict[str, str]:
+        """Env for host-side setup/teardown helpers.
+
+        ``prepare_task.py``/``cleanup_email.py`` select AgentMail strictly from
+        ``AGENTMAIL_API_KEY`` in subprocess env. Keep host env defaults, but let
+        ``--ae AGENTMAIL_API_KEY=...`` override so these scripts can provision and
+        tear down real inboxes.
+        """
+        child_env = {**os.environ}
+        api_key = self.extra_env.get("AGENTMAIL_API_KEY", "").strip()
+        if api_key:
+            child_env["AGENTMAIL_API_KEY"] = api_key
+        return child_env
+
     def _start_interceptor(
         self, environment, data_dir: Path
     ) -> subprocess.Popen | None:

diff --git a/benchmarks/adapters/clawbench/tests/test_agent.py b/benchmarks/adapters/clawbench/tests/test_agent.py
--- a/benchmarks/adapters/clawbench/tests/test_agent.py
+++ b/benchmarks/adapters/clawbench/tests/test_agent.py
@@ -230,12 +230,14 @@
 
 def test_prepare_my_info_invokes_prepare_task(tmp_path, monkeypatch):
     agent = _make_agent(tmp_path)
+    agent._extra_env = {"AGENTMAIL_API_KEY": "am-key"}
     myinfo = tmp_path / "myinfo"
     state = tmp_path / "data" / "task-state.json"
     calls = {}
 
     def fake_call(cmd, env=None, **kw):
         calls["cmd"] = cmd
+        calls["env"] = env
         return 0
 
     monkeypatch.setattr("clawbench_adapter.agent.subprocess.call", fake_call)
@@ -245,14 +247,18 @@
     assert calls["cmd"][1].endswith("prepare_task.py")
     assert str(myinfo / "my-info") in calls["cmd"]
     assert str(state) in calls["cmd"]
+    assert calls["env"]["AGENTMAIL_API_KEY"] == "am-key"
 
 
 def test_cleanup_my_info_noops_without_state(tmp_path, monkeypatch):
     agent = _make_agent(tmp_path)
+    agent._extra_env = {"AGENTMAIL_API_KEY": "am-key"}
     called = {"n": 0}
+    envs = []
 
     def fake_call(*a, **k):
         called["n"] += 1
+        envs.append(k.get("env"))
         return 0
 
     monkeypatch.setattr("clawbench_adapter.agent.subprocess.call", fake_call)
@@ -263,3 +269,4 @@
     state.write_text('{"email": {"address": "a@agentmail.to"}}')
     agent._cleanup_my_info(state)
     assert called["n"] == 1  # state present -> cleanup_email.py invoked
+    assert envs[-1]["AGENTMAIL_API_KEY"] == "am-key"

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit f185010. Configure here.

try:
subprocess.call(
[sys.executable, str(_CLEANUP_EMAIL), str(state_file)],
env={**os.environ},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host email env ignores Harbor

Medium Severity

The AGENTMAIL_API_KEY isn't propagated to the host-side prepare_task.py and cleanup_email.py subprocesses. These scripts, which provision disposable email inboxes, only receive os.environ. The key is incorrectly placed in task.toml's [verifier.env], which Harbor prevents from reaching agent host-side processes, causing email cohort tasks to lack real inboxes.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f185010. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant