ClawBench adapter (Kernel env + cua agent)#44
Conversation
Re-targets upstream ClawBench's Docker-shaped Harbor adapter at the Kernel
environment: a Python generator turns test-cases/v2/*/task.json into flat,
single-step Harbor task dirs (instruction.md + upstream block + Kernel footer,
environment/kernel.json {stealth, viewport} with no start_url, schema_version
1.0 task.toml, tests/ grader bundle). The dataset mapping and instruction
builder are reused from upstream (vendored, with a runtime import shim), and the
Stage-2 body-judge verify.py is reused verbatim.
The Stage-1 request interceptor is ported from upstream's start_cdp_handler to
run as a sidecar against the Kernel session's CDP socket; finalize_capture and
cleanup_email wire the verifier flow. Email provisioning is abstracted behind an
EmailProvider (AgentMail impl + a no-inbox fallback for the non-email subset).
Build, ruff, and mocked unit tests are green. The live smoke is deferred: it
gates on confirming a second raw-CDP Fetch.enable client can attach to a Kernel
session alongside cua (see SMOKE.md).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolves the gating CDP question and makes the ClawBench Stage-1 grader actually fire on the Kernel env. A spike confirmed a second raw-CDP Fetch.enable client attaches to a live Kernel session (via cdp_ws_url) and can block requests while cua drives the same session over the control plane. Adds ClawbenchCuaAgent, a thin CuaHarborAgent subclass that runs interceptor.py as a host-side sidecar around the cua drive (the Kernel base image has no pip/websocket-client; the CDP endpoint is reachable from the host), then uploads /data/interception.json into the VM for the verbatim Stage-2 verify.py. The shared core is untouched. Smoke (20 V2 non-email tasks, sonnet-4-6 agent + Anthropic judge, pool_size=5): the sidecar attached on all 20, captured the four passive layers, and one block fired end-to-end (judge correctly rejected a telemetry false-positive). Pass rate 0/20 -- the non-login subset hits auth walls and the 450s learning cap timed out every run; no adapter bug. See SMOKE.md. Fixes surfaced by the smoke: - eval_schema now reaches the interceptor: read from the host task dir's tests/eval_schema.json (Harbor's [agent.env] is not surfaced on the host agent), passed inline as CLAWBENCH_EVAL_SCHEMA_JSON; the dead [agent.env] block is gone. - test.sh drops the in-VM finalize (finalization is host-side now). - interceptor.py accepts an inline schema, falling back to the file for in-VM use. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Validate the ClawBench email cohort live against AgentMail. The provisioning
code had never run; fix two bugs the live run surfaced:
- my-info was never provisioned and is unreadable anyway. prepare_task.py was
copied into tests/ but nothing invoked it, and the cua harness on Kernel
exposes only browser tools (no file/shell), so the agent cannot read
./my-info/ files at all. ClawbenchCuaAgent now provisions the AgentMail inbox
+ persona host-side before the drive, inlines the email credentials + persona
into the instruction so the agent reads its identity from the prompt, and
deletes the inbox in teardown. The resume PDF stays unreachable (no
file-upload surface), bounding the resume-upload cohort.
- verify.py wrote reward.json as a rich object (judge_match: null, reason,
task_id), but Harbor reads it as a flat {key: number} map and coerced every
value to float, erroring every trial with a VerifierResult ValidationError.
reward.json now emits only numeric reward keys; diagnostics move to
clawbench-result.json + reward.txt.
Also: add fpdf2 (resume PDF) as a dependency; gitignore run logs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Update — ClawBench email cohort validated live (with AgentMail) (commit 2483e63):
|
Ship both upstream judge rubrics in verify.py (lenient = judge_llm.py, strict = judge.py, both verbatim) and select via CLAWBENCH_JUDGE_RUBRIC (default lenient) so the emitted reward matches the published leaderboard "Reward" column instead of the strict-flavoured number (~half). Default the parse-failure verdict by rubric (lenient -> match, strict -> None), record the chosen rubric in clawbench-result.json, and correct the README's "leaderboard-reproducible" claim. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Parity pass vs |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Host email env ignores Harbor
- Host-side prepare and cleanup subprocesses now use an env helper that forwards AGENTMAIL_API_KEY from agent extra_env, restoring live inbox provisioning and teardown.
Or push these changes by commenting:
@cursor push 6a97b228a9
Preview (6a97b228a9)
diff --git a/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py b/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
--- a/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
+++ b/benchmarks/adapters/clawbench/src/clawbench_adapter/agent.py
@@ -138,7 +138,7 @@
"--state-file",
str(state_file),
],
- env={**os.environ},
+ env=self._host_script_env(),
stdout=log,
stderr=subprocess.STDOUT,
)
@@ -157,13 +157,27 @@
try:
subprocess.call(
[sys.executable, str(_CLEANUP_EMAIL), str(state_file)],
- env={**os.environ},
+ env=self._host_script_env(),
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
except Exception as exc:
self.logger.warning(f"clawbench: inbox cleanup failed: {exc}")
+ def _host_script_env(self) -> dict[str, str]:
+ """Env for host-side setup/teardown helpers.
+
+ ``prepare_task.py``/``cleanup_email.py`` select AgentMail strictly from
+ ``AGENTMAIL_API_KEY`` in subprocess env. Keep host env defaults, but let
+ ``--ae AGENTMAIL_API_KEY=...`` override so these scripts can provision and
+ tear down real inboxes.
+ """
+ child_env = {**os.environ}
+ api_key = self.extra_env.get("AGENTMAIL_API_KEY", "").strip()
+ if api_key:
+ child_env["AGENTMAIL_API_KEY"] = api_key
+ return child_env
+
def _start_interceptor(
self, environment, data_dir: Path
) -> subprocess.Popen | None:
diff --git a/benchmarks/adapters/clawbench/tests/test_agent.py b/benchmarks/adapters/clawbench/tests/test_agent.py
--- a/benchmarks/adapters/clawbench/tests/test_agent.py
+++ b/benchmarks/adapters/clawbench/tests/test_agent.py
@@ -230,12 +230,14 @@
def test_prepare_my_info_invokes_prepare_task(tmp_path, monkeypatch):
agent = _make_agent(tmp_path)
+ agent._extra_env = {"AGENTMAIL_API_KEY": "am-key"}
myinfo = tmp_path / "myinfo"
state = tmp_path / "data" / "task-state.json"
calls = {}
def fake_call(cmd, env=None, **kw):
calls["cmd"] = cmd
+ calls["env"] = env
return 0
monkeypatch.setattr("clawbench_adapter.agent.subprocess.call", fake_call)
@@ -245,14 +247,18 @@
assert calls["cmd"][1].endswith("prepare_task.py")
assert str(myinfo / "my-info") in calls["cmd"]
assert str(state) in calls["cmd"]
+ assert calls["env"]["AGENTMAIL_API_KEY"] == "am-key"
def test_cleanup_my_info_noops_without_state(tmp_path, monkeypatch):
agent = _make_agent(tmp_path)
+ agent._extra_env = {"AGENTMAIL_API_KEY": "am-key"}
called = {"n": 0}
+ envs = []
def fake_call(*a, **k):
called["n"] += 1
+ envs.append(k.get("env"))
return 0
monkeypatch.setattr("clawbench_adapter.agent.subprocess.call", fake_call)
@@ -263,3 +269,4 @@
state.write_text('{"email": {"address": "a@agentmail.to"}}')
agent._cleanup_my_info(state)
assert called["n"] == 1 # state present -> cleanup_email.py invoked
+ assert envs[-1]["AGENTMAIL_API_KEY"] == "am-key"You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit f185010. Configure here.
| try: | ||
| subprocess.call( | ||
| [sys.executable, str(_CLEANUP_EMAIL), str(state_file)], | ||
| env={**os.environ}, |
There was a problem hiding this comment.
Host email env ignores Harbor
Medium Severity
The AGENTMAIL_API_KEY isn't propagated to the host-side prepare_task.py and cleanup_email.py subprocesses. These scripts, which provision disposable email inboxes, only receive os.environ. The key is incorrectly placed in task.toml's [verifier.env], which Harbor prevents from reaching agent host-side processes, causing email cohort tasks to lack real inboxes.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f185010. Configure here.



Summary
Harbor adapter for ClawBench (everyday online tasks on live sites) on the Kernel env + cua agent. Re-targets the upstream Docker-shaped adapter at the Kernel environment: reuses upstream's dataset mapping + the two-stage grader (CDP submission-interception + body-LLM-judge
verify.py) verbatim, re-hosting the interceptor against the Kernel session's CDP. Stacked on #40.Validated
Fetch.enableinterceptor attaches to the live Kernel session alongside cua and works (the gate for the whole approach).ClawbenchCuaAgentwrapper (zero shared-core/harbor edits).[agent.env]isn't surfaced to the host agent, so the interceptor saw no eval-schema; the wrapper now reads it from the task dir and passes it inline.EmailProviderabstraction (replaces upstream's PurelyMail).Live smoke — sonnet-4-6, 20 non-email tasks, browser pool
0/20 (mean 0.0), 1/20 interception fired end-to-end. Unsurprising — ClawBench is hard (frontier best ~33%), this is a v1 adapter on a mid model, and the email + login-walled cohorts were excluded/unscored. The value is that the pipeline + interceptor work end-to-end.
Deferred (noted, not blocking)
AGENTMAIL_API_KEYwasn't in the smoke session env; the AgentMail provider is wired but those tasks were excluded.Details in
SMOKE.md. Unit tests + ruff green; CDP spike + interceptor validated live.Note
Medium Risk
Large new benchmark surface (live sites, judge and AgentMail API keys, host CDP sidecar) but scoped to the adapter package without core Harbor/cua changes; grading contract changes (
reward.jsonshape) affect ClawBench trials only.Overview
Introduces
benchmarks/adapters/clawbench, a Harbor adapter that maps ClawBench V1/V2 cases to Kernel + cua instead of upstream’s Docker/runtime-server stack. A CLI (clawbench_adapter.main) emits flattask.tomltasks withenvironment/kernel.json, copied verifier assets, and upstream-vendored instruction building.ClawbenchCuaAgentwrapsCuaHarborAgentto provision disposable email/persona (hostprepare_task.py), inline identity into the prompt (no VM file tools), start a host-sideinterceptor.pyCDPFetchsidecar witheval_schemafrom<task>/tests/eval_schema.json, then upload/data/interception.jsonfor Stage-2 grading. Email usesAgentMailProviderwhenAGENTMAIL_API_KEYis set.Stage-2
verify.pyships lenient (default, leaderboard “Reward”) and strict rubrics viaCLAWBENCH_JUDGE_RUBRIC, and writes numeric-onlyreward.jsonplus diagnostics inclawbench-result.jsonfor Harbor’s float coercion. Docs (README,PARITY.md,SMOKE.md) and pytest coverage cover generator shape, agent lifecycle, interceptor matching, judge behavior, and email cleanup.Reviewed by Cursor Bugbot for commit f185010. Bugbot is set up for automated code reviews on this repo. Configure here.