Online-Mind2Web adapter (Kernel env + cua agent)#43
Conversation
Generates Harbor task dirs from the gated osunlp/Online-Mind2Web dataset (300 live web tasks) and grades them with WebJudge. Builds on the shared cua_harbor agent + cua-bench-task entrypoint; adds only the dataset->task generator and the verifier. - adapter.py maps each row to a task dir: kernel.json (start_url normalized to https://, stealth, 1280x1024 viewport), instruction.md, task.toml, tests/task.json. Mirrors the cua loader's field mapping + skip-malformed. - WebJudge verifier is a self-contained Node bundle (recovered webjudge/ prompts placed verbatim; Anthropic judge over fetch) that runs in-VM, reconstructs the trajectory from the shared core's run.jsonl + shots/ + answer.txt, and writes the reward. - Mocked unit tests (pytest, vitest), ruff + tsc clean. Generated tasks, build artifacts, and caches are gitignored. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Finalize the adapter against an opus agent + opus WebJudge after a 20-task live smoke (16/20, mean 0.800; pools worked, no 403). Fixes the smoke surfaced: - WebJudge: retry the Anthropic call without `temperature` on the 400 newer models (e.g. opus-4-8) return for the deprecated param, instead of fail-closing every grade to reward 0. Keep temperature=0 where supported. - Instruction: tell the browser-only agent to state its answer in its final message (captured into answer.txt) rather than to write a file it has no tools for. - Default JUDGE_MODEL to anthropic:claude-opus-4-8 (task.toml, test.sh, judge). - Gitignore .validate/, _smoke_logs/, and the standalone uv.lock. Adds a mocked anthropicJudgeModel test for the temperature retry. SMOKE.md records the run, the two fixes, and the failure taxonomy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Thread each row's level (easy/medium/hard) into task.toml difficulty and tests/task.json instead of hardcoding "hard", which mislabeled most tasks. Default to hard when absent. Document the 1024 max_tokens headroom and add PARITY.md recording the parity pass (applied vs skipped). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Parity pass vs |
Switch the judge backbone default from anthropic:claude-opus-4-8 to openai:o4-mini, the published WebJudge model, for leaderboard parity. Add a dependency-free OpenAI Chat Completions client to model.ts (stdlib fetch, no SDK) routed by the JUDGE_MODEL provider prefix. It supports vision (image_url data-URLs with detail:high) and handles the o-series reasoning quirks: omit temperature and use max_completion_tokens for o\d-prefixed names. The Anthropic Messages path stays wired as a configurable, non-canonical cheaper alternative. task.toml passes OPENAI_API_KEY (host-resolved) alongside ANTHROPIC_API_KEY and defaults JUDGE_MODEL to openai:o4-mini. README + PARITY.md document the new default, the configurable Anthropic path, and WebJudge-7B as a future GPU-hosted option. Mocked tests cover provider routing, the no-temperature / max_completion_tokens construction, and the vision payload shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Default WebJudge switched to OpenAI o4-mini (60afab6) — the published WebJudge backbone (~85.7% human agreement), for leaderboard parity. Added an OpenAI client to the judge bin (provider routing on |
Replace the hand-rolled raw-fetch OpenAI/Anthropic judge clients with @earendil-works/pi-ai's completeSimple, which owns provider routing, env-key resolution, the o-series reasoning quirks, and vision encoding. Resolve the JUDGE_MODEL ref (default openai:o4-mini) to a pi-ai model and surface pi-ai's stopReason "error" as a throw so a grading failure is logged rather than scored as a task failure. Bundle pi-ai via tsdown (inlineDynamicImports) so the verifier stays a single self-contained judge.js. Prompts, parsing, scoring, and the model default are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…the pi-ai judge o4-mini (and o3) reject `temperature` and reject a reasoning effort of `none` — pi-ai's default when the level is unset — on the responses API. Pass `reasoning: medium` (o4-mini's own default, the calibration the published WebJudge numbers use) and omit `temperature` for OpenAI reasoning models; every other backbone keeps deterministic `temperature: 0`. Verified live: o4-mini grades a real trajectory end-to-end (key points, per-screenshot vision scores, verdict) with no API errors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 2 potential issues.
Autofix Details
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Stale test expects anthropic judge
- Updated the adapter test to assert the generated JUDGE_MODEL matches the current template default of openai:o4-mini.
- ✅ Fixed: Anthropic judge sends temperature zero
- Changed judge option selection to omit temperature for Anthropic models and added a regression assertion that Anthropic calls do not include temperature.
Or push these changes by commenting:
@cursor push de5434b3bd
Preview (de5434b3bd)
diff --git a/benchmarks/adapters/online-mind2web/judge/src/model.ts b/benchmarks/adapters/online-mind2web/judge/src/model.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/model.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/model.ts
@@ -50,13 +50,15 @@
// reject a reasoning effort of "none" (pi-ai's default when the level is
// unset) — they require low/medium/high. "medium" is OpenAI's own o4-mini
// default, the setting the published WebJudge agreement numbers are
- // calibrated to. Every other backbone keeps deterministic scoring
- // (temperature 0) with no forced thinking.
+ // calibrated to. Anthropic backbones reject `temperature: 0`, so we omit it
+ // there and only force deterministic sampling on the remaining providers.
const baseOptions = { apiKey, maxTokens: MAX_OUTPUT_TOKENS };
const options =
model.reasoning && provider === "openai"
? { ...baseOptions, reasoning: "medium" as const }
- : { ...baseOptions, temperature: 0 };
+ : provider === "anthropic"
+ ? baseOptions
+ : { ...baseOptions, temperature: 0 };
return {
async complete(systemPrompt, content) {
diff --git a/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts b/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts
--- a/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts
+++ b/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts
@@ -203,7 +203,9 @@
expect(getModel).toHaveBeenCalledWith("anthropic", "claude-opus-4-8");
expect(getEnvApiKey).toHaveBeenCalledWith("anthropic");
- expect(completeSimple.mock.calls[0][2]).toMatchObject({ apiKey: "sk-ant" });
+ const options = completeSimple.mock.calls[0][2];
+ expect(options).toMatchObject({ apiKey: "sk-ant", maxTokens: 1024 });
+ expect(options).not.toHaveProperty("temperature");
});
it("concatenates the text blocks of the assistant message", async () => {
diff --git a/benchmarks/adapters/online-mind2web/tests/test_adapter.py b/benchmarks/adapters/online-mind2web/tests/test_adapter.py
--- a/benchmarks/adapters/online-mind2web/tests/test_adapter.py
+++ b/benchmarks/adapters/online-mind2web/tests/test_adapter.py
@@ -112,7 +112,7 @@
assert toml["task"]["name"] == "osunlp/online-mind2web-abc123-110325"
assert toml["metadata"]["reference_length"] == "6"
assert toml["metadata"]["difficulty"] == "medium"
- assert toml["verifier"]["env"]["JUDGE_MODEL"].startswith("anthropic:")
+ assert toml["verifier"]["env"]["JUDGE_MODEL"] == "openai:o4-mini"
instruction = (task_dir / "instruction.md").read_text()
assert "https://www.traderjoes.com/" in instructionYou can send follow-ups to the cloud agent here.
| const options = | ||
| model.reasoning && provider === "openai" | ||
| ? { ...baseOptions, reasoning: "medium" as const } | ||
| : { ...baseOptions, temperature: 0 }; |
There was a problem hiding this comment.
Anthropic judge sends temperature zero
High Severity
The judgeModel unconditionally passes temperature: 0 to pi-ai for non-OpenAI models. Anthropic Opus models reject this parameter with an HTTP 400, leading to grading failures and a reward of 0. This regression removes a prior retry mechanism that handled such rejections.
Reviewed by Cursor Bugbot for commit e8d1820. Configure here.
The judge backbone now defaults to openai:o4-mini (leaderboard parity), but test_generates_full_task_dir still asserted JUDGE_MODEL started with "anthropic:", so it failed against the current task.toml template. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.
There are 4 total unresolved issues (including 1 from previous review).
Bugbot Autofix prepared fixes for all 3 issues found in the latest run.
- ✅ Fixed: Action history drops screenshotless tools
loadTrajectorynow appends a step with the resolved action even when atool_resulthas no existing screenshots, preserving complete action history.
- ✅ Fixed: Details failure overwrites reward
judge.tsnow tracks whether reward output was already written and only writes fallback0on errors before that point, so details-write failures no longer clobber successful rewards.
- ✅ Fixed: Empty thoughts desync snapshot reasons
gradeWithWebJudgenow keeps one thought entry per kept snapshot without filtering empty strings, maintaining prompt reason/image alignment.
Or push these changes by commenting:
@cursor push 8c543d2432
Preview (8c543d2432)
diff --git a/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts b/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts
@@ -50,9 +50,10 @@
* `assistant` lines carrying `tool_calls: [{id, name, arguments}]` and
* `tool_result` lines carrying `{call_id, shots: ["shots/shot-N.png"]}`. Each
* spilled screenshot becomes one step whose action is the tool call that
- * produced it; the bytes are read from disk into base64 for the judge. The
- * final answer comes from `answer.txt` (the grading channel), falling back to
- * the `final` record.
+ * produced it; when no screenshot exists, a step is still kept so action
+ * history remains complete. Screenshot bytes are read from disk into base64 for
+ * the judge. The final answer comes from `answer.txt` (the grading channel),
+ * falling back to the `final` record.
*/
export function loadTrajectory(opts: {
runJsonlPath: string;
@@ -83,9 +84,11 @@
}
} else if (rec.t === "tool_result") {
const action = callActions.get(String(rec.call_id)) ?? "tool_result";
+ let pushedStep = false;
for (const rel of (rec.shots as string[]) ?? []) {
const abs = isAbsolute(rel) ? rel : join(baseDir, rel);
if (!existsSync(abs)) continue;
+ pushedStep = true;
steps.push({
index: steps.length,
action,
@@ -93,6 +96,12 @@
screenshotMimeType: mimeForPath(rel),
});
}
+ if (!pushedStep) {
+ steps.push({
+ index: steps.length,
+ action,
+ });
+ }
} else if (rec.t === "final" && !finalAnswer) {
finalAnswer = String(rec.answer ?? "");
}
diff --git a/benchmarks/adapters/online-mind2web/judge/src/judge.ts b/benchmarks/adapters/online-mind2web/judge/src/judge.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/judge.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/judge.ts
@@ -61,6 +61,7 @@
async function main(): Promise<void> {
const args = parseArgs(process.argv.slice(2));
+ let rewardWritten = false;
try {
const task = loadTask(args.task);
const trajectory = loadTrajectory({
@@ -76,6 +77,7 @@
scoreThreshold: args.scoreThreshold,
});
writeReward(args.rewardOut, result.success ? 1 : 0);
+ rewardWritten = true;
if (args.detailsOut) {
mkdirSync(dirname(args.detailsOut), { recursive: true });
writeFileSync(
@@ -91,7 +93,7 @@
// Never leave the reward file empty: a grading failure is a 0, and the
// error is surfaced on stderr for the verifier log.
console.error(err instanceof Error ? (err.stack ?? err.message) : String(err));
- writeReward(args.rewardOut, 0);
+ if (!rewardWritten) writeReward(args.rewardOut, 0);
}
}
diff --git a/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts b/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts
@@ -33,7 +33,7 @@
);
const kept = scored.filter((r) => r.score >= scoreThreshold).slice(0, P.MAX_IMAGE);
- const keptThoughts = kept.map((r) => r.thought).filter((t) => t !== "").slice(0, P.MAX_IMAGE);
+ const keptThoughts = kept.map((r) => r.thought);
const actions = trajectory.steps.map((s) => s.action);
const finalContent: JudgeContent = [You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.
| screenshotBase64: readFileSync(abs).toString("base64"), | ||
| screenshotMimeType: mimeForPath(rel), | ||
| }); | ||
| } |
There was a problem hiding this comment.
Action history drops screenshotless tools
Medium Severity
The loadTrajectory function only adds steps to the trajectory for tool_result records that have corresponding screenshot files on disk. Tool calls without existing screenshots are omitted, resulting in an incomplete WebJudge action history and potentially incorrect grading compared to the full last_actions from upstream.
Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.
| // Never leave the reward file empty: a grading failure is a 0, and the | ||
| // error is surfaced on stderr for the verifier log. | ||
| console.error(err instanceof Error ? (err.stack ?? err.message) : String(err)); | ||
| writeReward(args.rewardOut, 0); |
There was a problem hiding this comment.
Details failure overwrites reward
High Severity
After a successful grade, writeReward writes 1 to reward.txt, but any later error in the same try block (for example writing grading_details.json) falls into the shared catch, which always writes reward 0. A passing run can be recorded as failure when only the details artifact fails.
Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.
| type: "image" as const, | ||
| data: r.step.screenshotBase64!, | ||
| mimeType: r.step.screenshotMimeType ?? "image/png", | ||
| })), |
There was a problem hiding this comment.
Empty thoughts desync snapshot reasons
Medium Severity
Kept screenshots sent to the final judge include every image above the score threshold, but the prompt’s numbered “reasons” list drops entries whose per-image thought parsed empty. The text block and trailing images no longer stay paired the way upstream WebJudge builds whole_thoughts alongside whole_content_img.
Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.



Summary
Harbor adapter for Online-Mind2Web (300 live web tasks) on the Kernel environment with the cua agent. Generates Harbor task dirs from the dataset; graded by the ported WebJudge (key-points → per-screenshot → verdict), bundled as a self-contained in-VM Node judge. Stacked on #40 (the cua-harbor shared core).
Live smoke — opus-4-8 agent + judge, 20 tasks, browser pool
16/20 passed (mean reward 0.80); pools provisioned with no fallback. The 4 misses were agent/task/judge effects (a 1800s timeout, an empty-answer stall, judge strictness, one genuinely hard multi-constraint task) — no adapter bugs in the final run.
Bug found + fixed by the smoke: opus rejects the
temperatureparam → the WebJudge HTTP call 400'd and fail-closed every grade to 0; now retries withouttemperatureon that 400 (validation 0/2 → 2/2). Also reworked the instruction so the browser-only agent states its answer in its final message.Full taxonomy in
benchmarks/adapters/online-mind2web/SMOKE.md. Unit tests + lint + judge build green.Note
Medium Risk
Large additive benchmark surface with verifier calls to external LLM APIs and gated HuggingFace dataset fetch at generation time; isolated under
benchmarks/adaptersand does not change core app auth or data paths.Overview
Adds a new
benchmarks/adapters/online-mind2webpackage that turns the gated Online-Mind2Web dataset into Harbor task directories for the Kernel browser env and cua agent, with trajectory-based WebJudge grading instead of a gold answer.The Python adapter (
OnlineMind2WebAdapter) loads/caches dataset JSON, maps rows to slugged task dirs (instruction,kernel.jsonwithstart_url/stealth/viewport,task.json), threadslevel→[metadata].difficulty, and copies a prebuiltjudge.jsinto each task’stests/. A CLI (online_mind2web.main) supports--limit,--overwrite, and--task-ids.The in-VM judge is a TypeScript port of upstream WebJudge: key-point extraction → per-screenshot 1–5 scoring → final verdict, with prompts/parsers aligned to OSU-NLP-Group.
gradeWithWebJudgerebuilds trajectories from/logs/agent(run.jsonl,shots/*.png,answer.txt).judgeModeldefaults toopenai:o4-minivia@earendil-works/pi-ai(Anthropic configurable throughJUDGE_MODEL); failures fail-closed to reward 0.tsdownbundles a dependency-freedist/judge.jsfor the VM.Docs
README.md,PARITY.md, andSMOKE.mdcover usage, upstream parity (o4-mini default, PNG vs upstream JPEG noted), and a 20-task live smoke. Vitest and pytest cover the judge pipeline and adapter generation.Reviewed by Cursor Bugbot for commit fe9ae2f. Bugbot is set up for automated code reviews on this repo. Configure here.