Online-Mind2Web adapter (Kernel env + cua agent) by rgarcia · Pull Request #43 · kernel/cua

rgarcia · 2026-06-27T12:47:37Z

Summary

Harbor adapter for Online-Mind2Web (300 live web tasks) on the Kernel environment with the cua agent. Generates Harbor task dirs from the dataset; graded by the ported WebJudge (key-points → per-screenshot → verdict), bundled as a self-contained in-VM Node judge. Stacked on #40 (the cua-harbor shared core).

Live smoke — opus-4-8 agent + judge, 20 tasks, browser pool

16/20 passed (mean reward 0.80); pools provisioned with no fallback. The 4 misses were agent/task/judge effects (a 1800s timeout, an empty-answer stall, judge strictness, one genuinely hard multi-constraint task) — no adapter bugs in the final run.

Bug found + fixed by the smoke: opus rejects the temperature param → the WebJudge HTTP call 400'd and fail-closed every grade to 0; now retries without temperature on that 400 (validation 0/2 → 2/2). Also reworked the instruction so the browser-only agent states its answer in its final message.

Full taxonomy in benchmarks/adapters/online-mind2web/SMOKE.md. Unit tests + lint + judge build green.

Note

Medium Risk
Large additive benchmark surface with verifier calls to external LLM APIs and gated HuggingFace dataset fetch at generation time; isolated under benchmarks/adapters and does not change core app auth or data paths.

Overview
Adds a new benchmarks/adapters/online-mind2web package that turns the gated Online-Mind2Web dataset into Harbor task directories for the Kernel browser env and cua agent, with trajectory-based WebJudge grading instead of a gold answer.

The Python adapter (OnlineMind2WebAdapter) loads/caches dataset JSON, maps rows to slugged task dirs (instruction, kernel.json with start_url/stealth/viewport, task.json), threads level → [metadata].difficulty, and copies a prebuilt judge.js into each task’s tests/. A CLI (online_mind2web.main) supports --limit, --overwrite, and --task-ids.

The in-VM judge is a TypeScript port of upstream WebJudge: key-point extraction → per-screenshot 1–5 scoring → final verdict, with prompts/parsers aligned to OSU-NLP-Group. gradeWithWebJudge rebuilds trajectories from /logs/agent (run.jsonl, shots/*.png, answer.txt). judgeModel defaults to openai:o4-mini via @earendil-works/pi-ai (Anthropic configurable through JUDGE_MODEL); failures fail-closed to reward 0. tsdown bundles a dependency-free dist/judge.js for the VM.

Docs README.md, PARITY.md, and SMOKE.md cover usage, upstream parity (o4-mini default, PNG vs upstream JPEG noted), and a 20-task live smoke. Vitest and pytest cover the judge pipeline and adapter generation.

^{Reviewed by Cursor Bugbot for commit fe9ae2f. Bugbot is set up for automated code reviews on this repo. Configure here.}

Generates Harbor task dirs from the gated osunlp/Online-Mind2Web dataset (300 live web tasks) and grades them with WebJudge. Builds on the shared cua_harbor agent + cua-bench-task entrypoint; adds only the dataset->task generator and the verifier. - adapter.py maps each row to a task dir: kernel.json (start_url normalized to https://, stealth, 1280x1024 viewport), instruction.md, task.toml, tests/task.json. Mirrors the cua loader's field mapping + skip-malformed. - WebJudge verifier is a self-contained Node bundle (recovered webjudge/ prompts placed verbatim; Anthropic judge over fetch) that runs in-VM, reconstructs the trajectory from the shared core's run.jsonl + shots/ + answer.txt, and writes the reward. - Mocked unit tests (pytest, vitest), ruff + tsc clean. Generated tasks, build artifacts, and caches are gitignored. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Finalize the adapter against an opus agent + opus WebJudge after a 20-task live smoke (16/20, mean 0.800; pools worked, no 403). Fixes the smoke surfaced: - WebJudge: retry the Anthropic call without `temperature` on the 400 newer models (e.g. opus-4-8) return for the deprecated param, instead of fail-closing every grade to reward 0. Keep temperature=0 where supported. - Instruction: tell the browser-only agent to state its answer in its final message (captured into answer.txt) rather than to write a file it has no tools for. - Default JUDGE_MODEL to anthropic:claude-opus-4-8 (task.toml, test.sh, judge). - Gitignore .validate/, _smoke_logs/, and the standalone uv.lock. Adds a mocked anthropicJudgeModel test for the temperature retry. SMOKE.md records the run, the two fixes, and the failure taxonomy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Thread each row's level (easy/medium/hard) into task.toml difficulty and tests/task.json instead of hardcoding "hard", which mislabeled most tasks. Default to hard when absent. Document the 1024 max_tokens headroom and add PARITY.md recording the parity pass (applied vs skipped). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia · 2026-06-27T15:52:25Z

Parity pass vs OSU-NLP-Group/Online-Mind2Web (f226b01): carries the dataset level into [metadata].difficulty — the hardcoded "hard" mislabeled 221/300 rows (real split: medium 141 / easy 80 / hard 79). Metadata-only, no grading impact. WebJudge prompts confirmed verbatim; the Anthropic judge + max_tokens=1024 are kept as intentional Kernel adaptations (documented in PARITY.md).

Switch the judge backbone default from anthropic:claude-opus-4-8 to openai:o4-mini, the published WebJudge model, for leaderboard parity. Add a dependency-free OpenAI Chat Completions client to model.ts (stdlib fetch, no SDK) routed by the JUDGE_MODEL provider prefix. It supports vision (image_url data-URLs with detail:high) and handles the o-series reasoning quirks: omit temperature and use max_completion_tokens for o\d-prefixed names. The Anthropic Messages path stays wired as a configurable, non-canonical cheaper alternative. task.toml passes OPENAI_API_KEY (host-resolved) alongside ANTHROPIC_API_KEY and defaults JUDGE_MODEL to openai:o4-mini. README + PARITY.md document the new default, the configurable Anthropic path, and WebJudge-7B as a future GPU-hosted option. Mocked tests cover provider routing, the no-temperature / max_completion_tokens construction, and the vision payload shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia · 2026-06-27T19:31:31Z

Default WebJudge switched to OpenAI o4-mini (60afab6) — the published WebJudge backbone (~85.7% human agreement), for leaderboard parity. Added an OpenAI client to the judge bin (provider routing on JUDGE_MODEL; o-series handling: omit temperature, use max_completion_tokens, vision via image_url); JUDGE_MODEL defaults to openai:o4-mini, with sonnet/opus still configurable (clearly non-canonical), and WebJudge-7B noted as a future GPU-hosted option. Live-validated: o4-mini graded a real 20-screenshot OM2W trajectory end-to-end — 20 vision scoring calls, coherent key-point verdict, no API errors. 25/25 unit tests pass.

Replace the hand-rolled raw-fetch OpenAI/Anthropic judge clients with @earendil-works/pi-ai's completeSimple, which owns provider routing, env-key resolution, the o-series reasoning quirks, and vision encoding. Resolve the JUDGE_MODEL ref (default openai:o4-mini) to a pi-ai model and surface pi-ai's stopReason "error" as a throw so a grading failure is logged rather than scored as a task failure. Bundle pi-ai via tsdown (inlineDynamicImports) so the verifier stays a single self-contained judge.js. Prompts, parsing, scoring, and the model default are unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…the pi-ai judge o4-mini (and o3) reject `temperature` and reject a reasoning effort of `none` — pi-ai's default when the level is unset — on the responses API. Pass `reasoning: medium` (o4-mini's own default, the calibration the published WebJudge numbers use) and omit `temperature` for OpenAI reasoning models; every other backbone keeps deterministic `temperature: 0`. Verified live: o4-mini grades a real trajectory end-to-end (key points, per-screenshot vision scores, verdict) with no API errors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using high effort and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Stale test expects anthropic judge
- Updated the adapter test to assert the generated JUDGE_MODEL matches the current template default of openai:o4-mini.
✅ Fixed: Anthropic judge sends temperature zero
- Changed judge option selection to omit temperature for Anthropic models and added a regression assertion that Anthropic calls do not include temperature.

Or push these changes by commenting:

@cursor push de5434b3bd

Preview (de5434b3bd)

diff --git a/benchmarks/adapters/online-mind2web/judge/src/model.ts b/benchmarks/adapters/online-mind2web/judge/src/model.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/model.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/model.ts
@@ -50,13 +50,15 @@
   // reject a reasoning effort of "none" (pi-ai's default when the level is
   // unset) — they require low/medium/high. "medium" is OpenAI's own o4-mini
   // default, the setting the published WebJudge agreement numbers are
-  // calibrated to. Every other backbone keeps deterministic scoring
-  // (temperature 0) with no forced thinking.
+  // calibrated to. Anthropic backbones reject `temperature: 0`, so we omit it
+  // there and only force deterministic sampling on the remaining providers.
   const baseOptions = { apiKey, maxTokens: MAX_OUTPUT_TOKENS };
   const options =
     model.reasoning && provider === "openai"
       ? { ...baseOptions, reasoning: "medium" as const }
-      : { ...baseOptions, temperature: 0 };
+      : provider === "anthropic"
+        ? baseOptions
+        : { ...baseOptions, temperature: 0 };
 
   return {
     async complete(systemPrompt, content) {

diff --git a/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts b/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts
--- a/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts
+++ b/benchmarks/adapters/online-mind2web/judge/test/artifacts.test.ts
@@ -203,7 +203,9 @@
 
     expect(getModel).toHaveBeenCalledWith("anthropic", "claude-opus-4-8");
     expect(getEnvApiKey).toHaveBeenCalledWith("anthropic");
-    expect(completeSimple.mock.calls[0][2]).toMatchObject({ apiKey: "sk-ant" });
+    const options = completeSimple.mock.calls[0][2];
+    expect(options).toMatchObject({ apiKey: "sk-ant", maxTokens: 1024 });
+    expect(options).not.toHaveProperty("temperature");
   });
 
   it("concatenates the text blocks of the assistant message", async () => {

diff --git a/benchmarks/adapters/online-mind2web/tests/test_adapter.py b/benchmarks/adapters/online-mind2web/tests/test_adapter.py
--- a/benchmarks/adapters/online-mind2web/tests/test_adapter.py
+++ b/benchmarks/adapters/online-mind2web/tests/test_adapter.py
@@ -112,7 +112,7 @@
     assert toml["task"]["name"] == "osunlp/online-mind2web-abc123-110325"
     assert toml["metadata"]["reference_length"] == "6"
     assert toml["metadata"]["difficulty"] == "medium"
-    assert toml["verifier"]["env"]["JUDGE_MODEL"].startswith("anthropic:")
+    assert toml["verifier"]["env"]["JUDGE_MODEL"] == "openai:o4-mini"
 
     instruction = (task_dir / "instruction.md").read_text()
     assert "https://www.traderjoes.com/" in instruction

_{You can send follow-ups to the cloud agent here.}

cursor · 2026-06-27T22:47:39Z

+  const options =
+    model.reasoning && provider === "openai"
+      ? { ...baseOptions, reasoning: "medium" as const }
+      : { ...baseOptions, temperature: 0 };


Anthropic judge sends temperature zero

High Severity

The judgeModel unconditionally passes temperature: 0 to pi-ai for non-OpenAI models. Anthropic Opus models reject this parameter with an HTTP 400, leading to grading failures and a reward of 0. This regression removes a prior retry mechanism that handled such rejections.

^{Reviewed by Cursor Bugbot for commit e8d1820. Configure here.}

The judge backbone now defaults to openai:o4-mini (leaderboard parity), but test_generates_full_task_dir still asserted JUDGE_MODEL started with "anthropic:", so it failed against the current task.toml template. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.

There are 4 total unresolved issues (including 1 from previous review).

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: Action history drops screenshotless tools
- loadTrajectory now appends a step with the resolved action even when a tool_result has no existing screenshots, preserving complete action history.
✅ Fixed: Details failure overwrites reward
- judge.ts now tracks whether reward output was already written and only writes fallback 0 on errors before that point, so details-write failures no longer clobber successful rewards.
✅ Fixed: Empty thoughts desync snapshot reasons
- gradeWithWebJudge now keeps one thought entry per kept snapshot without filtering empty strings, maintaining prompt reason/image alignment.

Or push these changes by commenting:

@cursor push 8c543d2432

Preview (8c543d2432)

diff --git a/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts b/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/artifacts.ts
@@ -50,9 +50,10 @@
  * `assistant` lines carrying `tool_calls: [{id, name, arguments}]` and
  * `tool_result` lines carrying `{call_id, shots: ["shots/shot-N.png"]}`. Each
  * spilled screenshot becomes one step whose action is the tool call that
- * produced it; the bytes are read from disk into base64 for the judge. The
- * final answer comes from `answer.txt` (the grading channel), falling back to
- * the `final` record.
+ * produced it; when no screenshot exists, a step is still kept so action
+ * history remains complete. Screenshot bytes are read from disk into base64 for
+ * the judge. The final answer comes from `answer.txt` (the grading channel),
+ * falling back to the `final` record.
  */
 export function loadTrajectory(opts: {
   runJsonlPath: string;
@@ -83,9 +84,11 @@
         }
       } else if (rec.t === "tool_result") {
         const action = callActions.get(String(rec.call_id)) ?? "tool_result";
+        let pushedStep = false;
         for (const rel of (rec.shots as string[]) ?? []) {
           const abs = isAbsolute(rel) ? rel : join(baseDir, rel);
           if (!existsSync(abs)) continue;
+          pushedStep = true;
           steps.push({
             index: steps.length,
             action,
@@ -93,6 +96,12 @@
             screenshotMimeType: mimeForPath(rel),
           });
         }
+        if (!pushedStep) {
+          steps.push({
+            index: steps.length,
+            action,
+          });
+        }
       } else if (rec.t === "final" && !finalAnswer) {
         finalAnswer = String(rec.answer ?? "");
       }

diff --git a/benchmarks/adapters/online-mind2web/judge/src/judge.ts b/benchmarks/adapters/online-mind2web/judge/src/judge.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/judge.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/judge.ts
@@ -61,6 +61,7 @@
 
 async function main(): Promise<void> {
   const args = parseArgs(process.argv.slice(2));
+  let rewardWritten = false;
   try {
     const task = loadTask(args.task);
     const trajectory = loadTrajectory({
@@ -76,6 +77,7 @@
       scoreThreshold: args.scoreThreshold,
     });
     writeReward(args.rewardOut, result.success ? 1 : 0);
+    rewardWritten = true;
     if (args.detailsOut) {
       mkdirSync(dirname(args.detailsOut), { recursive: true });
       writeFileSync(
@@ -91,7 +93,7 @@
     // Never leave the reward file empty: a grading failure is a 0, and the
     // error is surfaced on stderr for the verifier log.
     console.error(err instanceof Error ? (err.stack ?? err.message) : String(err));
-    writeReward(args.rewardOut, 0);
+    if (!rewardWritten) writeReward(args.rewardOut, 0);
   }
 }
 

diff --git a/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts b/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts
--- a/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts
+++ b/benchmarks/adapters/online-mind2web/judge/src/webjudge.ts
@@ -33,7 +33,7 @@
   );
 
   const kept = scored.filter((r) => r.score >= scoreThreshold).slice(0, P.MAX_IMAGE);
-  const keptThoughts = kept.map((r) => r.thought).filter((t) => t !== "").slice(0, P.MAX_IMAGE);
+  const keptThoughts = kept.map((r) => r.thought);
   const actions = trajectory.steps.map((s) => s.action);
 
   const finalContent: JudgeContent = [

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.}

cursor · 2026-06-27T22:56:00Z

+            screenshotBase64: readFileSync(abs).toString("base64"),
+            screenshotMimeType: mimeForPath(rel),
+          });
+        }


Action history drops screenshotless tools

Medium Severity

The loadTrajectory function only adds steps to the trajectory for tool_result records that have corresponding screenshot files on disk. Tool calls without existing screenshots are omitted, resulting in an incomplete WebJudge action history and potentially incorrect grading compared to the full last_actions from upstream.

^{Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.}

cursor · 2026-06-27T22:56:00Z

+    // Never leave the reward file empty: a grading failure is a 0, and the
+    // error is surfaced on stderr for the verifier log.
+    console.error(err instanceof Error ? (err.stack ?? err.message) : String(err));
+    writeReward(args.rewardOut, 0);


Details failure overwrites reward

High Severity

After a successful grade, writeReward writes 1 to reward.txt, but any later error in the same try block (for example writing grading_details.json) falls into the shared catch, which always writes reward 0. A passing run can be recorded as failure when only the details artifact fails.

^{Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.}

cursor · 2026-06-27T22:56:00Z

+      type: "image" as const,
+      data: r.step.screenshotBase64!,
+      mimeType: r.step.screenshotMimeType ?? "image/png",
+    })),


Empty thoughts desync snapshot reasons

Medium Severity

Kept screenshots sent to the final judge include every image above the score threshold, but the prompt’s numbered “reasons” list drops entries whose per-image thought parsed empty. The text block and trailing images no longer stay paired the way upstream WebJudge builds whole_thoughts alongside whole_content_img.

^{Reviewed by Cursor Bugbot for commit fe9ae2f. Configure here.}

rgarcia and others added 3 commits June 27, 2026 09:14

rgarcia and others added 2 commits June 27, 2026 21:35

rgarcia marked this pull request as ready for review June 27, 2026 22:33

cursor Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Online-Mind2Web adapter (Kernel env + cua agent)#43

Online-Mind2Web adapter (Kernel env + cua agent)#43
rgarcia wants to merge 7 commits into
hypeship/cua-bench-online-mind2webfrom
hypeship/bench-online-mind2web

rgarcia commented Jun 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rgarcia commented Jun 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Live smoke — opus-4-8 agent + judge, 20 tasks, browser pool

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Anthropic judge sends temperature zero

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Action history drops screenshotless tools

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Details failure overwrites reward

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Empty thoughts desync snapshot reasons

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rgarcia commented Jun 27, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading