WebVoyager Harbor adapter (Kernel env + cua agent) by rgarcia · Pull Request #42 · kernel/cua

rgarcia · 2026-06-27T11:26:16Z

Summary

Adds the WebVoyager benchmark as a Harbor adapter that runs on the Kernel
environment with cua as the agent. Stacks on the cua-harbor shared core (#40).

benchmarks/adapters/webvoyager/ — Python task generator (adapter.py +
main.py) that emits Harbor task dirs from the vendored WebVoyager dataset
(643 tasks, pinned under src/webvoyager/data/), plus a stdlib-only verifier
(webjudge.py) that ports WebVoyager's single multimodal judge to the
Anthropic Messages API (no pip needed in the verifier VM).
The judge is hardened for cross-model + transient-failure resilience: it
retries once without temperature when a model rejects it with a 400, and
fails closed to reward 0 with the error recorded in grading_details.json
rather than crashing a trial into a missing reward.

Live smoke (recorded in `SMOKE.md`)

Ran the full pipeline live — 20 tasks across 13 sites, -n 8 with browser
pools, claude-sonnet-4-6 for both the agent and the judge, 900s/task:

Pass rate 10/20 (10/17 of graded tasks). 7 graded NOT SUCCESS, 3
AgentTimeoutError on heavy/anti-bot sites (Amazon, Apple, Booking,
Allrecipes). No adapter bugs — browser provision, agent drive,
answer/screenshot spill, in-VM judge, and reward write all fired on every
task.
Failure modes: agent timeouts on the heaviest sites (recommend full 1800s
budget + a residential proxy for a parity run), and judge strictness on
textually-correct but visually-unconfirmed answers.

Test plan

uv run pytest adapters/webvoyager/tests — 25 passed (generation +
judge parse/retry/error-handling, network stubbed)
uv run ruff check clean
Live 20-task smoke on Kernel (above)
Parity run with full 1800s budget + residential proxy (follow-up)

🤖 Generated with Claude Code

Note

Medium Risk
New benchmark path affects scoring semantics (LLM judge, last-k screenshots, fail-closed rewards) and requires ANTHROPIC_API_KEY at verify time; large surface area but mostly additive with tests and parity docs.

Overview
Introduces a new WebVoyager benchmark adapter under benchmarks/adapters/webvoyager/ so Harbor can run the 643 live-web tasks on the Kernel environment with cua_harbor:CuaHarborAgent.

A stdlib Python generator (adapter.py / main.py) turns the vendored, pinned dataset into Harbor task dirs: instructions, kernel.json start URLs, ground_truth.json, oracle solve.sh, and task.toml with verifier env (JUDGE_MODEL, MAX_IMAGES). IDs are slugified for Harbor naming; --task-ids accepts upstream or normalized ids.

Grading is a bundled node judge.js (TypeScript → single ESM via tsdown + pi-ai) that ports upstream auto_eval.py: verbatim SYSTEM_PROMPT, last-k screenshots, SUCCESS / NOT SUCCESS parse, and default MAX_IMAGES=15 (parity fix vs canonical --max_attached_imgs 15). The verifier reads /logs/agent/answer.txt and spilled shots, writes reward.txt + grading_details.json, and fails closed to 0 on judge/API errors.

Docs include README.md, PARITY.md, SMOKE.md, adapter_metadata.json, and run_webvoyager.yaml. Judge and adapter behavior are covered by Vitest and pytest (per PR description).

^{Reviewed by Cursor Bugbot for commit d2cff75. Bugbot is set up for automated code reviews on this repo. Configure here.}

Generates WebVoyager's 643 live-web tasks (15 sites) as Harbor task dirs that run on the Kernel environment via the shared cua_harbor agent. Each record becomes instruction.md + environment/kernel.json (start_url + stealth + 1280x1024) + a per-task ground_truth.json; the dataset is vendored and pinned to upstream commit 0915445 for hermetic generation. The verifier ports WebVoyager's single multimodal judge (SYSTEM_PROMPT verbatim from upstream auto_eval.py) to the Anthropic Messages API: it reads /logs/agent/answer.txt + the last-k /logs/agent/shots/shot-<n>.png the agent spilled and writes a 0/1 reward (SUCCESS/NOT SUCCESS, ambiguous fails closed). Site names with spaces are slugified so [task].name matches ORG_NAME_PATTERN, and reference answers with stray control chars are escaped for valid TOML. Generated task dirs and caches are gitignored. Mocked unit tests + ruff green.

The Kernel verifier VM has Python 3 but no pip/ensurepip, so the judge cannot install the anthropic SDK at grade time. Call the Messages API directly with urllib.request instead; drop the install step from test.sh and point docs at bare python3 for generation. Also gitignore _smoke_logs/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Run the WebVoyager adapter end to end on Kernel browsers with cua as the agent and the Anthropic WebJudge as the verifier: 20 tasks, pass rate 10/20 (10/17 of graded tasks), 3 agent timeouts on heavy/anti-bot sites, no adapter bugs. SMOKE.md captures the per-task table and the observed failure taxonomy. Make the judge resilient across model generations and transient API failures: retry once without `temperature` when a model rejects it with a 400 (newer models do), and fail closed to reward 0 with the error recorded in grading_details.json instead of crashing a trial into a missing reward.

The mid-run snapshot under-counted exceptions; the final summary is 5 (4 AgentTimeoutError + 1 AddTestsDirError). Headline Mean 0.500 (10/20) unchanged.

Replace the SMOKE notes with the claude-opus-4-8 agent + opus-4-8 judge run: 14/20 pass over 20 curated tasks across 12 sites, 0 judge/adapter exceptions. Failure taxonomy: 1 anti-bot (Cloudflare), 3 screenshot-coverage false-negatives (the MAX_IMAGES tension), 1 agent timeout (multi-constraint faceted search), and 1 env/session-lifetime error (session deleted before the shared-session verifier could attach). The judge hardening this run validated (temperature-drop retry + fail-closed on HTTP error) is already on the branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… auto-eval The canonical WebVoyager auto-eval invocation (evaluation/run_eval.sh + README) runs the GPT-4V judge with --max_attached_imgs 15; our default of 3 was read from auto_eval.py's argparse default, which is never what produces the published numbers. With one screenshot spilled per agent step, the last-k window is the only place the deciding frame can land, so k=3 left correct answers unverifiable and produced screenshot-coverage false-negatives. Set the default to 15 in task.toml and webjudge.py (env override preserved) and fix the README/run-config notes that quoted the old default. A live re-run at k=15 recovers the SMOKE false-negatives (apple--2, huggingface--2 both 0 -> 1). Adds PARITY.md documenting the applied fix vs the deliberate Kernel adaptations left intact. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

rgarcia · 2026-06-27T15:52:27Z

Parity pass vs MinorJerry/WebVoyager @ 0915445 (9ca1f57): fixed a real fidelity bug — the judge's last-k screenshot count was 3 vs the canonical --max_attached_imgs 15. Raised MAX_IMAGES 3→15 (still env-overridable). This is the cause of the smoke's judge screenshot-coverage false-negatives. The GPT-4V SYSTEM_PROMPT + verdict parsing were confirmed verbatim; Anthropic-vs-OpenAI judge kept as an intentional adaptation. See PARITY.md.

Replace the hand-rolled stdlib-urllib Anthropic judge (webjudge.py) with a self-contained TypeScript bin under judge/ that calls the model through @earendil-works/pi-ai's completeSimple. pi-ai owns provider routing, env-var keys, o-series temperature/max_completion_tokens quirks, vision, and retries, so the manual provider client and temperature-drop retry are deleted rather than ported. Transport-only change: the SYSTEM_PROMPT, the last-k (MAX_IMAGES=15) screenshot selection, the SUCCESS/NOT SUCCESS verdict parse, and the claude-sonnet-4-5 default are carried over byte-identically. JUDGE_MODEL is now a pi-ai provider:name ref (bare name defaults to anthropic). pi-ai is bundled into a single judge.js via tsdown (inlineDynamicImports) so the verifier runs with no install on the Kernel VM. test.sh shells `node judge.js` with the same inputs; the adapter copies the built bundle into each task's tests/ and build-checks it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…i-ai judge Same o-series handling as the online-mind2web judge: o4-mini/o3 reject `temperature` and a `none` reasoning effort. Gate `reasoning: medium` + no temperature to OpenAI reasoning models; the claude-sonnet-4-5 default keeps `temperature: 0`. Also drops the docstring's incorrect claim that pi-ai omits temperature for o-series. Verified live with both claude-sonnet-4-5 and openai:o4-mini. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using high effort and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Invalid max-images breaks last-k
- The judge CLI now validates --max-images as a positive integer and throws on invalid values, preventing slice(-0/-NaN) from attaching all screenshots.
✅ Fixed: Negative limit drops wrong tasks
- Task limit handling now rejects negative values in both CLI parsing and adapter initialization so slicing cannot silently select the wrong task set.

Or push these changes by commenting:

@cursor push 8070bc1ebc

Preview (8070bc1ebc)

diff --git a/benchmarks/adapters/webvoyager/judge/src/judge.ts b/benchmarks/adapters/webvoyager/judge/src/judge.ts
--- a/benchmarks/adapters/webvoyager/judge/src/judge.ts
+++ b/benchmarks/adapters/webvoyager/judge/src/judge.ts
@@ -32,7 +32,15 @@
   detailsOut?: string;
 }
 
-function parseArgs(argv: string[]): Args {
+function parsePositiveIntFlag(flag: string, value: string): number {
+  const parsed = Number(value);
+  if (!Number.isInteger(parsed) || parsed <= 0) {
+    throw new Error(`--${flag} must be a positive integer`);
+  }
+  return parsed;
+}
+
+export function parseArgs(argv: string[]): Args {
   const flags = new Map<string, string>();
   for (let i = 0; i < argv.length; i += 1) {
     const arg = argv[i];
@@ -51,7 +59,7 @@
     answer: flags.get("answer") ?? "/logs/agent/answer.txt",
     shots: flags.get("shots") ?? "/logs/agent/shots",
     judgeModel: flags.get("judge-model") ?? "claude-sonnet-4-5",
-    maxImages: Number(flags.get("max-images") ?? "15"),
+    maxImages: parsePositiveIntFlag("max-images", flags.get("max-images") ?? "15"),
     rewardOut: required("reward-out"),
     detailsOut: flags.get("details-out"),
   };

diff --git a/benchmarks/adapters/webvoyager/judge/test/judge.test.ts b/benchmarks/adapters/webvoyager/judge/test/judge.test.ts
--- a/benchmarks/adapters/webvoyager/judge/test/judge.test.ts
+++ b/benchmarks/adapters/webvoyager/judge/test/judge.test.ts
@@ -3,7 +3,7 @@
 import { join } from "node:path";
 import { describe, expect, it } from "vitest";
 import type { Args } from "../src/judge.ts";
-import { run } from "../src/judge.ts";
+import { parseArgs, run } from "../src/judge.ts";
 import type { GradingDetails, JudgeContent, JudgeModel } from "../src/types.ts";
 
 /** A /logs/agent + /tests layout, plus the verifier output paths run() writes. */
@@ -46,6 +46,22 @@
   return JSON.parse(readFileSync(args.detailsOut!, "utf8")) as GradingDetails;
 }
 
+describe("parseArgs", () => {
+  it("defaults max-images to 15", () => {
+    const args = parseArgs(["--reward-out", "/tmp/reward.txt"]);
+    expect(args.maxImages).toBe(15);
+  });
+
+  it.each(["0", "-1", "abc", "2.5", ""])(
+    "rejects invalid max-images value %s",
+    (raw) => {
+      expect(() =>
+        parseArgs(["--reward-out", "/tmp/reward.txt", "--max-images", raw])
+      ).toThrow("--max-images must be a positive integer");
+    }
+  );
+});
+
 describe("run", () => {
   it.each([
     ["The agent did it. SUCCESS", "1"],

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
@@ -100,6 +100,8 @@
         **kwargs: object,
     ):
         self.output_dir = Path(output_dir)
+        if limit is not None and limit < 0:
+            raise ValueError("limit must be >= 0")
         self.limit = limit
         self.overwrite = overwrite
         self.task_ids = task_ids

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/main.py b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/main.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
@@ -21,6 +21,13 @@
     return Path(__file__).resolve().parents[3] / ".tasks"
 
 
+def _non_negative_int(value: str) -> int:
+    parsed = int(value)
+    if parsed < 0:
+        raise argparse.ArgumentTypeError("--limit/--num-tasks must be >= 0")
+    return parsed
+
+
 def _parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description="Generate Harbor tasks for the WebVoyager benchmark",
@@ -34,7 +41,7 @@
     parser.add_argument(
         "--limit",
         "--num-tasks",
-        type=int,
+        type=_non_negative_int,
         dest="limit",
         default=None,
         help="Generate only the first N tasks",

diff --git a/benchmarks/adapters/webvoyager/tests/test_adapter.py b/benchmarks/adapters/webvoyager/tests/test_adapter.py
--- a/benchmarks/adapters/webvoyager/tests/test_adapter.py
+++ b/benchmarks/adapters/webvoyager/tests/test_adapter.py
@@ -2,6 +2,7 @@
 
 from __future__ import annotations
 
+import argparse
 import json
 import re
 import sys
@@ -14,6 +15,7 @@
 sys.path.insert(0, str(SRC))
 
 from webvoyager.adapter import WebVoyagerAdapter, _index_reference, _toml_escape  # noqa: E402
+from webvoyager.main import _non_negative_int  # noqa: E402
 
 
 @pytest.fixture
@@ -132,6 +134,18 @@
     assert ids == {"Amazon--3", "Apple--1"}
 
 
+def test_negative_limit_rejected(tmp_path: Path) -> None:
+    with pytest.raises(ValueError, match="limit must be >= 0"):
+        WebVoyagerAdapter(output_dir=tmp_path / "out", limit=-1)
+
+
+def test_limit_cli_parser_rejects_negative_values() -> None:
+    assert _non_negative_int("0") == 0
+    assert _non_negative_int("3") == 3
+    with pytest.raises(argparse.ArgumentTypeError, match="--limit/--num-tasks must be >= 0"):
+        _non_negative_int("-1")
+
+
 def test_overwrite_false_skips_existing(adapter: WebVoyagerAdapter) -> None:
     adapter.run()
     target = adapter.output_dir / "webvoyager-allrecipes--0" / "instruction.md"

_{You can send follow-ups to the cloud agent here.}

A 0/negative/non-numeric --max-images made lastShots' slice(-k) attach every screenshot instead of the last k, risking judge token blowups and spurious 0 rewards; parse it to a positive integer, falling back to the 15 default otherwise. A negative --limit likewise made tasks[:limit] drop tasks off the end instead of taking the first N, so reject it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: Wrong default tasks output path
- Changed _default_output_dir() to use parents[2] so the default resolves to adapters/webvoyager/.tasks as documented.
✅ Fixed: Task IDs omit slug alias
- Extended task selection to also match normalize_id(task.source_id) so normalized IDs like apple--1 are accepted.
✅ Fixed: Refresh pulls unpinned upstream data
- Updated refresh URLs to read the pinned upstream_commit from adapter_metadata.json instead of tracking the upstream main branch.

Or push these changes by commenting:

@cursor push d62ef80cff

Preview (d62ef80cff)

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/adapter.py
@@ -29,8 +29,11 @@
 # copied into each task's tests/ so the verifier runs `node judge.js` with no install.
 ADAPTER_ROOT = PACKAGE_DIR.parents[1]
 JUDGE_BUNDLE = ADAPTER_ROOT / "judge" / "dist" / "judge.js"
+UPSTREAM_COMMIT = json.loads((ADAPTER_ROOT / "adapter_metadata.json").read_text())["dataset"][
+    "upstream_commit"
+]
 
-RAW_BASE = "https://raw.githubusercontent.com/MinorJerry/WebVoyager/main/data"
+RAW_BASE = f"https://raw.githubusercontent.com/MinorJerry/WebVoyager/{UPSTREAM_COMMIT}/data"
 DATASET_URL = f"{RAW_BASE}/WebVoyager_data.jsonl"
 REFERENCE_URL = f"{RAW_BASE}/reference_answer.json"
 
@@ -136,6 +139,7 @@
                 task
                 for task in tasks
                 if task.source_id in requested
+                or self.normalize_id(task.source_id) in requested
                 or self.make_local_task_id(task.source_id) in requested
             ]
         if self.limit is not None:

diff --git a/benchmarks/adapters/webvoyager/src/webvoyager/main.py b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
--- a/benchmarks/adapters/webvoyager/src/webvoyager/main.py
+++ b/benchmarks/adapters/webvoyager/src/webvoyager/main.py
@@ -18,7 +18,7 @@
 
 
 def _default_output_dir() -> Path:
-    return Path(__file__).resolve().parents[3] / ".tasks"
+    return Path(__file__).resolve().parents[2] / ".tasks"
 
 
 def _parse_args() -> argparse.Namespace:

diff --git a/benchmarks/adapters/webvoyager/tests/test_adapter.py b/benchmarks/adapters/webvoyager/tests/test_adapter.py
--- a/benchmarks/adapters/webvoyager/tests/test_adapter.py
+++ b/benchmarks/adapters/webvoyager/tests/test_adapter.py
@@ -13,7 +13,14 @@
 SRC = Path(__file__).resolve().parents[1] / "src"
 sys.path.insert(0, str(SRC))
 
-from webvoyager.adapter import WebVoyagerAdapter, _index_reference, _toml_escape  # noqa: E402
+from webvoyager.adapter import (  # noqa: E402
+    DATASET_URL,
+    REFERENCE_URL,
+    WebVoyagerAdapter,
+    _index_reference,
+    _toml_escape,
+)
+from webvoyager.main import _default_output_dir  # noqa: E402
 
 
 @pytest.fixture
@@ -132,6 +139,23 @@
     assert ids == {"Amazon--3", "Apple--1"}
 
 
+def test_task_ids_accept_normalized_alias(tmp_path: Path) -> None:
+    adapter = WebVoyagerAdapter(output_dir=tmp_path / "out", task_ids=["apple--1"])
+    selected = adapter._select()
+    assert {t.source_id for t in selected} == {"Apple--1"}
+
+
+def test_refresh_urls_pin_upstream_commit() -> None:
+    metadata = json.loads((Path(__file__).resolve().parents[1] / "adapter_metadata.json").read_text())
+    upstream_commit = metadata["dataset"]["upstream_commit"]
+    assert f"/{upstream_commit}/" in DATASET_URL
+    assert f"/{upstream_commit}/" in REFERENCE_URL
+
+
+def test_default_output_dir_points_to_adapter_root() -> None:
+    assert _default_output_dir() == Path(__file__).resolve().parents[1] / ".tasks"
+
+
 def test_negative_limit_rejected(tmp_path: Path) -> None:
     # tasks[:limit] with a negative limit would drop tasks off the end, so it must error.
     with pytest.raises(ValueError, match="non-negative"):

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit d2cff75. Configure here.}

cursor · 2026-06-27T22:46:24Z

+        "--output-dir",
+        type=Path,
+        default=_default_output_dir(),
+        help="Directory to write generated tasks (default: adapters/webvoyager/.tasks)",


Wrong default tasks output path

Medium Severity

The CLI's default --output-dir is incorrectly calculated using parents[3], causing it to resolve to benchmarks/adapters/.tasks. This differs from the documented adapters/webvoyager/.tasks and results in generated tasks being written to the wrong location, making them difficult to locate or use.

^{Reviewed by Cursor Bugbot for commit d2cff75. Configure here.}

cursor · 2026-06-27T22:46:24Z

+                for task in tasks
+                if task.source_id in requested
+                or self.make_local_task_id(task.source_id) in requested
+            ]


Task IDs omit slug alias

Medium Severity

--task-ids matching only checks the upstream source_id (e.g. Google Flights--7) and the full local dir id (webvoyager-google-flights--7). It never checks the slug from normalize_id alone (e.g. google-flights--7), so those ids select zero tasks despite PARITY/README implying normalized ids work.

^{Reviewed by Cursor Bugbot for commit d2cff75. Configure here.}

cursor · 2026-06-27T22:46:24Z

+
+RAW_BASE = "https://raw.githubusercontent.com/MinorJerry/WebVoyager/main/data"
+DATASET_URL = f"{RAW_BASE}/WebVoyager_data.jsonl"
+REFERENCE_URL = f"{RAW_BASE}/reference_answer.json"


Refresh pulls unpinned upstream data

Medium Severity

--refresh downloads dataset files from MinorJerry/WebVoyager on the main branch, while adapter_metadata.json and the docs pin commit 0915445 with checksums. Refresh can replace vendored data with a newer upstream revision without updating the recorded pin or hashes.

^{Reviewed by Cursor Bugbot for commit d2cff75. Configure here.}

rgarcia and others added 6 commits June 27, 2026 09:18

webvoyager: reconcile SMOKE.md with harbor's final summary

b5d211e

The mid-run snapshot under-counted exceptions; the final summary is 5 (4 AgentTimeoutError + 1 AddTestsDirError). Headline Mean 0.500 (10/20) unchanged.

rgarcia and others added 2 commits June 27, 2026 21:42

rgarcia marked this pull request as ready for review June 27, 2026 22:33

cursor Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread benchmarks/adapters/webvoyager/judge/src/judge.ts

Comment thread benchmarks/adapters/webvoyager/src/webvoyager/adapter.py

cursor Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WebVoyager Harbor adapter (Kernel env + cua agent)#42

WebVoyager Harbor adapter (Kernel env + cua agent)#42
rgarcia wants to merge 9 commits into
hypeship/cua-bench-online-mind2webfrom
hypeship/bench-webvoyager

rgarcia commented Jun 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rgarcia commented Jun 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Live smoke (recorded in SMOKE.md)

Test plan

Uh oh!

rgarcia commented Jun 27, 2026

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Wrong default tasks output path

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Task IDs omit slug alias

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Refresh pulls unpinned upstream data

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rgarcia commented Jun 27, 2026 •

edited by cursor Bot

Loading

Live smoke (recorded in `SMOKE.md`)

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading