Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,25 @@ python assets/task_showcase/app.py \

---

## 🧠 Skill Library (reuse solved tasks across tasks)

[`webwright.skills`](src/webwright/skills/) turns solved tasks into **reusable, executable code
skills**, retrieves and judges them at solve time, gates what enters the library, and grows the
library incrementally — a self-evolving *store → retrieve → use/adapt → gate → evolve* loop on top
of Webwright's code-as-action solves. Plugs in with **no change to the agent loop**:

- **Reuse** — the agent calls `python -m webwright.tools.skill_use --task "..." --library ...`
(like `self_reflection`/`image_qa`); it returns `{verdict: use|adapt|skip, source_path}`.
- **Grow** — `python -m webwright.skills.update --manifest batch.json --library ./library`
distills a batch of gate-passed solves into a parameterized, primitive-decomposed skill.

Validated end-to-end on a real public website (read-only GitHub): solve two repos from scratch →
`update` builds a parameterized skill → a held-out repo is solved by reusing it (agent calls
`skill_use`, verdict `use`, answer correct); a wrong solve is kept out by the gate; a second batch
improves the existing skill in place. See [`src/webwright/skills/README.md`](src/webwright/skills/README.md).

---

## 🚀 Quick Start

### Prerequisites
Expand Down
13 changes: 13 additions & 0 deletions src/webwright/config/skill_mode.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Skill-library mode (optional overlay): raise the step budget headroom for skill reuse runs.
# Enable by stacking: webwright run ... -c base.yaml -c model_openai.yaml -c skill_mode.yaml
#
# How the agent is told to reuse skills: the SKILL-LIBRARY block is prepended to the TASK prompt
# by the caller (see webwright.skills.prompt.with_skill_hint), NOT injected via system_template —
# webwright merges system_template by replacement, so a prompt-level hint is the clean, non-invasive
# way and keeps webwright's default behavior unchanged when this mode is off.
#
# Requires env:
# SKILL_LIBRARY_ROOT path to the skill library (read by the skill_use tool)
# SKILL_MODEL_NAME / SKILL_MODEL_ENDPOINT (optional) backend for skill_use; defaults to OPENAI_*
agent:
step_limit: 100
76 changes: 76 additions & 0 deletions src/webwright/skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# `webwright.skills` — a memory / skill-library module for Webwright

Turn solved tasks into **reusable, executable code skills**, retrieve and judge them at solve
time, gate what enters the library, and grow the library incrementally. A self-evolving loop:

```
solve --(gate: gold | self_verify)--> admit --(evolve: refine + parameterize + primitives)--> library
^ |
| skill_use tool: retrieve + decide (use / adapt / skip) <--------------------------------------- +
```

This is the missing **reuse + accumulation** layer: Webwright already turns a task into a
parameterized script (`crafted_cli`); this module accumulates those across tasks, judges when a
prior skill applies, and improves skills as more solves arrive — with a gate so wrong solves don't
pollute the library.

## How it plugs into Webwright

Two touch points, **no change to the agent loop or default config**:

1. **Reuse at solve time — the `skill_use` tool.** The agent invokes it from bash, exactly like
`self_reflection` / `image_qa`:
```bash
python -m webwright.tools.skill_use --task "<the task>" --library "$SKILL_LIBRARY_ROOT"
```
It returns JSON `{verdict: use|adapt|skip, skill_id, source_path, how_to_reuse}`. The agent
reads `source_path` and reuses the skill (use = as-is, adapt = reuse core + change last step,
skip = solve from scratch). `webwright.skills.with_skill_hint(prompt, ...)` prepends a one-line
usage hint to the task prompt so the agent remembers to query the library first.

2. **Growth after solving — the `update` CLI.** Distill a batch of gate-passed solves into a
library skill (offline, not in the solve loop):
```bash
python -m webwright.skills.update --manifest batch.json --library ./library
```
`batch.json = {"template": "...", "runs": [{"dir","admit","params", ...}, ...]}`.

## Components

| file | role |
|---|---|
| `library.py` | `Skill` + `Library(root)`: on-disk skills (`<id>/skill.py` + `meta.json`) |
| `retrieve.py` | `retrieve(task, library)` → ranked `Candidate`s (relevance) |
| `decide.py` | `decide(task, candidates)` → `Decision(verdict, skill_id, reason)` (utility: use/adapt/skip) |
| `gate.py` | `gate(result, method=gold\|self_verify\|none)` → admit? (keeps wrong solves out) |
| `update.py` | `evolve(traces, library)`: grow on the existing library — add / adapt-refine / keep; `_refine` parameterizes + decomposes into primitives, incrementally improving an existing skill |
| `llm.py` | `configure_llm(model)` + `llm()`: **backend-agnostic** via Webwright's `Model` abstraction; a bare CLI builds the model from `SKILL_MODEL_NAME`/`SKILL_MODEL_ENDPOINT` (or `OPENAI_*`) env — no hardcoded endpoint/key |
| `prompt.py` | `with_skill_hint(prompt, task, library)`: non-invasive task-prompt hint |

## Gate

`gate(method=...)` is the **admission** check (independent of the solving agent — not the same as
`self_reflection`, which is the agent's own completion condition):
- `gold` — compare against a known answer (benchmarks); strongest.
- `self_verify` — invariant only (non-empty + shape); weak placeholder when no gold exists. It does
not check *correctness*. Note: `self_reflection` cannot serve as the gate — `require_self_reflection_success`
makes it always `predicted_label==1`, so it would admit everything (agent grading itself).
- `none` — admit all (demos).

## Backend

Backend-agnostic. Either `configure_llm(model_config_or_Model)` once in-process, or set
`SKILL_MODEL_NAME` / `SKILL_MODEL_ENDPOINT` (falling back to `OPENAI_*`) so a bare tool invocation
uses the same backend as the running agent. No gateway or key is hardcoded.

## Results (summary)

Validated with this module (full data + analysis live in the companion research repo, not here):
- **Real website (public GitHub, read-only):** end-to-end loop works — two repos solved from
scratch → `update` distilled a parameterized skill → a held-out repo solved by reusing it
(agent called `skill_use`, verdict `use`, answer correct).
- **Incremental growth:** a second batch improves the existing skill in place (keeps the working
functions, adds robustness) rather than rewriting it.
- **Gate prevents pollution:** a wrong solve is dropped and never enters the library.
- **Reuse value is task-dependent:** step savings are modest on easy tasks (query overhead ≈ the
exploration it saves) and larger on harder tasks with more exploration to skip.
22 changes: 22 additions & 0 deletions src/webwright/skills/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
"""webwright.skills — a memory/skill library module for webwright.

Store solved tasks as reusable, executable code skills; retrieve + judge (use/adapt/skip) at
solve time; admit via a gate; and grow the library incrementally (evolve). Plugs into webwright
as a built-in submodule:
- solve-time reuse : the `skill_use` tool (agent invokes it like self_reflection / image_qa)
- offline growth : `update.evolve` (run after solves to distill gate-passed solves into skills)

Backend-agnostic: configure_llm(model) wires it to any webwright Model.
"""
from .library import Library, Skill
from .retrieve import retrieve, Candidate
from .decide import decide, Decision
from .gate import gate, GateResult
from .update import evolve, Trace
from .llm import configure_llm
from .prompt import with_skill_hint

__all__ = [
"Library", "Skill", "retrieve", "Candidate", "decide", "Decision",
"gate", "GateResult", "evolve", "Trace", "configure_llm", "with_skill_hint",
]
50 changes: 50 additions & 0 deletions src/webwright/skills/decide.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""Decide whether to use: candidates + task -> use / adapt / skip (utility).

Stable interface (swappable implementation):
decide(task, candidates, *, method="llm") -> Decision
Relevant != useful: retrieve gives "how similar", decide gives "whether and how to use it".
"""
from __future__ import annotations
from dataclasses import dataclass

from .llm import llm_json


@dataclass
class Decision:
verdict: str # "use" | "adapt" | "skip"
skill_id: str | None
reason: str


def _decide_llm(task: str, candidates) -> Decision:
if not candidates:
return Decision("skip", None, "no candidate skills")
cat = "\n".join(
f"- skill_id: {c.skill.skill_id} | template: {c.skill.meta.get('template','')} | "
f"summary: {c.skill.summary} | params: {c.skill.signature.get('params', [])}"
for c in candidates
)
sys = (
"Decide whether a library skill is worth using for THIS task. Output STRICT JSON: "
'{"verdict":"use|adapt|skip","skill_id":"...","reason":"..."}.\n'
"- use = the skill fits the task as-is (just different parameter values).\n"
"- adapt = the skill's expensive core (login / navigation / extraction) is reusable, but the "
"FINAL step differs; the agent should reuse the front and add/adapt only the last step.\n"
"- skip = no candidate is worth it; solve from scratch (skill_id = null).\n"
"Relevance is not enough — only 'use'/'adapt' if it genuinely saves work."
)
user = f"## Task\n{task}\n\n## Candidate skills (most relevant first)\n{cat}"
out = llm_json(sys, user)
verdict = out.get("verdict", "skip")
if verdict not in ("use", "adapt", "skip"):
verdict = "skip"
skill_id = out.get("skill_id") if verdict != "skip" else None
return Decision(verdict=verdict, skill_id=skill_id, reason=out.get("reason", ""))


_DECIDERS = {"llm": _decide_llm}


def decide(task: str, candidates, *, method: str = "llm") -> Decision:
return _DECIDERS[method](task, candidates)
68 changes: 68 additions & 0 deletions src/webwright/skills/gate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
"""Admission gate: only "correct" solves/skills enter the library, preventing correct-but-narrow /
regression pollution. The gate is an INDEPENDENT second eye — distinct from the solving agent's own
self_reflection (which is a solve-completion condition, not an admission check).

Stable interface (swappable implementation), configurable method:
gate(result, *, gold=None, output_schema=None, method="auto") -> GateResult

- method="gold" : compare against gold (benchmarks like WebArena; truly independent, catches
mis-extracted solves). Recommended.
- method="self_verify" : invariant only (result non-empty + shape matches output_schema). A weak
placeholder when no gold exists.
Limitation: only checks "present / right shape", not "correct" — a wrong but
non-empty answer is admitted anyway.
(Note: webwright's self_reflection is always predicted_label==1 due to
require_self_reflection_success, so it cannot serve as the gate — that is a
solve-completion condition, not independent admission.)
- method="none" : no gate (demo reuse only, no pollution protection).
- method="auto" : use gold if available, else self_verify.
Upgrade path (next step): for real websites use WebJudge (OM2W's official judge) or cross-source
consistency checks for a truly independent gate.
"""
from __future__ import annotations
from dataclasses import dataclass


@dataclass
class GateResult:
admit: bool
reason: str


def _shape_ok(result, output_schema) -> bool:
if not output_schema:
return True
t = output_schema.get("type")
if t == "array":
return isinstance(result, list)
if t == "object":
return isinstance(result, dict)
if t in ("string",):
return isinstance(result, str)
if t in ("number", "integer"):
return isinstance(result, (int, float)) and not isinstance(result, bool)
return True


def _self_verify(result, output_schema) -> GateResult:
if result is None:
return GateResult(False, "result is null")
if isinstance(result, (list, dict, str)) and len(result) == 0:
return GateResult(False, "result is empty")
if not _shape_ok(result, output_schema):
return GateResult(False, f"shape != output_schema ({output_schema.get('type')})")
return GateResult(True, "self-verify passed (non-empty, shape ok)")


def _gold(result, gold) -> GateResult:
if result == gold:
return GateResult(True, "matches gold")
return GateResult(False, "differs from gold")


def gate(result, *, gold=None, output_schema=None, method: str = "auto") -> GateResult:
if method == "none":
return GateResult(True, "no gate (admit all)")
if method == "gold" or (method == "auto" and gold is not None):
return _gold(result, gold)
return _self_verify(result, output_schema)
59 changes: 59 additions & 0 deletions src/webwright/skills/library.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
"""Skill store. A skill = a directory under the library root holding skill.py + meta.json.

Interface (stable — implementations behind it may change):
Library(root).list() -> [Skill]
Library(root).get(skill_id) -> Skill | None
Library(root).add(skill) # write skill.py + meta.json
"""
from __future__ import annotations
import json
from dataclasses import dataclass, field
from pathlib import Path


@dataclass
class Skill:
skill_id: str
code: str # source of skill.py
meta: dict = field(default_factory=dict) # {template, site, signature, summary, ...}

@property
def summary(self) -> str:
return self.meta.get("summary", "")

@property
def signature(self) -> dict:
return self.meta.get("signature", {})


class Library:
def __init__(self, root: str | Path):
self.root = Path(root)
self.root.mkdir(parents=True, exist_ok=True)

def _dir(self, skill_id: str) -> Path:
return self.root / skill_id

def list(self) -> list[Skill]:
out = []
for d in sorted(self.root.iterdir()):
if (d / "meta.json").exists():
out.append(self.get(d.name))
return [s for s in out if s]

def get(self, skill_id: str) -> Skill | None:
d = self._dir(skill_id)
if not (d / "meta.json").exists():
return None
meta = json.loads((d / "meta.json").read_text())
code = (d / "skill.py").read_text() if (d / "skill.py").exists() else ""
return Skill(skill_id=skill_id, code=code, meta=meta)

def add(self, skill: Skill) -> None:
d = self._dir(skill.skill_id)
d.mkdir(parents=True, exist_ok=True)
(d / "skill.py").write_text(skill.code)
(d / "meta.json").write_text(json.dumps(skill.meta, ensure_ascii=False, indent=2))

def path(self, skill_id: str) -> Path:
return self._dir(skill_id) / "skill.py"
67 changes: 67 additions & 0 deletions src/webwright/skills/llm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
"""LLM helper for the skills module — backend-agnostic, via webwright's own model abstraction.

No hardcoded gateway/endpoint/key: the caller passes a webwright Model (or a model config dict),
so this works with any backend webwright supports (openai / anthropic / openrouter / custom).
"""
from __future__ import annotations

import json
import re
from typing import Any, Optional

from webwright.models import get_model

# Process-wide default model, set once via configure_llm() so retrieve/decide/update can call
# llm() without each caller threading a Model through. Falls back to env-configured openai.
_DEFAULT_MODEL: Optional[Any] = None


def configure_llm(model: Any) -> None:
"""Register the Model (or model-config dict) the skills module should use."""
global _DEFAULT_MODEL
_DEFAULT_MODEL = get_model(model) if isinstance(model, dict) else model


def _model() -> Any:
if _DEFAULT_MODEL is not None:
return _DEFAULT_MODEL
# default: build an openai-style model from env so a bare CLI invocation
# (e.g. `python -m webwright.tools.skill_use`) uses the SAME backend as the running agent.
# Honors SKILL_MODEL_NAME / SKILL_MODEL_ENDPOINT (or OPENAI_* fallbacks); no hardcoded gateway.
import os
cfg = {"model_class": os.environ.get("SKILL_MODEL_CLASS", "openai")}
name = os.environ.get("SKILL_MODEL_NAME") or os.environ.get("OPENAI_MODEL")
endpoint = os.environ.get("SKILL_MODEL_ENDPOINT") or os.environ.get("OPENAI_ENDPOINT")
if name:
cfg["model_name"] = name
if endpoint:
cfg["openai_endpoint"] = endpoint
return get_model(cfg)


def llm(system: str, user: str, *, model: Any = None, **_: Any) -> str:
"""Single-turn call. Returns raw text. `model` overrides the configured default."""
m = model if model is not None else _model()
messages = [
m.format_message(role="system", content=system),
m.format_message(role="user", content=user),
]
return m(messages)


def llm_json(system: str, user: str, **kw: Any) -> dict:
"""Call + parse the first {...} JSON object out of the reply."""
txt = llm(system, user, **kw)
match = re.search(r"\{.*\}", txt, re.S)
if not match:
return {}
try:
return json.loads(match.group(0))
except Exception:
s = match.group(0)
for end in range(len(s), 0, -1):
try:
return json.loads(s[:end])
except Exception:
continue
return {}
Loading