microsoft · DEM1TASSE · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026
diff --git a/README.md b/README.md
@@ -165,6 +165,25 @@ python assets/task_showcase/app.py \
 
 ---
 
+## 🧠 Skill Library (reuse solved tasks across tasks)
+
+[`webwright.skills`](src/webwright/skills/) turns solved tasks into **reusable, executable code
+skills**, retrieves and judges them at solve time, gates what enters the library, and grows the
+library incrementally — a self-evolving *store → retrieve → use/adapt → gate → evolve* loop on top
+of Webwright's code-as-action solves. Plugs in with **no change to the agent loop**:
+
+- **Reuse** — the agent calls `python -m webwright.tools.skill_use --task "..." --library ...`
+  (like `self_reflection`/`image_qa`); it returns `{verdict: use|adapt|skip, source_path}`.
+- **Grow** — `python -m webwright.skills.update --manifest batch.json --library ./library`
+  distills a batch of gate-passed solves into a parameterized, primitive-decomposed skill.
+
+Validated end-to-end on a real public website (read-only GitHub): solve two repos from scratch →
+`update` builds a parameterized skill → a held-out repo is solved by reusing it (agent calls
+`skill_use`, verdict `use`, answer correct); a wrong solve is kept out by the gate; a second batch
+improves the existing skill in place. See [`src/webwright/skills/README.md`](src/webwright/skills/README.md).
+
+---
+
 ## 🚀 Quick Start
 
 ### Prerequisites

diff --git a/src/webwright/config/skill_mode.yaml b/src/webwright/config/skill_mode.yaml
@@ -0,0 +1,13 @@
+# Skill-library mode (optional overlay): raise the step budget headroom for skill reuse runs.
+# Enable by stacking:  webwright run ... -c base.yaml -c model_openai.yaml -c skill_mode.yaml
+#
+# How the agent is told to reuse skills: the SKILL-LIBRARY block is prepended to the TASK prompt
+# by the caller (see webwright.skills.prompt.with_skill_hint), NOT injected via system_template —
+# webwright merges system_template by replacement, so a prompt-level hint is the clean, non-invasive
+# way and keeps webwright's default behavior unchanged when this mode is off.
+#
+# Requires env:
+#   SKILL_LIBRARY_ROOT          path to the skill library (read by the skill_use tool)
+#   SKILL_MODEL_NAME / SKILL_MODEL_ENDPOINT  (optional) backend for skill_use; defaults to OPENAI_*
+agent:
+  step_limit: 100
diff --git a/src/webwright/skills/README.md b/src/webwright/skills/README.md
@@ -0,0 +1,76 @@
+# `webwright.skills` — a memory / skill-library module for Webwright
+
+Turn solved tasks into **reusable, executable code skills**, retrieve and judge them at solve
+time, gate what enters the library, and grow the library incrementally. A self-evolving loop:
+
+```
+solve  --(gate: gold | self_verify)-->  admit  --(evolve: refine + parameterize + primitives)-->  library
+  ^                                                                                                   |
+  |  skill_use tool: retrieve + decide (use / adapt / skip)  <--------------------------------------- +
+```
+
+This is the missing **reuse + accumulation** layer: Webwright already turns a task into a
+parameterized script (`crafted_cli`); this module accumulates those across tasks, judges when a
+prior skill applies, and improves skills as more solves arrive — with a gate so wrong solves don't
+pollute the library.
+
+## How it plugs into Webwright
+
+Two touch points, **no change to the agent loop or default config**:
+
+1. **Reuse at solve time — the `skill_use` tool.** The agent invokes it from bash, exactly like
+   `self_reflection` / `image_qa`:
+   ```bash
+   python -m webwright.tools.skill_use --task "<the task>" --library "$SKILL_LIBRARY_ROOT"
+   ```
+   It returns JSON `{verdict: use|adapt|skip, skill_id, source_path, how_to_reuse}`. The agent
+   reads `source_path` and reuses the skill (use = as-is, adapt = reuse core + change last step,
+   skip = solve from scratch). `webwright.skills.with_skill_hint(prompt, ...)` prepends a one-line
+   usage hint to the task prompt so the agent remembers to query the library first.
+
+2. **Growth after solving — the `update` CLI.** Distill a batch of gate-passed solves into a
+   library skill (offline, not in the solve loop):
+   ```bash
+   python -m webwright.skills.update --manifest batch.json --library ./library
+   ```
+   `batch.json = {"template": "...", "runs": [{"dir","admit","params", ...}, ...]}`.
+
+## Components
+
+| file | role |
+|---|---|
+| `library.py`  | `Skill` + `Library(root)`: on-disk skills (`<id>/skill.py` + `meta.json`) |
+| `retrieve.py` | `retrieve(task, library)` → ranked `Candidate`s (relevance) |
+| `decide.py`   | `decide(task, candidates)` → `Decision(verdict, skill_id, reason)` (utility: use/adapt/skip) |
+| `gate.py`     | `gate(result, method=gold\|self_verify\|none)` → admit? (keeps wrong solves out) |
+| `update.py`   | `evolve(traces, library)`: grow on the existing library — add / adapt-refine / keep; `_refine` parameterizes + decomposes into primitives, incrementally improving an existing skill |
+| `llm.py`      | `configure_llm(model)` + `llm()`: **backend-agnostic** via Webwright's `Model` abstraction; a bare CLI builds the model from `SKILL_MODEL_NAME`/`SKILL_MODEL_ENDPOINT` (or `OPENAI_*`) env — no hardcoded endpoint/key |
+| `prompt.py`   | `with_skill_hint(prompt, task, library)`: non-invasive task-prompt hint |
+
+## Gate
+
+`gate(method=...)` is the **admission** check (independent of the solving agent — not the same as
+`self_reflection`, which is the agent's own completion condition):
+- `gold` — compare against a known answer (benchmarks); strongest.
+- `self_verify` — invariant only (non-empty + shape); weak placeholder when no gold exists. It does
+  not check *correctness*. Note: `self_reflection` cannot serve as the gate — `require_self_reflection_success`
+  makes it always `predicted_label==1`, so it would admit everything (agent grading itself).
+- `none` — admit all (demos).
+
+## Backend
+
+Backend-agnostic. Either `configure_llm(model_config_or_Model)` once in-process, or set
+`SKILL_MODEL_NAME` / `SKILL_MODEL_ENDPOINT` (falling back to `OPENAI_*`) so a bare tool invocation
+uses the same backend as the running agent. No gateway or key is hardcoded.
+
+## Results (summary)
+
+Validated with this module (full data + analysis live in the companion research repo, not here):
+- **Real website (public GitHub, read-only):** end-to-end loop works — two repos solved from
+  scratch → `update` distilled a parameterized skill → a held-out repo solved by reusing it
+  (agent called `skill_use`, verdict `use`, answer correct).
+- **Incremental growth:** a second batch improves the existing skill in place (keeps the working
+  functions, adds robustness) rather than rewriting it.
+- **Gate prevents pollution:** a wrong solve is dropped and never enters the library.
+- **Reuse value is task-dependent:** step savings are modest on easy tasks (query overhead ≈ the
+  exploration it saves) and larger on harder tasks with more exploration to skip.
diff --git a/src/webwright/skills/__init__.py b/src/webwright/skills/__init__.py
@@ -0,0 +1,22 @@
+"""webwright.skills — a memory/skill library module for webwright.
+
+Store solved tasks as reusable, executable code skills; retrieve + judge (use/adapt/skip) at
+solve time; admit via a gate; and grow the library incrementally (evolve). Plugs into webwright
+as a built-in submodule:
+  - solve-time reuse  : the `skill_use` tool (agent invokes it like self_reflection / image_qa)
+  - offline growth    : `update.evolve` (run after solves to distill gate-passed solves into skills)
+
+Backend-agnostic: configure_llm(model) wires it to any webwright Model.
+"""
+from .library import Library, Skill
+from .retrieve import retrieve, Candidate
+from .decide import decide, Decision
+from .gate import gate, GateResult
+from .update import evolve, Trace
+from .llm import configure_llm
+from .prompt import with_skill_hint
+
+__all__ = [
+    "Library", "Skill", "retrieve", "Candidate", "decide", "Decision",
+    "gate", "GateResult", "evolve", "Trace", "configure_llm", "with_skill_hint",
+]
diff --git a/src/webwright/skills/decide.py b/src/webwright/skills/decide.py
@@ -0,0 +1,50 @@
+"""Decide whether to use: candidates + task -> use / adapt / skip (utility).
+
+Stable interface (swappable implementation):
+    decide(task, candidates, *, method="llm") -> Decision
+Relevant != useful: retrieve gives "how similar", decide gives "whether and how to use it".
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+
+from .llm import llm_json
+
+
+@dataclass
+class Decision:
+    verdict: str            # "use" | "adapt" | "skip"
+    skill_id: str | None
+    reason: str
+
+
+def _decide_llm(task: str, candidates) -> Decision:
+    if not candidates:
+        return Decision("skip", None, "no candidate skills")
+    cat = "\n".join(
+        f"- skill_id: {c.skill.skill_id} | template: {c.skill.meta.get('template','')} | "
+        f"summary: {c.skill.summary} | params: {c.skill.signature.get('params', [])}"
+        for c in candidates
+    )
+    sys = (
+        "Decide whether a library skill is worth using for THIS task. Output STRICT JSON: "
+        '{"verdict":"use|adapt|skip","skill_id":"...","reason":"..."}.\n'
+        "- use   = the skill fits the task as-is (just different parameter values).\n"
+        "- adapt = the skill's expensive core (login / navigation / extraction) is reusable, but the "
+        "FINAL step differs; the agent should reuse the front and add/adapt only the last step.\n"
+        "- skip  = no candidate is worth it; solve from scratch (skill_id = null).\n"
+        "Relevance is not enough — only 'use'/'adapt' if it genuinely saves work."
+    )
+    user = f"## Task\n{task}\n\n## Candidate skills (most relevant first)\n{cat}"
+    out = llm_json(sys, user)
+    verdict = out.get("verdict", "skip")
+    if verdict not in ("use", "adapt", "skip"):
+        verdict = "skip"
+    skill_id = out.get("skill_id") if verdict != "skip" else None
+    return Decision(verdict=verdict, skill_id=skill_id, reason=out.get("reason", ""))
+
+
+_DECIDERS = {"llm": _decide_llm}
+
+
+def decide(task: str, candidates, *, method: str = "llm") -> Decision:
+    return _DECIDERS[method](task, candidates)
diff --git a/src/webwright/skills/gate.py b/src/webwright/skills/gate.py
@@ -0,0 +1,68 @@
+"""Admission gate: only "correct" solves/skills enter the library, preventing correct-but-narrow /
+regression pollution. The gate is an INDEPENDENT second eye — distinct from the solving agent's own
+self_reflection (which is a solve-completion condition, not an admission check).
+
+Stable interface (swappable implementation), configurable method:
+    gate(result, *, gold=None, output_schema=None, method="auto") -> GateResult
+
+- method="gold"        : compare against gold (benchmarks like WebArena; truly independent, catches
+                         mis-extracted solves). Recommended.
+- method="self_verify" : invariant only (result non-empty + shape matches output_schema). A weak
+                         placeholder when no gold exists.
+                         Limitation: only checks "present / right shape", not "correct" — a wrong but
+                         non-empty answer is admitted anyway.
+                         (Note: webwright's self_reflection is always predicted_label==1 due to
+                         require_self_reflection_success, so it cannot serve as the gate — that is a
+                         solve-completion condition, not independent admission.)
+- method="none"        : no gate (demo reuse only, no pollution protection).
+- method="auto"        : use gold if available, else self_verify.
+Upgrade path (next step): for real websites use WebJudge (OM2W's official judge) or cross-source
+consistency checks for a truly independent gate.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+
+
+@dataclass
+class GateResult:
+    admit: bool
+    reason: str
+
+
+def _shape_ok(result, output_schema) -> bool:
+    if not output_schema:
+        return True
+    t = output_schema.get("type")
+    if t == "array":
+        return isinstance(result, list)
+    if t == "object":
+        return isinstance(result, dict)
+    if t in ("string",):
+        return isinstance(result, str)
+    if t in ("number", "integer"):
+        return isinstance(result, (int, float)) and not isinstance(result, bool)
+    return True
+
+
+def _self_verify(result, output_schema) -> GateResult:
+    if result is None:
+        return GateResult(False, "result is null")
+    if isinstance(result, (list, dict, str)) and len(result) == 0:
+        return GateResult(False, "result is empty")
+    if not _shape_ok(result, output_schema):
+        return GateResult(False, f"shape != output_schema ({output_schema.get('type')})")
+    return GateResult(True, "self-verify passed (non-empty, shape ok)")
+
+
+def _gold(result, gold) -> GateResult:
+    if result == gold:
+        return GateResult(True, "matches gold")
+    return GateResult(False, "differs from gold")
+
+
+def gate(result, *, gold=None, output_schema=None, method: str = "auto") -> GateResult:
+    if method == "none":
+        return GateResult(True, "no gate (admit all)")
+    if method == "gold" or (method == "auto" and gold is not None):
+        return _gold(result, gold)
+    return _self_verify(result, output_schema)
diff --git a/src/webwright/skills/library.py b/src/webwright/skills/library.py
@@ -0,0 +1,59 @@
+"""Skill store. A skill = a directory under the library root holding skill.py + meta.json.
+
+Interface (stable — implementations behind it may change):
+    Library(root).list() -> [Skill]
+    Library(root).get(skill_id) -> Skill | None
+    Library(root).add(skill)            # write skill.py + meta.json
+"""
+from __future__ import annotations
+import json
+from dataclasses import dataclass, field
+from pathlib import Path
+
+
+@dataclass
+class Skill:
+    skill_id: str
+    code: str                    # source of skill.py
+    meta: dict = field(default_factory=dict)   # {template, site, signature, summary, ...}
+
+    @property
+    def summary(self) -> str:
+        return self.meta.get("summary", "")
+
+    @property
+    def signature(self) -> dict:
+        return self.meta.get("signature", {})
+
+
+class Library:
+    def __init__(self, root: str | Path):
+        self.root = Path(root)
+        self.root.mkdir(parents=True, exist_ok=True)
+
+    def _dir(self, skill_id: str) -> Path:
+        return self.root / skill_id
+
+    def list(self) -> list[Skill]:
+        out = []
+        for d in sorted(self.root.iterdir()):
+            if (d / "meta.json").exists():
+                out.append(self.get(d.name))
+        return [s for s in out if s]
+
+    def get(self, skill_id: str) -> Skill | None:
+        d = self._dir(skill_id)
+        if not (d / "meta.json").exists():
+            return None
+        meta = json.loads((d / "meta.json").read_text())
+        code = (d / "skill.py").read_text() if (d / "skill.py").exists() else ""
+        return Skill(skill_id=skill_id, code=code, meta=meta)
+
+    def add(self, skill: Skill) -> None:
+        d = self._dir(skill.skill_id)
+        d.mkdir(parents=True, exist_ok=True)
+        (d / "skill.py").write_text(skill.code)
+        (d / "meta.json").write_text(json.dumps(skill.meta, ensure_ascii=False, indent=2))
+
+    def path(self, skill_id: str) -> Path:
+        return self._dir(skill_id) / "skill.py"
diff --git a/src/webwright/skills/llm.py b/src/webwright/skills/llm.py
@@ -0,0 +1,67 @@
+"""LLM helper for the skills module — backend-agnostic, via webwright's own model abstraction.
+
+No hardcoded gateway/endpoint/key: the caller passes a webwright Model (or a model config dict),
+so this works with any backend webwright supports (openai / anthropic / openrouter / custom).
+"""
+from __future__ import annotations
+
+import json
+import re
+from typing import Any, Optional
+
+from webwright.models import get_model
+
+# Process-wide default model, set once via configure_llm() so retrieve/decide/update can call
+# llm() without each caller threading a Model through. Falls back to env-configured openai.
+_DEFAULT_MODEL: Optional[Any] = None
+
+
+def configure_llm(model: Any) -> None:
+    """Register the Model (or model-config dict) the skills module should use."""
+    global _DEFAULT_MODEL
+    _DEFAULT_MODEL = get_model(model) if isinstance(model, dict) else model
+
+
+def _model() -> Any:
+    if _DEFAULT_MODEL is not None:
+        return _DEFAULT_MODEL
+    # default: build an openai-style model from env so a bare CLI invocation
+    # (e.g. `python -m webwright.tools.skill_use`) uses the SAME backend as the running agent.
+    # Honors SKILL_MODEL_NAME / SKILL_MODEL_ENDPOINT (or OPENAI_* fallbacks); no hardcoded gateway.
+    import os
+    cfg = {"model_class": os.environ.get("SKILL_MODEL_CLASS", "openai")}
+    name = os.environ.get("SKILL_MODEL_NAME") or os.environ.get("OPENAI_MODEL")
+    endpoint = os.environ.get("SKILL_MODEL_ENDPOINT") or os.environ.get("OPENAI_ENDPOINT")
+    if name:
+        cfg["model_name"] = name
+    if endpoint:
+        cfg["openai_endpoint"] = endpoint
+    return get_model(cfg)
+
+
+def llm(system: str, user: str, *, model: Any = None, **_: Any) -> str:
+    """Single-turn call. Returns raw text. `model` overrides the configured default."""
+    m = model if model is not None else _model()
+    messages = [
+        m.format_message(role="system", content=system),
+        m.format_message(role="user", content=user),
+    ]
+    return m(messages)
+
+
+def llm_json(system: str, user: str, **kw: Any) -> dict:
+    """Call + parse the first {...} JSON object out of the reply."""
+    txt = llm(system, user, **kw)
+    match = re.search(r"\{.*\}", txt, re.S)
+    if not match:
+        return {}
+    try:
+        return json.loads(match.group(0))
+    except Exception:
+        s = match.group(0)
+        for end in range(len(s), 0, -1):
+            try:
+                return json.loads(s[:end])
+            except Exception:
+                continue
+    return {}