Evaluate and improve AI agents from the runs they already produce.
agent-eval turns agent outputs, traces, judge scores, and production feedback into a decision packet: did this change help, what failed, what should ship, and what needs more data?
Use it when you need to:
- compare a candidate agent/prompt/model against a baseline,
- turn production traces or human feedback into eval results,
- run a gated self-improvement loop,
- explain failures by cluster, cost, judge disagreement, and release risk.
It is a library, not a SaaS requirement. TypeScript is first-class; Python can call the same wire protocol through agent-eval-rpc.
pnpm add @tangle-network/agent-evalPython clients can use the RPC package:
pip install agent-eval-rpcStart here if you already have production logs, benchmark rows, human ratings, or agent run records.
import { analyzeRuns } from '@tangle-network/agent-eval/contract'
const report = await analyzeRuns({
runs, // RunRecord[]
baselineRuns,
})
console.log(report.recommendations)
console.log(report.lift)
console.log(report.failureClusters)The output includes score distributions, lift confidence intervals, failure modes, cost-quality tradeoffs, judge agreement, contamination checks, and release recommendations when the input supports them.
Use this when you have scenarios, a runnable agent, and judges.
import { defineAgentEval } from '@tangle-network/agent-eval/contract'
const evalKit = defineAgentEval({
scenarios,
agent: async (surface, scenario, ctx) =>
myAgent.run({ scenario, systemPrompt: String(surface), signal: ctx.signal }),
judge: myJudge,
baselineSurface: currentPrompt,
})
const baseline = await evalKit.evaluate()
const result = await evalKit.improve({ budget: { generations: 2 } })
console.log(baseline.aggregates.byJudge)
console.log(result.gateDecision)
console.log(result.winner.surface)
console.log(result.insight.recommendations)defineAgentEval() is a small wrapper over runEval() and selfImprove(). It lets you define scenarios, agent, judge, and baseline once, then either score one surface with .evaluate() or run the gated loop with .improve().
import { analyzeRuns, fromFeedbackTable, fromOtelSpans } from '@tangle-network/agent-eval/contract'
const { runs, raterScores } = fromFeedbackTable({
ratings: parseYourFeedbackTable(),
})
const traceRuns = fromOtelSpans({ spans: yourOtelSpans })
await analyzeRuns({ runs: [...runs, ...traceRuns], raterScores })- RunRecord: the durable row for one agent run: model, prompt/config hashes, split, cost, tokens, outcome.
- Scenario: one task or case the agent attempts.
- Judge: a scoring function, rule-based or model-based.
- Surface: the thing being changed, usually a prompt string or config object.
- InsightReport: the decision packet returned by
analyzeRuns()and embedded inselfImprove(). - Gate: the policy that decides
ship,hold, orneed_more_work.
| Journey | Example | Who it's for |
|---|---|---|
| Closed loop — improve a prompt under statistical confidence | examples/selfimprove-quickstart/ |
Teams with scenarios + judges + agent in hand |
| Multi-rater feedback corpus — turn Obsidian/Sheets/CSV ratings into actionable insights | examples/customer-feedback-loop/ |
Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop |
| Production OTel traces — analyze logs you already have, no closed loop required | examples/customer-otel-traces/ |
Teams running agents in prod with observability, no eval discipline yet |
Each example: README.md + a single index.ts runnable via pnpm tsx. Prints the resulting InsightReport to stdout.
| Subpath | What it gives you |
|---|---|
…/contract |
The headline, frozen surface — new code starts here. defineAgentEval, selfImprove, analyzeRuns, runEval, runCampaign, runImprovementLoop, diffRuns; intake adapters (fromFeedbackTable, fromOtelSpans); proposers (gepaProposer, evolutionaryProposer); gates (defaultProductionGate, heldOutGate, paretoSignificanceGate, composeGate); the deployment-outcome store; storage; and the five core types Scenario / Dispatch / JudgeConfig / SurfaceProposer / Gate. |
…/hosted |
createHostedClient / hostedClientFromEnv + the wire types to ship eval-run events + trace spans to a hosted orchestrator (ours or your own implementation of the spec) |
…/adapters/otel |
createOtelBridge — forwards OpenTelemetry-shape spans into the hosted-tier ingest, no @opentelemetry/* dependency |
…/adapters/langchain |
Wrap any LangChain Runnable as a Dispatch (or JudgeConfig), no @langchain/core peer dep |
…/adapters/http |
httpDispatch + runDispatchServer — run a campaign's worker on another machine (multi-region, remote worker execution) |
…/campaign |
The measurement + improvement engine: runProfileMatrix, compareProposers, every surface proposer (gepaProposer, fapoProposer, parameterSweepProposer, haloProposer, skillOptProposer, aceProposer, memoryCurationProposer, …), the gates, storage backends, and loop provenance. /contract re-exports the app-facing subset. |
…/rl |
Bridge from eval artifacts to training signal: verifiable rewards, preferences, OPE, tournaments, contamination, compute curves, trainer-format exporters, process rewards, plus the durable corpus + buildRlDataset / datasheet bundle |
…/reporting |
Release-decision statistics: pairedBootstrap, benjaminiHochberg, anytime-valid sequential e-values, evaluateReleaseConfidence, and the report renderers |
…/analyst |
The trace-analyst surface: AnalystRegistry + buildDefaultAnalystRegistry (run the failure-clustering panel), FindingsStore, and the LLM chat transports |
…/traces |
Trace stores + emitters, OTLP-JSONL deterministic replay, analyzeTraces, and the traceAnalystOnRunComplete hook |
…/control |
Agent control loop: runAgentControlLoop (observe → validate → decide → act), action policy, propose/review |
…/matrix |
runAgentMatrix — an N-axis cartesian over caller-supplied substrate values, per-axis pass/score/cost/duration |
…/multishot |
N-shot persona × shot matrix runner (runMultishot / runMultishotMatrix) |
…/wire |
The cross-language HTTP/RPC server + Zod schemas (the source-of-truth protocol the Python client speaks) + the built-in rubric registry |
…/benchmarks |
BenchmarkAdapter contract + deterministicSplit + the bundled routing reference benchmark |
Specialized surfaces (subpath-only): …/prm (process-reward grading + best-of-N), …/meta-eval (judge calibration + the deployment-outcome store), …/belief-state (decision-point extraction + selective-policy reports), …/pipelines (trace-diagnostic views: budget breach, failure cluster, stuck loop, …), …/governance (EU AI Act / NIST AI RMF / SOC2 reports), …/knowledge (knowledge-readiness gating before a run), …/builder-eval (code-generator three-layer eval), …/storyboard (trace → watchable replay), …/authenticity (anti-Goodhart "real or convincing BS" scorer over produced files), …/workflow (workflow-trace eval + partner export), …/telemetry (Workers-safe telemetry client), …/testing (test-only reset helpers).
The root export remains broad for compatibility; new code should prefer the focused subpaths above — /contract first.
agent-eval is the bottom of the layering: consumers depend on it, it depends on none of them.
agent-runtime Runs agents (chat turns, one-shot tasks, multi-attempt loops), captures every
run as a trace, and calls optimizePrompt / runImprovementLoop. Produces the
RunRecords + traces agent-eval scores. Depends on agent-eval.
agent-eval selfImprove, analyzeRuns, runCampaign + surface proposers (GEPA proposer, …), the gates
(this repo) (heldOutGate, defaultProductionGate, paretoSignificanceGate), the InsightReport
decision packet, the RL bridge, the wire protocol. Depends on neither consumer.
agent-knowledge proposeKnowledgeWrites / applyKnowledgeWriteBlocks. agent-eval's analyst findings
feed it; the knowledge gate consumes them. Depends on agent-eval.
sandbox Sandbox.create, streamPrompt. One execution surface the runtime's
loops run on; agent-eval scores what comes back.
The rule: agent-eval has zero upward dependencies on a consumer. A concept that makes sense without a running agent loop — a verdict, a run record, a scenario, a judge score — is substrate and lives here. Runtime execution details (a validation context with an abort signal, a concrete sandbox session) live in agent-runtime or sandbox. Agent profile shape is the shared @tangle-network/agent-interface contract.
docs/concepts.md— the top-level entry points, the layering rule, and the wire-protocol contract (the five core contract types are documented in the/contractbarrel itself)docs/campaign-proposers.md— ELI5 proposer inputs/outputs, when to use each proposer, and the FAPO escalation policydocs/insight-report.md— annotated walkthrough of every section of the decision packetdocs/customer-journeys.md— three end-to-end journeys with code + expected outputdocs/adapters-observability.md— composing agent-eval with LangSmith, Langfuse, Phoenix, OpenLLMetry, TraceAIdocs/wire-protocol.md— the HTTP/RPC contract Python (and any future language) speaksdocs/hosted-ingest-spec.md— the hosted-tier wire format, frozen at2026-05-26.v1docs/design/loop-taxonomy.md— plain-language vocabulary for execution drivers, workers, measurements, and proposers
The .claude/skills/agent-eval/SKILL.md skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.
Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:
import { defineAgentEval } from '@tangle-network/agent-eval/contract'
const evalKit = defineAgentEval({
scenarios,
agent,
judge,
baselineSurface,
})
await evalKit.improve({
hostedTenant: {
endpoint: 'https://intelligence.tangle.tools',
apiKey: process.env.TANGLE_API_KEY!,
tenantId: 'your-tenant',
},
})The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at docs/hosted-ingest-spec.md; reference receiver at examples/hosted-ingest-server/.
Run an example:
pnpm tsx examples/selfimprove-quickstart/index.ts
pnpm tsx examples/customer-feedback-loop/index.ts
pnpm tsx examples/customer-otel-traces/index.tsRun the test suite:
pnpm install
pnpm build
pnpm testThe /contract surface is the stability contract: its barrel freezes the API — a 0.x minor only adds; nothing there changes shape or disappears. Start there for app code.
| Surface | Meaning |
|---|---|
/contract |
Frozen app-facing API. Prefer this first. |
| Named subpaths | Public capability areas such as /campaign, /rl, /prm, /meta-eval, /belief-state, /wire, /reporting, /traces, and /analyst. |
/testing |
Test-only helpers. Do not import from production code. |
| Unexported source paths | Not public API. Open an issue if you need one promoted. |
CHANGELOG.md tracks every release with what's new / additive / breaking.
MIT. See LICENSE.