test(bench): careful model bake-off for adjudication (perf + judgement) by thejefflarson · Pull Request #30 · thejefflarson/protector

thejefflarson · 2026-06-22T00:18:19Z

scripts/judge_bakeoff.py — benchmarks candidate judge models on the protector adjudication prompt, answering two questions separately:

Performance — resident RAM, load time, gen/prompt tok/s, strict-JSON validity.
Judgement — calibrated verdict on own_app (refute), cross_tenant_breach (exploitable), argo_cluster_admin (refute).

OOM-safe (naive concurrent runs smoked the box): one model resident at a time with verified eviction (polls ollama ps), a free-RAM floor that skips rather than loads when low, capped num_ctx, and pulling as a separate idle --pull phase that never overlaps the bench. Faithful to the engine's build_judgment_prompt.

Run: python3 scripts/judge_bakeoff.py --pull then python3 scripts/judge_bakeoff.py. 🤖

scripts/judge_bakeoff.py: benchmarks candidate judge models on the protector adjudication prompt, answering two questions separately — (1) PERFORMANCE (resident RAM, load time, tok/s, JSON validity) and (2) JUDGEMENT (calibrated verdict on own-app/cross-tenant-breach/ argo-cluster-admin cases). OOM-safe: one model resident at a time with VERIFIED eviction (polls `ollama ps`), a free-RAM floor that skips rather than loads when low, capped num_ctx, and pulling as a separate idle phase (`--pull`) that never overlaps the bench — naive concurrent runs OOM'd the box. Faithful to engine build_judgment_prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP

…fast field Replaces the over-flagging prompt with an explicit decision procedure (CVE-loaded-at-runtime or alert => exploitable; else [NO-GRANT] cross-app secret => exploitable; else refuted) plus worked examples and per-objective [RBAC-GRANTED]/[NO-GRANT] tagging (the JEF-79 authorization signal). Fixes the qwen3 tag, model-name :latest normalization, and adds a regex verdict salvage so a messy reason field no longer scores as UNPARSEABLE. Broadens DEFAULT_MODELS to the fast field to find the fastest model that scores 3/3. Result: granite4 (1b-h, 3b-h) and qwen3 (1.7b, 4b-instruct) score 3/3; qwen2.5/gemma2/ llama3.2/phi3.5/exaone/lfm2.5 collapse to flag-everything (1/3) regardless of prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP

…granite4:1b-h 5/5) Real 3-tag reach vocabulary matching the engine's proven-edge relations: [RBAC-GRANTED] (can-do/*), [MOUNTED] (can-read, same-namespace by k8s rule), [NETWORK] (reaches). Adds the genuine cross-tenant *network* lateral-movement case and a privilege-escalation (escape-to-host) case, plus a high-severity-outcome branch so escape/exec/impact objectives flag regardless of ownership. granite4:1b-h and 3b-h both score 5/5; granite4:350m-h 0/4 (too small — emits invalid verdicts). Generic judgement table over CASES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP

thejefflarson · 2026-06-22T01:11:40Z

Superseded by #31, which squash-merged the bench script (it was branched into the JEF-79 PR) along with the engine change. Content is on main.

thejefflarson and others added 3 commits June 21, 2026 17:17

thejefflarson closed this Jun 22, 2026

thejefflarson deleted the feat/judge-bakeoff-script branch June 22, 2026 01:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bench): careful model bake-off for adjudication (perf + judgement)#30

test(bench): careful model bake-off for adjudication (perf + judgement)#30
thejefflarson wants to merge 3 commits into
mainfrom
feat/judge-bakeoff-script

thejefflarson commented Jun 22, 2026

Uh oh!

thejefflarson commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thejefflarson commented Jun 22, 2026

Uh oh!

thejefflarson commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant