Skip to content

test(bench): careful model bake-off for adjudication (perf + judgement)#30

Closed
thejefflarson wants to merge 3 commits into
mainfrom
feat/judge-bakeoff-script
Closed

test(bench): careful model bake-off for adjudication (perf + judgement)#30
thejefflarson wants to merge 3 commits into
mainfrom
feat/judge-bakeoff-script

Conversation

@thejefflarson

Copy link
Copy Markdown
Owner

scripts/judge_bakeoff.py — benchmarks candidate judge models on the protector adjudication prompt, answering two questions separately:

  1. Performance — resident RAM, load time, gen/prompt tok/s, strict-JSON validity.
  2. Judgement — calibrated verdict on own_app (refute), cross_tenant_breach (exploitable), argo_cluster_admin (refute).

OOM-safe (naive concurrent runs smoked the box): one model resident at a time with verified eviction (polls ollama ps), a free-RAM floor that skips rather than loads when low, capped num_ctx, and pulling as a separate idle --pull phase that never overlaps the bench. Faithful to the engine's build_judgment_prompt.

Run: python3 scripts/judge_bakeoff.py --pull then python3 scripts/judge_bakeoff.py. 🤖

thejefflarson and others added 3 commits June 21, 2026 17:17
scripts/judge_bakeoff.py: benchmarks candidate judge models on the protector adjudication
prompt, answering two questions separately — (1) PERFORMANCE (resident RAM, load time,
tok/s, JSON validity) and (2) JUDGEMENT (calibrated verdict on own-app/cross-tenant-breach/
argo-cluster-admin cases). OOM-safe: one model resident at a time with VERIFIED eviction
(polls `ollama ps`), a free-RAM floor that skips rather than loads when low, capped num_ctx,
and pulling as a separate idle phase (`--pull`) that never overlaps the bench — naive
concurrent runs OOM'd the box. Faithful to engine build_judgment_prompt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
…fast field

Replaces the over-flagging prompt with an explicit decision procedure (CVE-loaded-at-runtime
or alert => exploitable; else [NO-GRANT] cross-app secret => exploitable; else refuted) plus
worked examples and per-objective [RBAC-GRANTED]/[NO-GRANT] tagging (the JEF-79 authorization
signal). Fixes the qwen3 tag, model-name :latest normalization, and adds a regex verdict
salvage so a messy reason field no longer scores as UNPARSEABLE. Broadens DEFAULT_MODELS to
the fast field to find the fastest model that scores 3/3.

Result: granite4 (1b-h, 3b-h) and qwen3 (1.7b, 4b-instruct) score 3/3; qwen2.5/gemma2/
llama3.2/phi3.5/exaone/lfm2.5 collapse to flag-everything (1/3) regardless of prompt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
…granite4:1b-h 5/5)

Real 3-tag reach vocabulary matching the engine's proven-edge relations: [RBAC-GRANTED]
(can-do/*), [MOUNTED] (can-read, same-namespace by k8s rule), [NETWORK] (reaches). Adds
the genuine cross-tenant *network* lateral-movement case and a privilege-escalation
(escape-to-host) case, plus a high-severity-outcome branch so escape/exec/impact objectives
flag regardless of ownership. granite4:1b-h and 3b-h both score 5/5; granite4:350m-h 0/4
(too small — emits invalid verdicts). Generic judgement table over CASES.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
@thejefflarson

Copy link
Copy Markdown
Owner Author

Superseded by #31, which squash-merged the bench script (it was branched into the JEF-79 PR) along with the engine change. Content is on main.

@thejefflarson thejefflarson deleted the feat/judge-bakeoff-script branch June 22, 2026 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant