test(bench): careful model bake-off for adjudication (perf + judgement)#30
Closed
thejefflarson wants to merge 3 commits into
Closed
test(bench): careful model bake-off for adjudication (perf + judgement)#30thejefflarson wants to merge 3 commits into
thejefflarson wants to merge 3 commits into
Conversation
scripts/judge_bakeoff.py: benchmarks candidate judge models on the protector adjudication prompt, answering two questions separately — (1) PERFORMANCE (resident RAM, load time, tok/s, JSON validity) and (2) JUDGEMENT (calibrated verdict on own-app/cross-tenant-breach/ argo-cluster-admin cases). OOM-safe: one model resident at a time with VERIFIED eviction (polls `ollama ps`), a free-RAM floor that skips rather than loads when low, capped num_ctx, and pulling as a separate idle phase (`--pull`) that never overlaps the bench — naive concurrent runs OOM'd the box. Faithful to engine build_judgment_prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
…fast field Replaces the over-flagging prompt with an explicit decision procedure (CVE-loaded-at-runtime or alert => exploitable; else [NO-GRANT] cross-app secret => exploitable; else refuted) plus worked examples and per-objective [RBAC-GRANTED]/[NO-GRANT] tagging (the JEF-79 authorization signal). Fixes the qwen3 tag, model-name :latest normalization, and adds a regex verdict salvage so a messy reason field no longer scores as UNPARSEABLE. Broadens DEFAULT_MODELS to the fast field to find the fastest model that scores 3/3. Result: granite4 (1b-h, 3b-h) and qwen3 (1.7b, 4b-instruct) score 3/3; qwen2.5/gemma2/ llama3.2/phi3.5/exaone/lfm2.5 collapse to flag-everything (1/3) regardless of prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
…granite4:1b-h 5/5) Real 3-tag reach vocabulary matching the engine's proven-edge relations: [RBAC-GRANTED] (can-do/*), [MOUNTED] (can-read, same-namespace by k8s rule), [NETWORK] (reaches). Adds the genuine cross-tenant *network* lateral-movement case and a privilege-escalation (escape-to-host) case, plus a high-severity-outcome branch so escape/exec/impact objectives flag regardless of ownership. granite4:1b-h and 3b-h both score 5/5; granite4:350m-h 0/4 (too small — emits invalid verdicts). Generic judgement table over CASES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
Owner
Author
|
Superseded by #31, which squash-merged the bench script (it was branched into the JEF-79 PR) along with the engine change. Content is on main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
scripts/judge_bakeoff.py— benchmarks candidate judge models on the protector adjudication prompt, answering two questions separately:own_app(refute),cross_tenant_breach(exploitable),argo_cluster_admin(refute).OOM-safe (naive concurrent runs smoked the box): one model resident at a time with verified eviction (polls
ollama ps), a free-RAM floor that skips rather than loads when low, cappednum_ctx, and pulling as a separate idle--pullphase that never overlaps the bench. Faithful to the engine'sbuild_judgment_prompt.Run:
python3 scripts/judge_bakeoff.py --pullthenpython3 scripts/judge_bakeoff.py. 🤖