[BREAKING] FEAT: Standardize Jailbreak scenario defaults#2045
[BREAKING] FEAT: Standardize Jailbreak scenario defaults#2045varunj-msft wants to merge 1 commit into
Conversation
|
I think this is easier to think about what we want it to do; and I think the goal is to see how effective various jailbreaks are (or find jailbreaks that work) To do this, let's look at end parameters what a user actually sets on the CLI since I think that's the clearest way to see whether the design hangs together. Roughly the run I'd want to work: pyrit_scan airt.jailbreak \
--initializers target load_default_datasets \
--target openai_chat \
--dataset-names harmbench \ # WHAT: harmful objectives
--jailbreak-names dan aim \ # WHICH jailbreaks (or --num-templates N)
--strategies system user variation \ # HOW: delivery methods
--num-attempts 1Getting the below to work are over-simplifications, and I could see us needing to think about each one, and some may need a PR pre-req. But I think they should be close to working. 1.
|
Description
The Jailbreak scenario previously ran all 162 jailbreak templates by default, which made a default scan take roughly an hour and overshoot the intended ~30 minute budget. This PR makes the default fast and predictable.
What changed:
The default now randomly samples 10 templates instead of all 162. Power users can still run the full catalog with num_templates=None, or pick specific ones with jailbreak_names=[...].
num_templates (and num_attempts) are now exposed as runtime parameters, so they can be set from the CLI via --num-templates N rather than only through the Python constructor.
The sampled template names are persisted to ScenarioResult.metadata and replayed on --resume, so resuming a run uses the same templates instead of drawing a fresh random set.
The chosen templates are now visible to the user. Scenario output includes a new "Scenario Inputs" section that lists which templates were tried, driven by a generic metadata["summary"] convention in the pretty printer (internal keys like objective_hashes are never shown).
Fixed an edge case in get_jailbreak_templates: num_templates=0 previously fell through and silently returned all templates; it (and negatives) now raise a clear error.
The scanner doc fast path is now --strategies simple --max-dataset-size 1 (no longer relies on --num-templates 1).
This is BREAKING because the default scan output changes (same call now produces different atomic attacks). The scenario VERSION is bumped from 1 to 2 so stored results from the old default error out cleanly on --resume instead of silently mixing. Constructor signatures and the public API are unchanged.
Timing: ~48s at max_concurrency=10 on a mini-class Azure OpenAI model (11 atomic attacks, 44 objectives) — well under the ~30 min budget.
Tests and Documentation
Added/updated unit tests for the new default (10), the explicit None opt-out, jailbreak_names, metadata persistence, resume replaying the same templates, the "Scenario Inputs" rendering (including that it never leaks internal metadata), and the 0/negative edge cases.
Updated the AIRT scanner doc (doc/scanner/airt.py + .ipynb) with the new default and fast-path command.
Ran the targeted suites and the broader scenario + output suites — all green. Ran ruff, ruff-format, and ty clean on the changed source. JupyText .py/.ipynb pair is in sync.