[BREAKING] FEAT: Standardize Jailbreak scenario defaults by varunj-msft · Pull Request #2045 · microsoft/PyRIT

varunj-msft · 2026-06-18T17:07:15Z

Description

The Jailbreak scenario previously ran all 162 jailbreak templates by default, which made a default scan take roughly an hour and overshoot the intended ~30 minute budget. This PR makes the default fast and predictable.

What changed:

The default now randomly samples 10 templates instead of all 162. Power users can still run the full catalog with num_templates=None, or pick specific ones with jailbreak_names=[...].
num_templates (and num_attempts) are now exposed as runtime parameters, so they can be set from the CLI via --num-templates N rather than only through the Python constructor.
The sampled template names are persisted to ScenarioResult.metadata and replayed on --resume, so resuming a run uses the same templates instead of drawing a fresh random set.
The chosen templates are now visible to the user. Scenario output includes a new "Scenario Inputs" section that lists which templates were tried, driven by a generic metadata["summary"] convention in the pretty printer (internal keys like objective_hashes are never shown).
Fixed an edge case in get_jailbreak_templates: num_templates=0 previously fell through and silently returned all templates; it (and negatives) now raise a clear error.
The scanner doc fast path is now --strategies simple --max-dataset-size 1 (no longer relies on --num-templates 1).
This is BREAKING because the default scan output changes (same call now produces different atomic attacks). The scenario VERSION is bumped from 1 to 2 so stored results from the old default error out cleanly on --resume instead of silently mixing. Constructor signatures and the public API are unchanged.
Timing: ~48s at max_concurrency=10 on a mini-class Azure OpenAI model (11 atomic attacks, 44 objectives) — well under the ~30 min budget.

Tests and Documentation

Added/updated unit tests for the new default (10), the explicit None opt-out, jailbreak_names, metadata persistence, resume replaying the same templates, the "Scenario Inputs" rendering (including that it never leaks internal metadata), and the 0/negative edge cases.
Updated the AIRT scanner doc (doc/scanner/airt.py + .ipynb) with the new default and fast-path command.
Ran the targeted suites and the broader scenario + output suites — all green. Ran ruff, ruff-format, and ty clean on the changed source. JupyText .py/.ipynb pair is in sync.

rlundeen2 · 2026-06-20T22:30:48Z

I think this is easier to think about what we want it to do; and I think the goal is to see how effective various jailbreaks are (or find jailbreaks that work)

To do this, let's look at end parameters what a user actually sets on the CLI since I think that's the clearest way to see whether the design hangs together. Roughly the run I'd want to work:

pyrit_scan airt.jailbreak \
  --initializers target load_default_datasets \
  --target openai_chat \
  --dataset-names harmbench \           # WHAT: harmful objectives
  --jailbreak-names dan aim \           # WHICH jailbreaks (or --num-templates N)
  --strategies system user variation \  # HOW: delivery methods
  --num-attempts 1

Getting the below to work are over-simplifications, and I could see us needing to think about each one, and some may need a PR pre-req. But I think they should be close to working.

1. `--dataset-names` — the harmful objectives (the what)

Standard DatasetConfiguration. I'd switch the default off the small airt_harms sample to something effectiveness-oriented like HarmBench. Each entry is a SeedObjective; nothing exotic here.

2. `--jailbreak-names` / `--num-templates` — which jailbreaks

A selector with a default set that can be overridden. The jailbreaks themselves should be SeedAttackTechniqueGroups (is_general_technique=True) so they merge into every objective via with_technique() without touching the SeedObjective.

On how to build the selector: we don't need a heavy new class. DatasetConfiguration's generic core is already objective-agnostic — get_seed_groups() returns plain SeedGroups, and the select-by-name-or-explicit + max_dataset_size sampling + _load_seed_groups_for_dataset hook is exactly the "default set, or override" behavior we want. Only the terminal *_attack_groups() wrappers assume one objective. So: keep that selection core and add a technique-flavored terminal (get_technique_groups() → SeedAttackTechniqueGroups) plus a load hook that pulls the templates. The knobs then map straight across: names → jailbreak names, explicit seed_groups → override, max_dataset_size → --num-templates.

(Alternative if we'd rather not subclass: register the templates into memory as is_general_technique=True seed datasets and load them by name. Either works — just pick one.)

Note --jailbreak-names would be a new runtime parameter — today only --num-templates and --num-attempts are exposed in supported_parameters().

3. `--strategies` — the delivery methods (the how)

Registered as defaults, more selectable if registered:

baseline (objective only — already auto-prepended)
jailbreak on the system prompt — technique seed role="system"
jailbreak as the first user prompt — technique seed role="user"
variation of the jailbreak, on system + user
translation (random language), on system + user

system-vs-user is just the seed's role — no special converter. Each method is an AttackTechniqueFactory / AttackTechnique carrying the jailbreak seed_technique at the right role + converter config, instead of the current match strategy block with TextJailbreakConverter. That gets the strategy enum back to pure techniques.

For variation/translation: attach the converter to the attack (don't pre-convert offline) and scope it with PrependedConversationConfig(apply_converters_to_roles=[...]). request_converters also hit the live objective turn, so the variation/translation lands on both the jailbreak and the objective — which I think is what we want, and it avoids a build-time LLM call so resume stays cheap.

4. `--num-attempts` — repeats per (jailbreak × method × objective)

Keep this knob, like it. Free from the base class.

Cross-cutting: output grouping

Group by jailbreak template — that's the axis people compare across. Heads-up: the default _build_display_group hook only gets technique_name + seed_group_name; the template is a third axis, so since we're keeping a thin _get_atomic_attacks_async, just set display_group=<template_name> directly when constructing each AtomicAttack.

Follow-up param: skip objectives the baseline already cracked

I like this (don't spend jailbreak attempts on objectives that fail open with no jailbreak), but it's more than a one-liner: today baseline is just another prepended AtomicAttack and we build + run all atomic attacks together, so there's no feed-forward. It needs a two-phase run (baseline → collect objectives that scored true → build/filter the rest). I'd land it as an opt-in flag (default off) in a follow-up rather than blocking this PR.

Net: keep a thin _get_atomic_attacks_async (the template axis is a genuine third dimension), but every delivery method becomes a factory carrying a role-tagged seed_technique + converter config. Strategy enum = pure techniques, jailbreak axis = a selector, objectives = --dataset-names.

Standardizing Jailbreak

337ba17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] FEAT: Standardize Jailbreak scenario defaults#2045

[BREAKING] FEAT: Standardize Jailbreak scenario defaults#2045
varunj-msft wants to merge 1 commit into
microsoft:mainfrom
varunj-msft:varunj-msft/8380-Standardizing-Scenarios-Jailbreak

varunj-msft commented Jun 18, 2026 •

edited

Loading

Uh oh!

rlundeen2 commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

varunj-msft commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests and Documentation

Uh oh!

rlundeen2 commented Jun 20, 2026

1. --dataset-names — the harmful objectives (the what)

2. --jailbreak-names / --num-templates — which jailbreaks

3. --strategies — the delivery methods (the how)

4. --num-attempts — repeats per (jailbreak × method × objective)

Cross-cutting: output grouping

Follow-up param: skip objectives the baseline already cracked

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunj-msft commented Jun 18, 2026 •

edited

Loading

1. `--dataset-names` — the harmful objectives (the what)

2. `--jailbreak-names` / `--num-templates` — which jailbreaks

3. `--strategies` — the delivery methods (the how)

4. `--num-attempts` — repeats per (jailbreak × method × objective)