Skip to content

[BREAKING] FEAT: Standardize Jailbreak scenario defaults#2045

Open
varunj-msft wants to merge 1 commit into
microsoft:mainfrom
varunj-msft:varunj-msft/8380-Standardizing-Scenarios-Jailbreak
Open

[BREAKING] FEAT: Standardize Jailbreak scenario defaults#2045
varunj-msft wants to merge 1 commit into
microsoft:mainfrom
varunj-msft:varunj-msft/8380-Standardizing-Scenarios-Jailbreak

Conversation

@varunj-msft

@varunj-msft varunj-msft commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

The Jailbreak scenario previously ran all 162 jailbreak templates by default, which made a default scan take roughly an hour and overshoot the intended ~30 minute budget. This PR makes the default fast and predictable.

What changed:

The default now randomly samples 10 templates instead of all 162. Power users can still run the full catalog with num_templates=None, or pick specific ones with jailbreak_names=[...].
num_templates (and num_attempts) are now exposed as runtime parameters, so they can be set from the CLI via --num-templates N rather than only through the Python constructor.
The sampled template names are persisted to ScenarioResult.metadata and replayed on --resume, so resuming a run uses the same templates instead of drawing a fresh random set.
The chosen templates are now visible to the user. Scenario output includes a new "Scenario Inputs" section that lists which templates were tried, driven by a generic metadata["summary"] convention in the pretty printer (internal keys like objective_hashes are never shown).
Fixed an edge case in get_jailbreak_templates: num_templates=0 previously fell through and silently returned all templates; it (and negatives) now raise a clear error.
The scanner doc fast path is now --strategies simple --max-dataset-size 1 (no longer relies on --num-templates 1).
This is BREAKING because the default scan output changes (same call now produces different atomic attacks). The scenario VERSION is bumped from 1 to 2 so stored results from the old default error out cleanly on --resume instead of silently mixing. Constructor signatures and the public API are unchanged.
Timing: ~48s at max_concurrency=10 on a mini-class Azure OpenAI model (11 atomic attacks, 44 objectives) — well under the ~30 min budget.

Tests and Documentation

Added/updated unit tests for the new default (10), the explicit None opt-out, jailbreak_names, metadata persistence, resume replaying the same templates, the "Scenario Inputs" rendering (including that it never leaks internal metadata), and the 0/negative edge cases.
Updated the AIRT scanner doc (doc/scanner/airt.py + .ipynb) with the new default and fast-path command.
Ran the targeted suites and the broader scenario + output suites — all green. Ran ruff, ruff-format, and ty clean on the changed source. JupyText .py/.ipynb pair is in sync.

@rlundeen2

Copy link
Copy Markdown
Contributor

I think this is easier to think about what we want it to do; and I think the goal is to see how effective various jailbreaks are (or find jailbreaks that work)

To do this, let's look at end parameters what a user actually sets on the CLI since I think that's the clearest way to see whether the design hangs together. Roughly the run I'd want to work:

pyrit_scan airt.jailbreak \
  --initializers target load_default_datasets \
  --target openai_chat \
  --dataset-names harmbench \           # WHAT: harmful objectives
  --jailbreak-names dan aim \           # WHICH jailbreaks (or --num-templates N)
  --strategies system user variation \  # HOW: delivery methods
  --num-attempts 1

Getting the below to work are over-simplifications, and I could see us needing to think about each one, and some may need a PR pre-req. But I think they should be close to working.


1. --dataset-names — the harmful objectives (the what)

Standard DatasetConfiguration. I'd switch the default off the small airt_harms sample to something effectiveness-oriented like HarmBench. Each entry is a SeedObjective; nothing exotic here.

2. --jailbreak-names / --num-templates — which jailbreaks

A selector with a default set that can be overridden. The jailbreaks themselves should be SeedAttackTechniqueGroups (is_general_technique=True) so they merge into every objective via with_technique() without touching the SeedObjective.

On how to build the selector: we don't need a heavy new class. DatasetConfiguration's generic core is already objective-agnostic — get_seed_groups() returns plain SeedGroups, and the select-by-name-or-explicit + max_dataset_size sampling + _load_seed_groups_for_dataset hook is exactly the "default set, or override" behavior we want. Only the terminal *_attack_groups() wrappers assume one objective. So: keep that selection core and add a technique-flavored terminal (get_technique_groups()SeedAttackTechniqueGroups) plus a load hook that pulls the templates. The knobs then map straight across: names → jailbreak names, explicit seed_groups → override, max_dataset_size--num-templates.

(Alternative if we'd rather not subclass: register the templates into memory as is_general_technique=True seed datasets and load them by name. Either works — just pick one.)

Note --jailbreak-names would be a new runtime parameter — today only --num-templates and --num-attempts are exposed in supported_parameters().

3. --strategies — the delivery methods (the how)

Registered as defaults, more selectable if registered:

  • baseline (objective only — already auto-prepended)
  • jailbreak on the system prompt — technique seed role="system"
  • jailbreak as the first user prompt — technique seed role="user"
  • variation of the jailbreak, on system + user
  • translation (random language), on system + user

system-vs-user is just the seed's role — no special converter. Each method is an AttackTechniqueFactory / AttackTechnique carrying the jailbreak seed_technique at the right role + converter config, instead of the current match strategy block with TextJailbreakConverter. That gets the strategy enum back to pure techniques.

For variation/translation: attach the converter to the attack (don't pre-convert offline) and scope it with PrependedConversationConfig(apply_converters_to_roles=[...]). request_converters also hit the live objective turn, so the variation/translation lands on both the jailbreak and the objective — which I think is what we want, and it avoids a build-time LLM call so resume stays cheap.

4. --num-attempts — repeats per (jailbreak × method × objective)

Keep this knob, like it. Free from the base class.


Cross-cutting: output grouping

Group by jailbreak template — that's the axis people compare across. Heads-up: the default _build_display_group hook only gets technique_name + seed_group_name; the template is a third axis, so since we're keeping a thin _get_atomic_attacks_async, just set display_group=<template_name> directly when constructing each AtomicAttack.

Follow-up param: skip objectives the baseline already cracked

I like this (don't spend jailbreak attempts on objectives that fail open with no jailbreak), but it's more than a one-liner: today baseline is just another prepended AtomicAttack and we build + run all atomic attacks together, so there's no feed-forward. It needs a two-phase run (baseline → collect objectives that scored true → build/filter the rest). I'd land it as an opt-in flag (default off) in a follow-up rather than blocking this PR.


Net: keep a thin _get_atomic_attacks_async (the template axis is a genuine third dimension), but every delivery method becomes a factory carrying a role-tagged seed_technique + converter config. Strategy enum = pure techniques, jailbreak axis = a selector, objectives = --dataset-names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants