Draft: AdversarialConversationManager#2053
Draft
rlundeen2 wants to merge 6 commits into
Draft
Conversation
Add a shared `adversarial_chat` JSON schema (next_message, rationale, last_response_summary) and wire the Crescendo, TAP, and PAIR adversarial-chat prompts onto it so their prompts are interchangeable. Parsers now read/return next_message and forward the schema to schema-aware targets via prompt metadata. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The message normalizer already appends the response JSON schema when it is forwarded via prompt metadata, so the hand-written schema block in the Crescendo system prompts duplicated that instruction. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Routes Crescendo and TAP adversarial sends through a real PromptNormalizer and a MockPromptTarget (which lacks native JSON_SCHEMA support) to verify the shared adversarial_chat schema is forwarded via prompt metadata and rendered into the prompt the adversarial chat receives. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eam-attack-schema # Conflicts: # pyrit/datasets/executors/crescendo/crescendo_variant_1.yaml # pyrit/datasets/executors/crescendo/escalation_crisis.yaml # pyrit/datasets/executors/crescendo/therapist.yaml # pyrit/datasets/executors/pair/attacker_system_prompt.yaml # pyrit/datasets/executors/tree_of_attacks/adversarial_system_prompt.yaml # pyrit/datasets/executors/tree_of_attacks/image_generation.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AdversarialConversationManager is a way to simplify managing adversarial conversations, letting us centrally give various kinds of context to the adversarial chat.
Today each multi-turn attack (Red Teaming, Crescendo, TAP, PAIR) hand-rolls the same mechanics: holding the adversarial system prompt, building the per-turn message, sending it on a stable conversation id, and parsing the reply. This manager owns one adversarial conversation and centralizes all of that in one place.
Because it's one place, it's also where we decide what context the adversarial chat gets each turn — the objective, the latest score, and the objective target's last response (bucketed by data type, e.g. {{ message.text.converted_value }}) — all via a single adversarial_prompt_template. Adding a new signal (a new score field, multimodal pieces, etc.) becomes a template change rather than per-attack code.
It also unifies the shared adversarial_chat JSON schema and is the single home for JSON retry, so schema-aware targets natively constrain the reply and every attack gets consistent parsing/retry behavior for free.