Skip to content

FEAT: Add word-game option to DecompositionConverter#2051

Open
Raulster24 wants to merge 3 commits into
microsoft:mainfrom
Raulster24:raulster24/add-decomposition-word-game
Open

FEAT: Add word-game option to DecompositionConverter#2051
Raulster24 wants to merge 3 commits into
microsoft:mainfrom
Raulster24:raulster24/add-decomposition-word-game

Conversation

@Raulster24

@Raulster24 Raulster24 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

This adds an optional word-game mode to DecompositionConverter (the DrAttack decompose-and-reconstruct converter from #2003), via use_word_game: bool = False. When enabled, each harmful noun phrase is replaced by an innocuous codeword in the reconstruction questions, and a mapping preamble (for example 'apple' means 'a bomb') is established in the same prompt. This is the second half of DrAttack: it further conceals the harmful nouns by splitting them from the request behind codewords.

Off by default, so the merged converter behaviour is unchanged.

Two design choices worth flagging up front:

Inline, not a separate prepended conversation. We had discussed the word-game as a prepended/simulated conversation; I went with inline (preamble and reconstruction in one prompt) for two reasons. First, coupling: the codewords must match the reconstruction the converter builds, and a separate conversation generates its turns independently, so they cannot share the mapping without a stateful component (an attack class), which we wanted to avoid. Inline keeps it a pure converter. Second, the numbers, inline matches the two-turn version, and both are far above no word-game:

  • n=50, GPT-judge ASR: gpt-4o inline 44% / two-turn 46% / core 16%; gpt-4o-mini inline 52% / two-turn 62% / core 22%.
  • n=15 with the actual converter (not the harness), gpt-4o-mini: word-game off 20%, on 73%.

So, inline essentially keeps all of the effects on the frontier model, with no new attack class. Open to the prepended-conversation route if you prefer it.

A toggle on the converter, not a separate converter. The codewords have to stay in sync with the reconstruction this converter produces, so a separate converter cannot do it; it has to be a mode of this converter.

Note on the mechanism: the harmful phrase still appears once, in the mapping line; the concealment is that the question uses the codeword, splitting the harmful term from the request. This is the paper's word-game, and the numbers above show the lift.

(All numbers are GPT-judge refusal-bypass, not operational harm, consistent with the #2003 assessment.)

Tests and Documentation

  • Added unit tests: codeword substitution, off-mode unchanged, custom codewords, and a clear error when there are more noun phrases than codewords.
  • Documented the new use_word_game parameter in doc/code/converters/1_text_to_text_converters.py; ran JupyText --sync.
  • ruff check and format clean; ty reports no errors; full converter and docs test suites pass.

cc @rlundeen2 @romanlutz

@adrian-gavrila adrian-gavrila self-assigned this Jun 19, 2026

@adrian-gavrila adrian-gavrila left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! A few small things worth attention but overall looks great

word_game_prompt (SeedPrompt | None): Template for the word-game mapping preamble. Defaults
to the bundled ``decomposition/word_game_preamble.yaml``. Only used when
``use_word_game`` is True.
codewords (tuple[str, ...]): Innocuous codewords substituted for harmful noun phrases when

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Rationale's already on the _CODEWORDS comment and the Raises: block; arg docstrings here stay terse (cf. _MIN_RECALL). Could trim.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trimmed to one line; the bound is already covered by the overflow ValueError

self._word_game_prompt = word_game_prompt or SeedPrompt.from_yaml_file(
_DECOMPOSITION_DIR / "word_game_preamble.yaml"
)
self._codewords = codewords

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codewords isn't validated for uniqueness, duplicates silently yield an ambiguous mapping ('apple' means 'bomb'; 'apple' means 'gun'). Worth a fail-fast len(set(codewords)) != len(codewords) check in __init__?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a fail-fast len(set(codewords)) != len(codewords) check in init + a Raises doc entry + test_duplicate_codewords_raise

if self._use_word_game:
if noun_index > len(self._codewords):
raise ValueError(
f"word-game supports at most {len(self._codewords)} noun phrases, got {noun_index}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: noun_index is the first overflowing index (len+1), not the total noun count, so 25 nouns with 20 codewords reports got 21. Maybe reword to a threshold breach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to state the threshold breach, no misleading count

@Raulster24

Copy link
Copy Markdown
Contributor Author

@adrian-gavrila Thanks for the review. Addressed all three: codeword uniqueness is now validated in init with a test, the arg docstring is trimmed, and the overflow message states the threshold breach instead of a count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants