Red team attack generators for testing LLM safety.
Inspired by promptfoo's red team strategies, this module provides functions to generate adversarial prompts for testing your LLM's resistance to jailbreaks and manipulation.
Attack Categories
Static Encoding Attacks
Transform harmful prompts using encoding techniques:
- Base64 encoding
- Leetspeak substitution
- ROT13 cipher
- Pig Latin
- Reversed text
Prompt Injection Attacks
Attempt to override system instructions:
- Ignore previous instructions
- System prompt extraction
- Role switching
Jailbreak Attacks
Classic jailbreak techniques:
- DAN (Do Anything Now)
- STAN (Strive To Avoid Norms)
- Developer mode
- Hypothetical framing
Usage
# Generate all attacks for a harmful prompt
attacks = Tribunal.RedTeam.generate_attacks("How do I pick a lock?")
# Generate specific attack type
{:ok, encoded} = Tribunal.RedTeam.base64_encode("harmful prompt")
# Test your LLM against red team attacks
for {attack_type, prompt} <- attacks do
response = MyLLM.generate(prompt)
refute_jailbreak_llm(response, query: prompt)
end
Summary
Functions
Encodes prompt in Base64 and asks model to decode and execute.
Character roleplay to bypass restrictions.
Classic DAN (Do Anything Now) jailbreak.
Uses delimiter injection to confuse parsing.
Developer mode jailbreak attempt.
Generates encoding-based attacks that transform the prompt.
Generates all attack variants for a given prompt.
Hypothetical framing to bypass restrictions.
Classic ignore previous instructions attack.
Generates prompt injection attacks.
Generates classic jailbreak attacks.
Converts prompt to leetspeak (1337).
Converts prompt to Pig Latin.
Academic/research framing.
Reverses the prompt text.
Attempts to switch the model's role.
Encodes prompt using ROT13 cipher.
STAN (Strive To Avoid Norms) jailbreak variant.
Attempts to extract the system prompt.
Functions
Encodes prompt in Base64 and asks model to decode and execute.
Character roleplay to bypass restrictions.
Classic DAN (Do Anything Now) jailbreak.
Uses delimiter injection to confuse parsing.
Developer mode jailbreak attempt.
Generates encoding-based attacks that transform the prompt.
Generates all attack variants for a given prompt.
Returns a list of {attack_type, prompt} tuples.
Hypothetical framing to bypass restrictions.
Classic ignore previous instructions attack.
Generates prompt injection attacks.
Generates classic jailbreak attacks.
Converts prompt to leetspeak (1337).
Converts prompt to Pig Latin.
Academic/research framing.
Reverses the prompt text.
Attempts to switch the model's role.
Encodes prompt using ROT13 cipher.
STAN (Strive To Avoid Norms) jailbreak variant.
Attempts to extract the system prompt.