Tribunal integrates with ExUnit through the Tribunal.EvalCase module, providing assertion macros for LLM output evaluation.
Test Mode: ExUnit assertions fail immediately on any violation. This is intentional: use Test Mode for critical checks that must pass (safety, compliance, CI gates). For threshold-based evaluation with reporting, use Evaluation Mode instead.
Setup
Add use Tribunal.EvalCase to your test module:
defmodule MyApp.LLMTest do
use ExUnit.Case
use Tribunal.EvalCase
test "response quality" do
response = MyApp.generate("What is Elixir?")
assert_contains response, "programming language"
assert_min_length response, 50
end
endDeterministic Assertions
These run instantly without external API calls.
String Matching
# Substring presence
assert_contains response, "expected text"
assert_contains response, ["text1", "text2"] # all must be present
refute_contains response, "unwanted text"
# At least one must match
assert_contains_any response, ["option1", "option2", "option3"]
# All must match
assert_contains_all response, ["required1", "required2"]
# Exact match
assert_equals response, "exact expected output"
# Prefix and suffix
assert_starts_with response, "Hello"
assert_ends_with response, "Thank you."Pattern Matching
# Regex matching
assert_regex response, ~r/\d{3}-\d{4}/
# Valid JSON
assert_json response
# Valid URL
assert_url response
# Valid email
assert_email responseLength Constraints
# Character length
assert_min_length response, 100
assert_max_length response, 500
# Word count
assert_word_count response, min: 10, max: 100
# Token limit (approximate)
assert_max_tokens response, 150Edit Distance
# Levenshtein distance for fuzzy matching
assert_levenshtein response, "expected output", max_distance: 5Safety Checks
# Detect refusal patterns
assert_refusal response # passes if response is a refusal
# No PII (emails, phones, SSN, credit cards)
refute_pii response
# No toxic language
refute_toxic responseLLM-as-Judge Assertions
These use an LLM to evaluate outputs. Requires req_llm dependency.
Faithfulness and Relevancy
@context ["The store is open Monday-Friday 9am-5pm."]
test "response is grounded in context" do
response = MyApp.query("When is the store open?")
# Output should be faithful to provided context
assert_faithful response, context: @context
# Output should address the query
assert_relevant response, query: "When is the store open?"
# No claims beyond the context
refute_hallucination response, context: @context
endCorrectness
test "answer is correct" do
response = MyApp.solve("What is 2 + 2?")
assert_correctness response,
query: "What is 2 + 2?",
expected: "4"
endSafety Evaluations
test "response is safe" do
response = MyApp.chat(user_input)
# No bias or stereotypes
refute_bias response, query: user_input
# No toxic language
refute_toxicity response, query: user_input
# No harmful advice
refute_harmful response, query: user_input
# No jailbreak indicators
refute_jailbreak response, query: user_input
# No PII (LLM detection for subtle cases)
refute_pii response, query: user_input, llm: true
endCustom Judges
Use assert_judge for custom domain-specific evaluations:
test "response matches brand voice" do
response = MyApp.chat(user_input)
assert_judge :brand_voice, response, query: user_input
endSee LLM-as-Judge guide for creating custom judges.
Embedding-Based Assertions
These use semantic similarity. Requires alike dependency.
test "semantically similar to expected" do
response = MyApp.summarize(article)
assert_similar response,
expected: "The article discusses climate change impacts.",
threshold: 0.8
endDataset-Driven Testing
Generate tests automatically from JSON or YAML datasets.
Basic Usage
defmodule MyApp.EvalTest do
use ExUnit.Case
use Tribunal.EvalCase
tribunal_eval "test/evals/datasets/questions.json"
endWith Provider Function
The provider function receives each input and returns the actual output:
tribunal_eval "test/evals/datasets/questions.json",
provider: {MyApp.RAG, :query}With Default Options
tribunal_eval "test/evals/datasets/questions.json",
provider: {MyApp.RAG, :query},
defaults: [threshold: 0.9]Dataset Format
[
{
"input": "What is the return policy?",
"context": "Returns accepted within 30 days with receipt.",
"expected": {
"contains": ["30 days"],
"faithful": {"threshold": 0.8}
}
}
]Each item generates a test that:
- Calls the provider with
input - Runs all assertions from
expected - Fails if any assertion fails
Options for LLM Assertions
All LLM-as-judge assertions accept these options:
assert_faithful response,
context: @context,
model: "anthropic:claude-3-5-sonnet-latest", # override default model
threshold: 0.9, # pass/fail threshold
temperature: 0.0, # LLM temperature
max_tokens: 500 # max response tokensDefault model: anthropic:claude-3-5-haiku-latest
Default threshold: 0.8
Test Organization
Recommended structure:
test/
evals/
datasets/
questions.json
safety.yaml
my_app/
rag_test.exs
safety_test.exsExample test file:
# test/evals/my_app/rag_test.exs
defmodule MyApp.RAGEvalTest do
use ExUnit.Case
use Tribunal.EvalCase
@moduletag :eval
# Dataset-driven tests
tribunal_eval "test/evals/datasets/questions.json",
provider: {MyApp.RAG, :query}
# Manual tests for edge cases
describe "edge cases" do
test "handles empty context" do
response = MyApp.RAG.query("Unknown topic", context: [])
assert_refusal response
end
end
endRun just evals:
mix test --only eval