Tribunal integrates with ExUnit through the Tribunal.EvalCase module, providing assertion macros for LLM output evaluation.

Test Mode: ExUnit assertions fail immediately on any violation. This is intentional: use Test Mode for critical checks that must pass (safety, compliance, CI gates). For threshold-based evaluation with reporting, use Evaluation Mode instead.

Setup

Add use Tribunal.EvalCase to your test module:

defmodule MyApp.LLMTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  test "response quality" do
    response = MyApp.generate("What is Elixir?")

    assert_contains response, "programming language"
    assert_min_length response, 50
  end
end

Deterministic Assertions

These run instantly without external API calls.

String Matching

# Substring presence
assert_contains response, "expected text"
assert_contains response, ["text1", "text2"]  # all must be present
refute_contains response, "unwanted text"

# At least one must match
assert_contains_any response, ["option1", "option2", "option3"]

# All must match
assert_contains_all response, ["required1", "required2"]

# Exact match
assert_equals response, "exact expected output"

# Prefix and suffix
assert_starts_with response, "Hello"
assert_ends_with response, "Thank you."

Pattern Matching

# Regex matching
assert_regex response, ~r/\d{3}-\d{4}/

# Valid JSON
assert_json response

# Valid URL
assert_url response

# Valid email
assert_email response

Length Constraints

# Character length
assert_min_length response, 100
assert_max_length response, 500

# Word count
assert_word_count response, min: 10, max: 100

# Token limit (approximate)
assert_max_tokens response, 150

Edit Distance

# Levenshtein distance for fuzzy matching
assert_levenshtein response, "expected output", max_distance: 5

Safety Checks

# Detect refusal patterns
assert_refusal response  # passes if response is a refusal

# No PII (emails, phones, SSN, credit cards)
refute_pii response

# No toxic language
refute_toxic response

LLM-as-Judge Assertions

These use an LLM to evaluate outputs. Requires req_llm dependency.

Faithfulness and Relevancy

@context ["The store is open Monday-Friday 9am-5pm."]

test "response is grounded in context" do
  response = MyApp.query("When is the store open?")

  # Output should be faithful to provided context
  assert_faithful response, context: @context

  # Output should address the query
  assert_relevant response, query: "When is the store open?"

  # No claims beyond the context
  refute_hallucination response, context: @context
end

Correctness

test "answer is correct" do
  response = MyApp.solve("What is 2 + 2?")

  assert_correctness response,
    query: "What is 2 + 2?",
    expected: "4"
end

Safety Evaluations

test "response is safe" do
  response = MyApp.chat(user_input)

  # No bias or stereotypes
  refute_bias response, query: user_input

  # No toxic language
  refute_toxicity response, query: user_input

  # No harmful advice
  refute_harmful response, query: user_input

  # No jailbreak indicators
  refute_jailbreak response, query: user_input

  # No PII (LLM detection for subtle cases)
  refute_pii response, query: user_input, llm: true
end

Custom Judges

Use assert_judge for custom domain-specific evaluations:

test "response matches brand voice" do
  response = MyApp.chat(user_input)

  assert_judge :brand_voice, response, query: user_input
end

See LLM-as-Judge guide for creating custom judges.

Embedding-Based Assertions

These use semantic similarity. Requires alike dependency.

test "semantically similar to expected" do
  response = MyApp.summarize(article)

  assert_similar response,
    expected: "The article discusses climate change impacts.",
    threshold: 0.8
end

Dataset-Driven Testing

Generate tests automatically from JSON or YAML datasets.

Basic Usage

defmodule MyApp.EvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  tribunal_eval "test/evals/datasets/questions.json"
end

With Provider Function

The provider function receives each input and returns the actual output:

tribunal_eval "test/evals/datasets/questions.json",
  provider: {MyApp.RAG, :query}

With Default Options

tribunal_eval "test/evals/datasets/questions.json",
  provider: {MyApp.RAG, :query},
  defaults: [threshold: 0.9]

Dataset Format

[
  {
    "input": "What is the return policy?",
    "context": "Returns accepted within 30 days with receipt.",
    "expected": {
      "contains": ["30 days"],
      "faithful": {"threshold": 0.8}
    }
  }
]

Each item generates a test that:

  1. Calls the provider with input
  2. Runs all assertions from expected
  3. Fails if any assertion fails

Options for LLM Assertions

All LLM-as-judge assertions accept these options:

assert_faithful response,
  context: @context,
  model: "anthropic:claude-3-5-sonnet-latest",  # override default model
  threshold: 0.9,                                # pass/fail threshold
  temperature: 0.0,                              # LLM temperature
  max_tokens: 500                                # max response tokens

Default model: anthropic:claude-3-5-haiku-latest Default threshold: 0.8

Test Organization

Recommended structure:

test/
  evals/
    datasets/
      questions.json
      safety.yaml
    my_app/
      rag_test.exs
      safety_test.exs

Example test file:

# test/evals/my_app/rag_test.exs
defmodule MyApp.RAGEvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  @moduletag :eval

  # Dataset-driven tests
  tribunal_eval "test/evals/datasets/questions.json",
    provider: {MyApp.RAG, :query}

  # Manual tests for edge cases
  describe "edge cases" do
    test "handles empty context" do
      response = MyApp.RAG.query("Unknown topic", context: [])

      assert_refusal response
    end
  end
end

Run just evals:

mix test --only eval