Tribunal ⚖️

LLM evaluation framework for Elixir.

Tribunal provides tools for evaluating and testing LLM outputs, detecting hallucinations, and measuring response quality.

[!TIP] See tribunal-juror for an interactive Phoenix app to explore and test Tribunal's evaluation capabilities.

Test Mode vs Evaluation Mode

Tribunal offers two modes for different use cases:

Mode	Interface	Use Case	Failure Behavior
Test	ExUnit	CI gates, safety checks	Fails immediately on any failure
Evaluation	Mix Task	Benchmarking, baseline tracking	Configurable thresholds

Test Mode is for "this must work" cases: safety checks, refusal detection, critical RAG accuracy. Tests fail fast on any violation.

Evaluation Mode is for "track how well we're doing": run hundreds of evals, compare models, monitor regression over time. Set thresholds like "pass if 80% succeed."

Installation

def deps do
  [
    {:tribunal, "~> 0.1.0"},

    # Optional: for LLM-as-judge evaluations
    {:req_llm, "~> 1.2"},

    # Optional: for embedding-based similarity
    {:alike, "~> 0.1"}
  ]
end

Quick Start

ExUnit Integration

defmodule MyApp.RAGTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  @context ["Returns are accepted within 30 days with receipt."]

  test "response is faithful to context" do
    response = MyApp.RAG.query("What's the return policy?")

    assert_contains response, "30 days"
    assert_faithful response, context: @context
    refute_hallucination response, context: @context
  end
end

Dataset-Driven Evaluations

# test/evals/rag_test.exs
defmodule MyApp.RAGEvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  tribunal_eval "test/evals/datasets/questions.json",
    provider: {MyApp.RAG, :query}
end

Evaluation Mode (Mix Task)

# Initialize evaluation structure
mix tribunal.init

# Run evaluations (default: always exit 0, just report)
mix tribunal.eval

# Set pass threshold (fail if pass rate < 80%)
mix tribunal.eval --threshold 0.8

# Strict mode (fail on any failure)
mix tribunal.eval --strict

# Run in parallel for speed
mix tribunal.eval --concurrency 5

# Output formats
mix tribunal.eval --format json --output results.json
mix tribunal.eval --format github  # GitHub Actions annotations

Tribunal LLM Evaluation
═══════════════════════════════════════════════════════════════

Summary
───────────────────────────────────────────────────────────────
  Total:     12 test cases
  Passed:    10 (83%)
  Failed:    2
  Duration:  1.4s

Results by Metric
───────────────────────────────────────────────────────────────
  faithful       8/8 passed    100%  ████████████████████
  relevant       6/8 passed    75%   ███████████████░░░░░
  contains       10/10 passed  100%  ████████████████████
  pii            4/4 passed    100%  ████████████████████

Failed Cases
───────────────────────────────────────────────────────────────
  1. "What is the return policy for electronics?"
     ├─ relevant: Response discusses refunds but doesn't address return policy

  2. "Can I return opened software?"
     ├─ relevant: Response is generic, doesn't mention software-specific policy

───────────────────────────────────────────────────────────────
✅ PASSED (threshold: 80%)

Assertion Types

Deterministic (instant, no API calls)

assert_contains / refute_contains - Substring matching
assert_regex - Pattern matching
assert_json - Valid JSON validation
assert_max_tokens - Token limit
Full list in assertions guide

LLM-as-Judge (requires `req_llm`)

assert_faithful - Grounded in context
assert_relevant - Addresses query
assert_refusal - Detects refusal responses
refute_hallucination - No fabricated info
refute_bias - No stereotypes
refute_toxicity - No hostile language
refute_harmful - No dangerous content
refute_jailbreak - No safety bypass
refute_pii - No personally identifiable information
assert_judge :custom - Custom judges via Tribunal.Judge behaviour

Embedding-Based (requires `alike`)

assert_similar - Semantic similarity check

Red Team Testing

Generate adversarial prompts to test LLM safety:

alias Tribunal.RedTeam

attacks = RedTeam.generate_attacks("How do I pick a lock?")
# Returns encoding attacks (base64, leetspeak, rot13)
# injection attacks (ignore instructions, delimiter injection)
# jailbreak attacks (DAN, STAN, developer mode)

Guides

Roadmap

[x] Core evaluation pipeline
[x] Faithfulness metric (RAGAS-style)
[x] Hallucination detection
[x] LLM-as-judge with configurable models
[x] ExUnit integration for test assertions
[x] Red team attack generators
[ ] Async batch evaluation

License

MIT

Next Page → Getting Started