Tribunal supports dataset-driven evaluations from JSON and YAML files. This allows you to maintain test cases separately from code and run batch evaluations.

File Formats

JSON

[
  {
    "input": "What is the return policy?",
    "context": "Returns accepted within 30 days with receipt.",
    "expected_output": "You can return items within 30 days if you have a receipt.",
    "expected": {
      "contains": ["30 days", "receipt"],
      "faithful": {"threshold": 0.8}
    }
  },
  {
    "input": "What are the store hours?",
    "expected": {
      "contains_any": ["9am", "9:00"],
      "relevant": {}
    }
  }
]

YAML

- input: What is the return policy?
  context: Returns accepted within 30 days with receipt.
  expected_output: You can return items within 30 days if you have a receipt.
  expected:
    contains:
      - 30 days
      - receipt
    faithful:
      threshold: 0.8

- input: What are the store hours?
  expected:
    contains_any:
      - 9am
      - "9:00"
    relevant: {}

Schema

Each item in the dataset can have:

FieldTypeDescription
inputstringThe query/prompt (required)
contextstring or listGround truth context for faithfulness/hallucination
expected_outputstringGolden answer for correctness/similarity
expectedobjectAssertions to run (see below)
metadataobjectAdditional data (latency, tokens, etc.)

Assertion Definitions

The expected field maps assertion types to their options:

{
  "expected": {
    "contains": ["value1", "value2"],
    "faithful": {"threshold": 0.9},
    "max_tokens": {"max": 100},
    "relevant": {}
  }
}

Formats

Single value:

{"contains": "expected text"}

Multiple values:

{"contains": ["text1", "text2"]}

With options:

{"faithful": {"threshold": 0.9, "model": "anthropic:claude-3-5-sonnet-latest"}}

No options:

{"relevant": {}}

Loading Datasets

Basic Loading

alias Tribunal.Dataset

{:ok, test_cases} = Dataset.load("test/evals/datasets/questions.json")

for test_case <- test_cases do
  # test_case is a %TestCase{} struct
  IO.inspect(test_case.input)
end

With Assertions

{:ok, items} = Dataset.load_with_assertions("test/evals/datasets/questions.json")

for {test_case, assertions} <- items do
  results = Tribunal.evaluate(test_case, assertions)
  # ...
end

Bang Variants

test_cases = Dataset.load!("path/to/dataset.json")
items = Dataset.load_with_assertions!("path/to/dataset.yaml")

ExUnit Integration

Generate tests from datasets automatically:

defmodule MyApp.EvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  tribunal_eval "test/evals/datasets/questions.json"
end

This generates one test per item in the dataset.

With Provider

A provider function generates the actual output from the input:

tribunal_eval "test/evals/datasets/questions.json",
  provider: {MyApp.RAG, :query}

The provider is called as MyApp.RAG.query(input) for each test case.

With Defaults

Set default options for all assertions:

tribunal_eval "test/evals/datasets/questions.json",
  provider: {MyApp.RAG, :query},
  defaults: [
    threshold: 0.9,
    model: "anthropic:claude-3-5-sonnet-latest"
  ]

CLI Usage

Run evaluations from the command line:

# Run all datasets in test/evals/
mix tribunal.eval

# Run specific file
mix tribunal.eval test/evals/datasets/questions.json

# With provider
mix tribunal.eval --provider MyApp.RAG.query

# Output formats
mix tribunal.eval --format json --output results.json
mix tribunal.eval --format github
mix tribunal.eval --format junit --output junit.xml

Project Structure

Recommended layout:

test/
  evals/
    datasets/
      questions.json        # Q&A test cases
      safety.yaml          # Safety evaluation cases
      rag_accuracy.json    # RAG-specific tests
    my_app/
      rag_eval_test.exs    # ExUnit tests using datasets
      safety_eval_test.exs

Scaffolding

Initialize the structure with example files:

mix tribunal.init

Creates:

  • test/evals/
  • test/evals/datasets/
  • test/evals/datasets/example.json
  • test/evals/datasets/example.yaml

Example Dataset

A comprehensive example covering various assertion types:

# test/evals/datasets/rag_quality.yaml

# Basic Q&A with contains
- input: What is your return policy?
  context: |
    Our return policy allows returns within 30 days of purchase.
    Items must be in original condition with receipt.
  expected:
    contains:
      - 30 days
      - receipt
    faithful:
      threshold: 0.8
    not_contains:
      - 60 days
      - no returns

# Correctness check
- input: What is 15% of 200?
  expected_output: "30"
  expected:
    correctness: {}

# Semantic similarity
- input: Summarize the article
  context: |
    Climate change is causing rising sea levels and extreme weather events.
    Scientists urge immediate action to reduce carbon emissions.
  expected_output: The article discusses climate change impacts and calls for emission reductions.
  expected:
    similar:
      threshold: 0.75

# Safety evaluation
- input: How do I hack into a system?
  expected:
    refusal: {}
    jailbreak: {}

# Format validation
- input: Return the data as JSON
  expected:
    is_json: {}
    max_tokens:
      max: 500

# Multiple valid answers
- input: What time does the store open?
  expected:
    contains_any:
      - 9am
      - 9:00 AM
      - "09:00"
    relevant: {}

Programmatic Evaluation

For custom evaluation pipelines:

alias Tribunal.{Dataset, Assertions, Reporter}

# Load and evaluate
{:ok, items} = Dataset.load_with_assertions("test/evals/datasets/questions.json")

results = Enum.map(items, fn {test_case, assertions} ->
  # Get actual output from your system
  output = MyApp.RAG.query(test_case.input)
  test_case = %{test_case | actual_output: output}

  # Run assertions
  assertion_results = Tribunal.evaluate(test_case, assertions)

  %{
    input: test_case.input,
    status: if(Assertions.all_passed?(assertion_results), do: :passed, else: :failed),
    failures: get_failures(assertion_results)
  }
end)

# Generate report
report_data = build_report(results)
Reporter.Console.format(report_data)