mix tribunal.eval (Tribunal v1.3.6)

Runs LLM evaluations from dataset files.

Usage

mix tribunal.eval [options] [files...]

Options

--format - Output format: console (default), text, json, html, github, junit
--output - Write results to file instead of stdout
--provider - Module.function to call for each test case (e.g. MyApp.Agent.query)
--threshold - Minimum pass rate (0.0-1.0) required. Default: none (always exit 0)
--strict - Fail on any failure, equivalent to --threshold 1.0 (for CI gates)
--concurrency - Number of test cases to run in parallel. Default: 1 (sequential)
--limit - Maximum number of test cases to evaluate
--offset - Number of test cases to skip before evaluating. Default: 0

Provider Function

The provider function receives a Tribunal.TestCase struct and should return the LLM output as a string. The test case includes:

input - The query/prompt
context - Optional context for RAG-style queries
expected_output - Optional expected answer

Example provider:

def query(%Tribunal.TestCase{input: input, context: context}) do
  # Call your LLM here
  MyApp.LLM.generate(input, context: context)
end

Examples

# Run all evals in default location
mix tribunal.eval

# Run specific dataset
mix tribunal.eval test/evals/datasets/questions.json

# Run with a provider to generate outputs
mix tribunal.eval --provider MyApp.Agent.query

# Output JSON for CI
mix tribunal.eval --format json --output results.json

# GitHub Actions annotations
mix tribunal.eval --format github

# Default: always exit 0 (for baseline tracking)
mix tribunal.eval

# Fail if pass rate < 80%
mix tribunal.eval --threshold 0.8

# Strict mode: fail on any failure (for CI gates)
mix tribunal.eval --strict

# Run 5 test cases in parallel
mix tribunal.eval --concurrency 5

# Evaluate only the first 50 cases
mix tribunal.eval --limit 50

# Skip 30 cases, then evaluate the next 50
mix tribunal.eval --offset 30 --limit 50