Runs LLM evaluations from dataset files.
Usage
mix tribunal.eval [options] [files...]Options
--format- Output format: console (default), text, json, html, github, junit--output- Write results to file instead of stdout--provider- Module.function to call for each test case (e.g. MyApp.Agent.query)--threshold- Minimum pass rate (0.0-1.0) required. Default: none (always exit 0)--strict- Fail on any failure, equivalent to --threshold 1.0 (for CI gates)--concurrency- Number of test cases to run in parallel. Default: 1 (sequential)--limit- Maximum number of test cases to evaluate--offset- Number of test cases to skip before evaluating. Default: 0
Provider Function
The provider function receives a Tribunal.TestCase struct and should return
the LLM output as a string. The test case includes:
input- The query/promptcontext- Optional context for RAG-style queriesexpected_output- Optional expected answer
Example provider:
def query(%Tribunal.TestCase{input: input, context: context}) do
# Call your LLM here
MyApp.LLM.generate(input, context: context)
endExamples
# Run all evals in default location
mix tribunal.eval
# Run specific dataset
mix tribunal.eval test/evals/datasets/questions.json
# Run with a provider to generate outputs
mix tribunal.eval --provider MyApp.Agent.query
# Output JSON for CI
mix tribunal.eval --format json --output results.json
# GitHub Actions annotations
mix tribunal.eval --format github
# Default: always exit 0 (for baseline tracking)
mix tribunal.eval
# Fail if pass rate < 80%
mix tribunal.eval --threshold 0.8
# Strict mode: fail on any failure (for CI gates)
mix tribunal.eval --strict
# Run 5 test cases in parallel
mix tribunal.eval --concurrency 5
# Evaluate only the first 50 cases
mix tribunal.eval --limit 50
# Skip 30 cases, then evaluate the next 50
mix tribunal.eval --offset 30 --limit 50