# `mix tribunal.eval`
[🔗](https://github.com/georgeguimaraes/tribunal/blob/v1.3.6/lib/mix/tasks/tribunal.ex#L1)

Runs LLM evaluations from dataset files.

## Usage

    mix tribunal.eval [options] [files...]

## Options

  * `--format` - Output format: console (default), text, json, html, github, junit
  * `--output` - Write results to file instead of stdout
  * `--provider` - Module.function to call for each test case (e.g. MyApp.Agent.query)
  * `--threshold` - Minimum pass rate (0.0-1.0) required. Default: none (always exit 0)
  * `--strict` - Fail on any failure, equivalent to --threshold 1.0 (for CI gates)
  * `--concurrency` - Number of test cases to run in parallel. Default: 1 (sequential)
  * `--limit` - Maximum number of test cases to evaluate
  * `--offset` - Number of test cases to skip before evaluating. Default: 0

## Provider Function

The provider function receives a `Tribunal.TestCase` struct and should return
the LLM output as a string. The test case includes:

  * `input` - The query/prompt
  * `context` - Optional context for RAG-style queries
  * `expected_output` - Optional expected answer

Example provider:

    def query(%Tribunal.TestCase{input: input, context: context}) do
      # Call your LLM here
      MyApp.LLM.generate(input, context: context)
    end

## Examples

    # Run all evals in default location
    mix tribunal.eval

    # Run specific dataset
    mix tribunal.eval test/evals/datasets/questions.json

    # Run with a provider to generate outputs
    mix tribunal.eval --provider MyApp.Agent.query

    # Output JSON for CI
    mix tribunal.eval --format json --output results.json

    # GitHub Actions annotations
    mix tribunal.eval --format github

    # Default: always exit 0 (for baseline tracking)
    mix tribunal.eval

    # Fail if pass rate < 80%
    mix tribunal.eval --threshold 0.8

    # Strict mode: fail on any failure (for CI gates)
    mix tribunal.eval --strict

    # Run 5 test cases in parallel
    mix tribunal.eval --concurrency 5

    # Evaluate only the first 50 cases
    mix tribunal.eval --limit 50

    # Skip 30 cases, then evaluate the next 50
    mix tribunal.eval --offset 30 --limit 50

---

*Consult [api-reference.md](api-reference.md) for complete listing*
