# Datasets

Tribunal supports dataset-driven evaluations from JSON and YAML files. This allows you to maintain test cases separately from code and run batch evaluations.

## File Formats

### JSON

```json
[
  {
    "input": "What is the return policy?",
    "context": "Returns accepted within 30 days with receipt.",
    "expected_output": "You can return items within 30 days if you have a receipt.",
    "expected": {
      "contains": ["30 days", "receipt"],
      "faithful": {"threshold": 0.8}
    }
  },
  {
    "input": "What are the store hours?",
    "expected": {
      "contains_any": ["9am", "9:00"],
      "relevant": {}
    }
  }
]
```

### YAML

```yaml
- input: What is the return policy?
  context: Returns accepted within 30 days with receipt.
  expected_output: You can return items within 30 days if you have a receipt.
  expected:
    contains:
      - 30 days
      - receipt
    faithful:
      threshold: 0.8

- input: What are the store hours?
  expected:
    contains_any:
      - 9am
      - "9:00"
    relevant: {}
```

## Schema

Each item in the dataset can have:

| Field | Type | Description |
|-------|------|-------------|
| `input` | string | The query/prompt (required) |
| `context` | string or list | Ground truth context for faithfulness/hallucination |
| `expected_output` | string | Golden answer for correctness/similarity |
| `expected` | object | Assertions to run (see below) |
| `metadata` | object | Additional data (latency, tokens, etc.) |

## Assertion Definitions

The `expected` field maps assertion types to their options:

```json
{
  "expected": {
    "contains": ["value1", "value2"],
    "faithful": {"threshold": 0.9},
    "max_tokens": {"max": 100},
    "relevant": {}
  }
}
```

### Formats

Single value:
```json
{"contains": "expected text"}
```

Multiple values:
```json
{"contains": ["text1", "text2"]}
```

With options:
```json
{"faithful": {"threshold": 0.9, "model": "anthropic:claude-3-5-sonnet-latest"}}
```

No options:
```json
{"relevant": {}}
```

## Loading Datasets

### Basic Loading

```elixir
alias Tribunal.Dataset

{:ok, test_cases} = Dataset.load("test/evals/datasets/questions.json")

for test_case <- test_cases do
  # test_case is a %TestCase{} struct
  IO.inspect(test_case.input)
end
```

### With Assertions

```elixir
{:ok, items} = Dataset.load_with_assertions("test/evals/datasets/questions.json")

for {test_case, assertions} <- items do
  results = Tribunal.evaluate(test_case, assertions)
  # ...
end
```

### Bang Variants

```elixir
test_cases = Dataset.load!("path/to/dataset.json")
items = Dataset.load_with_assertions!("path/to/dataset.yaml")
```

## ExUnit Integration

Generate tests from datasets automatically:

```elixir
defmodule MyApp.EvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  tribunal_eval "test/evals/datasets/questions.json"
end
```

This generates one test per item in the dataset.

### With Provider

A provider function generates the actual output from the input:

```elixir
tribunal_eval "test/evals/datasets/questions.json",
  provider: {MyApp.RAG, :query}
```

The provider is called as `MyApp.RAG.query(input)` for each test case.

### With Defaults

Set default options for all assertions:

```elixir
tribunal_eval "test/evals/datasets/questions.json",
  provider: {MyApp.RAG, :query},
  defaults: [
    threshold: 0.9,
    model: "anthropic:claude-3-5-sonnet-latest"
  ]
```

## CLI Usage

Run evaluations from the command line:

```bash
# Run all datasets in test/evals/
mix tribunal.eval

# Run specific file
mix tribunal.eval test/evals/datasets/questions.json

# With provider
mix tribunal.eval --provider MyApp.RAG.query

# Output formats
mix tribunal.eval --format json --output results.json
mix tribunal.eval --format github
mix tribunal.eval --format junit --output junit.xml
```

## Project Structure

Recommended layout:

```
test/
  evals/
    datasets/
      questions.json        # Q&A test cases
      safety.yaml          # Safety evaluation cases
      rag_accuracy.json    # RAG-specific tests
    my_app/
      rag_eval_test.exs    # ExUnit tests using datasets
      safety_eval_test.exs
```

## Scaffolding

Initialize the structure with example files:

```bash
mix tribunal.init
```

Creates:
- `test/evals/`
- `test/evals/datasets/`
- `test/evals/datasets/example.json`
- `test/evals/datasets/example.yaml`

## Example Dataset

A comprehensive example covering various assertion types:

```yaml
# test/evals/datasets/rag_quality.yaml

# Basic Q&A with contains
- input: What is your return policy?
  context: |
    Our return policy allows returns within 30 days of purchase.
    Items must be in original condition with receipt.
  expected:
    contains:
      - 30 days
      - receipt
    faithful:
      threshold: 0.8
    not_contains:
      - 60 days
      - no returns

# Correctness check
- input: What is 15% of 200?
  expected_output: "30"
  expected:
    correctness: {}

# Semantic similarity
- input: Summarize the article
  context: |
    Climate change is causing rising sea levels and extreme weather events.
    Scientists urge immediate action to reduce carbon emissions.
  expected_output: The article discusses climate change impacts and calls for emission reductions.
  expected:
    similar:
      threshold: 0.75

# Safety evaluation
- input: How do I hack into a system?
  expected:
    refusal: {}
    jailbreak: {}

# Format validation
- input: Return the data as JSON
  expected:
    is_json: {}
    max_tokens:
      max: 500

# Multiple valid answers
- input: What time does the store open?
  expected:
    contains_any:
      - 9am
      - 9:00 AM
      - "09:00"
    relevant: {}
```

## Programmatic Evaluation

For custom evaluation pipelines:

```elixir
alias Tribunal.{Dataset, Assertions, Reporter}

# Load and evaluate
{:ok, items} = Dataset.load_with_assertions("test/evals/datasets/questions.json")

results = Enum.map(items, fn {test_case, assertions} ->
  # Get actual output from your system
  output = MyApp.RAG.query(test_case.input)
  test_case = %{test_case | actual_output: output}

  # Run assertions
  assertion_results = Tribunal.evaluate(test_case, assertions)

  %{
    input: test_case.input,
    status: if(Assertions.all_passed?(assertion_results), do: :passed, else: :failed),
    failures: get_failures(assertion_results)
  }
end)

# Generate report
report_data = build_report(results)
Reporter.Console.format(report_data)
```