# `Puck.Eval.Trial`
[🔗](https://github.com/bradleygolden/puck/blob/v0.2.23/lib/puck/eval/trial.ex#L1)

Multi-trial execution for measuring agent reliability and consistency.

Executes an agent function multiple times and computes:
- `pass@k`: At least one success in k trials (reliability)
- `pass^k`: All k trials succeed (consistency)
- `pass_rate`: Fraction of successful trials

## Example

    alias Puck.Eval.{Trial, Graders}

    Trial.run_trials(
      fn -> MyAgent.run("Find contact") end,
      [Graders.contains("john@example.com")],
      k: 5
    )
    # => %Puck.Eval.Trial.Summary{
    #   k: 5,
    #   results: [...],
    #   pass_at_k: true,
    #   pass_carrot_k: false,
    #   pass_rate: 0.6
    # }

Agent with 75% per-trial success:
- pass@3 ≈ 98% (1 - 0.25³)
- pass^3 ≈ 42% (0.75³)

## Process Isolation

Each trial runs in a separate process via `Task.async_stream/3`, providing
clean state per execution. BEAM's process isolation replaces Docker containers
for clean environments.

# `run_trials`

Runs agent function k times and grades each execution.

## Options

  * `:k` - Number of trials (default: 3)
  * `:concurrency` - Max concurrent trials (default: `System.schedulers_online()`)
  * `:timeout` - Timeout per trial in milliseconds (default: 30000)

## Returns

A `Puck.Eval.Trial.Summary` struct with:
  * `:k` - Number of trials run
  * `:results` - List of `Puck.Eval.Result` structs
  * `:pass_at_k` - Boolean, true if ≥1 trial passed
  * `:pass_carrot_k` - Boolean, true if all trials passed
  * `:pass_rate` - Float between 0.0 and 1.0

## Example

    Trial.run_trials(
      fn -> MyAgent.run("task") end,
      [Graders.contains("success")],
      k: 5,
      concurrency: 3
    )

---

*Consult [api-reference.md](api-reference.md) for complete listing*
