Multi-trial execution for measuring agent reliability and consistency.
Executes an agent function multiple times and computes:
pass@k: At least one success in k trials (reliability)pass^k: All k trials succeed (consistency)pass_rate: Fraction of successful trials
Example
alias Puck.Eval.{Trial, Graders}
Trial.run_trials(
fn -> MyAgent.run("Find contact") end,
[Graders.contains("john@example.com")],
k: 5
)
# => %Puck.Eval.Trial.Summary{
# k: 5,
# results: [...],
# pass_at_k: true,
# pass_carrot_k: false,
# pass_rate: 0.6
# }Agent with 75% per-trial success:
- pass@3 ≈ 98% (1 - 0.25³)
- pass^3 ≈ 42% (0.75³)
Process Isolation
Each trial runs in a separate process via Task.async_stream/3, providing
clean state per execution. BEAM's process isolation replaces Docker containers
for clean environments.
Summary
Functions
Runs agent function k times and grades each execution.
Functions
Runs agent function k times and grades each execution.
Options
:k- Number of trials (default: 3):concurrency- Max concurrent trials (default:System.schedulers_online()):timeout- Timeout per trial in milliseconds (default: 30000)
Returns
A Puck.Eval.Trial.Summary struct with:
:k- Number of trials run:results- List ofPuck.Eval.Resultstructs:pass_at_k- Boolean, true if ≥1 trial passed:pass_carrot_k- Boolean, true if all trials passed:pass_rate- Float between 0.0 and 1.0
Example
Trial.run_trials(
fn -> MyAgent.run("task") end,
[Graders.contains("success")],
k: 5,
concurrency: 3
)