Evaluation primitives for testing agents built on Puck.
Puck.Eval provides minimal building blocks for evaluating LLM agents. These primitives can be composed however you need - with ExUnit, custom runners, or production monitoring.
Core Primitives
Puck.Eval.Trajectory- Captures what happened during executionPuck.Eval.Step- A single LLM call within a trajectoryPuck.Eval.Collector- Captures trajectory via telemetryPuck.Eval.Grader- Behaviour for scoringPuck.Eval.Graders- Built-in gradersPuck.Eval.Result- Aggregates grader results
Helpers
Puck.Eval.Trial- Multi-trial execution with pass@k metricsPuck.Eval.Graders.LLM- LLM-as-judge for subjective criteriaPuck.Eval.Inspector- Debug tools for trajectories and failures
Quick Example
alias Puck.Eval.{Collector, Graders, Result}
# Capture trajectory from your agent
{output, trajectory} = Collector.collect(fn ->
MyAgent.run("Find John's email")
end)
# Apply graders
result = Result.from_graders(output, trajectory, [
Graders.contains("john@example.com"),
Graders.max_steps(5),
Graders.output_produced(LookupContact)
])
# Check result
result.passed? # => true or falseMulti-Trial Evaluation
alias Puck.Eval.Trial
# Run 5 trials, compute reliability metrics
results = Trial.run_trials(
fn -> MyAgent.run("Find contact") end,
[Graders.contains("john@example.com")],
k: 5
)
results.pass_at_k # => true (≥1 success)
results.pass_carrot_k # => false (not all succeeded)
results.pass_rate # => 0.6 (60% success rate)LLM-as-Judge
alias Puck.Eval.Graders.LLM
judge_client = Puck.Client.new(
{Puck.Backends.ReqLLM, "anthropic:claude-haiku-4-5"}
)
result = Result.from_graders(output, trajectory, [
LLM.rubric(judge_client, """
- Response is polite
- Response is helpful
- Response is concise
""")
])Debugging
alias Puck.Eval.Inspector
# Print human-readable trajectory
Inspector.print_trajectory(trajectory)
# Format grader failures
if not result.passed? do
IO.puts(Inspector.format_failures(result))
endIn ExUnit
test "agent finds contact" do
{output, trajectory} = Puck.Eval.collect(fn ->
MyAgent.run("Find John's email")
end)
assert trajectory.total_steps <= 3
assert output =~ "john@example.com"
endIn Production Monitoring
def monitor_agent_call(input) do
{output, trajectory} = Puck.Eval.collect(fn ->
MyAgent.run(input)
end)
:telemetry.execute([:my_app, :agent, :metrics], %{
tokens: trajectory.total_tokens,
steps: trajectory.total_steps,
duration_ms: trajectory.total_duration_ms
})
output
end
Summary
Functions
Collects trajectory from the provided function.
Collects trajectory with options.
Creates an empty trajectory.
Creates a Result by applying graders to output and trajectory.
Runs a single grader on output and trajectory.
Creates a trajectory from a list of steps.
Functions
Collects trajectory from the provided function.
Convenience delegate to Puck.Eval.Collector.collect/1.
Example
{output, trajectory} = Puck.Eval.collect(fn ->
MyAgent.run("Find John's email")
end)
Collects trajectory with options.
Convenience delegate to Puck.Eval.Collector.collect/2.
Options
:timeout- Time to wait for telemetry events (default: 100ms)
Creates an empty trajectory.
Example
trajectory = Puck.Eval.empty_trajectory()
Creates a Result by applying graders to output and trajectory.
Convenience delegate to Puck.Eval.Result.from_graders/3.
Example
result = Puck.Eval.grade(output, trajectory, [
Graders.contains("hello"),
Graders.max_steps(3)
])
Runs a single grader on output and trajectory.
Convenience delegate to Puck.Eval.Grader.run/3.
Example
Puck.Eval.run_grader(Graders.contains("hello"), output, trajectory)
# => :pass or {:fail, reason}
Creates a trajectory from a list of steps.
Example
steps = [
Puck.Eval.Step.new(input: "hi", output: "hello", tokens: %{total: 10})
]
trajectory = Puck.Eval.trajectory(steps)