Puck.Eval (Puck v0.2.11)

Copy Markdown View Source

Evaluation primitives for testing agents built on Puck.

Puck.Eval provides minimal building blocks for evaluating LLM agents. These primitives can be composed however you need - with ExUnit, custom runners, or production monitoring.

Core Primitives

Helpers

Quick Example

alias Puck.Eval.{Collector, Graders, Result}

# Capture trajectory from your agent
{output, trajectory} = Collector.collect(fn ->
  MyAgent.run("Find John's email")
end)

# Apply graders
result = Result.from_graders(output, trajectory, [
  Graders.contains("john@example.com"),
  Graders.max_steps(5),
  Graders.output_produced(LookupContact)
])

# Check result
result.passed?  # => true or false

Multi-Trial Evaluation

alias Puck.Eval.Trial

# Run 5 trials, compute reliability metrics
results = Trial.run_trials(
  fn -> MyAgent.run("Find contact") end,
  [Graders.contains("john@example.com")],
  k: 5
)

results.pass_at_k      # => true (≥1 success)
results.pass_carrot_k  # => false (not all succeeded)
results.pass_rate      # => 0.6 (60% success rate)

LLM-as-Judge

alias Puck.Eval.Graders.LLM

judge_client = Puck.Client.new(
  {Puck.Backends.ReqLLM, "anthropic:claude-haiku-4-5"}
)

result = Result.from_graders(output, trajectory, [
  LLM.rubric(judge_client, """
  - Response is polite
  - Response is helpful
  - Response is concise
  """)
])

Debugging

alias Puck.Eval.Inspector

# Print human-readable trajectory
Inspector.print_trajectory(trajectory)

# Format grader failures
if not result.passed? do
  IO.puts(Inspector.format_failures(result))
end

In ExUnit

test "agent finds contact" do
  {output, trajectory} = Puck.Eval.collect(fn ->
    MyAgent.run("Find John's email")
  end)

  assert trajectory.total_steps <= 3
  assert output =~ "john@example.com"
end

In Production Monitoring

def monitor_agent_call(input) do
  {output, trajectory} = Puck.Eval.collect(fn ->
    MyAgent.run(input)
  end)

  :telemetry.execute([:my_app, :agent, :metrics], %{
    tokens: trajectory.total_tokens,
    steps: trajectory.total_steps,
    duration_ms: trajectory.total_duration_ms
  })

  output
end

Summary

Functions

Collects trajectory from the provided function.

Collects trajectory with options.

Creates an empty trajectory.

Creates a Result by applying graders to output and trajectory.

Runs a single grader on output and trajectory.

Creates a trajectory from a list of steps.

Functions

collect(fun)

Collects trajectory from the provided function.

Convenience delegate to Puck.Eval.Collector.collect/1.

Example

{output, trajectory} = Puck.Eval.collect(fn ->
  MyAgent.run("Find John's email")
end)

collect(fun, opts)

Collects trajectory with options.

Convenience delegate to Puck.Eval.Collector.collect/2.

Options

  • :timeout - Time to wait for telemetry events (default: 100ms)

empty_trajectory()

Creates an empty trajectory.

Example

trajectory = Puck.Eval.empty_trajectory()

grade(output, trajectory, graders)

Creates a Result by applying graders to output and trajectory.

Convenience delegate to Puck.Eval.Result.from_graders/3.

Example

result = Puck.Eval.grade(output, trajectory, [
  Graders.contains("hello"),
  Graders.max_steps(3)
])

run_grader(grader, output, trajectory)

Runs a single grader on output and trajectory.

Convenience delegate to Puck.Eval.Grader.run/3.

Example

Puck.Eval.run_grader(Graders.contains("hello"), output, trajectory)
# => :pass or {:fail, reason}

trajectory(steps)

Creates a trajectory from a list of steps.

Example

steps = [
  Puck.Eval.Step.new(input: "hi", output: "hello", tokens: %{total: 10})
]
trajectory = Puck.Eval.trajectory(steps)