Puck.Eval (Puck v0.2.7)

View Source

Evaluation primitives for testing agents built on Puck.

Puck.Eval provides minimal building blocks for evaluating LLM agents. These primitives can be composed however you need - with ExUnit, custom runners, or production monitoring.

Core Primitives

Quick Example

alias Puck.Eval.{Collector, Graders, Result}

# Capture trajectory from your agent
{output, trajectory} = Collector.collect(fn ->
  MyAgent.run("Find John's email")
end)

# Apply graders
result = Result.from_graders(output, trajectory, [
  Graders.contains("john@example.com"),
  Graders.max_steps(5),
  Graders.output_produced(LookupContact)
])

# Check result
result.passed?  # => true or false

In ExUnit

test "agent finds contact" do
  {output, trajectory} = Puck.Eval.collect(fn ->
    MyAgent.run("Find John's email")
  end)

  assert trajectory.total_steps <= 3
  assert output =~ "john@example.com"
end

In Production Monitoring

def monitor_agent_call(input) do
  {output, trajectory} = Puck.Eval.collect(fn ->
    MyAgent.run(input)
  end)

  :telemetry.execute([:my_app, :agent, :metrics], %{
    tokens: trajectory.total_tokens,
    steps: trajectory.total_steps,
    duration_ms: trajectory.total_duration_ms
  })

  output
end

Summary

Functions

Collects trajectory from the provided function.

Collects trajectory with options.

Creates an empty trajectory.

Creates a Result by applying graders to output and trajectory.

Runs a single grader on output and trajectory.

Creates a trajectory from a list of steps.

Functions

collect(fun)

Collects trajectory from the provided function.

Convenience delegate to Puck.Eval.Collector.collect/1.

Example

{output, trajectory} = Puck.Eval.collect(fn ->
  MyAgent.run("Find John's email")
end)

collect(fun, opts)

Collects trajectory with options.

Convenience delegate to Puck.Eval.Collector.collect/2.

Options

  • :timeout - Time to wait for telemetry events (default: 100ms)

empty_trajectory()

Creates an empty trajectory.

Example

trajectory = Puck.Eval.empty_trajectory()

grade(output, trajectory, graders)

Creates a Result by applying graders to output and trajectory.

Convenience delegate to Puck.Eval.Result.from_graders/3.

Example

result = Puck.Eval.grade(output, trajectory, [
  Graders.contains("hello"),
  Graders.max_steps(3)
])

run_grader(grader, output, trajectory)

Runs a single grader on output and trajectory.

Convenience delegate to Puck.Eval.Grader.run/3.

Example

Puck.Eval.run_grader(Graders.contains("hello"), output, trajectory)
# => :pass or {:fail, reason}

trajectory(steps)

Creates a trajectory from a list of steps.

Example

steps = [
  Puck.Eval.Step.new(input: "hi", output: "hello", tokens: %{total: 10})
]
trajectory = Puck.Eval.trajectory(steps)