Puck.Eval.Inspector (Puck v0.2.11)

Debugging tools for trajectories and evaluation results.

When evals fail, developers need human-readable output to determine if the eval is broken or the agent is broken. Anthropic emphasizes reading transcripts to ensure evals "feel fair".

Example

{output, trajectory} = Collector.collect(fn ->
  MyAgent.run("Find John")
end)

Inspector.print_trajectory(trajectory)
# => Prints formatted trajectory to console

result = Result.from_graders(output, trajectory, graders)

if not result.passed? do
  IO.puts(Inspector.format_failures(result))
end

Summary

Functions

format_failures(result)

Formats grader failures into a readable string.

print_trajectory(trajectory, opts \\ [])

Prints a human-readable trajectory to the console.

Functions

format_failures(result)

Formats grader failures into a readable string.

Returns a string listing all failed graders and their reasons. Suitable for ExUnit assertions or logging.

Example

result = Result.from_graders(output, trajectory, graders)

if not result.passed? do
  IO.puts(Inspector.format_failures(result))
end

# Or in tests:
assert result.passed?, Inspector.format_failures(result)

Output Format

2 failures:
  - Output does not contain "john@example.com"
  - 7 steps exceeds max of 5

print_trajectory(trajectory, opts \\ [])

Prints a human-readable trajectory to the console.

Options

:device - IO device to print to (default: :stdio)
:max_length - Max characters for output display (default: 200)

Example

Inspector.print_trajectory(trajectory)
# Trajectory (3 steps, 425 tokens, 1250ms)
#
# Step 1:
#   Input: "Find John's email"
#   Output: %LookupContact{name: "John"}
#   Tokens: 150 in, 30 out (180 total)
#   Duration: 450ms
# ...