Puck.Eval.Inspector (Puck v0.2.11)

Copy Markdown View Source

Debugging tools for trajectories and evaluation results.

When evals fail, developers need human-readable output to determine if the eval is broken or the agent is broken. Anthropic emphasizes reading transcripts to ensure evals "feel fair".

Example

{output, trajectory} = Collector.collect(fn ->
  MyAgent.run("Find John")
end)

Inspector.print_trajectory(trajectory)
# => Prints formatted trajectory to console

result = Result.from_graders(output, trajectory, graders)

if not result.passed? do
  IO.puts(Inspector.format_failures(result))
end

Summary

Functions

Formats grader failures into a readable string.

Prints a human-readable trajectory to the console.

Functions

format_failures(result)

Formats grader failures into a readable string.

Returns a string listing all failed graders and their reasons. Suitable for ExUnit assertions or logging.

Example

result = Result.from_graders(output, trajectory, graders)

if not result.passed? do
  IO.puts(Inspector.format_failures(result))
end

# Or in tests:
assert result.passed?, Inspector.format_failures(result)

Output Format

2 failures:
  - Output does not contain "john@example.com"
  - 7 steps exceeds max of 5