Debugging tools for trajectories and evaluation results.
When evals fail, developers need human-readable output to determine if the eval is broken or the agent is broken. Anthropic emphasizes reading transcripts to ensure evals "feel fair".
Example
{output, trajectory} = Collector.collect(fn ->
MyAgent.run("Find John")
end)
Inspector.print_trajectory(trajectory)
# => Prints formatted trajectory to console
result = Result.from_graders(output, trajectory, graders)
if not result.passed? do
IO.puts(Inspector.format_failures(result))
end
Summary
Functions
Formats grader failures into a readable string.
Prints a human-readable trajectory to the console.
Functions
Formats grader failures into a readable string.
Returns a string listing all failed graders and their reasons. Suitable for ExUnit assertions or logging.
Example
result = Result.from_graders(output, trajectory, graders)
if not result.passed? do
IO.puts(Inspector.format_failures(result))
end
# Or in tests:
assert result.passed?, Inspector.format_failures(result)Output Format
2 failures:
- Output does not contain "john@example.com"
- 7 steps exceeds max of 5
Prints a human-readable trajectory to the console.
Options
:device- IO device to print to (default::stdio):max_length- Max characters for output display (default: 200)
Example
Inspector.print_trajectory(trajectory)
# Trajectory (3 steps, 425 tokens, 1250ms)
#
# Step 1:
# Input: "Find John's email"
# Output: %LookupContact{name: "John"}
# Tokens: 150 in, 30 out (180 total)
# Duration: 450ms
# ...