Aggregates grader results from an evaluation.
The Result struct captures the output, trajectory, and results of applying multiple graders to an agent execution.
Fields
:passed?- Whether all graders passed:output- The agent's final output:trajectory- The capturedPuck.Eval.Trajectory:grader_results- List of individual grader results
Example
alias Puck.Eval.{Collector, Graders, Result}
{output, trajectory} = Collector.collect(fn -> MyAgent.run(input) end)
result = Result.from_graders(output, trajectory, [
Graders.contains("john@example.com"),
Graders.max_steps(5)
])
if result.passed? do
IO.puts("All graders passed!")
else
IO.puts("Failed graders:")
for gr <- result.grader_results, !gr.passed? do
IO.puts(" - #{gr.reason}")
end
end
Summary
Functions
Returns only the failed grader results.
Creates a Result by applying graders to an output and trajectory.
Returns only the passed grader results.
Returns a summary map of the result.
Types
@type grader_result() :: %{ grader: module() | (term(), Puck.Eval.Trajectory.t() -> Puck.Eval.Grader.result()), result: Puck.Eval.Grader.result(), passed?: boolean(), reason: String.t() | nil }
@type t() :: %Puck.Eval.Result{ grader_results: [grader_result()], output: term(), passed?: boolean(), trajectory: Puck.Eval.Trajectory.t() }
Functions
Returns only the failed grader results.
Example
failed = Result.failures(result)
for f <- failed do
IO.puts("Failed: #{f.reason}")
end
Creates a Result by applying graders to an output and trajectory.
Returns a Result struct with all grader results aggregated.
The passed? field is true only if all graders pass.
Example
result = Result.from_graders(output, trajectory, [
Graders.contains("hello"),
Graders.max_steps(3)
])
result.passed? # => true if all passed
result.grader_results # => list of individual results
Returns only the passed grader results.
Returns a summary map of the result.
Useful for logging or serialization.
Example
summary = Result.summary(result)
# => %{
# passed?: true,
# total_graders: 3,
# passed_count: 3,
# failed_count: 0,
# total_steps: 2,
# total_tokens: 385,
# total_duration_ms: 1200
# }