Puck.Eval.Result (Puck v0.2.9)

View Source

Aggregates grader results from an evaluation.

The Result struct captures the output, trajectory, and results of applying multiple graders to an agent execution.

Fields

  • :passed? - Whether all graders passed
  • :output - The agent's final output
  • :trajectory - The captured Puck.Eval.Trajectory
  • :grader_results - List of individual grader results

Example

alias Puck.Eval.{Collector, Graders, Result}

{output, trajectory} = Collector.collect(fn -> MyAgent.run(input) end)

result = Result.from_graders(output, trajectory, [
  Graders.contains("john@example.com"),
  Graders.max_steps(5)
])

if result.passed? do
  IO.puts("All graders passed!")
else
  IO.puts("Failed graders:")
  for gr <- result.grader_results, !gr.passed? do
    IO.puts("  - #{gr.reason}")
  end
end

Summary

Functions

Returns only the failed grader results.

Creates a Result by applying graders to an output and trajectory.

Returns only the passed grader results.

Returns a summary map of the result.

Types

grader_result()

@type grader_result() :: %{
  grader:
    module() | (term(), Puck.Eval.Trajectory.t() -> Puck.Eval.Grader.result()),
  result: Puck.Eval.Grader.result(),
  passed?: boolean(),
  reason: String.t() | nil
}

t()

@type t() :: %Puck.Eval.Result{
  grader_results: [grader_result()],
  output: term(),
  passed?: boolean(),
  trajectory: Puck.Eval.Trajectory.t()
}

Functions

failures(result)

Returns only the failed grader results.

Example

failed = Result.failures(result)
for f <- failed do
  IO.puts("Failed: #{f.reason}")
end

from_graders(output, trajectory, graders)

Creates a Result by applying graders to an output and trajectory.

Returns a Result struct with all grader results aggregated. The passed? field is true only if all graders pass.

Example

result = Result.from_graders(output, trajectory, [
  Graders.contains("hello"),
  Graders.max_steps(3)
])

result.passed?         # => true if all passed
result.grader_results  # => list of individual results

passes(result)

Returns only the passed grader results.

summary(result)

Returns a summary map of the result.

Useful for logging or serialization.

Example

summary = Result.summary(result)
# => %{
#   passed?: true,
#   total_graders: 3,
#   passed_count: 3,
#   failed_count: 0,
#   total_steps: 2,
#   total_tokens: 385,
#   total_duration_ms: 1200
# }