Built-in graders for common evaluation patterns.
All graders return functions that can be used with Puck.Eval.Grader.run/3
or Puck.Eval.Result.from_graders/3.
Output Graders
Check the agent's final output:
contains/1- Output contains a substringmatches/1- Output matches a regexequals/1- Output equals expected valuesatisfies/1- Output passes a predicate function
Trajectory Graders
Check the agent's execution trajectory:
max_steps/1- Trajectory has at most N stepsmax_tokens/1- Trajectory used at most N tokensmax_duration_ms/1- Trajectory took at most N milliseconds
Step Output Graders
Check what was produced during execution:
output_produced/1,2- A specific struct type was producedoutput_matches/1,2- A step output matches a predicate (for value checks)output_not_produced/1- A specific struct type was not producedoutput_sequence/1- Struct types were produced in a specific order
Example
alias Puck.Eval.{Collector, Graders, Result}
{output, trajectory} = Collector.collect(fn -> MyAgent.run(input) end)
result = Result.from_graders(output, trajectory, [
Graders.contains("john@example.com"),
Graders.max_steps(5),
Graders.output_produced(LookupContact)
])
result.passed? # => true or false
Summary
Functions
Checks if the output contains the given substring.
Checks if the output equals the expected value.
Checks if the output matches the given regex.
Checks that the trajectory took at most N milliseconds.
Checks that the trajectory has at most N steps.
Checks that the trajectory used at most N tokens.
Checks that any step output matches a predicate function.
Checks that a specific struct type was NOT produced during execution.
Checks that a specific struct type was produced during execution.
Checks that struct types were produced in a specific order.
Checks if the output satisfies a predicate function.
Functions
Checks if the output contains the given substring.
Example
grader = Graders.contains("hello")
grader.("hello world", trajectory)
# => :pass
grader.("goodbye", trajectory)
# => {:fail, "Output does not contain \"hello\""}
Checks if the output equals the expected value.
Example
grader = Graders.equals("success")
grader.("success", trajectory)
# => :pass
Checks if the output matches the given regex.
Example
grader = Graders.matches(~r/\d{3}-\d{4}/)
grader.("Call 555-1234", trajectory)
# => :pass
Checks that the trajectory took at most N milliseconds.
Example
grader = Graders.max_duration_ms(5000)
grader.(output, %Trajectory{total_duration_ms: 3000})
# => :pass
Checks that the trajectory has at most N steps.
Example
grader = Graders.max_steps(3)
grader.(output, %Trajectory{total_steps: 2})
# => :pass
grader.(output, %Trajectory{total_steps: 5})
# => {:fail, "5 steps exceeds max of 3"}
Checks that the trajectory used at most N tokens.
Example
grader = Graders.max_tokens(1000)
grader.(output, %Trajectory{total_tokens: 500})
# => :pass
Checks that any step output matches a predicate function.
Use this to assert on specific struct values, not just types.
Options
:times- Exact number of matches required (default: at least once)
Example
# Check if any step produced a LookupContact for "John"
grader = Graders.output_matches(fn
%LookupContact{name: "John"} -> true
_ -> false
end)
# Check exact count of matching outputs
grader = Graders.output_matches(
fn %LookupContact{} -> true; _ -> false end,
times: 2
)
Checks that a specific struct type was NOT produced during execution.
Example
grader = Graders.output_not_produced(DeleteContact)
grader.(output, trajectory)
# => :pass if no step.output was %DeleteContact{}
Checks that a specific struct type was produced during execution.
Matches directly on struct module types - no extractor needed.
Options
:times- Exact number of times this struct should appear (default: at least once)
Example
# Check if any step produced a LookupContact struct
grader = Graders.output_produced(LookupContact)
grader.(output, trajectory)
# => :pass if any step.output was %LookupContact{}
# Check exact count
grader = Graders.output_produced(LookupContact, times: 2)
Checks that struct types were produced in a specific order.
The sequence must appear somewhere in the trajectory, but other struct types can appear between the expected ones.
Example
# Check investigation pattern: snapshot -> execute -> alert
grader = Graders.output_sequence([TakeSnapshot, Execute, FireAlert])
grader.(output, trajectory)
# => :pass if structs appeared in that order
Checks if the output satisfies a predicate function.
The predicate receives the output and should return a boolean.
Example
grader = Graders.satisfies(fn output -> String.length(output) > 10 end)
grader.("hello world!", trajectory)
# => :pass