Puck.Eval.Graders (Puck v0.2.11)

Copy Markdown View Source

Built-in graders for common evaluation patterns.

All graders return functions that can be used with Puck.Eval.Grader.run/3 or Puck.Eval.Result.from_graders/3.

Output Graders

Check the agent's final output:

Trajectory Graders

Check the agent's execution trajectory:

Step Output Graders

Check what was produced during execution:

  • output_produced/1,2 - A specific struct type was produced
  • output_matches/1,2 - A step output matches a predicate (for value checks)
  • output_not_produced/1 - A specific struct type was not produced
  • output_sequence/1 - Struct types were produced in a specific order

Example

alias Puck.Eval.{Collector, Graders, Result}

{output, trajectory} = Collector.collect(fn -> MyAgent.run(input) end)

result = Result.from_graders(output, trajectory, [
  Graders.contains("john@example.com"),
  Graders.max_steps(5),
  Graders.output_produced(LookupContact)
])

result.passed?  # => true or false

Summary

Functions

Checks if the output contains the given substring.

Checks if the output equals the expected value.

Checks if the output matches the given regex.

Checks that the trajectory took at most N milliseconds.

Checks that the trajectory has at most N steps.

Checks that the trajectory used at most N tokens.

Checks that any step output matches a predicate function.

Checks that a specific struct type was NOT produced during execution.

Checks that a specific struct type was produced during execution.

Checks that struct types were produced in a specific order.

Checks if the output satisfies a predicate function.

Functions

contains(substring)

Checks if the output contains the given substring.

Example

grader = Graders.contains("hello")
grader.("hello world", trajectory)
# => :pass

grader.("goodbye", trajectory)
# => {:fail, "Output does not contain \"hello\""}

equals(expected)

Checks if the output equals the expected value.

Example

grader = Graders.equals("success")
grader.("success", trajectory)
# => :pass

matches(regex)

Checks if the output matches the given regex.

Example

grader = Graders.matches(~r/\d{3}-\d{4}/)
grader.("Call 555-1234", trajectory)
# => :pass

max_duration_ms(n)

Checks that the trajectory took at most N milliseconds.

Example

grader = Graders.max_duration_ms(5000)
grader.(output, %Trajectory{total_duration_ms: 3000})
# => :pass

max_steps(n)

Checks that the trajectory has at most N steps.

Example

grader = Graders.max_steps(3)
grader.(output, %Trajectory{total_steps: 2})
# => :pass

grader.(output, %Trajectory{total_steps: 5})
# => {:fail, "5 steps exceeds max of 3"}

max_tokens(n)

Checks that the trajectory used at most N tokens.

Example

grader = Graders.max_tokens(1000)
grader.(output, %Trajectory{total_tokens: 500})
# => :pass

output_matches(predicate, opts \\ [])

Checks that any step output matches a predicate function.

Use this to assert on specific struct values, not just types.

Options

  • :times - Exact number of matches required (default: at least once)

Example

# Check if any step produced a LookupContact for "John"
grader = Graders.output_matches(fn
  %LookupContact{name: "John"} -> true
  _ -> false
end)

# Check exact count of matching outputs
grader = Graders.output_matches(
  fn %LookupContact{} -> true; _ -> false end,
  times: 2
)

output_not_produced(struct_module)

Checks that a specific struct type was NOT produced during execution.

Example

grader = Graders.output_not_produced(DeleteContact)
grader.(output, trajectory)
# => :pass if no step.output was %DeleteContact{}

output_produced(struct_module, opts \\ [])

Checks that a specific struct type was produced during execution.

Matches directly on struct module types - no extractor needed.

Options

  • :times - Exact number of times this struct should appear (default: at least once)

Example

# Check if any step produced a LookupContact struct
grader = Graders.output_produced(LookupContact)
grader.(output, trajectory)
# => :pass if any step.output was %LookupContact{}

# Check exact count
grader = Graders.output_produced(LookupContact, times: 2)

output_sequence(struct_modules)

Checks that struct types were produced in a specific order.

The sequence must appear somewhere in the trajectory, but other struct types can appear between the expected ones.

Example

# Check investigation pattern: snapshot -> execute -> alert
grader = Graders.output_sequence([TakeSnapshot, Execute, FireAlert])
grader.(output, trajectory)
# => :pass if structs appeared in that order

satisfies(predicate)

Checks if the output satisfies a predicate function.

The predicate receives the output and should return a boolean.

Example

grader = Graders.satisfies(fn output -> String.length(output) > 10 end)
grader.("hello world!", trajectory)
# => :pass