Puck.Eval.Graders.LLM (Puck v0.2.11)

LLM-as-judge graders for subjective evaluation.

Use when code-based graders can't capture nuanced criteria like tone, empathy, or code quality. LLM graders are non-deterministic - the same output may receive different scores on retries.

Example

alias Puck.Eval.{Collector, Graders, Result}
alias Puck.Eval.Graders.LLM

judge_client = Puck.Client.new(
  {Puck.Backends.ReqLLM, "anthropic:claude-haiku-4-5"}
)

{output, trajectory} = Collector.collect(fn ->
  CustomerAgent.respond("How do I return an item?")
end)

result = Result.from_graders(output, trajectory, [
  LLM.rubric(judge_client, """
  - Response is polite
  - Response explains return process
  - Response asks for order number
  """)
])

Use fast, cheap models (Haiku) for judges to minimize cost and latency.

Rubric Format

Simple bullet points describing criteria. Judge decides pass/fail based on whether all criteria are met.

Non-Determinism

LLM judges are probabilistic. For reliability testing, run multiple trials with Puck.Eval.Trial.run_trials/3 and measure pass@k metrics.

Summary

Functions

rubric(client, rubric)

Creates an LLM-as-judge grader using a rubric.

Functions

rubric(client, rubric)

Creates an LLM-as-judge grader using a rubric.

Returns a grader function compatible with Puck.Eval.Result.from_graders/3.

Parameters

client - Puck.Client for the judge LLM (recommend fast model like Haiku)
rubric - String with bullet points describing evaluation criteria

Returns

Grader function that returns :pass or {:fail, reason}.

Example

judge = Puck.Client.new({Puck.Backends.ReqLLM, "anthropic:claude-haiku-4-5"})

grader = LLM.rubric(judge, """
- Response is polite
- Response is concise
- Response answers the question
""")

grader.("Thanks! Your order is confirmed.", trajectory)
# => :pass