Tribunal.Judge behaviour (Tribunal v1.3.6)

Copy Markdown View Source

Behaviour for LLM-as-judge assertions.

All judges (built-in and custom) implement this behaviour. This provides a consistent interface for evaluation criteria.

Example

defmodule MyApp.Judges.BrandVoice do
  @behaviour Tribunal.Judge

  @impl true
  def name, do: :brand_voice

  @impl true
  def prompt(test_case, _opts) do
    """
    Evaluate if the response matches our brand voice guidelines:

    - Friendly but professional tone
    - No jargon or technical terms
    - Empathetic and helpful

    Response to evaluate:
    #{test_case.actual_output}

    Query: #{test_case.input}
    """
  end
end

Configuration

Register your custom judges in config:

config :tribunal, :custom_judges, [
  MyApp.Judges.BrandVoice,
  MyApp.Judges.Compliance
]

Then use them like built-in assertions:

assert_judge :brand_voice, response, query: input

Summary

Callbacks

Optional: customize how the LLM result is interpreted.

Returns the atom name for this judge.

Optional: whether "no" verdict means pass (for negative metrics like toxicity).

Builds the evaluation prompt for the LLM judge.

Optional: validate that the test case has required fields.

Functions

Returns list of all judge names (built-in + custom).

Returns all judge modules (built-in + custom).

Checks if a name is a built-in judge.

Returns list of built-in judge names.

Returns all built-in judge modules.

Checks if a name is a registered custom judge.

Returns list of custom judge names.

Returns all configured custom judge modules.

Finds a judge module by name.

Callbacks

evaluate_result(result, opts)

(optional)
@callback evaluate_result(result :: map(), opts :: keyword()) ::
  {:pass, map()} | {:fail, map()}

Optional: customize how the LLM result is interpreted.

By default, uses verdict and threshold logic. Override for custom pass/fail logic.

Should return {:pass, details} or {:fail, details}.

name()

@callback name() :: atom()

Returns the atom name for this judge.

This name is used to invoke the judge in assertions:

assert_judge :my_judge_name, response, opts

negative_metric?()

(optional)
@callback negative_metric?() :: boolean()

Optional: whether "no" verdict means pass (for negative metrics like toxicity).

When true, verdict "no" = pass and "yes" = fail. When false (default), verdict "yes" = pass and "no" = fail.

prompt(test_case, opts)

@callback prompt(test_case :: Tribunal.TestCase.t(), opts :: keyword()) :: String.t()

Builds the evaluation prompt for the LLM judge.

Receives the test case and any options passed to the assertion. Should return a prompt string that asks the LLM to evaluate the response and return a JSON verdict.

The prompt should instruct the LLM to return JSON with:

  • verdict: "yes", "no", or "partial"
  • reason: explanation for the verdict
  • score: confidence score 0.0-1.0

validate(test_case)

(optional)
@callback validate(test_case :: Tribunal.TestCase.t()) :: :ok | {:error, String.t()}

Optional: validate that the test case has required fields.

Return :ok if valid, or {:error, reason} if not. Default implementation always returns :ok.

Functions

all_judge_names()

Returns list of all judge names (built-in + custom).

all_judges()

Returns all judge modules (built-in + custom).

builtin_judge?(name)

Checks if a name is a built-in judge.

builtin_judge_names()

Returns list of built-in judge names.

builtin_judges()

Returns all built-in judge modules.

custom_judge?(name)

Checks if a name is a registered custom judge.

custom_judge_names()

Returns list of custom judge names.

custom_judges()

Returns all configured custom judge modules.

find(name)

Finds a judge module by name.

Returns {:ok, module} or :error.