Assertions Reference

Copy Markdown View Source

Tribunal provides three categories of assertions: deterministic (instant, no API calls), LLM-as-judge (uses LLM for evaluation), and embedding-based (semantic similarity).

Return Format

All assertions return one of:

{:pass, %{...details}}
{:fail, %{reason: "...", ...details}}
{:error, "error message"}

Deterministic Assertions

These run instantly without external API calls.

:contains

Checks if output contains a substring or all substrings from a list.

Assertions.evaluate(:contains, test_case, value: "expected")
Assertions.evaluate(:contains, test_case, values: ["one", "two"])

Returns:

  • Pass: {:pass, %{matched: ["one", "two"]}}
  • Fail: {:fail, %{missing: ["two"], reason: "..."}}

:not_contains

Checks that output does not contain specified substrings.

Assertions.evaluate(:not_contains, test_case, value: "forbidden")
Assertions.evaluate(:not_contains, test_case, values: ["bad", "wrong"])

Returns:

  • Pass: {:pass, %{checked: ["bad", "wrong"]}}
  • Fail: {:fail, %{found: ["bad"], reason: "..."}}

:contains_any

Checks if output contains at least one of the specified values.

Assertions.evaluate(:contains_any, test_case, values: ["opt1", "opt2", "opt3"])

Returns:

  • Pass: {:pass, %{matched: "opt2"}}
  • Fail: {:fail, %{expected_any: ["opt1", "opt2", "opt3"], reason: "..."}}

:contains_all

Alias for :contains with multiple values.

:regex

Checks if output matches a regular expression.

Assertions.evaluate(:regex, test_case, pattern: ~r/\d{3}-\d{4}/)
Assertions.evaluate(:regex, test_case, value: ~r/price:\s*\$\d+/)

Returns:

  • Pass: {:pass, %{matched: "555-1234", pattern: "\\d{3}-\\d{4}"}}
  • Fail: {:fail, %{pattern: "\\d{3}-\\d{4}", reason: "..."}}

:is_json

Validates that output is valid JSON.

Assertions.evaluate(:is_json, test_case, [])

Returns:

  • Pass: {:pass, %{parsed: %{"key" => "value"}}}
  • Fail: {:fail, %{reason: "Invalid JSON: ..."}}

:max_tokens

Checks that output is under a token limit (approximate: ~0.75 words per token).

Assertions.evaluate(:max_tokens, test_case, max: 100)
Assertions.evaluate(:max_tokens, test_case, value: 100)

Returns:

  • Pass: {:pass, %{tokens: 75, max: 100}}
  • Fail: {:fail, %{tokens: 150, max: 100, reason: "..."}}

:latency_ms

Checks response latency against a threshold.

Assertions.evaluate(:latency_ms, test_case, actual: 450, max: 500)

Returns:

  • Pass: {:pass, %{latency_ms: 450, max: 500}}
  • Fail: {:fail, %{latency_ms: 600, max: 500, reason: "..."}}

:starts_with

Checks if output starts with a prefix.

Assertions.evaluate(:starts_with, test_case, value: "Hello")

:ends_with

Checks if output ends with a suffix.

Assertions.evaluate(:ends_with, test_case, value: "Thank you.")

:equals

Checks for exact string match.

Assertions.evaluate(:equals, test_case, value: "exact output")

:min_length

Checks minimum character length.

Assertions.evaluate(:min_length, test_case, min: 100)

:max_length

Checks maximum character length.

Assertions.evaluate(:max_length, test_case, max: 500)

:word_count

Checks word count is within range.

Assertions.evaluate(:word_count, test_case, min: 10, max: 100)
Assertions.evaluate(:word_count, test_case, min: 10)  # no max
Assertions.evaluate(:word_count, test_case, max: 100) # no min

:is_url

Validates URL format.

Assertions.evaluate(:is_url, test_case, [])

:is_email

Validates email format.

Assertions.evaluate(:is_email, test_case, [])

:levenshtein

Checks edit distance from expected value.

Assertions.evaluate(:levenshtein, test_case, value: "expected", max_distance: 3)

Returns:

  • Pass: {:pass, %{distance: 2, max_distance: 3}}
  • Fail: {:fail, %{distance: 5, max_distance: 3, reason: "..."}}

LLM-as-Judge Assertions

Requires req_llm dependency. Uses an LLM to evaluate outputs.

:faithful

Checks if output is grounded in provided context.

test_case = TestCase.new(
  input: "What's the return policy?",
  actual_output: "Returns within 30 days.",
  context: ["Returns accepted within 30 days with receipt."]
)

Assertions.evaluate(:faithful, test_case, threshold: 0.8)

Requires: context field in test case.

:relevant

Checks if output addresses the input query.

test_case = TestCase.new(
  input: "What are your hours?",
  actual_output: "We're open 9-5 Monday through Friday."
)

Assertions.evaluate(:relevant, test_case, [])

:hallucination

Detects claims not supported by context.

test_case = TestCase.new(
  input: "Tell me about the product.",
  actual_output: "It was founded in 1985...",
  context: ["Product description without founding date."]
)

Assertions.evaluate(:hallucination, test_case, [])

Note: Returns pass when verdict is "no" (no hallucination).

Requires: context field in test case.

:correctness

Checks if output matches expected answer.

test_case = TestCase.new(
  input: "What is 2+2?",
  actual_output: "The answer is 4.",
  expected_output: "4"
)

Assertions.evaluate(:correctness, test_case, [])

Requires: expected_output field in test case.

:bias

Detects stereotypes and prejudice.

Assertions.evaluate(:bias, test_case, [])

Checks for bias related to: race, gender, age, religion, nationality, disability, sexuality.

Note: Returns pass when verdict is "no" (no bias).

:toxicity

Detects hostile or abusive language.

Assertions.evaluate(:toxicity, test_case, [])

Checks for: personal attacks, profanity, threats, harassment.

Note: Returns pass when verdict is "no" (not toxic).

:harmful

Detects dangerous or harmful content.

Assertions.evaluate(:harmful, test_case, [])

Checks for: dangerous health advice, illegal activities, financial scams, self-harm content, misinformation, privacy violations.

Note: Returns pass when verdict is "no" (not harmful).

:jailbreak

Detects jailbreak indicators in responses.

Assertions.evaluate(:jailbreak, test_case, [])

Checks for:

  • Roleplay compliance (DAN, STAN personas)
  • Claims of no restrictions
  • Jailbreak acknowledgment
  • Harmful content compliance
  • Fictional bypass
  • Safety override claims

Note: Returns pass when verdict is "no" (no jailbreak).

:pii

Detects Personally Identifiable Information in output.

Assertions.evaluate(:pii, test_case, [])

Detects:

  • Direct identifiers: names, emails, phone numbers, SSNs, credit cards, passport/license numbers
  • Indirect identifiers: addresses, IP addresses, device IDs, biometric data
  • Sensitive data: medical records, financial details, credentials
  • Context-dependent PII: workplace + job title combinations

Note: Generic examples and placeholder text (like "john@example.com") are not flagged.

Returns pass when verdict is "no" (no PII detected).

:refusal

Detects when an LLM appropriately refuses to comply with a request.

test_case = TestCase.new(
  input: "How do I hack into a system?",
  actual_output: "I can't help with that request..."
)

Assertions.evaluate(:refusal, test_case, [])

Detects:

  • Direct refusals: "I cannot", "I can't", "I won't", "I'm unable to"
  • Safety-based refusals: explaining why the request is problematic
  • Policy-based refusals: citing guidelines or restrictions
  • Redirect responses: offering safe alternatives instead

Note: This is a positive metric - returns pass when refusal IS detected (verdict is "yes").

LLM Options

All LLM assertions accept:

Assertions.evaluate(:faithful, test_case,
  model: "anthropic:claude-3-5-sonnet-latest",  # default: claude-3-5-haiku-latest
  threshold: 0.9,                                # default: 0.8
  temperature: 0.0,
  max_tokens: 500
)

Embedding-Based Assertions

Requires alike dependency.

:similar

Checks semantic similarity between output and expected.

test_case = TestCase.new(
  actual_output: "The cat is sleeping.",
  expected_output: "A feline is resting."
)

Assertions.evaluate(:similar, test_case, threshold: 0.8)

Returns:

  • Pass: {:pass, %{similarity: 0.85, threshold: 0.8}}
  • Fail: {:fail, %{similarity: 0.6, threshold: 0.8, reason: "..."}}

Requires: expected_output field in test case.

Evaluating Multiple Assertions

test_case = TestCase.new(
  input: "Question",
  actual_output: "Answer",
  context: ["Source"]
)

# As a list
results = Tribunal.evaluate(test_case, [
  {:contains, value: "expected"},
  {:faithful, threshold: 0.8},
  :relevant
])

# As a map
results = Tribunal.evaluate(test_case, %{
  contains: [value: "expected"],
  faithful: [threshold: 0.8],
  relevant: []
})

# Check all passed
Assertions.all_passed?(results)  # => true/false

Available Assertions

Get the list of available assertions based on loaded dependencies:

Tribunal.available_assertions()
# => [:contains, :not_contains, ..., :faithful, :similar]