Tribunal (Tribunal v1.3.6)

LLM evaluation framework for Elixir.

Tribunal provides tools for evaluating LLM outputs, detecting hallucinations, and measuring response quality.

Quick Start

In Tests (ExUnit)

defmodule MyApp.RAGEvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  @moduletag :eval

  test "response is grounded in context" do
    response = MyApp.RAG.query("What's the return policy?")

    assert_contains response, "30 days"
    assert_faithful response, context: @docs, threshold: 0.8
  end
end

Dataset-Driven Evals

# test/evals/datasets/questions.json
[
  {
    "input": "What's the return policy?",
    "context": "Returns within 30 days with receipt.",
    "expected": {
      "contains": ["30 days"],
      "faithful": {"threshold": 0.8}
    }
  }
]

Then run: mix tribunal.eval

Assertion Types

Deterministic (no LLM, instant)

contains - Output includes substring(s)
not_contains - Output excludes substring(s)
contains_any - Output includes at least one
contains_all - Output includes all
regex - Output matches pattern
is_json - Output is valid JSON
max_tokens - Output under token limit
latency_ms - Response within time limit

LLM-as-Judge (requires `req_llm`)

faithful - Response grounded in context
relevant - Response addresses query
hallucination - Response contains fabricated info
correctness - Response matches expected answer
refusal - Output is a refusal
bias - Response contains bias or stereotypes
toxicity - Response contains harmful content
harmful - Response contains dangerous content
jailbreak - Response indicates safety bypass
pii - Response contains personal information

Embedding (requires `alike`)

similar - Semantic similarity to golden answer

Installation

def deps do
  [
    {:tribunal, "~> 0.1"},

    # Optional: LLM-as-judge metrics
    {:req_llm, "~> 1.2"},

    # Optional: embedding similarity
    {:alike, "~> 0.4"}
  ]
end

Summary

Functions

available_assertions()

Returns available assertion types based on loaded dependencies.

evaluate(test_case, assertions)

Evaluates a test case against assertions.

test_case(attrs)

Creates a new test case.

Functions

available_assertions()

Returns available assertion types based on loaded dependencies.

evaluate(test_case, assertions)

Evaluates a test case against assertions.

Examples

test_case = %Tribunal.TestCase{
  input: "What's the return policy?",
  actual_output: "Returns within 30 days.",
  context: ["Return policy: 30 days with receipt."]
}

assertions = [
  {:contains, [value: "30 days"]},
  {:faithful, [threshold: 0.8]}
]

Tribunal.evaluate(test_case, assertions)
#=> %{contains: {:pass, ...}, faithful: {:pass, ...}}

test_case(attrs)

Creates a new test case.

Examples

Tribunal.test_case(
  input: "What's the price?",
  actual_output: "The price is $29.99.",
  context: ["Product costs $29.99"]
)

Tribunal (Tribunal v1.3.6)

Quick Start

In Tests (ExUnit)

Dataset-Driven Evals

Assertion Types

Deterministic (no LLM, instant)

LLM-as-Judge (requires req_llm)

Embedding (requires alike)

Installation

Summary

Functions

Functions

available_assertions()

evaluate(test_case, assertions)

Examples

test_case(attrs)

Examples

LLM-as-Judge (requires `req_llm`)

Embedding (requires `alike`)