Tribunal (Tribunal v1.3.6)

Copy Markdown View Source

LLM evaluation framework for Elixir.

Tribunal provides tools for evaluating LLM outputs, detecting hallucinations, and measuring response quality.

Quick Start

In Tests (ExUnit)

defmodule MyApp.RAGEvalTest do
  use ExUnit.Case
  use Tribunal.EvalCase

  @moduletag :eval

  test "response is grounded in context" do
    response = MyApp.RAG.query("What's the return policy?")

    assert_contains response, "30 days"
    assert_faithful response, context: @docs, threshold: 0.8
  end
end

Dataset-Driven Evals

# test/evals/datasets/questions.json
[
  {
    "input": "What's the return policy?",
    "context": "Returns within 30 days with receipt.",
    "expected": {
      "contains": ["30 days"],
      "faithful": {"threshold": 0.8}
    }
  }
]

Then run: mix tribunal.eval

Assertion Types

Deterministic (no LLM, instant)

  • contains - Output includes substring(s)
  • not_contains - Output excludes substring(s)
  • contains_any - Output includes at least one
  • contains_all - Output includes all
  • regex - Output matches pattern
  • is_json - Output is valid JSON
  • max_tokens - Output under token limit
  • latency_ms - Response within time limit

LLM-as-Judge (requires req_llm)

  • faithful - Response grounded in context
  • relevant - Response addresses query
  • hallucination - Response contains fabricated info
  • correctness - Response matches expected answer
  • refusal - Output is a refusal
  • bias - Response contains bias or stereotypes
  • toxicity - Response contains harmful content
  • harmful - Response contains dangerous content
  • jailbreak - Response indicates safety bypass
  • pii - Response contains personal information

Embedding (requires alike)

  • similar - Semantic similarity to golden answer

Installation

def deps do
  [
    {:tribunal, "~> 0.1"},

    # Optional: LLM-as-judge metrics
    {:req_llm, "~> 1.2"},

    # Optional: embedding similarity
    {:alike, "~> 0.4"}
  ]
end

Summary

Functions

Returns available assertion types based on loaded dependencies.

Evaluates a test case against assertions.

Creates a new test case.

Functions

available_assertions()

Returns available assertion types based on loaded dependencies.

evaluate(test_case, assertions)

Evaluates a test case against assertions.

Examples

test_case = %Tribunal.TestCase{
  input: "What's the return policy?",
  actual_output: "Returns within 30 days.",
  context: ["Return policy: 30 days with receipt."]
}

assertions = [
  {:contains, [value: "30 days"]},
  {:faithful, [threshold: 0.8]}
]

Tribunal.evaluate(test_case, assertions)
#=> %{contains: {:pass, ...}, faithful: {:pass, ...}}

test_case(attrs)

Creates a new test case.

Examples

Tribunal.test_case(
  input: "What's the price?",
  actual_output: "The price is $29.99.",
  context: ["Product costs $29.99"]
)