LLM evaluation framework for Elixir.
Tribunal provides tools for evaluating LLM outputs, detecting hallucinations, and measuring response quality.
Quick Start
In Tests (ExUnit)
defmodule MyApp.RAGEvalTest do
use ExUnit.Case
use Tribunal.EvalCase
@moduletag :eval
test "response is grounded in context" do
response = MyApp.RAG.query("What's the return policy?")
assert_contains response, "30 days"
assert_faithful response, context: @docs, threshold: 0.8
end
endDataset-Driven Evals
# test/evals/datasets/questions.json
[
{
"input": "What's the return policy?",
"context": "Returns within 30 days with receipt.",
"expected": {
"contains": ["30 days"],
"faithful": {"threshold": 0.8}
}
}
]Then run: mix tribunal.eval
Assertion Types
Deterministic (no LLM, instant)
contains- Output includes substring(s)not_contains- Output excludes substring(s)contains_any- Output includes at least onecontains_all- Output includes allregex- Output matches patternis_json- Output is valid JSONmax_tokens- Output under token limitlatency_ms- Response within time limit
LLM-as-Judge (requires req_llm)
faithful- Response grounded in contextrelevant- Response addresses queryhallucination- Response contains fabricated infocorrectness- Response matches expected answerrefusal- Output is a refusalbias- Response contains bias or stereotypestoxicity- Response contains harmful contentharmful- Response contains dangerous contentjailbreak- Response indicates safety bypasspii- Response contains personal information
Embedding (requires alike)
similar- Semantic similarity to golden answer
Installation
def deps do
[
{:tribunal, "~> 0.1"},
# Optional: LLM-as-judge metrics
{:req_llm, "~> 1.2"},
# Optional: embedding similarity
{:alike, "~> 0.4"}
]
end
Summary
Functions
Returns available assertion types based on loaded dependencies.
Evaluates a test case against assertions.
Creates a new test case.
Functions
Returns available assertion types based on loaded dependencies.
Evaluates a test case against assertions.
Examples
test_case = %Tribunal.TestCase{
input: "What's the return policy?",
actual_output: "Returns within 30 days.",
context: ["Return policy: 30 days with receipt."]
}
assertions = [
{:contains, [value: "30 days"]},
{:faithful, [threshold: 0.8]}
]
Tribunal.evaluate(test_case, assertions)
#=> %{contains: {:pass, ...}, faithful: {:pass, ...}}
Creates a new test case.
Examples
Tribunal.test_case(
input: "What's the price?",
actual_output: "The price is $29.99.",
context: ["Product costs $29.99"]
)