Evaluation

Measure and improve your retrieval quality with Arcana's evaluation system.

Overview

Arcana provides tools to evaluate how well your RAG pipeline retrieves relevant information:

Test Cases - Questions paired with their known relevant chunks
Evaluation Runs - Execute searches and measure performance
Metrics - Standard IR metrics (MRR, Precision, Recall, Hit Rate)

Creating Test Cases

Manual Test Cases

Create test cases when you know which chunks should be retrieved for a question:

# First, find the chunk you want to use as ground truth
chunks = Arcana.search("GenServer state", repo: MyApp.Repo, limit: 1)
chunk = hd(chunks)

# Create a test case linking question to relevant chunk
{:ok, test_case} = Arcana.Evaluation.create_test_case(
  repo: MyApp.Repo,
  question: "How do you manage state in Elixir?",
  relevant_chunk_ids: [chunk.id]
)

Synthetic Test Cases

Generate test cases automatically using an LLM:

{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: Application.get_env(:arcana, :llm),
  sample_size: 50
)

The generator samples random chunks and asks the LLM to create questions that should retrieve those chunks.

Filtering by Collection

Generate test cases from a specific collection:

{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: Application.get_env(:arcana, :llm),
  sample_size: 50,
  collection: "elixir-docs"
)

Using the Mix Task

# Generate 50 test cases (default)
mix arcana.eval.generate

# Custom sample size
mix arcana.eval.generate --sample-size 100

# From a specific collection
mix arcana.eval.generate --collection elixir-docs

# From a specific source
mix arcana.eval.generate --source-id my-source

Running Evaluations

Run an evaluation against all test cases:

{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic  # or :fulltext, :hybrid
)

Evaluating Answer Quality

For end-to-end RAG evaluation, you can also evaluate the quality of generated answers:

{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic,
  evaluate_answers: true,
  llm: Application.get_env(:arcana, :llm)
)

# Includes faithfulness metric
run.metrics.faithfulness  # => 7.8 (0-10 scale)

When evaluate_answers: true is set, the evaluation:

Generates an answer for each test case using the retrieved chunks
Uses LLM-as-judge to score how faithful the answer is to the context
Aggregates scores into an overall faithfulness metric

Faithfulness measures whether the generated answer is grounded in the retrieved chunks (0 = hallucinated, 10 = fully faithful).

Using the Mix Task

# Run with semantic search (default)
mix arcana.eval.run

# Run with hybrid search
mix arcana.eval.run --mode hybrid

# Run with full-text search
mix arcana.eval.run --mode fulltext

Understanding Results

# Overall metrics
run.metrics
# => %{
#   recall_at_1: 0.62,
#   recall_at_3: 0.78,
#   recall_at_5: 0.84,
#   recall_at_10: 0.91,
#   precision_at_1: 0.62,
#   precision_at_3: 0.52,
#   precision_at_5: 0.34,
#   precision_at_10: 0.18,
#   mrr: 0.76,
#   hit_rate_at_1: 0.62,
#   hit_rate_at_3: 0.78,
#   hit_rate_at_5: 0.84,
#   hit_rate_at_10: 0.91
# }

# Per-case results
run.results
# => %{"case-id" => %{hit: true, rank: 2, ...}, ...}

# Configuration used
run.config
# => %{mode: :semantic, embedding: %{model: "...", dimensions: 384}}

Comparing Configurations

Run evaluations with different settings to find the best configuration:

# Test semantic search
{:ok, semantic_run} = Arcana.Evaluation.run(repo: MyApp.Repo, mode: :semantic)

# Test hybrid search
{:ok, hybrid_run} = Arcana.Evaluation.run(repo: MyApp.Repo, mode: :hybrid)

# Compare
IO.puts("Semantic MRR: #{semantic_run.metrics.mrr}")
IO.puts("Hybrid MRR: #{hybrid_run.metrics.mrr}")

Managing Test Cases and Runs

# List all test cases
test_cases = Arcana.Evaluation.list_test_cases(repo: MyApp.Repo)

# Get a specific test case
test_case = Arcana.Evaluation.get_test_case(id, repo: MyApp.Repo)

# Delete a test case
{:ok, _} = Arcana.Evaluation.delete_test_case(id, repo: MyApp.Repo)

# List past runs
runs = Arcana.Evaluation.list_runs(repo: MyApp.Repo, limit: 10)

# Delete a run
{:ok, _} = Arcana.Evaluation.delete_run(run_id, repo: MyApp.Repo)

Dashboard

The Arcana Dashboard provides a visual interface for evaluation:

Test Cases tab - View, generate, and delete test cases
Run Evaluation tab - Execute evaluations with different search modes
History tab - View past runs with metrics

See the Dashboard Guide for setup instructions.

Metrics Explained

Retrieval Metrics

Metric	Description	Good Value
MRR (Mean Reciprocal Rank)	Average of 1/rank for first relevant result	> 0.7
Recall@K	Fraction of relevant chunks found in top K	> 0.8
Precision@K	Fraction of top K results that are relevant	> 0.6
Hit Rate@K	Fraction of queries with at least one relevant result in top K	> 0.9

Answer Quality Metrics

Metric	Description	Good Value
Faithfulness	How well the answer is grounded in retrieved context (0-10)	> 7.0

Which Metric to Focus On?

MRR - Best for single-answer scenarios where you need the relevant chunk first
Recall@K - Important when you need to find all relevant information
Precision@K - Matters when you want to minimize irrelevant context
Hit Rate@K - Good baseline to ensure retrieval is working at all
Faithfulness - Essential for preventing hallucinations in generated answers

Best Practices

Diverse test cases - Cover different topics and question types
Sufficient sample size - Aim for 50+ test cases for reliable metrics
Regular evaluation - Re-run after changing embeddings, chunking, or search settings
Track over time - Compare runs to ensure changes improve quality
Use collection filtering - Evaluate specific document collections separately
Test all search modes - Compare semantic, fulltext, and hybrid to find what works best

← Previous Page Search Algorithms

Next Page → Telemetry and Observability