Evaluation

View Source

Measure and improve your retrieval quality with Arcana's evaluation system.

Overview

Arcana provides tools to evaluate how well your RAG pipeline retrieves relevant information:

  1. Test Cases - Questions paired with their known relevant chunks
  2. Evaluation Runs - Execute searches and measure performance
  3. Metrics - Standard IR metrics (MRR, Precision, Recall, Hit Rate)

Creating Test Cases

Manual Test Cases

Create test cases when you know which chunks should be retrieved for a question:

# First, find the chunk you want to use as ground truth
chunks = Arcana.search("GenServer state", repo: MyApp.Repo, limit: 1)
chunk = hd(chunks)

# Create a test case linking question to relevant chunk
{:ok, test_case} = Arcana.Evaluation.create_test_case(
  repo: MyApp.Repo,
  question: "How do you manage state in Elixir?",
  relevant_chunk_ids: [chunk.id]
)

Synthetic Test Cases

Generate test cases automatically using an LLM:

{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: Application.get_env(:arcana, :llm),
  sample_size: 50
)

The generator samples random chunks and asks the LLM to create questions that should retrieve those chunks.

Filtering by Collection

Generate test cases from a specific collection:

{:ok, test_cases} = Arcana.Evaluation.generate_test_cases(
  repo: MyApp.Repo,
  llm: Application.get_env(:arcana, :llm),
  sample_size: 50,
  collection: "elixir-docs"
)

Using the Mix Task

# Generate 50 test cases (default)
mix arcana.eval.generate

# Custom sample size
mix arcana.eval.generate --sample-size 100

# From a specific collection
mix arcana.eval.generate --collection elixir-docs

# From a specific source
mix arcana.eval.generate --source-id my-source

Running Evaluations

Run an evaluation against all test cases:

{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic  # or :fulltext, :hybrid
)

Evaluating Answer Quality

For end-to-end RAG evaluation, you can also evaluate the quality of generated answers:

{:ok, run} = Arcana.Evaluation.run(
  repo: MyApp.Repo,
  mode: :semantic,
  evaluate_answers: true,
  llm: Application.get_env(:arcana, :llm)
)

# Includes faithfulness metric
run.metrics.faithfulness  # => 7.8 (0-10 scale)

When evaluate_answers: true is set, the evaluation:

  1. Generates an answer for each test case using the retrieved chunks
  2. Uses LLM-as-judge to score how faithful the answer is to the context
  3. Aggregates scores into an overall faithfulness metric

Faithfulness measures whether the generated answer is grounded in the retrieved chunks (0 = hallucinated, 10 = fully faithful).

Using the Mix Task

# Run with semantic search (default)
mix arcana.eval.run

# Run with hybrid search
mix arcana.eval.run --mode hybrid

# Run with full-text search
mix arcana.eval.run --mode fulltext

Understanding Results

# Overall metrics
run.metrics
# => %{
#   recall_at_1: 0.62,
#   recall_at_3: 0.78,
#   recall_at_5: 0.84,
#   recall_at_10: 0.91,
#   precision_at_1: 0.62,
#   precision_at_3: 0.52,
#   precision_at_5: 0.34,
#   precision_at_10: 0.18,
#   mrr: 0.76,
#   hit_rate_at_1: 0.62,
#   hit_rate_at_3: 0.78,
#   hit_rate_at_5: 0.84,
#   hit_rate_at_10: 0.91
# }

# Per-case results
run.results
# => %{"case-id" => %{hit: true, rank: 2, ...}, ...}

# Configuration used
run.config
# => %{mode: :semantic, embedding: %{model: "...", dimensions: 384}}

Comparing Configurations

Run evaluations with different settings to find the best configuration:

# Test semantic search
{:ok, semantic_run} = Arcana.Evaluation.run(repo: MyApp.Repo, mode: :semantic)

# Test hybrid search
{:ok, hybrid_run} = Arcana.Evaluation.run(repo: MyApp.Repo, mode: :hybrid)

# Compare
IO.puts("Semantic MRR: #{semantic_run.metrics.mrr}")
IO.puts("Hybrid MRR: #{hybrid_run.metrics.mrr}")

Managing Test Cases and Runs

# List all test cases
test_cases = Arcana.Evaluation.list_test_cases(repo: MyApp.Repo)

# Get a specific test case
test_case = Arcana.Evaluation.get_test_case(id, repo: MyApp.Repo)

# Delete a test case
{:ok, _} = Arcana.Evaluation.delete_test_case(id, repo: MyApp.Repo)

# List past runs
runs = Arcana.Evaluation.list_runs(repo: MyApp.Repo, limit: 10)

# Delete a run
{:ok, _} = Arcana.Evaluation.delete_run(run_id, repo: MyApp.Repo)

Dashboard

The Arcana Dashboard provides a visual interface for evaluation:

  • Test Cases tab - View, generate, and delete test cases
  • Run Evaluation tab - Execute evaluations with different search modes
  • History tab - View past runs with metrics

See the Dashboard Guide for setup instructions.

Metrics Explained

Retrieval Metrics

MetricDescriptionGood Value
MRR (Mean Reciprocal Rank)Average of 1/rank for first relevant result> 0.7
Recall@KFraction of relevant chunks found in top K> 0.8
Precision@KFraction of top K results that are relevant> 0.6
Hit Rate@KFraction of queries with at least one relevant result in top K> 0.9

Answer Quality Metrics

MetricDescriptionGood Value
FaithfulnessHow well the answer is grounded in retrieved context (0-10)> 7.0

Which Metric to Focus On?

  • MRR - Best for single-answer scenarios where you need the relevant chunk first
  • Recall@K - Important when you need to find all relevant information
  • Precision@K - Matters when you want to minimize irrelevant context
  • Hit Rate@K - Good baseline to ensure retrieval is working at all
  • Faithfulness - Essential for preventing hallucinations in generated answers

Best Practices

  1. Diverse test cases - Cover different topics and question types
  2. Sufficient sample size - Aim for 50+ test cases for reliable metrics
  3. Regular evaluation - Re-run after changing embeddings, chunking, or search settings
  4. Track over time - Compare runs to ensure changes improve quality
  5. Use collection filtering - Evaluate specific document collections separately
  6. Test all search modes - Compare semantic, fulltext, and hybrid to find what works best