Overview

View Source

DeepEvalEx Logo

CI Hex.pm Documentation License

DeepEvalEx

LLM evaluation framework for Elixir - Idiomatic + Compatible Elixir port of DeepEval.

Attribution: This project is a derivative work of DeepEval by Confident AI, licensed under Apache 2.0. The core evaluation algorithms, metrics, and prompt templates are derived from the original Python implementation.

Installation

Add deep_eval_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:deep_eval_ex, "~> 0.1.0"}
  ]
end

Quick Start

# Create a test case
test_case = DeepEvalEx.TestCase.new!(
  input: "What is the capital of France?",
  actual_output: "The capital of France is Paris.",
  expected_output: "Paris"
)

# Evaluate with ExactMatch metric
{:ok, result} = DeepEvalEx.Metrics.ExactMatch.measure(test_case)

# Check result
result.score      # => 0.0 (not an exact match)
result.success    # => false
result.reason     # => "The actual and expected outputs are different."

Configuration

Configure your LLM provider in config/config.exs:

config :deep_eval_ex,
  default_model: {:openai, "gpt-4o-mini"},
  openai_api_key: System.get_env("OPENAI_API_KEY"),
  default_threshold: 0.5

Available Metrics

MetricPurpose
ExactMatchSimple string comparison
GEvalFlexible criteria-based evaluation using LLM-as-judge
FaithfulnessRAG: claims supported by retrieval context
HallucinationDetects unsupported statements
AnswerRelevancyResponse relevance to input question
ContextualPrecisionRAG retrieval ranking quality
ContextualRecallRAG coverage of ground truth

See the Metrics Overview for detailed documentation on each metric.

Documentation

GuideDescription
Quick StartGet up and running in 5 minutes
ConfigurationLLM provider setup and options
Metrics OverviewAll available metrics explained
ExUnit IntegrationTest assertions for CI/CD
Custom MetricsBuild your own evaluation metrics
TelemetryObservability and monitoring

API Reference

Architecture

LLM Adapters

DeepEvalEx supports multiple LLM providers:

  • OpenAI - GPT-4o, GPT-4o-mini, GPT-3.5-turbo
  • Anthropic - Claude 3 family (planned)
  • Ollama - Local models (planned)

See LLM Adapters and Custom LLM Adapters for details.

Usage with ExUnit

defmodule MyApp.LLMTest do
  use ExUnit.Case

  alias DeepEvalEx.{TestCase, Metrics}

  test "LLM generates accurate responses" do
    test_case = TestCase.new!(
      input: "What is 2 + 2?",
      actual_output: get_llm_response("What is 2 + 2?"),
      expected_output: "4"
    )

    {:ok, result} = Metrics.ExactMatch.measure(test_case)
    assert result.success, result.reason
  end
end

Concurrent Evaluation

Evaluate multiple test cases concurrently:

test_cases = [
  TestCase.new!(input: "Q1", actual_output: "A1", expected_output: "A1"),
  TestCase.new!(input: "Q2", actual_output: "A2", expected_output: "A2")
]

results = DeepEvalEx.evaluate_batch(test_cases, [Metrics.ExactMatch],
  concurrency: 20
)

Telemetry

DeepEvalEx emits telemetry events for observability:

:telemetry.attach("my-handler", [:deep_eval_ex, :metric, :stop], fn _event, measurements, metadata, _config ->
  IO.puts("Metric #{metadata.metric} completed with score #{measurements.score}")
end, nil)

See Telemetry Guide for all events and integration patterns.

License

Apache 2.0 - See LICENSE and NOTICE for details.

This project is a derivative work of DeepEval by Confident AI, also licensed under Apache 2.0.