GEval Metric

View Source

G-Eval is a flexible LLM-as-judge evaluation framework that uses an LLM to evaluate outputs based on custom criteria.

Based on the paper: G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

When to Use

  • Custom evaluation criteria that don't fit predefined metrics
  • Subjective quality assessment (helpfulness, tone, clarity)
  • Domain-specific evaluation requirements
  • When you need explainable scoring with reasoning

How It Works

  1. Define criteria - Describe what you want to evaluate
  2. Generate evaluation steps - LLM creates concrete steps from your criteria
  3. Evaluate - LLM scores the test case using the steps
  4. Get result - Normalized score (0-1) with detailed reasoning

Required Parameters

Configurable via evaluation_params:

ParameterDescription
:inputThe input prompt
:actual_outputThe LLM's response
:expected_outputGround truth (optional)
:retrieval_contextRetrieved context (optional)

Basic Usage

alias DeepEvalEx.{TestCase, Metrics.GEval}

# Create a metric
metric = GEval.new(
  name: "Helpfulness",
  criteria: "Determine if the response is helpful and addresses the user's question completely",
  evaluation_params: [:input, :actual_output]
)

# Create a test case
test_case = TestCase.new!(
  input: "How do I make pasta?",
  actual_output: "Boil water, add pasta, cook for 8-10 minutes, drain and serve with sauce."
)

# Evaluate
{:ok, result} = GEval.evaluate(metric, test_case)

result.score   # => 0.85 (normalized 0-1)
result.reason  # => "The response provides clear, actionable steps..."
result.success # => true (score >= threshold)

Options

Basic Options

OptionTypeDefaultDescription
:namestringrequiredName for this metric
:criteriastringnilEvaluation criteria description
:evaluation_paramslistrequiredTest case parameters to evaluate
:thresholdfloat0.5Pass/fail threshold

Advanced Options

OptionTypeDefaultDescription
:evaluation_stepslistnilPre-defined evaluation steps
:rubriclistnilScoring rubric with descriptions
:score_rangetuple{0, 10}Min/max score range
:strict_modebooleanfalseBinary 0/1 scoring
:modeltuple/stringdefaultLLM model to use

Pre-defined Evaluation Steps

Skip the step generation by providing your own:

metric = GEval.new(
  name: "Code Quality",
  evaluation_params: [:input, :actual_output],
  evaluation_steps: [
    "Check if the code is syntactically correct",
    "Verify the code addresses the requirements",
    "Assess code readability and style",
    "Check for potential bugs or edge cases"
  ]
)

Using a Rubric

Provide explicit scoring guidance:

metric = GEval.new(
  name: "Response Quality",
  criteria: "Evaluate the overall quality of the response",
  evaluation_params: [:input, :actual_output],
  rubric: [
    {10, "Perfect response - comprehensive, accurate, well-structured"},
    {8, "Excellent response - minor improvements possible"},
    {6, "Good response - addresses main points but lacks detail"},
    {4, "Acceptable response - partially addresses the question"},
    {2, "Poor response - mostly irrelevant or incorrect"},
    {0, "Unacceptable - does not address the question at all"}
  ]
)

Strict Mode

For binary pass/fail evaluation:

metric = GEval.new(
  name: "Factual Accuracy",
  criteria: "The response must be 100% factually accurate",
  evaluation_params: [:input, :actual_output, :expected_output],
  strict_mode: true  # Returns only 0 or 1
)

Custom Score Range

metric = GEval.new(
  name: "Rating",
  criteria: "Rate the response on a 1-5 scale",
  evaluation_params: [:input, :actual_output],
  score_range: {1, 5}
)

# Score is still normalized to 0-1 in results
# Raw score available in result.metadata.raw_score

Multiple Evaluation Parameters

Include context for more accurate evaluation:

metric = GEval.new(
  name: "RAG Accuracy",
  criteria: "Evaluate if the response accurately uses the provided context",
  evaluation_params: [:input, :actual_output, :retrieval_context]
)

test_case = TestCase.new!(
  input: "What are the company's vacation policies?",
  actual_output: "Employees receive 20 days of PTO per year.",
  retrieval_context: [
    "Section 3.2: All full-time employees receive 20 days of paid time off annually.",
    "Section 3.3: PTO can be carried over up to 5 days per year."
  ]
)

Specifying LLM Model

# Use a specific model
metric = GEval.new(
  name: "Test",
  criteria: "...",
  evaluation_params: [:input, :actual_output],
  model: {:openai, "gpt-4o"}  # More capable model
)

# Or override at evaluation time
{:ok, result} = GEval.evaluate(metric, test_case,
  adapter: :anthropic,
  model: "claude-3-opus-20240229"
)

Result Structure

%DeepEvalEx.Result{
  metric: "Helpfulness [GEval]",
  score: 0.85,
  success: true,
  threshold: 0.5,
  reason: "The response provides clear, step-by-step instructions...",
  latency_ms: 1250,
  metadata: %{
    raw_score: 8.5,
    score_range: {0, 10},
    evaluation_steps: ["Step 1: ...", "Step 2: ..."],
    criteria: "Determine if the response is helpful..."
  }
}

Best Practices

Writing Good Criteria

# Good - specific and measurable
criteria: "The response should: 1) directly answer the question, 2) be factually accurate, 3) be concise (under 100 words)"

# Avoid - vague and subjective
criteria: "The response should be good"

Choosing Evaluation Parameters

  • Include :expected_output when you have ground truth
  • Include :retrieval_context for RAG evaluation
  • Fewer parameters = faster evaluation (fewer tokens)

Performance Tips

  • Use gpt-4o-mini for faster, cheaper evaluations
  • Pre-define evaluation_steps to skip generation step
  • Run evaluations concurrently with evaluate_batch/3

Error Handling

case GEval.evaluate(metric, test_case) do
  {:ok, result} ->
    IO.puts("Score: #{result.score}")

  {:error, {:missing_params, params}} ->
    IO.puts("Missing: #{inspect(params)}")

  {:error, {:api_error, status, body}} ->
    IO.puts("API error: #{status}")
end

Source

Ported from deepeval/metrics/g_eval/g_eval.py

Reference paper: https://arxiv.org/pdf/2303.16634.pdf