Nous.Eval (nous v0.9.0)

Evaluation framework for Nous AI agents.

Nous.Eval provides comprehensive testing, evaluation, and optimization capabilities for AI agents. It supports running test suites against real LLM APIs and collecting detailed metrics.

Quick Start

# Define a test suite
suite = Nous.Eval.Suite.new(
  name: "my_agent_tests",
  default_model: "lmstudio:ministral-3-14b-reasoning",
  test_cases: [
    Nous.Eval.TestCase.new(
      id: "greeting",
      input: "Say hello",
      expected: %{contains: ["hello", "hi"]},
      eval_type: :contains
    )
  ]
)

# Run the evaluation
{:ok, result} = Nous.Eval.run(suite)

# Print results
Nous.Eval.Reporter.print(result)

Loading from YAML

{:ok, suite} = Nous.Eval.Suite.from_yaml("test/eval/suites/basic.yaml")
{:ok, result} = Nous.Eval.run(suite)

A/B Testing

{:ok, comparison} = Nous.Eval.run_ab(suite,
  config_a: [model_settings: %{temperature: 0.3}],
  config_b: [model_settings: %{temperature: 0.7}]
)

Mix Tasks

# Run all suites
mix nous.eval

# Run specific suite
mix nous.eval --suite basic

# Filter by tags
mix nous.eval --tags tool,streaming

Metrics

The evaluation framework tracks:

Correctness: Pass/fail rates and scores
Token usage: Input, output, and total tokens
Latency: Total duration, first token time, per-iteration timing
Tool usage: Call counts, errors, timing per tool
Cost estimation: Based on provider pricing

Evaluators

Built-in evaluators:

:exact_match - Output must exactly match expected
:fuzzy_match - String similarity above threshold
:contains - Output must contain expected substrings
:tool_usage - Verify correct tools were called
:schema - Validate structured output against Ecto schema
:llm_judge - Use an LLM to judge output quality

Custom Evaluators

defmodule MyEvaluator do
  @behaviour Nous.Eval.Evaluator

  @impl true
  def evaluate(actual, expected, config) do
    # Your evaluation logic
    %{score: 1.0, passed: true, reason: nil, details: %{}}
  end
end

test_case = TestCase.new(
  id: "custom",
  input: "...",
  expected: "...",
  eval_type: :custom,
  eval_config: %{evaluator: MyEvaluator}
)

Summary

Functions

run(suite, opts \\ [])

Run an evaluation suite.

run!(suite, opts \\ [])

Run an evaluation suite, raising on error.

run_ab(suite, opts \\ [])

Run A/B comparison between two configurations.

run_case(test_case, opts \\ [])

Run a single test case.

Functions

run(suite, opts \\ [])

@spec run(
  Nous.Eval.Suite.t(),
  keyword()
) :: {:ok, map()} | {:error, term()}

Run an evaluation suite.

Options

:model - Override the default model for all test cases
:agent_config - Additional agent configuration
:parallelism - Number of concurrent test cases (default: 1)
:timeout - Timeout per test case in ms (default: from suite or 30_000)
:retry_failed - Number of retries for failed tests (default: 0)
:tags - Only run test cases with these tags
:exclude_tags - Skip test cases with these tags

Returns

{:ok, result} - Run completed with results
{:error, reason} - Run failed to start

Example

{:ok, result} = Nous.Eval.run(suite)
IO.puts("Pass rate: #{result.pass_rate * 100}%")

run!(suite, opts \\ [])

@spec run!(
  Nous.Eval.Suite.t(),
  keyword()
) :: map()

Run an evaluation suite, raising on error.

run_ab(suite, opts \\ [])

@spec run_ab(
  Nous.Eval.Suite.t(),
  keyword()
) :: {:ok, map()} | {:error, term()}

Run A/B comparison between two configurations.

Runs the same suite with two different configurations and compares results.

Options

:config_a - Configuration for variant A (keyword list)
:config_b - Configuration for variant B (keyword list)
All other options from run/2

Returns

%{
  a: %{...result for config A...},
  b: %{...result for config B...},
  comparison: %{
    winner: :a | :b | :tie,
    score_diff: float(),
    token_diff: float(),
    latency_diff: float()
  }
}

Example

{:ok, comparison} = Nous.Eval.run_ab(suite,
  config_a: [model_settings: %{temperature: 0.3}],
  config_b: [model_settings: %{temperature: 0.7}]
)

IO.puts("Winner: #{comparison.comparison.winner}")

run_case(test_case, opts \\ [])

@spec run_case(
  Nous.Eval.TestCase.t(),
  keyword()
) :: {:ok, Nous.Eval.Result.t()} | {:error, term()}

Run a single test case.

Useful for debugging individual test cases.

Example

test_case = TestCase.new(id: "test", input: "Hello", expected: "hi")
{:ok, result} = Nous.Eval.run_case(test_case, model: "lmstudio:model")