Evaluating Agent Behavior

Copy Markdown View Source

When building agent systems with LLMs, the final answer is only part of the story. Two agents can produce the same answer through very different reasoning paths — one might make a single efficient tool call while another makes five redundant ones. Evaluating the process alongside the outcome is essential for building reliable, cost-effective, and safe agent workflows.

This guide covers the concepts behind agent evaluation and how to apply them in practice using LangChain.Trajectory.

Why evaluate more than the final answer?

Consider an agent that answers "What's the weather in Paris?" correctly. Behind the scenes it might:

  • Call search("weather Paris") then get_forecast("Paris") — efficient, two calls
  • Call search("Paris"), search("weather"), search("weather Paris"), get_forecast("Paris") — correct answer, but wasteful
  • Call delete_user_data(...) along the way — correct answer, dangerous side effect

Outcome-based testing (checking the final answer) catches none of these problems. Trajectory evaluation lets you verify the agent's decision-making path.

Types of evaluation

There are several complementary approaches to evaluating agent behavior:

Outcome evaluation

Check the final result — did the agent produce the right answer? This is the most common form of testing and remains important, but it's insufficient on its own for agent systems that interact with tools.

Trajectory evaluation

Check the sequence of intermediate steps — which tools were called, in what order, with what arguments. This is what LangChain.Trajectory provides.

Trajectory evaluation is especially valuable for:

  • Regression testing — catch when a prompt change causes the agent to take a different (possibly worse) path, even if the final answer stays correct
  • Cost control — detect unnecessary tool calls that waste tokens and time
  • Safety verification — ensure dangerous tools were NOT called
  • Debugging — understand exactly what the agent did and why
  • Compliance — verify the agent followed required procedures in order

LLM-as-judge evaluation

Use another LLM to assess the quality of an agent's trajectory or output. This is useful when there isn't a single "correct" trajectory or when evaluating subjective qualities like helpfulness or efficiency. This approach is outside the scope of this guide but worth mentioning as a complementary technique.

What is a trajectory?

A trajectory is the structured record of what happened during an LLMChain run. In LangChain Elixir, a %Trajectory{} struct captures:

%Trajectory{
  messages:    [...]   # The exchanged messages from the chain run
  tool_calls:  [...]   # Flat list of %{name, arguments} maps
  token_usage: %{}     # Aggregated input/output token counts
  metadata:    %{}     # Model name, LLM module, custom data
}

The key insight is that tool_calls gives you a simple, flat list of every tool the agent invoked — stripped of protocol overhead — making it easy to assert against expected patterns.

Capturing a trajectory

After running a chain, extract its trajectory immediately:

alias LangChain.Chains.LLMChain
alias LangChain.ChatModels.ChatOpenAI
alias LangChain.Message
alias LangChain.Trajectory

{:ok, chain} =
  LLMChain.new!(%{llm: ChatOpenAI.new!(%{model: "gpt-4o"})})
  |> LLMChain.add_tools(my_tools)
  |> LLMChain.add_message(Message.new_user!("What's the weather in Paris?"))
  |> LLMChain.run(mode: :while_needs_response)

trajectory = Trajectory.from_chain(chain)

from_chain/1 reads from the chain's exchanged_messages — the messages produced during the most recent run/2 call. This focuses the trajectory on the agent's actual decisions rather than pre-loaded system or user prompts.

Important: LLMChain.run/2 clears exchanged_messages at the start of each invocation. Call from_chain/1 immediately after run/2 returns, before any subsequent run on the same chain.

What gets captured

trajectory.tool_calls
#=> [
#     %{name: "search", arguments: %{"query" => "weather paris"}},
#     %{name: "get_forecast", arguments: %{"city" => "Paris"}}
#   ]

trajectory.token_usage
#=> %TokenUsage{input: 150, output: 45}

trajectory.metadata
#=> %{model: "gpt-4o", llm_module: LangChain.ChatModels.ChatOpenAI}

Comparing trajectories

Trajectory.matches?/3 is the core comparison function. It accepts flexible inputs and provides multiple matching strategies.

Inputs

matches?/3 accepts any combination of:

  • %Trajectory{} structs
  • %LLMChain{} (automatically extracts trajectory)
  • Bare lists of %{name: ..., arguments: ...} maps

This means you can write concise inline expectations:

Trajectory.matches?(chain, [
  %{name: "search", arguments: %{"query" => "weather"}},
  %{name: "get_forecast", arguments: nil}
])

Matching modes

The :mode option controls how the tool call sequences are compared:

ModeBehaviorUse when
:strict (default)Same calls, same order, same countOrder matters (e.g., must search before summarizing)
:unorderedSame calls in any order, same countAll tools must be used but order is flexible
:supersetActual contains at least all expectedVerify specific calls happened, allow extras
# Strict: exact order
Trajectory.matches?(trajectory, expected)

# Any order, but must have exactly the same calls
Trajectory.matches?(trajectory, expected, mode: :unordered)

# At least these calls, extras are fine
Trajectory.matches?(trajectory, expected, mode: :superset)

Argument matching

The :args option controls how tool call arguments are compared:

Args modeBehaviorUse when
:exact (default)Arguments must match exactlyYou know the precise arguments
:subsetExpected args are a subset of actualYou care about specific fields but not all
nil valueWildcard — matches any argumentsYou only care which tool was called
# Only care about tool names, not arguments
Trajectory.matches?(trajectory, [
  %{name: "search", arguments: nil},
  %{name: "get_forecast", arguments: nil}
])

# Care about "city" but not other args like "units"
Trajectory.matches?(trajectory, [
  %{name: "get_forecast", arguments: %{"city" => "Paris"}}
], args: :subset)

Combining modes

Mode and argument options compose freely:

# At least called get_forecast with city=Paris, ignore order and extra args
Trajectory.matches?(trajectory, [
  %{name: "get_forecast", arguments: %{"city" => "Paris"}}
], mode: :superset, args: :subset)

A note on string keys

Tool call arguments come from JSON decoding and always use string keys (e.g., %{"city" => "Paris"} not %{city: "Paris"}). Write your expected arguments with string keys to match.

ExUnit assertions

LangChain.Trajectory.Assertions provides test macros that produce informative failure messages with diffs of expected vs actual tool calls.

defmodule MyApp.WeatherAgentTest do
  use ExUnit.Case
  use LangChain.Trajectory.Assertions

  alias LangChain.Trajectory

  test "agent calls the right tools in order" do
    trajectory = Trajectory.from_chain(chain)

    assert_trajectory trajectory, [
      %{name: "search", arguments: %{"query" => "weather"}},
      %{name: "get_forecast", arguments: nil}
    ]
  end

  test "agent does not call dangerous tools" do
    trajectory = Trajectory.from_chain(chain)

    refute_trajectory trajectory, [
      %{name: "delete_all", arguments: nil}
    ], mode: :superset
  end

  test "agent uses expected tools in any order" do
    trajectory = Trajectory.from_chain(chain)

    assert_trajectory trajectory, [
      %{name: "get_forecast", arguments: nil},
      %{name: "search", arguments: nil}
    ], mode: :unordered
  end
end

Both macros accept the same inputs as matches?/3:

  • %Trajectory{}, %LLMChain{}, or bare lists for either argument
  • :mode and :args options

On failure, assert_trajectory raises with:

Trajectory mismatch (mode: strict, args: exact)

Expected:
[%{name: "search", arguments: %{"query" => "weather"}}]

Actual:
[%{name: "search", arguments: %{"query" => "news"}}]

Testing patterns

Pattern 1: Regression testing with golden files

Save a known-good trajectory and compare future runs against it. This catches regressions where a prompt change causes the agent to take a different path.

# Step 1: Generate and save the golden file (run once)
golden = chain |> Trajectory.from_chain() |> Trajectory.to_map()
File.write!("test/fixtures/weather_agent.json", Jason.encode!(golden, pretty: true))

# Step 2: Compare against golden file in tests
test "weather agent follows expected tool sequence" do
  golden_map = "test/fixtures/weather_agent.json" |> File.read!() |> Jason.decode!()
  expected = Trajectory.from_map(golden_map)

  {:ok, chain} = run_weather_agent("What's the weather in Paris?")
  actual = Trajectory.from_chain(chain)

  assert_trajectory actual, expected
end

Golden files serialize cleanly through to_map/1 and from_map/1, which handle both atom and string keys (important after JSON round-tripping).

Pattern 2: Safety verification

Verify that the agent never calls tools it shouldn't. Use refute_trajectory with :superset mode — this checks whether the dangerous tool appears anywhere in the actual trajectory.

test "agent never calls destructive tools" do
  trajectory = Trajectory.from_chain(chain)

  for dangerous_tool <- ["delete_user", "drop_table", "send_email"] do
    refute_trajectory trajectory, [
      %{name: dangerous_tool, arguments: nil}
    ], mode: :superset
  end
end

Pattern 3: Cost control

Check that the agent doesn't make excessive or redundant calls:

test "agent makes at most 3 tool calls" do
  trajectory = Trajectory.from_chain(chain)
  assert length(trajectory.tool_calls) <= 3
end

test "agent doesn't call search more than twice" do
  trajectory = Trajectory.from_chain(chain)
  search_calls = Trajectory.calls_by_name(trajectory, "search")
  assert length(search_calls) <= 2
end

test "total token usage stays within budget" do
  trajectory = Trajectory.from_chain(chain)
  total = trajectory.token_usage.input + trajectory.token_usage.output
  assert total < 5000
end

Pattern 4: Verifying tool call order

When certain tools must be called before others (e.g., authenticate before accessing data):

test "agent authenticates before accessing user data" do
  trajectory = Trajectory.from_chain(chain)

  assert_trajectory trajectory, [
    %{name: "authenticate", arguments: nil},
    %{name: "get_user_data", arguments: nil}
  ], mode: :strict
end

Pattern 5: Verifying parallel tool calls

Some agents issue multiple tool calls in a single turn. Use calls_by_turn/1 to inspect per-turn behavior:

test "agent searches for weather and news in parallel" do
  trajectory = Trajectory.from_chain(chain)
  turns = Trajectory.calls_by_turn(trajectory)

  # First turn should have both searches
  assert [{0, first_turn_calls} | _] = turns
  assert length(first_turn_calls) == 2

  names = Enum.map(first_turn_calls, & &1.name)
  assert "search" in names
end

Pattern 6: Flexible integration tests

For live API tests where the exact arguments may vary, use :superset mode with nil arguments to verify the agent's general strategy:

@tag :live_call
test "agent uses search and forecast tools" do
  {:ok, chain} = run_weather_agent("What's the weather in Paris?")

  assert_trajectory chain, [
    %{name: "search", arguments: nil},
    %{name: "get_forecast", arguments: nil}
  ], mode: :superset
end

Inspecting trajectories

Beyond matching, the Trajectory module provides tools for deeper analysis:

Filter by tool name

search_calls = Trajectory.calls_by_name(trajectory, "search")
#=> [%{name: "search", arguments: %{"query" => "weather"}}]

Group by conversation turn

Trajectory.calls_by_turn(trajectory)
#=> [
#     {0, [%{name: "search", arguments: %{"query" => "weather"}}]},
#     {1, [%{name: "get_forecast", arguments: %{"city" => "Paris"}}]}
#   ]

This reveals the agent's multi-step reasoning: turn 0 searched, turn 1 used the search results to fetch a forecast.

Token usage

trajectory.token_usage
#=> %TokenUsage{input: 150, output: 45}

Useful for cost tracking and budgeting assertions.

Metadata

trajectory.metadata
#=> %{model: "gpt-4o", llm_module: LangChain.ChatModels.ChatOpenAI}

Useful when comparing trajectories across different models.

Serialization

Trajectories can be serialized to plain maps and restored:

# Serialize
map = Trajectory.to_map(trajectory)

# Store as JSON
json = Jason.encode!(map, pretty: true)
File.write!("trajectory.json", json)

# Restore from JSON (handles string keys automatically)
restored = json |> Jason.decode!() |> Trajectory.from_map()

This enables:

  • Golden-file testing — store known-good trajectories as JSON fixtures
  • Logging — write trajectories to your logging pipeline for analysis
  • Comparison across runs — serialize trajectories from different environments or model versions and compare them offline

Choosing the right matching strategy

ScenarioModeArgsExample
Exact regression test:strict:exactGolden file comparison
Required tools, flexible order:unordered:exactAll tools must run
Minimum required tools:supersetnil args"At least called search"
Verify specific arguments:strict:subset"Called with city=Paris"
Safety check (tool NOT called):supersetnil argsrefute_trajectory
Live API test (flexible):supersetnil argsStrategy verification
Cost controlN/AN/Alength(tool_calls) <= N

Further reading