When building agent systems with LLMs, the final answer is only part of the story. Two agents can produce the same answer through very different reasoning paths — one might make a single efficient tool call while another makes five redundant ones. Evaluating the process alongside the outcome is essential for building reliable, cost-effective, and safe agent workflows.
This guide covers the concepts behind agent evaluation and how to apply them
in practice using LangChain.Trajectory.
Why evaluate more than the final answer?
Consider an agent that answers "What's the weather in Paris?" correctly. Behind the scenes it might:
- Call
search("weather Paris")thenget_forecast("Paris")— efficient, two calls - Call
search("Paris"),search("weather"),search("weather Paris"),get_forecast("Paris")— correct answer, but wasteful - Call
delete_user_data(...)along the way — correct answer, dangerous side effect
Outcome-based testing (checking the final answer) catches none of these problems. Trajectory evaluation lets you verify the agent's decision-making path.
Types of evaluation
There are several complementary approaches to evaluating agent behavior:
Outcome evaluation
Check the final result — did the agent produce the right answer? This is the most common form of testing and remains important, but it's insufficient on its own for agent systems that interact with tools.
Trajectory evaluation
Check the sequence of intermediate steps — which tools were called, in what order,
with what arguments. This is what LangChain.Trajectory provides.
Trajectory evaluation is especially valuable for:
- Regression testing — catch when a prompt change causes the agent to take a different (possibly worse) path, even if the final answer stays correct
- Cost control — detect unnecessary tool calls that waste tokens and time
- Safety verification — ensure dangerous tools were NOT called
- Debugging — understand exactly what the agent did and why
- Compliance — verify the agent followed required procedures in order
LLM-as-judge evaluation
Use another LLM to assess the quality of an agent's trajectory or output. This is useful when there isn't a single "correct" trajectory or when evaluating subjective qualities like helpfulness or efficiency. This approach is outside the scope of this guide but worth mentioning as a complementary technique.
What is a trajectory?
A trajectory is the structured record of what happened during an LLMChain run.
In LangChain Elixir, a %Trajectory{} struct captures:
%Trajectory{
messages: [...] # The exchanged messages from the chain run
tool_calls: [...] # Flat list of %{name, arguments} maps
token_usage: %{} # Aggregated input/output token counts
metadata: %{} # Model name, LLM module, custom data
}The key insight is that tool_calls gives you a simple, flat list of every tool
the agent invoked — stripped of protocol overhead — making it easy to assert against
expected patterns.
Capturing a trajectory
After running a chain, extract its trajectory immediately:
alias LangChain.Chains.LLMChain
alias LangChain.ChatModels.ChatOpenAI
alias LangChain.Message
alias LangChain.Trajectory
{:ok, chain} =
LLMChain.new!(%{llm: ChatOpenAI.new!(%{model: "gpt-4o"})})
|> LLMChain.add_tools(my_tools)
|> LLMChain.add_message(Message.new_user!("What's the weather in Paris?"))
|> LLMChain.run(mode: :while_needs_response)
trajectory = Trajectory.from_chain(chain)from_chain/1 reads from the chain's exchanged_messages — the messages produced
during the most recent run/2 call. This focuses the trajectory on the agent's
actual decisions rather than pre-loaded system or user prompts.
Important:
LLMChain.run/2clearsexchanged_messagesat the start of each invocation. Callfrom_chain/1immediately afterrun/2returns, before any subsequent run on the same chain.
What gets captured
trajectory.tool_calls
#=> [
# %{name: "search", arguments: %{"query" => "weather paris"}},
# %{name: "get_forecast", arguments: %{"city" => "Paris"}}
# ]
trajectory.token_usage
#=> %TokenUsage{input: 150, output: 45}
trajectory.metadata
#=> %{model: "gpt-4o", llm_module: LangChain.ChatModels.ChatOpenAI}Comparing trajectories
Trajectory.matches?/3 is the core comparison function. It accepts flexible
inputs and provides multiple matching strategies.
Inputs
matches?/3 accepts any combination of:
%Trajectory{}structs%LLMChain{}(automatically extracts trajectory)- Bare lists of
%{name: ..., arguments: ...}maps
This means you can write concise inline expectations:
Trajectory.matches?(chain, [
%{name: "search", arguments: %{"query" => "weather"}},
%{name: "get_forecast", arguments: nil}
])Matching modes
The :mode option controls how the tool call sequences are compared:
| Mode | Behavior | Use when |
|---|---|---|
:strict (default) | Same calls, same order, same count | Order matters (e.g., must search before summarizing) |
:unordered | Same calls in any order, same count | All tools must be used but order is flexible |
:superset | Actual contains at least all expected | Verify specific calls happened, allow extras |
# Strict: exact order
Trajectory.matches?(trajectory, expected)
# Any order, but must have exactly the same calls
Trajectory.matches?(trajectory, expected, mode: :unordered)
# At least these calls, extras are fine
Trajectory.matches?(trajectory, expected, mode: :superset)Argument matching
The :args option controls how tool call arguments are compared:
| Args mode | Behavior | Use when |
|---|---|---|
:exact (default) | Arguments must match exactly | You know the precise arguments |
:subset | Expected args are a subset of actual | You care about specific fields but not all |
nil value | Wildcard — matches any arguments | You only care which tool was called |
# Only care about tool names, not arguments
Trajectory.matches?(trajectory, [
%{name: "search", arguments: nil},
%{name: "get_forecast", arguments: nil}
])
# Care about "city" but not other args like "units"
Trajectory.matches?(trajectory, [
%{name: "get_forecast", arguments: %{"city" => "Paris"}}
], args: :subset)Combining modes
Mode and argument options compose freely:
# At least called get_forecast with city=Paris, ignore order and extra args
Trajectory.matches?(trajectory, [
%{name: "get_forecast", arguments: %{"city" => "Paris"}}
], mode: :superset, args: :subset)A note on string keys
Tool call arguments come from JSON decoding and always use string keys
(e.g., %{"city" => "Paris"} not %{city: "Paris"}). Write your expected
arguments with string keys to match.
ExUnit assertions
LangChain.Trajectory.Assertions provides test macros that produce informative
failure messages with diffs of expected vs actual tool calls.
defmodule MyApp.WeatherAgentTest do
use ExUnit.Case
use LangChain.Trajectory.Assertions
alias LangChain.Trajectory
test "agent calls the right tools in order" do
trajectory = Trajectory.from_chain(chain)
assert_trajectory trajectory, [
%{name: "search", arguments: %{"query" => "weather"}},
%{name: "get_forecast", arguments: nil}
]
end
test "agent does not call dangerous tools" do
trajectory = Trajectory.from_chain(chain)
refute_trajectory trajectory, [
%{name: "delete_all", arguments: nil}
], mode: :superset
end
test "agent uses expected tools in any order" do
trajectory = Trajectory.from_chain(chain)
assert_trajectory trajectory, [
%{name: "get_forecast", arguments: nil},
%{name: "search", arguments: nil}
], mode: :unordered
end
endBoth macros accept the same inputs as matches?/3:
%Trajectory{},%LLMChain{}, or bare lists for either argument:modeand:argsoptions
On failure, assert_trajectory raises with:
Trajectory mismatch (mode: strict, args: exact)
Expected:
[%{name: "search", arguments: %{"query" => "weather"}}]
Actual:
[%{name: "search", arguments: %{"query" => "news"}}]Testing patterns
Pattern 1: Regression testing with golden files
Save a known-good trajectory and compare future runs against it. This catches regressions where a prompt change causes the agent to take a different path.
# Step 1: Generate and save the golden file (run once)
golden = chain |> Trajectory.from_chain() |> Trajectory.to_map()
File.write!("test/fixtures/weather_agent.json", Jason.encode!(golden, pretty: true))
# Step 2: Compare against golden file in tests
test "weather agent follows expected tool sequence" do
golden_map = "test/fixtures/weather_agent.json" |> File.read!() |> Jason.decode!()
expected = Trajectory.from_map(golden_map)
{:ok, chain} = run_weather_agent("What's the weather in Paris?")
actual = Trajectory.from_chain(chain)
assert_trajectory actual, expected
endGolden files serialize cleanly through to_map/1 and from_map/1, which handle
both atom and string keys (important after JSON round-tripping).
Pattern 2: Safety verification
Verify that the agent never calls tools it shouldn't. Use refute_trajectory
with :superset mode — this checks whether the dangerous tool appears anywhere
in the actual trajectory.
test "agent never calls destructive tools" do
trajectory = Trajectory.from_chain(chain)
for dangerous_tool <- ["delete_user", "drop_table", "send_email"] do
refute_trajectory trajectory, [
%{name: dangerous_tool, arguments: nil}
], mode: :superset
end
endPattern 3: Cost control
Check that the agent doesn't make excessive or redundant calls:
test "agent makes at most 3 tool calls" do
trajectory = Trajectory.from_chain(chain)
assert length(trajectory.tool_calls) <= 3
end
test "agent doesn't call search more than twice" do
trajectory = Trajectory.from_chain(chain)
search_calls = Trajectory.calls_by_name(trajectory, "search")
assert length(search_calls) <= 2
end
test "total token usage stays within budget" do
trajectory = Trajectory.from_chain(chain)
total = trajectory.token_usage.input + trajectory.token_usage.output
assert total < 5000
endPattern 4: Verifying tool call order
When certain tools must be called before others (e.g., authenticate before accessing data):
test "agent authenticates before accessing user data" do
trajectory = Trajectory.from_chain(chain)
assert_trajectory trajectory, [
%{name: "authenticate", arguments: nil},
%{name: "get_user_data", arguments: nil}
], mode: :strict
endPattern 5: Verifying parallel tool calls
Some agents issue multiple tool calls in a single turn. Use calls_by_turn/1
to inspect per-turn behavior:
test "agent searches for weather and news in parallel" do
trajectory = Trajectory.from_chain(chain)
turns = Trajectory.calls_by_turn(trajectory)
# First turn should have both searches
assert [{0, first_turn_calls} | _] = turns
assert length(first_turn_calls) == 2
names = Enum.map(first_turn_calls, & &1.name)
assert "search" in names
endPattern 6: Flexible integration tests
For live API tests where the exact arguments may vary, use :superset mode
with nil arguments to verify the agent's general strategy:
@tag :live_call
test "agent uses search and forecast tools" do
{:ok, chain} = run_weather_agent("What's the weather in Paris?")
assert_trajectory chain, [
%{name: "search", arguments: nil},
%{name: "get_forecast", arguments: nil}
], mode: :superset
endInspecting trajectories
Beyond matching, the Trajectory module provides tools for deeper analysis:
Filter by tool name
search_calls = Trajectory.calls_by_name(trajectory, "search")
#=> [%{name: "search", arguments: %{"query" => "weather"}}]Group by conversation turn
Trajectory.calls_by_turn(trajectory)
#=> [
# {0, [%{name: "search", arguments: %{"query" => "weather"}}]},
# {1, [%{name: "get_forecast", arguments: %{"city" => "Paris"}}]}
# ]This reveals the agent's multi-step reasoning: turn 0 searched, turn 1 used the search results to fetch a forecast.
Token usage
trajectory.token_usage
#=> %TokenUsage{input: 150, output: 45}Useful for cost tracking and budgeting assertions.
Metadata
trajectory.metadata
#=> %{model: "gpt-4o", llm_module: LangChain.ChatModels.ChatOpenAI}Useful when comparing trajectories across different models.
Serialization
Trajectories can be serialized to plain maps and restored:
# Serialize
map = Trajectory.to_map(trajectory)
# Store as JSON
json = Jason.encode!(map, pretty: true)
File.write!("trajectory.json", json)
# Restore from JSON (handles string keys automatically)
restored = json |> Jason.decode!() |> Trajectory.from_map()This enables:
- Golden-file testing — store known-good trajectories as JSON fixtures
- Logging — write trajectories to your logging pipeline for analysis
- Comparison across runs — serialize trajectories from different environments or model versions and compare them offline
Choosing the right matching strategy
| Scenario | Mode | Args | Example |
|---|---|---|---|
| Exact regression test | :strict | :exact | Golden file comparison |
| Required tools, flexible order | :unordered | :exact | All tools must run |
| Minimum required tools | :superset | nil args | "At least called search" |
| Verify specific arguments | :strict | :subset | "Called with city=Paris" |
| Safety check (tool NOT called) | :superset | nil args | refute_trajectory |
| Live API test (flexible) | :superset | nil args | Strategy verification |
| Cost control | N/A | N/A | length(tool_calls) <= N |
Further reading
LangChain.Trajectory— full API referenceLangChain.Trajectory.Assertions— ExUnit assertion macros- AgentEvals — reference implementations of trajectory matching algorithms for Python/JS
- LangSmith Trajectory Evaluation — trajectory-level evaluators for scoring agent behavior