Telemetry and Observability

Copy Markdown View Source

AgentSessionManager emits telemetry events and supports audit logging for production observability. Both systems can be enabled or disabled at runtime.

Telemetry Events

All telemetry events use the :telemetry library and follow the [:agent_session_manager, ...] prefix.

Run Lifecycle Events

[:agent_session_manager, :run, :start]

Emitted when a run begins.

MeasurementTypeDescription
system_timeintegerSystem time in native units
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
agent_idstringAgent identifier
runRun.t()Full run struct
sessionSession.t()Full session struct

[:agent_session_manager, :run, :stop]

Emitted when a run completes successfully.

MeasurementTypeDescription
durationintegerDuration in nanoseconds
system_timeintegerSystem time
input_tokensintegerInput token count (if available)
output_tokensintegerOutput token count (if available)
total_tokensintegerTotal token count (if available)
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
agent_idstringAgent identifier
statusatomFinal run status
runRun.t()Full run struct
sessionSession.t()Full session struct

[:agent_session_manager, :run, :exception]

Emitted when a run fails.

MeasurementTypeDescription
durationintegerDuration in nanoseconds
system_timeintegerSystem time
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
agent_idstringAgent identifier
error_codeatomError code
error_messagestringError message
runRun.t()Full run struct
sessionSession.t()Full session struct
errormapFull error details

Usage Events

[:agent_session_manager, :usage, :report]

Emitted with token usage metrics.

MeasurementTypeDescription
(varies)numberAll keys from the metrics map

Common measurement keys: input_tokens, output_tokens, total_tokens, cost_usd.

Adapter Events

[:agent_session_manager, :adapter, <event_type>]

Emitted for each adapter event (:run_started, :message_streamed, :tool_call_started, etc.).

MeasurementTypeDescription
system_timeintegerSystem time
(numeric data)numberAny numeric values from event data
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
agent_idstringAgent identifier
provideratomProvider name (:claude, :codex)
tool_namestringTool name (for tool events)
event_datamapFull event data

Attaching Handlers

:telemetry.attach_many(
  "my-metrics",
  [
    [:agent_session_manager, :run, :start],
    [:agent_session_manager, :run, :stop],
    [:agent_session_manager, :run, :exception],
    [:agent_session_manager, :usage, :report]
  ],
  &MyMetricsHandler.handle_event/4,
  nil
)

Example: Logging Handler

defmodule MyMetricsHandler do
  require Logger

  def handle_event([:agent_session_manager, :run, :start], _measurements, metadata, _config) do
    Logger.info("Run started: #{metadata.run_id} (agent: #{metadata.agent_id})")
  end

  def handle_event([:agent_session_manager, :run, :stop], measurements, metadata, _config) do
    duration_ms = System.convert_time_unit(measurements.duration, :nanosecond, :millisecond)
    Logger.info("Run completed: #{metadata.run_id} in #{duration_ms}ms")
  end

  def handle_event([:agent_session_manager, :run, :exception], _measurements, metadata, _config) do
    Logger.error("Run failed: #{metadata.run_id} - #{metadata.error_message}")
  end

  def handle_event([:agent_session_manager, :usage, :report], measurements, metadata, _config) do
    Logger.info("Usage for #{metadata.session_id}: #{inspect(measurements)}")
  end
end

The Span Helper

For manual execution, the span/3 function wraps a function with automatic start/stop/exception events:

alias AgentSessionManager.Telemetry

result = Telemetry.span(run, session, fn ->
  # Your execution logic
  {:ok, %{output: output, token_usage: usage}}
end)

This automatically emits:

  • :start before the function runs
  • :stop if the function returns {:ok, ...}
  • :exception if the function returns {:error, ...}

Enabling/Disabling Telemetry

Telemetry uses the AgentSessionManager.Config layered configuration system. set_enabled/1 sets a process-local override, so disabling telemetry in one process (e.g., a test) does not affect other processes.

# Check if enabled (default: true)
Telemetry.enabled?()

# Disable for the current process only
Telemetry.set_enabled(false)

# Or via application config (affects all processes without a local override)
config :agent_session_manager, telemetry_enabled: false

When disabled, telemetry functions return :ok without emitting events. See Configuration for details on the layered resolution order.

Audit Logging

The AuditLogger module persists audit events to the SessionStore, providing a queryable history of all run lifecycle events.

What Gets Logged

  • :run_started -- when a run begins
  • :run_completed -- when a run finishes successfully (includes token usage)
  • :run_failed -- when a run fails (includes error code and message)
  • :error_occurred -- when an error happens during execution
  • :token_usage_updated -- when usage metrics are reported

Manual Logging

alias AgentSessionManager.AuditLogger

AuditLogger.log_run_started(store, run, session)
AuditLogger.log_run_completed(store, run, session, %{token_usage: usage})
AuditLogger.log_run_failed(store, run, session, %{code: :timeout, message: "Timed out"})
AuditLogger.log_error(store, run, session, %{code: :provider_error, message: "Rate limited"})
AuditLogger.log_usage_metrics(store, session, %{input_tokens: 100}, run_id: run.id)

Querying the Audit Log

{:ok, events} = AuditLogger.get_audit_log(store, session.id)
{:ok, events} = AuditLogger.get_audit_log(store, session.id, run_id: run.id)
{:ok, events} = AuditLogger.get_audit_log(store, session.id, type: :run_failed)

Automatic Logging via Telemetry

The AuditLogger can automatically create audit entries from telemetry events:

# Attach -- audit events are now created automatically for all runs
AuditLogger.attach_telemetry_handlers(store)

# Detach when no longer needed
AuditLogger.detach_telemetry_handlers()

This is the recommended approach for production: telemetry handles real-time metrics, and the audit logger ensures every event is durably stored.

Enabling/Disabling Audit Logging

Audit logging also uses the AgentSessionManager.Config layered system. set_enabled/1 sets a process-local override.

AuditLogger.enabled?()           # default: true
AuditLogger.set_enabled(false)   # disable for the current process

# Or via application config (affects all processes without a local override)
config :agent_session_manager, audit_logging_enabled: false

See Configuration for the full resolution order.

Routing Telemetry Events

The ProviderRouter emits telemetry events for each routing attempt, enabling observability into provider selection, failover, and latency.

[:agent_session_manager, :router, :attempt, :start]

Emitted before adapter execution begins for a routing attempt.

MeasurementTypeDescription
system_timeintegerSystem time in native units
MetadataTypeDescription
adapter_idstringSelected adapter identifier
run_idstringRun identifier
session_idstringSession identifier
attemptintegerAttempt number (1-based)
strategyatomRouting strategy (:prefer or :weighted)
candidateslistCandidate adapter IDs considered

[:agent_session_manager, :router, :attempt, :stop]

Emitted after a successful adapter execution.

MeasurementTypeDescription
durationintegerDuration in nanoseconds
system_timeintegerSystem time
MetadataTypeDescription
adapter_idstringAdapter that handled the run
run_idstringRun identifier
session_idstringSession identifier
attemptintegerAttempt number

[:agent_session_manager, :router, :attempt, :exception]

Emitted when an adapter execution fails during routing.

MeasurementTypeDescription
durationintegerDuration in nanoseconds
system_timeintegerSystem time
MetadataTypeDescription
adapter_idstringAdapter that failed
run_idstringRun identifier
session_idstringSession identifier
attemptintegerAttempt number
error_codeatomError code from the failure
retryablebooleanWhether the error is retryable
will_retrybooleanWhether the router will try another adapter

Attaching Router Handlers

:telemetry.attach_many(
  "my-routing-metrics",
  [
    [:agent_session_manager, :router, :attempt, :start],
    [:agent_session_manager, :router, :attempt, :stop],
    [:agent_session_manager, :router, :attempt, :exception]
  ],
  &MyRoutingHandler.handle_event/4,
  nil
)

Runtime Telemetry Events

The SessionServer runtime emits telemetry events under the [:agent_session_manager, :runtime, ...] namespace for queue and lifecycle observability.

[:agent_session_manager, :runtime, :run, :enqueued]

Emitted when a run is submitted to the queue.

MeasurementTypeDescription
system_timeintegerSystem time
queue_depthintegerQueue size after enqueue
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
in_flight_countintegerCurrent in-flight runs
max_concurrent_runsintegerConfigured slot limit

[:agent_session_manager, :runtime, :run, :started]

Emitted when a run is dequeued and adapter execution begins.

MeasurementTypeDescription
system_timeintegerSystem time
wait_timeintegerTime in queue (nanoseconds)
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
in_flight_countintegerIn-flight runs after start

[:agent_session_manager, :runtime, :run, :completed]

Emitted when a run finishes (success, failure, or cancellation).

MeasurementTypeDescription
durationintegerTotal run duration (nanoseconds)
system_timeintegerSystem time
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
statusatomFinal status (:completed, :failed, :cancelled)
in_flight_countintegerIn-flight runs after completion
queued_countintegerRemaining queued runs

[:agent_session_manager, :runtime, :run, :crashed]

Emitted when a run task crashes unexpectedly.

MeasurementTypeDescription
system_timeintegerSystem time
MetadataTypeDescription
run_idstringRun identifier
session_idstringSession identifier
reasontermCrash reason from the DOWN message

[:agent_session_manager, :runtime, :drain, :complete]

Emitted when drain/2 finishes successfully.

MeasurementTypeDescription
durationintegerTotal drain wait time (nanoseconds)
system_timeintegerSystem time
MetadataTypeDescription
session_idstringSession identifier
runs_drainedintegerTotal runs that completed during drain

Attaching Runtime Handlers

:telemetry.attach_many(
  "my-runtime-metrics",
  [
    [:agent_session_manager, :runtime, :run, :enqueued],
    [:agent_session_manager, :runtime, :run, :started],
    [:agent_session_manager, :runtime, :run, :completed],
    [:agent_session_manager, :runtime, :run, :crashed],
    [:agent_session_manager, :runtime, :drain, :complete]
  ],
  &MyRuntimeHandler.handle_event/4,
  nil
)