Observability

View Source

After: You can monitor Jido agents in production with metrics, traces, and structured logging.

This guide covers production-grade observability for Jido agents. For development debugging, see Seeing What Happened.

Production Logger Configuration

Configure structured JSON logging for production:

# config/prod.exs
config :logger, :default_handler,
  formatter: {
    :logger_formatter_json,
    %{
      template: [:time, :level, :msg, :metadata],
      single_line: true
    }
  }

config :logger,
  level: :info,
  metadata: [:agent_id, :jido_trace_id, :jido_span_id, :jido_instance, :signal_type, :duration_ms]

config :jido, :observability,
  log_level: :info,
  debug_events: :off,
  redact_sensitive: true

Per-instance observability tuning allows different Jido instances to use different verbosity and redaction settings:

config :my_app, MyApp.PublicJido,
  telemetry: [log_level: :info],
  observability: [debug_events: :off, redact_sensitive: true]

config :my_app, MyApp.InternalJido,
  telemetry: [log_level: :debug, log_args: :full],
  observability: [debug_events: :all, redact_sensitive: false, tracer: MyApp.OtelTracer]

Settings resolve in this order: Debug override → instance config → global config → default. Invalid values are ignored and fall back to Jido defaults. The default telemetry log level is :info, so compact debug signal/directive logs are opt-in through config or MyApp.Jido.debug(:on).

The :redact_sensitive option replaces sensitive data with [REDACTED] in logs and telemetry.

For Jido-managed Jido.Exec calls, telemetry.log_args also controls how much observability data is allowed through the underlying jido_action layer:

  • log_args: :full keeps verbose action logs and emits [:jido, :action, ...] spans
  • log_args: :keys_only and :none suppress the noisy full action start log
  • log_args: :keys_only and :none also silence dependency action spans, because jido_action only supports :full or :silent telemetry payload behavior
  • MyApp.Jido.debug(:verbose) re-enables full action logs and action spans for that instance

Telemetry Event Reference

Jido emits telemetry events for all core operations. Use these for metrics collection and alerting.

Agent Events

EventDescriptionMeasurementsMetadata
[:jido, :agent, :cmd, :start]Command execution startedsystem_timeagent_id, agent_module, action, jido_instance
[:jido, :agent, :cmd, :stop]Command completedduration, directive_countagent_id, agent_module, directive_count, jido_instance
[:jido, :agent, :cmd, :exception]Command faileddurationagent_id, agent_module, error, stacktrace, jido_instance

AgentServer Events

EventDescriptionMeasurementsMetadata
[:jido, :agent_server, :signal, :start]Signal processing startedsystem_timeagent_id, signal_type, jido_instance
[:jido, :agent_server, :signal, :stop]Signal processing completeddurationagent_id, signal_type, directive_count, directive_types, jido_instance
[:jido, :agent_server, :signal, :exception]Signal processing faileddurationagent_id, signal_type, error, jido_instance
[:jido, :agent_server, :directive, :start]Directive execution startedsystem_timeagent_id, directive_type, directive, jido_instance
[:jido, :agent_server, :directive, :stop]Directive execution completeddurationagent_id, directive_type, directive, result, jido_instance
[:jido, :agent_server, :directive, :exception]Directive execution faileddurationagent_id, directive_type, directive, error, jido_instance
[:jido, :agent_server, :queue, :overflow]Directive queue overflowqueue_sizeagent_id, signal_type, jido_instance

When debug or trace logging is enabled, the structured [signal] log line includes the directive-type summary from directive_types, for example directives=2 Emit=1 Schedule=1.

Strategy Events

EventDescriptionMeasurementsMetadata
[:jido, :agent, :strategy, :init, :start]Strategy initialization startedsystem_timeagent_id, strategy, jido_instance
[:jido, :agent, :strategy, :init, :stop]Strategy initialization completeddurationagent_id, strategy, jido_instance
[:jido, :agent, :strategy, :init, :exception]Strategy initialization faileddurationagent_id, strategy, error, jido_instance
[:jido, :agent, :strategy, :cmd, :start]Strategy command startedsystem_timeagent_id, strategy, jido_instance
[:jido, :agent, :strategy, :cmd, :stop]Strategy command completeddurationagent_id, strategy, directive_count, jido_instance
[:jido, :agent, :strategy, :cmd, :exception]Strategy command faileddurationagent_id, strategy, error, jido_instance
[:jido, :agent, :strategy, :tick, :start]Strategy tick startedsystem_timeagent_id, strategy, jido_instance
[:jido, :agent, :strategy, :tick, :stop]Strategy tick completeddurationagent_id, strategy, jido_instance
[:jido, :agent, :strategy, :tick, :exception]Strategy tick faileddurationagent_id, strategy, error, jido_instance

Pod Events

EventDescriptionMeasurementsMetadata
[:jido, :pod, :reconcile, :start]Pod eager reconciliation startedsystem_timepod_id, pod_module, jido_instance
[:jido, :pod, :reconcile, :stop]Pod eager reconciliation completedduration, node_count, requested_count, failure_count, pending_count, wave_countpod_id, pod_module, jido_instance
[:jido, :pod, :reconcile, :exception]Pod eager reconciliation faileddurationpod_id, pod_module, error, jido_instance
[:jido, :pod, :node, :ensure, :start]Pod node acquisition/adoption startedsystem_timepod_id, pod_module, node_name, node_manager, node_kind, source, owner, jido_instance
[:jido, :pod, :node, :ensure, :stop]Pod node acquisition/adoption completedduration, source, parentpod_id, pod_module, node_name, node_manager, node_kind, source, owner, jido_instance
[:jido, :pod, :node, :ensure, :exception]Pod node acquisition/adoption faileddurationpod_id, pod_module, node_name, node_manager, node_kind, source, owner, error, jido_instance

source explains what ensure_node/3 did:

  • :started — the node was started through its InstanceManager
  • :running — the node was already alive and was re-adopted
  • :adopted — the node was already attached to its expected runtime parent

owner is the logical :owns parent when the node is not a pod root. Root nodes emit owner: nil.

node_name and owner can be either atoms or strings, depending on how the pod topology names its nodes.

For kind: :pod nodes, a node-level exception on the outer pod can wrap a nested reconcile report from the inner pod manager.

Correlation Metadata

When trace context is active, all events include:

  • :jido_trace_id — shared across the entire call chain
  • :jido_span_id — unique to the current operation
  • :jido_parent_span_id — the parent operation
  • :jido_causation_id — the signal ID that caused this signal

Metrics Collection

Prometheus with TelemetryMetricsPrometheus

# mix.exs
defp deps do
  [
    {:telemetry_metrics_prometheus, "~> 1.1"}
  ]
end
# lib/my_app/telemetry.ex
defmodule MyApp.Telemetry do
  use Supervisor
  import Telemetry.Metrics

  def start_link(arg) do
    Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  @impl true
  def init(_arg) do
    children = [
      {TelemetryMetricsPrometheus, metrics: metrics()}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

  defp metrics do
    [
      # Agent command latency histogram
      distribution("jido.agent.cmd.duration",
        unit: {:native, :millisecond},
        reporter_options: [buckets: [10, 50, 100, 250, 500, 1000, 2500]],
        tags: [:agent_module]
      ),

      # Command throughput
      counter("jido.agent.cmd.stop.count",
        tags: [:agent_module]
      ),

      # Error rate
      counter("jido.agent.cmd.exception.count",
        tags: [:agent_module]
      ),

      # Signal processing latency
      distribution("jido.agent_server.signal.duration",
        unit: {:native, :millisecond},
        reporter_options: [buckets: [10, 50, 100, 250, 500, 1000]],
        tags: [:signal_type]
      ),

      # Directive execution
      counter("jido.agent_server.directive.stop.count",
        tags: [:directive_type]
      ),

      # Queue overflow events
      counter("jido.agent_server.queue.overflow.count"),

      # Directives per command
      summary("jido.agent.cmd.directive_count",
        tags: [:agent_module]
      )
    ]
  end
end

Alternatively, use Jido's built-in metric definitions which include automatic per-instance scoping:

children = [
  {TelemetryMetricsPrometheus, metrics: Jido.Telemetry.metrics()}
]

Add to your application supervision tree:

# lib/my_app/application.ex
def start(_type, _args) do
  children = [
    MyApp.Telemetry,
    # ... other children
  ]

  Supervisor.start_link(children, strategy: :one_for_one, name: MyApp.Supervisor)
end

StatsD with TelemetryMetricsStatsd

# mix.exs
defp deps do
  [
    {:telemetry_metrics_statsd, "~> 0.7"}
  ]
end
defmodule MyApp.Telemetry do
  use Supervisor
  import Telemetry.Metrics

  def start_link(arg) do
    Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  @impl true
  def init(_arg) do
    children = [
      {TelemetryMetricsStatsd,
       metrics: metrics(),
       host: System.get_env("STATSD_HOST", "localhost"),
       port: String.to_integer(System.get_env("STATSD_PORT", "8125")),
       prefix: "jido"}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

  defp metrics do
    [
      counter("agent.cmd.count"),
      counter("agent.cmd.exception.count"),
      summary("agent.cmd.duration"),
      counter("agent_server.signal.count"),
      summary("agent_server.signal.duration"),
      counter("agent_server.directive.count"),
      last_value("agent_server.queue.overflow.count")
    ]
  end
end

Custom Telemetry Handler

For custom metrics backends or specialized logging:

defmodule MyApp.JidoTelemetryHandler do
  require Logger

  @events [
    [:jido, :agent, :cmd, :start],
    [:jido, :agent, :cmd, :stop],
    [:jido, :agent, :cmd, :exception],
    [:jido, :agent_server, :signal, :start],
    [:jido, :agent_server, :signal, :stop],
    [:jido, :agent_server, :signal, :exception],
    [:jido, :agent_server, :directive, :start],
    [:jido, :agent_server, :directive, :stop],
    [:jido, :agent_server, :directive, :exception],
    [:jido, :agent_server, :queue, :overflow],
    [:jido, :agent, :strategy, :cmd, :start],
    [:jido, :agent, :strategy, :cmd, :stop],
    [:jido, :agent, :strategy, :cmd, :exception]
  ]

  def attach do
    :telemetry.attach_many(
      "my-jido-handler",
      @events,
      &__MODULE__.handle_event/4,
      nil
    )
  end

  def handle_event([:jido, :agent, :cmd, :stop], measurements, metadata, _config) do
    duration_ms = System.convert_time_unit(measurements.duration, :native, :millisecond)

    if duration_ms > 1000 do
      Logger.warning("Slow agent command",
        agent_id: metadata.agent_id,
        agent_module: metadata.agent_module,
        jido_instance: metadata[:jido_instance],
        jido_trace_id: metadata[:jido_trace_id],
        jido_span_id: metadata[:jido_span_id],
        duration_ms: duration_ms
      )
    end

    MyMetrics.histogram("jido.cmd.duration", duration_ms, %{
      agent_module: to_string(metadata.agent_module),
      jido_instance: to_string(metadata[:jido_instance])
    })
  end

  def handle_event([:jido, :agent, :cmd, :exception], measurements, metadata, _config) do
    duration_ms = System.convert_time_unit(measurements.duration, :native, :millisecond)

    Logger.error("Agent command failed",
      agent_id: metadata.agent_id,
      agent_module: metadata.agent_module,
      jido_instance: metadata[:jido_instance],
      jido_trace_id: metadata[:jido_trace_id],
      error: inspect(metadata.error),
      duration_ms: duration_ms
    )

    MyMetrics.increment("jido.cmd.errors", %{
      agent_module: to_string(metadata.agent_module),
      jido_instance: to_string(metadata[:jido_instance])
    })
  end

  def handle_event([:jido, :agent_server, :queue, :overflow], measurements, metadata, _config) do
    Logger.error("Agent queue overflow",
      agent_id: metadata.agent_id,
      jido_instance: metadata[:jido_instance],
      queue_size: measurements.queue_size
    )

    MyMetrics.increment("jido.queue.overflow")
  end

  def handle_event(_event, _measurements, _metadata, _config), do: :ok
end

Call MyApp.JidoTelemetryHandler.attach() in your application startup.

OpenTelemetry Integration

Core jido intentionally stays free of direct OpenTelemetry dependencies. Use a separate package (for example, jido_otel) to provide a concrete tracer implementation, then configure Jido to use that module.

Quick Start (External jido_otel Package)

# mix.exs
defp deps do
  [
    {:jido_otel, "~> 0.1"},
    {:opentelemetry_exporter, "~> 1.7"}
  ]
end
# config/prod.exs
config :jido, :observability,
  tracer: JidoOtel.Tracer,
  log_level: :info,
  redact_sensitive: true,
  tracer_failure_mode: :warn

# OpenTelemetry exporter configuration
config :opentelemetry,
  span_processor: :batch,
  traces_exporter: :otlp

config :opentelemetry_exporter,
  otlp_protocol: :grpc,
  otlp_endpoint: System.get_env("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")

Custom Tracer Extensions

For custom backend behavior, you can still implement Jido.Observe.Tracer directly:

defmodule MyApp.CustomTracer do
  @behaviour Jido.Observe.Tracer

  @impl true
  def span_start(event_prefix, metadata) do
    # Start external span and return tracer context.
  end

  @impl true
  def span_stop(tracer_ctx, measurements) do
    # Complete external span.
    :ok
  end

  @impl true
  def span_exception(tracer_ctx, kind, reason, stacktrace) do
    # Record exception in external tracer.
    :ok
  end

  @impl true
  def with_span_scope(event_prefix, metadata, fun) do
    # Optional sync span scoping callback.
    # Must call `fun.()` exactly once and preserve result/error semantics.
    _ = {event_prefix, metadata}
    fun.()
  end
end

Sync vs Async Tracer Semantics

  • with_span/3 uses optional with_span_scope/3 when implemented.
  • start_span/2 + finish callbacks remain explicit lifecycle APIs for async work.
  • Existing tracers that implement only span_start/2, span_stop/2, and span_exception/4 continue to work unchanged.

Tracer Failure Modes

Use :tracer_failure_mode to control callback failures:

  • :warn (default): isolate tracer failures and continue app flow
  • :strict: raise immediately when tracer callbacks fail
config :jido, :observability,
  tracer: MyApp.CustomTracer,
  tracer_failure_mode: :strict

Correlation IDs and Distributed Tracing

Jido automatically propagates trace context through signal chains via Jido.Tracing.Context.

Extracting Trace IDs

defmodule MyApp.RequestHandler do
  require Logger

  def handle_request(conn) do
    # Set correlation ID from incoming request
    trace_id = get_req_header(conn, "x-trace-id") || generate_trace_id()

    Logger.metadata(trace_id: trace_id)

    signal = Signal.new!("process_request", conn.params,
      source: "/api/request",
      extensions: %{
        "jido_ext_trace" => %{
          "trace_id" => trace_id,
          "span_id" => generate_span_id(),
          "parent_span_id" => nil
        }
      }
    )

    {:ok, result} = AgentServer.call(pid, signal)
    result
  end

  defp generate_trace_id, do: Base.encode16(:crypto.strong_rand_bytes(16), case: :lower)
  defp generate_span_id, do: Base.encode16(:crypto.strong_rand_bytes(8), case: :lower)
end

Cross-Agent Tracing

When agents spawn child agents or emit signals to other agents, trace context propagates automatically:

def handle_event([:jido, :agent_server, :signal, :stop], _measurements, metadata, _config) do
  Logger.info("Signal processed",
    agent_id: metadata.agent_id,
    signal_type: metadata.signal_type,
    jido_instance: metadata[:jido_instance],
    jido_trace_id: metadata[:jido_trace_id],
    jido_span_id: metadata[:jido_span_id],
    jido_parent_span_id: metadata[:jido_parent_span_id],
    jido_causation_id: metadata[:jido_causation_id]
  )
end

Custom Namespace Events and Spans

Non-Gated Domain Events (emit_event/3)

Use Jido.Observe.emit_event/3 for production domain lifecycle events that must always emit telemetry regardless of :debug_events config.

Jido.Observe.EventContract provides lightweight key validation helpers so downstream namespaces keep stable metadata/measurement contracts.

Namespace ownership matters here:

  • jido owns the generic runtime and execution surfaces such as [:jido, :agent, ...], [:jido, :agent_server, ...], and [:jido, :action, ...].
  • Domain packages should own their own higher-level namespaces such as [:jido, :ai, ...] in jido_ai.
  • Keep AI- or domain-specific event contracts out of core jido; use emit_event/3 plus package-local docs/tests in the owning package instead.
alias Jido.Observe
alias Jido.Observe.EventContract

with {:ok, validated} <-
       EventContract.validate_event(
         [:jido, :ai, :request, :completed],
         %{duration_ms: 42},
         %{request_id: "req-1", terminal_state: :completed, model: "gpt-4.1"},
         required_metadata: [:request_id, :terminal_state],
         required_measurements: [:duration_ms]
       ) do
  Observe.emit_event(validated.event, validated.measurements, validated.metadata)
end

Async Request Lifecycle Pattern

For long-lived request workflows, use a request-root span plus async child spans. Propagate correlation metadata across async boundaries, then emit terminal domain events (completed, failed, cancelled, rejected).

Important: OpenTelemetry current-span context is process-local. Across Task boundaries, explicitly propagate only correlation metadata and start/finish spans in the owning process.

alias Jido.Observe
alias Jido.Observe.EventContract
alias Jido.Tracing.Context, as: TraceContext

# Root request span
request_span = Observe.start_span([:jido, :ai, :request], %{request_id: request_id})
trace_metadata = TraceContext.to_telemetry_metadata()

# Async tool span
task =
  Task.async(fn ->
    child_metadata = Map.merge(trace_metadata, %{request_id: request_id, tool: "search"})
    tool_span = Observe.start_span([:jido, :ai, :request, :tool], child_metadata)
    result = ExternalSearch.run(query)
    Observe.finish_span(tool_span, %{result_count: length(result.items)})
  end)

Task.await(task)

# Terminal lifecycle event
{:ok, validated} =
  EventContract.validate_event(
    [:jido, :ai, :request, :completed],
    %{duration_ms: 125},
    %{request_id: request_id, terminal_state: :completed},
    required_metadata: [:request_id, :terminal_state],
    required_measurements: [:duration_ms]
  )

Observe.emit_event(
  validated.event,
  validated.measurements,
  Map.merge(validated.metadata, trace_metadata)
)

Observe.finish_span(request_span, %{directive_count: 1})
EventPurposeRequired MetadataRequired Measurements
[:jido, :ai, :request, :start]Request lifecycle startrequest_id
[:jido, :ai, :request, :tool, :start]Async tool lifecycle startrequest_id, tool
[:jido, :ai, :request, :tool, :stop]Async tool lifecycle stoprequest_id, toolduration or duration_ms
[:jido, :ai, :request, :completed]Request completedrequest_id, terminal_stateduration_ms
[:jido, :ai, :request, :failed]Request failedrequest_id, terminal_stateduration_ms
[:jido, :ai, :request, :cancelled]Request cancelledrequest_id, terminal_stateduration_ms
[:jido, :ai, :request, :rejected]Request rejectedrequest_id, terminal_stateduration_ms
CategoryKeysNotes
Metadatarequest_id, terminal_state, tool, modelUse IDs and categorical values, avoid payloads.
Measurementsduration_ms, token_count, result_count, cost_usdNumeric measurements only.
Metric Tagsjido_instance, terminal_state, tool, modelKeep tag cardinality bounded.

Dashboard Metrics Recommendations

Key Metrics to Track

MetricTypeAlert Threshold
jido.agent.cmd.duration p99Histogram> 1s
jido.agent.cmd.exception.count rateCounter> 1/min per agent
jido.agent_server.signal.duration p95Histogram> 500ms
jido.agent_server.queue.overflow.countCounter> 0
jido.agent_server.directive.exception.countCounter> 0

Grafana Dashboard Panels

Command Latency Distribution:

histogram_quantile(0.99, 
  sum(rate(jido_agent_cmd_duration_bucket[5m])) by (le, jido_instance, agent_module)
)

Error Rate by Agent:

sum(rate(jido_agent_cmd_exception_count[5m])) by (jido_instance, agent_module)
/ 
sum(rate(jido_agent_cmd_stop_count[5m])) by (jido_instance, agent_module)

Signal Throughput:

sum(rate(jido_agent_server_signal_stop_count[1m])) by (jido_instance, signal_type)

Directive Execution Rate:

sum(rate(jido_agent_server_directive_stop_count[1m])) by (jido_instance, directive_type)

Alerting Patterns

Prometheus Alerting Rules

groups:
  - name: jido
    rules:
      - alert: JidoHighCommandLatency
        expr: histogram_quantile(0.99, sum(rate(jido_agent_cmd_duration_bucket[5m])) by (le)) > 2000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Jido command latency p99 > 2s"

      - alert: JidoHighErrorRate
        expr: |
          sum(rate(jido_agent_cmd_exception_count[5m])) 
          / sum(rate(jido_agent_cmd_stop_count[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Jido agent error rate > 5%"

      - alert: JidoQueueOverflow
        expr: increase(jido_agent_server_queue_overflow_count[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Jido directive queue overflow detected"

      - alert: JidoDirectiveFailures
        expr: increase(jido_agent_server_directive_exception_count[5m]) > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Jido directive execution failures"

SLO Definitions

SLITargetCalculation
Command Success Rate99.9%1 - (cmd.exception.count / cmd.stop.count)
Signal Latency p99< 500mshistogram_quantile(0.99, signal.duration)
Directive Success Rate99.99%1 - (directive.exception.count / directive.stop.count)
Queue Overflow Rate0queue.overflow.count == 0

Debug Events in Development

Enable verbose telemetry for debugging:

# config/dev.exs
config :jido, :observability,
  log_level: :debug,
  debug_events: :all,
  redact_sensitive: false

Toggle debug mode at runtime for a specific instance:

MyApp.Jido.debug(:on)       # sets debug_events to :minimal
MyApp.Jido.debug(:verbose)  # sets debug_events to :all
MyApp.Jido.debug(:off)      # back to configured defaults

Emit debug events from custom code:

Jido.Observe.emit_debug_event(
  [:my_app, :custom, :step],
  %{duration: 1234},
  %{agent_id: agent.id, jido_instance: MyApp.Jido, step: 3, status: :processing}
)

Debug events are no-ops when :debug_events is :off (production default).

Key Modules

Further Reading