Telemetry and Observability
View SourceArcana emits telemetry events for all operations, giving you visibility into performance, errors, and usage patterns. This guide covers setup options from quick debugging to full production monitoring.
Quick Start
For immediate visibility, attach the built-in logger in your application startup:
# lib/my_app/application.ex
def start(_type, _args) do
# Attach telemetry logger before starting supervision tree
Arcana.Telemetry.Logger.attach()
children = [
MyApp.Repo,
Arcana.TaskSupervisor,
Arcana.Embedder.Local
]
Supervisor.start_link(children, strategy: :one_for_one)
endThis logs all Arcana operations with timing:
[info] [Arcana] search completed in 42ms (15 results)
[info] [Arcana] llm.complete completed in 1.23s [openai:gpt-4o-mini] ok (156 chars) prompt=892chars
[info] [Arcana] agent.gate completed in 180ms (skip_retrieval: false)
[info] [Arcana] agent.rewrite completed in 235ms
[info] [Arcana] agent.expand completed in 2.15s ("machine learning ML models...")
[info] [Arcana] agent.search completed in 156ms (25 chunks)
[info] [Arcana] agent.reason completed in 1.2s (1 iteration)
[info] [Arcana] agent.rerank completed in 312ms (10/25 kept)
[info] [Arcana] agent.answer completed in 3.25s
[info] [Arcana] ask completed in 6.12sWith graph: true enabled, you'll also see:
[info] [Arcana] graph.ner completed in 45ms (3 entities)
[info] [Arcana] graph.relationship_extraction completed in 1.2s (2 relationships)
[info] [Arcana] graph.build completed in 1.5s (5 entities, 3 relationships)
[info] [Arcana] graph.search completed in 28ms (8 graph results, 10 combined)Logger Options
Arcana.Telemetry.Logger.attach(
level: :debug, # Log level (default: :info)
handler_id: "my-logger" # Custom handler ID
)
# To stop logging
Arcana.Telemetry.Logger.detach()Event Reference
All events use :telemetry.span/3, which emits :start, :stop, and :exception variants automatically.
Core Events
| Event | Measurements | Metadata |
|---|---|---|
[:arcana, :ingest, :*] | system_time, duration | text, repo, collection, document, chunk_count |
[:arcana, :search, :*] | system_time, duration | query, repo, mode, limit, results, result_count |
[:arcana, :ask, :*] | system_time, duration | question, repo, answer, context_count |
[:arcana, :embed, :*] | system_time, duration | text, dimensions |
[:arcana, :llm, :complete, :*] | system_time, duration | model, prompt_length, success, response_length, error |
Agent Pipeline Events
Each step in the Agent pipeline emits its own events:
| Event | Metadata |
|---|---|
[:arcana, :agent, :gate, :*] | question, skip_retrieval |
[:arcana, :agent, :rewrite, :*] | question, rewritten_query |
[:arcana, :agent, :select, :*] | selected (collections) |
[:arcana, :agent, :expand, :*] | question, expanded_query |
[:arcana, :agent, :decompose, :*] | question, sub_question_count |
[:arcana, :agent, :search, :*] | question, total_chunks |
[:arcana, :agent, :reason, :*] | question, iterations |
[:arcana, :agent, :rerank, :*] | question, chunks_before, chunks_after |
[:arcana, :agent, :answer, :*] | question, context_chunk_count |
VectorStore Events
Storage layer events for vector operations:
| Event | Metadata |
|---|---|
[:arcana, :vector_store, :store, :*] | collection, id, backend |
[:arcana, :vector_store, :search, :*] | collection, limit, backend, result_count |
[:arcana, :vector_store, :search_text, :*] | collection, query, limit, backend, result_count |
[:arcana, :vector_store, :delete, :*] | collection, id, backend |
[:arcana, :vector_store, :clear, :*] | collection, backend |
GraphRAG Events
When using graph: true, these events track knowledge graph operations:
| Event | Metadata |
|---|---|
[:arcana, :graph, :build, :*] | chunk_count, collection, entity_count, relationship_count |
[:arcana, :graph, :search, :*] | query, entity_count, graph_result_count, combined_count |
[:arcana, :graph, :ner, :*] | text, entity_count |
[:arcana, :graph, :relationship_extraction, :*] | text, relationship_count |
[:arcana, :graph, :community_detection, :*] | entity_count, community_count |
[:arcana, :graph, :community_summary, :*] | entity_count, summary_length |
GraphStore Events
Storage layer events for graph operations:
| Event | Metadata |
|---|---|
[:arcana, :graph_store, :persist_entities, :*] | collection_id, entity_count, backend |
[:arcana, :graph_store, :persist_relationships, :*] | relationship_count, backend |
[:arcana, :graph_store, :persist_mentions, :*] | mention_count, backend |
[:arcana, :graph_store, :search, :*] | entity_count, backend, result_count |
[:arcana, :graph_store, :delete_by_chunks, :*] | chunk_count, backend |
[:arcana, :graph_store, :delete_by_collection, :*] | collection_id, backend |
Exception Events
All :exception events include:
kind- The exception type (:error,:exit,:throw)reason- The exception or error termstacktrace- Full stacktrace
Custom Handlers
For more control, attach your own handlers:
defmodule MyApp.ArcanaMetrics do
require Logger
def setup do
events = [
[:arcana, :ingest, :stop],
[:arcana, :search, :stop],
[:arcana, :ask, :stop],
[:arcana, :embed, :stop],
[:arcana, :llm, :complete, :stop],
# Agent pipeline
[:arcana, :agent, :gate, :stop],
[:arcana, :agent, :rerank, :stop],
[:arcana, :agent, :reason, :stop],
[:arcana, :agent, :answer, :stop],
# VectorStore
[:arcana, :vector_store, :store, :stop],
[:arcana, :vector_store, :search, :stop],
# GraphRAG
[:arcana, :graph, :build, :stop],
[:arcana, :graph, :search, :stop],
# GraphStore
[:arcana, :graph_store, :persist_entities, :stop],
[:arcana, :graph_store, :search, :stop]
]
:telemetry.attach_many("my-arcana-metrics", events, &handle_event/4, nil)
end
def handle_event([:arcana, :search, :stop], measurements, metadata, _config) do
duration_ms = System.convert_time_unit(measurements.duration, :native, :millisecond)
Logger.info("Search: #{metadata.result_count} results in #{duration_ms}ms",
query: metadata.query,
mode: metadata.mode
)
# Send to your metrics system
:telemetry.execute([:my_app, :arcana, :search], %{
duration_ms: duration_ms,
result_count: metadata.result_count
})
end
def handle_event([:arcana, :llm, :complete, :stop], measurements, metadata, _config) do
duration_ms = System.convert_time_unit(measurements.duration, :native, :millisecond)
if metadata.success do
Logger.debug("LLM call to #{metadata.model}: #{duration_ms}ms")
else
Logger.warning("LLM call failed: #{metadata.error}")
end
end
# ... handle other events
endCall MyApp.ArcanaMetrics.setup() in your application startup.
Phoenix LiveDashboard Integration
Add Arcana metrics to your LiveDashboard:
# lib/my_app/telemetry.ex
defmodule MyApp.Telemetry do
import Telemetry.Metrics
def metrics do
[
# Arcana core operations
summary("arcana.ingest.stop.duration",
unit: {:native, :millisecond},
tags: [:collection]
),
summary("arcana.search.stop.duration",
unit: {:native, :millisecond},
tags: [:mode]
),
counter("arcana.search.stop.result_count"),
summary("arcana.ask.stop.duration",
unit: {:native, :millisecond}
),
# Embedding performance
summary("arcana.embed.stop.duration",
unit: {:native, :millisecond}
),
# LLM calls (often the slowest part)
summary("arcana.llm.complete.stop.duration",
unit: {:native, :millisecond},
tags: [:model]
),
counter("arcana.llm.complete.stop.prompt_length"),
# Agent pipeline steps
summary("arcana.agent.rerank.stop.duration",
unit: {:native, :millisecond}
),
last_value("arcana.agent.rerank.stop.kept"),
summary("arcana.agent.answer.stop.duration",
unit: {:native, :millisecond}
),
# GraphRAG metrics
summary("arcana.graph.build.stop.duration",
unit: {:native, :millisecond}
),
last_value("arcana.graph.build.stop.entity_count"),
summary("arcana.graph.search.stop.duration",
unit: {:native, :millisecond}
),
last_value("arcana.graph.search.stop.graph_result_count")
]
end
endConfigure LiveDashboard to use these metrics:
# lib/my_app_web/router.ex
live_dashboard "/dashboard",
metrics: MyApp.TelemetryPrometheus Integration
For production monitoring with Prometheus, use prom_ex:
# mix.exs
{:prom_ex, "~> 1.9"}# lib/my_app/prom_ex.ex
defmodule MyApp.PromEx do
use PromEx, otp_app: :my_app
@impl true
def plugins do
[
# Default plugins
PromEx.Plugins.Application,
PromEx.Plugins.Beam,
# Add custom Arcana metrics
MyApp.PromEx.ArcanaPlugin
]
end
@impl true
def dashboards do
[
{:prom_ex, "application.json"},
{:prom_ex, "beam.json"}
]
end
end# lib/my_app/prom_ex/arcana_plugin.ex
defmodule MyApp.PromEx.ArcanaPlugin do
use PromEx.Plugin
@impl true
def event_metrics(_opts) do
Event.build(
:arcana_event_metrics,
[
distribution(
[:arcana, :search, :duration, :milliseconds],
event_name: [:arcana, :search, :stop],
measurement: :duration,
unit: {:native, :millisecond},
tags: [:mode],
tag_values: fn metadata -> %{mode: metadata[:mode] || :semantic} end
),
distribution(
[:arcana, :llm, :complete, :duration, :milliseconds],
event_name: [:arcana, :llm, :complete, :stop],
measurement: :duration,
unit: {:native, :millisecond},
tags: [:model, :success],
tag_values: fn metadata ->
%{model: metadata[:model] || "unknown", success: metadata[:success]}
end
),
counter(
[:arcana, :ingest, :chunks, :total],
event_name: [:arcana, :ingest, :stop],
measurement: fn _measurements, metadata -> metadata[:chunk_count] || 0 end
)
]
)
end
endDebugging Performance Issues
Identify Slow Operations
The built-in logger makes it easy to spot bottlenecks:
[info] [Arcana] embed completed in 45ms (384 dims)
[info] [Arcana] search completed in 12ms (10 results)
[info] [Arcana] llm.complete completed in 3.2s [openai:gpt-4o] ok (1892 chars)
[info] [Arcana] ask completed in 3.3sIn this example, the LLM call dominates total time (3.2s of 3.3s).
Track Agent Pipeline Steps
For agentic RAG, each pipeline step is instrumented:
[info] [Arcana] agent.gate completed in 150ms (skip_retrieval: false)
[info] [Arcana] agent.rewrite completed in 180ms ("what are elixir macros")
[info] [Arcana] agent.expand completed in 220ms ("elixir macros metaprogramming...")
[info] [Arcana] agent.search completed in 35ms (25 chunks)
[info] [Arcana] agent.reason completed in 850ms (1 iteration)
[info] [Arcana] agent.rerank completed in 890ms (8/25 kept)
[info] [Arcana] agent.answer completed in 2.1sHere, reranking takes 890ms - if this is too slow, consider:
- Reducing chunks before reranking (lower search limit)
- Using a faster reranking threshold
- Implementing a custom reranker
If reason/2 is taking too long due to multiple iterations, consider:
- Lowering
max_iterations(default: 2) - Improving initial search quality with query expansion
Monitor LLM Costs
Track prompt sizes to estimate API costs:
def handle_event([:arcana, :llm, :complete, :stop], measurements, metadata, _config) do
# Rough token estimate (4 chars per token)
prompt_tokens = div(metadata.prompt_length || 0, 4)
response_tokens = div(metadata.response_length || 0, 4)
Logger.info("LLM usage",
model: metadata.model,
prompt_tokens: prompt_tokens,
response_tokens: response_tokens,
success: metadata.success
)
endError Tracking
Handle exceptions to send to your error tracking service:
def setup do
exception_events = [
[:arcana, :ingest, :exception],
[:arcana, :search, :exception],
[:arcana, :ask, :exception],
[:arcana, :llm, :complete, :exception]
]
:telemetry.attach_many("arcana-errors", exception_events, &handle_exception/4, nil)
end
def handle_exception(event, _measurements, metadata, _config) do
# Send to Sentry, Honeybadger, etc.
Sentry.capture_message("Arcana error",
extra: %{
event: inspect(event),
kind: metadata.kind,
reason: inspect(metadata.reason)
}
)
endBest Practices
Start with the built-in logger - It's zero-config and helps you understand what's happening
Focus on LLM latency - This is usually the bottleneck; track it closely
Monitor reranking - If using Agent.rerank/2, watch the kept/original ratio
Track by collection - Tag metrics with collection names to identify slow document sets
Set up alerts - Alert on LLM failures and unusually slow operations
Log in production - Keep at least
:infolevel logging for Arcana to debug issues