Design and Internals

View Source

This guide describes the architecture, design decisions, and internal workings of the instrument library for developers who want to understand or extend the system.

Architecture Overview

+------------------+     +------------------+     +------------------+
|   Application    |     |    Tracing       |     |    Metrics       |
|------------------|     |------------------|     |------------------|
| instrument_app   |---->| instrument_tracer|     | instrument_counter|
| instrument_sup   |     | instrument_sampler     | instrument_gauge  |
| instrument_config|     | instrument_span_*|     | instrument_hist...|
+------------------+     +------------------+     +------------------+
         |                       |                       |
         v                       v                       v
+------------------+     +------------------+     +------------------+
|    Context       |     |   Processing     |     |    Registry      |
|------------------|     |------------------|     |------------------|
| instrument_ctx   |<--->| span_processor   |     | instrument_reg...|
| instrument_prop..      | instrument_exp...|<--->| instrument_lib   |
+------------------+     +------------------+     +------------------+
                                |
                                v
                         +------------------+
                         | Flight Recorder  |
                         |------------------|
                         | flight_recorder  |
                         | tracer_pool      |
                         | tracer_nif (C)   |
                         +------------------+

Startup Sequence

  1. instrument_app:start/2 - Initializes configuration via instrument_config:init()
  2. Creates the instrument_span_exporters ETS table for span export hooks
  3. instrument_sup:start_link/0 - Starts the supervisor tree

The supervisor (instrument_sup) starts these children with one_for_all strategy:

OrderChild IDModulePurpose
1registryinstrument_registryMetric registration and lookup
2exporterinstrument_exporterSpan export batching
3metrics_exporterinstrument_metrics_exporterMetrics export
4log_exporterinstrument_log_exporterLog export
5span_processorinstrument_span_processorSpan processing chain
6flight_recorderinstrument_flight_recorderMessage tracing

Design Principles

Lock-free Operations: Metrics use NIF-based C11 atomics for updates, avoiding locks entirely.

Scheduler-aware Storage: ETS tables are partitioned per-scheduler (instrument_registry_1 through instrument_registry_N) to eliminate cross-scheduler contention.

Persistent Term Caching: Metric lookups use persistent_term for O(1) read access with zero copying overhead.

Process Dictionary Context: Span context is stored in the process dictionary for zero-cost reads within a process.

Module Organization

By Subsystem

SubsystemModulesPurpose
Applicationinstrument_app, instrument_sup, instrument_configLifecycle and configuration
Tracinginstrument_tracer, instrument_sampler, instrument_idSpan creation and sampling
Metricsinstrument_counter, instrument_gauge, instrument_histogramMetric types
Contextinstrument_context, instrument_propagator, instrument_propagator_*Context propagation
Processinginstrument_span_processor, instrument_span_processor_*, instrument_exporterSpan processing pipeline
Flight Recorderinstrument_flight_recorder, instrument_tracer_pool, instrument_tracer_nifMessage capture
Registryinstrument_registry, instrument_libMetric storage

Module Dependencies

instrument_tracer
    |
    +---> instrument_context (span storage)
    +---> instrument_sampler (sampling decisions)
    +---> instrument_id (ID generation)
    +---> instrument_span_processor (on_start/on_end hooks)
    +---> instrument_flight_recorder (message tracing)

Key Modules

instrument_tracer: Core span lifecycle management. Handles span creation, context attachment, and export hooks. Uses the process dictionary via instrument_context to maintain the current span stack.

instrument_registry: Central metric storage using a gen_server with scheduler-partitioned ETS tables and persistent_term caching. The gen_server serializes writes while reads bypass it entirely.

instrument_span_processor: Manages a chain of span processors, invoking on_start/2 and on_end/1 callbacks. Supports both simple (immediate) and batch processing.

instrument_exporter: Batches completed spans and distributes them to registered exporters. Default batch size is 512 spans with a 5-second timeout.

instrument_flight_recorder: Low-overhead message tracing using erlang:trace with a custom erl_tracer NIF. Distributes trace events across a worker pool to avoid single-process bottlenecks.

Key Data Structures

Tracing Records (instrument_otel.hrl)

#span_ctx{} - W3C TraceContext identifier

-record(span_ctx, {
  trace_id :: <<_:128>>,           % 16 bytes (128 bits)
  span_id :: <<_:64>>,             % 8 bytes (64 bits)
  trace_flags = 1 :: 0 | 1,        % sampled flag
  trace_state = [] :: [{binary(), binary()}],  % vendor key-value pairs
  is_remote = false :: boolean()   % extracted from remote?
}).

#span{} - Complete span record

-record(span, {
  name :: binary(),
  ctx :: #span_ctx{},
  parent_ctx :: #span_ctx{} | undefined,
  kind = internal :: client | server | producer | consumer | internal,
  start_time :: integer(),         % monotonic nanoseconds
  end_time :: integer() | undefined,
  attributes = #{} :: map(),
  events = [] :: [#span_event{}],
  links = [] :: [#span_link{}],
  status = unset :: unset | ok | {error, binary()},
  is_recording = true :: boolean()
}).

#span_event{} - Timestamped annotation

-record(span_event, {
  name :: binary(),
  timestamp :: integer(),          % monotonic nanoseconds
  attributes = #{} :: map()
}).

#sampling_result{} - Sampler output

-record(sampling_result, {
  decision :: drop | record_only | record_and_sample,
  attributes = #{} :: map(),
  trace_state = [] :: [{binary(), binary()}]
}).

Metrics Records (instrument.hrl)

#metric{} - Core metric wrapper

-record(metric, {
  name,
  handle :: term(),                % NIF resource or #vector{}
  collect :: tuple(),              % {Module, Function, Args}
  description :: binary() | undefined,
  unit :: binary() | undefined,
  meter :: binary() | undefined,
  attributes = #{} :: map()
}).

#vector{} - Labeled/dimensional metrics

-record(vector, {
  name,
  help,
  metric,                          % counter | gauge | histogram
  buckets = [],                    % histogram boundaries
  labels = [],                     % label keys
  labels_map = #{}                 % {LabelValues} => #metric{}
}).

Internal Workflows

Span Lifecycle

start_span(Name, Opts)
    |
    v
[Check tracing enabled] --no--> [Create noop span] --> [Attach] --> Return
    |
    yes
    v
[Get parent context from opts or current span]
    |
    v
[Generate trace_id (new trace) or inherit from parent]
[Generate span_id]
    |
    v
[Call sampler: should_sample/6]
    |
    +---> decision = drop         --> is_recording = false, trace_flags = 0
    +---> decision = record_only  --> is_recording = true, trace_flags = 0
    +---> decision = record_and_sample --> is_recording = true, trace_flags = 1
    |
    v
[Create #span{} with merged attributes]
    |
    v
[If is_recording: call span_processor:on_start/2]
    |
    v
[Attach span to context (process dictionary)]
    |
    v
[If flight recorder enabled: enable_flight_tracing/1]
    |
    v
Return #span{}

Span End Flow

end_span(Span)
    |
    v
[If is_recording = false] --> [Detach from context] --> Return ok
    |
    v
[Set end_time = now()]
    |
    v
[Call span_processor:on_end/1]
    |
    v
[Mark is_recording = false]
    |
    v
[Call all registered exporters via ETS lookup]
    |
    v
[If flight recorder owner: disable tracing]
    |
    v
[Detach span from context, restore parent if any]
    |
    v
Return ok

Metrics Recording Flow

instrument_counter:inc(Name, Value)
    |
    v
[Lookup metric via persistent_term]
    |
    +---> #metric{handle = NIF_Resource}
    |         |
    |         v
    |     [Call NIF: atomic increment]
    |
    +---> #metric{handle = #vector{}}
              |
              v
          [Lookup labeled metric from labels_map]
              |
              v
          [Call NIF on resolved metric]

Context Propagation Flow

Outbound (inject):
    instrument_propagator:inject(Carrier)
        |
        v
    [Get current context from process dictionary]
        |
        v
    [For each registered propagator:]
        +---> propagator:inject(Ctx, Carrier) --> Updated Carrier
        |
        v
    Return final Carrier with all headers


Inbound (extract):
    instrument_propagator:extract(Carrier)
        |
        v
    [Create new empty context]
        |
        v
    [For each registered propagator:]
        +---> propagator:extract(Carrier, Ctx) --> Updated Ctx
        |
        v
    Return context with span_ctx, baggage, etc.

Flight Recorder Message Flow

[Root span starts]
    |
    v
[enable_flight_tracing(TraceId)]
    |
    v
[Store label in process dictionary]
    |
    v
[Get tracer state from instrument_tracer_pool]
    |
    v
[erlang:trace(self(), true, [send, 'receive', set_on_spawn,
                             {tracer, instrument_tracer_nif, State}])]
    |
    v
[All send/receive events captured by NIF]
    |
    +---> [NIF hashes tracee PID to select worker]
    |         |
    |         v
    |     [Send {trace, ...} to worker]
    |         |
    |         v
    |     [Worker inserts into ETS ring buffer]
    |
    v
[On span end: erlang:trace(self(), false, ...)]

Extension Points

Custom Sampler

Implement the instrument_sampler behavior:

-module(my_sampler).
-behaviour(instrument_sampler).

-export([should_sample/7, get_description/1]).

-include_lib("instrument/include/instrument_otel.hrl").

%% Rate limit to N spans per second
should_sample(Config, TraceId, SpanName, SpanKind, Attributes, Links, ParentCtx) ->
  RateLimit = maps:get(rate_limit, Config, 100),
  case check_rate_limit(RateLimit) of
    allow ->
      #sampling_result{
        decision = record_and_sample,
        attributes = #{},
        trace_state = []
      };
    deny ->
      #sampling_result{decision = drop}
  end.

get_description(Config) ->
  RateLimit = maps:get(rate_limit, Config, 100),
  iolist_to_binary(io_lib:format("RateLimitSampler{~p/s}", [RateLimit])).

%% Register:
%% instrument_sampler:set_sampler(my_sampler, #{rate_limit => 50}).

Custom Span Processor

Implement the instrument_span_processor behavior:

-module(my_span_processor).
-behaviour(instrument_span_processor).

-export([init/1, on_start/2, on_end/1, shutdown/1, force_flush/1]).

-include_lib("instrument/include/instrument_otel.hrl").

init(Config) ->
  %% Initialize state (e.g., open connections)
  {ok, #{filter => maps:get(filter, Config, fun(_) -> true end)}}.

on_start(Span, _ParentCtx) ->
  %% Called synchronously at span start
  %% Add attributes, modify span, etc.
  Span#span{attributes = maps:put(<<"processor">>, <<"custom">>,
                                  Span#span.attributes)}.

on_end(Span) ->
  %% Called asynchronously at span end
  %% Filter, transform, forward to external systems
  case should_export(Span) of
    true -> send_to_analytics(Span);
    false -> ok
  end.

shutdown(_State) ->
  ok.

force_flush(_State) ->
  ok.

%% Register:
%% instrument_span_processor:register(my_span_processor, #{}).

Custom Exporter

Implement the exporter callbacks:

-module(my_exporter).

-export([init/1, export/2, shutdown/1, force_flush/1]).

-include_lib("instrument/include/instrument_otel.hrl").

init(Config) ->
  Endpoint = maps:get(endpoint, Config),
  {ok, #{endpoint => Endpoint, conn => connect(Endpoint)}}.

export(Spans, State) ->
  %% Spans is a list of #span{} records
  #{endpoint := Endpoint, conn := Conn} = State,
  Payload = encode_spans(Spans),
  case send_request(Conn, Endpoint, Payload) of
    ok ->
      {ok, State};
    {error, Reason} ->
      logger:warning("Export failed: ~p", [Reason]),
      {error, Reason, State}
  end.

shutdown(#{conn := Conn}) ->
  close(Conn),
  ok.

force_flush(State) ->
  {ok, State}.

%% Register:
%% instrument_exporter:register(#{module => my_exporter,
%%                                config => #{endpoint => "..."}}).

Custom Propagator

Implement the instrument_propagator behavior:

-module(my_propagator).
-behaviour(instrument_propagator).

-export([inject/2, extract/2, fields/0]).

-define(HEADER, <<"x-my-trace">>).

inject(Ctx, Carrier) ->
  case maps:get(span_ctx, Ctx, undefined) of
    undefined ->
      Carrier;
    #span_ctx{trace_id = TraceId, span_id = SpanId} ->
      Value = encode(TraceId, SpanId),
      maps:put(?HEADER, Value, Carrier)
  end.

extract(Carrier, Ctx) ->
  case maps:get(?HEADER, Carrier, undefined) of
    undefined ->
      Ctx;
    Value ->
      {TraceId, SpanId} = decode(Value),
      SpanCtx = #span_ctx{
        trace_id = TraceId,
        span_id = SpanId,
        is_remote = true
      },
      maps:put(span_ctx, SpanCtx, Ctx)
  end.

fields() ->
  [?HEADER].

%% Register:
%% instrument_propagator:register(my_propagator).

Performance Design Choices

NIF-based Atomic Metrics

Metrics use C11 atomics via NIFs for lock-free updates:

// From c_src/gauge.c
static void instrument_gauge_change(instrument_gauge_t *g, double delta) {
    double current = atomic_load(&g->value);
    while (!atomic_compare_exchange_weak(&g->value, &current, current + delta))
        ;
}

This provides:

  • Lock-free updates using CAS (compare-and-swap)
  • No contention between concurrent writers
  • Sub-microsecond update latency

Scheduler-aware ETS Tables

The registry creates one ETS table per scheduler:

% From instrument_lib.erl
tables() ->
  [table(S) || S <- lists:seq(1,erlang:system_info(schedulers))].

table() ->
  table(erlang:system_info(scheduler_id)).

table(1) -> instrument_registry_1;
table(2) -> instrument_registry_2;
% ... up to instrument_registry_64

Benefits:

  • Each scheduler accesses its own table
  • Eliminates cross-scheduler ETS lock contention
  • Linear scaling with core count

Persistent Term Caching

Metric lookups use persistent_term for O(1) access:

% From instrument_registry.erl
lookup(Name) ->
  persistent_term:get({instrument_metric, Name}, undefined).

cache_label(Name, LabelValues, Metric) ->
  persistent_term:put({instrument_label, Name, LabelValues}, Metric).

Characteristics:

  • Reads are essentially free (no copying)
  • Writes trigger global GC but are rare
  • Perfect for read-heavy metric lookups

Process Dictionary Context

Context is stored in the process dictionary:

% From instrument_context.erl
current() ->
  case erlang:get(?CONTEXT_KEY) of
    undefined -> new();
    Ctx -> Ctx
  end.

set_current(Ctx) when is_map(Ctx) ->
  erlang:put(?CONTEXT_KEY, Ctx),
  ok.

Benefits:

  • Zero-cost reads (direct memory access)
  • No message passing or function calls
  • Natural process isolation

Worker Pool Distribution

The flight recorder distributes trace events across workers:

// From c_src/instrument_tracer_nif.c
ErlNifUInt64 hash = enif_hash(ERL_NIF_INTERNAL_HASH, tracee, 0);
unsigned int worker_idx = hash % pool_size;

This avoids the single-process bottleneck that would occur if all trace events went to one handler.

Batched Export

Spans are batched before export:

SettingDefaultPurpose
max_export_batch_size512Spans per batch
schedule_delay_millis5000Time trigger
max_queue_size2048Queue limit
export_timeout_millis30000Export timeout

Performance Summary

OperationMechanismTypical LatencyConcurrency
Metric incrementNIF atomic CAS<100nsLock-free
Metric lookuppersistent_term<10nsCopy-free
Context readProcess dictionary<10nsProcess-local
Span startErlang + sampling~1-5usPer-process
Span endAsync cast<1usNon-blocking
Flight trace eventNIF + hash~500nsPool-distributed
Batch exportAsync processBackgroundConfigurable