Telemetry & Observability

Copy Markdown View Source

Accrue emits :telemetry events for every public entry point. This guide documents the conventions, the high-signal [:accrue, :ops, :*] namespace, the OpenTelemetry span naming rules, and the default Telemetry.Metrics recipe.

If you only read one section: jump to Using the default metrics recipe below — that's the short host wiring snippet that appends Accrue.Telemetry.Metrics.defaults/0 (~20 metric definitions, including every documented ops counter) with minimal glue code.

Doc contract: Published Hex docs for accrue describe the telemetry contracts for that package version. The main branch may add or rename events before they appear in a Hex release; operators should prefer the guides/telemetry.md revision that matches the accrue version pinned in their host mix.lock (see Hex for the published guide snapshot).

Namespace split

Accrue divides telemetry into two namespaces:

  • [:accrue, :*] — the firehose. Every public Accrue entry point emits :start / :stop / :exception here via Accrue.Telemetry.span/3. High cardinality. Use for tracing, debug, and traffic shape — NOT for paging on-call.
  • [:accrue, :ops, :*] — SRE-actionable signals. Low cardinality, high value. Every event in this namespace is an alertable condition that should wake somebody up (or at least file a ticket). Subscribe handlers here, set thresholds, sleep well.

The split exists because Accrue's firehose is too noisy for alerting — a busy SaaS dispatching webhooks, finalizing invoices, and reporting usage events emits hundreds of [:accrue, :*] events per second. The ops namespace is curated: every event is a real ops signal, not a heartbeat.

Firehose and diagnostic events (not in [:accrue, :ops, :*])

These are useful for tracing, dashboards, and anomaly detection, not typically for paging:

  • Billing spansAccrue.Telemetry.span/3 on every public Accrue.Billing entry point: [:accrue, :billing, :<resource>, :<action>, :start | :stop | :exception] (see Accrue.Telemetry module doc).
  • Webhooks[:accrue, :webhook, :receive], handler exceptions, orphan reducers under [:accrue, :webhooks, :orphan_*], [:accrue, :webhooks, :stale_event], etc.
  • Mail / PDF[:accrue, :mailer, :deliver, …], [:accrue, :pdf, :render, …], email fallbacks [:accrue, :email, :locale_fallback | :timezone_fallback | :format_money_failed].

Subscribe in your host Telemetry or OpenTelemetry pipeline when you need latency percentiles, diagnostics, or high-cardinality dashboards — keep on-call paging on [:accrue, :ops, :*] above. The firehose stays useful for tracing and anomaly detection, but its volume is intentionally unsuitable for wake-the-oncall thresholds.

Ops event catalog ([:accrue, :ops, :*])

All ops events fire inside the same Repo.transact/2 as the state write they correspond to — they are idempotent under webhook replay via the accrue_webhook_events unique-index short-circuit.

EventMeasurementsMetadataPrimary owner
[:accrue, :ops, :revenue_loss]count, amount_minor, currencysubject_type, subject_id, reasonAccrue.Telemetry.Ops
[:accrue, :ops, :dunning_exhaustion]countsubscription_id, from_status, to_status, source (:accrue_sweeper | :stripe_native | :manual)Accrue.Webhook.DefaultHandler
[:accrue, :ops, :incomplete_expired]countsubscription_idAccrue.Telemetry.Ops
[:accrue, :ops, :charge_failed]countcharge_id, customer_id, failure_codeAccrue.Telemetry.Ops
[:accrue, :ops, :meter_reporting_failed]countmeter_event_id, event_name, source (:reconciler | :webhook | :sync)Accrue.Webhook.DefaultHandler / Accrue.Billing.MeterEventActions / Accrue.Jobs.MeterEventsReconciler
[:accrue, :ops, :webhook_dlq, :dead_lettered]countevent_id, processor_event_id, type, attemptAccrue.Webhook.DispatchWorker
[:accrue, :ops, :webhook_dlq, :replay]count, duration, requeued_count, skipped_countactor, filter, dry_run?Accrue.Webhooks.DLQ
[:accrue, :ops, :webhook_dlq, :prune]dead_deleted, succeeded_deleted, durationretention_daysAccrue.Webhook.Pruner
[:accrue, :ops, :pdf_adapter_unavailable]counttype (email template key), operation_id when setAccrue.Workers.Mailer
[:accrue, :ops, :events_upcast_failed]countevent_id, type, schema_versionAccrue.Events
[:accrue, :ops, :connect_account_deauthorized]countstripe_account_id, deauthorized_at or unresolved: trueAccrue.Webhook.ConnectHandler
[:accrue, :ops, :connect_capability_lost]countstripe_account_id, capability, from, toAccrue.Webhook.ConnectHandler
[:accrue, :ops, :connect_payout_failed]countstripe_account_id, payout_id, amount, currency, failure_codeAccrue.Webhook.ConnectHandler

Meter reporting failures: semantics & sources

[:accrue, :ops, :meter_reporting_failed] fires on the first durable transition of a meter row into terminal failed (as enforced by Accrue.Billing.MeterEvents guarded updates)—not once per HTTP retry, Stripe redelivery, webhook replay, or idempotent report_usage replay. Treat it as a terminal-epoch signal, not retry noise.

Read this block before tuning Grafana annotations—alert links should point here (tuple + semantics), then operator-runbooks.md for ordered triage.

Connect ops rows above are emitted via Accrue.Telemetry.Ops.emit/3 from Accrue.Webhook.ConnectHandler. PDF and ledger rows use :telemetry.execute/3 directly with the same [:accrue, :ops] prefix — treat them as first-class ops signals for paging and dashboards.

Note: [:accrue, :ops, :revenue_loss], :incomplete_expired, and :charge_failed are part of the supported host + Accrue ops vocabulary (Ops.emit/3 and metrics defaults). Prefer Ops.emit/3 from host billing code so operation_id merges consistently; search the codebase for concrete emit sites when wiring alerts.

Every ops event also carries an automatically-merged operation_id field in metadata, sourced from Accrue.Actor.current_operation_id/0 (the same seed used for processor idempotency keys). This lets you correlate ops events with the originating webhook, Oban job, or admin action across service boundaries.

Last reconciled with v1.9 gap audit §1: 2026-04-21 — PR #14. The §1 ops inventory in .planning/research/v1.9-TELEMETRY-GAP-AUDIT.md is reflected in the ops catalog table above (including Connect, PDF, ledger, and DLQ rows).

Span naming conventions (OpenTelemetry)

Accrue's OpenTelemetry span helpers (gated on the :opentelemetry optional dep) wrap every Billing context function with consistent naming:

accrue.<domain>.<resource>.<action>

Domains emitted through Accrue.Telemetry.span/3 today are :billing, :connect, :mailer, :pdf, :processor, and :storage (see Accrue.Telemetry module doc). Illustrative, non-exhaustive billing examples below — for the enforced billing span inventory, see test/accrue/telemetry/billing_span_coverage_test.exs.

  • accrue.billing.subscription.create
  • accrue.billing.subscription.cancel
  • accrue.billing.invoice.finalize
  • accrue.billing.charge.refund
  • accrue.billing.meter_event.report_usage — verified billing span from Accrue.Billing.report_usage/3; failures surface as the ops signal [:accrue, :ops, :meter_reporting_failed] (see ops catalog table).
  • accrue.billing.payment_method.list[:accrue, :billing, :payment_method, :list] from Accrue.Billing.list_payment_methods/2 (processor-backed read; no extra ops tuple).
  • <a id="billing-billing-portal-create"></a> accrue.billing.billing_portal.create[:accrue, :billing, :billing_portal, :create] from Accrue.Billing.create_billing_portal_session/2 (Phase 78 / BIL-04).
  • <a id="billing-checkout-session-create"></a> accrue.billing.checkout_session.create[:accrue, :billing, :checkout_session, :create] from Accrue.Billing.create_checkout_session/2 (BIL-06 / Phase 80 emission; BIL-07 / Phase 81 catalog). Checkout-only span metadata merged from validated attrs is exactly checkout_mode, checkout_ui_mode, and checkout_line_items_count — no checkout URLs, client_secret, or raw attrs blob (behavioral SSOT: checkout_session_facade_test.exs + merge_checkout_session_create_metadata/4). Prefer these on spans; do not promote them unmodified to Telemetry.Metrics tags (see Cardinality discipline below).
  • accrue.billing.payment_method.attach[:accrue, :billing, :payment_method, :attach] from Accrue.Billing.attach_payment_method/3.
  • accrue.billing.payment_method.detach[:accrue, :billing, :payment_method, :detach] from Accrue.Billing.detach_payment_method/2.
  • accrue.billing.payment_method.set_default[:accrue, :billing, :payment_method, :set_default] from Accrue.Billing.set_default_payment_method/3.
  • NOT an OTel span nameaccrue.webhooks.dlq.replay is the dotted OpenTelemetry span name only when a host maps the ops event [:accrue, :ops, :webhook_dlq, :replay] (via Ops.emit/3 / :telemetry.execute) into OTel separately. It is not produced by Accrue.Telemetry.span/3.

This mirrors the :telemetry event naming ([:accrue, :billing, :subscription, :create]) so a single name maps cleanly to both the telemetry event and the OTel span — no translation table.

Last reconciled (billing span examples): 2026-04-24 — Phase 81 BIL-07 (checkout_session catalog row; billing_span_coverage_test.exs unchanged).

Span kind:

  • INTERNAL for Accrue context functions
  • CLIENT for outbound lattice_stripe calls (the underlying HTTP layer emits these via its own instrumentation)

Attribute conventions

Accrue spans attach a small, fixed set of business-meaningful attributes. Hosts adding their own spans on top of Accrue should follow the same discipline — both for grep-ability across services and for the PII contract below.

Allowed attributes (host-queryable, audit-useful, PII-free):

  • accrue.subscription.id — Accrue's internal UUID
  • accrue.customer.id — Accrue's internal UUID
  • accrue.invoice.id
  • accrue.charge.id
  • accrue.event_type — webhook event type string (e.g. "invoice.paid")
  • accrue.processor:stripe | :fake

  • stripe.subscription.id — upstream Stripe ID (for support bridging)
  • stripe.customer.id
  • stripe.charge.id
  • stripe.invoice.id
  • stripe.payment_intent.id

PROHIBITED attributes (NEVER attach to spans, telemetry events, or metric tags):

  • Any customer email, name, phone, postal address
  • Any card PAN, CVC, expiry, fingerprint metadata
  • Any Stripe PaymentMethod raw response data
  • Any webhook raw body or signature
  • Any Accrue.Money amounts that identify a specific customer's purchase history (aggregate amounts at the metric level are fine; per-customer amounts at the span/event level are not)
  • Any free-text reason fields supplied by end users (refund notes, support tickets, etc.)

The rule of thumb: attach PII-free identifiers ONLY. If an attribute could be reversed into a person, don't attach it. This is a host responsibility too — Accrue cannot inspect your custom span attributes for PII at runtime, so code review and the grep-able allowlist above are the enforcement mechanism.

Using the default metrics recipe

Accrue.Telemetry.Metrics.defaults/0 returns a list of ready-to-use Telemetry.Metrics definitions covering the billing context, webhook pipeline, and ops namespace. It is conditionally compiled on the optional :telemetry_metrics dep.

Add the dep to your host app:

# mix.exs
{:telemetry_metrics, "~> 1.1"},
{:telemetry_metrics_prometheus, "~> 1.1"}  # or your reporter of choice

Then wire it in:

defmodule MyApp.Telemetry do
  use Supervisor
  import Telemetry.Metrics

  def start_link(arg), do: Supervisor.start_link(__MODULE__, arg, name: __MODULE__)

  @impl true
  def init(_arg) do
    children = [
      {TelemetryMetricsPrometheus, [metrics: metrics()]}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

  defp metrics do
    [
      counter("my_app.request.count")
      # ... your other host metrics ...
    ] ++ Accrue.Telemetry.Metrics.defaults()
  end
end

This wires in the default metric set covering the billing context, webhook pipeline, and full ops namespace (including Connect and PDF ops signals). Distributions and percentile summaries beyond these are host choice — Accrue doesn't prescribe binning strategies because appropriate buckets depend heavily on your traffic shape and SLO targets.

The default recipe includes counters for every [:accrue, :ops, :*] event documented in the table above (including Connect, PDF fallback, and ledger upcast failures), so Prometheus-style scrapers stay aligned with the ops catalog.

Cardinality discipline

The default metric definitions only attach low-cardinality tags (:status, :source, :type, :stripe_status). Customer IDs, subscription IDs, and other unbounded identifiers are never promoted to metric tags — they belong on spans, not metrics. If you add custom Accrue-derived metrics in your host app, follow the same rule: anything with more than ~50 distinct values per day is a span attribute, not a metric tag.

Emitting custom ops events

If your host app fires billing-adjacent ops events (e.g. a custom revenue recognition reconciler), prefer Accrue.Telemetry.Ops.emit/3 over raw :telemetry.execute/3 — it enforces the namespace prefix and auto-merges operation_id from the process dict:

Accrue.Telemetry.Ops.emit(
  :revenue_loss,
  %{count: 1, amount_minor: 9900, currency: "usd"},
  %{subject_type: "Subscription", subject_id: sub.id, reason: :fraud_refund}
)
# → emits [:accrue, :ops, :revenue_loss] with operation_id auto-merged

For multi-segment events (sub-namespaces), pass a list:

Accrue.Telemetry.Ops.emit(
  [:webhook_dlq, :replay],
  %{count: 12, requeued_count: 12, skipped_count: 0, duration: 142_000},
  %{actor: :admin, filter: %{type: "invoice.paid"}, dry_run?: false}
)
# → emits [:accrue, :ops, :webhook_dlq, :replay]

The [:accrue, :ops] prefix is hardcoded — callers cannot inject events outside the namespace via this helper. If you need to emit under [:accrue, :*] for the firehose, use Accrue.Telemetry.span/3 instead.

Cross-domain host subscription

Phoenix controllers, LiveViews, Channels, and plain processes often need a small, high-signal Accrue hook without subscribing to the full billing firehose. Use the public modules only — Accrue.Telemetry (span naming), Accrue.Telemetry.Metrics (default counters), and Accrue.Telemetry.Ops (emit/3 contract) — and the same :telemetry APIs you already use elsewhere.

Append Accrue.Telemetry.Metrics.defaults/0 to your host metric list (see Using the default metrics recipe above) so scrapers stay aligned with the ops catalog. The authoritative tuple list lives in the Ops event catalog table earlier in this guide — do not fork that table into a second inventory here.

Ops attach (webhook DLQ dead-lettered)

This pattern mirrors the checked-in examples/accrue_host app: start a tiny process once from supervision, call :telemetry.attach/4 with a stable handler id, and detach on shutdown so dev hot reload does not stack duplicate handlers.

:telemetry.attach(
  "accrue-host-ops-dlq-dead-lettered",
  [:accrue, :ops, :webhook_dlq, :dead_lettered],
  fn _event, measurements, _metadata, _config ->
    require Logger
    # Low-cardinality only — never log full metadata maps from billing.
    Logger.info("accrue ops webhook_dlq dead_lettered count=#{measurements.count}")
  end,
  nil
)

Optional: billing spans without metric-tag explosions

Accrue.Telemetry.span/3 on billing entry points emits [:accrue, :billing, :*, :start | :stop | :exception]high cardinality. If you add a handler, subscribe to :stop and :exception only for dashboards or logs. Do not copy customer IDs, subscription IDs, or other unbounded fields into metric tags; identifiers belong in traces or scrubbed log lines, not Prometheus labels.

OpenTelemetry–first hosts

Teams standardizing on OpenTelemetry may skip Telemetry.Metrics entirely and attach handlers or OTel bridges directly to :telemetry events instead — the ops catalog tuples still apply. Prefer spans for per-customer detail; keep paging on [:accrue, :ops, :*].

For ordered triage, default Oban queue placement (anchor #oban-queue-topology in operator-runbooks.md), and expanded Stripe verification, use Operator runbooks. This section keeps a compact signal → first-action table as your starting point — adjust for your support model and Stripe objects. Prefer Stripe Dashboard / Sigma for finance reporting; Accrue focuses on state + webhooks + replay in your app.

Operator runbooks (first actions)

Ops eventSuggested first actions
[:accrue, :ops, :webhook_dlq, :dead_lettered]Inspect accrue_webhook_events row; fix handler bug or data; use admin Replay or DLQ tools; watch replay telemetry; (Oban defaults: queue topology).
[:accrue, :ops, :webhook_dlq, :replay]Validate requeued_count vs expectation; if dry-run, follow up with real replay.
[:accrue, :ops, :meter_reporting_failed]Check source (:sync, :webhook, :reconciler); inspect accrue_meter_events; verify Stripe meter + API keys; retry after fix; (Oban defaults: queue topology).
[:accrue, :ops, :dunning_exhaustion]Confirm subscription status transition; notify customer success; verify payment method in Stripe.
[:accrue, :ops, :revenue_loss]Triage reason + subject_*; fraud vs refund policy; reconcile with Stripe balance transactions; (Oban defaults: queue topology).
[:accrue, :ops, :charge_failed]Map failure_code; prompt card update or alternative PM; check Radar rules in Stripe if unexpected.
[:accrue, :ops, :incomplete_expired]Incomplete checkout/subscription expired; clean up local rows; marketing follow-up if abandoned cart.
[:accrue, :ops, :pdf_adapter_unavailable]Start ChromicPDF (or switch PDF adapter); emails still send with hosted invoice link fallback.
[:accrue, :ops, :events_upcast_failed]Data migration issue — unknown schema_version for type; deploy compatible upcaster before replaying events; (Oban defaults: queue topology).
[:accrue, :ops, :connect_account_deauthorized]Disconnect Connect account in product UI; stop destination charges; audit open Connect transfers.
[:accrue, :ops, :connect_capability_lost]Read capability + to status; Stripe Connect onboarding / requirements.
[:accrue, :ops, :connect_payout_failed]Use payout_id + failure_code in Stripe; update bank account or resolve restriction.

See also