phi_accrual

A source-agnostic φ-accrual failure detector for Elixir/OTP, built on Hayashibara et al. 2004 with a dual-α EWMA estimator, head-of-line and local-pause awareness, and a telemetry-first API.

Stable — v1.x. Public API and telemetry event schema are committed under SemVer. Per-node estimator option defaults may be tuned within v1.x; breaking changes only in v2.0.0. See Versioning for the full contract.

Observability-grade, not decision-grade. Designed for dashboards, alerting, and operator intuition — not for automated routing, quorum, or correctness decisions. See limitations for why.

Quick start

# mix.exs
def deps do
  [{:phi_accrual, "~> 1.0"}]
end

The application auto-starts. Feed in heartbeat arrivals from anywhere your code already receives cross-node traffic, and read out φ on demand:

# Call this whenever you receive evidence that a peer is alive —
# a GenServer reply, a :pg broadcast, an :rpc response, a custom ping.
# First call for an unknown node auto-tracks it with defaults.
PhiAccrual.observe(:"peer@host")

# Query φ at any time.
PhiAccrual.phi(:"peer@host")
#=> {:ok, 0.42, :steady}

That's the whole core loop: feed in arrivals, read out φ. Everything below is about making it useful in production — reference heartbeat sources if you have none of your own, telemetry wiring for Prometheus, thresholding with hysteresis, and honest limitations.

What it does

Given a stream of heartbeat arrivals from a remote node, the detector maintains an EWMA estimate of the inter-arrival distribution (mean and variance, independently smoothed) and emits a continuous suspicion value φ. φ is calibrated so that φ ≈ -log₁₀(P(arrival still pending)):

φ value	Rough meaning
1	1-in-10 chance the node is dead
3	1-in-1000
8	1-in-100 000 000 — very likely down

Thresholding is a consumer concern. The detector does not decide whether a node is up or down; it publishes φ, and you (or the optional PhiAccrual.Threshold module) decide what crosses what line.

Why another failure detector?

The Elixir/OTP ecosystem has plenty of cluster-management libraries (libcluster, swarm, horde, partisan), but all of them use binary up/down detectors or entangle detection with membership. phi_accrual is the thing that goes alongside them: a pure detector, unopinionated about who sends heartbeats, what the topology looks like, or what to do when φ gets high.

Usage — bring your own signal

Anything that arrives from a remote node is evidence of liveness. If your app already has cross-node traffic, call observe/2 from the receive path — no extra network cost:

defmodule MyApp.Chatter do
  use GenServer

  def handle_info({:reply_from, node}, state) do
    PhiAccrual.observe(node)
    {:noreply, state}
  end
end

Then pattern-match on phi/1 to handle every result state:

case PhiAccrual.phi(:"node_a@host") do
  {:ok, phi, :steady}        -> # warm estimator, normal
  {:ok, phi, :recovering}    -> # warm estimator, absorbing a recent gap
  {:insufficient_data, n}    -> # still in bootstrap, `n` samples remaining
  {:stale, elapsed_ms}       -> # no arrival for > stale_after_ms
  {:error, :not_tracked}     -> # never observed
end

Call PhiAccrual.track(node, opts) before your first observe if you need custom per-node estimator options; otherwise the first observe auto-tracks with defaults.

Usage — reference source

If you have no existing cross-node chatter, enable the bundled DistributionPing source in config:

# config/runtime.exs
config :phi_accrual,
  distribution_ping: [interval_ms: 1_000, auto_track: true]

Each node then pings every peer every interval_ms over BEAM distribution. Cheap per-ping, but cluster cost is O(N²) — at 50 nodes and 1 s interval that's 2 500 pings/second of distribution traffic.

This source inherits HoL blocking — see limitations. The planned phi_accrual_udp companion package escapes it by using a dedicated socket.

What happens when a node fails

Suppose :node_a@host has been heartbeating every ~1 s for a few minutes. Its estimator has mean ≈ 1 000 ms, σ ≈ 50 ms, and φ hovers around 0.3 (the median for an on-schedule arrival).

Then the node goes dark. Here is the timeline, using the default options and a threshold instance configured at suspect_at: 4.0, recover_at: 3.0:

t=0s    last heartbeat arrives. φ ≈ 0.3.
        → [:phi_accrual, :sample, :observed]  (interval_ms: ~1000)

t=1s    no new heartbeat. φ ≈ 0.3 (still on-schedule).
        → [:phi_accrual, :phi, :computed]  (periodic gauge tick)

t=2s    φ ≈ 3.5. starting to get suspicious.
        → [:phi_accrual, :phi, :computed]

t=3s    φ crosses 4.0.
        → [:phi_accrual, :phi, :computed]
        → [:phi_accrual, :threshold, :suspected]

t=10s   φ very high. state still :steady (stale_after_ms default 60 s).
        → [:phi_accrual, :phi, :computed]

t=60s   elapsed > stale_after_ms.
        → [:phi_accrual, :phi, :computed]  (state: :stale)

If :node_a@host comes back at t=15s and resumes heartbeating, the first-arrival interval of 15 000 ms exceeds recovering_threshold_ms (default 10 000). The state transitions to :recovering for the next 3 samples while the EWMA absorbs the outlier. Once φ drops below 3.0:

t=15s   first heartbeat after outage. interval = 15 000 ms.
        → [:phi_accrual, :sample, :observed]
        state becomes :recovering.

t=16s   next heartbeat. φ has fallen sharply (elapsed is small).
        → [:phi_accrual, :phi, :computed]  (state: :recovering)
        → [:phi_accrual, :threshold, :recovered]    (φ crossed 3.0 downward)

t=19s   three samples since the outlier.
        → state returns to :steady.

Nowhere in this flow does the library decide the node is "down." It just publishes φ and state labels; the Threshold module (or your own consumer) decides what to do. That separation is why the detector can be wired to a dashboard, an alert, and an automated-routing policy simultaneously with different thresholds.

Telemetry schema (v1.x stable)

Event names, measurement keys, and metadata keys are a contract. Breaking changes only in v2.

[:phi_accrual, :sample, :observed]
  measurements: %{interval_ms}
  metadata:     %{node, local_pause?}

[:phi_accrual, :phi, :computed]                  # periodic gauge stream
  measurements: %{phi, elapsed_ms}
  metadata:     %{node, state, local_pause?, confidence}
    # state ∈ [:steady, :recovering, :insufficient_data, :stale]
    # phi is 0.0 when state is :insufficient_data or :stale; consumers
    # should filter on state if they want to graph only meaningful values.

[:phi_accrual, :local_pause, :start]             # rising edge
  metadata:     %{kind}                          # :long_gc | :long_schedule | :busy_dist_port
[:phi_accrual, :local_pause, :stop]              # falling edge

[:phi_accrual, :overload, :shed]
  measurements: %{mailbox_len}
  metadata:     %{node}

[:phi_accrual, :source, :started]
  metadata:     %{source, interval_ms}

[:phi_accrual, :threshold, :suspected]           # emitted by Threshold module
[:phi_accrual, :threshold, :recovered]
  measurements: %{phi}
  metadata:     %{node, instance, threshold, confidence, detector_state}

Pipe these to Prometheus via telemetry_metrics_prometheus, to logs, or to your own alerting (see next section).

Wiring telemetry to Prometheus

Pull in telemetry_metrics_prometheus (or your preferred telemetry_metrics reporter) and declare the metrics you care about:

# mix.exs — add dependency
{:telemetry_metrics_prometheus, "~> 1.1"}

# In your supervision tree
children = [
  {TelemetryMetricsPrometheus,
   metrics: [
     # φ as a gauge — one series per (node, state) pair.
     Telemetry.Metrics.last_value(
       "phi_accrual.phi.computed.phi",
       event_name: [:phi_accrual, :phi, :computed],
       measurement: :phi,
       tags: [:node, :state, :confidence]
     ),

     # Counter of every heartbeat observed.
     Telemetry.Metrics.counter(
       "phi_accrual.sample.observed.count",
       event_name: [:phi_accrual, :sample, :observed],
       tags: [:node]
     ),

     # Local-pause events — correlate noise in φ with GC / HoL.
     Telemetry.Metrics.counter(
       "phi_accrual.local_pause.start.count",
       event_name: [:phi_accrual, :local_pause, :start],
       tags: [:kind]
     ),

     # Overload shedding — if this is ever non-zero in steady state,
     # tune α instead of raising :shed_threshold.
     Telemetry.Metrics.counter(
       "phi_accrual.overload.shed.count",
       event_name: [:phi_accrual, :overload, :shed],
       tags: [:node]
     ),

     # Discrete alert events from the Threshold module.
     Telemetry.Metrics.counter(
       "phi_accrual.threshold.suspected.count",
       event_name: [:phi_accrual, :threshold, :suspected],
       tags: [:node, :instance]
     ),
     Telemetry.Metrics.counter(
       "phi_accrual.threshold.recovered.count",
       event_name: [:phi_accrual, :threshold, :recovered],
       tags: [:node, :instance]
     )
   ]}
]

For ad-hoc logging, attach a handler directly:

:telemetry.attach_many(
  "phi-accrual-logger",
  [
    [:phi_accrual, :threshold, :suspected],
    [:phi_accrual, :threshold, :recovered]
  ],
  &MyApp.PhiLogger.handle/4,
  nil
)

defmodule MyApp.PhiLogger do
  require Logger

  def handle([:phi_accrual, :threshold, kind], %{phi: phi}, %{node: node}, _) do
    Logger.warning("node=#{node} #{kind} phi=#{Float.round(phi, 2)}")
  end
end

Thresholding (optional)

PhiAccrual.Threshold converts the φ gauge stream into discrete :suspected / :recovered events with hysteresis:

# In your supervision tree
children = [
  {PhiAccrual.Threshold, name: :dash, suspect_at: 4.0, recover_at: 3.0},
  {PhiAccrual.Threshold, name: :route, suspect_at: 8.0, recover_at: 7.0}
]

Multiple instances coexist — one for dashboards at φ=4, another for automated routing at φ=8. Skip the module entirely if you want to roll your own.

Configuration

# config/runtime.exs
config :phi_accrual,
  # enable the node-global :erlang.system_monitor hook (default: true).
  # Disable if another library already subscribes.
  pause_monitor: true,

  # back-pressure threshold — observe/2 sheds samples when mailbox
  # exceeds this count and emits [:overload, :shed] telemetry.
  shed_threshold: 10_000,

  # bundled reference source — off by default, opt in:
  distribution_ping: [interval_ms: 1_000, auto_track: true]

Per-node estimator options (passed to PhiAccrual.track/2):

Option	Default	Notes
`:alpha_mean`	`0.125`	EWMA smoothing for mean
`:alpha_var`	`0.125`	EWMA smoothing for variance (tune lower)
`:min_std_dev_ms`	`50.0`	Floor on σ — prevents singular distribution
`:min_samples`	`8`	Bootstrap gate before φ is reported
`:stale_after_ms`	`60_000`	Elapsed past which state becomes `:stale`
`:recovering_threshold_ms`	`10_000`	Large-gap detection for `:recovering` tag
`:recovering_grace_samples`	`3`	Samples the `:recovering` tag persists for
`:initial_interval_ms`	`1_000`	Prior mean before any observation
`:initial_std_dev_ms`	`500`	Prior σ (variance = σ²)

Debugging in production

When φ for a node looks wrong (stuck low, stuck high, or noisy in a way that doesn't match the heartbeat source), the fastest first step is to look at what the estimator currently believes:

iex> PhiAccrual.inspect_state(:"node_a@host")
%PhiAccrual.Core{
  mean: 1023.4,
  variance: 2_847.1,
  last_arrival_ts: 1_234_567_890,
  last_interval_ms: 1_017.0,
  samples_seen: 42,
  recovering_remaining: 0,
  ...
}

If samples_seen is below min_samples, you're still in bootstrap. If mean is far from the heartbeat period you expect, the source is either dropping samples or arriving in bursts. If variance is huge but intervals look stable, you've hit a single outlier — alpha_var is too high. inspect_state/1 returns {:error, :not_tracked} for unknown nodes; do not use it on the hot path.

Limitations

Read these before wiring φ to anything that takes irreversible action.

Head-of-line blocking (primary v1 caveat). DistributionPing and any source that travels over BEAM distribution shares a TCP socket with user traffic. A large GenServer reply or :pg broadcast can delay heartbeats for arbitrary periods. PauseMonitor subscribes to :busy_dist_port so you can observe this (pause telemetry + confidence: false on φ events), but the underlying problem cannot be fixed by this library while the source is distribution-based. The planned phi_accrual_udp companion package solves it by using a dedicated socket.

Local-pause suppression is best-effort. :erlang.system_monitor fires on :long_gc, :long_schedule, and :busy_dist_port. The monitor marks φ output with local_pause?: true and confidence: false for a short lockout window after any event. It does not freeze φ or widen the variance — we decided the silent- detector failure mode is worse than noisy φ. Consumers are expected to filter on the confidence flag (the Threshold module passes it through in metadata).

Gaussian assumption misbehaves under bimodal distributions. BEAM GC produces intermittent large pauses that, combined with normal intervals, yield a bimodal inter-arrival distribution. A Gaussian EWMA is a poor fit and will over-alert. Correlate φ with :erlang.statistics(:garbage_collection) before acting on high φ. A non-parametric estimator (Satzger or a two-component mixture) is a v2 consideration once we have real traces from deployments.

One :erlang.system_monitor per node. Only one subscription can exist. If another library installs its own, enabling both will cause one to silently win. Disable pause_monitor in config and feed pause state to PhiAccrual.PauseMonitor.put_state/1 yourself if you need coexistence.

Testing strategy

Failure detectors are hard to test against wall-clock. This project:

Uses StreamData for property-based tests of estimator math (test/phi_accrual/core_test.exs).
Injects clocks into PhiAccrual.Estimator via the :clock_fn option — no Process.sleep in unit tests.
Integration tests against live distribution (:peer-based, multi-node) are planned alongside the phi_accrual_udp companion package work.

Versioning

v1.x is telemetry-schema-stable: event names, measurement keys, and metadata keys will not change until v2. Per-node option defaults may be tuned within v1.x.

Roadmap

v1 (shipped)

Dual-α EWMA estimator with bootstrap / stale / recovering states
PauseMonitor with :busy_dist_port tracking
Per-node estimator GenServer + DynamicSupervisor + Registry
Overload shedding with telemetry
Bring-your-own-signal API + DistributionPing reference source
Optional Threshold module with hysteresis
Committed telemetry event schema

Companion packages (separate repos)

The core detector is intentionally transport-agnostic. Heartbeat transports and integrations live in their own packages, depending on phi_accrual as a normal Hex dependency:

phi_accrual_udp — dedicated UDP socket source, escapes BEAM distribution head-of-line blocking. Required to make the detector decision-grade for any consumer that needs it. (planned, not yet published.)
phi_accrual_libcluster — libcluster strategy adapter for apps that want a one-line membership + detection wiring. (planned.)

Anyone is free to write additional companion packages — the public API is stable, the contract is PhiAccrual.observe/2.

Future core work

Evidence-based evaluation of non-parametric / mixture estimators for bimodal BEAM-GC inter-arrival distributions, once we have real traces from deployments.
:peer-based multi-node integration tests for DistributionPing.

This library is the first of three composable primitives: φ-accrual → HLC + causal broadcast → SWIM-Lifeguard standalone.

License

Apache-2.0. See LICENSE.

Next Page → Changelog