phi_accrual
View SourceA source-agnostic φ-accrual failure detector for Elixir/OTP, built on Hayashibara et al. 2004 with a dual-α EWMA estimator, head-of-line and local-pause awareness, and a telemetry-first API.
Stable —
v1.x. Public API and telemetry event schema are committed under SemVer. Per-node estimator option defaults may be tuned withinv1.x; breaking changes only inv2.0.0. See Versioning for the full contract.
Observability-grade, not decision-grade. Designed for dashboards, alerting, and operator intuition — not for automated routing, quorum, or correctness decisions. See limitations for why.
Quick start
# mix.exs
def deps do
[{:phi_accrual, "~> 1.0"}]
endThe application auto-starts. Feed in heartbeat arrivals from anywhere your code already receives cross-node traffic, and read out φ on demand:
# Call this whenever you receive evidence that a peer is alive —
# a GenServer reply, a :pg broadcast, an :rpc response, a custom ping.
# First call for an unknown node auto-tracks it with defaults.
PhiAccrual.observe(:"peer@host")
# Query φ at any time.
PhiAccrual.phi(:"peer@host")
#=> {:ok, 0.42, :steady}That's the whole core loop: feed in arrivals, read out φ. Everything below is about making it useful in production — reference heartbeat sources if you have none of your own, telemetry wiring for Prometheus, thresholding with hysteresis, and honest limitations.
What it does
Given a stream of heartbeat arrivals from a remote node, the detector
maintains an EWMA estimate of the inter-arrival distribution (mean and
variance, independently smoothed) and emits a continuous suspicion
value φ. φ is calibrated so that φ ≈ -log₁₀(P(arrival still pending)):
| φ value | Rough meaning |
|---|---|
| 1 | 1-in-10 chance the node is dead |
| 3 | 1-in-1000 |
| 8 | 1-in-100 000 000 — very likely down |
Thresholding is a consumer concern. The detector does not decide
whether a node is up or down; it publishes φ, and you (or the optional
PhiAccrual.Threshold module) decide what crosses what line.
Why another failure detector?
The Elixir/OTP ecosystem has plenty of cluster-management libraries
(libcluster, swarm, horde, partisan), but all of them use
binary up/down detectors or entangle detection with membership.
phi_accrual is the thing that goes alongside them: a pure detector,
unopinionated about who sends heartbeats, what the topology looks like,
or what to do when φ gets high.
Usage — bring your own signal
Anything that arrives from a remote node is evidence of liveness. If
your app already has cross-node traffic, call observe/2 from the
receive path — no extra network cost:
defmodule MyApp.Chatter do
use GenServer
def handle_info({:reply_from, node}, state) do
PhiAccrual.observe(node)
{:noreply, state}
end
endThen pattern-match on phi/1 to handle every result state:
case PhiAccrual.phi(:"node_a@host") do
{:ok, phi, :steady} -> # warm estimator, normal
{:ok, phi, :recovering} -> # warm estimator, absorbing a recent gap
{:insufficient_data, n} -> # still in bootstrap, `n` samples remaining
{:stale, elapsed_ms} -> # no arrival for > stale_after_ms
{:error, :not_tracked} -> # never observed
endCall PhiAccrual.track(node, opts) before your first observe if
you need custom per-node estimator options; otherwise the first
observe auto-tracks with defaults.
Usage — reference source
If you have no existing cross-node chatter, enable the bundled
DistributionPing source in config:
# config/runtime.exs
config :phi_accrual,
distribution_ping: [interval_ms: 1_000, auto_track: true]Each node then pings every peer every interval_ms over BEAM
distribution. Cheap per-ping, but cluster cost is O(N²) —
at 50 nodes and 1 s interval that's 2 500 pings/second of distribution
traffic.
This source inherits HoL blocking — see
limitations. The planned phi_accrual_udp companion
package escapes it by using a dedicated socket.
What happens when a node fails
Suppose :node_a@host has been heartbeating every ~1 s for a few
minutes. Its estimator has mean ≈ 1 000 ms, σ ≈ 50 ms, and φ hovers
around 0.3 (the median for an on-schedule arrival).
Then the node goes dark. Here is the timeline, using the default
options and a threshold instance configured at suspect_at: 4.0,
recover_at: 3.0:
t=0s last heartbeat arrives. φ ≈ 0.3.
→ [:phi_accrual, :sample, :observed] (interval_ms: ~1000)
t=1s no new heartbeat. φ ≈ 0.3 (still on-schedule).
→ [:phi_accrual, :phi, :computed] (periodic gauge tick)
t=2s φ ≈ 3.5. starting to get suspicious.
→ [:phi_accrual, :phi, :computed]
t=3s φ crosses 4.0.
→ [:phi_accrual, :phi, :computed]
→ [:phi_accrual, :threshold, :suspected]
t=10s φ very high. state still :steady (stale_after_ms default 60 s).
→ [:phi_accrual, :phi, :computed]
t=60s elapsed > stale_after_ms.
→ [:phi_accrual, :phi, :computed] (state: :stale)If :node_a@host comes back at t=15s and resumes heartbeating, the
first-arrival interval of 15 000 ms exceeds recovering_threshold_ms
(default 10 000). The state transitions to :recovering for the next
3 samples while the EWMA absorbs the outlier. Once φ drops below 3.0:
t=15s first heartbeat after outage. interval = 15 000 ms.
→ [:phi_accrual, :sample, :observed]
state becomes :recovering.
t=16s next heartbeat. φ has fallen sharply (elapsed is small).
→ [:phi_accrual, :phi, :computed] (state: :recovering)
→ [:phi_accrual, :threshold, :recovered] (φ crossed 3.0 downward)
t=19s three samples since the outlier.
→ state returns to :steady.Nowhere in this flow does the library decide the node is "down."
It just publishes φ and state labels; the Threshold module (or your
own consumer) decides what to do. That separation is why the detector
can be wired to a dashboard, an alert, and an automated-routing policy
simultaneously with different thresholds.
Telemetry schema (v1.x stable)
Event names, measurement keys, and metadata keys are a contract. Breaking changes only in v2.
[:phi_accrual, :sample, :observed]
measurements: %{interval_ms}
metadata: %{node, local_pause?}
[:phi_accrual, :phi, :computed] # periodic gauge stream
measurements: %{phi, elapsed_ms}
metadata: %{node, state, local_pause?, confidence}
# state ∈ [:steady, :recovering, :insufficient_data, :stale]
# phi is 0.0 when state is :insufficient_data or :stale; consumers
# should filter on state if they want to graph only meaningful values.
[:phi_accrual, :local_pause, :start] # rising edge
metadata: %{kind} # :long_gc | :long_schedule | :busy_dist_port
[:phi_accrual, :local_pause, :stop] # falling edge
[:phi_accrual, :overload, :shed]
measurements: %{mailbox_len}
metadata: %{node}
[:phi_accrual, :source, :started]
metadata: %{source, interval_ms}
[:phi_accrual, :threshold, :suspected] # emitted by Threshold module
[:phi_accrual, :threshold, :recovered]
measurements: %{phi}
metadata: %{node, instance, threshold, confidence, detector_state}Pipe these to Prometheus via telemetry_metrics_prometheus, to logs,
or to your own alerting (see next section).
Wiring telemetry to Prometheus
Pull in telemetry_metrics_prometheus
(or your preferred telemetry_metrics reporter) and declare the
metrics you care about:
# mix.exs — add dependency
{:telemetry_metrics_prometheus, "~> 1.1"}
# In your supervision tree
children = [
{TelemetryMetricsPrometheus,
metrics: [
# φ as a gauge — one series per (node, state) pair.
Telemetry.Metrics.last_value(
"phi_accrual.phi.computed.phi",
event_name: [:phi_accrual, :phi, :computed],
measurement: :phi,
tags: [:node, :state, :confidence]
),
# Counter of every heartbeat observed.
Telemetry.Metrics.counter(
"phi_accrual.sample.observed.count",
event_name: [:phi_accrual, :sample, :observed],
tags: [:node]
),
# Local-pause events — correlate noise in φ with GC / HoL.
Telemetry.Metrics.counter(
"phi_accrual.local_pause.start.count",
event_name: [:phi_accrual, :local_pause, :start],
tags: [:kind]
),
# Overload shedding — if this is ever non-zero in steady state,
# tune α instead of raising :shed_threshold.
Telemetry.Metrics.counter(
"phi_accrual.overload.shed.count",
event_name: [:phi_accrual, :overload, :shed],
tags: [:node]
),
# Discrete alert events from the Threshold module.
Telemetry.Metrics.counter(
"phi_accrual.threshold.suspected.count",
event_name: [:phi_accrual, :threshold, :suspected],
tags: [:node, :instance]
),
Telemetry.Metrics.counter(
"phi_accrual.threshold.recovered.count",
event_name: [:phi_accrual, :threshold, :recovered],
tags: [:node, :instance]
)
]}
]For ad-hoc logging, attach a handler directly:
:telemetry.attach_many(
"phi-accrual-logger",
[
[:phi_accrual, :threshold, :suspected],
[:phi_accrual, :threshold, :recovered]
],
&MyApp.PhiLogger.handle/4,
nil
)
defmodule MyApp.PhiLogger do
require Logger
def handle([:phi_accrual, :threshold, kind], %{phi: phi}, %{node: node}, _) do
Logger.warning("node=#{node} #{kind} phi=#{Float.round(phi, 2)}")
end
endThresholding (optional)
PhiAccrual.Threshold converts the φ gauge stream into discrete
:suspected / :recovered events with hysteresis:
# In your supervision tree
children = [
{PhiAccrual.Threshold, name: :dash, suspect_at: 4.0, recover_at: 3.0},
{PhiAccrual.Threshold, name: :route, suspect_at: 8.0, recover_at: 7.0}
]Multiple instances coexist — one for dashboards at φ=4, another for automated routing at φ=8. Skip the module entirely if you want to roll your own.
Configuration
# config/runtime.exs
config :phi_accrual,
# enable the node-global :erlang.system_monitor hook (default: true).
# Disable if another library already subscribes.
pause_monitor: true,
# back-pressure threshold — observe/2 sheds samples when mailbox
# exceeds this count and emits [:overload, :shed] telemetry.
shed_threshold: 10_000,
# bundled reference source — off by default, opt in:
distribution_ping: [interval_ms: 1_000, auto_track: true]Per-node estimator options (passed to PhiAccrual.track/2):
| Option | Default | Notes |
|---|---|---|
:alpha_mean | 0.125 | EWMA smoothing for mean |
:alpha_var | 0.125 | EWMA smoothing for variance (tune lower) |
:min_std_dev_ms | 50.0 | Floor on σ — prevents singular distribution |
:min_samples | 8 | Bootstrap gate before φ is reported |
:stale_after_ms | 60_000 | Elapsed past which state becomes :stale |
:recovering_threshold_ms | 10_000 | Large-gap detection for :recovering tag |
:recovering_grace_samples | 3 | Samples the :recovering tag persists for |
:initial_interval_ms | 1_000 | Prior mean before any observation |
:initial_std_dev_ms | 500 | Prior σ (variance = σ²) |
Debugging in production
When φ for a node looks wrong (stuck low, stuck high, or noisy in a way that doesn't match the heartbeat source), the fastest first step is to look at what the estimator currently believes:
iex> PhiAccrual.inspect_state(:"node_a@host")
%PhiAccrual.Core{
mean: 1023.4,
variance: 2_847.1,
last_arrival_ts: 1_234_567_890,
last_interval_ms: 1_017.0,
samples_seen: 42,
recovering_remaining: 0,
...
}If samples_seen is below min_samples, you're still in bootstrap.
If mean is far from the heartbeat period you expect, the source is
either dropping samples or arriving in bursts. If variance is huge
but intervals look stable, you've hit a single outlier — alpha_var
is too high. inspect_state/1 returns {:error, :not_tracked} for
unknown nodes; do not use it on the hot path.
Limitations
Read these before wiring φ to anything that takes irreversible action.
Head-of-line blocking (primary v1 caveat). DistributionPing and
any source that travels over BEAM distribution shares a TCP socket
with user traffic. A large GenServer reply or :pg broadcast can
delay heartbeats for arbitrary periods. PauseMonitor subscribes to
:busy_dist_port so you can observe this (pause telemetry +
confidence: false on φ events), but the underlying problem cannot be
fixed by this library while the source is distribution-based. The
planned phi_accrual_udp companion package solves it by using a
dedicated socket.
Local-pause suppression is best-effort. :erlang.system_monitor
fires on :long_gc, :long_schedule, and :busy_dist_port. The
monitor marks φ output with local_pause?: true and
confidence: false for a short lockout window after any event. It
does not freeze φ or widen the variance — we decided the silent-
detector failure mode is worse than noisy φ. Consumers are expected to
filter on the confidence flag (the Threshold module passes it
through in metadata).
Gaussian assumption misbehaves under bimodal distributions. BEAM
GC produces intermittent large pauses that, combined with normal
intervals, yield a bimodal inter-arrival distribution. A Gaussian EWMA
is a poor fit and will over-alert. Correlate φ with
:erlang.statistics(:garbage_collection) before acting on high φ. A
non-parametric estimator (Satzger or a two-component mixture) is a v2
consideration once we have real traces from deployments.
One :erlang.system_monitor per node. Only one subscription can
exist. If another library installs its own, enabling both will cause
one to silently win. Disable pause_monitor in config and feed pause
state to PhiAccrual.PauseMonitor.put_state/1 yourself if you need
coexistence.
Testing strategy
Failure detectors are hard to test against wall-clock. This project:
- Uses
StreamDatafor property-based tests of estimator math (test/phi_accrual/core_test.exs). - Injects clocks into
PhiAccrual.Estimatorvia the:clock_fnoption — noProcess.sleepin unit tests. - Integration tests against live distribution (
:peer-based, multi-node) are planned alongside thephi_accrual_udpcompanion package work.
Versioning
v1.x is telemetry-schema-stable: event names, measurement keys, and metadata keys will not change until v2. Per-node option defaults may be tuned within v1.x.
Roadmap
v1 (shipped)
- Dual-α EWMA estimator with bootstrap / stale / recovering states
PauseMonitorwith:busy_dist_porttracking- Per-node estimator GenServer +
DynamicSupervisor+Registry - Overload shedding with telemetry
- Bring-your-own-signal API +
DistributionPingreference source - Optional
Thresholdmodule with hysteresis - Committed telemetry event schema
Companion packages (separate repos)
The core detector is intentionally transport-agnostic. Heartbeat
transports and integrations live in their own packages, depending on
phi_accrual as a normal Hex dependency:
phi_accrual_udp— dedicated UDP socket source, escapes BEAM distribution head-of-line blocking. Required to make the detector decision-grade for any consumer that needs it. (planned, not yet published.)phi_accrual_libcluster—libclusterstrategy adapter for apps that want a one-line membership + detection wiring. (planned.)
Anyone is free to write additional companion packages — the public
API is stable, the contract is PhiAccrual.observe/2.
Future core work
- Evidence-based evaluation of non-parametric / mixture estimators for bimodal BEAM-GC inter-arrival distributions, once we have real traces from deployments.
:peer-based multi-node integration tests forDistributionPing.
Related ideas
This library is the first of three composable primitives: φ-accrual → HLC + causal broadcast → SWIM-Lifeguard standalone.
License
Apache-2.0. See LICENSE.