Gossip broadcast
View SourceThe protocol that moves registry updates and other cluster gossip across the network is Plumtree (Push-Lazy-Push Multicast Tree). It produces an efficient broadcast tree that self-heals under churn.
This page explains the algorithm and the invariants it maintains.
The problem
Naive epidemic gossip floods. Every node retransmits every message it sees to every neighbour, generating O(n²) wire traffic. The cluster converges, but the cost grows quadratically with size.
Plumtree replaces the flood with a spanning tree that self-organises out of the active-view links. Once the tree is stable, each broadcast costs O(n) messages: one per node, not one per pair.
The clever bit is that the tree heals automatically when peers fail: there is no separate maintenance pass.
The algorithm
Each node classifies its active-view peers into two sets:
- Eager peers receive the full body of every broadcast.
- Lazy peers receive only an
IHAVEannouncement (essentially the message ID and the sender).
The split starts trivially: when a node first joins, all of its active peers are eager. From then on, the protocol re-shapes the sets as broadcasts flow:
- A duplicate arrival (you already have the body, somebody
sent it again) triggers a
PRUNE: the sender becomes lazy for you. The next broadcast they originate, they will announce to you viaIHAVErather than push directly. - A missing message (you received an
IHAVEbut never the body) triggers aGRAFT: the lazy peer is promoted back to eager, and they retransmit the message.
In the steady state, each broadcast flows down a single
spanning tree of eager links. Lazy links act as a backup index;
when the tree breaks, GRAFT repairs it.
Protocol messages
GOSSIP(MsgId, Payload, Round) %% full body, push
IHAVE(MsgId, Round) %% announcement, lazy
GRAFT(MsgId) %% "promote me to eager"
PRUNE %% "demote me to lazy"Round is a counter on broadcasts; it lets a receiver detect
out-of-order delivery and decide when to give up waiting for a
graft.
Visualising a broadcast
A single broadcast through a six-node cluster, after the tree has stabilised:
origin
/ \
v v
peerA peerB
/ \ \
v v v
peerC peerD peerE
|
v
peerFSolid edges (->) are eager. Every peer receives the body
exactly once.
A peer that fails (say, peerD disconnects) removes its slice
of the tree. The next broadcast from peerA propagates to
peerC only. peerE and peerF still receive via the
right-hand branch through peerB. The lazy backups will
re-graft peerC to one of peerE's parents if needed.
Why this is efficient
For a cluster of n nodes:
- A naïve flood sends
O(n²)messages per broadcast. - Plumtree sends
O(n)GOSSIPmessages (one per node) plusO(n)IHAVEmessages (one per lazy edge), in the steady state. - Repair messages (
GRAFT/PRUNE) are bounded by the rate of churn; on a healthy cluster they are rare.
The protocol scales gracefully with cluster size and tolerates peer failures without a separate repair pass.
When the tree breaks
Two failure modes are interesting:
- A peer crashes mid-broadcast. The next-hop peers receive
the message via the eager edge only if the crash happened
after the body was sent. If not, they will see the
IHAVEfrom a lazy peer andGRAFTto recover. Total recovery time is bounded by one round-trip plus the retransmit. - A network partition splits the active view. Plumtree
keeps broadcasting on each side; when the partition heals,
the next broadcast picks up missed messages through
GRAFT. Older messages (pastMESSAGE_TTL, default 5 minutes) are dropped: the system is eventually consistent, not infinitely retentive.
Message deduplication
Each peer maintains a small ETS-backed cache of recently-seen
message IDs. On each GOSSIP arrival, the cache is checked
first; duplicates trigger PRUNE and are dropped from the
broadcast path.
The cache has a TTL (MESSAGE_TTL, 5 minutes); entries past
that age are discarded. Past the TTL, a re-broadcast of an old
message would be re-flooded as if new. This is intentional: it
bounds memory.
Observability
Plumtree exposes a small set of metrics. The interesting ratios:
| Metric | Healthy ratio |
|---|---|
barrel_p2p.plumtree.graft.sent / gossip.received | Should be small. A high ratio means lots of self-healing, which is a symptom of churn in the active view. |
barrel_p2p.plumtree.prune.sent / gossip.received | A non-trivial steady-state value is normal (the tree settling). A spike means the tree is reshaping (peer failures, joins). |
barrel_p2p.plumtree.ihave.sent | Roughly tracks active-view size × broadcast rate. |
See observe a cluster for the full catalogue.
Configuration
There are no operator-tunable Plumtree knobs in barrel_p2p today.
The defaults match the HyParView paper. If you want to tweak the
deduplication TTL or the graft timeout, the constants live in
src/barrel_p2p_plumtree.erl.
API
Plumtree is internal; you do not call it directly. The subsystems that use it (the service registry, service-event broadcasts) are the consumers.
If you want to broadcast your own message:
barrel_p2p_plumtree:broadcast(Tag, Payload).
%% Subscribe to receive broadcasts.
barrel_p2p_plumtree:subscribe(self()).
%% Receives: {plumtree_broadcast, {Tag, Payload}}
barrel_p2p_plumtree:unsubscribe(self()).The API is beta — the calling shape may change across minor
bumps. For application-level pub/sub, prefer building on
top of the service registry's events; for large blobs, use the
tagged-stream multiplex (streams concept).
Related
- Cluster membership is what produces the active-view links Plumtree builds the tree on top of.
- Service registry is the main consumer of Plumtree broadcasts inside barrel_p2p.
- Observe a cluster lists the metrics emitted by the gossip layer.