Leader election and singletons

View Source

Many applications need "exactly one node runs this job": a cron driver, a queue compactor, a shard owner. Barrel P2P provides this as a small campaign-and-notify API. A process calls barrel_p2p:lead/2, the cluster elects one leader, and the leader is re-elected automatically when membership changes.

Because barrel_p2p is an AP/gossip system with no consensus layer, election is deterministic rather than coordinated, and each leadership term carries a fencing token so a stale leader cannot corrupt shared state. This page explains both.

The campaign-and-notify model

The calling process is the candidate. It is monitored, so if it dies it stops being a candidate. lead/2 returns the caller's initial role, and every later transition arrives as a message:

%% The worker process campaigns, then runs only while it leads.
run() ->
    case barrel_p2p:lead(report_roller) of
        {ok, {leader, Fence}}  -> start_job(Fence);
        {ok, follower}         -> wait()
    end,
    loop().

loop() ->
    receive
        {barrel_p2p_leader, report_roller, {elected, Fence}} ->
            start_job(Fence), loop();
        {barrel_p2p_leader, report_roller, revoked} ->
            stop_job(), loop()
    end.

Supporting calls:

barrel_p2p:leader(report_roller).     %% {ok, Node, Pid} | {error, no_leader}
barrel_p2p:is_leader(report_roller).  %% boolean()
barrel_p2p:fence(report_roller).      %% {ok, Fence} | {error, not_leader}
barrel_p2p:resign(report_roller).     %% step down (no `revoked' is sent)

The caller owns its own job lifecycle: barrel_p2p tells it when it holds leadership and when it loses it, and the caller decides what to start and stop.

How the winner is chosen

Every node holds a replicated set of candidates (which node is campaigning for which name), gossiped exactly like the service registry: an OR-Map keyed by {Name, node()}, broadcast over Plumtree, full-synced on peer_up, and pruned on peer_down.

Given that set, each node computes the leader independently and identically:

  1. highest priority (a lead/2 option, default 0),
  2. ties broken by the lowest node atom.

No votes, no quorum, no coordination round. Two nodes with the same candidate set always agree. During membership flux they may disagree briefly and then converge, which is the same trade the bare whereis_service + node-atom approach already makes.

priority lets you pin a preference (for example a node with more resources) without giving up the deterministic tiebreaker:

barrel_p2p:lead(shard_owner, #{priority => 1}).  %% beats priority 0

Re-election

Re-election is driven by the same membership events the rest of barrel_p2p uses:

  • A new candidate appears (someone calls lead/2): its candidacy gossips out, every node recomputes, and a node that loses gets revoked.
  • The leader resigns or its process dies: the candidacy is tombstoned and the next-best candidate is elected.
  • The leader's node leaves or fails (peer_down): each surviving node drops that node's candidacies and recomputes. The next-best survivor is elected.

Fencing

A leader that is paused (long GC), partitioned, or simply slow can keep believing it leads after the cluster has moved on. If it then writes to a shared resource, it corrupts state. This is the classic split-brain hazard, and election alone does not solve it.

The fix is a fencing token. Each leadership term mints one, and the leader stamps it on every write to the protected resource. The resource records the highest token it has accepted and rejects any operation whose token is not strictly greater:

{ok, {leader, Fence}} = barrel_p2p:lead(ledger_writer),
ok = ledger:append(Entry, #{fence => Fence}).

%% Inside the resource:
append(Entry, #{fence := F}) when F > LastAcceptedFence ->
    do_append(Entry, F);
append(_Entry, _) ->
    {error, fenced_out}.   %% a newer leader has taken over

A revoked leader's token is now stale, so its late writes are refused. That is what turns "we elected one leader" into "exactly one leader can actually mutate state".

How the token is built

The token is a non_neg_integer(), minted from barrel_p2p's hybrid logical clock. When a node takes a term it advances its HLC past a replicated per-name high-water mark and then takes a fresh timestamp, so the new token is strictly greater than every token observed in its connected component. The high-water mark is gossiped, so the next leader (on any node) mints above it.

What the guarantee is, and is not

  • Within a connected partition, tokens are strictly monotonic: each term's token exceeds every earlier term's. This is the property the resource check relies on, and it holds across a real leader death (proven in barrel_p2p_leader_e2e_SUITE: kill the leader, the survivor takes over with a strictly greater token).
  • Across a network partition, barrel_p2p cannot guarantee monotonicity without a consensus layer it deliberately does not have. Each side may elect its own leader. Safety then rests entirely on the resource's reject-if-not-greater check; the HLC wall-clock component keeps cross-partition tokens approximately ordered, but you must not assume strict ordering there.

If you need cross-partition exclusivity guarantees, you need a consensus system; that is outside barrel_p2p's AP design.

API

%% Campaign. Returns the initial role; transitions arrive as messages.
lead(Name) -> {ok, {leader, Fence}} | {ok, follower} | {error, term()}.
lead(Name, #{priority => integer()}) -> same.

%% Messages delivered to the candidate process:
%%   {barrel_p2p_leader, Name, {elected, Fence}}
%%   {barrel_p2p_leader, Name, revoked}

resign(Name)    -> ok.
leader(Name)    -> {ok, node(), pid()} | {error, no_leader}.
is_leader(Name) -> boolean().
fence(Name)     -> {ok, Fence} | {error, not_leader}.

%% Name :: term().  Fence :: non_neg_integer().

The API is beta: the message and return shapes may change across a 0.x minor bump.