Troubleshooting

View Source

A table per category, ordered by how often we see each symptom in practice. Every row gives a likely cause and a place to look.

The structure of every cell: what you see, what is probably happening, and the code or configuration to inspect. Where the fix is simple, it is included; where it requires a runbook, the relevant doc is linked.

A node fails to boot

SymptomLikely causeWhat to do
Protocol 'quic': register/listen error: {credentials,no_credentials}TLS material missing, listener cannot find a cert/keyCheck data/quic/node.crt and data/quic/node.key exist and are readable. Run priv/bin/barrel_p2p_gen_cert.sh to generate by hand.
barrel_p2p_dist: cert ensure failed: <reason>Cert directory is not writableConfirm auth_key_dir / quic_cert_dir point to a writable path. Check disk quota.
Boot fails with {barrel_p2p_dist, auth_enabled_without_callback}barrel_p2p.auth_enabled = true but the projected auth_callback is undefined (you set it explicitly)This is fail-safe behaviour. Either set auth_enabled = false or do not override the auth callback.
Failed to initialize keypair: <reason>Cannot write to auth_key_dirVerify filesystem permissions and free space.
{error, keypair_mismatch} on loadIdentity rotation crashed mid-flight; node.pub does not derive from node.keyRestore from the most recent data/keys/backups/<ts>/ or regenerate with barrel_p2p_rotate:rotate_identity/0.

A peer cannot connect

SymptomLikely causeWhat to do
Key mismatch for node <peer> - existing key differs from presented keyA peer's identity rotated, or someone is impersonating itIf you just rotated keys, delete the stale pin and let TOFU re-pin (barrel_p2p_dist_keys:delete_key/1). Otherwise treat as a security event.
{auth_rejected, untrusted_key}Strict mode, no pin for this peerProvision the peer's public key (barrel_p2p_dist_keys:store_key/2) or switch to TOFU.
auth_stream_timeoutThe handshake did not complete within auth_handshake_timeout (default 10s)Inspect the network between peers. Raise the timeout if the link is genuinely slow.
{auth_rejected, signature_invalid}Wrong private key in use, or clock skew beyond the wall-time windowVerify the keypair on disk derives correctly (barrel_p2p_dist_auth:load_keypair/1); check auth_timestamp_window and NTP.
{error, unexpected_auth_ok}A server sent AUTH_OK but the client did not have the target in its own cookie_only_nodesCookie-only is symmetric: both ends must list the peer. Either provision the whitelist on both sides or remove cookie_only_nodes.
Pid ! Msg returns immediately but the peer never receivesDist channel silently closed (reaped, or remote nodedown)Check erlang:nodes/0 on both sides; check the barrel_p2p.dist_gc.reap rate; resend (auto-connect will reopen).

A new node cannot find the cluster

SymptomLikely causeWhat to do
barrel_p2p:active_view/0 always emptyDiscovery chain returns nothing for the contact nodesInspect each backend in barrel_p2p.discovery_backends. The file backend needs a shared directory; the DNS backend needs records.
barrel_p2p_discovery_dns: no records for <host> in the logDNS records missingConfirm the zone publishes records for the host portion of every node atom.
File-based discovery: peer not visible to other hostsDiscovery directory is local to each hostMount a shared volume; use rsync/NFS; or switch to the DNS backend.
[barrel_p2p_app] no listen port found; skipping discovery publishThe QUIC listener did not bind, or the node is not running -proto_dist barrel_p2pConfirm the listener bound (netstat -unlp | grep beam). Confirm -proto_dist barrel_p2p in your boot args.

Cluster membership instability

SymptomLikely causeWhat to do
Nodes oscillate in and out of the active viewNetwork instability or max_fail_count too lowInspect barrel_p2p.hyparview.peer_down by reason. Tune max_fail_count, base_backoff_ms.
Active view grows past active_sizeShould not happen post-decoupling; HyParView caps the active view itselfConfirm the running build. barrel_p2p:active_view/0 should never exceed active_size; erlang:nodes/0 may legitimately be larger.
peer_up/peer_down events arrive in unexpected ordersMultiple shuffles interleavingOrder is best-effort; consume both events and treat them as state transitions, not strict sequences.

Discovery and ports

SymptomLikely causeWhat to do
Connection refused on the dist portFirewall, or listen_port not what you thinknetstat -unlp | grep beam shows the actual port; check container or host firewall rules.
A new container cannot reach existing peersdocker-compose discovery shares no state between containersProvide a static topology (see docker/cluster-topology.config in the project), or share a volume for the file backend.

Idle dist-channel GC

SymptomLikely causeWhat to do
barrel_p2p.dist_gc.reap rate is highApplication opens dist channels faster than GC's min_ageRaise dist_gc_min_age_ms so transient sends do not get reaped mid-conversation.
Streams on a non-active peer keep disappearingStream owners are short-lived pids; GC sees no live ownerPin a long-lived owner pid for any stream you expect to outlive the call that opened it.
GC never reaps anythingEvery peer is in active view or carries a live streamThis is correct behaviour; no action needed.

Cookies and Erlang dist handshake

SymptomLikely causeWhat to do
{exit, normal} immediately after dist handshake startCookie mismatchSet barrel_p2p.dist_cookie to the same value on every node; barrel_p2p_app applies it at start.
Peer "connects" but nodes/0 shows only the local nodeProto_dist mismatch or cookie mismatchConfirm both sides have -proto_dist barrel_p2p and the same cookie atom.

Metrics absent or zero

SymptomLikely causeWhat to do
No barrel_p2p.* instruments in the exporterinstrument not started, or no exporter wiredapplication:which_applications/0 must show instrument. Confirm the exporter (Prometheus/OTLP/live) is started.
Counters present but always zeroThe code path never fired on this runCross-check the seam in the relevant module (barrel_p2p_hyparview_events, barrel_p2p_dist_auth_stream, barrel_p2p_plumtree, barrel_p2p_dist_gc).
dist.auth.attempts{outcome=fail} risingSomething rejected the handshake (key, cookie, clock, mode)Skim the recent log for key_mismatch, untrusted_key, signature_invalid, auth_stream_timeout.

When to look at logs vs metrics

  • Metrics are good for "is something happening at the wrong rate". Use them for paging, dashboards, capacity planning.
  • Logs are good for "what was the specific reason". The auth rejection log lines carry the node atom and (for key mismatch) the fingerprints; the discovery log lines carry the host that failed to resolve.

Both layers share a vocabulary; if you see key_mismatch in a log, expect the dist.auth.attempts{outcome=fail} counter to have moved at the same time.