reckon_db_health_prober (reckon_db v1.6.0)

Active health prober for reckon-db cluster nodes

Implements active health probing to detect node failures faster than passive net_kernel:monitor_nodes/1 events. This is critical for timely split-brain detection and quorum management.

Design Philosophy:

Passive monitoring (nodeup/nodedown) can take 60+ seconds to detect failures depending on net_ticktime configuration. Active probing provides sub-second detection with configurable thresholds.

Probe Types:

1. Ping Probe - net_adm:ping/1, fast but shallow 2. RPC Probe - rpc:call with actual work, deeper health check 3. Khepri Probe - khepri_cluster:members/1, verifies store health

Failure Threshold:

A node is only declared failed after consecutive probe failures (default: 3). This prevents transient network issues from triggering false positives.

Recovery Detection:

Once a node is marked failed, probing continues. When probes succeed again, the node is marked recovered and callbacks are notified.

See also: reckon_db_consistency_checker.

Summary

Types

node_status/0

probe_config/0

store_config/0

Functions

configure(StoreId, Config)

Update prober configuration

get_all_status(StoreId)

Get health status of all monitored nodes

get_node_status(StoreId, Node)

Get health status of a specific node

handle_call(Request, From, State)

handle_cast(Msg, State)

handle_info(Info, State)

health_check(CallerNode)

Health check function called via RPC Returns basic node health information

init(Store_config)

on_node_failed(StoreId, Callback)

on_node_recovered(StoreId, Callback)

probe_now(StoreId)

Force immediate probe cycle

remove_callback(StoreId, Ref)

Remove a previously registered callback

start_link(Store_config)

Start the health prober

terminate(Reason, State)