reckon_db_health_prober (reckon_db v1.6.0)
View SourceActive health prober for reckon-db cluster nodes
Implements active health probing to detect node failures faster than passive net_kernel:monitor_nodes/1 events. This is critical for timely split-brain detection and quorum management.
Design Philosophy:
Passive monitoring (nodeup/nodedown) can take 60+ seconds to detect failures depending on net_ticktime configuration. Active probing provides sub-second detection with configurable thresholds.
Probe Types:
1. Ping Probe - net_adm:ping/1, fast but shallow 2. RPC Probe - rpc:call with actual work, deeper health check 3. Khepri Probe - khepri_cluster:members/1, verifies store health
Failure Threshold:
A node is only declared failed after consecutive probe failures (default: 3). This prevents transient network issues from triggering false positives.
Recovery Detection:
Once a node is marked failed, probing continues. When probes succeed again, the node is marked recovered and callbacks are notified.
See also: reckon_db_consistency_checker.
Summary
Functions
Update prober configuration
Get health status of all monitored nodes
Get health status of a specific node
Health check function called via RPC Returns basic node health information
Register callback for node failure events
Register callback for node recovery events
Force immediate probe cycle
Remove a previously registered callback
Start the health prober
Types
-type node_status() :: healthy | suspect | failed | unknown.
-type probe_config() :: #{probe_interval => pos_integer(), probe_timeout => pos_integer(), failure_threshold => pos_integer(), probe_type => ping | rpc | khepri}.
-type store_config() :: #store_config{store_id :: atom(), data_dir :: string(), mode :: single | cluster, timeout :: pos_integer(), writer_pool_size :: pos_integer(), reader_pool_size :: pos_integer(), gateway_pool_size :: pos_integer(), options :: map()}.
Functions
-spec configure(atom(), probe_config()) -> ok.
Update prober configuration
-spec get_all_status(atom()) -> #{node() => node_status()}.
Get health status of all monitored nodes
-spec get_node_status(atom(), node()) -> {ok, node_status()} | {error, unknown_node}.
Get health status of a specific node
Health check function called via RPC Returns basic node health information
Register callback for node failure events
Register callback for node recovery events
-spec probe_now(atom()) -> ok.
Force immediate probe cycle
Remove a previously registered callback
-spec start_link(store_config()) -> {ok, pid()} | {error, term()}.
Start the health prober