reckon_db_health_prober (reckon_db v1.6.0)

View Source

Active health prober for reckon-db cluster nodes

Implements active health probing to detect node failures faster than passive net_kernel:monitor_nodes/1 events. This is critical for timely split-brain detection and quorum management.

Design Philosophy:

Passive monitoring (nodeup/nodedown) can take 60+ seconds to detect failures depending on net_ticktime configuration. Active probing provides sub-second detection with configurable thresholds.

Probe Types:

1. Ping Probe - net_adm:ping/1, fast but shallow 2. RPC Probe - rpc:call with actual work, deeper health check 3. Khepri Probe - khepri_cluster:members/1, verifies store health

Failure Threshold:

A node is only declared failed after consecutive probe failures (default: 3). This prevents transient network issues from triggering false positives.

Recovery Detection:

Once a node is marked failed, probing continues. When probes succeed again, the node is marked recovered and callbacks are notified.

See also: reckon_db_consistency_checker.

Summary

Functions

Update prober configuration

Get health status of all monitored nodes

Get health status of a specific node

Health check function called via RPC Returns basic node health information

Register callback for node failure events

Register callback for node recovery events

Force immediate probe cycle

Remove a previously registered callback

Start the health prober

Types

node_status/0

-type node_status() :: healthy | suspect | failed | unknown.

probe_config/0

-type probe_config() ::
          #{probe_interval => pos_integer(),
            probe_timeout => pos_integer(),
            failure_threshold => pos_integer(),
            probe_type => ping | rpc | khepri}.

store_config/0

-type store_config() ::
          #store_config{store_id :: atom(),
                        data_dir :: string(),
                        mode :: single | cluster,
                        timeout :: pos_integer(),
                        writer_pool_size :: pos_integer(),
                        reader_pool_size :: pos_integer(),
                        gateway_pool_size :: pos_integer(),
                        options :: map()}.

Functions

configure(StoreId, Config)

-spec configure(atom(), probe_config()) -> ok.

Update prober configuration

get_all_status(StoreId)

-spec get_all_status(atom()) -> #{node() => node_status()}.

Get health status of all monitored nodes

get_node_status(StoreId, Node)

-spec get_node_status(atom(), node()) -> {ok, node_status()} | {error, unknown_node}.

Get health status of a specific node

handle_call(Request, From, State)

handle_cast(Msg, State)

handle_info(Info, State)

health_check(CallerNode)

-spec health_check(node()) -> {ok, map()}.

Health check function called via RPC Returns basic node health information

init(Store_config)

on_node_failed(StoreId, Callback)

-spec on_node_failed(atom(), fun((node()) -> any())) -> reference().

Register callback for node failure events

on_node_recovered(StoreId, Callback)

-spec on_node_recovered(atom(), fun((node()) -> any())) -> reference().

Register callback for node recovery events

probe_now(StoreId)

-spec probe_now(atom()) -> ok.

Force immediate probe cycle

remove_callback(StoreId, Ref)

-spec remove_callback(atom(), reference()) -> ok.

Remove a previously registered callback

start_link(Store_config)

-spec start_link(store_config()) -> {ok, pid()} | {error, term()}.

Start the health prober

terminate(Reason, State)