Distributed cluster health monitor for SuperCache.
Continuously monitors node connectivity, replication lag, partition health, and operation latency across the cluster. Provides real-time health status and anomaly detection.
Health Checks
The monitor performs these checks on a configurable interval:
- Node connectivity — pings each live node via
:erpcand measures RTT - Replication lag — writes a probe record and measures time until it appears on all replicas
- Partition balance — checks that record counts are evenly distributed
- Error rate — tracks the ratio of failed operations per node
Health Status
Each node reports a health status:
| Status | Meaning |
|---|---|
:healthy | All checks passing, latency within thresholds |
:degraded | Some checks failing or latency elevated |
:unhealthy | Node unreachable or critical checks failing |
:unknown | No health data available yet |
Configuration
| Option | Default | Description |
|---|---|---|
:interval_ms | 5_000 | Health check interval in milliseconds |
:latency_threshold_ms | 100 | Max acceptable node RTT before :degraded |
:replication_lag_threshold_ms | 500 | Max acceptable replication lag |
:error_rate_threshold | 0.05 | Max acceptable error rate (5%) |
:partition_imbalance_threshold | 0.3 | Max acceptable partition size variance (30%) |
Usage
# Start with defaults
SuperCache.Cluster.HealthMonitor.start_link([])
# Get cluster health
SuperCache.Cluster.HealthMonitor.cluster_health()
# => %{status: :healthy, nodes: [...], checks: %{...}}
# Get node-specific health
SuperCache.Cluster.HealthMonitor.node_health(:"node1@127.0.0.1")
# => %{status: :healthy, latency_ms: 12, checks: %{...}}Telemetry Events
The monitor emits events via :telemetry (if available):
[:super_cache, :health, :check]— after each health check cycle[:super_cache, :health, :alert]— when a health threshold is breached
Summary
Functions
Returns a specification to start this module under a supervisor.
Returns the overall cluster health status.
Forces an immediate health check cycle.
Returns the health status for a specific node.
Returns partition balance statistics across all nodes.
Returns the replication lag for a specific partition across all replicas.
Starts the health monitor GenServer.
Functions
Returns a specification to start this module under a supervisor.
See Supervisor.
@spec cluster_health() :: map()
Returns the overall cluster health status.
Returns
%{
status: :healthy | :degraded | :unhealthy,
timestamp: integer,
nodes: [
%{
node: node(),
status: :healthy | :degraded | :unhealthy | :unknown,
latency_ms: non_neg_integer | nil,
last_check: integer,
checks: %{...}
}
],
summary: %{
total_nodes: non_neg_integer,
healthy: non_neg_integer,
degraded: non_neg_integer,
unhealthy: non_neg_integer
}
}
@spec force_check() :: :ok
Forces an immediate health check cycle.
Useful for testing or when an external monitoring system needs fresh data.
Returns the health status for a specific node.
Returns
%{
node: node(),
status: :healthy | :degraded | :unhealthy | :unknown,
latency_ms: non_neg_integer | nil,
last_check: integer,
checks: %{
connectivity: %{status: :pass | :fail, latency_ms: non_neg_integer | nil},
replication: %{status: :pass | :fail | :unknown, lag_ms: non_neg_integer | nil},
partitions: %{status: :pass | :fail | :unknown, imbalance: float | nil},
error_rate: %{status: :pass | :fail | :unknown, rate: float | nil}
}
}
@spec partition_balance() :: map()
Returns partition balance statistics across all nodes.
Returns
%{
total_records: non_neg_integer,
partition_count: non_neg_integer,
avg_records_per_partition: float,
max_imbalance: float,
partitions: [
%{idx: non_neg_integer, record_count: non_neg_integer, primary: node()}
]
}
@spec replication_lag(non_neg_integer()) :: map()
Returns the replication lag for a specific partition across all replicas.
Returns
%{
partition_idx: non_neg_integer,
primary: node(),
replicas: [
%{node: node(), lag_ms: non_neg_integer | nil, status: :synced | :lagging | :unknown}
]
}
Starts the health monitor GenServer.
Options
See module documentation for configuration options.