SuperCache.Cluster.HealthMonitor (SuperCache v1.3.0)

Distributed cluster health monitor for SuperCache.

Continuously monitors node connectivity, replication lag, partition health, and operation latency across the cluster. Provides real-time health status and anomaly detection.

Health Checks

The monitor performs these checks on a configurable interval:

Node connectivity — pings each live node via :erpc and measures RTT
Replication lag — writes a probe record and measures time until it appears on all replicas
Partition balance — checks that record counts are evenly distributed
Error rate — tracks the ratio of failed operations per node

Health Status

Each node reports a health status:

Status	Meaning
`:healthy`	All checks passing, latency within thresholds
`:degraded`	Some checks failing or latency elevated
`:unhealthy`	Node unreachable or critical checks failing
`:unknown`	No health data available yet

Configuration

Option	Default	Description
`:interval_ms`	`5_000`	Health check interval in milliseconds
`:latency_threshold_ms`	`100`	Max acceptable node RTT before `:degraded`
`:replication_lag_threshold_ms`	`500`	Max acceptable replication lag
`:error_rate_threshold`	`0.05`	Max acceptable error rate (5%)
`:partition_imbalance_threshold`	`0.3`	Max acceptable partition size variance (30%)

Usage

# Start with defaults
SuperCache.Cluster.HealthMonitor.start_link([])

# Get cluster health
SuperCache.Cluster.HealthMonitor.cluster_health()
# => %{status: :healthy, nodes: [...], checks: %{...}}

# Get node-specific health
SuperCache.Cluster.HealthMonitor.node_health(:"node1@127.0.0.1")
# => %{status: :healthy, latency_ms: 12, checks: %{...}}

Telemetry Events

The monitor emits events via :telemetry (if available):

[:super_cache, :health, :check] — after each health check cycle
[:super_cache, :health, :alert] — when a health threshold is breached

Summary

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

cluster_health()

Returns the overall cluster health status.

force_check()

Forces an immediate health check cycle.

node_health(node)

Returns the health status for a specific node.

partition_balance()

Returns partition balance statistics across all nodes.

replication_lag(partition_idx)

Returns the replication lag for a specific partition across all replicas.

start_link(opts \\ [])

Starts the health monitor GenServer.

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

cluster_health()

@spec cluster_health() :: map()

Returns the overall cluster health status.

Returns

%{
  status: :healthy | :degraded | :unhealthy,
  timestamp: integer,
  nodes: [
    %{
      node: node(),
      status: :healthy | :degraded | :unhealthy | :unknown,
      latency_ms: non_neg_integer | nil,
      last_check: integer,
      checks: %{...}
    }
  ],
  summary: %{
    total_nodes: non_neg_integer,
    healthy: non_neg_integer,
    degraded: non_neg_integer,
    unhealthy: non_neg_integer
  }
}

force_check()

@spec force_check() :: :ok

Forces an immediate health check cycle.

Useful for testing or when an external monitoring system needs fresh data.

node_health(node)

@spec node_health(node()) :: map()

Returns the health status for a specific node.

Returns

%{
  node: node(),
  status: :healthy | :degraded | :unhealthy | :unknown,
  latency_ms: non_neg_integer | nil,
  last_check: integer,
  checks: %{
    connectivity: %{status: :pass | :fail, latency_ms: non_neg_integer | nil},
    replication: %{status: :pass | :fail | :unknown, lag_ms: non_neg_integer | nil},
    partitions: %{status: :pass | :fail | :unknown, imbalance: float | nil},
    error_rate: %{status: :pass | :fail | :unknown, rate: float | nil}
  }
}

partition_balance()

@spec partition_balance() :: map()

Returns partition balance statistics across all nodes.

Returns

%{
  total_records: non_neg_integer,
  partition_count: non_neg_integer,
  avg_records_per_partition: float,
  max_imbalance: float,
  partitions: [
    %{idx: non_neg_integer, record_count: non_neg_integer, primary: node()}
  ]
}

replication_lag(partition_idx)

@spec replication_lag(non_neg_integer()) :: map()

Returns the replication lag for a specific partition across all replicas.

Returns

%{
  partition_idx: non_neg_integer,
  primary: node(),
  replicas: [
    %{node: node(), lag_ms: non_neg_integer | nil, status: :synced | :lagging | :unknown}
  ]
}

start_link(opts \\ [])

@spec start_link(keyword()) :: :ignore | {:error, any()} | {:ok, pid()}

Starts the health monitor GenServer.

Options

See module documentation for configuration options.