SuperCache.Cluster.HealthMonitor (SuperCache v1.3.0)

Copy Markdown View Source

Distributed cluster health monitor for SuperCache.

Continuously monitors node connectivity, replication lag, partition health, and operation latency across the cluster. Provides real-time health status and anomaly detection.

Health Checks

The monitor performs these checks on a configurable interval:

  • Node connectivity — pings each live node via :erpc and measures RTT
  • Replication lag — writes a probe record and measures time until it appears on all replicas
  • Partition balance — checks that record counts are evenly distributed
  • Error rate — tracks the ratio of failed operations per node

Health Status

Each node reports a health status:

StatusMeaning
:healthyAll checks passing, latency within thresholds
:degradedSome checks failing or latency elevated
:unhealthyNode unreachable or critical checks failing
:unknownNo health data available yet

Configuration

OptionDefaultDescription
:interval_ms5_000Health check interval in milliseconds
:latency_threshold_ms100Max acceptable node RTT before :degraded
:replication_lag_threshold_ms500Max acceptable replication lag
:error_rate_threshold0.05Max acceptable error rate (5%)
:partition_imbalance_threshold0.3Max acceptable partition size variance (30%)

Usage

# Start with defaults
SuperCache.Cluster.HealthMonitor.start_link([])

# Get cluster health
SuperCache.Cluster.HealthMonitor.cluster_health()
# => %{status: :healthy, nodes: [...], checks: %{...}}

# Get node-specific health
SuperCache.Cluster.HealthMonitor.node_health(:"node1@127.0.0.1")
# => %{status: :healthy, latency_ms: 12, checks: %{...}}

Telemetry Events

The monitor emits events via :telemetry (if available):

  • [:super_cache, :health, :check] — after each health check cycle
  • [:super_cache, :health, :alert] — when a health threshold is breached

Summary

Functions

Returns a specification to start this module under a supervisor.

Returns the overall cluster health status.

Forces an immediate health check cycle.

Returns the health status for a specific node.

Returns partition balance statistics across all nodes.

Returns the replication lag for a specific partition across all replicas.

Starts the health monitor GenServer.

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

cluster_health()

@spec cluster_health() :: map()

Returns the overall cluster health status.

Returns

%{
  status: :healthy | :degraded | :unhealthy,
  timestamp: integer,
  nodes: [
    %{
      node: node(),
      status: :healthy | :degraded | :unhealthy | :unknown,
      latency_ms: non_neg_integer | nil,
      last_check: integer,
      checks: %{...}
    }
  ],
  summary: %{
    total_nodes: non_neg_integer,
    healthy: non_neg_integer,
    degraded: non_neg_integer,
    unhealthy: non_neg_integer
  }
}

force_check()

@spec force_check() :: :ok

Forces an immediate health check cycle.

Useful for testing or when an external monitoring system needs fresh data.

node_health(node)

@spec node_health(node()) :: map()

Returns the health status for a specific node.

Returns

%{
  node: node(),
  status: :healthy | :degraded | :unhealthy | :unknown,
  latency_ms: non_neg_integer | nil,
  last_check: integer,
  checks: %{
    connectivity: %{status: :pass | :fail, latency_ms: non_neg_integer | nil},
    replication: %{status: :pass | :fail | :unknown, lag_ms: non_neg_integer | nil},
    partitions: %{status: :pass | :fail | :unknown, imbalance: float | nil},
    error_rate: %{status: :pass | :fail | :unknown, rate: float | nil}
  }
}

partition_balance()

@spec partition_balance() :: map()

Returns partition balance statistics across all nodes.

Returns

%{
  total_records: non_neg_integer,
  partition_count: non_neg_integer,
  avg_records_per_partition: float,
  max_imbalance: float,
  partitions: [
    %{idx: non_neg_integer, record_count: non_neg_integer, primary: node()}
  ]
}

replication_lag(partition_idx)

@spec replication_lag(non_neg_integer()) :: map()

Returns the replication lag for a specific partition across all replicas.

Returns

%{
  partition_idx: non_neg_integer,
  primary: node(),
  replicas: [
    %{node: node(), lag_ms: non_neg_integer | nil, status: :synced | :lagging | :unknown}
  ]
}

start_link(opts \\ [])

@spec start_link(keyword()) :: :ignore | {:error, any()} | {:ok, pid()}

Starts the health monitor GenServer.

Options

See module documentation for configuration options.