# `SuperCache.Cluster.HealthMonitor`
[🔗](https://github.com/ohhi-vn/super_cache/blob/main/lib/cluster/health_monitor.ex#L1)

Distributed cluster health monitor for SuperCache.

Continuously monitors node connectivity, replication lag, partition health,
and operation latency across the cluster. Provides real-time health status
and anomaly detection.

## Health Checks

The monitor performs these checks on a configurable interval:

- **Node connectivity** — pings each live node via `:erpc` and measures RTT
- **Replication lag** — writes a probe record and measures time until it
  appears on all replicas
- **Partition balance** — checks that record counts are evenly distributed
- **Error rate** — tracks the ratio of failed operations per node

## Health Status

Each node reports a health status:

| Status | Meaning |
|--------|---------|
| `:healthy` | All checks passing, latency within thresholds |
| `:degraded` | Some checks failing or latency elevated |
| `:unhealthy` | Node unreachable or critical checks failing |
| `:unknown` | No health data available yet |

## Configuration

| Option | Default | Description |
|--------|---------|-------------|
| `:interval_ms` | `5_000` | Health check interval in milliseconds |
| `:latency_threshold_ms` | `100` | Max acceptable node RTT before `:degraded` |
| `:replication_lag_threshold_ms` | `500` | Max acceptable replication lag |
| `:error_rate_threshold` | `0.05` | Max acceptable error rate (5%) |
| `:partition_imbalance_threshold` | `0.3` | Max acceptable partition size variance (30%) |

## Usage

    # Start with defaults
    SuperCache.Cluster.HealthMonitor.start_link([])

    # Get cluster health
    SuperCache.Cluster.HealthMonitor.cluster_health()
    # => %{status: :healthy, nodes: [...], checks: %{...}}

    # Get node-specific health
    SuperCache.Cluster.HealthMonitor.node_health(:"node1@127.0.0.1")
    # => %{status: :healthy, latency_ms: 12, checks: %{...}}

## Telemetry Events

The monitor emits events via `:telemetry` (if available):

- `[:super_cache, :health, :check]` — after each health check cycle
- `[:super_cache, :health, :alert]` — when a health threshold is breached

# `child_spec`

Returns a specification to start this module under a supervisor.

See `Supervisor`.

# `cluster_health`

```elixir
@spec cluster_health() :: map()
```

Returns the overall cluster health status.

## Returns

    %{
      status: :healthy | :degraded | :unhealthy,
      timestamp: integer,
      nodes: [
        %{
          node: node(),
          status: :healthy | :degraded | :unhealthy | :unknown,
          latency_ms: non_neg_integer | nil,
          last_check: integer,
          checks: %{...}
        }
      ],
      summary: %{
        total_nodes: non_neg_integer,
        healthy: non_neg_integer,
        degraded: non_neg_integer,
        unhealthy: non_neg_integer
      }
    }

# `force_check`

```elixir
@spec force_check() :: :ok
```

Forces an immediate health check cycle.

Useful for testing or when an external monitoring system needs fresh data.

# `node_health`

```elixir
@spec node_health(node()) :: map()
```

Returns the health status for a specific node.

## Returns

    %{
      node: node(),
      status: :healthy | :degraded | :unhealthy | :unknown,
      latency_ms: non_neg_integer | nil,
      last_check: integer,
      checks: %{
        connectivity: %{status: :pass | :fail, latency_ms: non_neg_integer | nil},
        replication: %{status: :pass | :fail | :unknown, lag_ms: non_neg_integer | nil},
        partitions: %{status: :pass | :fail | :unknown, imbalance: float | nil},
        error_rate: %{status: :pass | :fail | :unknown, rate: float | nil}
      }
    }

# `partition_balance`

```elixir
@spec partition_balance() :: map()
```

Returns partition balance statistics across all nodes.

## Returns

    %{
      total_records: non_neg_integer,
      partition_count: non_neg_integer,
      avg_records_per_partition: float,
      max_imbalance: float,
      partitions: [
        %{idx: non_neg_integer, record_count: non_neg_integer, primary: node()}
      ]
    }

# `replication_lag`

```elixir
@spec replication_lag(non_neg_integer()) :: map()
```

Returns the replication lag for a specific partition across all replicas.

## Returns

    %{
      partition_idx: non_neg_integer,
      primary: node(),
      replicas: [
        %{node: node(), lag_ms: non_neg_integer | nil, status: :synced | :lagging | :unknown}
      ]
    }

# `start_link`

```elixir
@spec start_link(keyword()) :: :ignore | {:error, any()} | {:ok, pid()}
```

Starts the health monitor GenServer.

## Options

See module documentation for configuration options.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
