Macula Monitoring Guide

View Source

Complete guide to monitoring Macula mesh deployments

Audience: Operators Last Updated: 2025-11-28


Table of Contents

  1. Overview
  2. Key Metrics
  3. Log Monitoring
  4. Health Checks
  5. Alerting
  6. Dashboards
  7. Capacity Planning

Overview

Macula provides several observable metrics and log messages for production monitoring. This guide covers what to monitor, how to interpret metrics, and when to alert.

Monitoring Philosophy

                    
                             Observability Layers        
                    
                                      
        
                                                                  
                                                                  
                      
    Metrics                     Logs                       Traces      
  (Numerical)                (Textual)                  (Distributed)  
                      
  Pool sizes               Cleanup events             RPC call flow 
  Latencies                Rejections                 PubSub fanout 
  Throughput               Errors/Warns               DHT queries   
  Memory                   State changes                             
                      

Key Metrics

Memory Management Metrics

These metrics indicate the health of Macula's bounded resource pools.

MetricModuleThresholdAction
Connection Pool Sizemacula_gateway_mesh> 800 (of 1,000)Scale horizontally
Client Countmacula_gateway_client_manager> 8,000 (of 10,000)Scale or rate-limit
Service Registry Sizemacula_service_registryIncreasing trendCheck for stale services
Pending RPC Callsmacula_rpc_handler> 100 sustainedCheck handler latency
Pending DHT Queriesmacula_rpc_handler> 50 sustainedCheck DHT health

Querying Metrics

Connection Pool (via Erlang shell)

%% Get connection pool stats
macula_gateway_mesh:get_stats().
%% Returns: #{connections => 245, max => 1000, lru_evictions => 12}

%% Check if pool is near capacity
case macula_gateway_mesh:get_stats() of
    #{connections := C, max := Max} when C > Max * 0.8 ->
        io:format("WARNING: Pool at ~p% capacity~n", [C * 100 div Max]);
    _ ->
        ok
end.

Client Count

%% Get client manager stats
macula_gateway_client_manager:get_stats().
%% Returns: #{clients => 1234, max => 10000, streams => 3456}

%% Check rejection rate (if tracked)
macula_gateway_client_manager:get_rejection_count().

Service Registry

%% Get service count
macula_service_registry:count_services().
%% Returns: 45

%% List all registered services (debug only)
macula_service_registry:list_services().

Performance Metrics

MetricTargetWarningCritical
RPC Latency (p50)< 10ms> 50ms> 200ms
RPC Latency (p99)< 50ms> 200ms> 1000ms
PubSub Throughput> 1,000 msg/s< 500 msg/s< 100 msg/s
DHT Query Time< 100ms> 200ms> 500ms
Cache Hit Rate> 90%< 80%< 50%

Log Monitoring

Log Levels

Macula uses standard OTP log levels:

LevelUsageAction Required
debugDetailed operational infoNone (high volume)
infoNormal operationsNone
noticeSignificant eventsReview if unusual
warningPotential issuesInvestigate
errorOperation failuresFix required

Critical Log Patterns

Memory Management (Normal Operation)

%% Service cleanup - runs every 60 seconds
[info] Service cleanup: removed 3 expired service(s)
[debug] Service cleanup: no expired services

%% Connection pool LRU eviction
[debug] Evicted LRU connection: NodeId=abc123

%% Stream cleanup on disconnect
[debug] Cleaned up streams for disconnected client: NodeId=xyz789

Warning Signs

%% Client rejection - monitor frequency
[warning] Client connection rejected: max_clients_reached

%% RPC timeout
[warning] RPC call timed out: Procedure=energy.home.get, CallId=call-123

%% DHT query failure
[warning] DHT query failed: Key=energy.home.get, Reason=timeout

Errors Requiring Action

%% Gateway crash
[error] Gateway process crashed: Reason={badmatch, undefined}

%% QUIC connection failure
[error] QUIC handshake failed: Endpoint=192.168.1.100:4433, Reason=tls_alert

%% Memory pressure
[error] Memory threshold exceeded: Current=85%, Threshold=80%

Log Aggregation Queries

Grafana Loki / Elasticsearch

# Client rejections in last hour
{app="macula"} |= "max_clients_reached" | count_over_time([1h])

# RPC timeouts by procedure
{app="macula"} |= "RPC call timed out" | regexp "Procedure=(?P<proc>[^,]+)" | by (proc)

# Service cleanup activity
{app="macula"} |= "Service cleanup" | rate([5m])

Health Checks

HTTP Health Endpoint

If using macula_gateway with HTTP enabled:

# Basic health check
curl http://localhost:4433/health

# Detailed status
curl http://localhost:4433/status

Erlang Health Functions

%% Check gateway is alive
is_pid(whereis(macula_gateway)).

%% Check all supervisors
[{Name, is_pid(whereis(Name))} || Name <- [
    macula_sup,
    macula_gateway_sup,
    macula_connection_sup
]].

%% Check DHT connectivity
macula_dht:ping().
%% Returns: pong | {error, Reason}

Kubernetes Probes

# Liveness probe - is the process running?
livenessProbe:
  exec:
    command:
      - /opt/macula/bin/macula
      - eval
      - "is_pid(whereis(macula_gateway))."
  initialDelaySeconds: 30
  periodSeconds: 10

# Readiness probe - is it accepting traffic?
readinessProbe:
  exec:
    command:
      - /opt/macula/bin/macula
      - eval
      - "macula_gateway:is_ready()."
  initialDelaySeconds: 10
  periodSeconds: 5

Alerting

Alert Priority Matrix

SeverityResponse TimeExamples
P1 (Critical)< 15 minGateway down, OOM, all clients disconnected
P2 (High)< 1 hour80%+ capacity, sustained errors
P3 (Medium)< 4 hoursElevated latency, cache miss rate
P4 (Low)Next business dayWarnings, cleanup anomalies

Critical (P1)

- alert: MaculaGatewayDown
  expr: up{job="macula"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Macula gateway is down"

- alert: MaculaOOMRisk
  expr: process_resident_memory_bytes{job="macula"} > 8e9  # 8GB
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Macula memory exceeds 8GB - OOM risk"

High (P2)

- alert: MaculaClientPoolNearCapacity
  expr: macula_peers_current / macula_peers_max > 0.8
  for: 10m
  labels:
    severity: high
  annotations:
    summary: "Client pool at {{ $value | humanizePercentage }} capacity"

- alert: MaculaHighRejectionRate
  expr: rate(macula_peer_rejections_total[5m]) > 10
  for: 5m
  labels:
    severity: high
  annotations:
    summary: "High client rejection rate: {{ $value }}/sec"

Medium (P3)

- alert: MaculaElevatedLatency
  expr: histogram_quantile(0.99, macula_rpc_latency_bucket) > 0.5
  for: 15m
  labels:
    severity: medium
  annotations:
    summary: "RPC p99 latency elevated: {{ $value }}s"

- alert: MaculaLowCacheHitRate
  expr: macula_cache_hits / (macula_cache_hits + macula_cache_misses) < 0.8
  for: 30m
  labels:
    severity: medium
  annotations:
    summary: "Cache hit rate below 80%"

Dashboards

Essential Dashboard Panels

1. Resource Utilization


  Connection Pool          Client Pool            Memory   
            
            
   245/1000 (24%)   │  │  │ 6,234/10,000    │  │  │ 2.1G│  │
            

2. Throughput


  RPC Calls/sec                                              
  2500                                                    
  2000                                           
  1500                                           
  1000                                            
   500                                                 
         
        00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00       

  PubSub Events/sec                                          
  5000                                     
  4000                                       
  3000                                            
  2000                                                      
         

3. Latency Distribution


  RPC Latency (ms)                                           
                                                             
  p50:    8ms       
  p90:    25ms      
  p99:    45ms      
  max:    120ms     
                                                             
        0    25    50    75    100   125   150              

Grafana Dashboard JSON

A basic dashboard template:

{
  "title": "Macula Overview",
  "panels": [
    {
      "title": "Client Pool Utilization",
      "type": "gauge",
      "targets": [{"expr": "macula_peers_current / macula_peers_max * 100"}],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "yellow", "value": 70},
              {"color": "red", "value": 90}
            ]
          }
        }
      }
    },
    {
      "title": "RPC Throughput",
      "type": "graph",
      "targets": [{"expr": "rate(macula_rpc_calls_total[5m])"}]
    },
    {
      "title": "Memory Usage",
      "type": "graph",
      "targets": [{"expr": "process_resident_memory_bytes{job=\"macula\"}"}]
    }
  ]
}

Capacity Planning

Resource Sizing

Deployment SizeClientsMemoryCPUNotes
Small (Dev)< 100512MB1 coreSingle node
Medium100-1,0002GB2 coresTypical production
Large1,000-10,0008GB4 coresHigh availability
XL10,000+16GB+8+ coresMulti-gateway

Scaling Triggers

MetricThresholdAction
Client pool> 80% for 1 hourAdd gateway node
Memory> 70% sustainedIncrease memory or add node
CPU> 80% sustainedAdd CPU or optimize handlers
RPC latency p99> 200ms sustainedProfile handlers, check DHT
Connection churn> 1000/minCheck client stability

Horizontal Scaling

Macula supports horizontal scaling via multiple gateway nodes:

                    
                      Load Balancer  
                      (DNS/HAProxy)  
                    
                             
        
                                                
                                                
        
   Gateway 1          Gateway 2          Gateway 3   
  10k clients        10k clients        10k clients  
        
                                                
        
                             
                    
                       DHT Network   
                      (Shared State) 
                    

Each gateway operates independently with shared DHT for service discovery.


See Also