Macula Troubleshooting Guide

View Source

Diagnosing and resolving common issues in Macula deployments

Audience: Operators, Developers Last Updated: 2025-11-28


Table of Contents

  1. Quick Diagnostics
  2. Connection Issues
  3. RPC Problems
  4. PubSub Problems
  5. Memory Issues
  6. DHT Problems
  7. Performance Issues
  8. TLS/Certificate Issues
  9. Gateway Issues
  10. Debug Tools

Quick Diagnostics

First Steps Checklist

1. [ ] Is the gateway process running?
2. [ ] Are all supervisors alive?
3. [ ] Can you reach the bootstrap node?
4. [ ] Is TLS configured correctly?
5. [ ] Are there errors in the logs?

Erlang Shell Health Check

%% Quick health check script
QuickCheck = fun() ->
    io:format("Gateway: ~p~n", [is_pid(whereis(macula_gateway))]),
    io:format("Sup: ~p~n", [is_pid(whereis(macula_sup))]),
    io:format("Clients: ~p~n", [macula_gateway_client_manager:count()]),
    io:format("Services: ~p~n", [macula_service_registry:count_services()]),
    ok
end,
QuickCheck().

Connection Issues

Problem: Clients Can't Connect

Symptoms:

  • Connection timeouts
  • TLS handshake failures
  • "Connection refused" errors

Diagnostic Steps:

%% 1. Check listener is running
is_pid(whereis(macula_quic_listener)).

%% 2. Check port is bound
%% From shell:
%% netstat -tlnp | grep 4433

%% 3. Check TLS certificates
ssl:peercert(Socket).

Common Causes & Solutions:

CauseSolution
Firewall blocking UDPOpen UDP port 4433 (or configured port)
TLS cert expiredRenew certificates
Wrong cert path in configVerify certfile and keyfile paths
Listener crashedCheck supervisor, restart gateway
Max clients reachedScale horizontally or increase limit

Fix: TLS Certificate Issues

%% Verify certificate is readable
file:read_file("/path/to/cert.pem").

%% Check certificate validity
ssl:pkix_verify_certificate_chain(CertDer, TrustedCerts).

%% Check expiration date
%% openssl x509 -in cert.pem -noout -dates

Problem: Connections Drop Unexpectedly

Symptoms:

  • Clients disconnect randomly
  • "connection_closed" errors
  • High reconnection rate

Diagnostic Steps:

%% Check for QUIC transport errors in logs
%% grep -i "quic\|transport\|closed" /var/log/macula.log

%% Check connection state
sys:get_state(ClientPid).

Common Causes & Solutions:

CauseSolution
Network instabilityCheck network path, MTU settings
Idle timeoutAdjust idle_timeout_ms in QUIC config
NAT timeoutEnable keepalives
Resource limitsCheck ulimit, file descriptor limits

Fix: Idle Timeout

%% In sys.config
{macula, [
    {quic_options, [
        {idle_timeout_ms, 300000}  %% 5 minutes
    ]}
]}

RPC Problems

Problem: RPC Calls Timeout

Symptoms:

  • {error, timeout} returned from calls
  • Slow response times
  • High pending call count

Diagnostic Steps:

%% 1. Check pending calls
sys:get_state(macula_rpc_handler).
%% Look at pending_calls map size

%% 2. Check if service is registered
macula_service_registry:lookup(<<"energy.home.get">>).

%% 3. Check DHT for providers
macula_dht:get(crypto:hash(sha256, <<"energy.home.get">>)).

Common Causes & Solutions:

CauseSolution
Service not registeredVerify provider called register/2
Provider unreachableCheck provider node connectivity
Handler too slowProfile handler, add async processing
DHT not propagatedWait for DHT sync (up to 30s)
Network partitionCheck mesh connectivity

Fix: Increase Timeout

%% For specific calls
macula:call(Client, <<"slow.procedure">>, Args, #{timeout => 30000}).

%% Global default (sys.config)
{macula, [
    {rpc_timeout_ms, 10000}  %% 10 seconds
]}

Problem: RPC Returns "No Provider"

Symptoms:

  • {error, no_provider} returned
  • Service works on some nodes but not others

Diagnostic Steps:

%% 1. Check local registry
macula_service_registry:list_services().

%% 2. Check DHT directly
Key = crypto:hash(sha256, <<"my.procedure">>),
macula_dht:get(Key).

%% 3. Check if advertised recently
macula_advertisement_manager:get_last_advertised(<<"my.procedure">>).

Common Causes & Solutions:

CauseSolution
Service not advertisedCall macula:advertise/3
TTL expiredRe-advertise (auto every 60s)
DHT not syncedWait, then query bootstrap node
Wrong procedure nameCheck for typos in procedure name

PubSub Problems

Problem: Subscribers Not Receiving Events

Symptoms:

  • Publisher succeeds but subscribers get nothing
  • Works locally but not across mesh

Diagnostic Steps:

%% 1. Check subscription is active
macula_pubsub_handler:list_subscriptions().

%% 2. Check DHT for subscribers
Topic = <<"sensor.temperature">>,
Key = crypto:hash(sha256, Topic),
macula_dht:get(Key).

%% 3. Verify subscriber endpoint is reachable
macula_peer_connector:connect(Endpoint).

Common Causes & Solutions:

CauseSolution
Subscription not in DHTRe-subscribe, wait for propagation
Wildcard mismatchVerify wildcard pattern syntax
Subscriber crashedCheck subscriber process, restart
Endpoint unreachableFix network/firewall
Cache staleWait for cache refresh (60s TTL)

Fix: Force DHT Refresh

%% Clear subscriber cache for topic
macula_subscriber_cache:invalidate(<<"sensor.temperature">>).

%% Re-subscribe
macula:subscribe(Client, <<"sensor.temperature">>, Callback).

Problem: Duplicate Events

Symptoms:

  • Same event delivered multiple times
  • Subscribers overwhelmed

Diagnostic Steps:

%% Check for multiple subscriptions
macula_pubsub_handler:list_subscriptions().
%% Should see only one entry per topic

Common Causes & Solutions:

CauseSolution
Multiple subscribe callsTrack subscription refs, unsubscribe first
Stale DHT entriesWait for TTL expiration
Gateway restart during publishImplement idempotency in subscriber

Memory Issues

Problem: Memory Usage Keeps Growing

Symptoms:

  • Memory climbs over hours/days
  • Eventually OOM crash

Diagnostic Steps:

%% 1. Check process memory
erlang:memory().

%% 2. Find top memory consumers
lists:sort(
    fun({_, A}, {_, B}) -> A > B end,
    [{Pid, element(2, process_info(Pid, memory))}
     || Pid <- processes()]
).

%% 3. Check ETS tables
[{Tab, ets:info(Tab, size), ets:info(Tab, memory)}
 || Tab <- ets:all()].

Common Causes & Solutions:

CauseSolution
Unbounded message queueAdd backpressure, check slow handlers
ETS table growthVerify TTL cleanup is running
Process leakCheck spawn/exit patterns
Binary leakForce GC: erlang:garbage_collect()

Fix: Force Cleanup

%% Trigger service cleanup manually
macula_advertisement_manager:cleanup_expired().

%% Force garbage collection on specific process
erlang:garbage_collect(whereis(macula_gateway)).

Problem: "max_clients_reached" Errors

Symptoms:

  • New clients rejected
  • Warning logs: Client connection rejected: max_clients_reached

Diagnostic Steps:

%% Check current client count
macula_gateway_client_manager:count().

%% Check max limit
application:get_env(macula, max_clients).

Solutions:

  1. Scale horizontally - Add more gateway nodes
  2. Increase limit (if resources allow):
    %% In sys.config
    {macula, [
        {max_clients, 20000}  %% Double the default
    ]}
  3. Investigate client churn - Why are clients not disconnecting?

DHT Problems

Problem: DHT Queries Timeout

Symptoms:

  • Service discovery fails
  • {error, timeout} from DHT operations

Diagnostic Steps:

%% 1. Check DHT process
is_pid(whereis(macula_dht)).

%% 2. Check bootstrap connectivity
macula_dht:ping().

%% 3. Check DHT routing table
macula_dht:get_routing_table().

Common Causes & Solutions:

CauseSolution
Bootstrap unreachableCheck network to bootstrap node
DHT not initializedWait for startup, check logs
Network partitionRestore connectivity
High DHT loadScale bootstrap nodes

Problem: Services Not Propagating

Symptoms:

  • Service works on registering node
  • Other nodes can't discover it

Diagnostic Steps:

%% On provider node
macula_service_registry:list_services().

%% On consumer node
macula_dht:get(crypto:hash(sha256, <<"service.name">>)).

Common Causes & Solutions:

CauseSolution
DHT replication delayWait up to 30 seconds
Partition during advertisementRe-advertise service
TTL too shortIncrease service_ttl_ms

Performance Issues

Problem: High Latency

Symptoms:

  • RPC calls take > 100ms
  • User-perceived slowness

Diagnostic Steps:

%% 1. Check cache hit rate
macula_subscriber_cache:stats().
%% Should see high hit_rate

%% 2. Profile a call
{Time, Result} = timer:tc(fun() ->
    macula:call(Client, Proc, Args)
end),
io:format("Call took ~p ms~n", [Time / 1000]).

Common Causes & Solutions:

CauseSolution
Cache missWarm up cache, check TTL settings
Slow handlerProfile handler code
Network latencyCheck network path
DHT overloadedAdd more DHT nodes
QUIC handshake overheadEnable connection reuse

Fix: Enable Caching

%% Verify caching is enabled (should be by default)
application:get_env(macula, enable_subscriber_cache).
%% Should return {ok, true}

Problem: Low Throughput

Symptoms:

  • PubSub < 1000 msg/sec
  • System seems slow under load

Diagnostic Steps:

%% Check for backpressure
sys:get_state(macula_gateway_pubsub).
%% Look at queue sizes

%% Check scheduler utilization
scheduler:utilization(1000).

Solutions:

  1. Enable caching (see Performance Guide)
  2. Batch messages - Send in groups
  3. Reduce DHT queries - Increase cache TTL
  4. Profile handlers - Find bottlenecks

TLS/Certificate Issues

Problem: ECDSA Certificate Not Supported

Symptoms:

  • Gateway fails to start with config_error tls_error
  • Log shows certificate loading errors
  • Works with self-signed certs but fails with Let's Encrypt

Root Cause: MsQuic (the QUIC implementation used by Macula) does NOT support ECDSA certificates. Let's Encrypt switched to ECDSA by default in late 2024.

Diagnostic Steps:

# Check certificate key type
openssl x509 -in /path/to/cert.pem -noout -text | grep "Public Key Algorithm"

# If it shows "id-ecPublicKey" - that's the problem!
# Must show "rsaEncryption"

Solution:

Re-issue the certificate with RSA:

# For Let's Encrypt
certbot certonly --standalone -d your-domain.com \
  --key-type rsa --rsa-key-size 2048 --force-renewal

# Then restart the service
docker restart your-macula-container

Prevention: Always specify --key-type rsa when using certbot with Macula nodes.


Problem: Certificate Permission Denied

Symptoms:

  • Gateway fails to start with config_error tls_error
  • Certificate files exist but container can't read them

Root Cause: Certbot's /archive/ directory often has 700 permissions (root only). If your container runs as non-root, it can't read the certs through symlinks.

Diagnostic Steps:

# Check archive directory permissions
ls -la /etc/letsencrypt/archive/

# If permissions are drwx------ (700), non-root can't read

Solution:

# Fix archive directory permissions
chmod 755 /etc/letsencrypt/archive/
chmod 755 /etc/letsencrypt/archive/your-domain.com/

# Restart container
docker restart your-macula-container

Gateway Issues

Problem: Gateway Crashes on Startup

Symptoms:

  • Gateway fails to start
  • Supervisor keeps restarting

Diagnostic Steps:

%% Check crash logs
%% grep "CRASH\|EXIT\|error" /var/log/macula.log

%% Try manual start to see error
macula_gateway:start_link(Config).

Common Causes & Solutions:

CauseSolution
Missing TLS certsProvide valid cert/key paths
Port already in useChange port or stop conflicting service
Invalid configVerify sys.config syntax
Missing dependencyCheck all deps started

Problem: Gateway Becomes Unresponsive

Symptoms:

  • Gateway process alive but not handling requests
  • Message queue growing

Diagnostic Steps:

%% 1. Check message queue
process_info(whereis(macula_gateway), message_queue_len).

%% 2. Check if processing
sys:get_status(macula_gateway).

%% 3. Check for locks
erlang:process_info(whereis(macula_gateway), [status, current_stacktrace]).

Solutions:

  1. Restart gateway - Last resort: supervisor:restart_child(macula_sup, macula_gateway).
  2. Find slow handler - Profile message handling
  3. Add flow control - Implement backpressure

Debug Tools

Enabling Debug Logging

%% Temporarily enable debug logs
logger:set_primary_config(level, debug).

%% For specific module
logger:set_module_level(macula_gateway, debug).

%% Reset to normal
logger:set_primary_config(level, info).

Tracing

%% Trace function calls
dbg:tracer().
dbg:p(all, c).
dbg:tpl(macula_gateway, handle_call, '_', []).

%% Stop tracing
dbg:stop().

State Inspection

%% Get process state (gen_server)
sys:get_state(macula_gateway).

%% Get ETS table contents
ets:tab2list(macula_peers).

%% Process info
process_info(whereis(macula_gateway)).

Remote Shell

# Connect to running node
/opt/macula/bin/macula remote_console

# Or via remsh
erl -name debug@localhost -setcookie macula -remsh macula@hostname

Log Analysis Commands

# Find errors in last hour
journalctl -u macula --since "1 hour ago" | grep -i error

# Count warnings by type
grep -oP '\[warning\] \K[^:]+' /var/log/macula.log | sort | uniq -c | sort -rn

# Watch logs in real-time
tail -f /var/log/macula.log | grep -E "(error|warning|CRASH)"

See Also