Memory Management Architecture
View SourceStatus: ✅ PRODUCTION-READY (Completed 2025-11-14) Project: Macula HTTP/3 Mesh Platform
Overview
Macula implements comprehensive memory management to prevent OOM (Out-Of-Memory) crashes through 5 critical fixes that bound memory usage and enable automatic cleanup.
Problem Solved: Platform experienced OOM crashes after 30-60 minutes of operation due to unbounded data structure growth.
Solution: Bounded pools, backpressure mechanisms, TTL-based cleanup, coordinated map management, and process monitoring.
Result: Stable memory usage, no crashes, production-ready platform.
5 Critical Memory Leak Fixes
1. Bounded Connection Pool (macula_gateway_mesh)
Problem: Unbounded mesh connection pool
Solution: LRU eviction, max 1,000 connections
Module: src/macula_gateway_mesh.erl
Tests: 22 tests passing
Documentation: 02_service_ttl_cleanup.md
Key Implementation:
- Track last access time for each connection
- Evict Least Recently Used when pool is full
- O(1) connection lookup and update
2. Client Connection Limits (macula_gateway_client_manager)
Problem: Unbounded client connections
Solution: Backpressure mechanism, max 10,000 clients (configurable)
Module: src/macula_gateway_client_manager.erl
Tests: 30 tests passing
Documentation: 03_stream_cleanup.md
Key Implementation:
- Check pool size before accepting new client
- Return
{error, max_clients_reached}when full - Graceful degradation under load
3. Service TTL/Cleanup (macula_service_registry)
Problem: Unbounded local_services map
Solution: 300-second TTL, periodic cleanup
Modules:
src/macula_service_registry.erl(cleanup function)src/macula_advertisement_manager.erl(periodic timer) Tests: 27 tests passing Documentation:- 02_service_ttl_cleanup.md
- 06_periodic_cleanup.md
Key Implementation:
- Track
advertised_attimestamp for each service - Automatic cleanup every 60 seconds
- Remove services older than 300 seconds
4. Stream Cleanup (macula_gateway_client_manager)
Problem: client_streams map leaked on disconnect
Solution: Coordinated cleanup of both clients and client_streams maps
Module: src/macula_gateway_client_manager.erl
Tests: 32 tests passing (includes 2 new stream tests)
Documentation: 03_stream_cleanup.md
Key Implementation:
- Extract
node_idfrom client info before removal - Atomic cleanup of both maps
- Works for both explicit disconnect and crashes
5. Caller Process Monitoring (macula_rpc_handler)
Problem: Dead caller processes left entries for 5 seconds
Solution: Monitor caller processes, immediate cleanup via DOWN messages
Module: src/macula_rpc_handler.erl
Tests: 27 tests passing (includes 2 new monitoring tests)
Documentation:
Key Implementation:
- Two-way mapping:
MonitorRef ↔ CallId/ServiceKey - Handle DOWN messages for immediate cleanup
- Cancel timers to prevent leaks
Documentation Index
Implementation Details
- Overview - Complete implementation summary (all 5 fixes)
- Service TTL Cleanup - Fix #3 details
- Stream Cleanup - Fix #4 details
- Caller Monitoring - Fix #5 details
- Caller Monitoring Tests - Fix #5 test coverage
- Periodic Cleanup - Automation (Task B)
Testing & Validation
- Load Testing - Load test script & results
- Complete Summary - Comprehensive final report
Maintenance & Operations
- Housekeeping Report - Architecture review, code quality analysis, future improvements
Visual Documentation
- Diagrams - Mermaid diagrams for all memory management mechanisms
Architecture Diagram
The platform implements memory management at 3 layers:
Gateway Layer (Infrastructure)
├── mesh: Bounded Pool (LRU, max 1,000)
├── client_manager: Client Limits (backpressure, max 10,000)
└── client_manager: Stream Cleanup (coordinated maps)
Service Layer
├── service_registry: TTL Cleanup (300s expiry)
└── advertisement_manager: Periodic Cleanup (60s interval)
Application Layer
└── rpc_handler: Caller Monitoring (immediate cleanup)Test Coverage
All memory leak fixes are comprehensively tested:
| Fix | Module | Tests | Status |
|---|---|---|---|
| #1 Bounded Pool | macula_gateway_mesh | 22 | ✅ PASS |
| #2 Client Limits | macula_gateway_client_manager | 30 | ✅ PASS |
| #3 Service TTL | macula_service_registry | 27 | ✅ PASS |
| #4 Stream Cleanup | macula_gateway_client_manager | 32 | ✅ PASS |
| #5 Caller Monitoring | macula_rpc_handler | 27 | ✅ PASS |
Total: 138 tests (7 new tests added for memory leak fixes) All tests passing: ✅
Production Monitoring
Key Metrics to Monitor
Connection Pool Size
- Should stay ≤ 1,000
- Alert if consistently at max
Client Count
- Should stay ≤ 10,000
- Track rejection rate (
max_clients_reachederrors)
Service Registry Size
- Should remain stable over time
- Monitor periodic cleanup logs
Stream Map Size
- Should match client count
- No orphaned entries
Pending Calls/Queries
- Should trend toward 0
- Spikes OK, sustained high values indicate issues
Log Monitoring
Service Cleanup (runs every 60s):
[info] Service cleanup: removed 3 expired service(s) % Normal
[debug] Service cleanup: no expired services % Also normalClient Rejections:
[warn] Client connection rejected: max_clients_reached % Monitor frequencyCaller Cleanup:
[debug] Cleaned up pending call due to caller death % Expected behaviorQuick Start
Understanding the Fixes
- Start here: Complete Summary
- Deep dive: Individual fix documentation (02-06)
- Visual learners: Diagrams
- Troubleshooting: Housekeeping Report (Section 3: Documentation)
Implementation References
All fixes follow idiomatic Erlang patterns:
- ✅ Pattern matching on function heads
- ✅ Guards instead of
if/case - ✅ Atomic state updates
- ✅ OTP best practices (process monitoring, timers)
- ✅ No deep nesting
See Housekeeping Report Section 2 for code quality analysis.
Performance Impact
Before Fixes:
- OOM crashes after 30-60 minutes
- Unbounded memory growth
- No cleanup mechanisms
After Fixes:
- Stable memory usage
- Bounded pools prevent growth
- Automatic cleanup maintains stability
- No OOM crashes observed
Overhead:
- LRU tracking: O(1) per operation
- Periodic cleanup: Runs every 60s, negligible CPU
- Process monitoring: Native Erlang, no overhead
Future Improvements
See Housekeeping Report Section 5 for detailed recommendations:
High Priority
- Memory metrics/observability (telemetry integration)
- Troubleshooting guide for production
Medium Priority
- Refactor nested case statements for clarity
- Memory pressure handling (dynamic limits)
Low Priority
- Memory manager behavior abstraction
- Unified memory management interface
Related Documentation
- Gateway Refactoring - Context for client_manager extraction
- Code Review Report - Overall code quality assessment
- CLAUDE.md - Development guidelines and memory management summary
Contributors
Implementation: Completed 2025-11-14 Documentation: ~2,500 lines across 9 documents Diagrams: 5 Mermaid diagrams Time Investment: ~6 hours (fixes + tests + docs)
Status: Production Ready ✅
All 5 critical memory leak fixes are:
- ✅ Implemented and tested
- ✅ Following idiomatic Erlang patterns
- ✅ Comprehensively documented
- ✅ Production-ready
Deployment Recommendation: Ready for staging → production with monitoring in place.
Last Updated: 2025-11-14 Next Review: After first production deployment