Architecture Decision Records (ADR)
ADR-001: Multi-Instance PubSub Architecture
Date: 2025-07-27
Status: Accepted
Context
ExESDB Gater originally used a single PubSub instance (:ex_esdb_system
) for all event communication. As the system evolved, several challenges emerged:
Event Type Confusion: Different types of events (health, metrics, security, audit) were mixed together, making it difficult for consumers to subscribe selectively.
Scalability Concerns: High-volume events (like metrics) could overwhelm low-volume but critical events (like security alerts).
Fault Isolation: Issues with one type of event processing could affect all other event types.
Observability Challenges: Monitoring and debugging were complicated by the lack of separation between different event concerns.
Consumer Complexity: Applications had to filter and route messages manually, increasing complexity and potential for errors.
Decision
We will implement a multi-instance PubSub architecture with 10 dedicated instances, each serving a specific purpose:
:ex_esdb_events
- Core business events and domain data:ex_esdb_system
- General system-level events (retained for compatibility):ex_esdb_logging
- Log aggregation and distribution:ex_esdb_health
- Health monitoring and status events:ex_esdb_metrics
- Performance metrics and statistics:ex_esdb_security
- Security events and threat detection:ex_esdb_audit
- Audit trail for compliance requirements:ex_esdb_alerts
- Critical system alerts requiring immediate attention:ex_esdb_diagnostics
- Deep diagnostic information for debugging:ex_esdb_lifecycle
- Process lifecycle events (starts, stops, crashes)
Consequences
Positive
Clear Separation of Concerns: Each PubSub instance has a well-defined purpose, making the system easier to understand and maintain.
Independent Scaling: Each instance can be tuned and scaled based on its specific volume and latency requirements.
Selective Subscription: Consumers can subscribe only to the event types they need, reducing unnecessary processing.
Fault Isolation: Problems with one event type won't cascade to others, improving overall system resilience.
Enhanced Observability: Each instance can be monitored independently, providing better insights into system behavior.
Compliance Support: Dedicated audit and security instances support regulatory and compliance requirements.
Performance Optimization: High-volume streams (metrics, logs) won't interfere with critical, low-volume streams (alerts, security).
Negative
Increased Complexity: More instances to manage, monitor, and maintain.
Resource Overhead: Each PubSub instance consumes some memory and processing resources.
Configuration Complexity: More instances to configure and tune appropriately.
Learning Curve: Developers need to understand which instance to use for which type of event.
Mitigations
Comprehensive Documentation: Created detailed architecture documentation explaining each instance's purpose and usage patterns.
Consistent Naming: Used clear, descriptive names for each instance that indicate their purpose.
Extensive Testing: Implemented comprehensive test suite to ensure proper isolation and functionality.
Backward Compatibility: Maintained existing
:ex_esdb_system
instance to ensure no breaking changes.
Implementation Details
- All instances are managed by
ExESDBGater.PubSubSystem
supervisor - Each instance uses the same underlying Phoenix PubSub technology
- Instances are created using
ExESDBGater.PubSubManager
to ensure singleton behavior - Comprehensive test coverage ensures proper isolation and functionality
Alternatives Considered
Alternative 1: Topic-Based Routing on Single Instance
Rejected because: While simpler to implement, this approach doesn't provide the fault isolation, independent scaling, or selective subscription benefits of separate instances.
Alternative 2: External Message Broker (RabbitMQ, Kafka)
Rejected because: Adds external dependencies and operational complexity. Phoenix PubSub provides sufficient functionality for our current needs with better integration into the Elixir ecosystem.
Alternative 3: Fewer, Broader Categories
Rejected because: Testing with 3-5 broader categories showed that we still needed to subdivide them logically, so we chose to be explicit upfront.
Monitoring and Success Criteria
- Instance Isolation: Each instance operates independently without cross-contamination
- Performance: No degradation in message delivery performance
- Resource Usage: Total resource usage remains within acceptable bounds
- Developer Experience: Clear guidelines and examples for choosing appropriate instances
- Operational Excellence: Monitoring and alerting work effectively for each instance
Related Documents
- PUBSUB_ARCHITECTURE.md - Detailed technical documentation
- CHANGELOG.md - Implementation history and changes