Architecture Decision Records (ADR)

ADR-001: Multi-Instance PubSub Architecture

Date: 2025-07-27

Status: Accepted

Context

ExESDB Gater originally used a single PubSub instance (:ex_esdb_system) for all event communication. As the system evolved, several challenges emerged:

Event Type Confusion: Different types of events (health, metrics, security, audit) were mixed together, making it difficult for consumers to subscribe selectively.
Scalability Concerns: High-volume events (like metrics) could overwhelm low-volume but critical events (like security alerts).
Fault Isolation: Issues with one type of event processing could affect all other event types.
Observability Challenges: Monitoring and debugging were complicated by the lack of separation between different event concerns.
Consumer Complexity: Applications had to filter and route messages manually, increasing complexity and potential for errors.

Decision

We will implement a multi-instance PubSub architecture with 10 dedicated instances, each serving a specific purpose:

:ex_esdb_events - Core business events and domain data
:ex_esdb_system - General system-level events (retained for compatibility)
:ex_esdb_logging - Log aggregation and distribution
:ex_esdb_health - Health monitoring and status events
:ex_esdb_metrics - Performance metrics and statistics
:ex_esdb_security - Security events and threat detection
:ex_esdb_audit - Audit trail for compliance requirements
:ex_esdb_alerts - Critical system alerts requiring immediate attention
:ex_esdb_diagnostics - Deep diagnostic information for debugging
:ex_esdb_lifecycle - Process lifecycle events (starts, stops, crashes)

Consequences

Positive

Clear Separation of Concerns: Each PubSub instance has a well-defined purpose, making the system easier to understand and maintain.
Independent Scaling: Each instance can be tuned and scaled based on its specific volume and latency requirements.
Selective Subscription: Consumers can subscribe only to the event types they need, reducing unnecessary processing.
Fault Isolation: Problems with one event type won't cascade to others, improving overall system resilience.
Enhanced Observability: Each instance can be monitored independently, providing better insights into system behavior.
Compliance Support: Dedicated audit and security instances support regulatory and compliance requirements.
Performance Optimization: High-volume streams (metrics, logs) won't interfere with critical, low-volume streams (alerts, security).

Negative

Increased Complexity: More instances to manage, monitor, and maintain.
Resource Overhead: Each PubSub instance consumes some memory and processing resources.
Configuration Complexity: More instances to configure and tune appropriately.
Learning Curve: Developers need to understand which instance to use for which type of event.

Mitigations

Comprehensive Documentation: Created detailed architecture documentation explaining each instance's purpose and usage patterns.
Consistent Naming: Used clear, descriptive names for each instance that indicate their purpose.
Extensive Testing: Implemented comprehensive test suite to ensure proper isolation and functionality.
Backward Compatibility: Maintained existing :ex_esdb_system instance to ensure no breaking changes.

Implementation Details

All instances are managed by ExESDBGater.PubSubSystem supervisor
Each instance uses the same underlying Phoenix PubSub technology
Instances are created using ExESDBGater.PubSubManager to ensure singleton behavior
Comprehensive test coverage ensures proper isolation and functionality

Alternatives Considered

Alternative 1: Topic-Based Routing on Single Instance

Rejected because: While simpler to implement, this approach doesn't provide the fault isolation, independent scaling, or selective subscription benefits of separate instances.

Alternative 2: External Message Broker (RabbitMQ, Kafka)

Rejected because: Adds external dependencies and operational complexity. Phoenix PubSub provides sufficient functionality for our current needs with better integration into the Elixir ecosystem.

Alternative 3: Fewer, Broader Categories

Rejected because: Testing with 3-5 broader categories showed that we still needed to subdivide them logically, so we chose to be explicit upfront.

Monitoring and Success Criteria

Instance Isolation: Each instance operates independently without cross-contamination
Performance: No degradation in message delivery performance
Resource Usage: Total resource usage remains within acceptable bounds
Developer Experience: Clear guidelines and examples for choosing appropriate instances
Operational Excellence: Monitoring and alerting work effectively for each instance

PUBSUB_ARCHITECTURE.md - Detailed technical documentation
CHANGELOG.md - Implementation history and changes

Next Page → Changelog