Broadway's architecture is built on top of GenStage. That means we structure
our processing units as independent stages that are responsible for one
individual task in the pipeline. By implementing the
we define a
GenServer process that wraps a
Supervisor to manage and
own our pipeline.
[producers] <- pulls data from SQS, RabbitMQ, etc. | | (demand dispatcher) | handle_message/3 runs here -> [processors] / \ / \ (partition dispatcher) / \ [batcher] [batcher] <- one for each batcher key | | | | (demand dispatcher) | | handle_batch/4 runs here -> [consumers] [consumers]
Broadway.Producer- A wrapper around the actual producer defined by the user. It serves as the source of the pipeline.
Broadway.Processor- This is where messages are processed, e.g. do calculations, convert data into a custom json format etc. Here is where the code from
Broadway.Batcher- Creates batches of messages based on the batcher's key. One batcher for each key will be created.
Broadway.Consumer- This is where the code from
Broadway was designed to always go back to a working state in case of failures thanks to the use of supervisors. Our supervision tree is designed as follows:
[Broadway GenServer] | | | [Broadway Pipeline Supervisor] / / (:rest_for_one) \ \ / | | \ / | | \ / | | \ / | | \ / | | \ [ProducerSupervisor] [ProcessorSupervisor] [BatcherPartitionSupervisor] [Terminator] (:one_for_one) (:one_for_all) (:one_for_one) / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ [Producer_1] ... [Processor_1] ... [BatcherConsumerSuperv_1] ... (:rest_for_one) / \ / \ / \ [Batcher] [ConsumerSupervisor] (:one_for_all) / \ / \ / \ [Consumer_1] ...
ConsumerSupervisor are set with
max_restarts to 0. The idea is that if any process fails, we want
to restart the rest of the tree. Since Broadway callbacks are
stateless, we can handle errors and provide reports without crashing
processes. This means that the supervision tree will only shutdown
in case of unforeseen errors in Broadway's implementation.
The only exception are the producers, which contain external code and are expected to fail. If a producer crashes, it will be restarted by its supervisor without cascading failures until its max restarts is reached. Broadway automatically handles those failures by making processors automatically resubscribe to producers in case of crashes.
The cascading failures aspect also provides safe semantics for graceful shutdown. We know that either all processes are running OR they are all being shutdown. Therefore, to gracefully shutdown the supervision tree, a terminator process is activated, which starts the following steps:
It notifies the first layer of processors that they should not resubscribe to producers once they exit
It tells all producers to no longer accept demand, flush all current events, and then shutdown
It then monitors and waits for a confirmation message from consumers. At this point, the terminator is effectively blocking the supervisor until all events have been processed
This triggers a cascade effect where processors notice all of its producers have been cancelled, causing them to flush their own events and cancels their own consumers, and so on and so on. This happens until consumers notice all of their producers have been cancelled, effectively notifying the terminator to shutdown, allowing the outer most supervisor to go on and fully terminate all stages, which at this point have flushed all events.