ADR 001: Worker.Starter Supervision Pattern
View SourceStatus: Accepted Date: 2025-10-07 Deciders: nshkrdotcom Context: Issue #2 feedback questioning supervision layer complexity
Context and Problem Statement
Snakepit workers manage external OS processes (Python gRPC servers). We need to handle:
- Worker crashes and automatic restarts
- Clean resource cleanup (ports, file descriptors, child processes)
- Future potential for per-worker resource pooling (connections, caches)
The Question (from Issue #2):
"What does having a :temporary worker under a Supervisor, which itself is under a DynamicSupervisor bring us (that it's worth the extra layer of complexity)?"
Decision Drivers
- External Process Management - Workers spawn and manage OS processes
- Automatic Recovery - Workers should restart without Pool intervention
- Clean Lifecycle - Worker termination must clean up all associated resources
- Future Extensibility - May need per-worker connection pools, caches, etc.
- Separation of Concerns - Pool manages availability, not lifecycle details
Considered Options
Option 1: Direct DynamicSupervisor (Standard Pattern)
DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (GenServer, :transient)Implementation:
def start_worker(worker_id) do
child_spec = {GRPCWorker, [id: worker_id]}
DynamicSupervisor.start_child(__MODULE__, child_spec)
endPros:
- ✅ Simple, standard OTP pattern
- ✅ One less process per worker
- ✅ Familiar to Elixir developers
Cons:
- ❌ Pool must track and handle worker restarts
- ❌ Harder to group worker + resources in future
- ❌ Less encapsulation
Option 2: Worker.Starter Wrapper (Current Choice)
DynamicSupervisor (WorkerSupervisor)
└── Worker.Starter (Supervisor, :permanent)
└── GRPCWorker (GenServer, :transient)Implementation:
def start_worker(worker_id) do
child_spec = {Worker.Starter, {worker_id, GRPCWorker}}
DynamicSupervisor.start_child(__MODULE__, child_spec)
endPros:
- ✅ Automatic restarts without Pool intervention
- ✅ Worker.Starter can supervise multiple related processes
- ✅ Clean encapsulation (terminate Starter = terminate all)
- ✅ Extensible for future per-worker resources
Cons:
- ❌ Extra process per worker (~1KB memory)
- ❌ More complex process tree
- ❌ Non-standard pattern (requires explanation)
Option 3: Worker as Supervisor
DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (Supervisor + GenServer hybrid)
└── Python Process (Port)Implementation:
defmodule GRPCWorker do
use Supervisor
# Also implements GenServer-like callbacks
endPros:
- ✅ One fewer module
- ✅ Worker directly supervises its resources
Cons:
- ❌ Violates single responsibility (mixing Supervisor + GenServer)
- ❌ Complex GenServer.call routing
- ❌ Harder to reason about behavior
Option 4: erlexec Integration
DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (GenServer)
└── erlexec Port (C++ middleware)
└── Python Process (guaranteed cleanup)Implementation:
# Use erlexec library
{:ok, pid, os_pid} = :exec.run(
"python3 grpc_server.py",
[monitor: true, kill_timeout: 5000, kill_group: true]
)Pros:
- ✅ Guaranteed cleanup (C++ port enforces)
- ✅ No orphans possible
- ✅ Simpler than Worker.Starter
Cons:
- ❌ External dependency
- ❌ Tight coupling (Python dies with BEAM)
- ❌ No independence for long-running jobs
Decision Outcome
Chosen Option: Option 2 (Worker.Starter Wrapper)
Justification
- External processes are special - Not just Elixir GenServers
- Automatic restarts proven - Works in production without Pool logic
- Future-proof - Can add per-worker resources without refactoring
- Clean abstraction - Terminating Starter atomically cleans up everything
Trade-offs Accepted
Memory: +1KB per worker
- For 100 workers: +100KB (negligible on modern systems)
- Acceptable cost for cleaner architecture
Complexity: Non-standard pattern
- Requires documentation (this ADR)
- Benefits outweigh learning curve
Process tree depth: 3 levels instead of 2
- Observer shows deeper tree
- But clearer ownership of resources
Consequences
Positive
Automatic Restart:
# Worker crashes
# Worker.Starter detects :DOWN
# Worker.Starter restarts worker automatically
# Pool gets :DOWN but doesn't need to actAtomic Cleanup:
# Stop a worker
DynamicSupervisor.terminate_child(WorkerSupervisor, starter_pid)
# → Starter stops
# → Worker stops
# → Python process stops
# → All cleaned up atomicallyFuture Extensibility:
# v0.5: Add per-worker connection pool
children = [
{GRPCWorker, [id: worker_id]},
{ConnectionPool, [worker_id: worker_id]}, # Future
{MetricsCollector, [worker_id: worker_id]} # Future
]
Supervisor.init(children, strategy: :one_for_one)Negative
Memory Overhead:
- Each Worker.Starter process: ~1KB
- 100 workers: 100KB total
- Monitored but acceptable
Conceptual Overhead:
- Developers must understand pattern
- Not in typical Phoenix/Elixir apps
- Requires this ADR for explanation
Debugging Complexity:
- :observer shows 3-level tree
- Must understand which PID is which
- More processes to track
Validation
Tested Scenarios
✅ Normal operation (139 tests passing):
- Workers start under Starter
- Restart on crash works
- Clean shutdown works
✅ High concurrency (100 workers):
- Initialization: 3 seconds
- No resource leaks
- Clean shutdown
✅ Crash recovery:
- Worker crashes → Starter restarts
- Starter crashes → DynamicSupervisor restarts Starter+Worker
- Pool handles both gracefully
Performance Impact
Startup:
- Direct: ~20ms for 4 workers
- With Starter: ~30ms for 4 workers
- Overhead: ~2.5ms per worker (negligible)
Memory:
- Per worker: +1KB (Worker.Starter process)
- 100 workers: +100KB total
- Acceptable on modern systems
Runtime:
- No performance difference
- Message routing: extra hop (microseconds)
Alternatives for Future
If Complexity Becomes Issue
Option A: Simplify to direct supervision
- Benchmark performance difference
- If negligible overhead, remove pattern
- Keep in v0.4.x, reconsider in v0.5
Option B: Adopt erlexec for coupled mode
- Guaranteed cleanup with less code
- Trade independence for simplicity
- See multi-mode architecture design
If Resources Are Added
Validation: If per-worker resources materialize, pattern was correct Re-evaluation: If no resources by v0.5, reconsider necessity
Related Decisions
Multi-Mode Architecture (Future):
docs/20251007_external_process_supervision_design.md- Coupled mode (current): Keep Worker.Starter
- Supervised mode (systemd): Connect to existing pool
- Independent mode (ML): Heartbeat-based
- Distributed mode (k8s): Service mesh
Process Cleanup:
- DETS tracking (ProcessRegistry)
- ApplicationCleanup (emergency handler)
- beam_run_id for safe cleanup
Review Schedule
Initial Review: 2025-10-07 (this ADR) Next Review: 2025-Q2 (6 months) Criteria: Check if per-worker resources were added
Questions for Review:
- Did we add per-worker resources? (validates pattern)
- Is memory overhead acceptable? (re-benchmark)
- Is team comfortable with pattern? (survey)
- Would simpler approach work? (prototype comparison)
References
- Issue #2: ElixirForum feedback questioning pattern
- Code:
lib/snakepit/pool/worker_starter.ex - Tests:
test/unit/pool/worker_supervisor_test.exs - Poolboy: Uses similar wrapper patterns
- Our analysis:
docs/20251007_slop_cleanup_analysis/
Notes
This pattern is intentional, not accidental. It solves real problems with managing external OS processes from Erlang. The complexity is justified by the requirements, but requires documentation (this ADR) to communicate the reasoning.
For new team members: Read this ADR first before questioning the pattern. If still unclear, discuss in team review.
For future refactoring: Benchmark direct supervision before removing. The pattern serves a purpose.