ADR 001: Worker.Starter Supervision Pattern

View Source

Status: Accepted Date: 2025-10-07 Deciders: nshkrdotcom Context: Issue #2 feedback questioning supervision layer complexity


Context and Problem Statement

Snakepit workers manage external OS processes (Python gRPC servers). We need to handle:

  • Worker crashes and automatic restarts
  • Clean resource cleanup (ports, file descriptors, child processes)
  • Future potential for per-worker resource pooling (connections, caches)

The Question (from Issue #2):

"What does having a :temporary worker under a Supervisor, which itself is under a DynamicSupervisor bring us (that it's worth the extra layer of complexity)?"


Decision Drivers

  1. External Process Management - Workers spawn and manage OS processes
  2. Automatic Recovery - Workers should restart without Pool intervention
  3. Clean Lifecycle - Worker termination must clean up all associated resources
  4. Future Extensibility - May need per-worker connection pools, caches, etc.
  5. Separation of Concerns - Pool manages availability, not lifecycle details

Considered Options

Option 1: Direct DynamicSupervisor (Standard Pattern)

DynamicSupervisor (WorkerSupervisor)
 GRPCWorker (GenServer, :transient)

Implementation:

def start_worker(worker_id) do
  child_spec = {GRPCWorker, [id: worker_id]}
  DynamicSupervisor.start_child(__MODULE__, child_spec)
end

Pros:

  • ✅ Simple, standard OTP pattern
  • ✅ One less process per worker
  • ✅ Familiar to Elixir developers

Cons:

  • ❌ Pool must track and handle worker restarts
  • ❌ Harder to group worker + resources in future
  • ❌ Less encapsulation

Option 2: Worker.Starter Wrapper (Current Choice)

DynamicSupervisor (WorkerSupervisor)
 Worker.Starter (Supervisor, :permanent)
     GRPCWorker (GenServer, :transient)

Implementation:

def start_worker(worker_id) do
  child_spec = {Worker.Starter, {worker_id, GRPCWorker}}
  DynamicSupervisor.start_child(__MODULE__, child_spec)
end

Pros:

  • ✅ Automatic restarts without Pool intervention
  • ✅ Worker.Starter can supervise multiple related processes
  • ✅ Clean encapsulation (terminate Starter = terminate all)
  • ✅ Extensible for future per-worker resources

Cons:

  • ❌ Extra process per worker (~1KB memory)
  • ❌ More complex process tree
  • ❌ Non-standard pattern (requires explanation)

Option 3: Worker as Supervisor

DynamicSupervisor (WorkerSupervisor)
 GRPCWorker (Supervisor + GenServer hybrid)
     Python Process (Port)

Implementation:

defmodule GRPCWorker do
  use Supervisor
  # Also implements GenServer-like callbacks
end

Pros:

  • ✅ One fewer module
  • ✅ Worker directly supervises its resources

Cons:

  • ❌ Violates single responsibility (mixing Supervisor + GenServer)
  • ❌ Complex GenServer.call routing
  • ❌ Harder to reason about behavior

Option 4: erlexec Integration

DynamicSupervisor (WorkerSupervisor)
 GRPCWorker (GenServer)
     erlexec Port (C++ middleware)
         Python Process (guaranteed cleanup)

Implementation:

# Use erlexec library
{:ok, pid, os_pid} = :exec.run(
  "python3 grpc_server.py",
  [monitor: true, kill_timeout: 5000, kill_group: true]
)

Pros:

  • ✅ Guaranteed cleanup (C++ port enforces)
  • ✅ No orphans possible
  • ✅ Simpler than Worker.Starter

Cons:

  • ❌ External dependency
  • ❌ Tight coupling (Python dies with BEAM)
  • ❌ No independence for long-running jobs

Decision Outcome

Chosen Option: Option 2 (Worker.Starter Wrapper)

Justification

  1. External processes are special - Not just Elixir GenServers
  2. Automatic restarts proven - Works in production without Pool logic
  3. Future-proof - Can add per-worker resources without refactoring
  4. Clean abstraction - Terminating Starter atomically cleans up everything

Trade-offs Accepted

Memory: +1KB per worker

  • For 100 workers: +100KB (negligible on modern systems)
  • Acceptable cost for cleaner architecture

Complexity: Non-standard pattern

  • Requires documentation (this ADR)
  • Benefits outweigh learning curve

Process tree depth: 3 levels instead of 2

  • Observer shows deeper tree
  • But clearer ownership of resources

Consequences

Positive

Automatic Restart:

# Worker crashes
# Worker.Starter detects :DOWN
# Worker.Starter restarts worker automatically
# Pool gets :DOWN but doesn't need to act

Atomic Cleanup:

# Stop a worker
DynamicSupervisor.terminate_child(WorkerSupervisor, starter_pid)
# → Starter stops
# → Worker stops
# → Python process stops
# → All cleaned up atomically

Future Extensibility:

# v0.5: Add per-worker connection pool
children = [
  {GRPCWorker, [id: worker_id]},
  {ConnectionPool, [worker_id: worker_id]},  # Future
  {MetricsCollector, [worker_id: worker_id]} # Future
]
Supervisor.init(children, strategy: :one_for_one)

Negative

Memory Overhead:

  • Each Worker.Starter process: ~1KB
  • 100 workers: 100KB total
  • Monitored but acceptable

Conceptual Overhead:

  • Developers must understand pattern
  • Not in typical Phoenix/Elixir apps
  • Requires this ADR for explanation

Debugging Complexity:

  • :observer shows 3-level tree
  • Must understand which PID is which
  • More processes to track

Validation

Tested Scenarios

Normal operation (139 tests passing):

  • Workers start under Starter
  • Restart on crash works
  • Clean shutdown works

High concurrency (100 workers):

  • Initialization: 3 seconds
  • No resource leaks
  • Clean shutdown

Crash recovery:

  • Worker crashes → Starter restarts
  • Starter crashes → DynamicSupervisor restarts Starter+Worker
  • Pool handles both gracefully

Performance Impact

Startup:

  • Direct: ~20ms for 4 workers
  • With Starter: ~30ms for 4 workers
  • Overhead: ~2.5ms per worker (negligible)

Memory:

  • Per worker: +1KB (Worker.Starter process)
  • 100 workers: +100KB total
  • Acceptable on modern systems

Runtime:

  • No performance difference
  • Message routing: extra hop (microseconds)

Alternatives for Future

If Complexity Becomes Issue

Option A: Simplify to direct supervision

  • Benchmark performance difference
  • If negligible overhead, remove pattern
  • Keep in v0.4.x, reconsider in v0.5

Option B: Adopt erlexec for coupled mode

  • Guaranteed cleanup with less code
  • Trade independence for simplicity
  • See multi-mode architecture design

If Resources Are Added

Validation: If per-worker resources materialize, pattern was correct Re-evaluation: If no resources by v0.5, reconsider necessity


Multi-Mode Architecture (Future):

  • docs/20251007_external_process_supervision_design.md
  • Coupled mode (current): Keep Worker.Starter
  • Supervised mode (systemd): Connect to existing pool
  • Independent mode (ML): Heartbeat-based
  • Distributed mode (k8s): Service mesh

Process Cleanup:

  • DETS tracking (ProcessRegistry)
  • ApplicationCleanup (emergency handler)
  • beam_run_id for safe cleanup

Review Schedule

Initial Review: 2025-10-07 (this ADR) Next Review: 2025-Q2 (6 months) Criteria: Check if per-worker resources were added

Questions for Review:

  1. Did we add per-worker resources? (validates pattern)
  2. Is memory overhead acceptable? (re-benchmark)
  3. Is team comfortable with pattern? (survey)
  4. Would simpler approach work? (prototype comparison)

References

  • Issue #2: ElixirForum feedback questioning pattern
  • Code: lib/snakepit/pool/worker_starter.ex
  • Tests: test/unit/pool/worker_supervisor_test.exs
  • Poolboy: Uses similar wrapper patterns
  • Our analysis: docs/20251007_slop_cleanup_analysis/

Notes

This pattern is intentional, not accidental. It solves real problems with managing external OS processes from Erlang. The complexity is justified by the requirements, but requires documentation (this ADR) to communicate the reasoning.

For new team members: Read this ADR first before questioning the pattern. If still unclear, discuss in team review.

For future refactoring: Benchmark direct supervision before removing. The pattern serves a purpose.