ADR 001: Worker.Starter Supervision Pattern

Status: Accepted Date: 2025-10-07 Deciders: nshkrdotcom Context: Issue #2 feedback questioning supervision layer complexity

Context and Problem Statement

Snakepit workers manage external OS processes (Python gRPC servers). We need to handle:

Worker crashes and automatic restarts
Clean resource cleanup (ports, file descriptors, child processes)
Future potential for per-worker resource pooling (connections, caches)

The Question (from Issue #2):

"What does having a :temporary worker under a Supervisor, which itself is under a DynamicSupervisor bring us (that it's worth the extra layer of complexity)?"

Decision Drivers

External Process Management - Workers spawn and manage OS processes
Automatic Recovery - Workers should restart without Pool intervention
Clean Lifecycle - Worker termination must clean up all associated resources
Future Extensibility - May need per-worker connection pools, caches, etc.
Separation of Concerns - Pool manages availability, not lifecycle details

Considered Options

Option 1: Direct DynamicSupervisor (Standard Pattern)

DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (GenServer, :transient)

Implementation:

def start_worker(worker_id) do
  child_spec = {GRPCWorker, [id: worker_id]}
  DynamicSupervisor.start_child(__MODULE__, child_spec)
end

Pros:

✅ Simple, standard OTP pattern
✅ One less process per worker
✅ Familiar to Elixir developers

Cons:

❌ Pool must track and handle worker restarts
❌ Harder to group worker + resources in future
❌ Less encapsulation

Option 2: Worker.Starter Wrapper (Current Choice)

DynamicSupervisor (WorkerSupervisor)
└── Worker.Starter (Supervisor, :permanent)
    └── GRPCWorker (GenServer, :transient)

Implementation:

def start_worker(worker_id) do
  child_spec = {Worker.Starter, {worker_id, GRPCWorker}}
  DynamicSupervisor.start_child(__MODULE__, child_spec)
end

Pros:

✅ Automatic restarts without Pool intervention
✅ Worker.Starter can supervise multiple related processes
✅ Clean encapsulation (terminate Starter = terminate all)
✅ Extensible for future per-worker resources

Cons:

❌ Extra process per worker (~1KB memory)
❌ More complex process tree
❌ Non-standard pattern (requires explanation)

Option 3: Worker as Supervisor

DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (Supervisor + GenServer hybrid)
    └── Python Process (Port)

Implementation:

defmodule GRPCWorker do
  use Supervisor
  # Also implements GenServer-like callbacks
end

Pros:

✅ One fewer module
✅ Worker directly supervises its resources

Cons:

❌ Violates single responsibility (mixing Supervisor + GenServer)
❌ Complex GenServer.call routing
❌ Harder to reason about behavior

Option 4: erlexec Integration

DynamicSupervisor (WorkerSupervisor)
└── GRPCWorker (GenServer)
    └── erlexec Port (C++ middleware)
        └── Python Process (guaranteed cleanup)

Implementation:

# Use erlexec library
{:ok, pid, os_pid} = :exec.run(
  "python3 grpc_server.py",
  [monitor: true, kill_timeout: 5000, kill_group: true]
)

Pros:

✅ Guaranteed cleanup (C++ port enforces)
✅ No orphans possible
✅ Simpler than Worker.Starter

Cons:

❌ External dependency
❌ Tight coupling (Python dies with BEAM)
❌ No independence for long-running jobs

Decision Outcome

Chosen Option: Option 2 (Worker.Starter Wrapper)

Justification

External processes are special - Not just Elixir GenServers
Automatic restarts proven - Works in production without Pool logic
Future-proof - Can add per-worker resources without refactoring
Clean abstraction - Terminating Starter atomically cleans up everything

Trade-offs Accepted

Memory: +1KB per worker

For 100 workers: +100KB (negligible on modern systems)
Acceptable cost for cleaner architecture

Complexity: Non-standard pattern

Requires documentation (this ADR)
Benefits outweigh learning curve

Process tree depth: 3 levels instead of 2

Observer shows deeper tree
But clearer ownership of resources

Consequences

Positive

Automatic Restart:

# Worker crashes
# Worker.Starter detects :DOWN
# Worker.Starter restarts worker automatically
# Pool gets :DOWN but doesn't need to act

Atomic Cleanup:

# Stop a worker
DynamicSupervisor.terminate_child(WorkerSupervisor, starter_pid)
# → Starter stops
# → Worker stops
# → Python process stops
# → All cleaned up atomically

Future Extensibility:

# v0.5: Add per-worker connection pool
children = [
  {GRPCWorker, [id: worker_id]},
  {ConnectionPool, [worker_id: worker_id]},  # Future
  {MetricsCollector, [worker_id: worker_id]} # Future
]
Supervisor.init(children, strategy: :one_for_one)

Negative

Memory Overhead:

Each Worker.Starter process: ~1KB
100 workers: 100KB total
Monitored but acceptable

Conceptual Overhead:

Developers must understand pattern
Not in typical Phoenix/Elixir apps
Requires this ADR for explanation

Debugging Complexity:

:observer shows 3-level tree
Must understand which PID is which
More processes to track

Validation

Tested Scenarios

✅ Normal operation (139 tests passing):

Workers start under Starter
Restart on crash works
Clean shutdown works

✅ High concurrency (100 workers):

Initialization: 3 seconds
No resource leaks
Clean shutdown

✅ Crash recovery:

Worker crashes → Starter restarts
Starter crashes → DynamicSupervisor restarts Starter+Worker
Pool handles both gracefully

Performance Impact

Startup:

Direct: ~20ms for 4 workers
With Starter: ~30ms for 4 workers
Overhead: ~2.5ms per worker (negligible)

Memory:

Per worker: +1KB (Worker.Starter process)
100 workers: +100KB total
Acceptable on modern systems

Runtime:

No performance difference
Message routing: extra hop (microseconds)

Alternatives for Future

If Complexity Becomes Issue

Option A: Simplify to direct supervision

Benchmark performance difference
If negligible overhead, remove pattern
Keep in v0.4.x, reconsider in v0.5

Option B: Adopt erlexec for coupled mode

Guaranteed cleanup with less code
Trade independence for simplicity
See multi-mode architecture design

If Resources Are Added

Validation: If per-worker resources materialize, pattern was correct Re-evaluation: If no resources by v0.5, reconsider necessity

Multi-Mode Architecture (Future):

docs/20251007_external_process_supervision_design.md
Coupled mode (current): Keep Worker.Starter
Supervised mode (systemd): Connect to existing pool
Independent mode (ML): Heartbeat-based
Distributed mode (k8s): Service mesh

Process Cleanup:

DETS tracking (ProcessRegistry)
ApplicationCleanup (emergency handler)
beam_run_id for safe cleanup

Review Schedule

Initial Review: 2025-10-07 (this ADR) Next Review: 2025-Q2 (6 months) Criteria: Check if per-worker resources were added

Questions for Review:

Did we add per-worker resources? (validates pattern)
Is memory overhead acceptable? (re-benchmark)
Is team comfortable with pattern? (survey)
Would simpler approach work? (prototype comparison)

References

Issue #2: ElixirForum feedback questioning pattern
Code: lib/snakepit/pool/worker_starter.ex
Tests: test/unit/pool/worker_supervisor_test.exs
Poolboy: Uses similar wrapper patterns
Our analysis: docs/20251007_slop_cleanup_analysis/

Notes

This pattern is intentional, not accidental. It solves real problems with managing external OS processes from Erlang. The complexity is justified by the requirements, but requires documentation (this ADR) to communicate the reasoning.

For new team members: Read this ADR first before questioning the pattern. If still unclear, discuss in team review.

For future refactoring: Benchmark direct supervision before removing. The pattern serves a purpose.

← Previous Page Telemetry Events Reference

Next Page → Changelog

ADR 001: Worker.Starter Supervision Pattern

Context and Problem Statement

Decision Drivers

Considered Options

Option 1: Direct DynamicSupervisor (Standard Pattern)

Option 2: Worker.Starter Wrapper (Current Choice)

Option 3: Worker as Supervisor

Option 4: erlexec Integration

Decision Outcome

Justification

Trade-offs Accepted

Consequences

Positive

Negative

Validation

Tested Scenarios

Performance Impact

Alternatives for Future

If Complexity Becomes Issue

If Resources Are Added

Related Decisions

Review Schedule

References

Notes