Snakepit Performance Architecture Diagrams (v0.6.0)
View SourceHigh-performance behaviour in Snakepit is anchored in constant-time routing, concurrent worker startup, and proactive health management. The diagrams below highlight the control-plane mechanics that keep latency low while tolerating heavy churn in external Python processes.
Key Performance Features
- Dual worker profiles (process or thread) surfaced through the same pool API
- Non-blocking pool backed by
Task.Supervisor.async_nolink/2 - ETS-backed registries with O(1) worker lookup and session affinity
- Heartbeat-driven failure detection feeding back into OTP supervision
- Lifecycle-driven recycling to cap memory growth and tail latency
1. Control Plane & Worker Capsule (performance focus)
graph TD
subgraph ControlPlane["Elixir Control Plane"]
Pool["Pool<br>GenServer"]
TaskSup["Task Supervisor<br>Async execution"]
Registries["Registries<br>Worker / Starter / Process"]
Lifecycle["Worker Lifecycle<br>Manager"]
GRPCServer["GRPC Endpoint<br>Cowboy + gRPC"]
WorkerSup["Worker Supervisor<br>DynamicSupervisor"]
end
subgraph WorkerCapsule["Worker Capsule (per worker)"]
Starter["Worker.Starter<br>Permanent supervisor"]
Profile["WorkerProfile<br>process/thread"]
Worker["GRPCWorker<br>GenServer"]
Heartbeat["HeartbeatMonitor<br>GenServer"]
end
subgraph External["Python Runtime"]
PythonProc["grpc_server.py<br>Stateless worker"]
end
Pool --> Registries
Pool --> TaskSup
WorkerSup --> Starter
Starter --> Profile
Profile --> Worker
Worker --> Heartbeat
Worker -->|Spawn & Port| PythonProc
PythonProc -->|gRPC state ops| GRPCServer
TaskSup -->|Execute| Worker
Lifecycle --> Worker
Heartbeat --> Lifecycle
Registries --> WorkerSup
style ControlPlane fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#000
style WorkerCapsule fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
style External fill:#fff3e0,stroke:#ef6c00,stroke-width:2px,color:#000
style Pool fill:#c8e6c9,color:#000
style TaskSup fill:#c8e6c9,color:#000
style Registries fill:#c8e6c9,color:#000
style Lifecycle fill:#bbdefb,color:#000
style Starter fill:#bbdefb,color:#000
style Profile fill:#bbdefb,color:#000
style Worker fill:#bbdefb,color:#000
style Heartbeat fill:#bbdefb,color:#000
style PythonProc fill:#ffe0b2,color:#000
style GRPCServer fill:#c8e6c9,color:#000Highlights
- Worker capsules contain all per-worker processes, so supervisor restarts are local.
- LifecycleManager tracks request budgets and TTLs to trigger proactive replacement without blocking the pool.
- HeartbeatMonitor feeds latency and timeout metrics back to LifecycleManager, enabling fast detection of wedged workers.
- OpenTelemetry spans/metrics originate in the Pool + Worker capsule layers; enable them via
config :snakepit, opentelemetry: %{enabled: true}(Elixir) and the Python requirements listed inpriv/python/requirements.txtfor cross-language correlation.
2. Request Flow Sequence (with session affinity)
sequenceDiagram
participant Client as Client
participant Pool as Pool (GenServer)
participant Registry as Worker Registry
participant TaskSup as Task Supervisor
participant Worker as GRPCWorker
participant Python as Python Process
participant Store as SessionStore (ETS)
Client->>Pool: execute(command, args, session_id)
Pool->>Registry: checkout(session_id)
Registry-->>Pool: worker_id / nil
alt worker available
Pool->>TaskSup: async_nolink(request)
TaskSup->>Worker: execute(command, args, timeout)
else need spin-up
Pool->>WorkerSup: start_worker()
WorkerSup-->>Pool: {:ok, pid}
Pool->>TaskSup: async_nolink(request)
TaskSup->>Worker: execute(command, args, timeout)
end
Worker->>Python: gRPC ExecuteTool
alt variable fetch
Python->>Worker: gRPC GetVariable
Worker->>Store: read(session_id, name)
Store-->>Worker: value
Worker-->>Python: value
end
Python-->>Worker: result
Worker-->>TaskSup: {:ok, result}
TaskSup-->>Client: reply
TaskSup->>Pool: checkin(worker_id)
Pool->>Lifecycle: increment_request(worker_id)Observations
- Pool is never blocked by work execution; it immediately returns after scheduling via the Task Supervisor.
- Registry lookups and ETS-backed session reads stay O(1), keeping queue times predictable even with 100+ workers.
- LifecycleManager is notified about completed requests so TTL/request budgets stay accurate.
3. Health, Recycling, and Restart Loop
stateDiagram-v2
[*] --> Booting
Booting --> Ready: Worker registered
Ready --> Executing: Request dispatched
Executing --> Ready: Completion
Ready --> HeartbeatPing: Monitor ping tick
HeartbeatPing --> Ready: Pong in time
HeartbeatPing --> MissedHeartbeat: Pong timeout
MissedHeartbeat --> Ready: Pong before limit
MissedHeartbeat --> Recycling: Max missed reached
Ready --> Recycling: TTL reached or max requests
Recycling --> Stopping: LifecycleManager requests stop
Stopping --> Restarting: WorkerStarter restarts capsule
Restarting --> Ready: Replacement liveNotes
- Heartbeat failures and lifecycle thresholds converge on the same recycling path, ensuring consistent restart semantics.
- When Recycling triggers,
Snakepit.ProcessKillercleans up OS processes before the supervisor brings the capsule back online. - Restart intensity is governed by
WorkerSupervisorlimits, keeping cluster stability under heavy churn.