Snakepit Architecture
View SourceCaptured for v0.6.0 (worker profiles & heartbeat workstream)
Overview
Snakepit is an OTP application that brokers Python execution over gRPC. The BEAM side owns lifecycle, state, and health; Python workers stay stateless, disposable, and protocol-focused. Version 0.6.0 introduces dual worker profiles (process vs. thread), proactive recycling, and heartbeat-driven supervision, so the documentation below reflects the current tree.
Top-Level Layout
┌─────────────────────────────────────────────────────────────────────────────┐
│ Snakepit.Supervisor │
├─────────────────────────────────────────────────────────────────────────────┤
│ Base services (always on) │
│ - Snakepit.Bridge.SessionStore - GenServer + ETS backing │
│ - Snakepit.Bridge.ToolRegistry - GenServer for adapter metadata │
│ │
│ Pooling branch (when :pooling_enabled is true) │
│ - GRPC.Server.Supervisor (Snakepit.GRPC.Endpoint / Cowboy) │
│ - Task.Supervisor (Snakepit.TaskSupervisor) │
│ - Snakepit.Pool.Registry / Worker.StarterRegistry / ProcessRegistry │
│ - Snakepit.Pool.WorkerSupervisor (DynamicSupervisor) │
│ - Snakepit.Worker.LifecycleManager (GenServer) │
│ - Snakepit.Pool (GenServer, request router) │
│ - Snakepit.Pool.ApplicationCleanup (GenServer, last child) │
└─────────────────────────────────────────────────────────────────────────────┘
Worker capsule (dynamic child of WorkerSupervisor):
┌────────────────────────────────────────────────────────────┐
│ Snakepit.Pool.WorkerStarter (:permanent Supervisor) │
│ - Worker profile module (Snakepit.WorkerProfile.*) │
│ - Snakepit.GRPCWorker (:transient GenServer) │
│ - Snakepit.HeartbeatMonitor (GenServer, optional) │
└────────────────────────────────────────────────────────────┘Python processes are launched by each GRPCWorker and connect back to the BEAM gRPC endpoint for stateful operations (sessions, variables, telemetry).
Control Plane Components (Elixir)
Snakepit.Application
- Applies Python thread limits before the tree boots (prevents fork bombs with large pools).
- Starts base services, optionally the pooling branch, and records start/stop timestamps for diagnostics.
Snakepit.Bridge.SessionStore (lib/snakepit/bridge/session_store.ex)
- GenServer + ETS pair that owns all stateful session data.
- ETS tables (
:snakepit_sessions,:snakepit_sessions_global_programs) are created with read/write concurrency and decentralized counters. - Provides TTL expiration, atomic operations, and selective cleanup for high churn pools.
Snakepit.Bridge.ToolRegistry
- Tracks registered tools/adapters exposed to Python workers.
- Keeps metadata so GRPCWorker can answer
Describecalls without touching disk.
Snakepit.GRPC.Endpoint
- Cowboy-based endpoint that Python workers call for session/variable RPCs.
- Runs under
GRPC.Server.Supervisorwith tuned acceptor/backlog values to survive 200+ concurrent worker startups.
Registries (Snakepit.Pool.Registry, Snakepit.Pool.ProcessRegistry, Snakepit.Pool.Worker.StarterRegistry)
- Provide O(1) lookup for worker routing, track external OS PIDs/run IDs, and ensure cleanup routines can map resources back to the current BEAM instance.
ProcessRegistryuses ETS to store run IDs soSnakepit.ProcessKillerandSnakepit.Pool.ApplicationCleanupcan reap orphans deterministically.Pool.Registrynow keeps authoritative metadata (worker_module,pool_identifier,pool_name) for every worker.Snakepit.GRPCWorkerupdates the registry as soon as it boots so callers such asPool.extract_pool_name_from_worker_id/1and reverse lookups never have to guess based on worker ID formats.- Helper APIs like
Pool.Registry.fetch_worker/1centralizepid + metadatalookups so higher layers (pool, bridge server, worker profiles) no longer reach into the rawRegistrytables. This ensures metadata stays normalized and ready for diagnostics.
Snakepit.Pool
- GenServer request router with queueing, session affinity, and Task.Supervisor integration.
- Starts workers concurrently on boot using async streams; runtime execution uses
Task.Supervisor.async_nolink/2so the pool remains non-blocking. - Emits telemetry per request lifecycle; integrates with
Snakepit.Worker.LifecycleManagerfor request-count tracking.
Snakepit.Worker.LifecycleManager
- Tracks worker TTLs, request counts, and optional memory thresholds.
- Builds a
%Snakepit.Worker.LifecycleConfig{}for every worker so adapter modules, env overrides, and profile selection are explicit. Replacement workers inherit the exact same spec (minus worker_id) which prevents subtle drift during recycling. - Monitors worker pids and triggers graceful recycling via the pool when budgets are exceeded. Memory recycling samples the BEAM
Snakepit.GRPCWorkerprocess via:get_memory_usage(not the Python child) and compares the result tomemory_threshold_mb, emitting[:snakepit, :worker, :recycled]telemetry with the measured MB. - Coordinates periodic health checks across the pool and emits telemetry (
[:snakepit, :worker, :recycled], etc.) for dashboards.
Snakepit.Pool.ApplicationCleanup
- Last child in the pool branch; terminates first on shutdown.
- Uses
Snakepit.ProcessKillerto terminate lingering Python processes based on recorded run IDs. - Ensures the BEAM exits cleanly even if upstream supervisors crash.
Execution Capsule (Per Worker)
Snakepit.Pool.WorkerStarter
- Permanent supervisor that owns a transient worker.
- Provides a hook to extend the capsule with additional children (profilers, monitors, etc.).
- Makes restarts local: killing the starter tears down the worker, heartbeat monitor, and any profile-specific state atomically.
Worker profiles (Snakepit.WorkerProfile.*)
- Abstract the strategy for request capacity:
Processprofile: one request per Python process (capacity = 1).Threadprofile: multi-request concurrency inside a single Python process.
- Profiles decide how to spawn/stop workers and how to report health/capacity back to the pool.
Snakepit.GRPCWorker
- GenServer that launches the Python gRPC server, manages the Port, and bridges execute/stream calls.
- Maintains worker metadata (run ID, OS PID, adapter options) and cooperates with LifecycleManager/HeartbeatMonitor for health.
- Interacts with registries for lookup and with
Snakepit.ProcessKillerfor escalation on shutdown.
Snakepit.HeartbeatMonitor
- Optional GenServer started per worker based on pool configuration.
- Periodically invokes a ping callback (usually a gRPC health RPC) and tracks missed responses.
- Signals the WorkerStarter (by exiting the worker) when thresholds are breached, allowing the supervisor to restart the capsule.
dependent: true(default) means heartbeat failures terminate the worker;dependent: falsekeeps the worker alive and simply logs/retries so you can debug without killing the Python process.- The BEAM heartbeat monitor is authoritative. Python's heartbeat client only exits when it cannot reach the BEAM control-plane for an extended period, so always change the Elixir
dependentflag rather than relying on Python behaviour to keep a capsule running. - Heartbeat settings are shipped to Python via the
SNAKEPIT_HEARTBEAT_CONFIGenvironment variable. Python workers treat it as hints (ping interval, timeout, dependent flag) so both sides agree on policy. Today this is a push-style ping from Elixir into Python; future control-plane work will layer richer distributed health signals on top.
Snakepit.ProcessKiller
- Utility module for POSIX-compliant termination of external processes.
- Provides
kill_with_escalation/1semantics (SIGTERM → SIGKILL) and discovery helpers (kill_by_run_id/1). - Used by ApplicationCleanup, LifecycleManager, and GRPCWorker terminate callbacks.
- Rogue cleanup is intentionally scoped: only commands containing
grpc_server.pyorgrpc_server_threaded.pyand a--snakepit-run-id/--run-idmarker are eligible. Operators can disable the startup sweep viaconfig :snakepit, :rogue_cleanup, enabled: falsewhen sharing hosts with third-party services.
Kill Chain Summary
Snakepit.Poolenqueues the client request and picks a worker.Snakepit.GRPCWorkerexecutes the adapter call, keeps registry metadata fresh, and reports lifecycle stats.Snakepit.Worker.LifecycleManagerwatches TTL/request/memory budgets and asks the appropriate profile to recycle workers when necessary.Snakepit.Pool.ProcessRegistrypersists run IDs + OS PIDs so BEAM restarts can see orphaned Python processes.Snakepit.ProcessKillerandSnakepit.Pool.ApplicationCleanupreap any remaining processes that belong to older run IDs before the VM exits.
Python Bridge
priv/python/grpc_server.py: Stateless gRPC service that exposes Snakepit functionality to user adapters. It forwards session operations to Elixir and defers telemetry to the BEAM.priv/python/snakepit_bridge/session_context.py: Client helper that caches variables locally with TTL invalidation. Heartbeat configuration is passed viaSNAKEPIT_HEARTBEAT_CONFIG, so any schema changes must be made in both Elixir (Snakepit.Config) and Python (snakepit_bridge.heartbeat.HeartbeatConfig).priv/python/snakepit_bridge/adapters/*: User-defined logic; receives aSessionContextand uses generated stubs fromsnakepit_bridge.proto.- Python code is bundled with the release; regeneration happens via
make proto-pythonorpriv/python/generate_grpc.sh.
Observability & Telemetry
Snakepit.Telemetrydefines events for worker lifecycle, pool throughput, queue depth, and heartbeat outcomes.Snakepit.TelemetryMetrics.metrics/0feeds Prometheus viaTelemetryMetricsPrometheus.Coreonceconfig :snakepit, telemetry_metrics: %{prometheus: %{enabled: true}}is set. By default metrics remain disabled so unit tests can run without a reporter.Snakepit.Telemetry.OpenTelemetryboots spans whenconfig :snakepit, opentelemetry: %{enabled: true}. Local developers typically enable the console exporter (export SNAKEPIT_OTEL_CONSOLE=true) while CI and production pointSNAKEPIT_OTEL_ENDPOINTat an OTLP collector. The Elixir configuration honours exporter toggles (:otlp,:console) and resource attributes (service.name, etc.).- Correlation IDs flow through
lib/snakepit/telemetry/correlation.exand are injected into gRPC metadata. The Python bridge reads thex-snakepit-correlation-idheader, sets the active span whenopentelemetry-sdkis available, and mirrors the ID into structured logs so traces and logs line up across languages. - Python workers inherit the interpreter configured by
Snakepit.Adapters.GRPCPython. SetSNAKEPIT_PYTHON=$PWD/.venv/bin/python3(or add the path toconfig/test.exs) so CI/dev shells use the virtualenv that containsgrpcio,opentelemetry-*, andpytest.PYTHONPATH=priv/pythonkeeps the bridge packages importable for pytest and for the runtime shims. - Validation recipe:
mix test --color --trace– asserts Elixir spans/metrics handlers attach and that worker start-up succeeds with OTEL deps present.PYTHONPATH=priv/python .venv/bin/pytest priv/python/tests -q– exercises span helpers, correlation filters, and the threaded bridge parity tests.- (Optional)
curl http://localhost:9568/metrics– once Prometheus is enabled, confirms heartbeat counters and pool gauges surface to/metrics.
Design Principles (Current)
- Stateless Python Workers – All durable state lives in
Snakepit.Bridge.SessionStore; Python can be restarted at will. - Layered Supervision – Control-plane supervisors manage pools and registries, while WorkerStarter encapsulates per-worker processes.
- Config-Driven Behaviour – Pooling can be disabled (tests), heartbeat tuning lives in config maps, and worker profiles are pluggable.
- Proactive Health Management – Heartbeats, lifecycle recycling, and deterministic kill routines keep long-running clusters stable.
- O(1) Routing & Concurrency – Registries and ETS deliver constant-time lookup so the pool stays responsive at high request volume.
Worker Lifecycle (Happy Path)
- Pool boot requests
sizeworkers fromWorkerSupervisor. WorkerStarterlaunches the configured profile and creates a GRPCWorker.- GRPCWorker spawns the Python process, registers with registries, and (optionally) starts a HeartbeatMonitor.
- LifecycleManager begins tracking TTL/request budgets.
- Requests flow:
Snakepit.Pool→Task.Supervisor→ GRPCWorker → Python. - Completion events increment lifecycle counters; heartbeats maintain health status.
- Recycling or heartbeat failures call back into the pool to spin up a fresh capsule.
Protocol Overview
- gRPC definitions live in
priv/proto/snakepit_bridge.proto. - Core RPCs:
Execute,ExecuteStream,RegisterTool,GetVariable,SetVariable,InitializeSession,CleanupSession,GetWorkerInfo,Heartbeat. - Binary fields are used for large payloads; Elixir encodes via ETF and Python falls back to pickle.
- Generated code is versioned in
lib/snakepit/grpc/generated/(Elixir) andpriv/python/generated/(Python).
Operational Notes
- Toggle
:pooling_enabledfor integration tests or single-worker scenarios. - Dialyzer PLTs live in
priv/plts—runmix dialyzerafter changing protocol or worker APIs. - When introducing new supervisor children, update
lib/snakepit/supervisor_tree.mdand corresponding docs indocs/20251018/. - Always regenerate gRPC stubs in lockstep (
mix grpc.gen,make proto-python) before cutting a release.