Production Deployment
View SourceThis guide covers deploying Snakepit in production environments, including setup, process management, troubleshooting, and performance tuning.
Pre-Deployment Checklist
Before deploying Snakepit to production:
- [ ] Python 3.10+ installed (3.13+ for thread-safe adapters)
- [ ] Virtual environment created with dependencies installed
- [ ]
SNAKEPIT_PYTHONenvironment variable set (if not using system Python) - [ ] gRPC proto files compiled (
mix snakepit.setup) - [ ] Pool size appropriate for workload
- [ ] Logging level set (
:warningor:errorfor production) - [ ] Telemetry handlers attached
- [ ] DETS storage directory writable (
priv/data/)
Mix Tasks
mix snakepit.setup
Bootstrap the environment, installing Python dependencies and compiling gRPC protos:
mix snakepit.setup
mix snakepit.doctor
Run environment diagnostics:
mix snakepit.doctor
Checks Python version, gRPC tools, proto files, and virtual environment configuration.
mix snakepit.status
Check pool status and worker health:
mix snakepit.status
Output:
Pool: default (process)
Workers: 8
Queued: 0
Requests: 1523
Errors: 2mix snakepit.gen.adapter
Generate a Python adapter skeleton:
mix snakepit.gen.adapter my_adapter
Creates priv/python/my_adapter/ with adapter.py. Configure with:
adapter_args: ["--adapter", "my_adapter.adapter.MyAdapter"]Process Management
Run IDs and Orphan Detection
Each BEAM instance gets a unique run ID on startup, enabling identification of workers belonging to the current process and detection of orphaned workers from crashed instances.
Automatic Cleanup on Restart
When Snakepit starts, it automatically:
- Identifies processes from previous BEAM runs using run IDs
- Sends SIGTERM for graceful shutdown
- Sends SIGKILL to unresponsive processes
- Cleans up stale registry entries
Only processes matching Snakepit's command-line patterns (grpc_server.py with --snakepit-run-id) are considered. To disable:
config :snakepit, :rogue_cleanup, enabled: falseGraceful Shutdown
During application shutdown:
- Workers receive SIGTERM (2 second timeout)
- Unresponsive workers receive SIGKILL
- Final
pkillsafety net for missed processes
Manual Cleanup
case Snakepit.cleanup() do
:ok -> Logger.info("Cleanup completed")
{:timeout, pids} -> Logger.warning("Some processes did not terminate")
endScript Mode (run_as_script/2)
For short-lived scripts and Mix tasks:
defmodule Mix.Tasks.MyApp.ProcessData do
use Mix.Task
def run(args) do
Snakepit.run_as_script(fn ->
{:ok, result} = Snakepit.execute("process_batch", %{input: args})
IO.puts("Complete: #{inspect(result)}")
end, timeout: 30_000, cleanup_timeout: 10_000)
end
endOptions: :timeout, :shutdown_timeout, :cleanup_timeout, :halt
Common Troubleshooting
Python Process Will Not Start
mix snakepit.doctor # Check environment
echo $SNAKEPIT_PYTHON # Verify Python path
python3 -c "import grpc" # Test gRPC import
Solutions: Set SNAKEPIT_PYTHON, verify dependencies, check adapter module path.
gRPC Connection Failures
Snakepit.list_workers() # Check running workers
Snakepit.get_stats() # Check pool statsSolutions: Check port conflicts, firewall rules, compile proto files.
Memory Issues
:telemetry.attach("mem", [:snakepit, :worker, :recycled],
fn _, _, meta, _ -> IO.inspect(meta) end, nil)Solutions: Increase memory_threshold_mb, reduce pool_size, enable worker TTL.
Orphaned Processes
ps aux | grep grpc_server.py
pkill -f "grpc_server.py.*--snakepit-run-id" # Manual cleanup
Orphans are cleaned automatically on next startup.
Performance Tuning
Pool Size Selection
%{name: :default, pool_size: System.schedulers_online() * 2}| Workload | Pool Size |
|---|---|
| CPU-bound | schedulers * 1-2 |
| I/O-bound | schedulers * 4-8 |
| Mixed | schedulers * 2-4 |
For thread-profile workers:
%{name: :hpc, worker_profile: :thread, pool_size: 4, threads_per_worker: 16}Batch Configuration
%{
pool_size: 100,
startup_batch_size: 8,
startup_batch_delay_ms: 750
}Heartbeat Tuning
heartbeat: %{
enabled: true,
ping_interval_ms: 2000,
timeout_ms: 10000,
max_missed_heartbeats: 3
}| Environment | Interval | Timeout | Max Missed |
|---|---|---|---|
| Development | 5000ms | 30000ms | 5 |
| Production | 2000ms | 10000ms | 3 |
| Critical | 1000ms | 5000ms | 2 |
Complete Production Configuration
# config/prod.exs
config :snakepit,
pooling_enabled: true,
log_level: :warning,
pools: [
%{
name: :default,
worker_profile: :process,
pool_size: System.schedulers_online() * 2,
adapter_module: Snakepit.Adapters.GRPCPython,
adapter_args: ["--adapter", "myapp.adapter.MainAdapter"],
startup_batch_size: 8,
startup_batch_delay_ms: 750,
worker_ttl: {1, :hours},
worker_max_requests: 10_000,
heartbeat: %{
enabled: true,
ping_interval_ms: 2000,
timeout_ms: 10000,
max_missed_heartbeats: 3
}
},
%{
name: :hpc,
worker_profile: :thread,
pool_size: 4,
threads_per_worker: 16,
adapter_args: ["--adapter", "myapp.adapter.MLAdapter"],
heartbeat: %{enabled: true, ping_interval_ms: 5000}
}
],
pool_queue_timeout: 5000,
pool_max_queue_size: 1000,
pool_startup_timeout: 30000,
crash_barrier: %{
enabled: true,
retry: :idempotent,
max_retries: 1,
taint_ms: 5000
},
rogue_cleanup: %{enabled: true},
telemetry_metrics: %{prometheus: %{enabled: true}},
opentelemetry: %{
enabled: true,
exporters: %{otlp: %{endpoint: "http://collector:4318"}}
}Environment Variables
| Variable | Description |
|---|---|
SNAKEPIT_PYTHON | Path to Python binary |
SNAKEPIT_SCRIPT_HALT | Force halt after script completion |
SNAKEPIT_OTEL_ENDPOINT | OpenTelemetry collector endpoint |
Deployment Recommendations
- Use Releases - Build OTP releases for production
- Separate Python Env - Use a dedicated virtual environment
- Monitor Early - Attach telemetry handlers before starting pools
- Start Conservative - Begin with smaller pool sizes
- Test Failure Modes - Verify orphan cleanup and crash recovery