Production Deployment

This guide covers deploying Snakepit in production environments, including setup, process management, troubleshooting, and performance tuning.

Pre-Deployment Checklist

Before deploying Snakepit to production:

[ ] Python 3.10+ installed (3.13+ for thread-safe adapters)
[ ] Virtual environment created with dependencies installed
[ ] SNAKEPIT_PYTHON environment variable set (if not using system Python)
[ ] gRPC proto files compiled (mix snakepit.setup)
[ ] Pool size appropriate for workload
[ ] Logging level set (:warning or :error for production)
[ ] Telemetry handlers attached
[ ] DETS storage directory writable (priv/data/)

Mix Tasks

mix snakepit.setup

Bootstrap the environment, installing Python dependencies and compiling gRPC protos:

mix snakepit.setup

mix snakepit.doctor

Run environment diagnostics:

mix snakepit.doctor

Checks Python version, gRPC tools, proto files, and virtual environment configuration.

mix snakepit.status

Check pool status and worker health:

mix snakepit.status

Output:

Pool: default (process)
  Workers: 8
  Queued: 0
  Requests: 1523
  Errors: 2

mix snakepit.gen.adapter

Generate a Python adapter skeleton:

mix snakepit.gen.adapter my_adapter

Creates priv/python/my_adapter/ with adapter.py. Configure with:

adapter_args: ["--adapter", "my_adapter.adapter.MyAdapter"]

Process Management

Run IDs and Orphan Detection

Each BEAM instance gets a unique run ID on startup, enabling identification of workers belonging to the current process and detection of orphaned workers from crashed instances.

Automatic Cleanup on Restart

When Snakepit starts, it automatically:

Identifies processes from previous BEAM runs using run IDs
Sends SIGTERM for graceful shutdown
Sends SIGKILL to unresponsive processes
Cleans up stale registry entries

Only processes matching Snakepit's command-line patterns (grpc_server.py with --snakepit-run-id) are considered. To disable:

config :snakepit, :rogue_cleanup, enabled: false

Graceful Shutdown

During application shutdown:

Workers receive SIGTERM (2 second timeout)
Unresponsive workers receive SIGKILL
Final pkill safety net for missed processes

Manual Cleanup

case Snakepit.cleanup() do
  :ok -> Logger.info("Cleanup completed")
  {:timeout, pids} -> Logger.warning("Some processes did not terminate")
end

Script Mode (run_as_script/2)

For short-lived scripts and Mix tasks:

defmodule Mix.Tasks.MyApp.ProcessData do
  use Mix.Task

  def run(args) do
    Snakepit.run_as_script(fn ->
      {:ok, result} = Snakepit.execute("process_batch", %{input: args})
      IO.puts("Complete: #{inspect(result)}")
    end, timeout: 30_000, cleanup_timeout: 10_000)
  end
end

Options: :timeout, :shutdown_timeout, :cleanup_timeout, :halt

Common Troubleshooting

Python Process Will Not Start

mix snakepit.doctor      # Check environment
echo $SNAKEPIT_PYTHON    # Verify Python path
python3 -c "import grpc" # Test gRPC import

Solutions: Set SNAKEPIT_PYTHON, verify dependencies, check adapter module path.

gRPC Connection Failures

Snakepit.list_workers()  # Check running workers
Snakepit.get_stats()     # Check pool stats

Solutions: Check port conflicts, firewall rules, compile proto files.

Memory Issues

:telemetry.attach("mem", [:snakepit, :worker, :recycled],
  fn _, _, meta, _ -> IO.inspect(meta) end, nil)

Solutions: Increase memory_threshold_mb, reduce pool_size, enable worker TTL.

Orphaned Processes

ps aux | grep grpc_server.py
pkill -f "grpc_server.py.*--snakepit-run-id"  # Manual cleanup

Orphans are cleaned automatically on next startup.

Performance Tuning

Pool Size Selection

%{name: :default, pool_size: System.schedulers_online() * 2}

Workload	Pool Size
CPU-bound	`schedulers * 1-2`
I/O-bound	`schedulers * 4-8`
Mixed	`schedulers * 2-4`

For thread-profile workers:

%{name: :hpc, worker_profile: :thread, pool_size: 4, threads_per_worker: 16}

Batch Configuration

%{
  pool_size: 100,
  startup_batch_size: 8,
  startup_batch_delay_ms: 750
}

Heartbeat Tuning

heartbeat: %{
  enabled: true,
  ping_interval_ms: 2000,
  timeout_ms: 10000,
  max_missed_heartbeats: 3
}

Environment	Interval	Timeout	Max Missed
Development	5000ms	30000ms	5
Production	2000ms	10000ms	3
Critical	1000ms	5000ms	2

Complete Production Configuration

# config/prod.exs
config :snakepit,
  pooling_enabled: true,
  log_level: :warning,

  pools: [
    %{
      name: :default,
      worker_profile: :process,
      pool_size: System.schedulers_online() * 2,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapter.MainAdapter"],
      startup_batch_size: 8,
      startup_batch_delay_ms: 750,
      worker_ttl: {1, :hours},
      worker_max_requests: 10_000,
      heartbeat: %{
        enabled: true,
        ping_interval_ms: 2000,
        timeout_ms: 10000,
        max_missed_heartbeats: 3
      }
    },
    %{
      name: :hpc,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_args: ["--adapter", "myapp.adapter.MLAdapter"],
      heartbeat: %{enabled: true, ping_interval_ms: 5000}
    }
  ],

  pool_queue_timeout: 5000,
  pool_max_queue_size: 1000,
  pool_startup_timeout: 30000,

  crash_barrier: %{
    enabled: true,
    retry: :idempotent,
    max_retries: 1,
    taint_ms: 5000
  },

  rogue_cleanup: %{enabled: true},

  telemetry_metrics: %{prometheus: %{enabled: true}},

  opentelemetry: %{
    enabled: true,
    exporters: %{otlp: %{endpoint: "http://collector:4318"}}
  }

Environment Variables

Variable	Description
`SNAKEPIT_PYTHON`	Path to Python binary
`SNAKEPIT_SCRIPT_HALT`	Force halt after script completion
`SNAKEPIT_OTEL_ENDPOINT`	OpenTelemetry collector endpoint

Deployment Recommendations

Use Releases - Build OTP releases for production
Separate Python Env - Use a dedicated virtual environment
Monitor Early - Attach telemetry handlers before starting pools
Start Conservative - Begin with smaller pool sizes
Test Failure Modes - Verify orphan cleanup and crash recovery

← Previous Page Python Threading Guide

Next Page → Changelog