Production Deployment

This guide covers deploying Snakepit in production environments, including setup, process management, troubleshooting, and performance tuning.

Pre-Deployment Checklist

Before deploying Snakepit to production:

[ ] Python 3.10+ installed (3.13+ for thread-safe adapters)
[ ] Virtual environment created with dependencies installed
[ ] SNAKEPIT_PYTHON environment variable set (if not using system Python)
[ ] gRPC proto files compiled (mix snakepit.setup)
[ ] Pool size appropriate for workload
[ ] Logging level set (:warning or :error for production)
[ ] Telemetry handlers attached
[ ] DETS storage directory writable (priv/data/)

Mix Tasks

mix snakepit.setup

Bootstrap the environment, installing Python dependencies and compiling gRPC protos:

mix snakepit.setup

mix snakepit.doctor

Run environment diagnostics:

mix snakepit.doctor

Checks Python version, gRPC tools, proto files, and virtual environment configuration.

mix snakepit.status

Check pool status and worker health:

mix snakepit.status

Output:

Pool: default (process)
  Workers: 8
  Queued: 0
  Requests: 1523
  Errors: 2

mix snakepit.gen.adapter

Generate a Python adapter skeleton:

mix snakepit.gen.adapter my_adapter

Creates priv/python/my_adapter/ with adapter.py. Configure with:

adapter_args: ["--adapter", "my_adapter.adapter.MyAdapter"]

Process Management

Run IDs and Orphan Detection

Each BEAM instance gets a unique run ID on startup, enabling identification of workers belonging to the current process and detection of orphaned workers from crashed instances.

For concurrent Snakepit instances from the same codebase/host, configure both:

instance_name to identify the shared environment.
instance_token with a unique value per running VM.

This prevents one live instance from classifying another live instance's workers as rogue/orphan during cleanup scans.

config :snakepit,
  instance_name: "my-app",
  instance_token: "deploy-slot-a"

For scripts, use environment variables:

SNAKEPIT_INSTANCE_NAME=my-app SNAKEPIT_INSTANCE_TOKEN=script_1 mix run --no-start script_a.exs
SNAKEPIT_INSTANCE_NAME=my-app SNAKEPIT_INSTANCE_TOKEN=script_2 mix run --no-start script_b.exs

Automatic Cleanup on Restart

When Snakepit starts, it automatically:

Identifies processes from previous BEAM runs using run IDs
Sends SIGTERM for graceful shutdown
Sends SIGKILL to unresponsive processes
Cleans up stale registry entries

Only processes matching Snakepit's command-line patterns (grpc_server.py with --snakepit-run-id) are considered. To disable:

config :snakepit, :rogue_cleanup, enabled: false

enabled: false is an explicit disable (also valid as rogue_cleanup: %{enabled: false}).

Graceful Shutdown

During application shutdown:

Workers receive SIGTERM (2 second timeout)
Unresponsive workers receive SIGKILL
Final pkill safety net for missed processes

Manual Cleanup

case Snakepit.cleanup() do
  :ok -> Logger.info("Cleanup completed")
  {:timeout, pids} -> Logger.warning("Some processes did not terminate")
end

Script Mode (run_as_script/2)

For short-lived scripts and Mix tasks:

defmodule Mix.Tasks.MyApp.ProcessData do
  use Mix.Task

  def run(args) do
    Snakepit.run_as_script(fn ->
      {:ok, result} = Snakepit.execute("process_batch", %{input: args})
      IO.puts("Complete: #{inspect(result)}")
    end, timeout: 30_000, cleanup_timeout: 10_000, exit_mode: :auto)
  end
end

Defaults are exit_mode: :none and stop_mode: :if_started. Use exit_mode: :auto for scripts that may run under --no-halt, and set stop_mode: :never for embedded usage where the host VM must stay alive.

Warning: exit_mode: :halt or :stop terminates the entire VM regardless of stop_mode.

Cleanup runs whenever cleanup_timeout is greater than zero (default), even if Snakepit is already started. For embedded usage where you do not own the pool, set cleanup_timeout: 0 to skip cleanup.

Options: :timeout, :shutdown_timeout, :cleanup_timeout, :exit_mode, :stop_mode (:halt is legacy and deprecated).

Common Troubleshooting

Python Process Will Not Start

mix snakepit.doctor      # Check environment
echo $SNAKEPIT_PYTHON    # Verify Python path
python3 -c "import grpc" # Test gRPC import

Solutions: Set SNAKEPIT_PYTHON, verify dependencies, check adapter module path.

gRPC Connection Failures

Snakepit.list_workers()  # Check running workers
Snakepit.get_stats()     # Check pool stats

Solutions: Check port conflicts, firewall rules, compile proto files.

Memory Issues

:telemetry.attach("mem", [:snakepit, :worker, :recycled],
  fn _, _, meta, _ -> IO.inspect(meta) end, nil)

Solutions: Increase memory_threshold_mb, reduce pool_size, enable worker TTL.

Orphaned Processes

ps aux | grep grpc_server.py
pkill -f "grpc_server.py.*--snakepit-run-id"  # Manual cleanup

Orphans are cleaned automatically on next startup.

Performance Tuning

Pool Size Selection

%{name: :default, pool_size: System.schedulers_online() * 2}

Workload	Pool Size
CPU-bound	`schedulers * 1-2`
I/O-bound	`schedulers * 4-8`
Mixed	`schedulers * 2-4`

For thread-profile workers:

%{name: :hpc, worker_profile: :thread, pool_size: 4, threads_per_worker: 16}

Batch Configuration

%{
  pool_size: 100,
  startup_batch_size: 8,
  startup_batch_delay_ms: 750
}

Heartbeat Tuning

heartbeat: %{
  enabled: true,
  ping_interval_ms: 2000,
  timeout_ms: 10000,
  max_missed_heartbeats: 3
}

Environment	Interval	Timeout	Max Missed
Development	5000ms	30000ms	5
Production	2000ms	10000ms	3
Critical	1000ms	5000ms	2

Complete Production Configuration

# config/prod.exs
config :snakepit,
  pooling_enabled: true,
  log_level: :warning,

  pools: [
    %{
      name: :default,
      worker_profile: :process,
      pool_size: System.schedulers_online() * 2,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapter.MainAdapter"],
      startup_batch_size: 8,
      startup_batch_delay_ms: 750,
      worker_ttl: {1, :hours},
      worker_max_requests: 10_000,
      heartbeat: %{
        enabled: true,
        ping_interval_ms: 2000,
        timeout_ms: 10000,
        max_missed_heartbeats: 3
      }
    },
    %{
      name: :hpc,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_args: ["--adapter", "myapp.adapter.MLAdapter"],
      heartbeat: %{enabled: true, ping_interval_ms: 5000}
    }
  ],

  pool_queue_timeout: 5000,
  pool_max_queue_size: 1000,
  pool_startup_timeout: 30000,

  crash_barrier: %{
    enabled: true,
    retry: :idempotent,
    max_retries: 1,
    taint_ms: 5000
  },

  rogue_cleanup: %{enabled: true},

  telemetry_metrics: %{prometheus: %{enabled: true}},

  opentelemetry: %{
    enabled: true,
    exporters: %{otlp: %{endpoint: "http://collector:4318"}}
  }

Environment Variables

Variable	Description
`SNAKEPIT_PYTHON`	Path to Python binary
`SNAKEPIT_SCRIPT_EXIT`	Exit behavior for scripts (`none`, `halt`, `stop`, `auto`)
`SNAKEPIT_SCRIPT_HALT`	Deprecated; use `SNAKEPIT_SCRIPT_EXIT=halt`
`SNAKEPIT_OTEL_ENDPOINT`	OpenTelemetry collector endpoint

Deployment Recommendations

Use Releases - Build OTP releases for production
Separate Python Env - Use a dedicated virtual environment
Monitor Early - Attach telemetry handlers before starting pools
Start Conservative - Begin with smaller pool sizes
Test Failure Modes - Verify orphan cleanup and crash recovery

← Previous Page Python Threading Guide

Next Page → Changelog