Production Deployment

View Source

This guide covers deploying Snakepit in production environments, including setup, process management, troubleshooting, and performance tuning.

Pre-Deployment Checklist

Before deploying Snakepit to production:

  • [ ] Python 3.10+ installed (3.13+ for thread-safe adapters)
  • [ ] Virtual environment created with dependencies installed
  • [ ] SNAKEPIT_PYTHON environment variable set (if not using system Python)
  • [ ] gRPC proto files compiled (mix snakepit.setup)
  • [ ] Pool size appropriate for workload
  • [ ] Logging level set (:warning or :error for production)
  • [ ] Telemetry handlers attached
  • [ ] DETS storage directory writable (priv/data/)

Mix Tasks

mix snakepit.setup

Bootstrap the environment, installing Python dependencies and compiling gRPC protos:

mix snakepit.setup

mix snakepit.doctor

Run environment diagnostics:

mix snakepit.doctor

Checks Python version, gRPC tools, proto files, and virtual environment configuration.

mix snakepit.status

Check pool status and worker health:

mix snakepit.status

Output:

Pool: default (process)
  Workers: 8
  Queued: 0
  Requests: 1523
  Errors: 2

mix snakepit.gen.adapter

Generate a Python adapter skeleton:

mix snakepit.gen.adapter my_adapter

Creates priv/python/my_adapter/ with adapter.py. Configure with:

adapter_args: ["--adapter", "my_adapter.adapter.MyAdapter"]

Process Management

Run IDs and Orphan Detection

Each BEAM instance gets a unique run ID on startup, enabling identification of workers belonging to the current process and detection of orphaned workers from crashed instances.

Automatic Cleanup on Restart

When Snakepit starts, it automatically:

  1. Identifies processes from previous BEAM runs using run IDs
  2. Sends SIGTERM for graceful shutdown
  3. Sends SIGKILL to unresponsive processes
  4. Cleans up stale registry entries

Only processes matching Snakepit's command-line patterns (grpc_server.py with --snakepit-run-id) are considered. To disable:

config :snakepit, :rogue_cleanup, enabled: false

Graceful Shutdown

During application shutdown:

  1. Workers receive SIGTERM (2 second timeout)
  2. Unresponsive workers receive SIGKILL
  3. Final pkill safety net for missed processes

Manual Cleanup

case Snakepit.cleanup() do
  :ok -> Logger.info("Cleanup completed")
  {:timeout, pids} -> Logger.warning("Some processes did not terminate")
end

Script Mode (run_as_script/2)

For short-lived scripts and Mix tasks:

defmodule Mix.Tasks.MyApp.ProcessData do
  use Mix.Task

  def run(args) do
    Snakepit.run_as_script(fn ->
      {:ok, result} = Snakepit.execute("process_batch", %{input: args})
      IO.puts("Complete: #{inspect(result)}")
    end, timeout: 30_000, cleanup_timeout: 10_000)
  end
end

Options: :timeout, :shutdown_timeout, :cleanup_timeout, :halt

Common Troubleshooting

Python Process Will Not Start

mix snakepit.doctor      # Check environment
echo $SNAKEPIT_PYTHON    # Verify Python path
python3 -c "import grpc" # Test gRPC import

Solutions: Set SNAKEPIT_PYTHON, verify dependencies, check adapter module path.

gRPC Connection Failures

Snakepit.list_workers()  # Check running workers
Snakepit.get_stats()     # Check pool stats

Solutions: Check port conflicts, firewall rules, compile proto files.

Memory Issues

:telemetry.attach("mem", [:snakepit, :worker, :recycled],
  fn _, _, meta, _ -> IO.inspect(meta) end, nil)

Solutions: Increase memory_threshold_mb, reduce pool_size, enable worker TTL.

Orphaned Processes

ps aux | grep grpc_server.py
pkill -f "grpc_server.py.*--snakepit-run-id"  # Manual cleanup

Orphans are cleaned automatically on next startup.

Performance Tuning

Pool Size Selection

%{name: :default, pool_size: System.schedulers_online() * 2}
WorkloadPool Size
CPU-boundschedulers * 1-2
I/O-boundschedulers * 4-8
Mixedschedulers * 2-4

For thread-profile workers:

%{name: :hpc, worker_profile: :thread, pool_size: 4, threads_per_worker: 16}

Batch Configuration

%{
  pool_size: 100,
  startup_batch_size: 8,
  startup_batch_delay_ms: 750
}

Heartbeat Tuning

heartbeat: %{
  enabled: true,
  ping_interval_ms: 2000,
  timeout_ms: 10000,
  max_missed_heartbeats: 3
}
EnvironmentIntervalTimeoutMax Missed
Development5000ms30000ms5
Production2000ms10000ms3
Critical1000ms5000ms2

Complete Production Configuration

# config/prod.exs
config :snakepit,
  pooling_enabled: true,
  log_level: :warning,

  pools: [
    %{
      name: :default,
      worker_profile: :process,
      pool_size: System.schedulers_online() * 2,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapter.MainAdapter"],
      startup_batch_size: 8,
      startup_batch_delay_ms: 750,
      worker_ttl: {1, :hours},
      worker_max_requests: 10_000,
      heartbeat: %{
        enabled: true,
        ping_interval_ms: 2000,
        timeout_ms: 10000,
        max_missed_heartbeats: 3
      }
    },
    %{
      name: :hpc,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_args: ["--adapter", "myapp.adapter.MLAdapter"],
      heartbeat: %{enabled: true, ping_interval_ms: 5000}
    }
  ],

  pool_queue_timeout: 5000,
  pool_max_queue_size: 1000,
  pool_startup_timeout: 30000,

  crash_barrier: %{
    enabled: true,
    retry: :idempotent,
    max_retries: 1,
    taint_ms: 5000
  },

  rogue_cleanup: %{enabled: true},

  telemetry_metrics: %{prometheus: %{enabled: true}},

  opentelemetry: %{
    enabled: true,
    exporters: %{otlp: %{endpoint: "http://collector:4318"}}
  }

Environment Variables

VariableDescription
SNAKEPIT_PYTHONPath to Python binary
SNAKEPIT_SCRIPT_HALTForce halt after script completion
SNAKEPIT_OTEL_ENDPOINTOpenTelemetry collector endpoint

Deployment Recommendations

  1. Use Releases - Build OTP releases for production
  2. Separate Python Env - Use a dedicated virtual environment
  3. Monitor Early - Attach telemetry handlers before starting pools
  4. Start Conservative - Begin with smaller pool sizes
  5. Test Failure Modes - Verify orphan cleanup and crash recovery