Snakepit Configuration Guide

This guide covers all configuration options for Snakepit, from simple single-pool setups to advanced multi-pool deployments with different worker profiles.

Configuration Formats
Global Options
Pool Configuration
Heartbeat Configuration
Logging Configuration
Python Runtime Configuration
Optional Features
Complete Configuration Example

Configuration Formats

Snakepit supports two configuration formats: legacy (single-pool) and multi-pool (v0.6+).

Simple (Legacy) Configuration

For backward compatibility with v0.5.x and single-pool deployments:

# config/config.exs
config :snakepit,
  pooling_enabled: true,
  adapter_module: Snakepit.Adapters.GRPCPython,
  pool_size: 100,
  pool_config: %{
    startup_batch_size: 8,
    startup_batch_delay_ms: 750,
    max_workers: 1000
  }

This format creates a single pool named :default with the specified settings. If both top-level :pool_size and pool_config.pool_size are set, Snakepit uses the top-level :pool_size.

Multi-Pool Configuration (v0.6+)

For advanced deployments with multiple pools, each with different profiles:

# config/config.exs
config :snakepit,
  pools: [
    %{
      name: :default,
      worker_profile: :process,
      pool_size: 100,
      adapter_module: Snakepit.Adapters.GRPCPython
    },
    %{
      name: :ml_inference,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.ml.InferenceAdapter"]
    }
  ]

This creates two pools: :default for general tasks and :ml_inference for CPU-bound ML workloads.

Global Options

These options apply to all pools or the Snakepit application as a whole.

Option	Type	Default	Description
`pooling_enabled`	`boolean()`	`false`	Enable or disable worker pooling. Set to `true` for normal operation.
`adapter_module`	`module()`	`nil`	Default adapter module for pools that do not specify one (including adapter timeout fallback).
`pool_size`	`pos_integer()`	`System.schedulers_online() * 2`	Default pool size. Typically 2x CPU cores.
`capacity_strategy`	`:pool \| :profile \| :hybrid`	`:pool`	How worker capacity is managed across pools.
`affinity`	`:hint \| :strict_queue \| :strict_fail_fast`	`:hint`	Default session affinity mode for pools.
`pool_startup_timeout`	`pos_integer()`	`10000`	Maximum time (ms) to wait for a worker to start.
`pool_queue_timeout`	`pos_integer()`	`5000`	Maximum time (ms) a request waits in queue.
`pool_max_queue_size`	`pos_integer()`	`1000`	Maximum queued requests before rejecting new ones.
`pool_reconcile_interval_ms`	`non_neg_integer()`	`1000`	Interval (ms) for pool reconciliation to restore worker count (`0` disables).
`pool_reconcile_batch_size`	`pos_integer()`	`2`	Max workers respawned per reconciliation tick (ignored if reconcile disabled).
`grpc_worker_health_check_timeout_ms`	`pos_integer()`	`5000`	Timeout (ms) for periodic worker health-check RPCs.
`worker_starter_max_restarts`	`non_neg_integer()`	`3`	Restart intensity: max restarts for worker starter supervisor.
`worker_starter_max_seconds`	`pos_integer()`	`5`	Restart intensity window (seconds) for worker starter supervisor.
`worker_supervisor_max_restarts`	`non_neg_integer()`	`3`	Restart intensity: max restarts for worker supervisor.
`worker_supervisor_max_seconds`	`pos_integer()`	`5`	Restart intensity window (seconds) for worker supervisor.
`grpc_listener`	`map()`	`%{mode: :internal}`	gRPC listener configuration (mode/host/port).
`grpc_internal_host`	`String.t()`	`"127.0.0.1"`	Default host for internal-only gRPC listeners.
`grpc_port_pool_size`	`pos_integer()`	`32`	Default pool size for `:external_pool` listeners.
`grpc_listener_ready_timeout_ms`	`pos_integer()`	`5000`	Time (ms) to wait for gRPC listener to publish its port before pool startup.
`grpc_listener_port_check_interval_ms`	`pos_integer()`	`25`	Interval (ms) between port readiness checks when reusing an existing listener.
`grpc_listener_reuse_attempts`	`pos_integer()`	`3`	Number of attempts to reuse or rebind a listener before failing.
`grpc_listener_reuse_wait_timeout_ms`	`pos_integer()`	`500`	Max wait (ms) for an already-started listener to publish its port before retrying.
`grpc_listener_reuse_retry_delay_ms`	`pos_integer()`	`100`	Delay (ms) between listener reuse retries.
`instance_name`	`String.t()`	`nil`	Instance identifier for isolating runtime state.
`instance_token`	`String.t()`	`nil`	Unique per-running-instance token for strong process cleanup isolation.
`data_dir`	`String.t()`	`priv/data`	Directory for runtime persistence (DETS, cleanup state).
`graceful_shutdown_timeout_ms`	`pos_integer()`	`6000`	Time (ms) to wait for Python to terminate gracefully before SIGKILL.

grpc_port and grpc_host remain supported for legacy configurations, but new deployments should use grpc_listener.

gRPC Listener Modes

Internal-only mode binds to an ephemeral port and advertises localhost to workers:

config :snakepit,
  grpc_listener: %{
    mode: :internal
  }

External bindings require explicit host/port configuration:

config :snakepit,
  grpc_listener: %{
    mode: :external,
    host: "localhost",
    bind_host: "0.0.0.0",
    port: 50051
  }

To run multiple instances on the same host, use pooled external ports:

config :snakepit,
  grpc_listener: %{
    mode: :external_pool,
    host: "localhost",
    bind_host: "0.0.0.0",
    base_port: 50051,
    pool_size: 32
  }

Use instance_name, instance_token, and data_dir to isolate registry state when sharing a deployment directory.

instance_name is for environment-level grouping (for example prod-us-east-1). instance_token must be unique for each concurrently running VM (for example deploy slot, CI job, terminal session). Without unique tokens, concurrent instances from the same codebase can treat each other as rogue/orphan processes during cleanup.

Example:

config :snakepit,
  instance_name: "my-app",
  instance_token: "job-1234",
  data_dir: "/var/lib/snakepit"

Capacity Strategies

Strategy	Description
`:pool`	Each pool manages its own capacity independently. Default and simplest option.
`:profile`	Workers of the same profile share capacity across pools.
`:hybrid`	Combination of pool and profile strategies for complex deployments.

Pool Configuration

Each pool can be configured independently with these options.

Required Fields

Option	Type	Description
`name`	`atom()`	Unique pool identifier. Use `:default` for the primary pool.

Profile Selection

Option	Type	Default	Description
`worker_profile`	`:process \| :thread`	`:process`	Worker execution model. See Worker Profiles Guide.

Common Pool Options

Option	Type	Default	Description
`pool_size`	`pos_integer()`	Global setting	Number of workers in this pool.
`adapter_module`	`module()`	Global setting	Adapter module for this pool.
`adapter_args`	`list(String.t())`	`[]`	CLI arguments passed to the Python server.
`adapter_env`	`list({String.t(), String.t()})`	`[]`	Environment variables for Python processes.
`adapter_spec`	`String.t()`	`nil`	Python adapter module path (e.g., `"myapp.adapters.MyAdapter"`).
`affinity`	`:hint \| :strict_queue \| :strict_fail_fast`	Global setting	Session affinity behavior for this pool.

When adapters implement command_timeout/2, Snakepit resolves command timeout from the selected worker's pool adapter_module first. The global config :snakepit, adapter_module: ... value is used only if a pool does not declare its own adapter.

Session Affinity Modes

config :snakepit,
  affinity: :hint  # global default

config :snakepit,
  pools: [
    %{name: :default, affinity: :strict_queue, pool_size: 4}
  ]

:hint (default) — Prefer the last worker if available; otherwise fall back.
:strict_queue — Queue when the preferred worker is busy; guarantees same-worker routing but can increase latency and queue timeouts.
:strict_fail_fast — Return {:error, %Snakepit.Error{category: :pool, details: %{reason: :worker_busy}}} when the preferred worker is busy.

If the preferred worker is tainted or missing, strict modes return {:error, %Snakepit.Error{category: :pool, details: %{reason: :session_worker_unavailable}}}.

Process Profile Options

These options apply when worker_profile: :process:

Option	Type	Default	Description
`startup_batch_size`	`pos_integer()`	`8`	Workers started per batch during pool initialization.
`startup_batch_delay_ms`	`non_neg_integer()`	`750`	Delay between startup batches (ms). Reduces system load during startup.

Thread Profile Options

These options apply when worker_profile: :thread:

Option	Type	Default	Description
`threads_per_worker`	`pos_integer()`	`10`	Thread pool size per Python process. Total capacity = `pool_size * threads_per_worker`.
`thread_safety_checks`	`boolean()`	`false`	Enable runtime thread safety validation. Useful for development.

Worker Lifecycle Options

Option	Type	Default	Description
`worker_ttl`	`:infinity \| {value, unit}`	`:infinity`	Maximum worker lifetime before recycling.
`worker_max_requests`	`:infinity \| pos_integer()`	`:infinity`	Maximum requests before recycling a worker.

TTL Units:

Unit	Example
`:seconds`	`{3600, :seconds}` - 1 hour
`:minutes`	`{60, :minutes}` - 1 hour
`:hours`	`{1, :hours}` - 1 hour

Worker recycling helps prevent memory leaks and ensures fresh worker state.

Heartbeat Configuration

Heartbeats detect unresponsive workers and trigger automatic restarts.

Global Heartbeat Config

config :snakepit,
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 2000,
    timeout_ms: 10000,
    max_missed_heartbeats: 3,
    initial_delay_ms: 0,
    dependent: true
  }

Per-Pool Heartbeat Config

%{
  name: :ml_pool,
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 10000,
    timeout_ms: 30000,
    max_missed_heartbeats: 2
  }
}

Heartbeat Options

Option	Type	Default	Description
`enabled`	`boolean()`	`true`	Enable heartbeat monitoring.
`ping_interval_ms`	`pos_integer()`	`2000`	Interval between heartbeat pings.
`timeout_ms`	`pos_integer()`	`10000`	Maximum time to wait for heartbeat response.
`max_missed_heartbeats`	`pos_integer()`	`3`	Missed heartbeats before declaring worker dead.
`initial_delay_ms`	`non_neg_integer()`	`0`	Delay before first heartbeat ping.
`dependent`	`boolean()`	`true`	Whether worker terminates if heartbeat monitor dies.

Tuning Guidelines

Fast detection: Lower ping_interval_ms and max_missed_heartbeats
Reduce overhead: Higher ping_interval_ms for stable workloads
Long operations: Increase timeout_ms if workers run long computations
ML workloads: Use ping_interval_ms: 10000 or higher since inference can block

Logging Configuration

Snakepit uses its own logger for internal operations.

Log Level

config :snakepit,
  log_level: :info  # :debug | :info | :warning | :error | :none

Level	Description
`:debug`	Verbose output including worker lifecycle, gRPC calls, heartbeats
`:info`	Normal operation messages
`:warning`	Potential issues that do not stop operation
`:error`	Errors that affect functionality
`:none`	Disable all Snakepit logging

Log Categories

Fine-grained control over logging categories:

config :snakepit,
  log_level: :info,
  log_categories: %{
    pool: :debug,      # Pool operations
    worker: :debug,    # Worker lifecycle
    heartbeat: :info,  # Heartbeat monitoring
    grpc: :warning     # gRPC communication
  }

Python-Side Logging

The Python bridge respects the SNAKEPIT_LOG_LEVEL environment variable:

%{
  name: :default,
  adapter_env: [{"SNAKEPIT_LOG_LEVEL", "info"}]
}

Python Runtime Configuration

Configure how Python interpreters are discovered and managed.

Interpreter Selection

config :snakepit,
  python_executable: "/path/to/python3"

Or use environment variable (takes precedence):

export SNAKEPIT_PYTHON="/path/to/python3"

Runtime Strategy

config :snakepit,
  python_runtime: %{
    strategy: :venv,  # :system | :venv | :managed
    managed: false,
    version: "3.12"
  }

Strategy	Description
`:system`	Use system Python interpreter
`:venv`	Use project virtual environment (`.venv/bin/python3`)
`:managed`	Let Snakepit manage Python version (experimental)

Environment Variables per Pool

%{
  name: :ml_pool,
  adapter_env: [
    # Control threading in numerical libraries
    {"OPENBLAS_NUM_THREADS", "1"},
    {"MKL_NUM_THREADS", "1"},
    {"OMP_NUM_THREADS", "1"},
    {"NUMEXPR_NUM_THREADS", "1"},

    # GPU configuration
    {"CUDA_VISIBLE_DEVICES", "0"},

    # Python settings
    {"PYTHONUNBUFFERED", "1"},
    {"SNAKEPIT_LOG_LEVEL", "warning"}
  ]
}

Optional Features

Zero-Copy Data Transfer

Enable zero-copy for large binary data:

config :snakepit,
  zero_copy: %{
    enabled: true,
    threshold_bytes: 1_048_576  # 1 MB
  }

Zero-copy is beneficial for ML workloads with large tensors.

Crash Barrier

Limit restart attempts for frequently crashing workers:

config :snakepit,
  crash_barrier: %{
    enabled: true,
    max_restarts: 5,
    window_seconds: 60
  }

If a worker restarts more than max_restarts times within window_seconds, it is permanently removed from the pool.

Circuit Breaker

Prevent cascading failures:

config :snakepit,
  circuit_breaker: %{
    enabled: true,
    failure_threshold: 5,
    reset_timeout_ms: 30000
  }

After failure_threshold consecutive failures, the circuit opens and requests fail fast for reset_timeout_ms.

Rogue Cleanup

Control startup orphan-process cleanup:

config :snakepit, :rogue_cleanup, enabled: false
# equivalent map form:
config :snakepit, rogue_cleanup: %{enabled: false}

enabled: false is treated as an explicit disable and is not replaced by defaults.

Complete Configuration Example

Here is a production-ready configuration demonstrating all major options:

# config/config.exs
config :snakepit,
  # Global settings
  pooling_enabled: true,
  pool_startup_timeout: 30_000,
  pool_queue_timeout: 10_000,
  pool_max_queue_size: 5000,
  grpc_listener: %{
    mode: :external,
    host: "snakepit.internal",
    bind_host: "0.0.0.0",
    port: 50051
  },

  # Logging
  log_level: :info,
  log_categories: %{
    pool: :info,
    worker: :warning,
    heartbeat: :warning,
    grpc: :warning
  },

  # Global heartbeat defaults
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 5000,
    timeout_ms: 15000,
    max_missed_heartbeats: 3
  },

  # Multiple pools
  pools: [
    # Default pool for I/O-bound tasks
    %{
      name: :default,
      worker_profile: :process,
      pool_size: 50,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapters.GeneralAdapter"],
      adapter_env: [
        {"OPENBLAS_NUM_THREADS", "1"},
        {"OMP_NUM_THREADS", "1"}
      ],
      startup_batch_size: 10,
      startup_batch_delay_ms: 500
    },

    # ML inference pool (CPU-bound, thread profile)
    %{
      name: :ml_inference,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 8,  # 32 total capacity
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.ml.InferenceAdapter"],
      adapter_env: [
        {"OPENBLAS_NUM_THREADS", "8"},
        {"OMP_NUM_THREADS", "8"},
        {"CUDA_VISIBLE_DEVICES", "0"},
        {"PYTORCH_CUDA_ALLOC_CONF", "max_split_size_mb:512"}
      ],
      thread_safety_checks: false,
      worker_ttl: {1800, :seconds},
      worker_max_requests: 10000,
      heartbeat: %{
        enabled: true,
        ping_interval_ms: 10000,
        timeout_ms: 60000,
        max_missed_heartbeats: 2
      }
    },

    # Background processing pool
    %{
      name: :background,
      worker_profile: :process,
      pool_size: 10,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapters.BackgroundAdapter"],
      adapter_env: [
        {"SNAKEPIT_LOG_LEVEL", "warning"}
      ],
      worker_ttl: {3600, :seconds}
    }
  ],

  # Optional features
  crash_barrier: %{
    enabled: true,
    max_restarts: 10,
    window_seconds: 300
  }

Environment-Specific Overrides

# config/prod.exs
config :snakepit,
  log_level: :warning,
  pool_max_queue_size: 10000

# config/dev.exs
config :snakepit,
  log_level: :debug,
  pool_size: 4

# config/test.exs
config :snakepit,
  pooling_enabled: false

Validation

Verify your configuration with the doctor task:

mix snakepit.doctor

At runtime, check pool status:

iex> Snakepit.get_stats()
%{
  requests: 15432,
  queued: 5,
  errors: 12,
  queue_timeouts: 3,
  pool_saturated: 0,
  workers: 54,
  available: 49,
  busy: 5
}

Getting Started - Installation and first steps
Worker Profiles - Process vs Thread profiles
Production - Performance tuning and deployment checklist

← Previous Page Getting Started

Next Page → Worker Profiles