Snakepit Configuration Guide

Copy Markdown View Source

This guide covers all configuration options for Snakepit, from simple single-pool setups to advanced multi-pool deployments with different worker profiles.


Table of Contents

  1. Configuration Formats
  2. Global Options
  3. Pool Configuration
  4. Heartbeat Configuration
  5. Logging Configuration
  6. Python Runtime Configuration
  7. Optional Features
  8. Complete Configuration Example

Configuration Formats

Snakepit supports two configuration formats: legacy (single-pool) and multi-pool (v0.6+).

Simple (Legacy) Configuration

For backward compatibility with v0.5.x and single-pool deployments:

# config/config.exs
config :snakepit,
  pooling_enabled: true,
  adapter_module: Snakepit.Adapters.GRPCPython,
  pool_size: 100,
  pool_config: %{
    startup_batch_size: 8,
    startup_batch_delay_ms: 750,
    max_workers: 1000
  }

This format creates a single pool named :default with the specified settings. If both top-level :pool_size and pool_config.pool_size are set, Snakepit uses the top-level :pool_size.

Multi-Pool Configuration (v0.6+)

For advanced deployments with multiple pools, each with different profiles:

# config/config.exs
config :snakepit,
  pools: [
    %{
      name: :default,
      worker_profile: :process,
      pool_size: 100,
      adapter_module: Snakepit.Adapters.GRPCPython
    },
    %{
      name: :ml_inference,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.ml.InferenceAdapter"]
    }
  ]

This creates two pools: :default for general tasks and :ml_inference for CPU-bound ML workloads.


Global Options

These options apply to all pools or the Snakepit application as a whole.

OptionTypeDefaultDescription
pooling_enabledboolean()falseEnable or disable worker pooling. Set to true for normal operation.
adapter_modulemodule()nilDefault adapter module for pools that do not specify one (including adapter timeout fallback).
pool_sizepos_integer()System.schedulers_online() * 2Default pool size. Typically 2x CPU cores.
capacity_strategy:pool | :profile | :hybrid:poolHow worker capacity is managed across pools.
affinity:hint | :strict_queue | :strict_fail_fast:hintDefault session affinity mode for pools.
pool_startup_timeoutpos_integer()10000Maximum time (ms) to wait for a worker to start.
pool_queue_timeoutpos_integer()5000Maximum time (ms) a request waits in queue.
pool_max_queue_sizepos_integer()1000Maximum queued requests before rejecting new ones.
pool_reconcile_interval_msnon_neg_integer()1000Interval (ms) for pool reconciliation to restore worker count (0 disables).
pool_reconcile_batch_sizepos_integer()2Max workers respawned per reconciliation tick (ignored if reconcile disabled).
grpc_worker_health_check_timeout_mspos_integer()5000Timeout (ms) for periodic worker health-check RPCs.
worker_starter_max_restartsnon_neg_integer()3Restart intensity: max restarts for worker starter supervisor.
worker_starter_max_secondspos_integer()5Restart intensity window (seconds) for worker starter supervisor.
worker_supervisor_max_restartsnon_neg_integer()3Restart intensity: max restarts for worker supervisor.
worker_supervisor_max_secondspos_integer()5Restart intensity window (seconds) for worker supervisor.
grpc_listenermap()%{mode: :internal}gRPC listener configuration (mode/host/port).
grpc_internal_hostString.t()"127.0.0.1"Default host for internal-only gRPC listeners.
grpc_port_pool_sizepos_integer()32Default pool size for :external_pool listeners.
grpc_listener_ready_timeout_mspos_integer()5000Time (ms) to wait for gRPC listener to publish its port before pool startup.
grpc_listener_port_check_interval_mspos_integer()25Interval (ms) between port readiness checks when reusing an existing listener.
grpc_listener_reuse_attemptspos_integer()3Number of attempts to reuse or rebind a listener before failing.
grpc_listener_reuse_wait_timeout_mspos_integer()500Max wait (ms) for an already-started listener to publish its port before retrying.
grpc_listener_reuse_retry_delay_mspos_integer()100Delay (ms) between listener reuse retries.
instance_nameString.t()nilInstance identifier for isolating runtime state.
instance_tokenString.t()nilUnique per-running-instance token for strong process cleanup isolation.
data_dirString.t()priv/dataDirectory for runtime persistence (DETS, cleanup state).
graceful_shutdown_timeout_mspos_integer()6000Time (ms) to wait for Python to terminate gracefully before SIGKILL.

grpc_port and grpc_host remain supported for legacy configurations, but new deployments should use grpc_listener.

gRPC Listener Modes

Internal-only mode binds to an ephemeral port and advertises localhost to workers:

config :snakepit,
  grpc_listener: %{
    mode: :internal
  }

External bindings require explicit host/port configuration:

config :snakepit,
  grpc_listener: %{
    mode: :external,
    host: "localhost",
    bind_host: "0.0.0.0",
    port: 50051
  }

To run multiple instances on the same host, use pooled external ports:

config :snakepit,
  grpc_listener: %{
    mode: :external_pool,
    host: "localhost",
    bind_host: "0.0.0.0",
    base_port: 50051,
    pool_size: 32
  }

Use instance_name, instance_token, and data_dir to isolate registry state when sharing a deployment directory.

instance_name is for environment-level grouping (for example prod-us-east-1). instance_token must be unique for each concurrently running VM (for example deploy slot, CI job, terminal session). Without unique tokens, concurrent instances from the same codebase can treat each other as rogue/orphan processes during cleanup.

Example:

config :snakepit,
  instance_name: "my-app",
  instance_token: "job-1234",
  data_dir: "/var/lib/snakepit"

Capacity Strategies

StrategyDescription
:poolEach pool manages its own capacity independently. Default and simplest option.
:profileWorkers of the same profile share capacity across pools.
:hybridCombination of pool and profile strategies for complex deployments.

Pool Configuration

Each pool can be configured independently with these options.

Required Fields

OptionTypeDescription
nameatom()Unique pool identifier. Use :default for the primary pool.

Profile Selection

OptionTypeDefaultDescription
worker_profile:process | :thread:processWorker execution model. See Worker Profiles Guide.

Common Pool Options

OptionTypeDefaultDescription
pool_sizepos_integer()Global settingNumber of workers in this pool.
adapter_modulemodule()Global settingAdapter module for this pool.
adapter_argslist(String.t())[]CLI arguments passed to the Python server.
adapter_envlist({String.t(), String.t()})[]Environment variables for Python processes.
adapter_specString.t()nilPython adapter module path (e.g., "myapp.adapters.MyAdapter").
affinity:hint | :strict_queue | :strict_fail_fastGlobal settingSession affinity behavior for this pool.

When adapters implement command_timeout/2, Snakepit resolves command timeout from the selected worker's pool adapter_module first. The global config :snakepit, adapter_module: ... value is used only if a pool does not declare its own adapter.

Session Affinity Modes

config :snakepit,
  affinity: :hint  # global default

config :snakepit,
  pools: [
    %{name: :default, affinity: :strict_queue, pool_size: 4}
  ]
  • :hint (default) — Prefer the last worker if available; otherwise fall back.
  • :strict_queue — Queue when the preferred worker is busy; guarantees same-worker routing but can increase latency and queue timeouts.
  • :strict_fail_fast — Return {:error, %Snakepit.Error{category: :pool, details: %{reason: :worker_busy}}} when the preferred worker is busy.

If the preferred worker is tainted or missing, strict modes return {:error, %Snakepit.Error{category: :pool, details: %{reason: :session_worker_unavailable}}}.

Process Profile Options

These options apply when worker_profile: :process:

OptionTypeDefaultDescription
startup_batch_sizepos_integer()8Workers started per batch during pool initialization.
startup_batch_delay_msnon_neg_integer()750Delay between startup batches (ms). Reduces system load during startup.

Thread Profile Options

These options apply when worker_profile: :thread:

OptionTypeDefaultDescription
threads_per_workerpos_integer()10Thread pool size per Python process. Total capacity = pool_size * threads_per_worker.
thread_safety_checksboolean()falseEnable runtime thread safety validation. Useful for development.

Worker Lifecycle Options

OptionTypeDefaultDescription
worker_ttl:infinity | {value, unit}:infinityMaximum worker lifetime before recycling.
worker_max_requests:infinity | pos_integer():infinityMaximum requests before recycling a worker.

TTL Units:

UnitExample
:seconds{3600, :seconds} - 1 hour
:minutes{60, :minutes} - 1 hour
:hours{1, :hours} - 1 hour

Worker recycling helps prevent memory leaks and ensures fresh worker state.


Heartbeat Configuration

Heartbeats detect unresponsive workers and trigger automatic restarts.

Global Heartbeat Config

config :snakepit,
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 2000,
    timeout_ms: 10000,
    max_missed_heartbeats: 3,
    initial_delay_ms: 0,
    dependent: true
  }

Per-Pool Heartbeat Config

%{
  name: :ml_pool,
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 10000,
    timeout_ms: 30000,
    max_missed_heartbeats: 2
  }
}

Heartbeat Options

OptionTypeDefaultDescription
enabledboolean()trueEnable heartbeat monitoring.
ping_interval_mspos_integer()2000Interval between heartbeat pings.
timeout_mspos_integer()10000Maximum time to wait for heartbeat response.
max_missed_heartbeatspos_integer()3Missed heartbeats before declaring worker dead.
initial_delay_msnon_neg_integer()0Delay before first heartbeat ping.
dependentboolean()trueWhether worker terminates if heartbeat monitor dies.

Tuning Guidelines

  • Fast detection: Lower ping_interval_ms and max_missed_heartbeats
  • Reduce overhead: Higher ping_interval_ms for stable workloads
  • Long operations: Increase timeout_ms if workers run long computations
  • ML workloads: Use ping_interval_ms: 10000 or higher since inference can block

Logging Configuration

Snakepit uses its own logger for internal operations.

Log Level

config :snakepit,
  log_level: :info  # :debug | :info | :warning | :error | :none
LevelDescription
:debugVerbose output including worker lifecycle, gRPC calls, heartbeats
:infoNormal operation messages
:warningPotential issues that do not stop operation
:errorErrors that affect functionality
:noneDisable all Snakepit logging

Log Categories

Fine-grained control over logging categories:

config :snakepit,
  log_level: :info,
  log_categories: %{
    pool: :debug,      # Pool operations
    worker: :debug,    # Worker lifecycle
    heartbeat: :info,  # Heartbeat monitoring
    grpc: :warning     # gRPC communication
  }

Python-Side Logging

The Python bridge respects the SNAKEPIT_LOG_LEVEL environment variable:

%{
  name: :default,
  adapter_env: [{"SNAKEPIT_LOG_LEVEL", "info"}]
}

Python Runtime Configuration

Configure how Python interpreters are discovered and managed.

Interpreter Selection

config :snakepit,
  python_executable: "/path/to/python3"

Or use environment variable (takes precedence):

export SNAKEPIT_PYTHON="/path/to/python3"

Runtime Strategy

config :snakepit,
  python_runtime: %{
    strategy: :venv,  # :system | :venv | :managed
    managed: false,
    version: "3.12"
  }
StrategyDescription
:systemUse system Python interpreter
:venvUse project virtual environment (.venv/bin/python3)
:managedLet Snakepit manage Python version (experimental)

Environment Variables per Pool

%{
  name: :ml_pool,
  adapter_env: [
    # Control threading in numerical libraries
    {"OPENBLAS_NUM_THREADS", "1"},
    {"MKL_NUM_THREADS", "1"},
    {"OMP_NUM_THREADS", "1"},
    {"NUMEXPR_NUM_THREADS", "1"},

    # GPU configuration
    {"CUDA_VISIBLE_DEVICES", "0"},

    # Python settings
    {"PYTHONUNBUFFERED", "1"},
    {"SNAKEPIT_LOG_LEVEL", "warning"}
  ]
}

Optional Features

Zero-Copy Data Transfer

Enable zero-copy for large binary data:

config :snakepit,
  zero_copy: %{
    enabled: true,
    threshold_bytes: 1_048_576  # 1 MB
  }

Zero-copy is beneficial for ML workloads with large tensors.

Crash Barrier

Limit restart attempts for frequently crashing workers:

config :snakepit,
  crash_barrier: %{
    enabled: true,
    max_restarts: 5,
    window_seconds: 60
  }

If a worker restarts more than max_restarts times within window_seconds, it is permanently removed from the pool.

Circuit Breaker

Prevent cascading failures:

config :snakepit,
  circuit_breaker: %{
    enabled: true,
    failure_threshold: 5,
    reset_timeout_ms: 30000
  }

After failure_threshold consecutive failures, the circuit opens and requests fail fast for reset_timeout_ms.

Rogue Cleanup

Control startup orphan-process cleanup:

config :snakepit, :rogue_cleanup, enabled: false
# equivalent map form:
config :snakepit, rogue_cleanup: %{enabled: false}

enabled: false is treated as an explicit disable and is not replaced by defaults.


Complete Configuration Example

Here is a production-ready configuration demonstrating all major options:

# config/config.exs
config :snakepit,
  # Global settings
  pooling_enabled: true,
  pool_startup_timeout: 30_000,
  pool_queue_timeout: 10_000,
  pool_max_queue_size: 5000,
  grpc_listener: %{
    mode: :external,
    host: "snakepit.internal",
    bind_host: "0.0.0.0",
    port: 50051
  },

  # Logging
  log_level: :info,
  log_categories: %{
    pool: :info,
    worker: :warning,
    heartbeat: :warning,
    grpc: :warning
  },

  # Global heartbeat defaults
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 5000,
    timeout_ms: 15000,
    max_missed_heartbeats: 3
  },

  # Multiple pools
  pools: [
    # Default pool for I/O-bound tasks
    %{
      name: :default,
      worker_profile: :process,
      pool_size: 50,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapters.GeneralAdapter"],
      adapter_env: [
        {"OPENBLAS_NUM_THREADS", "1"},
        {"OMP_NUM_THREADS", "1"}
      ],
      startup_batch_size: 10,
      startup_batch_delay_ms: 500
    },

    # ML inference pool (CPU-bound, thread profile)
    %{
      name: :ml_inference,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 8,  # 32 total capacity
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.ml.InferenceAdapter"],
      adapter_env: [
        {"OPENBLAS_NUM_THREADS", "8"},
        {"OMP_NUM_THREADS", "8"},
        {"CUDA_VISIBLE_DEVICES", "0"},
        {"PYTORCH_CUDA_ALLOC_CONF", "max_split_size_mb:512"}
      ],
      thread_safety_checks: false,
      worker_ttl: {1800, :seconds},
      worker_max_requests: 10000,
      heartbeat: %{
        enabled: true,
        ping_interval_ms: 10000,
        timeout_ms: 60000,
        max_missed_heartbeats: 2
      }
    },

    # Background processing pool
    %{
      name: :background,
      worker_profile: :process,
      pool_size: 10,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapters.BackgroundAdapter"],
      adapter_env: [
        {"SNAKEPIT_LOG_LEVEL", "warning"}
      ],
      worker_ttl: {3600, :seconds}
    }
  ],

  # Optional features
  crash_barrier: %{
    enabled: true,
    max_restarts: 10,
    window_seconds: 300
  }

Environment-Specific Overrides

# config/prod.exs
config :snakepit,
  log_level: :warning,
  pool_max_queue_size: 10000

# config/dev.exs
config :snakepit,
  log_level: :debug,
  pool_size: 4

# config/test.exs
config :snakepit,
  pooling_enabled: false

Validation

Verify your configuration with the doctor task:

mix snakepit.doctor

At runtime, check pool status:

iex> Snakepit.get_stats()
%{
  requests: 15432,
  queued: 5,
  errors: 12,
  queue_timeouts: 3,
  pool_saturated: 0,
  workers: 54,
  available: 49,
  busy: 5
}