Snakepit Configuration Guide

View Source

This guide covers all configuration options for Snakepit, from simple single-pool setups to advanced multi-pool deployments with different worker profiles.


Table of Contents

  1. Configuration Formats
  2. Global Options
  3. Pool Configuration
  4. Heartbeat Configuration
  5. Logging Configuration
  6. Python Runtime Configuration
  7. Optional Features
  8. Complete Configuration Example

Configuration Formats

Snakepit supports two configuration formats: legacy (single-pool) and multi-pool (v0.6+).

Simple (Legacy) Configuration

For backward compatibility with v0.5.x and single-pool deployments:

# config/config.exs
config :snakepit,
  pooling_enabled: true,
  adapter_module: Snakepit.Adapters.GRPCPython,
  pool_size: 100,
  pool_config: %{
    startup_batch_size: 8,
    startup_batch_delay_ms: 750,
    max_workers: 1000
  }

This format creates a single pool named :default with the specified settings.

Multi-Pool Configuration (v0.6+)

For advanced deployments with multiple pools, each with different profiles:

# config/config.exs
config :snakepit,
  pools: [
    %{
      name: :default,
      worker_profile: :process,
      pool_size: 100,
      adapter_module: Snakepit.Adapters.GRPCPython
    },
    %{
      name: :ml_inference,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 16,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.ml.InferenceAdapter"]
    }
  ]

This creates two pools: :default for general tasks and :ml_inference for CPU-bound ML workloads.


Global Options

These options apply to all pools or the Snakepit application as a whole.

OptionTypeDefaultDescription
pooling_enabledboolean()falseEnable or disable worker pooling. Set to true for normal operation.
adapter_modulemodule()nilDefault adapter module for pools that do not specify one.
pool_sizepos_integer()System.schedulers_online() * 2Default pool size. Typically 2x CPU cores.
capacity_strategy:pool | :profile | :hybrid:poolHow worker capacity is managed across pools.
pool_startup_timeoutpos_integer()10000Maximum time (ms) to wait for a worker to start.
pool_queue_timeoutpos_integer()5000Maximum time (ms) a request waits in queue.
pool_max_queue_sizepos_integer()1000Maximum queued requests before rejecting new ones.
grpc_portpos_integer()50051Port for the Elixir gRPC server (Python-to-Elixir calls).
grpc_hostString.t()"localhost"Host for gRPC connections.
graceful_shutdown_timeout_mspos_integer()6000Time (ms) to wait for Python to terminate gracefully before SIGKILL.

Capacity Strategies

StrategyDescription
:poolEach pool manages its own capacity independently. Default and simplest option.
:profileWorkers of the same profile share capacity across pools.
:hybridCombination of pool and profile strategies for complex deployments.

Pool Configuration

Each pool can be configured independently with these options.

Required Fields

OptionTypeDescription
nameatom()Unique pool identifier. Use :default for the primary pool.

Profile Selection

OptionTypeDefaultDescription
worker_profile:process | :thread:processWorker execution model. See Worker Profiles Guide.

Common Pool Options

OptionTypeDefaultDescription
pool_sizepos_integer()Global settingNumber of workers in this pool.
adapter_modulemodule()Global settingAdapter module for this pool.
adapter_argslist(String.t())[]CLI arguments passed to the Python server.
adapter_envlist({String.t(), String.t()})[]Environment variables for Python processes.
adapter_specString.t()nilPython adapter module path (e.g., "myapp.adapters.MyAdapter").

Process Profile Options

These options apply when worker_profile: :process:

OptionTypeDefaultDescription
startup_batch_sizepos_integer()8Workers started per batch during pool initialization.
startup_batch_delay_msnon_neg_integer()750Delay between startup batches (ms). Reduces system load during startup.

Thread Profile Options

These options apply when worker_profile: :thread:

OptionTypeDefaultDescription
threads_per_workerpos_integer()10Thread pool size per Python process. Total capacity = pool_size * threads_per_worker.
thread_safety_checksboolean()falseEnable runtime thread safety validation. Useful for development.

Worker Lifecycle Options

OptionTypeDefaultDescription
worker_ttl:infinity | {value, unit}:infinityMaximum worker lifetime before recycling.
worker_max_requests:infinity | pos_integer():infinityMaximum requests before recycling a worker.

TTL Units:

UnitExample
:seconds{3600, :seconds} - 1 hour
:minutes{60, :minutes} - 1 hour
:hours{1, :hours} - 1 hour

Worker recycling helps prevent memory leaks and ensures fresh worker state.


Heartbeat Configuration

Heartbeats detect unresponsive workers and trigger automatic restarts.

Global Heartbeat Config

config :snakepit,
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 2000,
    timeout_ms: 10000,
    max_missed_heartbeats: 3,
    initial_delay_ms: 0,
    dependent: true
  }

Per-Pool Heartbeat Config

%{
  name: :ml_pool,
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 10000,
    timeout_ms: 30000,
    max_missed_heartbeats: 2
  }
}

Heartbeat Options

OptionTypeDefaultDescription
enabledboolean()trueEnable heartbeat monitoring.
ping_interval_mspos_integer()2000Interval between heartbeat pings.
timeout_mspos_integer()10000Maximum time to wait for heartbeat response.
max_missed_heartbeatspos_integer()3Missed heartbeats before declaring worker dead.
initial_delay_msnon_neg_integer()0Delay before first heartbeat ping.
dependentboolean()trueWhether worker terminates if heartbeat monitor dies.

Tuning Guidelines

  • Fast detection: Lower ping_interval_ms and max_missed_heartbeats
  • Reduce overhead: Higher ping_interval_ms for stable workloads
  • Long operations: Increase timeout_ms if workers run long computations
  • ML workloads: Use ping_interval_ms: 10000 or higher since inference can block

Logging Configuration

Snakepit uses its own logger for internal operations.

Log Level

config :snakepit,
  log_level: :info  # :debug | :info | :warning | :error | :none
LevelDescription
:debugVerbose output including worker lifecycle, gRPC calls, heartbeats
:infoNormal operation messages
:warningPotential issues that do not stop operation
:errorErrors that affect functionality
:noneDisable all Snakepit logging

Log Categories

Fine-grained control over logging categories:

config :snakepit,
  log_level: :info,
  log_categories: %{
    pool: :debug,      # Pool operations
    worker: :debug,    # Worker lifecycle
    heartbeat: :info,  # Heartbeat monitoring
    grpc: :warning     # gRPC communication
  }

Python-Side Logging

The Python bridge respects the SNAKEPIT_LOG_LEVEL environment variable:

%{
  name: :default,
  adapter_env: [{"SNAKEPIT_LOG_LEVEL", "info"}]
}

Python Runtime Configuration

Configure how Python interpreters are discovered and managed.

Interpreter Selection

config :snakepit,
  python_executable: "/path/to/python3"

Or use environment variable (takes precedence):

export SNAKEPIT_PYTHON="/path/to/python3"

Runtime Strategy

config :snakepit,
  python_runtime: %{
    strategy: :venv,  # :system | :venv | :managed
    managed: false,
    version: "3.12"
  }
StrategyDescription
:systemUse system Python interpreter
:venvUse project virtual environment (.venv/bin/python3)
:managedLet Snakepit manage Python version (experimental)

Environment Variables per Pool

%{
  name: :ml_pool,
  adapter_env: [
    # Control threading in numerical libraries
    {"OPENBLAS_NUM_THREADS", "1"},
    {"MKL_NUM_THREADS", "1"},
    {"OMP_NUM_THREADS", "1"},
    {"NUMEXPR_NUM_THREADS", "1"},

    # GPU configuration
    {"CUDA_VISIBLE_DEVICES", "0"},

    # Python settings
    {"PYTHONUNBUFFERED", "1"},
    {"SNAKEPIT_LOG_LEVEL", "warning"}
  ]
}

Optional Features

Zero-Copy Data Transfer

Enable zero-copy for large binary data:

config :snakepit,
  zero_copy: %{
    enabled: true,
    threshold_bytes: 1_048_576  # 1 MB
  }

Zero-copy is beneficial for ML workloads with large tensors.

Crash Barrier

Limit restart attempts for frequently crashing workers:

config :snakepit,
  crash_barrier: %{
    enabled: true,
    max_restarts: 5,
    window_seconds: 60
  }

If a worker restarts more than max_restarts times within window_seconds, it is permanently removed from the pool.

Circuit Breaker

Prevent cascading failures:

config :snakepit,
  circuit_breaker: %{
    enabled: true,
    failure_threshold: 5,
    reset_timeout_ms: 30000
  }

After failure_threshold consecutive failures, the circuit opens and requests fail fast for reset_timeout_ms.


Complete Configuration Example

Here is a production-ready configuration demonstrating all major options:

# config/config.exs
config :snakepit,
  # Global settings
  pooling_enabled: true,
  pool_startup_timeout: 30_000,
  pool_queue_timeout: 10_000,
  pool_max_queue_size: 5000,
  grpc_port: 50051,

  # Logging
  log_level: :info,
  log_categories: %{
    pool: :info,
    worker: :warning,
    heartbeat: :warning,
    grpc: :warning
  },

  # Global heartbeat defaults
  heartbeat: %{
    enabled: true,
    ping_interval_ms: 5000,
    timeout_ms: 15000,
    max_missed_heartbeats: 3
  },

  # Multiple pools
  pools: [
    # Default pool for I/O-bound tasks
    %{
      name: :default,
      worker_profile: :process,
      pool_size: 50,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapters.GeneralAdapter"],
      adapter_env: [
        {"OPENBLAS_NUM_THREADS", "1"},
        {"OMP_NUM_THREADS", "1"}
      ],
      startup_batch_size: 10,
      startup_batch_delay_ms: 500
    },

    # ML inference pool (CPU-bound, thread profile)
    %{
      name: :ml_inference,
      worker_profile: :thread,
      pool_size: 4,
      threads_per_worker: 8,  # 32 total capacity
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.ml.InferenceAdapter"],
      adapter_env: [
        {"OPENBLAS_NUM_THREADS", "8"},
        {"OMP_NUM_THREADS", "8"},
        {"CUDA_VISIBLE_DEVICES", "0"},
        {"PYTORCH_CUDA_ALLOC_CONF", "max_split_size_mb:512"}
      ],
      thread_safety_checks: false,
      worker_ttl: {1800, :seconds},
      worker_max_requests: 10000,
      heartbeat: %{
        enabled: true,
        ping_interval_ms: 10000,
        timeout_ms: 60000,
        max_missed_heartbeats: 2
      }
    },

    # Background processing pool
    %{
      name: :background,
      worker_profile: :process,
      pool_size: 10,
      adapter_module: Snakepit.Adapters.GRPCPython,
      adapter_args: ["--adapter", "myapp.adapters.BackgroundAdapter"],
      adapter_env: [
        {"SNAKEPIT_LOG_LEVEL", "warning"}
      ],
      worker_ttl: {3600, :seconds}
    }
  ],

  # Optional features
  crash_barrier: %{
    enabled: true,
    max_restarts: 10,
    window_seconds: 300
  }

Environment-Specific Overrides

# config/prod.exs
config :snakepit,
  log_level: :warning,
  pool_max_queue_size: 10000

# config/dev.exs
config :snakepit,
  log_level: :debug,
  pool_size: 4

# config/test.exs
config :snakepit,
  pooling_enabled: false

Validation

Verify your configuration with the doctor task:

mix snakepit.doctor

At runtime, check pool status:

iex> Snakepit.get_stats()
%{
  requests: 15432,
  queued: 5,
  errors: 12,
  queue_timeouts: 3,
  pool_saturated: 0,
  workers: 54,
  available: 49,
  busy: 5
}