Timeout Configuration Guide

Copy Markdown View Source

Snakepit includes a unified timeout architecture designed for reliability in production deployments. This guide covers timeout profiles, deadline propagation, and configuration strategies for different workloads.


Table of Contents

  1. Overview
  2. Timeout Profiles
  3. How Timeouts Work
  4. Configuration Reference
  5. Common Scenarios
  6. Debugging Timeout Issues
  7. Migration Guide

Overview

The Problem

Earlier releases had fragmented timeout configuration with 7+ independent timeout keys that didn't coordinate:

IssueSymptom
pool_request_timeout vs grpc_command_timeout confusionUnclear which is outer, which is inner
Queue wait consumed budget invisiblyInner timeouts didn't account for queue time
GenServer.call timeouts fired before inner timeoutsUnhandled exits instead of structured errors

The Solution

Snakepit now uses a single-budget, derived deadlines architecture:

  1. One top-level timeout budget set at request entry
  2. Deadline propagation tracks remaining time through the stack
  3. Inner timeouts derived from remaining budget minus safety margins
  4. Profile-based defaults for different deployment scenarios

Timeout Profiles

Profiles provide sensible defaults for common deployment scenarios. Configure via:

config :snakepit, timeout_profile: :production

Profile Comparison

Profiledefault_timeoutstream_timeoutqueue_timeoutUse Case
:balanced300s (5m)900s (15m)10sGeneral purpose, default
:production300s (5m)900s (15m)10sProduction deployments
:production_strict60s300s (5m)5sLatency-sensitive APIs
:development900s (15m)3600s (60m)60sLocal development, debugging
:ml_inference900s (15m)3600s (60m)60sML model inference
:batch3600s (60m)300s (5m)Batch processing jobs

Profile Selection Guidelines

Workload TypeRecommended ProfileRationale
Web API backends:production_strictFast failure for user-facing requests
Background jobs:batchLong-running operations need patience
ML inference:ml_inferenceModel loading and inference are slow
Development:developmentGenerous timeouts for debugging
Mixed workloads:balancedGood defaults for most cases

Using Profiles

# config/runtime.exs

# Production API server
config :snakepit, timeout_profile: :production_strict

# ML inference service
config :snakepit, timeout_profile: :ml_inference

# Batch processing worker
config :snakepit, timeout_profile: :batch

How Timeouts Work

The Timeout Stack

Requests flow through multiple layers, each with its own timeout:


  User Code: Snakepit.execute("cmd", args, timeout: 60_000)      

                              
                              

  Pool Layer                                                     
   GenServer.call timeout: 60_000                              
   Queue wait (if workers busy): up to queue_timeout           
   Deadline stored: now + 60_000                               

                              
                              

  Worker Layer                                                   
   GenServer.call timeout: remaining - 1000ms margin           
   Forwards to gRPC adapter                                    

                              
                              

  gRPC Layer                                                     
   gRPC call timeout: remaining - 1200ms total margins         
   Actual Python execution                                     

Margin Formula

Inner timeouts are derived from the total budget minus safety margins:

rpc_timeout = total_timeout - worker_call_margin_ms - pool_reply_margin_ms
MarginDefaultPurpose
worker_call_margin_ms1000msGenServer.call overhead to worker
pool_reply_margin_ms200msPool reply processing overhead

Example: With a 60-second total budget:

  • Total: 60,000ms
  • Worker margin: -1,000ms
  • Pool margin: -200ms
  • RPC timeout: 58,800ms

This ensures inner timeouts expire before outer GenServer.call timeouts, reducing timeout-related exits and returning structured Snakepit.Error tuples at public API boundaries.

Deadline Propagation

When a request enters the pool, a deadline is computed and stored:

# Inside Pool.execute/3
deadline_ms = System.monotonic_time(:millisecond) + timeout
opts_with_deadline = Keyword.put(opts, :deadline_ms, deadline_ms)

As the request moves through the stack:

  1. Queue handler uses effective_queue_timeout_ms/2 to respect deadline
  2. Worker execution uses derive_rpc_timeout_from_opts/2 to compute remaining budget
  3. All layers return structured errors instead of crashing on timeout

Queue-Aware Timeouts

If a request waits in queue, that time is subtracted from the budget:

# Request with 60s budget waits 5s in queue
# Remaining budget for execution: 55s (minus margins)

This prevents the common bug where queue wait + execution time exceeds the user's expected timeout.


Configuration Reference

# config/runtime.exs
config :snakepit,
  timeout_profile: :production,
  
  # Optional: customize margins
  worker_call_margin_ms: 1000,
  pool_reply_margin_ms: 200

Explicit Timeout Configuration

Override profile defaults with explicit values:

config :snakepit,
  timeout_profile: :production,
  
  # These override profile-derived values
  pool_request_timeout: 120_000,      # 2 minutes
  pool_streaming_timeout: 600_000,    # 10 minutes
  pool_queue_timeout: 15_000,         # 15 seconds
  grpc_command_timeout: 90_000,       # 90 seconds
  grpc_worker_execute_timeout: 95_000 # 95 seconds

Complete Timeout Options

OptionDefaultLayerDescription
timeout_profile:balancedGlobalProfile to use for defaults
pool_request_timeoutProfile-derivedPoolGenServer.call timeout for execute
pool_streaming_timeoutProfile-derivedPoolGenServer.call timeout for streaming
pool_queue_timeoutProfile-derivedPoolMax time request waits in queue
checkout_timeoutProfile-derivedPoolWorker checkout for streaming
pool_startup_timeout10,000msPoolWorker startup timeout
pool_await_ready_timeout15,000msPoolWait for pool initialization
grpc_worker_execute_timeoutProfile-derivedWorkerGenServer.call to GRPCWorker
grpc_worker_stream_timeout300,000msWorkerStreaming GenServer.call
grpc_worker_health_check_timeout_ms5,000msWorkerPeriodic health-check gRPC timeout
grpc_command_timeoutProfile-derivedAdaptergRPC call timeout
grpc_batch_inference_timeout300,000msAdapterBatch inference operations
grpc_large_dataset_timeout600,000msAdapterLarge dataset processing
grpc_server_ready_timeout30,000msWorkerPython server readiness
worker_ready_timeout30,000msWorkerWorker ready notification
graceful_shutdown_timeout_ms6,000msWorkerPython process shutdown
worker_call_margin_ms1,000msMarginWorker GenServer.call overhead
pool_reply_margin_ms200msMarginPool reply overhead

Per-Call Timeout Override

Override timeouts for individual calls:

# Use default from profile
Snakepit.execute("fast_command", %{})

# Override for slow operation
Snakepit.execute("slow_inference", %{model: "large"}, timeout: 300_000)

# Streaming with custom timeout
Snakepit.execute_stream("generate", %{}, callback, timeout: 600_000)

Common Scenarios

Scenario 1: LLM API Calls (60+ seconds)

Problem: Default timeouts are too short for LLM inference.

Solution: Use :ml_inference profile or explicit config:

# Option A: Profile-based
config :snakepit, timeout_profile: :ml_inference

# Option B: Explicit timeouts
config :snakepit,
  pool_request_timeout: 300_000,
  grpc_command_timeout: 280_000,
  grpc_worker_execute_timeout: 290_000

Per-call override:

Snakepit.execute("llm_generate", %{prompt: prompt}, timeout: 120_000)

Scenario 2: Fast API with Strict SLAs

Problem: Need fast failure for user-facing requests.

Solution: Use :production_strict profile:

config :snakepit, timeout_profile: :production_strict

This gives you:

  • 60-second default timeout
  • 5-second queue timeout (fail fast if pool is saturated)
  • Quick feedback to users

Scenario 3: Batch Processing Jobs

Problem: Jobs run for hours, need infinite streaming timeout.

Solution: Use :batch profile:

config :snakepit, timeout_profile: :batch

This gives you:

  • 60-minute default timeout
  • Infinite streaming timeout
  • 5-minute queue tolerance

Scenario 4: Mixed Workloads

Problem: Same pool handles fast and slow operations.

Solution: Use :balanced profile with per-call overrides:

config :snakepit, timeout_profile: :balanced

# Fast operations use default
Snakepit.execute("lookup", %{id: id})

# Slow operations override
Snakepit.execute("batch_process", %{data: data}, timeout: 600_000)

Scenario 5: Pool Initialization Takes Too Long

Problem: Starting 50+ workers with heavy model loading.

Solution: Increase startup timeouts:

config :snakepit,
  pool_startup_timeout: 120_000,       # 2 min per worker
  pool_await_ready_timeout: 600_000,   # 10 min total
  grpc_server_ready_timeout: 120_000   # 2 min for Python ready

Scenario 6: Workers Killed During Shutdown

Problem: Python cleanup takes longer than 6 seconds.

Solution: Increase graceful shutdown timeout:

config :snakepit,
  graceful_shutdown_timeout_ms: 15_000  # 15 seconds

Note: This must be >= Python's shutdown envelope: server.stop(2s) + wait_for_termination(3s) = 5s.


Debugging Timeout Issues

Enable Debug Logging

config :snakepit,
  log_level: :debug,
  log_categories: %{
    pool: :debug,
    grpc: :debug,
    worker: :debug
  }

Identify Which Timeout Fired

Log PatternTimeout Type
** (exit) {:timeout, {GenServer, :call, ...}GenServer.call timeout
gRPC error: %GRPC.RPCError{status: 4...}gRPC DEADLINE_EXCEEDED
Request timed out after XmsPool queue timeout
Timeout waiting for Python gRPC serverServer ready timeout
Pool execute timed outPool-level structured timeout

Use Telemetry

:telemetry.attach("timeout-debug", [:snakepit, :request, :executed], 
  fn _name, measurements, metadata, _config ->
    if measurements[:duration_us] > 30_000_000 do  # > 30s
      Logger.warning("Slow request: #{metadata.command} took #{measurements[:duration_us] / 1_000}ms")
    end
  end, nil)

Check Pool Stats

iex> Snakepit.get_stats()
%{
  requests: 15432,
  queued: 5,           # Requests waiting in queue
  queue_timeouts: 12,  # Queue timeout count
  pool_saturated: 3,   # Times pool was at capacity
  ...
}

High queue_timeouts indicates you need either:

  • More workers (pool_size)
  • Higher pool_queue_timeout
  • Faster Python operations

Verify Timeout Derivation

iex> alias Snakepit.Defaults

# Check current profile
iex> Defaults.timeout_profile()
:balanced

# Check derived values
iex> Defaults.default_timeout()
300_000

iex> Defaults.rpc_timeout(60_000)
58_800  # 60_000 - 1000 - 200

Migration Guide

Legacy Configuration (Pre-Unified Timeouts)

The timeout architecture is fully backward compatible. Existing configurations continue to work:

# This still works in current releases
config :snakepit,
  pool_request_timeout: 60_000,
  grpc_command_timeout: 30_000

Behavior changes:

  • When explicit timeouts are set, they take precedence over profile-derived values
  • When not set, values now derive from the active profile (default: :balanced)
  • Default values are similar to previous versions for :balanced profile
  1. Test with defaults: Remove explicit timeout config, use profile defaults
  2. Select appropriate profile: Choose based on workload type
  3. Fine-tune if needed: Override specific values that don't fit
# Before (legacy timeouts)
config :snakepit,
  pool_request_timeout: 300_000,
  pool_streaming_timeout: 900_000,
  pool_queue_timeout: 10_000,
  grpc_command_timeout: 280_000

# After (unified timeouts) - equivalent behavior
config :snakepit, timeout_profile: :balanced

Breaking Changes

None. All existing configuration keys are honored and take precedence over profile-derived values.


API Reference

Snakepit.Defaults Functions

FunctionReturnsDescription
timeout_profiles/0map()All available timeout profiles
timeout_profile/0atom()Currently configured profile
default_timeout/0timeout()Profile's default timeout
stream_timeout/0timeout()Profile's streaming timeout
queue_timeout/0timeout()Profile's queue timeout
rpc_timeout/1timeout()Derive RPC timeout from total budget
worker_call_margin_ms/0integer()Worker GenServer.call margin
pool_reply_margin_ms/0integer()Pool reply margin

Snakepit.Pool Functions

FunctionReturnsDescription
get_default_timeout_for_call/3timeout()Get timeout for call type
derive_rpc_timeout_from_opts/2timeout()Derive RPC timeout from opts with deadline
effective_queue_timeout_ms/2integer()Queue timeout respecting deadline