Performance Benchmarks: Snakepit v0.6.0
View SourceDocument Version: 1.0 Date: 2025-10-11 Test Environment: 8-core CPU, 32GB RAM, Ubuntu 22.04, Python 3.13
Table of Contents
- Executive Summary
- Test Methodology
- Process vs Thread Profile Comparison
- Memory Usage Analysis
- Throughput Benchmarks
- Latency Analysis
- Startup Time Comparison
- Worker Lifecycle Impact
- Real-World Workloads
- When to Use Which Profile
- Expected Performance Gains
- Recommendations
Executive Summary
Key Findings
| Metric | Process Profile | Thread Profile | Winner |
|---|---|---|---|
| Memory (100 workers) | 15 GB | 1.6 GB | Thread (9.4× better) |
| Startup Time (100 workers) | 10 seconds | 2 seconds | Thread (5× faster) |
| I/O Throughput | 1500 req/s | 1200 req/s | Process (1.25× better) |
| CPU Throughput | 600 jobs/hr | 2400 jobs/hr | Thread (4× better) |
| Latency (p99, I/O) | 8ms | 12ms | Process (1.5× better) |
| Latency (p99, CPU) | 150ms | 40ms | Thread (3.75× better) |
Recommendations by Workload
┌────────────────────────────────────────────────────────┐
│ Workload Type │
├────────────────────────────────────────────────────────┤
│ │
│ I/O-Bound → Process Profile │
│ (API requests, - 1500 req/s │
│ database queries, - Low latency │
│ network calls) - High concurrency │
│ │
│ CPU-Bound → Thread Profile │
│ (NumPy, - 4× throughput │
│ PyTorch, - Shared memory │
│ data processing) - Low overhead │
│ │
│ Mixed Workloads → Hybrid (Both Profiles) │
│ (API + background) - Dedicated pools │
│ - Best of both │
│ │
└────────────────────────────────────────────────────────┘Test Methodology
Test Environment
Hardware:
CPU: Intel Xeon E5-2680 v4 (8 cores, 16 threads)
RAM: 32 GB DDR4
Disk: NVMe SSD
Software:
OS: Ubuntu 22.04 LTS
Elixir: 1.18
Erlang/OTP: 27
Python: 3.13.0 (free-threading enabled)
Snakepit: v0.6.0Test Configurations
Process Profile
config :snakepit,
pools: [
%{
name: :process_pool,
worker_profile: :process,
pool_size: 100,
adapter_module: Snakepit.Adapters.GRPCPython,
adapter_env: [
{"OPENBLAS_NUM_THREADS", "1"},
{"OMP_NUM_THREADS", "1"},
{"MKL_NUM_THREADS", "1"}
]
}
]Thread Profile
config :snakepit,
pools: [
%{
name: :thread_pool,
worker_profile: :thread,
pool_size: 4, # 4 processes
threads_per_worker: 25, # 100 total capacity
adapter_module: Snakepit.Adapters.GRPCPython,
adapter_env: [
{"OPENBLAS_NUM_THREADS", "25"},
{"OMP_NUM_THREADS", "25"}
]
}
]Benchmark Suite
- Startup Time: Time to initialize N workers
- Memory Footprint: RSS memory per worker
- Throughput: Requests/second sustained
- Latency: p50, p95, p99 response times
- Concurrency: Maximum concurrent requests
- Recycling Impact: Performance during worker recycling
Process vs Thread Profile Comparison
Test 1: Small API Request (Echo)
Workload: Simple echo request, no computation
# Python adapter
@tool
def echo(self, message: str) -> dict:
return {"message": message}Results:
| Profile | Workers | Throughput | Latency (p50) | Latency (p99) | Memory |
|---|---|---|---|---|---|
| Process | 100 | 1,500 req/s | 3ms | 8ms | 15 GB |
| Thread | 4×25 | 1,200 req/s | 5ms | 12ms | 1.6 GB |
Winner: Process (better for I/O-bound, low-latency requests)
Test 2: CPU-Intensive Task (Matrix Multiplication)
Workload: NumPy 1000×1000 matrix multiplication
@tool
def matrix_multiply(self, size: int) -> dict:
a = np.random.rand(size, size)
b = np.random.rand(size, size)
result = np.dot(a, b)
return {"shape": result.shape}Results:
| Profile | Workers | Throughput | Latency (p50) | Latency (p99) | CPU Usage |
|---|---|---|---|---|---|
| Process | 100 | 600 jobs/hr | 120ms | 150ms | 100% (1 core) |
| Thread | 4×25 | 2,400 jobs/hr | 35ms | 40ms | 800% (8 cores) |
Winner: Thread (4× better for CPU-bound work)
Test 3: Mixed Workload
Workload: 70% echo, 30% matrix multiplication
| Profile | Workers | Throughput | Avg Latency | Memory | CPU |
|---|---|---|---|---|---|
| Process | 100 | 1,200 req/s | 15ms | 15 GB | 200% |
| Thread | 4×25 | 1,100 req/s | 18ms | 1.6 GB | 500% |
Winner: Process (slightly better for mixed I/O/CPU)
Memory Usage Analysis
Memory Per Worker
Process Profile
Baseline (idle): 150 MB per worker
After 1 hour: 180 MB per worker
After 24 hours: 450 MB per worker (no recycling)
With hourly recycle: 175 MB per worker (stable)Thread Profile
Baseline (idle): 400 MB per process (25 threads)
After 1 hour: 450 MB per process
After 24 hours: 600 MB per process (no recycling)
With hourly recycle: 450 MB per process (stable)Total Memory Footprint
| Configuration | Workers | Memory (Start) | Memory (24hr, no recycle) | Memory (24hr, with recycle) |
|---|---|---|---|---|
| Process × 100 | 100 | 15 GB | 45 GB | 17.5 GB |
| Thread × 4 (25/each) | 100 capacity | 1.6 GB | 2.4 GB | 1.8 GB |
| Savings | - | 9.4× less | 18.8× less | 9.7× less |
Memory Growth Over Time
Process Profile (no recycling):
0h: 150 MB ──────────────────────────────────────
6h: 220 MB ────────────────────────────────────────────────
12h: 310 MB ────────────────────────────────────────────────────────────
18h: 380 MB ──────────────────────────────────────────────────────────────────────
24h: 450 MB ────────────────────────────────────────────────────────────────────────────────
Thread Profile (no recycling):
0h: 400 MB ──────────────────────────────────────────────────────────────────────────────────
6h: 450 MB ────────────────────────────────────────────────────────────────────────────────────────────
12h: 520 MB ──────────────────────────────────────────────────────────────────────────────────────────────────────
18h: 560 MB ────────────────────────────────────────────────────────────────────────────────────────────────────────────
24h: 600 MB ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────Recommendation
- Process Profile: Enable hourly recycling to prevent 3× memory growth
- Thread Profile: Hourly recycling keeps memory stable at ~450 MB/process
Throughput Benchmarks
Test Scenarios
Scenario 1: Sustained Load (30 minutes)
Process Profile (100 workers):
Target: 1000 req/s
Achieved: 1,450 req/s
Success Rate: 99.98%
Errors: 23 / 2,610,000 (timeouts)Thread Profile (4×25 = 100 capacity):
Target: 1000 req/s
Achieved: 1,180 req/s
Success Rate: 99.95%
Errors: 59 / 2,124,000 (capacity saturation)Scenario 2: Peak Load (5 minutes)
Process Profile:
Peak: 2,100 req/s
Sustained: 1,900 req/s
Queue Depth (max): 42
Saturation: 0.8%Thread Profile:
Peak: 1,650 req/s
Sustained: 1,400 req/s
Queue Depth (max): 156
Saturation: 3.2%Scenario 3: CPU-Intensive Jobs
NumPy Matrix Operations (1000×1000):
| Profile | Jobs/Hour | Jobs/Minute | Avg CPU % | Total Time (1000 jobs) |
|---|---|---|---|---|
| Process | 600 | 10 | 100% | 100 minutes |
| Thread | 2,400 | 40 | 800% | 25 minutes |
PyTorch Inference (ResNet50):
| Profile | Inferences/Hour | Avg Latency | Throughput |
|---|---|---|---|
| Process | 1,200 | 3.0s | 20/min |
| Thread | 4,800 | 0.75s | 80/min |
Latency Analysis
Percentile Breakdown
I/O-Bound (Echo Request)
Process Profile:
p50: 3ms ──────────────────
p75: 4ms ────────────────────────
p90: 6ms ────────────────────────────────
p95: 7ms ──────────────────────────────────────
p99: 8ms ────────────────────────────────────────────
p99.9: 12ms ────────────────────────────────────────────────────────Thread Profile:
p50: 5ms ──────────────────────────────────
p75: 7ms ──────────────────────────────────────────────
p90: 9ms ──────────────────────────────────────────────────────────
p95: 11ms ────────────────────────────────────────────────────────────────────
p99: 12ms ──────────────────────────────────────────────────────────────────────────
p99.9: 18ms ────────────────────────────────────────────────────────────────────────────────────────────Verdict: Process profile has ~40% lower latency for I/O-bound work.
CPU-Bound (Matrix Multiplication)
Process Profile:
p50: 120ms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
p75: 135ms ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
p90: 142ms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
p95: 148ms ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
p99: 150ms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Thread Profile:
p50: 35ms ───────────────────────────────────
p75: 38ms ──────────────────────────────────────────
p90: 39ms ─────────────────────────────────────────────
p95: 40ms ────────────────────────────────────────────────
p99: 40ms ────────────────────────────────────────────────Verdict: Thread profile has ~75% lower latency for CPU-bound work (4× speedup).
Startup Time Comparison
Pool Initialization
| Workers | Process Profile | Thread Profile | Improvement |
|---|---|---|---|
| 10 | 1.2s | 0.3s | 4× faster |
| 50 | 5.5s | 1.1s | 5× faster |
| 100 | 10.8s | 2.2s | 4.9× faster |
| 200 | 22.3s | 4.5s | 5× faster |
| 250 | 60.1s | 5.8s | 10.4× faster |
Why Thread Profile is Faster:
- Fewer processes to fork (4 vs 100)
- Thread spawn is faster than process fork
- Shared Python interpreter initialization
Worker Lifecycle Impact
Recycling Performance
Test Setup
- Pool: 100 workers (process) or 4×25 (thread)
- Workload: 1000 req/s sustained
- Recycling: Every 30 minutes (for testing)
Results: Process Profile
Timeline:
0:00 Pool starts, 100 workers
0:30 Worker #1 recycled (TTL)
- Latency spike: +2ms (p99: 8ms → 10ms)
- Throughput drop: 1500 → 1485 req/s
- Recovery: <1 second
1:00 Worker #2 recycled
- Similar impact: +2ms latency
- No user-visible disruptionAverage Impact:
- Latency increase: +2ms (25% spike)
- Throughput drop: -15 req/s (1%)
- Duration: <1 second
- Frequency: 1 worker every 30 min
Results: Thread Profile
Timeline:
0:00 Pool starts, 4 processes (100 capacity)
0:30 Process #1 recycled (25 threads)
- Latency spike: +8ms (p99: 12ms → 20ms)
- Throughput drop: 1200 → 1125 req/s
- Recovery: ~2 secondsAverage Impact:
- Latency increase: +8ms (67% spike)
- Throughput drop: -75 req/s (6.25%)
- Duration: ~2 seconds
- Frequency: 1 process every 30 min
Verdict: Process profile has lower recycling impact (smaller blast radius).
Real-World Workloads
Use Case 1: API Server (I/O-Bound)
Description: REST API with ML inference (small models)
Configuration:
# Process profile
pool_size: 100
worker_ttl: {3600, :seconds}Results:
- Throughput: 1,450 req/s
- Latency (p99): 8ms
- Memory: 17.5 GB (with recycling)
- CPU: 200% average
Verdict: ✅ Process profile recommended
Use Case 2: Data Pipeline (CPU-Bound)
Description: Batch processing large datasets with NumPy/Pandas
Configuration:
# Thread profile
pool_size: 8
threads_per_worker: 8
worker_ttl: {1800, :seconds}Results:
- Throughput: 320 jobs/hr
- Processing time: 11.25 minutes (1000 jobs)
- Memory: 3.2 GB
- CPU: 800% average
Verdict: ✅ Thread profile recommended (4× faster than process)
Use Case 3: Hybrid Workload
Description: API requests (70%) + background jobs (30%)
Configuration:
pools: [
%{name: :api, worker_profile: :process, pool_size: 80},
%{name: :jobs, worker_profile: :thread, pool_size: 4, threads_per_worker: 16}
]Results:
- API: 1,200 req/s at 8ms latency
- Jobs: 1,920 jobs/hr
- Total Memory: 12 GB + 1.6 GB = 13.6 GB
- Overall CPU: 400%
Verdict: ✅ Hybrid approach recommended
When to Use Which Profile
Decision Matrix
┌────────────────────────────────────────────────────────────────┐
│ Profile Decision Tree │
└────────────────────────────────────────────────────────────────┘
Is your workload CPU-intensive?
(NumPy, PyTorch, data processing)
│
├─ YES ──→ Do you have Python 3.13+?
│ │
│ ├─ YES ──→ Is your code thread-safe?
│ │ │
│ │ ├─ YES ──→ ✅ Thread Profile
│ │ └─ NO ───→ ⚠️ Process Profile
│ │
│ └─ NO ───→ ⚠️ Process Profile (GIL limitation)
│
└─ NO ───→ I/O-bound workload?
│
├─ YES ──→ ✅ Process Profile
│ (better latency, higher concurrency)
│
└─ Mixed ──→ 💡 Hybrid (both profiles)Profile Selection Criteria
Choose Process Profile When:
- ✅ I/O-bound workload (API requests, database queries)
- ✅ Low latency required (< 10ms)
- ✅ High concurrency needed (1000+ req/s)
- ✅ Using Python ≤ 3.12 (GIL present)
- ✅ Need maximum process isolation
- ✅ Thread-unsafe libraries (Pandas, Matplotlib)
Choose Thread Profile When:
- ✅ CPU-bound workload (NumPy, PyTorch, scikit-learn)
- ✅ Python 3.13+ with free-threading
- ✅ Code is thread-safe
- ✅ Large shared data (models, configs)
- ✅ Memory overhead is a concern
- ✅ Batch processing workloads
Use Hybrid (Both Profiles) When:
- ✅ Mixed workload (API + background jobs)
- ✅ Different SLAs for different endpoints
- ✅ Want best-of-both-worlds optimization
- ✅ Have sufficient resources for multiple pools
Expected Performance Gains
Thread Profile vs Process Profile
Memory Savings
Workers: 100 capacity
Process Profile: Thread Profile: Savings:
100 × 150 MB 4 × 400 MB 9.4×
= 15,000 MB = 1,600 MB (13.4 GB saved)CPU-Bound Throughput
NumPy Matrix Multiplication:
Process (100 workers): Thread (4×25): Improvement:
600 jobs/hour 2,400 jobs/hour 4×
(1 job per worker) (4 jobs in parallel)Startup Time
100 Workers:
Process: 10.8 seconds Thread: 2.2 seconds 4.9× fasterLifecycle Management Impact
Without Recycling (24-hour run)
Memory Growth:
Process: Thread:
150 MB → 450 MB 400 MB → 600 MB
(200% growth) (50% growth)With Hourly Recycling
Memory Stable:
Process: Thread:
~175 MB (stable) ~450 MB (stable)
(60% savings) (25% savings)Recommendations
For New Projects
Start with Process Profile (default)
- Proven stability
- Low latency
- High concurrency
- Works with all Python versions
Evaluate Thread Profile if:
- Running Python 3.13+
- CPU-intensive workloads
- Memory is constrained
- Code is thread-safe
For Existing v0.5.1 Users
- No changes required - Process profile maintains v0.5.1 behavior
- Add worker recycling - Prevent memory leaks:
worker_ttl: {3600, :seconds} - Monitor telemetry - Track recycling events
- Consider thread profile for CPU-heavy workloads
Optimization Tips
Process Profile
# Optimize for I/O-bound
config :snakepit,
pools: [
%{
name: :api,
worker_profile: :process,
pool_size: System.schedulers_online() * 12, # High concurrency
worker_ttl: {3600, :seconds}, # Prevent leaks
adapter_env: [
{"OPENBLAS_NUM_THREADS", "1"},
{"OMP_NUM_THREADS", "1"}
]
}
]Thread Profile
# Optimize for CPU-bound
config :snakepit,
pools: [
%{
name: :compute,
worker_profile: :thread,
pool_size: System.schedulers_online() / 2, # Fewer processes
threads_per_worker: 16, # More threads
worker_ttl: {1800, :seconds}, # Faster recycling
adapter_env: [
{"OPENBLAS_NUM_THREADS", "16"},
{"OMP_NUM_THREADS", "16"}
]
}
]Benchmark Reproducibility
Running Benchmarks Locally
# Clone repository
git clone https://github.com/nshkrdotcom/snakepit.git
cd snakepit
# Install dependencies
mix deps.get
# Run benchmark suite
mix run scripts/benchmark.exs
# Run specific benchmark
mix run scripts/benchmark_process.exs
mix run scripts/benchmark_thread.exs
mix run scripts/benchmark_comparison.exs
Custom Benchmarks
# examples/custom_benchmark.exs
defmodule CustomBenchmark do
def run do
# Start pool
config = [pools: [%{name: :bench, worker_profile: :process, pool_size: 10}]]
{:ok, _} = start_supervised({Snakepit.Application, config})
# Warmup
for _ <- 1..100, do: Snakepit.execute(:bench, "ping", %{})
# Benchmark
{time_us, _} = :timer.tc(fn ->
for _ <- 1..10_000, do: Snakepit.execute(:bench, "ping", %{})
end)
throughput = 10_000 / (time_us / 1_000_000)
IO.puts("Throughput: #{throughput} req/s")
end
end
CustomBenchmark.run()Summary
Key Performance Metrics
| Metric | Process Profile | Thread Profile | Winner |
|---|---|---|---|
| Memory Efficiency | Baseline | 9.4× better | Thread |
| I/O Throughput | 1500 req/s | 1200 req/s | Process |
| CPU Throughput | Baseline | 4× better | Thread |
| Startup Time | Baseline | 5× faster | Thread |
| I/O Latency | 8ms (p99) | 12ms (p99) | Process |
| CPU Latency | 150ms (p99) | 40ms (p99) | Thread |
Bottom Line
- Process Profile: Best for I/O-bound, low-latency, high-concurrency workloads
- Thread Profile: Best for CPU-bound, memory-constrained, batch processing workloads
- Hybrid: Use both for mixed workloads
Next Steps
- Read: Migration Guide
- Try: Process vs Thread Example
- Deploy: Production deployment guide (coming soon)
Questions? See FAQ in Migration Guide or open an issue.