All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased
0.13.0 - 2026-02-06
This release is primarily internal hardening: non-blocking GenServer callbacks, structured public API errors, centralized configuration, and a deprecation framework for legacy optional modules. No new user-facing features are introduced beyond instance isolation tokens and the structured error contract.
Added
- Instance isolation tokens -
instance_tokenconfiguration (andSNAKEPIT_INSTANCE_TOKENenv var) provides per-VM isolation when multiple Snakepit instances share a host or deployment directory. Each concurrent VM must use a unique token so cleanup logic never targets another live instance's workers. - Structured public API errors -
Error.normalize_public_result/2converts internal atom/tuple error codes (:queue_timeout,:pool_saturated,:worker_busy,:session_worker_unavailable,:pool_not_initialized,:pool_not_found,:worker_exit) into categorized%Snakepit.Error{}structs. All public returns fromSnakepit.execute/3,Pool.execute/3, andPool.execute_stream/3are now{:error, %Snakepit.Error{}}. - Legacy module deprecation framework -
Snakepit.Internal.Deprecationprovides telemetry-based once-per-VM deprecation events ([:snakepit, :deprecated, :module_used]) for legacy optional modules:Snakepit.Compatibility,Snakepit.Executor,Snakepit.HealthMonitor,Snakepit.PythonVersion,Snakepit.Telemetry,Snakepit.Telemetry.GPUProfiler,Snakepit.Telemetry.Handlers.Logger, andSnakepit.Telemetry.Handlers.Metrics. - Centralized timeout runner -
Snakepit.Internal.TimeoutRunnerstandardizes execution with timeouts across the executor, Python package runner, and shutdown modules usingspawn_monitor+receiveinstead ofTask.async/yield/shutdown. - Async fallback helpers -
Snakepit.Internal.AsyncFallbackconsolidates duplicated supervisor-unavailable fallback logic (start_nolink_with_fallback/3,start_child_with_fallback/3,start_monitored/1,start_monitored_fire_and_forget/1). - Pool RuntimeSupervisor - pool-dependent children are grouped under a
rest_for_onesupervisor ensuring dependency-order restarts. - Enriched pool_not_found errors -
Dispatcher.get_poolnow returns{:error, {:pool_not_found, pool_name}}carrying the missing pool name for diagnostics. - Pre-stream telemetry buffering - Python
TelemetryStreambuffers events emitted before the gRPC stream is attached and flushes them once the async loop is initialized, preventing dropped startup events. - Session quota enforcement in
SessionStoreto protect against resource exhaustion during high-volume worker assignments. - Supervisor fallbacks for heartbeat and lifecycle tasks - when
TaskSupervisoris unavailable, heartbeat pings and lifecycle checks fall back to manually monitored processes instead of crashing. - Dispatch telemetry event -
[:snakepit, :pool, :call, :dispatched]is emitted when a request is assigned to a worker and execution begins. Metadata includespool,worker_id,command, andqueued(boolean indicating whether the request waited in the queue). Enables deterministic synchronization for contention-aware consumers. - New runtime-configurable defaults:
process_registry_dets_flush_interval_ms,grpc_stream_open_timeout_ms,grpc_stream_control_timeout_ms,lifecycle_check_max_concurrency,lifecycle_worker_action_timeout_ms,grpc_worker_health_check_timeout_ms. @enforce_keysonPool,Pool.State,ProcessRegistry,HeartbeatMonitor, andLifecycleManagerstructs.- Generated getters in
Defaultsforpool_reconcile_interval_ms,pool_reconcile_batch_size, and supervisor restart intensity values.
Changed
- GRPCWorker is fully non-blocking - long-running gRPC calls execute in async tasks with an internal request queue, keeping workers responsive to health checks and state queries during active calls.
get_healthandget_infocalls also use non-blocking async mechanisms. - Periodic health checks route through async RPC queue instead of running synchronously in
handle_info. - ProcessRegistry DETS persistence is now batched - sync operations are deferred behind a configurable flush interval instead of running directly inside GenServer callbacks. Startup cleanup deferred to
handle_continueto avoid blocking the supervisor. - Pool initialization uses supervised async tasks - initialization is launched with
Task.Supervisor.async_nolink/2with explicit crash attribution instead ofspawn_link. - Lifecycle checks run off the GenServer callback path with bounded per-worker concurrency via
async_stream_nolink. Worker recycle operations run in supervised tasks tracked viarecycle_task_refs.LifecycleManager.terminate/2cancels timers and kills tracked tasks. - GRPCWorker terminate cleans up pending calls - iterates
pending_rpc_callsandrpc_request_queue, killing in-flight task PIDs, demonitoring refs, and replying to waiting callers with structured shutdown errors. - Telemetry stream operations are asynchronous - gRPC stream open and control operations execute in supervised tasks with explicit operation timeouts. Connection lifecycle driven by a dedicated task with stream_ready/timeout messages.
- Heartbeat pings run in supervised tasks instead of blocking the
HeartbeatMonitorGenServer. - GPU profiling moved to asynchronous model to prevent slow hardware queries from stalling telemetry collection.
- Config resolution centralized -
Snakepit.Config.adapter_module/2,Snakepit.Config.capacity_strategy/1,Snakepit.Config.adapter_args/1resolve with explicit precedence (override -> pool -> legacy -> global -> default). All consumers delegate to these helpers. - Shutdown module consolidation -
Shutdown.shutdown_reason?/1replaces duplicated private implementations acrossGRPCWorkerandGrpcStream.Shutdown.stop_supervisor/2extracted for reusable supervisor stop logic. - Application compile-time env replaced with runtime function to prevent stale environment values.
- Legacy pool_size precedence fixed - top-level
:pool_sizenow wins overpool_config.pool_sizewhen both are set. - GRPC Client mock channel dispatch tightened - mock response logic extracted into
ClientMock;Client.mock_channel?/1no longer silently treats non-map channels as mocks. - ToolRegistry errors use tagged tuples instead of string messages;
BridgeServerformats them at the API boundary. - SessionStore default arguments consolidated using Elixir default argument syntax.
- ClientSupervisor startup race normalization -
{:error, {:already_started, pid}}normalized to:ignore. - TaintRegistry consume_restart atomicity - uses
:ets.take/2instead of lookup-then-delete for single-consumer semantics. - ProcessRegistry DETS access indirection - direct
:detscalls replaced withpersist_put/3,persist_delete/2,persist_sync/1wrappers.
Deprecated
Snakepit.HealthMonitor- use worker lifecycle telemetry and host-managed health policy. Emits[:snakepit, :deprecated, :module_used]once per VM.Snakepit.Executor- useSnakepit.RetryPolicy,Snakepit.CircuitBreaker, and timeout helpers directly. Emits deprecation event once per VM.Snakepit.Compatibility,Snakepit.PythonVersion,Snakepit.Telemetry(legacy module),Snakepit.Telemetry.GPUProfiler,Snakepit.Telemetry.Handlers.Logger,Snakepit.Telemetry.Handlers.Metrics- all emit deprecation events on first use; see event metadata for replacement guidance.
Fixed
- Shutdown flag stickiness -
mark_in_progressnow stores a{pid, ref}marker with owner monitoring. Stale flags from crashed processes are automatically cleared and no longer block worker startup. - Process.alive? TOCTOU races removed from
HeartbeatMonitor,GRPCWorker,Application,WorkerSupervisor,Initializer,Listener,Shutdown, andLifecycleManagerin favor of monitor-based or catch-based patterns. - CapacityStore :noproc crashes during shutdown - all public APIs catch exits and return typed fallback values.
- GRPCWorker orphaned monitor growth - orphaned RPC task monitors are now cleaned from
pending_rpc_monitorswhen no matching pending call exists. - HeartbeatMonitor stale timeout messages - ignores
:heartbeat_timeoutwhentimeout_timerisnil. Demonitors ping task refs on timeout to prevent stale:DOWNdelivery. - ApplicationCleanup bounded termination - cleanup runs in a spawned process with a configurable timeout budget, preventing blocked supervision tree shutdown.
- Listener process liveness detection - replaced
Process.alive?/1with monitor-and-receive to correctly detect remote node processes. - ProcessRegistry cleanup task lifecycle - catches
TaskSupervisor:noproc, falls back tospawn_monitor, drains in-flight cleanup on terminate with configurable timeout. - Telemetry stream callback blocking - gRPC stream open and control operations now execute asynchronously with explicit operation timeouts.
- Heartbeat ping callback blocking - pings run in supervised tasks with bounded timeout handling and cleanup.
- Heartbeat pong routing under async execution -
notify_pongremains backward-compatible whenping_funexecutes in a task by routing self-targeted pongs back to the owning monitor process. - SessionStore callback containment -
update_sessionnow catchesthrowandexitin addition to rescued exceptions. - Dynamic atom creation from telemetry config keys - config normalization uses template-driven key matching instead of
String.to_atom. - Async task monitor hygiene in Pool - tracked async task refs are demonitor/flushed and no longer misrouted through worker
:DOWNhandling. - WorkerSupervisor shutdown race handling - APIs return structured errors when the supervisor is unavailable instead of raising
:noproc. - Pool initialization shutdown cleanup - in-flight async initialization tasks are cancelled when
Poolterminates. - Initialization resource delta telemetry - baseline captured at start instead of sampling both values at completion.
- Python telemetry events dropped during startup - pre-stream buffering ensures events emitted before gRPC connection are preserved.
- Thread-safe Python telemetry emission - uses
loop.call_soon_threadsafewith loop state checks. - Rogue cleanup configuration - correctly handles explicit
falsevalues and string-key variations. - GrpcStream and Snakepit.cleanup
:noproctolerance - catch exits instead of pre-checkingProcess.whereis. - Port reservation race in tests - test helper table reservation tolerates ETS owner races during concurrent execution.
0.12.0 - 2026-01-25
Added
- Post-readiness process group resolution - Workers re-check process group membership after Python signals readiness, handling cases where
os.setsid()is called after initial spawn. Uses exponential backoff (up to 250ms) to accommodate delayed OS-level bookkeeping. ProcessRegistry.update_process_group/3to update:pgidand:process_group?metadata after worker startup, with PID mismatch protection to prevent corrupting restarted worker entries.ready_workerstracking in pool state to distinguish workers that have completed the gRPC handshake from those merely spawned.init_failedflag on pool state to mark pools that failed to start any workers.- Global
await_ready_waiterslist for coordinating callers waiting on all pools. - Python executable validation for
:python_executableandSNAKEPIT_PYTHONoverrides, checking both existence and execute permissions before use. Snakepit.Pool.await_init_complete/2- waits for asynchronous pool initialization to complete, separate fromawait_ready/2which returns as soon as each pool has at least one ready worker. Useful for tests and scripts that need to wait for all workers to be spawned before proceeding.- Pool initialization telemetry events:
[:snakepit, :pool, :init_started]- emitted when pool initialization begins, withtotal_workersmeasurement.[:snakepit, :pool, :init_complete]- emitted when initialization finishes, withduration_ms,total_workers, andpool_workersmetadata.[:snakepit, :pool, :worker_ready]- emitted when a worker completes the gRPC handshake, withworker_count,pool_name, andworker_idmetadata.
Changed
- Pool readiness semantics -
await_ready/2now waits for at least one worker per pool to complete the gRPC handshake, not just for workers to be spawned. Pools report ready only whenready_workersis non-empty. - Worker availability now requires both capacity headroom AND ready status. Workers are no longer marked available until they signal readiness.
Snakepit.execute/3returns{:error, :pool_not_initialized}immediately for pools withinit_failed: trueinstead of queueing requests that would eventually timeout.PythonRuntime.python_version/1returns{:ok, version}or{:error, reason}tuples instead of raw strings or"unknown".PythonRuntime.build_identity/1propagates errors from version detection instead of silently returning partial identity maps.PythonRuntime.runtime_identity/0now refreshes the cached identity when the resolved Python path changes, supporting dynamic reconfiguration.- Waiter reply logic refactored to stagger replies (2ms apart) to avoid thundering herd on pool initialization.
State.ensure_worker_available/2,State.increment_load/2, andState.decrement_load/2now gate availability on worker readiness.EventHandler.remove_worker_from_pool/4cleans upready_workersset when removing workers.
Fixed
- Startup race for process group detection - Previously, if Python called
os.setsid()after Snakepit captured the initial process group ID, the worker would remain in PID-only kill mode, leaving orphaned grandchildren after termination. The bootstrap phase now retries process group resolution after readiness. - Pool readiness gating - Early calls to
Snakepit.execute/3no longer hit workers with half-closed gRPC streams. Workers must complete the handshake before receiving work. - Broken pool signaling - Pools that fail to start any workers are now flagged immediately.
await_ready/2returns{:error, %Snakepit.Error{}}promptly instead of blocking until timeout. - Python runtime override robustness - Invalid
:python_executableorSNAKEPIT_PYTHONpaths now return{:error, {:invalid_python_executable, path}}instead of crashing the VM on first use.runtime_env/0returns an empty list for invalid configurations instead of raising. - Legacy
pool_confignow preserves all user overrides - Previously, onlystartup_batch_size,startup_batch_delay_ms, andmax_workerswere extracted from legacypool_configmaps, silently dropping other fields likeadapter_envandadapter_args. The config is now fully merged before applying defaults.
0.11.1 - 2026-01-23
Changed
ETSOwner.ensure_table/2is nowensure_table/1- table options are centralized in ETSOwner as the single source of truth for known tables.ETSOwnerraisesArgumentErrorfor unknown table names, preventing accidental table creation outside the managed set.ETSOwnerraises a clear error when called before the Snakepit application is started.WorkerSupervisor.start_worker/5now returns the GRPCWorker PID instead of the starter PID. The function waits up to 1 second for the worker to register, making the return value immediately usable for operations.CapacityStore.ensure_started/0no longer auto-starts the GenServer; returns{:error, :not_started}if the process isn't running. This prevents unsupervised process spawning during shutdown.GRPC.Listenerinit now useshandle_continueinstead of spawning aTaskfor listener startup, simplifying the initialization flow.
Fixed
- ETS table ownership for taint registry and zero-copy handles is now supervised to avoid short-lived processes becoming table owners.
- Race condition in
ETSOwner.create_table/2now properly re-raises if the table still doesn't exist after catchingArgumentError(distinguishes real errors from concurrent creation). - Shutdown race in
ProcessManager.wait_for_server_ready/3- Now detects{:EXIT, _, :shutdown}messages and checks the shutdown flag to exit early instead of timing out during application shutdown. - Telemetry stream task lifecycle -
GrpcStreamnow traps exits and properly cleans up stream state when tasks complete or crash, preventing orphaned entries in the streams map. - Thread profile resilience during shutdown -
Thread.start_worker/5,stop_worker/1,acquire_slot/1,get_capacity/1, andget_load/1now handleCapacityStorebeing unavailable gracefully instead of crashing. - Pool capacity tracking -
track_capacity_increment/1andtrack_capacity_decrement/1now check ifCapacityStoreis available before attempting operations, preventing crashes during shutdown.
0.11.0 - 2026-01-11
Added
- Graceful serialization fallback for non-JSON-serializable Python objects. Instead of failing, Snakepit now:
- Tries common conversion methods (
model_dump,to_dict,_asdict,tolist,isoformat) - Falls back to a marker dict with type info for truly non-serializable objects (safe by default, repr excluded)
- Tries common conversion methods (
Snakepit.SerializationElixir module with helpers for detecting and inspecting unserializable markers:unserializable?/1- checks if a value is an unserializable markerunserializable_info/1- extracts type and repr info from markers
- Configurable marker detail policy via environment variables on Python workers:
SNAKEPIT_UNSERIALIZABLE_DETAIL- controls what info is included (nonedefault,type,repr_truncated,repr_redacted_truncated)SNAKEPIT_UNSERIALIZABLE_REPR_MAXLEN- maximum repr length (default 500, max 2000)
- Secret redaction in
repr_redacted_truncatedmode - redacts common patterns (API keys, bearer tokens, passwords) from repr output. GracefulJSONEncoderclass and_orjson_defaultfunction inserialization.pyfor both stdlib json and orjson paths.- Tolist size guard (
SNAKEPIT_TOLIST_MAX_ELEMENTSenv var, default 1M) to prevent explosive sparse→dense array conversions:- Pre-checks numpy arrays via
isinstance()before callingtolist()to avoid allocation - Best-effort heuristics for scipy sparse matrices and pandas DataFrames
- Post-checks unknown types after
tolist()with fallback to marker if oversized
- Pre-checks numpy arrays via
- Telemetry for marker creation - Emits
[:snakepit, :serialization, :unserializable_marker]events with type metadata (never repr). Deduplicated per-type-per-process with a 10K type cap to bound cardinality. serialization_demotool in the showcase adapter demonstrating datetime, custom class, and convertible object handling.graceful_serialization.exsexample showing the feature in action.guides/graceful-serialization.mdcomprehensive guide covering configuration, helpers, telemetry, and best practices.- Unit tests for graceful serialization (Python: 24 tests, Elixir: 14 tests) plus policy behavior tests.
0.10.1 - 2026-01-11
Fixed
Pool.handle_call/3now resolves stringpool_nameoptions to configured pool atoms viaresolve_pool_name_opt/2, fixing routing when callers pass pool names as strings.
0.10.0 - 2026-01-10
Changed
- gRPC listener defaults to internal-only mode (port 0) and now publishes its assigned port to workers via the
grpc_listenerconfig. - Added explicit external binding modes (
:external,:external_pool) with required host/port configuration and pooled port selection for multi-instance deployments. - ProcessRegistry DETS paths are now namespaced by
instance_nameanddata_dirto prevent shared-deployment collisions.
Fixed
- Session affinity now supports strict routing - Requests with
session_idcan be guaranteed to route to the same worker where refs exist by enabling strict affinity modes, preventing "Unknown reference" errors for in-memory Python refs.- Added
affinity: :strict_queueto queue on the preferred worker when busy. - Added
affinity: :strict_fail_fastto return{:error, :worker_busy}when the preferred worker is busy. - Kept
affinity: :hintas the default for legacy behavior (falls back to any available worker).
- Added
- Documentation now clarifies hint vs strict affinity behavior, and the new
grpc_session_affinity_modes.exsexample demonstrates both modes in practice. - Examples now restart Snakepit when run via
mix runso example configs are applied consistently; README recommendsmix run --no-startfor predictable startup.
0.9.1 - 2026-01-09
Added
ClientSupervisorwrapper for safe gRPC client supervision across gRPC variants.- gRPC server request logging interceptor with optional
:grpc_request_loggingand category-aware debug output. mix snakepit.python_testtask to bootstrap and run the Python test suite (supports--no-bootstrap).- Pool reconciliation loop to restore minimum worker counts after crash storms (configurable via
pool_reconcile_interval_msandpool_reconcile_batch_size). - Configurable restart intensity for worker starters and worker supervisors (
worker_starter_*andworker_supervisor_*defaults).
Changed
- gRPC client and worker stream defaults now derive from
grpc_command_timeout/0andstream_timeout/0. - Pool and worker execution now handle
:infinitytimeouts without deadline bookkeeping. - Python gRPC server now runs sync adapter calls in worker threads by default; use
thread_sensitivemetadata orSNAKEPIT_THREAD_SENSITIVEto keep execution on the main thread. Snakepit.Poolmetadata validation now acceptsSnakepit.Poolas the default pool identifier.- gRPC is pinned to
0.11.5and protobuf is pinned to0.16.0(override).
Fixed
- gRPC status code 4 now maps to
{:error, :timeout}in the client. - Process group shutdown waits for group exit using
/procorps, avoiding zombie false positives. - Test suite now tracks and terminates leaked external Python processes after runs.
0.9.0 - 2026-01-02
Added
run_as_script/2:exit_modeoption andSNAKEPIT_SCRIPT_EXITenv var for explicit exit semantics.- Integration tests for external VM exit behavior and broken-pipe safety.
run_as_script/2:stop_modeoption for ownership-aware application shutdown.- Shutdown orchestrator for script shutdown sequencing.
- Script shutdown telemetry events (
[:snakepit, :script, :shutdown, ...]) with required metadata. - CI docs build gate (
mix docs) to catch documentation build errors.
Changed
- Exit selection precedence now favors
:exit_modeover legacy:haltand env vars. Snakepit.Examples.Bootstrap.run_example/2now defaults toexit_mode: :autoand respectsstop_mode.run_as_script/2now captures cleanup targets before stopping and routes shutdown through the orchestrator.- Documentation now aligns README/guides with
exit_mode/stop_modesemantics and the Script Lifecycle tables. - Tests now avoid timing sleeps, using deterministic polling, receive timeouts, and
Logger.flush/0for async-safe synchronization. - Test timing constants were tightened (heartbeat, circuit breaker, queue churn, gRPC slow-operation paths) to reduce suite runtime.
- Long-running integration and randomized flow tests are tagged
:slow, and random worker flow iterations were trimmed. - Pool size isolation checks now wait on pool stats instead of fixed delays.
- gRPC errors during shutdown now log at debug level to reduce noise during expected teardown.
- Refactored
Snakepit.PoolandSnakepit.GRPCWorkerinternals into focused helpers (dispatcher/scheduler/event handler, bootstrap/instrumentation) without behavior changes. Snakepit.TaskSupervisornow starts even when pooling is disabled so queue dispatch paths can spawn tasks safely.
Fixed
- Removed direct IO from the script exit path to avoid hangs on closed pipes.
run_as_script/2no longer stops Snakepit in embedded usage unless explicitly requested.- Script shutdown now marks shutdown-in-progress whenever cleanup runs, so cleanup-only runs (when Snakepit is already started) treat Python exits as expected.
- Shape mismatch telemetry test now filters events by operation to avoid cross-test telemetry bleed.
- Worker lifecycle memory-probe warning test now synchronizes probe failures and log capture to prevent flakes.
- BEAM run IDs now use second-resolution timestamps plus a monotonic counter to avoid collisions during rapid restarts.
- ProcessRegistry rebuilds DETS metadata when index corruption is detected, preventing stale entries after crash/restart cycles.
0.8.9 - 2026-01-01
Breaking Changes
- uv is now required - pip support has been removed. Snakepit now requires uv for Python package management.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh- Or via Homebrew:
brew install uv - The
:installerconfig option has been removed (was:auto,:uv, or:pip) - uv provides 10-100x faster package operations and more reliable version resolution
Fixed
Version checking now validates constraints -
PythonPackages.check_installed/2now properly verifies that installed package versions satisfy the version constraints in requirements (e.g.,grpcio>=1.76.0).- Previously, only package existence was checked, not version satisfaction
- This caused runtime errors when outdated packages were installed (e.g., grpcio 1.67.1 when >=1.76.0 was required)
- Now uses
uv pip install --dry-runfor accurate PEP-440 version checking - Packages that need upgrading are correctly identified as "missing" and reinstalled
Bootstrap now uses quiet pip install - Reduced noise from "Requirement already satisfied" messages during
mix test --include python_integrationAdded startup feedback - Shows "🐍 Checking Python package requirements..." during app startup in dev/test when checking packages (once per BEAM session)
Changed
- Removed unused configuration keys from
config/config.exs,config/test.exs, andconfig/grpc_test.exsto trim dead config surface (legacy worker timeouts and unused grpc_test flags) - Virtual environments are now created using
uv venvfor consistency with package management - Simplified
PythonPackagesmodule by removing all pip-specific code paths
0.8.8 - 2025-12-31
Added
Centralized configurable defaults - New
Snakepit.Defaultsmodule provides runtime-configurable defaults for all hardcoded values- All 68 previously hardcoded timeout, sizing, and threshold values are now configurable via
Application.get_env/3 - Values are read at runtime, allowing configuration changes in
config/runtime.exswithout recompilation - Defaults remain unchanged from previous versions for backward compatibility
- See
Snakepit.Defaultsmodule documentation for complete list of configurable keys
- All 68 previously hardcoded timeout, sizing, and threshold values are now configurable via
Timeout profile architecture - New single-budget, derived deadlines, profile-based timeout system
- Six predefined profiles:
:balanced,:production,:production_strict,:development,:ml_inference,:batch - New user-facing API:
default_timeout/0,stream_timeout/0,queue_timeout/0 - Margin configuration:
worker_call_margin_ms/0(default 1000),pool_reply_margin_ms/0(default 200) - RPC timeout derivation:
rpc_timeout/1computes inner timeout from total budget - Legacy getters (
pool_request_timeout,grpc_command_timeout, etc.) now derive from profile when not explicitly configured - Configure via:
config :snakepit, timeout_profile: :production
- Six predefined profiles:
Pool deadline-aware execution - Pool.execute/3 now stores deadline_ms for queue-aware timeout handling
- New helper:
Pool.get_default_timeout_for_call/3for call-type-aware timeout lookup - New helper:
Pool.derive_rpc_timeout_from_opts/2for deadline-aware RPC timeout derivation - New helper:
Pool.effective_queue_timeout_ms/2for budget-aware queue timeout - GenServer.call timeout caught and returned as structured
{:error, %Snakepit.Error{}}
- New helper:
Changed
Pool module - Timeout and sizing defaults now read from
Snakepit.Defaults:pool_request_timeout,pool_streaming_timeout,pool_startup_timeout,pool_queue_timeoutcheckout_timeout,default_command_timeout,pool_await_ready_timeoutpool_max_queue_size,pool_max_workers,pool_max_cancelled_entriespool_startup_batch_size,pool_startup_batch_delay_ms
GRPCWorker - Execute and streaming timeouts now configurable:
grpc_worker_execute_timeout,grpc_worker_stream_timeoutgrpc_server_ready_timeout,worker_ready_timeoutgrpc_worker_health_check_interval- Heartbeat configuration:
heartbeat_ping_interval_ms,heartbeat_timeout_ms,heartbeat_max_missed,heartbeat_initial_delay_ms
Fault tolerance modules - Circuit breaker, retry policy, crash barrier, and health monitor defaults now configurable:
circuit_breaker_failure_threshold,circuit_breaker_reset_timeout_ms,circuit_breaker_half_open_max_callsretry_max_attempts,retry_backoff_sequence,retry_max_backoff_ms,retry_jitter_factorcrash_barrier_taint_duration_ms,crash_barrier_max_restarts,crash_barrier_backoff_mshealth_monitor_check_interval,health_monitor_crash_window_ms,health_monitor_max_crashes
Session store - Session management defaults now configurable:
session_cleanup_interval,session_default_ttl,session_max_sessions,session_warning_threshold
Process registry - Cleanup intervals now configurable:
process_registry_cleanup_interval,process_registry_unregister_cleanup_delay,process_registry_unregister_cleanup_attempts
Application and gRPC - Server configuration now configurable:
grpc_port,grpc_num_acceptors,grpc_max_connections,grpc_socket_backlogcleanup_on_stop_timeout_ms,cleanup_poll_interval_ms
Config module - Pool and worker profile defaults now configurable:
default_pool_size,default_worker_profile,default_capacity_strategyconfig_default_batch_size,config_default_batch_delay,config_default_threads_per_worker
Timeout Architecture Proposal
The following documents the design rationale for the timeout architecture implemented in this release.
Problem Statement
Snakepit's timeout configuration was fragmented with 7+ independent timeout keys that didn't coordinate:
pool_request_timeoutvsgrpc_command_timeout- Which is outer? Which is inner?- Queue wait time consumed part of the budget, but inner timeouts didn't account for it
- GenServer.call timeouts firing before inner timeouts produced unhandled exits instead of structured errors
Solution: Single-Budget, Derived Deadlines
Core principle: One top-level timeout budget, all inner timeouts derived from remaining time.
Profile-based defaults provide sensible starting points for different deployment scenarios:
| Profile | default_timeout | stream_timeout | queue_timeout |
|---|---|---|---|
| :balanced | 300_000 (5m) | 900_000 (15m) | 10_000 (10s) |
| :production | 300_000 (5m) | 900_000 (15m) | 10_000 (10s) |
| :production_strict | 60_000 (60s) | 300_000 (5m) | 5_000 (5s) |
| :development | 900_000 (15m) | 3_600_000 (60m) | 60_000 (60s) |
| :ml_inference | 900_000 (15m) | 3_600_000 (60m) | 60_000 (60s) |
| :batch | 3_600_000 (60m) | :infinity | 300_000 (5m) |
Margin formula ensures inner timeouts fire before outer:
rpc_timeout = total_timeout - worker_call_margin_ms (1000) - pool_reply_margin_ms (200)Deadline propagation tracks remaining budget:
- Pool.execute stores
deadline_ms = now + timeoutin opts - Queue handler uses
effective_queue_timeout_ms/2to respect deadline - Worker execution uses
derive_rpc_timeout_from_opts/2to compute remaining budget - All GenServer.call timeouts are caught and returned as structured errors
Backward Compatibility
- All legacy config keys (
pool_request_timeout,grpc_command_timeout, etc.) still work - When explicitly set, they take precedence over profile-derived values
- When not set, they derive from the active profile
- Default profile is
:balancedwhich provides similar values to previous defaults
0.8.7 - 2025-12-31
Fixed
- Python Any encoding performance - Avoided extra UTF-8 decode/encode round-trips in
TypeSerializer- JSON payloads now stay as bytes for
google.protobuf.Any.value - Stabilizes orjson benchmark expectations on large payloads
- JSON payloads now stay as bytes for
- Test isolation - Prevented telemetry/logging state bleed across tests
- OOM telemetry assertions now scoped by operation ID
- Logging tests reset global logging disable state
- Python integration test bootstrap - Ensure
--include python_integrationreliably provisions deps- CLI tag detection now triggers bootstrap and real env doctor checks
- Test helper validates
.venvexists after bootstrap and skips redundant deps fetches
- HealthMonitor cleanup - Ignore benign shutdown races in test teardown
- Ready file race condition on CI - Fixed flaky gRPC server startup on slow/loaded systems
read_ready_file/1now returns:not_readyinstead of error when file is empty- Polling loop continues retrying instead of failing immediately
- Resolves
{:invalid_ready_file, ""}errors on GitHub Actions runners - Python already uses atomic rename (
os.replace), but edge cases on slow filesystems could still produce empty reads
0.8.6 - 2025-12-31
Added
Session cleanup telemetry - Emit telemetry events for session lifecycle monitoring
[:snakepit, :bridge, :session, :pruned]- Emitted when sessions expire via TTL[:snakepit, :bridge, :session, :accumulation_warning]- Emitted when session count exceeds thresholds
Strict mode for session store - New
strict_mode: trueoption for dev/test environments- Logs loud warnings when session count exceeds 80% of
max_sessions - Helps detect session leaks during development
- Logs loud warnings when session count exceeds 80% of
BaseAdapter session context - Added
session_idproperty andset_session_context()toBaseAdapter- Ensures consistent session_id handling across all adapters
- Backward compatible with existing adapter implementations
Session Scoping Guide - New documentation at
guides/session-scoping-rules.md- Explains session lifecycle, reference scoping, and recommended patterns
- Documents telemetry events and strict mode configuration
0.8.5 - 2025-12-31
Fixed
GRPCWorker graceful shutdown - Eliminated spurious crash logs during application shutdown
- Added
shutting_downflag to distinguish expected exits from unexpected crashes - Handle supervisor EXIT signals (
:shutdown,{:shutdown, _}) explicitly - Detect shutdown via mailbox peek and pool liveness checks to handle message race conditions
- Shutdown exit codes (0, 137/SIGKILL, 143/SIGTERM) logged at debug level during shutdown
- Non-zero exits only logged as errors when not in shutdown context
- Added
Configurable shutdown timeouts - Graceful shutdown timeout now configurable via
:graceful_shutdown_timeout_ms- Default increased from 2s to 6s to accommodate Python's async shutdown envelope
child_specandWorker.Starterderive supervisor shutdown timeout from this config- New
Snakepit.GRPCWorker.supervisor_shutdown_timeout/0for custom supervision trees
Python server shutdown - Improved graceful termination sequence
- Server stop grace period increased to 2 seconds
wait_for_terminationnow awaited with 3s timeout before force-cancel- Sequential shutdown: close servicer → stop server → await termination task
Python dependency version mismatch - Updated
requirements.txtto match generated protobuf/grpc stubsgrpcio:>=1.60.0→>=1.76.0protobuf:>=4.25.0→>=6.31.1- Previously, users installing minimum versions would get runtime import errors
Proto README documentation drift - Rewrote
priv/proto/README.mdto match actual implementation- Fixed service name:
SnakepitBridge→BridgeService - Removed non-existent methods (GetVariable, SetVariable, WatchVariables, optimization APIs)
- Documented only implemented RPC methods
- Added
Anyencoding convention documentation - Clarified binary payload format (opaque bytes, not pickle/ETF specific)
- Moved aspirational features to "Roadmap" section
- Fixed service name:
Streaming backpressure - Added bounded queue (maxsize=100) to
ExecuteStreamingTool- Prevents unbounded memory growth when producer outpaces consumer
drain_syncnow blocks on enqueue with proper exception handling
Streaming cancellation handling - Producer now stops when client disconnects
- Added cancellation event propagation to drain loops
- Added disconnect watcher task that polls
context.is_active() - Producer task explicitly cancelled on cleanup
- Iterator/generator properly closed via
aclose()/close()
Adapter lifecycle cleanup - Added
cleanup()calls to adapter lifecycleExecuteTool: Callsadapter.cleanup()in finally block (always runs)ExecuteStreamingTool: Callsadapter.cleanup()in finally block- Uses
inspect.isawaitable()pattern for robust sync/async handling - Added
_maybe_cleanup()and_close_iterator()helper functions
Threaded server parity - Applied all streaming/cleanup fixes to
grpc_server_threaded.py- Bounded queue, cancellation handling, iterator closing, adapter cleanup
CancelledError handling - Producer now properly re-raises
CancelledError- Prevents task from blocking on
queue.put()when consumer is gone - On cancellation, task terminates immediately without sentinel (consumer is already gone)
- Prevents task from blocking on
Sentinel delivery under backpressure - Fixed potential hang when queue is full
- Sentinel is now
await queue.put(sentinel)(guaranteed delivery) on normal completion - Previous
put_nowaitcould silently drop sentinel, causing consumer to hang forever
- Sentinel is now
Sentinel delivery on disconnect - Fixed hang when
watch_disconnect()sets cancelled flagwatch_disconnect()now injects sentinel directly into queue when disconnect detected- Drops buffered chunks if needed to make room for sentinel (consumer is gone anyway)
- Prevents hang when producer exits normally (not via CancelledError) with cancelled flag set
Binary parameters handling - Fixed unconditional
pickle.loadssecurity issuebinary_parametersnow treated as opaque bytes by default (per proto docs)- Pickle only used if
metadata["binary_format:<param>"] == "pickle" - Enables safe handling of images, audio, and other binary data
Loadtest demo formatting - Fixed
format_number/1crash on nil values and spacing in output
Added
CI version guard - New
scripts/check_stub_versions.pyvalidates thatrequirements.txtversions match generated protobuf/grpc stubs- Integrated into GitHub Actions CI workflow
- Checks protobuf, grpcio, and grpcio-tools versions
- Prevents "works for us, breaks for users" dependency issues
Streaming cancellation tests - New tests for streaming cleanup behavior
test_streaming_cleanup_called_on_normal_completiontest_streaming_producer_stops_on_client_disconnecttest_async_streaming_cleanup_calledtest_streaming_completes_under_backpressure- verifies sentinel delivery with >maxsize chunks
Changed
Adapter lifecycle documentation - Clarified per-request adapter lifecycle in
base_adapter.py- Documented that adapters are instantiated per-request
- Added example showing module-level caching pattern for expensive resources
- Explained
initialize()/cleanup()semantics
Streaming demo modernization - Updated
execute_streaming_tool_demo.exsto use standard bootstrap pattern
0.8.4 - 2025-12-30
Added
- ExecuteStreamingTool Implementation - Full gRPC streaming support in BridgeServer
- End-to-end streaming from clients through to Python workers
- Automatic final chunk injection if worker doesn't send one
- Execution time metadata on final chunks
- Proper error handling for streaming failures
Fixed
- Timeout Parsing Bug - Fixed precedence issue in
tool_call_options/1that caused string timeout values to bypass parsing - Binary Parameter Encoding - Fixed remote tool execution to properly handle binary parameters without attempting JSON encoding of tuples
0.8.3 - 2025-12-29
Fixed
- Hardware Detector Cache - Replaced ETS cache creation with
:persistent_termto eliminate race conditions and table ownership hazards under concurrent access.
Removed
- Deprecated/Unused APIs - Removed
RetryPolicy.exponential_backoff/2,RetryPolicy.with_circuit_breaker/2,HeartbeatMonitor.get_status/1,RunID.valid?/1, and deprecatedProcessRegistry.register_worker/4.
0.8.2 - 2025-12-29
Added
- Process-Level Log Isolation - New
Snakepit.Loggerfunctions for per-process log level controlset_process_level/1- Set log level for current process onlyget_process_level/0- Get effective log level for current processclear_process_level/0- Clear process-level overridewith_level/2- Execute function with temporary log level
- Test Helper Module -
Snakepit.Logger.TestHelperfor test isolationsetup_log_isolation/0- Set up per-test log level isolationcapture_at_level/2- Capture logs at specific level without affecting other testscapture_at_level_with_result/2- Capture logs and return function resultsuppress_logs/1- Suppress all logs for duration of function
Fixed
- Flaky Test Race Condition - Tests that modify log levels no longer interfere with each other when running concurrently
- Root cause: Multiple async tests modifying global
Application.get_env(:snakepit, :log_level)caused race conditions - Solution: Logger now checks process-local override first, then Elixir Logger process level, then global config
- Root cause: Multiple async tests modifying global
Changed
- Log level resolution now uses priority order:
- Process-level override (via
set_process_level/1) - highest priority - Elixir Logger process level (via
Logger.put_process_level/2) - Application config (via
config :snakepit, log_level: ...) - lowest priority
- Process-level override (via
0.8.1 - 2025-12-27
Changed
- BREAKING: Default log level changed from
:warningto:errorfor silent-by-default behavior - Centralized all logging through
Snakepit.Loggermodule - Python logging now respects
SNAKEPIT_LOG_LEVELenvironment variable - Replaced stdout
GRPC_READYsignaling with a non-console control channel - Removed all hardcoded
IO.putsand Pythonprint()statements
Added
- Category-based logging:
:lifecycle,:pool,:grpc,:bridge,:worker,:startup,:shutdown,:telemetry,:general config :snakepit, log_categories: [...]to enable specific categoriespriv/python/snakepit_bridge/logging_config.pyfor centralized Python logging
Fixed
- Noisy startup messages no longer pollute console output
- Health-check messages suppressed by default
- gRPC server startup messages suppressed by default
Migration Guide
If you relied on seeing startup logs, add to your config:
config :snakepit, log_level: :info0.8.0 - 2025-12-27
Added
Hardware Abstraction Layer
- Hardware Detection - New
Snakepit.Hardwaremodule providing automatic detection of CPU, NVIDIA CUDA, Apple MPS, and AMD ROCm accelerators. - Hardware Detector -
Snakepit.Hardware.Detectorwith unified detection API and caching. - CPU Detection -
Snakepit.Hardware.CPUDetectorwith cores, threads, model, and feature detection (AVX, AVX2, SSE4.2). - CUDA Detection -
Snakepit.Hardware.CUDADetectorfor NVIDIA GPUs via nvidia-smi with version, driver, and memory info. - MPS Detection -
Snakepit.Hardware.MPSDetectorfor Apple Metal Performance Shaders on macOS. - ROCm Detection -
Snakepit.Hardware.ROCmDetectorfor AMD GPUs via rocm-smi. - Device Selection -
Snakepit.Hardware.Selectorwith automatic selection and fallback strategies.
Enhanced ML Telemetry
- Telemetry Events -
Snakepit.Telemetry.Eventsdefining ML-specific telemetry events for hardware, errors, circuit breaker, and GPU profiling. - Logger Handler -
Snakepit.Telemetry.Handlers.Loggerfor automatic logging of all ML telemetry events. - Metrics Handler -
Snakepit.Telemetry.Handlers.Metricswith Prometheus-compatible metric definitions. - GPU Profiler -
Snakepit.Telemetry.GPUProfilerGenServer for periodic GPU memory, utilization, temperature, and power sampling. - Span Helper -
Snakepit.Telemetry.Spanfor convenient timing of operations with automatic start/stop telemetry.
Structured Exception Protocol
- Shape Errors -
Snakepit.Error.ShapewithShapeMismatchandDTypeMismatchexceptions with dimension detection. - Device Errors -
Snakepit.Error.DevicewithDeviceMismatchandOutOfMemoryexceptions with recovery suggestions. - Error Parser -
Snakepit.Error.Parserfor automatic parsing of Python errors with pattern detection for shape, device, and OOM errors.
Crash Barrier Supervision
- Circuit Breaker -
Snakepit.CircuitBreakerGenServer with closed/open/half-open states for fault tolerance. - Health Monitor -
Snakepit.HealthMonitorfor tracking crash patterns with rolling windows and health status. - Retry Policy -
Snakepit.RetryPolicywith configurable exponential backoff, jitter, and retriable error filtering. - Executor -
Snakepit.Executorwithexecute_with_retry/2,execute_with_timeout/2,execute_with_circuit_breaker/3, and batch execution.
Documentation
- New guide:
guides/hardware-detection.md- Hardware detection usage and device selection. - New guide:
guides/crash-recovery.md- Circuit breaker, health monitoring, and retry patterns. - New guide:
guides/error-handling.md- ML-specific error types and parsing. - New guide:
guides/ml-telemetry.md- ML telemetry events, GPU profiling, and metrics.
Changed
- ExDoc Configuration - Added new module groups for Hardware, Reliability, ML Errors, and enhanced Telemetry.
- Telemetry Module Groups - Expanded to include Events, GPUProfiler, Span, and Handlers submodules.
0.7.7 - 2025-12-26
Changed
- Pool GenServer initialization redesigned for OTP compliance. Worker startup now uses an async
spawn_linkpattern instead of blockingreceiveinhandle_continue, keeping the GenServer responsive to shutdown signals during batch initialization. - Multi-pool configuration now correctly isolates
pool_sizeper pool. Each pool in:poolsconfig uses its ownpool_sizevalue; the globalpool_config[:pool_size]is only used in legacy single-pool mode. - Test harness improvements:
after_suitenow monitors the supervisor and waits for actual termination before returning, preventing orphaned process warnings between test runs. - ProcessRegistry defers unregistration when external OS processes are still alive, with automatic retry cleanup after process termination.
Fixed
- Pool no longer crashes during application shutdown when WorkerSupervisor terminates before batch initialization completes. Added supervisor health checks before starting each worker batch.
- ProcessKiller
process_alive?/1on Linux now detects zombie processes by reading/proc/{pid}/statstate, preventing false positives for terminated-but-not-reaped processes. - Test configuration pollution fixed: tests that modify
:poolsconfig now properly save and restore:pool_configto prevent pool_size leakage between tests.
Added
README_TESTING.mdupdated with test isolation patterns, application lifecycle documentation, and multi-pool configuration examples for integration tests.REMEDIATION_PLAN.mddocumenting the root cause analysis and fixes for test harness race conditions.
0.7.6 - 2025-12-26
Added
- Deterministic shutdown cleanup via
Snakepit.RuntimeCleanupand manual cleanup viaSnakepit.cleanup/0, with cleanup telemetry events. - Process group lifecycle support with
process_group_kill, pgid tracking inProcessRegistry, and newProcessKillerhelpers for group kill/pgid lookup. - Python gRPC servers can create their own process group when
SNAKEPIT_PROCESS_GROUPis set. - Python package management supports isolated virtualenvs via
:python_packagesenv_dir, auto-creating venvs and honoring command timeouts. - Documentation suites for FFI ergonomics, Python process cleanup, and runtime hygiene (docs/20251226/*).
- New tests for runtime cleanup, logger defaults, process group kill, process registry cleanup deferrals, and uv venv integration.
Changed
- Quiet-by-default library config:
library_mode: true,log_level: :warning,grpc_log_level: :error,log_python_output: false, plus new cleanup defaults (cleanup_on_stop,cleanup_on_stop_timeout_ms,cleanup_poll_interval_ms,cleanup_retry_interval_ms,cleanup_max_retries). - Application supervision always starts
Snakepit.Pool.ProcessRegistryandSnakepit.Pool.ApplicationCleanupeven without pooling;Application.stop/1now runs a cleanup pass when enabled. - gRPC worker startup/shutdown now tracks pgid/process_group, can kill process groups, buffers startup output, suppresses Python stdout unless enabled, and passes
SNAKEPIT_PROCESS_GROUPwhile extendingPYTHONPATHwith SnakeBridge priv Python. Snakepit.EnvDoctornow locatesgrpc_server.pyfrom the project or installed app root and expandsPYTHONPATHto include Snakepit/SnakeBridge priv Python when running checks.- Python runtime selection now prefers explicit overrides, then
:python_packagesvenv Python, then managed/system fallback; package operations resolve Python from the configured venv. - Cleanup retry timing for worker supervisor is now read from runtime config with
_mssuffix. - Version references updated to 0.7.6 in
mix.exsand README dependency docs. Updatedsupertestertov0.4.0.
Fixed
- Taint registry ETS initialization now tolerates a pre-existing table.
- Process registry cleanup no longer drops entries while external OS processes remain alive, and DETS is synced on cleanup/unregister.
- Startup failure diagnostics now include buffered Python output to aid gRPC server troubleshooting.
0.7.5 - 2025-12-25
Added
Snakepit.PythonPackagesmodule for uv/pip package management.Snakepit.PackageErrorstructured error type for package operations.:python_packagesapplication config for installer, timeout, and env settings.Snakepit.PythonPackages.ensure!/2for provisioning required packages.Snakepit.PythonPackages.check_installed/2for verifying package presence.Snakepit.PythonPackages.lock_metadata/2for lockfile package metadata.Snakepit.PythonPackages.install!/2for direct requirement installs.
0.7.4 - 2025-12-25
Added
- Zero-copy interop –
Snakepit.ZeroCopy+Snakepit.ZeroCopyRefhandle DLPack/Arrow exports/imports with explicitclose/1and telemetry for export/import/fallback flows. - Crash barrier – Worker crash classification, taint tracking, and idempotent retry policy with new crash/taint/restart telemetry events.
- Hermetic Python runtime support – uv-managed interpreter selection, bootstrap integration, and runtime identity metadata propagation.
- Exception translation – Structured Python error payloads mapped into
Snakepit.Error.*exception structs with telemetry for mapped/unmapped translations. - Runtime contract coverage – Integration test coverage for
kwargs,call_type, and payload version fields.
Changed
- gRPC bridge error payloads – Python gRPC servers now return JSON-structured error payloads for tooling failures.
- Telemetry catalog – Added runtime event listings for zero-copy, crash barrier, and exception translation.
Fixed
- Queue resiliency – Tainted workers no longer drive queued requests; queue dispatch selects non-tainted workers when available.
0.7.3 - 2025-12-25
Fixed
- CI test infrastructure – Fixed
python_integrationtest failures in CI by startingGRPC.Client.SupervisorinPythonIntegrationCasesetup and enabling pooling inStreamingRegressionTestsetup. - EnvDoctor port check race condition – Fixed intermittent
env_doctor_testfailures caused by:grpc_portcheck reading from global Application env instead of opts. The check now acceptsgrpc_portvia opts (consistent with other state values), eliminating conflicts when tests or the application bind to overlapping port ranges.
0.7.2 - 2025-12-25
Changed
- Codebase cleanup – Removed dead code, unused modules, and obsolete files across the Elixir and Python codebases.
- Static analysis compliance – Resolved Dialyzer warnings and Credo issues for cleaner, more maintainable code.
- Documentation overhaul – Rewrote README.md and ARCHITECTURE.md for v0.7.2; consolidated DIAGS.md and DIAGS2.md into a single DIAGRAMS.md with mermaid diagrams; updated all README_* guides with version markers; removed obsolete test_bidirectional.py and remaining_handlers.txt.
0.7.1 - 2025-12-24
Added
- Script ergonomics –
Snakepit.run_as_script/2now supportsrestart,await_pool, andhaltoptions plus configurable shutdown/cleanup timeouts. - Example runner controls –
examples/run_all.shhonorsSNAKEPIT_EXAMPLE_DURATION_MSandSNAKEPIT_RUN_TIMEOUT_MS. - Examples bootstrap helper –
Snakepit.Examples.Bootstrap.run_example/2centralizes pool readiness and script exit behavior.
Changed
- Pooling defaults to opt-in –
pooling_enablednow defaults tofalseto avoid auto-start surprises in scripts. - Examples cleanup – bidirectional and documentation-only examples now shut down cleanly under both
mix runandrun_all.sh.
Fixed
- Mix-run config drift – examples now restart Snakepit to apply script-level env overrides, preventing port mismatches and orphaned workers.
0.7.0 - 2025-12-22
Added
- Capacity-aware scheduling – Pool tracks per-worker load and
threads_per_worker, withcapacity_strategy(:pooldefault,:profile,:hybrid) configurable globally or per pool. - Request metadata exposure – Python SessionContext now carries
request_metadatafor adapters;grpc_server.pywraps ExecuteTool/ExecuteStreamingTool in telemetry spans.
Changed
- Correlation propagation – gRPC calls now set
x-snakepit-correlation-idheaders andExecuteToolRequest.metadataon execute + streaming paths; streaming calls ensure a correlation ID exists. - Process profile env merge – Worker env defaults merge system thread limits with user overrides instead of replacing them.
Fixed
- ToolRegistry cleanup logging – Cleanup logs now report the correct count of removed tools.
0.6.11 - 2025-12-20
Added
- Pool status CLI –
mix snakepit.statusreports pool size, queue depth, and error counts without requiring a full dashboard stack. - Adapter generator –
mix snakepit.gen.adapterscaffolds a minimal Python adapter underpriv/pythonwith a ready-to-copyadapter_argssnippet. - Binary gRPC results – Bridge responses now include
binary_resultsupport so tools can return{:binary, payload[, metadata]}tuples for large outputs. - Examples runner –
examples/run_all.shexecutes every example (including showcase/loadtest) viamix run, with auto-stop and configurable loadtest sizes.
Changed
- Doctor checks –
Snakepit.EnvDoctorvalidates the Elixirgrpc_portand runs per-pool adapter import health checks viagrpc_server.py --health-check --adapter .... - Bootstrap consolidation – scripts/docs/examples now standardize on
mix snakepit.setup+mix snakepit.doctor, and examples prefermix runwith the shared bootstrap helper. - Python env defaults – gRPC workers merge default
PYTHONPATHandSNAKEPIT_PYTHONinto adapter environments to keep imports predictable. - Docs organization – legacy unified-bridge and unified-example design docs are archived, and install guidance now differentiates repo bootstrap from app usage.
Fixed
- Threaded server loop –
grpc_server_threaded.pynow ensures a running asyncio event loop to avoid deprecation warnings. - Worker spawn telemetry – gRPC worker spawn/terminate durations now use consistent monotonic units, preventing negative duration values in telemetry handlers.
- Elixir tool decoding in Python –
SessionContext.call_elixir_tool/2decodes JSON/binary payloads viaTypeSerializerinstead of returning raw protobuf Any values. - Python ML workflow serialization – showcase ML handlers coerce NumPy-derived stats into JSON-safe floats to avoid
orjsonerrors. - Tool registration noise – Python bridge caches tool registration per session and treats duplicate registrations as info, avoiding false error reports.
0.6.10 - 2025-11-13
Added
- Canonical worker metadata –
Snakepit.Pool.Registry.metadata_keys/0exposes the authoritative metadata keys (:worker_module,:pool_name,:pool_identifier,:adapter_module) and the surrounding docs call out how pool helpers, diagnostics, and worker profiles should treat that map as the single source of truth. - Telemetry catalog + filters –
Snakepit.Telemetry.Naming.python_event_catalog/0now documents the full event/measurement schema emitted bysnakepit_bridge, while the Python telemetry stream implements glob-style allow/deny filters pushed from Elixir so noisy adapters can be muted without redeploying workers. - Async adapter registration –
snakepit_bridge.base_adapter.BaseAdapteraddsregister_with_session_async/2(plus regression coverage) so asyncio/aio stubs can advertise tool surfaces without blocking while the synchronous helper stays intact for classic stubs. - Self-managing Python tests –
test_python.shnow creates/updates.venv, fingerprintspriv/python/requirements.txt, installs deps, regenerates protobuf stubs, and exports quiet OTEL defaults so./test_python.shis a one-command pytest runner on any Linux/WSL host.
Changed
- Queue timeout enforcement – Queued requests now carry their timer reference, the pool cancels those timers as soon as the request is dequeued or dropped, and statistics/logging happen in one place, preventing runaway timers when pools churn.
- Threaded adapter guardrails –
priv/python/grpc_server_threaded.pyrefuses to boot adapters that don’t set__thread_safe__ = True, logging a clear remediation path and forcing unsafe adapters back to process mode. - Tool registration resilience –
snakepit_bridge.base_adapter.BaseAdapterwraps gRPC stub responses in_coerce_stub_response/1, unwrapping awaitables,UnaryUnaryCallstructs, or lazy callables before checkingresponse.success, which stabilizes adapters that mix sync and async gRPC stubs. - Heartbeat/schema documentation –
Snakepit.Confignow ships typedocs for the normalized pool/heartbeat map shared with Python, and the architecture plus gRPC guides emphasize that BEAM is the authoritative heartbeat monitor withSNAKEPIT_HEARTBEAT_CONFIGkept in sync across languages.
Fixed
- Stale queue timeouts – Queue timeout messages that arrive after a request has already been serviced are ignored, and clients now receive
{:error, :queue_timeout}exactly once when their request is actually dropped.
0.6.9 - 2025-11-13
Added
- Registry helpers: Introduced
Snakepit.Pool.Registry.fetch_worker/1plus metadata helpers used throughout the pool, bridge server, worker profiles, and diagnostics soworker_module,pool_identifier, andpool_nameare always looked up in a single, tested place. - Binary parameter validation:
Snakepit.GRPC.BridgeServernow rejects non-binary entries inExecuteToolRequest.binary_parameters, guaranteeing local tools only ever see{:binary, payload}tuples while remote workers still receive the untouched proto map. - Slow-test workflow: Tagged the long-running suites with
@tag :slow, defaultedmix testto skip them, and documented the opt-in commands plus the 2025-11-13 slow-test inventory inREADME_TESTINGanddocs/20251113/slow-test-report.md. - Lifecycle observability: Memory-based recycling now logs a warning whenever a worker cannot answer the
:get_memory_usageprobe, preventing silent configuration drift. - Rogue cleanup controls: Operators can configure the exact script names and run-id markers that qualify Python processes for startup cleanup, with defaults matching
grpc_server.py/grpc_server_threaded.py. - Memory recycle telemetry & diagnostics:
[:snakepit, :worker, :recycled]now emitsmemory_mb/memory_threshold_mb, Prometheus metrics exposesnakepit.worker.recycledcounters, and bothSnakepit.Diagnostics.ProfileInspectorplusmix snakepit.profile_inspectorshow per-pool “Memory Recycles” totals for operators.
Changed
- GRPC worker lookups: GRPCWorker, ToolRegistry clients, pool helpers, and worker profiles call the new Registry helpers instead of
Registry.lookup/2, ensuring metadata stays normalized and reverse lookups never crash when metadata is missing. - Bridge test coverage: Added binary-parameter regression tests that prove malformed payloads are rejected before reaching Elixir tools, plus lifecycle tests that simulate failing memory probes.
- Process killer tests: Rogue cleanup unit tests now cover the customizable scripts/markers path so changes to the configuration surface immediately.
- Heartbeat contract clarity: Documented what
dependent: true|falsemeans, exportedSNAKEPIT_HEARTBEAT_CONFIGexpectations, and added both HeartbeatMonitor- and GRPCWorker-level regression tests so fail-fast vs independent behavior stays well defined. - Telemetry stream shutdown noise: gRPC telemetry stream shutdowns that report
:normalor:shutdownnow log at debug level, eliminating the warning spam that buried actionable failures during slow-test runs.
Fixed
- Registry metadata race:
Pool.Registry.put_metadata/2now reports{:error, :not_registered}when clients attempt to attach metadata before the worker is registered and downgrades those expected attempts to debug logs, eliminating silent successes that previously returned:ok. - Heartbeat metrics stability: The
snakepit.worker.memory_mbsummary now pulls values viaMap.get/2and non-dependent monitors retain timeout/missed-heartbeat counters, so Telemetry/Prometheus exporters stop crashing when measurements arrive as maps and status checks reflect the real failure budget. - Docs parity: README, README_GRPC, README_PROCESS_MANAGEMENT, and ARCHITECTURE now describe the binary parameter contract, registry helper usage, lifecycle behavior, and rogue cleanup assumptions introduced in this release.
0.6.8 - 2025-11-12
This release also rolls up the previously undocumented fail-fast docs/tests work from 074f2260f703d16ccfecf937c10af905165419f0 (heartbeat fail-fast suites, orphan cleanup stress tests, queue probe adapter, and config fail-fast coverage).
Added
- Bootstrap automation: Introduced
Snakepit.Bootstrap,mix snakepit.setup, and amake bootstraptarget to install Mix deps, provision.venv/.venv-py313, install Python requirements, runscripts/setup_test_pythons.sh, and regenerate gRPC stubs with fully instrumented logging. - Environment doctor: New
Snakepit.EnvDoctormodule plusmix snakepit.doctortask verify interpreter availability,grpcimport,.venv/.venv-py313,priv/python/grpc_server.py --health-check, and worker port availability with actionable remediation messages. - Runtime guardrails:
Snakepit.Applicationnow invokesSnakepit.EnvDoctor.ensure_python!/0before pools start, failing fast when Python prerequisites are missing. Test helpers (test/support/fake_doctor.ex,test/support/bootstrap_runner.ex,test/support/command_runner.ex) enable deterministic unit coverage for the bootstrap/doctor path. - Python-aware CI: GitHub Actions workflow now runs bootstrap, doctor, the default suite, and
mix test --only python_integrationso bridge coverage is validated when the doctor passes. - New documentation: README + README_TESTING describe the
make bootstrap → mix snakepit.doctor → mix testworkflow, explain how to run python integration tests, and highlight the new Mix tasks. - Lifecycle config & memory recycling: Added
%Snakepit.Worker.LifecycleConfig{}to capture adapter/profile/env data for every worker, wiredSnakepit.GRPCWorkerto answer:get_memory_usage, and extended lifecycle tests so TTL/request/memory recycling use the same canonical config. - Binary tool parameters:
Snakepit.GRPC.BridgeServer,Snakepit.GRPC.Client, andSnakepit.GRPC.ClientImplnow decode/forwardExecuteToolRequest.binary_parameters, exposing binaries to local tools as{:binary, payload}while sending the untouched map to Python workers. README.md and README_GRPC.md document the contract. - Worker-flow integration test: New
Snakepit.Pool.WorkerFlowIntegrationTestexercises the WorkerSupervisor → MockGRPCWorker path, ensuring registry/process tracking stays consistent after execution and crash/restart flows. - Randomized worker stress test:
Snakepit.Pool.RandomWorkerFlowTestthrows randomized execute/kill sequences at pools to ensure Registry ↔ ProcessRegistry invariants hold under churn.
Changed
- Test gating: Default
mix testexcludes:python_integrationwhile Python-heavy suites (thread profile, session affinity, streaming regression, etc.) carry the tag;test/unit/exunit_configuration_test.exslocks the config in place. - Thread-profile test harness:
Snakepit.ThreadProfilePython313Testnow usesSnakepit.Test.PythonEnv.skip_unless_python_313/1to skip cleanly when.venv-py313is unavailable. - Process killer regression: Ports spawned during
kill_by_run_id/1tests close viasafe_close_port/1, eliminating:port_closerace exceptions. - Queue saturation regression:
Snakepit.Pool.QueueSaturationRuntimeTestfocuses on stats + agent tracking instead of brittle global ETS assertions, removing a common source of flaky failures. - gRPC generation script:
priv/python/generate_grpc.shnow prefers.venv/bin/python3, falling back to systempython3/pythononly when the virtualenv is missing, and emits helpful logs when no interpreter is found. - Registry metadata semantics:
Snakepit.GRPCWorkernow writes canonical metadata (worker_module,pool_name,pool_identifier) viaSnakepit.Pool.Registry.put_metadata/2, unblocking pool-name extraction and worker-module discovery without parsing IDs. Tests cover PID→worker lookups. - LifecycleManager internals: Tracking records store lifecycle structs instead of ad-hoc maps so replacement workers inherit adapter args/env, and memory thresholds now exercise the worker call path in tests.
- Process cleanup safety: Rogue process cleanup only targets commands containing
grpc_server.py/grpc_server_threaded.pywith--snakepit-run-id/--run-idflags, and operators can disable the sweep withconfig :snakepit, :rogue_cleanup, enabled: false. Docs explain the ownership contract. - Pool integration coverage: Replaced the unstable
test/snakepit/pool/high_risk_flow_test.exsharness with targeted unit-level integration coverage (WorkerSupervisor + MockGRPCWorker), keeping the suite reliable while still covering the critical registry/ProcessRegistry chain. - Worker profile metadata lookup: Process/thread profiles now resolve worker modules via
Pool.Registry.get_worker_id_by_pid/1+ metadata lookup, so non-GRPC workers can be supported and Dialyzer warnings are gone.
Fixed
- Shell instrumentation around bootstrap (reporting command start/finish and verbose pip output) prevents "silent hangs" and surfaced the root causes of previous provisioning confusion.
scripts/setup_test_pythons.shnow runs underset -x, streaming its progress during bootstrap.- Rogue cleanup tests verify we no longer kill unrelated Python processes, and docs call out the run-id requirements so multi-tenant hosts stay safe.
0.6.7 - 2025-10-28
Added
Phase 1: Type System MVP + Performance
- 6x JSON performance boost: Integrated
orjsonfor Python serialization, delivering 4-6x speedup for raw JSON operations and 1.5x improvement for large payloads (priv/python/snakepit_bridge/serialization.py,priv/python/tests/test_orjson_integration.py). - Structured error type: New
Snakepit.Errorstruct provides detailed context for debugging with fields includingcategory,message,details,python_traceback, andgrpc_status(lib/snakepit/error.ex,test/unit/error_test.exs). - Complete type specifications: All public API functions in
Snakepitmodule now have@specannotations with structured error return types for better IDE support and Dialyzer analysis. - Performance benchmarks: Comprehensive benchmark suite validates 4-6x raw JSON speedup and verifies no regression on small payloads (
priv/python/tests/test_orjson_integration.py).
Phase 2: Distributed Telemetry System
- Bidirectional telemetry streaming: Python workers can now emit telemetry events via gRPC that are re-emitted as Elixir
:telemetryevents for unified observability (lib/snakepit/telemetry/grpc_stream.ex,priv/python/snakepit_bridge/telemetry/). - Complete event catalog: 43 telemetry events across 3 layers (Infrastructure, Python Execution, gRPC Bridge) with atom-safe event names to prevent atom table exhaustion (
lib/snakepit/telemetry/naming.ex,docs/20251028/telemetry/01_EVENT_CATALOG.md). - Python telemetry API: High-level Python API with
telemetry.emit()for events andtelemetry.span()for automatic timing, plus correlation ID propagation across the Elixir/Python boundary (priv/python/snakepit_bridge/telemetry/__init__.py). - Runtime telemetry control: Adjust sampling rates, enable/disable telemetry, and filter events for individual workers without restarts (
lib/snakepit/telemetry/control.ex). - Metadata safety: Automatic sanitization of Python metadata to prevent atom table exhaustion from untrusted string keys (
lib/snakepit/telemetry/safe_metadata.ex). - Multiple backend support: Python telemetry supports gRPC streaming (default) and stderr backends, with extensible backend architecture (
priv/python/snakepit_bridge/telemetry/backends/). - Worker lifecycle hooks: Automatic telemetry stream registration/unregistration integrated into worker lifecycle (
lib/snakepit/grpc_worker.ex:479,lib/snakepit/grpc_worker.ex:783). - Integration tests: Comprehensive test suite covering event catalog, validation, sanitization, and control messages (
test/integration/telemetry_flow_test.exs).
Changed
- Python serialization now uses
orjsonwith graceful fallback to stdlibjsonif orjson is unavailable, maintaining full backward compatibility. - Error returns in
Snakepit.PoolandSnakepitmodules now use structuredSnakepit.Errortypes with detailed context instead of atoms. Snakepit.Pool.await_ready/2now returns{:error, %Snakepit.Error{category: :timeout}}instead of{:error, :timeout}.- Streaming validation errors now include adapter context in error details.
- Old
telemetry.span()(OpenTelemetry) renamed totelemetry.otel_span()to avoid naming conflict with new telemetry streaming span. Snakepit.Applicationsupervision tree now includesSnakepit.Telemetry.GrpcStreamfor managing bidirectional telemetry streams.
Fixed
- Updated Dialyzer type specifications to match new structured error returns, reducing type warnings.
- Corrected
grpc_worker.exmetadata fields for telemetry events (state.stats.start_time,state.stats.requests).
Documentation
- New
TELEMETRY.md: Complete user guide for the distributed telemetry system with usage examples, integration patterns for Prometheus/StatsD/OpenTelemetry, and troubleshooting guidance (320 lines). - Telemetry design docs: 9 comprehensive design documents covering architecture, event catalog, Python integration, client guide, gRPC implementation, and backend architecture (
docs/20251028/telemetry/). - New examples: 5 comprehensive examples demonstrating v0.6.7 features with ~50KB of production-ready code:
examples/telemetry_basic.exs- Introduction to telemetry handlers and Python telemetry APIexamples/telemetry_advanced.exs- Correlation tracking, performance monitoring, runtime controlexamples/telemetry_monitoring.exs- Production monitoring patterns with real-time dashboardexamples/telemetry_metrics_integration.exs- Prometheus/StatsD integration patternsexamples/structured_errors.exs- NewSnakepit.Errorstruct usage and pattern matching
- Updated
examples/README.md: Comprehensive guide to all examples with clear learning paths and troubleshooting. - Updated README.md with v0.6.7 release notes highlighting type system improvements, performance gains, and telemetry system.
- Updated mix.exs version to 0.6.7 with
TELEMETRY.mdin package files and docs extras. - Added comprehensive test coverage for structured error types (12 new tests in
test/unit/error_test.exs).
Performance
- Telemetry overhead: <10μs per event, <1% CPU impact at 100% sampling, <0.1% CPU at 10% sampling.
- Bounded resources: Python telemetry queue limited to 1024 events (~100KB), with graceful degradation (drops events vs blocking).
- Zero regression: All 235+ existing tests pass with full backward compatibility maintained.
Zero breaking changes: All existing code continues to work. Telemetry is fully opt-in via standard :telemetry.attach() patterns.
0.6.6 - 2025-10-27
Added
- Configurable session/program quotas now surface tagged errors when limits are exceeded, with regression coverage in
test/unit/bridge/session_store_test.exs. - Introduced a logger redaction helper so adapters and bridge code can log sensitive inputs safely (
test/unit/logger/redaction_test.exs).
Changed
Snakepit.GRPC.BridgeServerreuses worker-owned gRPC channels and only dials a disposable connection when the worker has not yet published one; fallbacks are closed after each invocation.- gRPC streaming helpers document and enforce the JSON-plus-metadata chunk envelope, clarifying
_metadataandraw_data_base64handling. - Worker startup handshake waits for the negotiated gRPC port before publishing worker metadata, eliminating transient routing failures during boot.
Snakepit.GRPC.ClientImplnow returns structured{:error, {:invalid_parameter, :json_encode_failed, message}}tuples when parameters cannot be JSON-encoded, preventing calling processes from crashing (test/unit/grpc/client_impl_test.exs).Snakepit.GRPC.BridgeServer.execute_streaming_tool/2raisesUNIMPLEMENTEDwith remediation guidance so callers can fall back gracefully when streaming is disabled (test/snakepit/grpc/bridge_server_test.exs).
Fixed
Snakepit.GRPCWorkerpersists the OS-assigned port discovered during startup so BridgeServer never receives0when routing requests (test/unit/grpc/grpc_worker_ephemeral_port_test.exs).- Parameter decoding now rejects malformed protobuf payloads with descriptive
{:invalid_parameter, key, reason}errors, preventing unexpected crashes (test/snakepit/grpc/bridge_server_test.exs). - Process registry ETS tables are
:protectedand DETS handles remain private, guarding against external mutation attempts (test/unit/pool/process_registry_security_test.exs). - Pool name inference prefers registry metadata and logs once when falling back to worker-id parsing, eliminating silent misroutes (
test/unit/pool/pool_registry_lookup_test.exs).
Documentation
- Refreshed README, gRPC guides (including the streaming and quick reference docs), and testing notes to cover port persistence, channel reuse, quota enforcement, DETS/ETS protections, streaming payload envelopes and fallbacks, metadata-driven pool routing, logging redaction guardrails, and the expanded regression suite.
0.6.5 - 2025-10-26
Added
- Regression suites covering worker supervisor stop/restart flows and profile-level shutdown helpers (
test/unit/pool/worker_supervisor_test.exs,test/unit/worker_profile/worker_profile_stop_worker_test.exs).
Changed
Snakepit.Applicationnow reads the current environment from compile-time configuration instead of callingMix.env/0, keeping OTP releases Mix-free.- Introduced
Snakepit.PythonThreadLimits.resolve/1to merge partial thread-limit overrides with defaults before applying environment variables.
Fixed
Snakepit.Pool.WorkerSupervisor.stop_worker/1targets worker starter supervisors and accepts either worker ids or pids, ensuring restarts actually decommission the old worker.Snakepit.WorkerProfile.ProcessandSnakepit.WorkerProfile.Threadresolve worker ids through the pool registry so lifecycle manager shutdowns succeed for pid handles.
0.6.4 - 2025-10-30
Added
- Streaming regression guard in
test/snakepit/streaming_regression_test.exscovering both success and adapter capability failures examples/stream_progress_demo.exsshowcasing five timed streaming updates with rich progress outputtest_python.shhelper that regenerates protobuf stubs, activates the project virtualenv, wiresPYTHONPATH, and forwards arguments topytest
Changed
- Python gRPC servers now bridge streaming iterators through an
asyncio.Queue, yielding chunks as soon as they are produced and removing ad-hoc log files Snakepit.Adapters.GRPCPythonconsumes streaming chunks incrementally, decoding JSON payloads, surfacing metadata, and safeguarding callback failures- Showcase
stream_progresstool acceptsdelay_msand reports elapsed timing so demos and diagnostics show meaningful pacing
Fixed
- Eliminated burst delivery of streaming responses by ensuring each chunk is forwarded to Elixir immediately, restoring real-time feedback for
execute_stream/4
0.6.3 - 2025-10-19
Added
- Dependent/Independent Heartbeat Mode - New
dependentconfiguration flag allows workers to optionally continue running when Elixir heartbeats fail, enabling debugging scenarios where Python workers should remain alive - Environment variable-based heartbeat configuration via
SNAKEPIT_HEARTBEAT_CONFIGfor passing settings from Elixir to Python workers - Python unit test coverage for dependent heartbeat termination behavior (
priv/python/tests/test_heartbeat_client.py) - CLI flags
--heartbeat-dependentand--heartbeat-independentfor Python gRPC server configuration
Changed
- Default heartbeat enabled state changed from
falsetotruefor better production reliability HeartbeatMonitornow suppresses worker termination whendependent: falseis configured, logging warnings instead- Python
HeartbeatClientincludes default shutdown handler for dependent mode Snakepit.GRPCWorkerpasses heartbeat configuration to Python via environment variables- Updated configuration tests to reflect new heartbeat defaults
Fixed
- Heartbeat configuration now properly propagates from Elixir to Python across all code paths
0.6.2 - 2025-10-26
Added
- End-to-end heartbeat regression suite covering monitor boot, timeout handling, and OS-level process cleanup (
test/snakepit/grpc/heartbeat_end_to_end_test.exs) - Long-running heartbeat stability test to guard against drift and missed ping accumulation (
test/snakepit/heartbeat_monitor_test.exs) - Python-side telemetry regression ensuring outbound metadata preserves correlation identifiers (
priv/python/tests/test_telemetry.py) - Deep-dive documentation for the heartbeat and observability stack plus consolidated testing command guide (
docs/20251019/*.md)
Changed
Snakepit.GRPCWorkernow terminates itself whenever the heartbeat monitor exits, preventing pools from keeping unhealthy workers alivemake testpreferentially uses the repository’s virtualenv interpreter, exportsPYTHONPATH, and runsmix test --colorfor consistent local runs
Fixed
- Guard against leaking heartbeat monitors by stopping the worker when the monitor crashes, ensuring registry entries and OS ports are released
0.6.1 - 2025-10-19
Added
- Proactive worker heartbeat monitoring via
Snakepit.HeartbeatMonitorwith configurable cadence, miss thresholds, and per-pool overrides - Comprehensive telemetry stack:
Snakepit.Telemetry.OpenTelemetryboot hook,Snakepit.TelemetryMetricsPrometheus exporter, and correlation helpers for tracing spans - Rich gRPC client utilities (
Snakepit.GRPC.ClientImpl) covering ping, session lifecycle, heartbeats, and streaming tooling - Python bridge instrumentation (
snakepit_bridge.heartbeat,snakepit_bridge.telemetry) plus new unit tests for telemetry and threaded servers - Default telemetry/heartbeat configuration shipped in
config/config.exs, including OTLP environment toggles and Prometheus port selection - Configurable logging system via the new
Snakepit.Loggermodule with centralized control over verbosity (:debug,:info,:warning,:error,:none)
Changed
Snakepit.GRPCWorkernow emits detailed telemetry, manages heartbeats, and wires correlation IDs through tracing spansSnakepit.Applicationactivates OTLP exporters based on environment variables, registers telemetry reporters alongside pool supervisors, and routes logs throughSnakepit.Logger- Python gRPC servers (
grpc_server.py,grpc_server_threaded.py) updated with structured logging, execution metrics, and heartbeat responses - Examples refreshed with observability storylines, dual-mode telemetry demos, and cleaner default output through
Snakepit.Logger - GitHub workflows tightened to reflect new test layout and planning artifacts
- 25+ Elixir modules migrated to
Snakepit.Loggerfor consistent log suppression in demos and production
Configuration
- New
:log_leveloption under the:snakepitapplication config to control internal logging# config/config.exs config :snakepit, log_level: :warning # Options: :debug, :info, :warning, :error, :none
Fixed
- Hardened CI skips for
ApplicationCleanupTestto avoid nondeterministic BEAM run IDs - Addressed flaky test ordering through targeted cleanup helpers and telemetry-aware assertions
Documentation
- Major rewrite of
ARCHITECTURE.md, newAGENTS.md, and comprehensive design dossiers for v0.7/v0.8 feature tracks - Added heartbeat, telemetry, and OTLP upgrade plans under
docs/2025101x/ - README refreshed with v0.6.1 highlights, logging guidance, installation tips, and observability walkthroughs
Notes
- Existing configurations continue to work with the default
:infolog level - Log suppression is optional—set
log_level: :debugto restore verbose output - Provides cleaner logs for production deployments and demos while retaining full visibility for debugging
0.6.0 - 2025-10-11
Added - Phase 1: Dual-Mode Architecture Foundation
Worker Profile System
- New
Snakepit.WorkerProfilebehaviour for pluggable parallelism strategies Snakepit.WorkerProfile.Process- Multi-process profile (default, backward compatible)Snakepit.WorkerProfile.Thread- Multi-threaded profile stub (Phase 2-3 implementation)- Profile abstraction enables switching between process and thread execution modes
- New
Python Environment Detection
- New
Snakepit.PythonVersionmodule for Python version detection - Automatic detection of Python 3.13+ free-threading support (PEP 703)
- Profile recommendation based on Python capabilities
- Version validation and compatibility warnings
- New
Library Compatibility Matrix
- New
Snakepit.Compatibilitymodule with thread-safety database - Compatibility tracking for 20+ popular Python libraries (NumPy, PyTorch, Pandas, etc.)
- Per-library thread safety status, recommendations, and workarounds
- Automatic compatibility checking for thread profile configurations
- New
Configuration System Enhancements
- New
Snakepit.Configmodule for multi-pool configuration management - Support for named pools with different worker profiles
- Backward-compatible legacy configuration conversion
- Comprehensive configuration validation and normalization
- Profile-specific defaults (process vs thread)
- New
Documentation
- Comprehensive v0.6.0 technical plan (8,000+ words)
- GIL removal research and dual-mode architecture design
- Phase-by-phase implementation roadmap (10 weeks)
- Performance benchmarks and migration strategies
Changed
- Architecture Evolution
- Foundation laid for Python 3.13+ free-threading support
- Worker management abstracted to support multiple parallelism models
- Configuration system generalized for multi-pool scenarios
Added - Phase 2: Multi-Threaded Python Worker
Threaded gRPC Server
- New
grpc_server_threaded.py- Multi-threaded server with ThreadPoolExecutor - Concurrent request handling via HTTP/2 multiplexing
- Thread safety monitoring with
ThreadSafetyMonitorclass - Request tracking per thread with performance metrics
- Automatic adapter thread safety validation on startup
- Configurable thread pool size (--max-workers parameter)
- New
Thread-Safe Adapter Infrastructure
- New
base_adapter_threaded.py- Base class for thread-safe adapters ThreadSafeAdapterwith built-in locking primitivesThreadLocalStoragemanager for per-thread stateRequestTrackerfor monitoring concurrent requests@thread_safe_methoddecorator for automatic tracking- Context managers for safe lock acquisition
- Built-in statistics and performance monitoring
- New
Example Implementations
threaded_showcase.py- Comprehensive thread-safe adapter example- Pattern 1: Shared read-only resources (models, configurations)
- Pattern 2: Thread-local storage (caches, buffers)
- Pattern 3: Locked shared mutable state (counters, logs)
- CPU-intensive workloads with NumPy integration
- Stress testing and performance monitoring tools
- Example tools: compute_intensive, matrix_multiply, batch_process, stress_test
Thread Safety Validation
- New
thread_safety_checker.py- Runtime validation toolkit - Concurrent access detection with detailed warnings
- Known unsafe library detection (Pandas, Matplotlib, SQLite3)
- Thread contention monitoring and analysis
- Performance profiling per thread
- Automatic recommendations for detected issues
- Global checker with strict mode option
- New
Documentation
- New
README_THREADING.md- Comprehensive threading guide - Thread safety patterns and best practices
- Writing thread-safe adapters tutorial
- Testing strategies for concurrent code
- Performance optimization techniques
- Library compatibility matrix (20+ libraries)
- Common pitfalls and solutions
- Advanced topics: worker recycling, monitoring, debugging
- New
Added - Phase 3: Elixir Thread Profile Integration
Complete ThreadProfile Implementation
- Full implementation of
Snakepit.WorkerProfile.Thread - Worker capacity tracking via ETS table (
:snakepit_worker_capacity) - Atomic load increment/decrement for thread-safe capacity management
- Support for concurrent requests to same worker (HTTP/2 multiplexing)
- Automatic script selection (threaded vs standard gRPC server)
- Full implementation of
Worker Capacity Management
- ETS-based capacity tracking:
{worker_pid, capacity, current_load} - Atomic operations for thread-safe load updates
- Capacity checking before request execution
- Automatic load decrement after request completion (even on error)
- Real-time capacity monitoring via
get_capacity/1andget_load/1
- ETS-based capacity tracking:
Adapter Configuration Enhancement
- Updated
GRPCPython.script_path/0to select correct server variant - Automatic detection of threaded mode from adapter args
- Seamless switching between process and thread servers
- Enhanced argument merging for user customization
- Updated
Load Balancing
- Capacity-aware worker selection
- Prevents over-subscription of workers
- Returns
:worker_at_capacitywhen no slots available - Automatic queueing handled by pool layer
Example Demonstration
- New
examples/threaded_profile_demo.exs- Interactive demo script - Shows configuration patterns for threaded mode
- Explains concurrent request handling
- Demonstrates capacity management
- Performance monitoring examples
- New
Added - Phase 4: Worker Lifecycle Management
LifecycleManager GenServer
- New
Snakepit.Worker.LifecycleManager- Automatic worker recycling - TTL-based recycling (configurable: seconds/minutes/hours/days)
- Request-count based recycling (recycle after N requests)
- Memory threshold recycling (optional, requires worker support)
- Periodic health checks (every 5 minutes)
- Graceful worker replacement with zero downtime
- New
Worker Tracking Infrastructure
- Automatic worker registration on startup
- Per-worker metadata tracking (start time, request count, config)
- Process monitoring for crash detection
- Lifecycle statistics and reporting
Recycling Logic
- Configurable TTL:
{3600, :seconds},{1, :hours}, etc. - Max requests:
worker_max_requests: 1000 - Memory threshold:
memory_threshold_mb: 2048(optional) - Manual recycling:
LifecycleManager.recycle_worker(pool, worker_id) - Automatic replacement after recycling
- Configurable TTL:
Request Counting
- Automatic increment after successful request
- Per-worker request tracking
- Triggers recycling at configured threshold
- Integrated with Pool's execute path
Telemetry Events
[:snakepit, :worker, :recycled]- Worker recycled with reason[:snakepit, :worker, :health_check_failed]- Health check failure- Rich metadata (worker_id, pool, reason, uptime, request_count)
- Integration with Prometheus, LiveDashboard, custom monitors
Documentation
- New
docs/telemetry_events.md- Complete telemetry reference - Event schemas and metadata descriptions
- Usage examples for monitoring systems
- Prometheus and LiveDashboard integration patterns
- Best practices and debugging tips
- New
Supervisor Integration
- LifecycleManager added to application supervision tree
- Positioned after WorkerSupervisor, before Pool
- Automatic startup with pooling enabled
- Clean shutdown handling
Changed - Phase 4
GRPCWorker Enhanced
- Workers now register with LifecycleManager on startup
- Lifecycle config passed during initialization
- Untracking on worker shutdown
Pool Enhanced
- Request counting integrated into execute path
- Automatic notification to LifecycleManager on success
- Supports lifecycle management without modifications to existing flow
Added - Phase 5: Enhanced Diagnostics and Monitoring
ProfileInspector Module
- New
Snakepit.Diagnostics.ProfileInspector- Programmatic pool inspection - Functions for pool statistics, capacity analysis, and memory usage
- Profile-aware metrics for both process and thread pools
get_pool_stats/1- Comprehensive pool statisticsget_capacity_stats/1- Capacity utilization and thread infoget_memory_stats/1- Memory usage breakdown per workerget_comprehensive_report/0- All pools analysischeck_saturation/2- Capacity warning systemget_recommendations/1- Intelligent optimization suggestions
- New
Mix Task: Profile Inspector
- New
mix snakepit.profile_inspector- Interactive pool inspection tool - Text and JSON output formats
- Detailed per-worker statistics with
--detailedflag - Pool-specific inspection with
--pooloption - Optimization recommendations with
--recommendationsflag - Color-coded utilization indicators (🔴🟡🟢⚪)
- Profile-specific insights (process vs thread)
- New
Enhanced Scaling Diagnostics
- Extended
mix diagnose.scalingwith profile-aware analysis - New TEST 0: Pool Profile Analysis
- Thread pool vs process pool comparison
- Capacity utilization monitoring
- Profile-specific recommendations
- System-wide optimization opportunities
- Real-time pool statistics integration
- Extended
Telemetry Events
[:snakepit, :pool, :saturated]- Pool queue at max capacity- Measurements:
queue_size,max_queue_size - Metadata:
pool,available_workers,busy_workers
- Measurements:
[:snakepit, :pool, :capacity_reached]- Worker reached capacity (thread profile)- Measurements:
capacity,load - Metadata:
worker_pid,profile,rejected(optional)
- Measurements:
[:snakepit, :request, :executed]- Request completed with duration- Measurements:
duration_us(microseconds) - Metadata:
pool,worker_id,command,success
- Measurements:
Diagnostic Features
- Worker memory usage tracking per process
- Thread pool utilization analysis
- Capacity saturation warnings
- Profile-appropriate recommendations
- Performance duration tracking
- Queue depth monitoring
Status
- Phase 1 ✅ Complete - Foundation modules and behaviors defined
- Phase 2 ✅ Complete - Multi-threaded Python worker implementation
- Phase 3 ✅ Complete - Elixir thread profile integration
- Phase 4 ✅ Complete - Worker lifecycle management and recycling
- Phase 5 ✅ Complete - Enhanced diagnostics and monitoring
- Phase 6 🔄 Pending - Documentation and examples
Notes
- No Breaking Changes: All v0.5.1 configurations remain fully compatible
- Thread Profile: Stub implementation (returns
:not_implemented) until Phase 2-3 - Default Behavior: Process profile remains default for maximum stability
- Python 3.13+: Free-threading support enables true multi-threaded workers
- Migration: Existing code requires zero changes to continue working
0.5.1 - 2025-10-11
Added
Diagnostic Tools
- New
mix diagnose.scalingtask for comprehensive bottleneck analysis - Captures resource metrics (ports, processes, TCP connections, memory usage)
- Enhanced error logging with port buffer drainage
- New
Configuration Enhancements
- Explicit gRPC port range constraint documentation and validation
- Batched worker startup configuration (
startup_batch_size: 8,startup_batch_delay_ms: 750) - Resource limit safeguards with
max_workers: 1000hard limit
Changed
Worker Pool Scaling Improvements
- Pool now reliably scales to 250+ workers (previously limited to ~105)
- Resolved thread explosion during concurrent startup (fixed "fork bomb" issue)
- Dynamic port allocation using OS-assigned ports (port=0) eliminates port collision races
- Batched worker startup prevents system resource exhaustion during concurrent initialization
Performance Optimizations
- Aggressive thread limiting via environment variables for optimal pool-level parallelism:
OPENBLAS_NUM_THREADS=1(numpy/scipy)OMP_NUM_THREADS=1(OpenMP)MKL_NUM_THREADS=1(Intel MKL)NUMEXPR_NUM_THREADS=1(NumExpr)GRPC_POLL_STRATEGY=poll(single-threaded)
- Increased GRPC server connection backlog to 512
- Extended worker ready timeout to 30s for large pools
- Aggressive thread limiting via environment variables for optimal pool-level parallelism:
Configuration Updates
- Increased
port_rangeto 1000 (accommodatesmax_workers) - Enhanced configuration comments explaining each tuning parameter
- Resource usage tracking during pool initialization
- Increased
Fixed
Concurrent Startup Issues
- Fixed "Cannot fork" / EAGAIN errors from thread explosion during worker spawn
- Eliminated port collision races with dynamic port allocation
- Resolved fork bomb caused by Python scientific libraries spawning excessive threads (6,000+ threads from OpenBLAS, gRPC, MKL)
Resource Management
- Better port binding error handling in Python gRPC server
- Improved error diagnostics during pool initialization
- Enhanced connection management in GRPC server
Performance
- Successfully tested with 250 workers (2.5x previous limit)
- Startup time increases with pool size (~60s for 250 workers vs ~10s for 100 workers)
- Eliminated port collision races and fork resource exhaustion
- Dynamic port allocation provides reliable scaling
Notes
- Thread limiting optimizes for high concurrency with many small tasks
- CPU-intensive workloads that perform heavy numerical computation within a single task may need different threading configuration
- For computationally intensive per-task workloads, consider:
- Workload-specific environment variables passed per task
- Separate worker pools with different threading profiles
- Dynamic thread limit adjustment based on task type
- Allowing higher OpenBLAS threads but reducing max_workers accordingly
- See commit dc67572 for detailed technical analysis and future considerations
0.5.0 - 2025-10-10
Added
Process Management & Lifecycle
- New
Snakepit.RunIdmodule for unique process run identification with nanosecond precision - New
Snakepit.ProcessKillermodule for robust OS-level process cleanup with SIGTERM/SIGKILL escalation - Enhanced
ProcessRegistrywith run_id tracking and improved cleanup logic - Added
scripts/setup_python.shfor automated Python environment setup
- New
Test Infrastructure Improvements
- Added comprehensive Supertester refactoring plan (SUPERTESTER_REFACTOR_PLAN.md)
- Phase 1 foundation updates complete with TestableGenServer support
- New
assert_eventuallyhelper for polling conditions without Process.sleep - Enhanced test documentation and baseline establishment
- New worker lifecycle tests for process management validation
- New application cleanup tests with run_id integration
Python Cleanup & Testing
- Created Python test infrastructure with
test_python.shscript - Added comprehensive SessionContext test suite (15 tests)
- Created Elixir integration tests for Python SessionContext (9 tests)
- Python cleanup summary documentation (PYTHON_CLEANUP_SUMMARY.md)
- Enhanced Python gRPC server with improved process management and signal handling
- Created Python test infrastructure with
Documentation
- Phase 1 completion report with detailed test results
- Python cleanup and testing infrastructure summary
- Enhanced test planning and refactoring documentation
- Added comprehensive process management design documents (robust_process_cleanup_with_run_id.md)
- Added implementation summaries and debugging session reports
- New production deployment checklist (PRODUCTION_DEPLOYMENT_CHECKLIST.md)
- New example status documentation (EXAMPLE_STATUS_FINAL.md)
- Enhanced README with new icons and improved organization
- Added README_GRPC.md and README_BIDIRECTIONAL_TOOL_BRIDGE.md
- Created docs/archive/ structure for historical analysis and design documents
Assets & Branding
- Added 29 new SVG icons for documentation (architecture, binary, book, bug, chart, etc.)
- New snakepit-icon.svg for branding
- Enhanced visual documentation throughout
Changed
Process Management Improvements
ApplicationCleanuprewritten with run_id-based cleanup strategyGRPCWorkerenhanced with run_id tracking and improved termination handlingProcessRegistryoptimized cleanup from O(n) to O(1) operations using run_id- Enhanced
GRPCPythonadapter with run_id support
Code Cleanup
- Removed dead Python code
- Deleted obsolete backup files and unused modules
- Streamlined Python SessionContext
- Cleaned up test infrastructure and removed duplicate code
- Archived ~60 historical documentation files to docs/archive/
Examples Refactoring
- Simplified grpc_streaming_demo.exs
- Refactored grpc_advanced.exs for better clarity
- Enhanced grpc_sessions.exs with improved structure
- Streamlined grpc_streaming.exs
- Improved grpc_concurrent.exs with better patterns
Test Coverage
- Increased total test coverage from 27 to 51 tests (+89%)
- 37 Elixir tests passing (27 + 9 new integration tests + 1 new helper test)
- 15 Python SessionContext tests passing
- Enhanced test helpers with improved synchronization and cleanup
Build Configuration
- Enhanced mix.exs with expanded documentation and package metadata
- Updated dependencies and build configurations
Removed
DSPy Integration (as announced in v0.4.3)
- Removed deprecated
dspy_integration.pymodule - Removed deprecated
types.pywith VariableType enum - Removed
session_context.py.backup - Removed obsolete
test_server.py - Removed unused CLI directory referencing non-existent modules
- All
__pycache__/directories cleaned up
- Removed deprecated
Variables Feature (Temporary Removal)
- Removed incomplete variables implementation pending future redesign:
lib/snakepit/bridge/variables.exlib/snakepit/bridge/variables/variable.exlib/snakepit/bridge/variables/types.ex- All variable type modules (boolean, choice, embedding, float, integer, module, string, tensor)
examples/grpc_variables.exslib/snakepit_showcase/demos/variables_demo.ex- Related test files and Python code
- Removed incomplete variables implementation pending future redesign:
Deprecated Components
- Removed
lib/snakepit/bridge/serialization.ex - Removed
lib/snakepit/grpc/stream_handler.ex - Removed integration test infrastructure (
test/integration/directory) - Removed property-based tests pending refactor
- Removed session and serialization tests pending redesign
- Removed
Fixed
Process Cleanup & Lifecycle
- Fixed race conditions in worker cleanup and termination
- Improved OS-level process cleanup with proper signal handling
- Enhanced DETS cleanup with run_id-based identification
- Fixed test flakiness with improved synchronization
gRPC & Session Management
- Improved session initialization and cleanup in Python gRPC server
- Enhanced error handling in bidirectional tool bridge
- Better isolation between test runs
Test Infrastructure
- Isolation level configuration documented (staying with :basic until test refactoring)
- Test infrastructure conflicts between manual cleanup and Supertester automatic cleanup resolved
- Enhanced debugging capabilities for test failures
Notes
- Breaking Changes:
- DSPy integration fully removed (deprecated in v0.4.3)
- Variables feature temporarily removed pending redesign
- Users must migrate to DSPex for DSPy functionality (see v0.4.3 migration guide)
- Test suite reliability improved with better synchronization patterns
- Foundation laid for full Supertester conformance in future releases
- Process management significantly improved with run_id tracking system
- Documentation reorganized with archive structure for historical content
0.4.3 - 2025-10-07
Deprecated
- DSPy Integration (
snakepit_bridge.dspy_integration)- Deprecated in favor of DSPex-native integration
- Will be removed in v0.5.0
- Deprecation warnings added to all DSPy-specific classes:
VariableAwarePredictVariableAwareChainOfThoughtVariableAwareReActVariableAwareProgramOfThoughtModuleVariableResolvercreate_variable_aware_program()
- See migration guide: https://github.com/nshkrdotcom/dspex/blob/main/docs/architecture_review_20251007/04_DECOUPLING_PLAN.md
Changed
- VariableAwareMixin docstring updated to emphasize generic applicability
- Clarified it's generic, not DSPy-specific
- Can be used with any Python library (scikit-learn, PyTorch, Pandas, etc.)
Documentation
- Added prominent deprecation notice to README
- Added migration guide for DSPex users
- Clarified architectural boundaries (Snakepit = infrastructure, DSPex = domain)
- Added comprehensive architecture review documents
Notes
- No breaking changes - existing code continues to work with deprecation warnings
- Core Snakepit functionality unaffected
- Non-DSPy users unaffected
- Deprecation period: 3-6 months before removal in v0.5.0
0.4.2 - 2025-10-07
Fixed
- DETS accumulation bug - Fixed ProcessRegistry indefinite growth (1994+ stale entries cleaned up)
- Session creation race condition - Implemented atomic session creation with
:ets.insert_newto eliminate concurrent initialization errors - Resource cleanup race condition - Fixed
wait_for_worker_cleanupto check actual resources (port availability + registry cleanup) instead of dead Elixir PID - Test cleanup race condition - Added proper error handling in test teardown for already-stopped workers
- ExDoc warnings - Fixed documentation references by moving INSTALLATION.md to guides/ and adding to ExDoc extras
Changed
- ApplicationCleanup simplified - Simplified implementation, changed to emergency-only handler with telemetry
- Worker.Starter documentation - Added comprehensive moduledoc with ADR-001 link explaining external process management rationale
- DETS cleanup optimization - Changed from O(n) per-PID syscalls to O(1) beam_run_id-based cleanup
- Process.alive? filter removed - Eliminated redundant check (Supervisor.which_children already returns alive children only)
Added
- ADR-001 - Architecture Decision Record documenting Worker.Starter supervision pattern rationale
- External Process Supervision Design - Comprehensive 1074-line design document covering multi-mode architecture
- Issue #2 critical review - Detailed analysis addressing all community feedback concerns
- Performance benchmarks - Added baseline benchmarks showing 1400-1500 ops/sec sustained throughput
- Telemetry in ApplicationCleanup - Added events for tracking orphan detection and emergency cleanup
Removed
- Dead code cleanup - Removed unused/aspirational code:
- Snakepit.Python module (referenced non-existent adapter)
- GRPCBridge adapter (never used)
- Dead Python adapters (dspy_streaming.py, enhanced.py, grpc_streaming.py)
- Redundant helper functions in ApplicationCleanup
- Catch-all rescue clauses (follows "let it crash" philosophy)
Performance
- 100 workers initialize in ~3 seconds (unchanged)
- 1400-1500 operations/second sustained (maintained)
- DETS cleanup now O(1) vs O(n) (significant improvement for large process counts)
Documentation
- Complete installation guide with platform-specific instructions (Ubuntu, macOS, WSL, Docker)
- Marked working vs WIP examples clearly (3 working, 6 aspirational)
- Added comprehensive analysis documents (150KB total)
Testing
- All 139/139 tests passing ✅
- No orphaned processes ✅
- Clean shutdown behavior validated ✅
0.4.1 - 2025-07-24
Added
- New
process_texttool - Text processing capabilities with upper, lower, reverse, and length operations - New
get_statstool - Real-time adapter and system monitoring with memory usage, CPU usage, and system information - Enhanced ShowcaseAdapter - Added missing tools (adapter_info, echo, process_text, get_stats) for complete tool bridge demonstration
Fixed
- gRPC tool registration issues - Resolved async/sync mismatch causing UnaryUnaryCall objects to be returned instead of actual responses
- Missing tool errors - Fixed "Unknown tool: adapter_info" and "Unknown tool: echo" errors by implementing missing @tool decorated methods
- Automatic session initialization - Fixed "Failed to register tools: not_found" error by automatically creating sessions before tool registration
- Remote tool dispatch - Implemented complete bidirectional tool execution between Elixir BridgeServer and Python workers
- Async/sync compatibility - Added proper handling for both sync and async gRPC stubs with fallback logic for UnaryUnaryCall objects
Changed
- BridgeServer enhancement - Added remote tool execution capabilities with worker port lookup and gRPC forwarding
- Python gRPC server - Enhanced with automatic session initialization before tool registration
- ShowcaseAdapter refactoring - Expanded tool set to demonstrate full bidirectional tool bridge capabilities
0.4.0 - 2025-07-23
Added
- Complete gRPC bridge implementation with full bidirectional tool execution
- Tool bridge streaming support for efficient real-time communication
- Variables feature with type system (string, integer, float, boolean, choice, tensor, embedding)
- Comprehensive process management and cleanup system
- Process registry with enhanced tracking and orphan detection
- SessionStore with TTL support and automatic expiration
- BridgeServer implementation for gRPC protocol
- StreamHandler for managing gRPC streaming responses
- Telemetry module for comprehensive metrics and monitoring
- MockGRPCWorker and test infrastructure improvements
- Showcase application with multiple demo scenarios
- Binary serialization support for large data (>10KB) with 5-10x performance improvement
- Automatic binary encoding with threshold detection
- Protobuf schema updates with binary fields support
- Tool registration and discovery system
- Elixir tool exposure to Python workers
- Batch variable operations for performance
- Variable watching/reactive updates support
- Heartbeat mechanism for session health monitoring
Changed
- Major refactoring from legacy bridge system to gRPC-only architecture
- Removed all legacy bridge implementations (V1, V2, MessagePack)
- Unified all adapters to use gRPC protocol exclusively
- Worker module completely rewritten for gRPC support
- Pool module enhanced with configurable adapter support
- ProcessRegistry rewritten with improved tracking and cleanup
- Test framework upgraded with SuperTester integration
- Examples reorganized and updated for gRPC usage
- Python client library restructured as snakepit_bridge package
- Serialization module now returns 3-tuple
{:ok, any_map, binary_data} - Large tensors and embeddings automatically use binary encoding
- Integration tests updated to use new infrastructure
Fixed
- Process cleanup and orphan detection issues
- Worker termination and registry cleanup
- Module redefinition warnings in test environment
- SessionStore TTL validation and expiration timing
- Mock adapter message handling
- Integration test pool timeouts and shutdown
- GitHub Actions deprecation warnings
- Elixir version compatibility in integration tests
Removed
- All legacy bridge implementations (generic_python.ex, generic_python_v2.ex, etc.)
- MessagePack protocol support (moved to gRPC exclusively)
- Old Python bridge scripts (generic_bridge.py, enhanced_bridge.py)
- Legacy session_context.py implementation
- V1/V2 adapter pattern in favor of unified gRPC approach
0.3.3 - 2025-07-20
Added
- Support for custom adapter arguments in gRPC adapter via pool configuration
- Enhanced Python API commands (call, store, retrieve, list_stored, delete_stored) in gRPC adapter
- Dynamic command validation based on adapter type in gRPC adapter
Changed
- GRPCPython adapter now accepts custom adapter arguments through pool_config.adapter_args
- Improved supported_commands/0 to dynamically include commands based on the adapter in use
Fixed
- gRPC adapter now properly supports third-party Python adapters like DSPy integration
0.3.2 - 2025-07-20
Fixed
- Added missing files to the repository
0.3.1 - 2025-07-20
Changed
- Merged MessagePack optimizations into main codebase
- Unified documentation for gRPC and MessagePack features
- Set GenericPythonV2 as default adapter with auto-negotiation
0.3.0 - 2025-07-20
Added
- Complete gRPC bridge implementation with streaming support
- MessagePack serialization protocol support
- Comprehensive gRPC integration documentation and setup guides
- Enhanced bridge documentation and examples
Changed
- Deprecated V1 Python bridge in favor of V2 architecture
- Updated demo implementations to use V2 Python bridge
- Improved gRPC streaming bridge implementation
- Enhanced debugging capabilities and cleanup
Fixed
- Resolved init/1 blocking issues in V2 Python bridge
- General debugging improvements and code cleanup
0.2.1 - 2025-07-20
Fixed
- Eliminated "unexpected message" logs in Pool module by properly handling Task completion messages from
Task.Supervisor.async_nolink
0.2.0 - 2025-07-19
Added
- Complete Enhanced Python Bridge V2 Extension implementation
- Built-in type support for Python Bridge V2
- Test rework specifications and improved testing infrastructure
- Commercial refactoring recommendations documentation
Changed
- Enhanced Python Bridge V2 with improved architecture and session management
- Improved debugging capabilities for V2 examples
- Better error handling and robustness in Python Bridge
Fixed
- Bug fixes in Enhanced Python Bridge examples
- Data science example debugging improvements
- General cleanup and code improvements
0.1.2 - 2025-07-18
Added
- Python Bridge V2 with improved architecture and session management
- Generalized Python bridge implementation
- Enhanced session management capabilities
Changed
- Major architectural improvements to Python bridge
- Better integration with external Python processes
0.1.1 - 2025-07-18
Added
- DIAGS.md with comprehensive Mermaid architecture diagrams
- Elixir-themed styling and proper subgraph format for diagrams
- Logo support to ExDoc and hex package
- Mermaid diagram support in documentation
Changed
- Updated configuration to include assets and documentation
- Improved documentation structure and visual presentation
Fixed
- README logo path for hex docs
- Asset organization (moved img/ to assets/)
0.1.0 - 2025-07-18
Added
- Initial release of Snakepit
- High-performance pooling system for external processes
- Session-based execution with worker affinity
- Built-in adapters for Python and JavaScript/Node.js
- Comprehensive session management with ETS storage
- Telemetry and monitoring support
- Graceful shutdown and process cleanup
- Extensive documentation and examples
Features
- Lightning-fast concurrent worker initialization (1000x faster than sequential)
- Session affinity for stateful operations
- Built on OTP primitives (DynamicSupervisor, Registry, GenServer)
- Adapter pattern for any external language/runtime
- Production-ready with health checks and error handling
- Configurable pool sizes and timeouts
- Built-in bridge scripts for Python and JavaScript