Snakepit Architecture Diagrams (v0.4.0+)
This document provides comprehensive Mermaid diagrams that illustrate Snakepit's system architecture, focusing on component relationships, data flow, and operational patterns. These diagrams complement the performance-focused diagrams in DIAGS.md by showing the complete system design.
Diagram Overview
- High-Level System Architecture - Component relationships and communication patterns
- Request Flow Sequence - Step-by-step execution flow with session and variable management
- Supervision Tree - OTP supervision hierarchy for fault tolerance
- State Management Flow - How state is centralized in Elixir while Python workers remain stateless
- Worker Lifecycle - State transitions during worker lifetime
- Variable System Architecture - Class structure for type-safe variable management
- Protocol Message Flow - gRPC message flow between components
- Error Handling & Recovery - How errors are detected and recovered from
These diagrams are essential for understanding how Snakepit achieves its design goals of high performance, fault tolerance, and clean separation of concerns.
1. High-Level System Architecture
graph TB
subgraph "Elixir Application"
APP[Application Code]
POOL[Pool<br/>GenServer]
WS[WorkerSupervisor<br/>DynamicSupervisor]
STARTER[Worker.Starter<br/>Supervisor]
WORKER[GRPCWorker<br/>GenServer]
SS[SessionStore<br/>GenServer + ETS]
BS[BridgeServer<br/>gRPC Service]
end
subgraph "Python Worker Process"
GRPC[grpc_server.py<br/>gRPC Service]
CTX[SessionContext<br/>Cache + Client]
ADAPTER[User Adapter]
TOOLS[User Tools]
end
APP -->|execute| POOL
POOL -->|manages| WS
WS -->|supervises| STARTER
STARTER -->|monitors| WORKER
WORKER -->|spawns & gRPC| GRPC
POOL -->|session ops| SS
SS -->|stores state| ETS[(ETS Table)]
GRPC -->|variable ops| BS
BS -->|reads/writes| SS
GRPC -->|creates| CTX
CTX -->|gRPC callbacks| BS
ADAPTER -->|uses| CTX
TOOLS -->|registered in| ADAPTER
style APP fill:#e1f5fe
style POOL fill:#b3e5fc
style SS fill:#81d4fa
style GRPC fill:#fff9c4
style CTX fill:#fff59dThis diagram shows the overall system components and their relationships. Key insight: Python workers are completely stateless - all state management happens in the Elixir SessionStore, enabling easy scaling and crash recovery.
2. Request Flow Sequence
sequenceDiagram
participant App as Elixir App
participant Pool
participant Worker as GRPCWorker
participant Python as Python Process
participant Store as SessionStore
App->>Pool: execute(command, args, session_id)
Pool->>Pool: Find worker with session affinity
Pool->>Worker: Forward request
Worker->>Python: gRPC ExecuteTool
alt Variable Access Needed
Python->>Worker: gRPC GetVariable
Worker->>Store: get_variable(session_id, name)
Store-->>Worker: Variable data
Worker-->>Python: Variable value
end
Python->>Python: Execute tool/adapter logic
alt Variable Update Needed
Python->>Worker: gRPC SetVariable
Worker->>Store: update_variable(session_id, name, value)
Store-->>Worker: Success
Worker-->>Python: Confirmation
end
Python-->>Worker: Execution result
Worker-->>Pool: Result
Pool-->>App: ResultThis sequence shows how variable access patterns work. Python workers call back to Elixir for variable operations, maintaining the stateless design while enabling session-based workflows.
3. Supervision Tree
graph TD
APP[Application]
SUP[Main Supervisor]
REG[Registries<br/>Pool.Registry<br/>ProcessRegistry<br/>StarterRegistry]
SS[SessionStore]
BS[BridgeServer]
WS[WorkerSupervisor<br/>:one_for_one]
POOL[Pool]
CLEANUP[ApplicationCleanup]
APP -->|starts| SUP
SUP -->|permanent| REG
SUP -->|permanent| SS
SUP -->|permanent| BS
SUP -->|permanent| WS
SUP -->|permanent| POOL
SUP -->|permanent| CLEANUP
WS -->|dynamic| S1[Worker.Starter 1<br/>:permanent]
WS -->|dynamic| S2[Worker.Starter 2<br/>:permanent]
WS -->|dynamic| SN[Worker.Starter N<br/>:permanent]
S1 -->|transient| W1[GRPCWorker 1]
S2 -->|transient| W2[GRPCWorker 2]
SN -->|transient| WN[GRPCWorker N]
style SUP fill:#ff9800
style WS fill:#ff9800
style S1 fill:#ffc107
style S2 fill:#ffc107
style SN fill:#ffc107The supervision tree implements the "Permanent Wrapper" pattern where Worker.Starters supervise individual workers. This decouples the Pool from worker restart logic and provides automatic recovery.
4. State Management Flow
graph LR
subgraph "Python Side (Stateless)"
PY[Python Worker]
CACHE[SessionContext<br/>Local Cache]
end
subgraph "Elixir Side (Stateful)"
BS[BridgeServer]
SS[SessionStore]
ETS[(ETS Table<br/>:read_concurrency)]
end
PY -->|"get_variable<br/>set_variable<br/>register_variable"| BS
BS -->|GenServer calls| SS
SS -->|atomic ops| ETS
CACHE -.->|"TTL-based<br/>invalidation"| CACHE
PY -->|"cache miss"| BS
BS -->|"variable data"| PY
PY -->|"cache hit"| CACHE
style PY fill:#fff9c4
style CACHE fill:#fff59d
style SS fill:#81d4fa
style ETS fill:#4fc3f7This diagram illustrates the key architectural principle: stateless Python workers with centralized Elixir state. The SessionContext provides local caching to reduce gRPC round-trips while maintaining consistency.
5. Worker Lifecycle
stateDiagram-v2
[*] --> Starting: Pool requests worker
Starting --> Spawning: WorkerSupervisor starts Starter
Spawning --> Launching: Starter starts GRPCWorker
Launching --> Connecting: GRPCWorker spawns Python
Connecting --> Ready: gRPC connection established
Ready --> Executing: Receive request
Executing --> Ready: Complete request
Ready --> HealthCheck: Periodic check
HealthCheck --> Ready: Healthy
HealthCheck --> Reconnecting: Unhealthy
Reconnecting --> Ready: Reconnected
Reconnecting --> Crashed: Failed
Ready --> Stopping: Shutdown request
Executing --> Crashed: Error
Crashed --> Restarting: Starter detects
Restarting --> Launching: Automatic restart
Stopping --> [*]: Clean shutdownThe worker lifecycle emphasizes fault tolerance and automatic recovery. Workers transition through well-defined states with automatic restart via the supervision tree when failures occur.
6. Variable System Architecture
classDiagram
class SessionStore {
+create_session(id, opts)
+get_session(id)
+delete_session(id)
+register_variable(session_id, name, type, value)
+get_variable(session_id, name)
+update_variable(session_id, name, value)
+list_variables(session_id)
-cleanup_expired_sessions()
}
class Session {
+id: String
+variables: Map
+variable_index: Map
+programs: Map
+metadata: Map
+ttl: Integer
+created_at: Integer
+last_accessed: Integer
}
class Variable {
+id: String
+name: String
+type: atom
+value: Any
+constraints: Map
+metadata: Map
+created_at: Integer
+updated_at: Integer
}
class SessionContext {
+session_id: String
+stub: BridgeServiceStub
+strict_mode: bool
-_cache: Dict
+register_variable(name, type, value)
+get_variable(name)
+update_variable(name, value)
+__getitem__(name)
+__setitem__(name, value)
}
class CachedVariable {
+variable: Variable
+cached_at: datetime
+ttl: timedelta
+expired: bool
}
SessionStore "1" --> "*" Session : manages
Session "1" --> "*" Variable : contains
SessionContext "1" --> "*" CachedVariable : caches
SessionContext --> SessionStore : gRPC callsThe variable system class diagram shows the relationship between Elixir-side storage (SessionStore) and Python-side caching (SessionContext). This design enables type-safe variable management across language boundaries.
7. Protocol Message Flow
graph TB
subgraph "Client Request"
REQ[ExecuteToolRequest<br/>tool_name, args, session_id]
end
subgraph "Python Processing"
TOOL[Tool Execution]
VAR_GET[GetVariableRequest]
VAR_SET[SetVariableRequest]
end
subgraph "Elixir Processing"
STORE[SessionStore Operations]
SER[Serialization Module]
end
subgraph "Response"
RESP[ExecuteToolResponse<br/>success, result, error]
end
REQ -->|protobuf| TOOL
TOOL -->|need variable| VAR_GET
VAR_GET -->|protobuf| STORE
STORE -->|Variable| SER
SER -->|GetVariableResponse| TOOL
TOOL -->|update variable| VAR_SET
VAR_SET -->|protobuf| STORE
STORE -->|update| SER
SER -->|SetVariableResponse| TOOL
TOOL -->|complete| RESP
style REQ fill:#e8f5e9
style RESP fill:#e8f5e9
style TOOL fill:#fff9c4
style STORE fill:#81d4faThis message flow diagram shows how gRPC protobuf messages flow through the system. The protocol handles both tool execution and variable management through a unified interface.
8. Error Handling & Recovery
graph TD
subgraph "Error Sources"
E1[Python Crash]
E2[gRPC Timeout]
E3[Network Error]
E4[Tool Exception]
end
subgraph "Detection"
MON[Process Monitor]
HC[Health Check]
TO[Timeout Handler]
EH[Error Handler]
end
subgraph "Recovery"
RS[Restart Worker]
RQ[Requeue Request]
CB[Circuit Breaker]
LOG[Error Logging]
end
E1 -->|:DOWN message| MON
E2 -->|catch :exit| TO
E3 -->|gRPC error| EH
E4 -->|try/except| EH
MON -->|Worker.Starter| RS
HC -->|failed check| RS
TO -->|Pool handler| RQ
EH -->|grpc_error_handler| LOG
RS -->|automatic| OK[Healthy Worker]
RQ -->|retry logic| OK
CB -->|threshold| FAIL[Mark Unavailable]
style E1 fill:#ffcdd2
style E2 fill:#ffcdd2
style E3 fill:#ffcdd2
style E4 fill:#ffcdd2
style OK fill:#c8e6c9The error handling diagram shows Snakepit's multi-layered approach to fault tolerance. Various detection mechanisms feed into recovery strategies, ensuring system resilience.
Architecture Principles Illustrated
These diagrams demonstrate key architectural principles that make Snakepit production-ready:
- Stateless Workers: Python processes hold no persistent state, enabling easy scaling
- Centralized State: All session data managed in Elixir SessionStore with ETS backing
- Fault Tolerance: Multi-level supervision with automatic recovery
- Performance: Non-blocking operations and concurrent execution throughout
- Type Safety: Structured variable system with validation and constraints
- Protocol Efficiency: Modern gRPC with protobuf for reliable communication
Rendering These Diagrams
To render these diagrams, use any tool that supports Mermaid syntax:
- GitHub/GitLab: Renders automatically in markdown
- Mermaid Live Editor: https://mermaid.live
- VS Code: Install Mermaid extension
- Documentation Tools: MkDocs, Docusaurus, GitBook, etc.
- ExDoc: These diagrams are included in the generated documentation
The diagrams provide visual understanding of how Snakepit achieves its design goals of high performance, fault tolerance, and clean separation between Elixir orchestration and Python execution.