SnmpKit.SnmpLib.ErrorHandler (snmpkit v0.6.4)
Intelligent error handling with retry logic, circuit breakers, and adaptive recovery.
This module provides sophisticated error handling capabilities designed to improve reliability and performance in production SNMP environments. Based on patterns proven in high-scale network monitoring systems handling thousands of devices.
Features
- Exponential Backoff: Intelligent retry timing to avoid overwhelming failing devices
- Circuit Breakers: Automatic failure detection and recovery for unhealthy devices
- Error Classification: Smart categorization of errors for appropriate handling
- Adaptive Timeouts: Dynamic timeout adjustment based on device performance
- Quarantine Management: Temporary isolation of problematic devices
- Recovery Strategies: Multiple approaches for bringing devices back online
Error Classification
Transient Errors (Retryable)
- Network timeouts
- Temporary device overload
- UDP packet loss
- DNS resolution delays
Permanent Errors (Non-retryable)
- Authentication failures
- Unsupported SNMP versions
- Invalid OIDs
- Device configuration errors
Degraded Performance
- Slow response times
- Partial failures
- High error rates
- Resource exhaustion
Circuit Breaker States
Closed (Normal Operation)
Device is healthy, all operations proceed normally.
Open (Failing)
Device has exceeded failure threshold, operations are blocked.
Half-Open (Testing)
Limited operations allowed to test device recovery.
Usage Examples
# Basic retry with exponential backoff
result = SnmpKit.SnmpLib.ErrorHandler.with_retry(fn ->
SnmpKit.SnmpLib.Manager.get("192.168.1.1", [1,3,6,1,2,1,1,1,0])
end, max_attempts: 3)
# Circuit breaker for device management
{:ok, breaker} = SnmpKit.SnmpLib.ErrorHandler.start_circuit_breaker("192.168.1.1")
result = SnmpKit.SnmpLib.ErrorHandler.call_through_breaker(breaker, fn ->
SnmpKit.SnmpLib.Manager.get_bulk("192.168.1.1", [1,3,6,1,2,1,2,2])
end)
# Adaptive timeout based on device history
timeout = SnmpKit.SnmpLib.ErrorHandler.adaptive_timeout("192.168.1.1", base_timeout: 5000)
Summary
Functions
Calculates an adaptive timeout based on device performance history.
Executes a function through a circuit breaker.
Returns a specification to start this module under a supervisor.
Classifies an error to determine appropriate handling strategy.
Gets comprehensive error statistics for a device.
Puts a device into quarantine for a specified duration.
Checks if a device is currently quarantined.
Starts a circuit breaker for a specific device.
Executes a function with intelligent retry logic and exponential backoff.
Types
@type circuit_breaker_opts() :: [ failure_threshold: pos_integer(), recovery_timeout: pos_integer(), half_open_max_calls: pos_integer(), timeout_threshold: pos_integer(), slow_call_threshold: pos_integer() ]
@type circuit_state() :: :closed | :open | :half_open
@type device_id() :: binary()
@type device_stats() :: %{ device_id: device_id(), success_count: non_neg_integer(), failure_count: non_neg_integer(), avg_response_time: float(), last_success: integer() | nil, last_failure: integer() | nil, circuit_state: circuit_state(), quarantine_until: integer() | nil }
@type error_class() :: :transient | :permanent | :degraded | :unknown
@type retry_opts() :: [ max_attempts: pos_integer(), strategy: retry_strategy(), base_delay: pos_integer(), max_delay: pos_integer(), jitter_factor: float(), retry_condition: function() ]
@type retry_strategy() :: :exponential | :linear | :fixed
Functions
@spec adaptive_timeout( device_id(), keyword() ) :: pos_integer()
Calculates an adaptive timeout based on device performance history.
Dynamically adjusts timeouts based on historical response times, device health, and current network conditions.
Parameters
device_id
: Device identifieropts
: Timeout calculation options
Options
base_timeout
: Minimum timeout value (default: 5000ms)max_timeout
: Maximum timeout value (default: 60000ms)percentile
: Response time percentile to use (default: 95)safety_factor
: Multiplier for calculated timeout (default: 2.0)
Returns
Calculated timeout in milliseconds
Examples
# Basic adaptive timeout
timeout = SnmpKit.SnmpLib.ErrorHandler.adaptive_timeout("192.168.1.1")
# Custom timeout parameters
timeout = SnmpKit.SnmpLib.ErrorHandler.adaptive_timeout("slow.device.local",
base_timeout: 10_000,
max_timeout: 120_000,
percentile: 99,
safety_factor: 3.0
)
@spec call_through_breaker(pid(), function(), pos_integer()) :: {:ok, any()} | {:error, any()}
Executes a function through a circuit breaker.
The circuit breaker monitors the operation and may block future calls if the device is experiencing failures.
Parameters
breaker_pid
: PID of the circuit breaker processfun
: Function to executetimeout
: Maximum execution time (optional)
Returns
{:ok, result}
: Operation succeeded{:error, reason}
: Operation failed{:error, :circuit_open}
: Circuit breaker is open (device unhealthy)
Examples
result = SnmpKit.SnmpLib.ErrorHandler.call_through_breaker(breaker, fn ->
SnmpKit.SnmpLib.Manager.get("192.168.1.1", [1,3,6,1,2,1,1,1,0])
end)
Returns a specification to start this module under a supervisor.
See Supervisor
.
@spec classify_error(any()) :: error_class()
Classifies an error to determine appropriate handling strategy.
Parameters
error
: The error to classify
Returns
:transient
: Error is likely temporary, retry recommended:permanent
: Error is permanent, retry not recommended:degraded
: Performance issue, may benefit from backoff:unknown
: Unable to classify, use conservative approach
Examples
:transient = SnmpKit.SnmpLib.ErrorHandler.classify_error(:timeout)
:permanent = SnmpKit.SnmpLib.ErrorHandler.classify_error(:authentication_failed)
:degraded = SnmpKit.SnmpLib.ErrorHandler.classify_error(:slow_response)
@spec get_device_stats(device_id()) :: {:ok, device_stats()} | {:error, :not_found}
Gets comprehensive error statistics for a device.
Examples
{:ok, stats} = SnmpKit.SnmpLib.ErrorHandler.get_device_stats("192.168.1.1")
IO.inspect(stats.failure_count)
@spec quarantine_device(device_id(), pos_integer()) :: :ok
Puts a device into quarantine for a specified duration.
Quarantined devices have operations blocked to allow recovery.
Examples
:ok = SnmpKit.SnmpLib.ErrorHandler.quarantine_device("192.168.1.1", 300_000) # 5 minutes
Checks if a device is currently quarantined.
Examples
false = SnmpKit.SnmpLib.ErrorHandler.quarantined?("192.168.1.1")
@spec start_circuit_breaker(device_id(), circuit_breaker_opts()) :: {:ok, pid()} | {:error, any()}
Starts a circuit breaker for a specific device.
Circuit breakers automatically detect failing devices and prevent cascading failures by temporarily blocking operations.
Parameters
device_id
: Unique identifier for the deviceopts
: Circuit breaker configuration options
Returns
{:ok, pid}
: Circuit breaker started successfully{:error, reason}
: Failed to start circuit breaker
Examples
{:ok, breaker} = SnmpKit.SnmpLib.ErrorHandler.start_circuit_breaker("192.168.1.1")
{:ok, breaker} = SnmpKit.SnmpLib.ErrorHandler.start_circuit_breaker("core-switch-01",
failure_threshold: 10,
recovery_timeout: 120_000
)
@spec with_retry(function(), retry_opts()) :: {:ok, any()} | {:error, any()}
Executes a function with intelligent retry logic and exponential backoff.
Automatically retries transient failures while avoiding permanent errors. Uses exponential backoff with jitter to prevent thundering herd problems.
Parameters
fun
: Function to execute (should return{:ok, result}
or{:error, reason}
)opts
: Retry configuration options
Options
max_attempts
: Maximum retry attempts (default: 3)strategy
: Backoff strategy (:exponential, :linear, :fixed)base_delay
: Initial delay in milliseconds (default: 1000)max_delay
: Maximum delay between retries (default: 30000)jitter_factor
: Random variation factor (default: 0.1)retry_condition
: Custom function to determine if error is retryable
Returns
{:ok, result}
: Operation succeeded (possibly after retries){:error, reason}
: Operation failed after all attempts{:error, {:max_retries_exceeded, last_error}}
: All retries exhausted
Examples
# Basic retry with defaults
result = SnmpKit.SnmpLib.ErrorHandler.with_retry(fn ->
SnmpKit.SnmpLib.Manager.get("192.168.1.1", [1,3,6,1,2,1,1,1,0])
end)
# Custom retry configuration
result = SnmpKit.SnmpLib.ErrorHandler.with_retry(fn ->
SnmpKit.SnmpLib.Manager.get_bulk("slow.device.local", [1,3,6,1,2,1,2,2])
end,
max_attempts: 5,
base_delay: 2000,
max_delay: 60000,
strategy: :exponential
)