SnmpKit.SnmpLib.ErrorHandler (snmpkit v0.6.3)

Intelligent error handling with retry logic, circuit breakers, and adaptive recovery.

This module provides sophisticated error handling capabilities designed to improve reliability and performance in production SNMP environments. Based on patterns proven in high-scale network monitoring systems handling thousands of devices.

Features

  • Exponential Backoff: Intelligent retry timing to avoid overwhelming failing devices
  • Circuit Breakers: Automatic failure detection and recovery for unhealthy devices
  • Error Classification: Smart categorization of errors for appropriate handling
  • Adaptive Timeouts: Dynamic timeout adjustment based on device performance
  • Quarantine Management: Temporary isolation of problematic devices
  • Recovery Strategies: Multiple approaches for bringing devices back online

Error Classification

Transient Errors (Retryable)

  • Network timeouts
  • Temporary device overload
  • UDP packet loss
  • DNS resolution delays

Permanent Errors (Non-retryable)

  • Authentication failures
  • Unsupported SNMP versions
  • Invalid OIDs
  • Device configuration errors

Degraded Performance

  • Slow response times
  • Partial failures
  • High error rates
  • Resource exhaustion

Circuit Breaker States

Closed (Normal Operation)

Device is healthy, all operations proceed normally.

Open (Failing)

Device has exceeded failure threshold, operations are blocked.

Half-Open (Testing)

Limited operations allowed to test device recovery.

Usage Examples

# Basic retry with exponential backoff
result = SnmpKit.SnmpLib.ErrorHandler.with_retry(fn ->
  SnmpKit.SnmpLib.Manager.get("192.168.1.1", [1,3,6,1,2,1,1,1,0])
end, max_attempts: 3)

# Circuit breaker for device management
{:ok, breaker} = SnmpKit.SnmpLib.ErrorHandler.start_circuit_breaker("192.168.1.1")

result = SnmpKit.SnmpLib.ErrorHandler.call_through_breaker(breaker, fn ->
  SnmpKit.SnmpLib.Manager.get_bulk("192.168.1.1", [1,3,6,1,2,1,2,2])
end)

# Adaptive timeout based on device history
timeout = SnmpKit.SnmpLib.ErrorHandler.adaptive_timeout("192.168.1.1", base_timeout: 5000)

Summary

Functions

Calculates an adaptive timeout based on device performance history.

Executes a function through a circuit breaker.

Returns a specification to start this module under a supervisor.

Classifies an error to determine appropriate handling strategy.

Gets comprehensive error statistics for a device.

Puts a device into quarantine for a specified duration.

Checks if a device is currently quarantined.

Starts a circuit breaker for a specific device.

Executes a function with intelligent retry logic and exponential backoff.

Types

circuit_breaker_opts()

@type circuit_breaker_opts() :: [
  failure_threshold: pos_integer(),
  recovery_timeout: pos_integer(),
  half_open_max_calls: pos_integer(),
  timeout_threshold: pos_integer(),
  slow_call_threshold: pos_integer()
]

circuit_state()

@type circuit_state() :: :closed | :open | :half_open

device_id()

@type device_id() :: binary()

device_stats()

@type device_stats() :: %{
  device_id: device_id(),
  success_count: non_neg_integer(),
  failure_count: non_neg_integer(),
  avg_response_time: float(),
  last_success: integer() | nil,
  last_failure: integer() | nil,
  circuit_state: circuit_state(),
  quarantine_until: integer() | nil
}

error_class()

@type error_class() :: :transient | :permanent | :degraded | :unknown

retry_opts()

@type retry_opts() :: [
  max_attempts: pos_integer(),
  strategy: retry_strategy(),
  base_delay: pos_integer(),
  max_delay: pos_integer(),
  jitter_factor: float(),
  retry_condition: function()
]

retry_strategy()

@type retry_strategy() :: :exponential | :linear | :fixed

Functions

adaptive_timeout(device_id, opts \\ [])

@spec adaptive_timeout(
  device_id(),
  keyword()
) :: pos_integer()

Calculates an adaptive timeout based on device performance history.

Dynamically adjusts timeouts based on historical response times, device health, and current network conditions.

Parameters

  • device_id: Device identifier
  • opts: Timeout calculation options

Options

  • base_timeout: Minimum timeout value (default: 5000ms)
  • max_timeout: Maximum timeout value (default: 60000ms)
  • percentile: Response time percentile to use (default: 95)
  • safety_factor: Multiplier for calculated timeout (default: 2.0)

Returns

Calculated timeout in milliseconds

Examples

# Basic adaptive timeout
timeout = SnmpKit.SnmpLib.ErrorHandler.adaptive_timeout("192.168.1.1")

# Custom timeout parameters
timeout = SnmpKit.SnmpLib.ErrorHandler.adaptive_timeout("slow.device.local",
  base_timeout: 10_000,
  max_timeout: 120_000,
  percentile: 99,
  safety_factor: 3.0
)

call_through_breaker(breaker_pid, fun, timeout \\ 5000)

@spec call_through_breaker(pid(), function(), pos_integer()) ::
  {:ok, any()} | {:error, any()}

Executes a function through a circuit breaker.

The circuit breaker monitors the operation and may block future calls if the device is experiencing failures.

Parameters

  • breaker_pid: PID of the circuit breaker process
  • fun: Function to execute
  • timeout: Maximum execution time (optional)

Returns

  • {:ok, result}: Operation succeeded
  • {:error, reason}: Operation failed
  • {:error, :circuit_open}: Circuit breaker is open (device unhealthy)

Examples

result = SnmpKit.SnmpLib.ErrorHandler.call_through_breaker(breaker, fn ->
  SnmpKit.SnmpLib.Manager.get("192.168.1.1", [1,3,6,1,2,1,1,1,0])
end)

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

classify_error(error)

@spec classify_error(any()) :: error_class()

Classifies an error to determine appropriate handling strategy.

Parameters

  • error: The error to classify

Returns

  • :transient: Error is likely temporary, retry recommended
  • :permanent: Error is permanent, retry not recommended
  • :degraded: Performance issue, may benefit from backoff
  • :unknown: Unable to classify, use conservative approach

Examples

:transient = SnmpKit.SnmpLib.ErrorHandler.classify_error(:timeout)
:permanent = SnmpKit.SnmpLib.ErrorHandler.classify_error(:authentication_failed)
:degraded = SnmpKit.SnmpLib.ErrorHandler.classify_error(:slow_response)

get_device_stats(device_id)

@spec get_device_stats(device_id()) :: {:ok, device_stats()} | {:error, :not_found}

Gets comprehensive error statistics for a device.

Examples

{:ok, stats} = SnmpKit.SnmpLib.ErrorHandler.get_device_stats("192.168.1.1")
IO.inspect(stats.failure_count)

quarantine_device(device_id, duration_ms)

@spec quarantine_device(device_id(), pos_integer()) :: :ok

Puts a device into quarantine for a specified duration.

Quarantined devices have operations blocked to allow recovery.

Examples

:ok = SnmpKit.SnmpLib.ErrorHandler.quarantine_device("192.168.1.1", 300_000)  # 5 minutes

quarantined?(device_id)

@spec quarantined?(device_id()) :: boolean()

Checks if a device is currently quarantined.

Examples

false = SnmpKit.SnmpLib.ErrorHandler.quarantined?("192.168.1.1")

start_circuit_breaker(device_id, opts \\ [])

@spec start_circuit_breaker(device_id(), circuit_breaker_opts()) ::
  {:ok, pid()} | {:error, any()}

Starts a circuit breaker for a specific device.

Circuit breakers automatically detect failing devices and prevent cascading failures by temporarily blocking operations.

Parameters

  • device_id: Unique identifier for the device
  • opts: Circuit breaker configuration options

Returns

  • {:ok, pid}: Circuit breaker started successfully
  • {:error, reason}: Failed to start circuit breaker

Examples

{:ok, breaker} = SnmpKit.SnmpLib.ErrorHandler.start_circuit_breaker("192.168.1.1")

{:ok, breaker} = SnmpKit.SnmpLib.ErrorHandler.start_circuit_breaker("core-switch-01",
  failure_threshold: 10,
  recovery_timeout: 120_000
)

with_retry(fun, opts \\ [])

@spec with_retry(function(), retry_opts()) :: {:ok, any()} | {:error, any()}

Executes a function with intelligent retry logic and exponential backoff.

Automatically retries transient failures while avoiding permanent errors. Uses exponential backoff with jitter to prevent thundering herd problems.

Parameters

  • fun: Function to execute (should return {:ok, result} or {:error, reason})
  • opts: Retry configuration options

Options

  • max_attempts: Maximum retry attempts (default: 3)
  • strategy: Backoff strategy (:exponential, :linear, :fixed)
  • base_delay: Initial delay in milliseconds (default: 1000)
  • max_delay: Maximum delay between retries (default: 30000)
  • jitter_factor: Random variation factor (default: 0.1)
  • retry_condition: Custom function to determine if error is retryable

Returns

  • {:ok, result}: Operation succeeded (possibly after retries)
  • {:error, reason}: Operation failed after all attempts
  • {:error, {:max_retries_exceeded, last_error}}: All retries exhausted

Examples

# Basic retry with defaults
result = SnmpKit.SnmpLib.ErrorHandler.with_retry(fn ->
  SnmpKit.SnmpLib.Manager.get("192.168.1.1", [1,3,6,1,2,1,1,1,0])
end)

# Custom retry configuration
result = SnmpKit.SnmpLib.ErrorHandler.with_retry(fn ->
  SnmpKit.SnmpLib.Manager.get_bulk("slow.device.local", [1,3,6,1,2,1,2,2])
end,
max_attempts: 5,
base_delay: 2000,
max_delay: 60000,
strategy: :exponential
)