Retry and Error Handling
View SourceTinkex provides sophisticated retry mechanisms and error categorization to build resilient applications that gracefully handle transient failures, rate limits, and network issues. This guide covers the error model, retry strategies, telemetry integration, and production best practices.
Overview
Tinkex's error handling system distinguishes between user errors (permanent failures that should not be retried) and transient errors (temporary failures that may succeed on retry). The retry mechanism uses exponential backoff with jitter and integrates with telemetry for observability.
Key components:
Tinkex.Error- Structured error type with categorization- Tinkex.Retry - Retry orchestration with telemetry
- Tinkex.RetryHandler - Configurable retry policy and timing
Tinkex.RateLimiter- Shared backoff state for rate limit coordinationTinkex.RetryConfig- User-facing sampling retry configuration (max retries, jitter, progress timeout, connection limit)
Sampling retry configuration
SamplingClient now accepts an optional :retry_config, letting you tune high-level retries and bound concurrent attempts per client. Defaults match the Python SDK: base delay 500ms, max delay 10s, ±25% jitter, 120-minute progress timeout, unbounded retries until the progress timeout elapses, and 1000 max_connections.
# Custom retry profile
retry_config =
Tinkex.RetryConfig.new(
max_retries: :infinity,
base_delay_ms: 750,
max_delay_ms: 15_000,
jitter_pct: 0.25,
max_connections: 20
)
{:ok, sampler} =
Tinkex.ServiceClient.create_sampling_client(service,
base_model: "meta-llama/Llama-3.1-8B",
retry_config: retry_config
)
# Disable retries for a specific client
{:ok, sampler} =
Tinkex.ServiceClient.create_sampling_client(service,
base_model: "meta-llama/Llama-3.1-8B",
retry_config: [enable_retry_logic: false]
)Notes:
- Retry logic runs at the SamplingClient layer; HTTP sampling calls still default to 0 low-level retries.
max_connectionsgates concurrent sampling attempts using a semaphore to protect pools under load.progress_timeout_msaborts a retry loop if no forward progress is observed within the window.
The Error Model
Tinkex.Error Structure
All Tinkex operations return {:ok, result} or {:error, %Tinkex.Error{}} tuples:
%Tinkex.Error{
message: "Rate limit exceeded",
type: :api_status,
status: 429,
category: :transient,
data: %{"retry_after_ms" => 5000},
retry_after_ms: 5000
}Fields:
message- Human-readable error descriptiontype- Error classification (:api_connection,:api_timeout,:api_status,:request_failed,:validation)status- HTTP status code (if applicable)category- Request error category (:user,:transient,:system)data- Additional error context from the APIretry_after_ms- Server-requested retry delay
Error Types
:api_connection
Network connectivity failures (DNS resolution, connection refused, TLS errors). Always retryable.
:api_timeout
Request exceeded configured timeout or progress timeout. Retryable unless caused by user input.
:api_status
HTTP error status codes. Retryability depends on the status code and category.
:request_failed
General request failures including exceptions during execution. Retryability depends on context.
:validation
Client-side validation errors (invalid parameters, missing required fields). Never retryable.
User Errors vs Transient Errors
The Error.user_error?/1 function determines if an error is permanent:
# User errors (NOT retryable)
Error.user_error?(%Error{category: :user}) # => true
Error.user_error?(%Error{status: 400}) # => true (bad request)
Error.user_error?(%Error{status: 401}) # => true (unauthorized)
Error.user_error?(%Error{status: 403}) # => true (forbidden)
Error.user_error?(%Error{status: 404}) # => true (not found)
# Transient errors (retryable)
Error.user_error?(%Error{status: 408}) # => false (request timeout)
Error.user_error?(%Error{status: 429}) # => false (rate limit)
Error.user_error?(%Error{status: 500}) # => false (server error)
Error.user_error?(%Error{status: 502}) # => false (bad gateway)
Error.user_error?(%Error{status: 503}) # => false (service unavailable)Truth table:
category == :user→ User error (never retry)- Status
4xx(except408,429) → User error (never retry) - Everything else → Transient error (retry allowed)
Basic Retry Usage
Simple Retry
Wrap any operation returning {:ok, value} or {:error, error}:
alias Tinkex.Retry
result = Retry.with_retry(fn ->
# Your operation here
SamplingClient.sample(sampler, prompt, params)
end)
case result do
{:ok, response} ->
IO.puts("Success: #{inspect(response)}")
{:error, error} ->
IO.puts("Failed after retries: #{Error.format(error)}")
endCustom Retry Configuration
Configure retry behavior with RetryHandler:
alias Tinkex.{Retry, RetryHandler}
handler = RetryHandler.new(
max_retries: :infinity, # Time-bounded by progress_timeout_ms (default)
base_delay_ms: 1000, # Start with 1s delay (default: 500ms)
max_delay_ms: 30_000, # Cap at 30s (default: 10s)
jitter_pct: 1.0, # Full jitter (default: 0.25)
progress_timeout_ms: 60_000 # Abort if no progress for 60s (default: 120m)
)
result = Retry.with_retry(&my_operation/0, handler: handler)Exponential Backoff with Jitter
Delays grow exponentially: base_delay_ms * 2^attempt, capped at max_delay_ms, with randomized jitter to prevent thundering herd:
# With base_delay_ms: 500, max_delay_ms: 8000, jitter_pct: 1.0
# Attempt 0: 0-500ms (500 * 2^0 * random)
# Attempt 1: 0-1000ms (500 * 2^1 * random)
# Attempt 2: 0-2000ms (500 * 2^2 * random)
# Attempt 3: 0-4000ms (500 * 2^3 * random)
# Attempt 4: 0-8000ms (capped at max_delay_ms)Disable jitter (for testing or deterministic delays):
handler = RetryHandler.new(jitter_pct: 0.0)Progress Timeout
The progress timeout prevents infinite retry loops when operations make no forward progress:
handler = RetryHandler.new(
max_retries: :infinity, # Time-bounded instead of attempt-bounded
progress_timeout_ms: 30_000 # Abort if stuck for 30s
)Progress is measured from the last recorded progress event (initially the start of the retry loop). Because retries no longer reset this timer, loops are bounded by progress_timeout_ms (default: 120 minutes) unless you explicitly call RetryHandler.record_progress/1 when real work is happening between attempts. If the elapsed time exceeds progress_timeout_ms, the retry loop aborts with:
{:error, %Error{type: :api_timeout, message: "Progress timeout exceeded"}}This is critical for long-running operations where individual attempts may hang or repeatedly fail without making progress.
Rate Limiting and Backoff Windows
Tinkex.RateLimiter provides shared backoff state per {base_url, api_key} combination. When the server returns 429 Too Many Requests with retry_after_ms, the limiter blocks all requests to that endpoint until the backoff window expires.
How It Works
alias Tinkex.RateLimiter
# Get limiter for an API key + base URL
limiter = RateLimiter.for_key({"https://api.example.com", "key-123"})
# Check if currently in backoff
RateLimiter.should_backoff?(limiter) # => false
# Set a 5-second backoff window
RateLimiter.set_backoff(limiter, 5000)
# Now backoff is active
RateLimiter.should_backoff?(limiter) # => true
# Block until backoff expires
RateLimiter.wait_for_backoff(limiter) # Sleeps in 100ms intervals
# Or clear it manually
RateLimiter.clear_backoff(limiter)Automatic Integration
When SamplingClient, TrainingClient, or RestClient receive a 429 error with retry_after_ms, they automatically:
- Extract the limiter for the current endpoint
- Set the backoff window
- Wait before the next retry
This ensures all concurrent requests to the same API key respect rate limits, not just the individual operation being retried.
Telemetry Events
The retry mechanism emits four telemetry events for observability:
Event Reference
[:tinkex, :retry, :attempt, :start]
Fired when an attempt begins.
# Measurements
%{system_time: integer()}
# Metadata
%{attempt: 0, operation: "sample"}[:tinkex, :retry, :attempt, :stop]
Fired when an attempt succeeds.
# Measurements
%{duration: integer()} # Nanoseconds
# Metadata
%{attempt: 2, result: :ok}[:tinkex, :retry, :attempt, :retry]
Fired when an attempt fails with a retryable error.
# Measurements
%{duration: integer(), delay_ms: integer()}
# Metadata
%{
attempt: 1,
error: %Error{type: :api_status, status: 500},
operation: "sample"
}[:tinkex, :retry, :attempt, :failed]
Fired when all retries are exhausted or a non-retryable error occurs.
# Measurements
%{duration: integer()}
# Metadata
%{
attempt: 3,
result: :failed,
error: %Error{type: :validation, message: "Invalid parameter"}
}Attaching Telemetry Handlers
Log retry events to understand application behavior:
:telemetry.attach_many(
"tinkex-retry-logger",
[
[:tinkex, :retry, :attempt, :start],
[:tinkex, :retry, :attempt, :retry],
[:tinkex, :retry, :attempt, :stop],
[:tinkex, :retry, :attempt, :failed]
],
&handle_retry_event/4,
nil
)
def handle_retry_event(event, measurements, metadata, _config) do
case event do
[:tinkex, :retry, :attempt, :start] ->
Logger.info("Retry attempt #{metadata.attempt} starting")
[:tinkex, :retry, :attempt, :retry] ->
Logger.warning(
"Retry attempt #{metadata.attempt} failed, " <>
"retrying after #{measurements.delay_ms}ms: #{Error.format(metadata.error)}"
)
[:tinkex, :retry, :attempt, :stop] ->
duration_ms = System.convert_time_unit(measurements.duration, :native, :millisecond)
Logger.info("Retry attempt #{metadata.attempt} succeeded in #{duration_ms}ms")
[:tinkex, :retry, :attempt, :failed] ->
Logger.error(
"All retries exhausted at attempt #{metadata.attempt}: #{Error.format(metadata.error)}"
)
end
endComplete Example
From examples/retry_and_capture.exs:
alias Tinkex.{Error, Retry, RetryHandler}
# Simulate a flaky operation that fails twice then succeeds
defp flaky_operation do
attempt = Process.get(:retry_demo_attempt, 0) + 1
Process.put(:retry_demo_attempt, attempt)
case attempt do
n when n < 3 ->
{:error, Error.new(:api_status, "synthetic 500 for retry demo", status: 500)}
_ ->
{:ok, "succeeded on attempt #{attempt}"}
end
end
# Configure retry handler
handler = RetryHandler.new(
base_delay_ms: 200,
jitter_pct: 0.0, # Deterministic delays for demo
max_retries: 2 # Override the default (:infinity) to keep the demo short
)
# Execute with retry
result = Retry.with_retry(
&flaky_operation/0,
handler: handler,
telemetry_metadata: %{operation: "retry_demo"}
)
case result do
{:ok, value} ->
IO.puts("Success: #{value}")
{:error, error} ->
IO.puts("Failed: #{Error.format(error)}")
endOutput with telemetry attached:
[retry start] attempt=0
[retry retry] attempt=0 delay=200ms duration=1ms error=[api_status (500)] synthetic 500 for retry demo
[retry start] attempt=1
[retry retry] attempt=1 delay=400ms duration=0ms error=[api_status (500)] synthetic 500 for retry demo
[retry start] attempt=2
[retry stop] attempt=2 duration=0ms result=ok
Success: succeeded on attempt 3Best Practices for Production
1. Configure Timeouts Appropriately
Balance responsiveness with operation duration:
# For fast APIs (< 5s expected)
handler = RetryHandler.new(
base_delay_ms: 500,
max_delay_ms: 8_000,
progress_timeout_ms: 30_000
)
# For slow operations (training, large downloads)
handler = RetryHandler.new(
base_delay_ms: 2000,
max_delay_ms: 60_000,
progress_timeout_ms: 300_000 # 5 minutes
)2. Use Full Jitter in Production
Prevent thundering herd when many clients retry simultaneously:
# Good (default)
handler = RetryHandler.new(jitter_pct: 1.0)
# Only disable for testing
handler = RetryHandler.new(jitter_pct: 0.0)3. Respect Server-Requested Delays
The retry mechanism automatically honors retry_after_ms from API responses. Don't override this with shorter delays.
4. Monitor Retry Metrics
Track retry rates to identify systemic issues:
defmodule MyApp.RetryMetrics do
def init do
:telemetry.attach_many(
"retry-metrics",
[
[:tinkex, :retry, :attempt, :retry],
[:tinkex, :retry, :attempt, :failed]
],
&handle_event/4,
nil
)
end
def handle_event([:tinkex, :retry, :attempt, :retry], _measurements, metadata, _config) do
:telemetry.execute(
[:my_app, :retry, :count],
%{count: 1},
%{status: metadata.error.status}
)
end
def handle_event([:tinkex, :retry, :attempt, :failed], _measurements, metadata, _config) do
:telemetry.execute(
[:my_app, :retry, :exhausted],
%{count: 1},
%{error_type: metadata.error.type}
)
end
end5. Tune Retry Bounds
Balance reliability with latency by adjusting progress_timeout_ms and optionally capping attempts:
# For interactive applications (< 30s total)
handler = RetryHandler.new(max_retries: 3, progress_timeout_ms: 30_000)
# For background jobs (tolerate longer delays)
handler = RetryHandler.new(max_retries: 8, progress_timeout_ms: 120_000)
# For critical operations (time-bounded, no attempt cap)
handler = RetryHandler.new(max_retries: :infinity, progress_timeout_ms: 600_000)6. Handle Non-Retryable Errors Explicitly
Don't retry user errors:
case Retry.with_retry(&my_operation/0) do
{:ok, result} ->
{:ok, result}
{:error, %Error{} = error} ->
if Error.user_error?(error) do
# Permanent failure - log and notify user
Logger.error("User error: #{Error.format(error)}")
{:error, :invalid_request}
else
# Transient failure exhausted retries - maybe queue for later
Logger.warning("Transient error exhausted retries: #{Error.format(error)}")
{:error, :service_unavailable}
end
end7. Use Telemetry Capture for Exception Safety
Wrap retry operations in Telemetry.Capture to ensure exceptions are logged:
alias Tinkex.Telemetry.Capture
require Capture
Capture.capture_exceptions reporter: my_reporter do
Retry.with_retry(&potentially_raising_operation/0)
|> case do
{:ok, value} -> value
{:error, error} -> raise "Operation failed: #{Error.format(error)}"
end
endError Formatting
Format errors for logs or user display:
error = Error.new(:api_status, "Rate limit exceeded", status: 429)
# Via Error.format/1
Error.format(error)
# => "[api_status (429)] Rate limit exceeded"
# Via String.Chars protocol
to_string(error)
# => "[api_status (429)] Rate limit exceeded"
# Direct interpolation
"Error occurred: #{error}"
# => "Error occurred: [api_status (429)] Rate limit exceeded"Advanced: Custom Retry Logic
For specialized retry needs, implement your own retry loop:
defmodule MyApp.CustomRetry do
alias Tinkex.{Error, RetryHandler}
def retry_with_circuit_breaker(fun, opts \\ []) do
handler = Keyword.get(opts, :handler, RetryHandler.new())
circuit = Keyword.get(opts, :circuit_breaker)
if circuit_open?(circuit) do
{:error, Error.new(:api_connection, "Circuit breaker open")}
else
do_retry(fun, handler, circuit)
end
end
defp do_retry(fun, handler, circuit) do
case fun.() do
{:ok, value} ->
record_success(circuit)
{:ok, value}
{:error, error} ->
record_failure(circuit)
if RetryHandler.retry?(handler, error) do
Process.sleep(RetryHandler.next_delay(handler))
do_retry(fun, RetryHandler.increment_attempt(handler), circuit)
else
{:error, error}
end
end
end
# Circuit breaker implementation...
endRelated Resources
- API reference:
docs/guides/api_reference.md - Troubleshooting:
docs/guides/troubleshooting.md - Telemetry guide:
docs/guides/telemetry.md(if available) - Examples:
examples/retry_and_capture.exs