Each module in Resiliency addresses a distinct failure mode -- retries smooth over transient errors, hedged requests cut tail latency, single-flight deduplication collapses thundering herds, and weighted semaphores bound downstream pressure. In isolation, each is useful. Combined, they form a defense-in-depth strategy that handles the full spectrum of production failures.
This guide walks through concrete composition patterns, building from simple two-module combinations to a full resilience stack. Every example is a complete, runnable module -- copy it into your project and adapt the function bodies.
Why Combine?
A single resilience primitive covers one failure mode. Real systems face several at once:
| Failure mode | Primitive | What it does |
|---|---|---|
| Transient errors (503s, timeouts) | BackoffRetry | Retries with backoff until success or budget exhaustion |
| Tail latency (p99 spikes) | Hedged | Fires a backup request after a delay, takes whichever finishes first |
| Thundering herd (cache stampede) | SingleFlight | Deduplicates concurrent calls so the function executes once per key |
| Downstream overload | WeightedSemaphore | Bounds concurrency to protect the downstream service |
| Workload isolation | Bulkhead | Limits per-partition concurrency with rejection semantics |
| Request frequency | RateLimiter | Rejects calls when tokens are exhausted; returns a retry-after hint |
When you call an external payment API, a single retry loop is not enough. The payment service might be slow (hedging helps), your retries might fan out across hundreds of pods (single-flight collapses them), and your retry storms might overwhelm the service entirely (semaphore caps concurrency). Combining patterns gives you defense at every layer.
The key insight: compose from the outside in. The outermost wrapper controls the broadest concern (deduplication, concurrency limits), and the innermost wrapper handles the narrowest (individual request hedging, retry logic).
Pattern: Retry + Hedged Requests
Scenario -- You are calling a slow search service. Individual requests sometimes time out (transient failure), and tail latency is high (p99 is 3x the median). You want to hedge each attempt and retry the entire hedged call if both the primary and hedge fail.
The retry loop wraps the hedged call. Each "attempt" from BackoffRetry's perspective is a full hedged execution -- primary plus backup.
defmodule MyApp.Search do
@moduledoc """
Search client with retry-wrapped hedged requests.
Each retry attempt fires a hedged request (primary + backup).
If both the primary and hedge fail, BackoffRetry sleeps and
tries again.
"""
require Logger
@doc """
Queries the search service with hedged requests and retry.
Returns `{:ok, results}` or `{:error, reason}` after all
retries are exhausted.
"""
@spec search(String.t(), keyword()) :: {:ok, map()} | {:error, any()}
def search(query, opts \\ []) do
tracker = Keyword.get(opts, :tracker, MyApp.Search.Tracker)
Resiliency.BackoffRetry.retry(
fn ->
case Resiliency.Hedged.run(tracker, fn -> do_search(query) end, timeout: 3_000) do
{:ok, results} -> {:ok, results}
{:error, reason} -> {:error, reason}
end
end,
max_attempts: 3,
backoff: :exponential,
base_delay: 200,
max_delay: 2_000,
budget: 10_000,
retry_if: fn
{:error, :timeout} -> true
{:error, :service_unavailable} -> true
{:error, _} -> false
end,
on_retry: fn attempt, delay, error ->
Logger.warning(
"Search retry attempt=#{attempt} delay=#{delay}ms error=#{inspect(error)}"
)
end
)
end
defp do_search(query) do
case HttpClient.post("https://search.internal/query", %{q: query}) do
{:ok, %{status: 200, body: body}} -> {:ok, body}
{:ok, %{status: 503}} -> {:error, :service_unavailable}
{:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
{:error, :timeout} -> {:error, :timeout}
{:error, reason} -> {:error, reason}
end
end
endSupervision tree
The Hedged.Tracker is a GenServer that must be started before any calls to
Resiliency.Hedged.run/3. Place it in your application's supervision tree:
defmodule MyApp.Application do
use Application
@impl true
def start(_type, _args) do
children = [
{Resiliency.Hedged,
name: MyApp.Search.Tracker,
percentile: 95,
min_delay: 10,
max_delay: 2_000,
initial_delay: 150}
]
Supervisor.start_link(children, strategy: :one_for_one)
end
endHow the layers interact
BackoffRetry.retry/2calls the anonymous function -- attempt 1.- Inside,
Hedged.run/3fires the primarydo_search/1. If the primary is slower than p95, a backup fires after the adaptive delay. - Whichever finishes first wins. If both fail,
Hedged.run/3returns{:error, reason}. BackoffRetrychecksretry_if-- if the error is retryable, it sleeps with exponential backoff and loops back to step 1.- The
:budgetoption ensures the entire retry+hedge sequence completes within 10 seconds, regardless of how many attempts remain.
Pattern: Hedged Requests + WeightedSemaphore
Scenario -- You are calling an external payment API with hedged requests to reduce tail latency. But hedging doubles your outbound request rate in the worst case, and the payment API has a strict rate limit. You need to throttle the total number of in-flight requests -- including hedges.
The semaphore wraps the hedged call. Each hedged execution (which may spawn up
to max_requests concurrent calls) consumes a weight from the semaphore. This
bounds the total downstream pressure.
defmodule MyApp.PaymentGateway do
@moduledoc """
Payment API client with hedged requests throttled by a
weighted semaphore.
The semaphore ensures at most 5 hedged executions run
concurrently. Since each hedged execution may fire up to
2 requests (primary + hedge), the downstream API sees at
most 10 in-flight requests from this node.
"""
require Logger
@semaphore MyApp.PaymentGateway.Semaphore
@tracker MyApp.PaymentGateway.Tracker
@doc """
Charges a payment method. Returns `{:ok, transaction}` or
`{:error, reason}`.
If the semaphore is at capacity, the caller blocks (FIFO)
until a slot opens. Use `charge/3` with a timeout to fail
fast under sustained load.
"""
@spec charge(String.t(), pos_integer()) :: {:ok, map()} | {:error, any()}
def charge(payment_method_id, amount_cents) do
charge(payment_method_id, amount_cents, :infinity)
end
@spec charge(String.t(), pos_integer(), timeout()) :: {:ok, map()} | {:error, any()}
def charge(payment_method_id, amount_cents, timeout) do
# Weight of 2: each hedged execution may fire 2 downstream requests.
case Resiliency.WeightedSemaphore.acquire(@semaphore, 2, fn ->
Resiliency.Hedged.run(@tracker, fn ->
do_charge(payment_method_id, amount_cents)
end, timeout: 5_000)
end, timeout) do
{:ok, {:ok, transaction}} -> {:ok, transaction}
{:ok, {:error, reason}} -> {:error, reason}
{:error, :timeout} -> {:error, :throttled}
{:error, reason} -> {:error, reason}
end
end
@doc """
Non-blocking variant. Returns `:rejected` immediately if the
semaphore has no capacity -- useful for shedding load at the
edge.
"""
@spec try_charge(String.t(), pos_integer()) :: {:ok, map()} | {:error, any()} | :rejected
def try_charge(payment_method_id, amount_cents) do
case Resiliency.WeightedSemaphore.try_acquire(@semaphore, 2, fn ->
Resiliency.Hedged.run(@tracker, fn ->
do_charge(payment_method_id, amount_cents)
end, timeout: 5_000)
end) do
{:ok, {:ok, transaction}} -> {:ok, transaction}
{:ok, {:error, reason}} -> {:error, reason}
:rejected -> :rejected
{:error, reason} -> {:error, reason}
end
end
defp do_charge(payment_method_id, amount_cents) do
payload = %{payment_method: payment_method_id, amount: amount_cents, currency: "usd"}
case HttpClient.post("https://payments.example.com/v1/charges", payload) do
{:ok, %{status: 200, body: body}} -> {:ok, body}
{:ok, %{status: 402, body: body}} -> {:error, {:payment_declined, body}}
{:ok, %{status: 429}} -> {:error, :rate_limited}
{:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
{:error, reason} -> {:error, reason}
end
end
endSupervision tree
defmodule MyApp.Application do
use Application
@impl true
def start(_type, _args) do
children = [
# Hedged tracker -- adapts delay based on observed payment API latency
{Resiliency.Hedged,
name: MyApp.PaymentGateway.Tracker,
percentile: 99,
min_delay: 50,
max_delay: 3_000,
initial_delay: 500},
# Semaphore -- max weight of 10 permits.
# Each hedged call acquires weight 2, so at most 5 hedged
# executions (= 10 downstream requests) run concurrently.
{Resiliency.WeightedSemaphore,
name: MyApp.PaymentGateway.Semaphore,
max: 10}
]
Supervisor.start_link(children, strategy: :one_for_one)
end
endWhy weight 2?
Each hedged execution may fire up to 2 requests (the primary, then the backup
after the adaptive delay). By acquiring weight 2 from a semaphore with max 10,
you guarantee at most 10 concurrent HTTP requests hit the payment API -- even
when every call triggers a hedge. If you set max_requests: 3 on the hedged
call, you would acquire weight 3 instead.
Pattern: SingleFlight + Retry
Scenario -- Your application caches user profiles in Redis. On a cache miss, you fetch from the database. Under load, hundreds of concurrent requests for the same user all miss the cache simultaneously and stampede the database. You want to deduplicate those concurrent fetches and retry transient database errors.
SingleFlight wraps the retry. All concurrent callers for the same key share a single retry loop. If the retry succeeds, everyone gets the result. If it fails, everyone gets the error.
defmodule MyApp.UserProfile do
@moduledoc """
User profile loader with single-flight deduplication and retry.
Concurrent requests for the same user ID are collapsed into a
single database fetch. The fetch itself retries transient errors
with exponential backoff.
"""
require Logger
@flight MyApp.UserProfile.Flight
@doc """
Loads a user profile by ID.
On a cache hit, returns immediately. On a cache miss, a single
database fetch runs -- even if 100 callers request the same user
concurrently. The fetch retries up to 3 times on transient errors.
"""
@spec get(pos_integer()) :: {:ok, map()} | {:error, any()}
def get(user_id) do
case Cache.get("user:#{user_id}") do
{:ok, profile} ->
{:ok, profile}
:miss ->
Resiliency.SingleFlight.flight(@flight, "user:#{user_id}", fn ->
fetch_with_retry(user_id)
end)
end
end
defp fetch_with_retry(user_id) do
case Resiliency.BackoffRetry.retry(
fn -> fetch_from_db(user_id) end,
max_attempts: 3,
backoff: :exponential,
base_delay: 50,
max_delay: 1_000,
retry_if: fn
{:error, :timeout} -> true
{:error, :connection_lost} -> true
{:error, _} -> false
end,
on_retry: fn attempt, delay, error ->
Logger.warning(
"UserProfile.fetch retry user_id=#{user_id} " <>
"attempt=#{attempt} delay=#{delay}ms error=#{inspect(error)}"
)
end
) do
{:ok, profile} ->
Cache.put("user:#{user_id}", profile, ttl: :timer.minutes(5))
profile
{:error, reason} ->
raise "Failed to fetch user #{user_id}: #{inspect(reason)}"
end
end
defp fetch_from_db(user_id) do
case Repo.get(User, user_id) do
nil -> {:error, :not_found}
user -> {:ok, Map.from_struct(user)}
end
end
endSupervision tree
children = [
{Resiliency.SingleFlight, name: MyApp.UserProfile.Flight}
]
Supervisor.start_link(children, strategy: :one_for_one)Ordering matters
Note that SingleFlight is the outer layer and retry is the inner layer. This is intentional:
- SingleFlight outside, retry inside -- 100 concurrent callers produce 1 retry loop with at most 3 database queries. This is what you want.
- Retry outside, SingleFlight inside -- 100 concurrent callers each start their own retry loop. Each attempt deduplicates, but you still have 100 independent retry loops burning CPU and memory. Avoid this.
Always place deduplication outside retry.
Pattern: CircuitBreaker + Bulkhead
Scenario -- You are calling a payment API that occasionally has sustained outages. You want the circuit breaker to stop calling the service when it is down, and the bulkhead to limit concurrent calls when it is up -- preventing your application from overwhelming the API with too many requests.
The circuit breaker wraps the bulkhead. When the circuit is open, calls are rejected immediately without consuming a bulkhead permit. When the circuit is closed, the bulkhead limits concurrency.
defmodule MyApp.PaymentClient do
@moduledoc """
Payment client with circuit breaker and bulkhead.
The circuit breaker rejects calls when the service is known to be down.
The bulkhead limits concurrent calls to 10 when the service is up.
"""
@breaker MyApp.Payment.Breaker
@bulkhead MyApp.Payment.Bulkhead
@spec charge(String.t(), pos_integer()) :: {:ok, map()} | {:error, any()}
def charge(payment_method_id, amount_cents) do
case Resiliency.CircuitBreaker.call(@breaker, fn ->
case Resiliency.Bulkhead.call(@bulkhead, fn ->
do_charge(payment_method_id, amount_cents)
end) do
{:ok, result} -> result
{:error, :bulkhead_full} -> {:error, :bulkhead_full}
{:error, reason} -> {:error, reason}
end
end) do
{:ok, {:ok, transaction}} -> {:ok, transaction}
{:ok, {:error, reason}} -> {:error, reason}
{:error, :circuit_open} -> {:error, :service_degraded}
{:error, reason} -> {:error, reason}
end
end
defp do_charge(payment_method_id, amount_cents) do
case HttpClient.post("https://payments.example.com/v1/charges", %{
payment_method: payment_method_id,
amount: amount_cents
}) do
{:ok, %{status: 200, body: body}} -> {:ok, body}
{:ok, %{status: 402, body: body}} -> {:error, {:payment_declined, body}}
{:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
{:error, reason} -> {:error, reason}
end
end
endSupervision tree
children = [
{Resiliency.CircuitBreaker,
name: MyApp.Payment.Breaker,
failure_rate_threshold: 0.5,
open_timeout: 30_000},
{Resiliency.Bulkhead,
name: MyApp.Payment.Bulkhead,
max_concurrent: 10,
max_wait: 5_000}
]
Supervisor.start_link(children, strategy: :one_for_one)Pattern: Full Resilience Stack
Scenario -- You are building a product catalog service that queries a slow, occasionally unreliable upstream inventory API. The API has rate limits, high tail latency, and transient 503 errors. Multiple pods may request the same product concurrently after a cache miss. You need all four primitives working together.
The composition order, from outermost to innermost:
- CircuitBreaker -- Reject calls when the downstream is known to be down.
- SingleFlight -- Deduplicate concurrent callers for the same product.
- Bulkhead -- Isolate this workload with its own concurrency limit.
- WeightedSemaphore -- Bound total concurrent outbound requests.
- BackoffRetry -- Retry transient failures with exponential backoff.
- Hedged -- Cut tail latency on each individual attempt.
defmodule MyApp.Inventory do
@moduledoc """
Inventory client combining all four resilience patterns.
Layer order (outside to inside):
1. SingleFlight -- collapse concurrent callers per product
2. Semaphore -- bound outbound concurrency
3. BackoffRetry -- retry transient errors
4. Hedged -- cut tail latency per attempt
"""
require Logger
@breaker MyApp.Inventory.Breaker
@flight MyApp.Inventory.Flight
@bulkhead MyApp.Inventory.Bulkhead
@semaphore MyApp.Inventory.Semaphore
@tracker MyApp.Inventory.Tracker
@doc """
Fetches inventory for a product by SKU.
Returns `{:ok, inventory}` or `{:error, reason}`.
"""
@spec get_inventory(String.t()) :: {:ok, map()} | {:error, any()}
def get_inventory(sku) do
case Cache.get("inventory:#{sku}") do
{:ok, data} ->
{:ok, data}
:miss ->
fetch_inventory(sku)
end
end
defp fetch_inventory(sku) do
# Layer 0: CircuitBreaker -- reject when downstream is known-down
case Resiliency.CircuitBreaker.call(@breaker, fn ->
# Layer 1: SingleFlight -- deduplicate concurrent callers
Resiliency.SingleFlight.flight(@flight, "inventory:#{sku}", fn ->
# Layer 2: WeightedSemaphore -- bound concurrency
case Resiliency.WeightedSemaphore.acquire(@semaphore, 2, fn ->
# Layer 3: BackoffRetry -- retry transient errors
Resiliency.BackoffRetry.retry(
fn ->
# Layer 4: Hedged -- cut tail latency
case Resiliency.Hedged.run(@tracker, fn ->
fetch_from_api(sku)
end, timeout: 4_000) do
{:ok, data} -> {:ok, data}
{:error, reason} -> {:error, reason}
end
end,
max_attempts: 3,
backoff: :exponential,
base_delay: 100,
max_delay: 2_000,
budget: 8_000,
retry_if: fn
{:error, :timeout} -> true
{:error, :service_unavailable} -> true
{:error, :rate_limited} -> true
{:error, _} -> false
end,
on_retry: fn attempt, delay, error ->
Logger.warning(
"Inventory.fetch retry sku=#{sku} " <>
"attempt=#{attempt} delay=#{delay}ms error=#{inspect(error)}"
)
end
)
end) do
{:ok, {:ok, data}} ->
Cache.put("inventory:#{sku}", data, ttl: :timer.seconds(30))
{:ok, data}
{:ok, {:error, reason}} ->
{:error, reason}
{:error, reason} ->
{:error, reason}
end
end)
end) do
{:ok, {:ok, data}} -> {:ok, data}
{:ok, {:error, reason}} -> {:error, reason}
{:error, :circuit_open} -> {:error, :service_degraded}
{:error, reason} -> {:error, reason}
end
end
defp fetch_from_api(sku) do
case HttpClient.get("https://inventory.internal/v2/products/#{sku}") do
{:ok, %{status: 200, body: body}} -> {:ok, body}
{:ok, %{status: 404}} -> {:error, :not_found}
{:ok, %{status: 429}} -> {:error, :rate_limited}
{:ok, %{status: 503}} -> {:error, :service_unavailable}
{:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
{:error, :timeout} -> {:error, :timeout}
{:error, reason} -> {:error, reason}
end
end
endFull supervision tree
defmodule MyApp.Application do
use Application
@impl true
def start(_type, _args) do
children = [
# -- Inventory resilience stack --
{Resiliency.CircuitBreaker,
name: MyApp.Inventory.Breaker,
failure_rate_threshold: 0.5,
open_timeout: 30_000},
{Resiliency.SingleFlight, name: MyApp.Inventory.Flight},
{Resiliency.WeightedSemaphore,
name: MyApp.Inventory.Semaphore,
max: 20},
{Resiliency.Hedged,
name: MyApp.Inventory.Tracker,
percentile: 95,
min_delay: 10,
max_delay: 2_000,
initial_delay: 200,
min_samples: 20}
]
Supervisor.start_link(children, strategy: :one_for_one)
end
endRequest flow
Here is what happens when 50 concurrent requests arrive for the same SKU, all missing the cache:
- CircuitBreaker -- The breaker checks its state. If
:open, all 50 callers immediately receive{:error, :service_degraded}with zero downstream load. If:closedor:half_open, the call proceeds. - SingleFlight -- All 50 callers call
flight/3with key"inventory:sku-123". Only the first caller executes the function. The other 49 block, waiting for the result. - WeightedSemaphore -- The single executing caller acquires weight 2 from the semaphore. If the semaphore is already at capacity from other SKU fetches, this caller blocks in FIFO order.
- BackoffRetry -- The caller enters the retry loop. Attempt 1 begins.
- Hedged -- The primary request fires. If it is slower than p95 (tracked adaptively), a backup fires after the adaptive delay. Whichever responds first wins.
- If the hedged call returns
{:error, :service_unavailable}, BackoffRetry checksretry_if, sleeps with exponential backoff, and loops to step 4. - On success, the result propagates back up: the circuit breaker records a success, the semaphore releases its permits, SingleFlight broadcasts the result to all 49 waiting callers, and the cache is populated.
Total downstream impact from 50 concurrent callers: at most 3 attempts x 2 hedged requests = 6 HTTP requests, all bounded by the semaphore. If the circuit is open, downstream impact is zero.
Supervision Tree Design
When combining multiple patterns, all stateful components -- Bulkhead,
CircuitBreaker, Hedged.Tracker, SingleFlight, and WeightedSemaphore --
need to be started under a supervisor. BackoffRetry, Race, AllSettled,
Map, and FirstOk are stateless and require no supervision.
Grouping by domain
Group resilience infrastructure by the domain it serves. This makes it clear which components belong together and simplifies restarts:
defmodule MyApp.Application do
use Application
@impl true
def start(_type, _args) do
children = [
# Application-level services
MyApp.Repo,
MyApp.Cache,
# Resilience infrastructure -- grouped by domain
{Supervisor, name: MyApp.ResiliencySupervisor, children: resilience_children(),
strategy: :one_for_one}
]
Supervisor.start_link(children, strategy: :one_for_one)
end
defp resilience_children do
[
# -- Payment gateway stack --
{Resiliency.CircuitBreaker,
name: MyApp.PaymentGateway.Breaker,
failure_rate_threshold: 0.5,
open_timeout: 60_000},
{Resiliency.Bulkhead,
name: MyApp.PaymentGateway.Bulkhead,
max_concurrent: 10,
max_wait: 5_000},
{Resiliency.Hedged,
name: MyApp.PaymentGateway.Tracker,
percentile: 99,
min_delay: 50,
max_delay: 3_000,
initial_delay: 500},
{Resiliency.WeightedSemaphore,
name: MyApp.PaymentGateway.Semaphore,
max: 10},
# -- Search stack --
{Resiliency.Hedged,
name: MyApp.Search.Tracker,
percentile: 95,
min_delay: 10,
max_delay: 2_000,
initial_delay: 150},
# -- User profile stack --
{Resiliency.SingleFlight,
name: MyApp.UserProfile.Flight},
# -- Inventory stack --
{Resiliency.CircuitBreaker,
name: MyApp.Inventory.Breaker,
failure_rate_threshold: 0.5,
open_timeout: 30_000},
{Resiliency.SingleFlight,
name: MyApp.Inventory.Flight},
{Resiliency.Bulkhead,
name: MyApp.Inventory.Bulkhead,
max_concurrent: 15,
max_wait: 3_000},
{Resiliency.WeightedSemaphore,
name: MyApp.Inventory.Semaphore,
max: 20},
{Resiliency.Hedged,
name: MyApp.Inventory.Tracker,
percentile: 95,
min_delay: 10,
max_delay: 2_000,
initial_delay: 200}
]
end
endChild spec reference
Each stateful module provides a child_spec/1 that works with standard
Supervisor syntax:
# CircuitBreaker -- failure-rate circuit breaker
{Resiliency.CircuitBreaker, name: MyApp.Breaker, failure_rate_threshold: 0.5}
# Bulkhead -- workload isolation with rejection semantics
{Resiliency.Bulkhead, name: MyApp.Bulkhead, max_concurrent: 10}
# Hedged.Tracker -- adaptive delay + token-bucket throttling
{Resiliency.Hedged, name: MyApp.HedgeTracker, percentile: 95}
# SingleFlight -- concurrent call deduplication
{Resiliency.SingleFlight, name: MyApp.Flights}
# WeightedSemaphore -- bounded concurrency
{Resiliency.WeightedSemaphore, name: MyApp.Semaphore, max: 10}
# RateLimiter -- token-bucket rate limiting
{Resiliency.RateLimiter, name: MyApp.ApiRateLimiter, rate: 100.0, burst_size: 10}Each spec uses the :name as its child ID, so you can run multiple instances
without conflict:
children = [
{Resiliency.Hedged, name: MyApp.SearchTracker, percentile: 95},
{Resiliency.Hedged, name: MyApp.PaymentTracker, percentile: 99}
]Restart strategy
Use :one_for_one for resilience components. They are independent of each
other -- a crashed semaphore should not take down the hedge tracker. If a
component crashes and restarts, callers that were blocked on it will receive an
exit signal, which propagates naturally through the retry/hedging layers and
triggers a retry on the next attempt.
Production Considerations
Telemetry hooks
Use the :on_retry callback in BackoffRetry and :on_hedge in Hedged to emit
telemetry events for observability:
defmodule MyApp.ResiliencyTelemetry do
@moduledoc """
Telemetry callbacks for resilience patterns.
"""
def on_retry(attempt, delay, error) do
:telemetry.execute(
[:my_app, :resilience, :retry],
%{attempt: attempt, delay_ms: delay},
%{error: error}
)
end
def on_hedge(attempt) do
:telemetry.execute(
[:my_app, :resilience, :hedge],
%{attempt: attempt},
%{}
)
end
endWire them into your calls:
Resiliency.BackoffRetry.retry(fun,
on_retry: &MyApp.ResiliencyTelemetry.on_retry/3
)
Resiliency.Hedged.run(tracker, fun,
on_hedge: &MyApp.ResiliencyTelemetry.on_hedge/1
)The Hedged.Tracker also exposes stats/1 for periodic scraping:
# In a periodic reporter or health check
stats = Resiliency.Hedged.Tracker.stats(MyApp.Search.Tracker)
# => %{total_requests: 12450, hedged_requests: 623, hedge_won: 198,
# p50: 12, p95: 87, p99: 340, current_delay: 87, tokens: 7.3}Circuit breakers
Resiliency.CircuitBreaker sits as the outermost wrapper, before
SingleFlight, making a binary decision -- call or reject -- before any
other work happens. This prevents retries and hedges from running against
a service that is known to be down:
case Resiliency.CircuitBreaker.call(MyApp.PaymentBreaker, fn ->
Resiliency.SingleFlight.flight(@flight, key, fn ->
# ... semaphore + retry + hedge ...
end)
end) do
{:ok, result} -> result
{:error, :circuit_open} -> {:error, :service_degraded}
{:error, reason} -> {:error, reason}
endGraceful degradation
Design your callers to degrade gracefully when the resilience stack fails:
defmodule MyApp.ProductPage do
@moduledoc """
Product page assembly with graceful degradation.
"""
def render(product_id) do
product = MyApp.Catalog.get!(product_id)
inventory =
case MyApp.Inventory.get_inventory(product.sku) do
{:ok, data} -> data
{:error, _reason} -> %{available: nil, message: "Check back shortly"}
end
reviews =
case MyApp.Reviews.get_reviews(product_id) do
{:ok, data} -> data
{:error, _reason} -> []
end
%{product: product, inventory: inventory, reviews: reviews}
end
endThe core product data is required -- if it fails, the page fails. But inventory and reviews are optional. When the inventory API is down and all retries are exhausted, the page still renders with a placeholder message instead of a 500 error.
Timeout budgets
When stacking multiple patterns, be deliberate about timeout budgets. A common mistake is setting generous timeouts at every layer, causing requests to hang for far too long:
| Layer | Timeout | Rationale |
|---|---|---|
| HTTP client | 3 s | Single request deadline |
| Hedged | 4 s | Slightly above HTTP timeout to allow the hedge to complete |
BackoffRetry :budget | 8 s | Total time for all retry attempts combined |
| WeightedSemaphore | 10 s | Includes queuing time waiting for a permit |
| Caller-facing deadline | 12 s | Outermost deadline the user experiences |
The outer timeout must always be larger than the inner timeout. If your semaphore timeout is shorter than your retry budget, the semaphore will kill the request while retries are still in progress.
Abort on non-retryable errors
Use BackoffRetry.abort/1 to short-circuit the retry loop when you know the
error is permanent:
Resiliency.BackoffRetry.retry(fn ->
case HttpClient.post(url, payload) do
{:ok, %{status: 200, body: body}} ->
{:ok, body}
{:ok, %{status: 400, body: body}} ->
# Client error -- retrying won't help
{:error, Resiliency.BackoffRetry.abort({:bad_request, body})}
{:ok, %{status: 503}} ->
{:error, :service_unavailable}
{:error, reason} ->
{:error, reason}
end
end)This prevents wasting retry budget on errors that will never succeed, and it propagates the abort through SingleFlight to all waiting callers immediately.