Each module in Resiliency addresses a distinct failure mode -- retries smooth over transient errors, hedged requests cut tail latency, single-flight deduplication collapses thundering herds, and weighted semaphores bound downstream pressure. In isolation, each is useful. Combined, they form a defense-in-depth strategy that handles the full spectrum of production failures.

This guide walks through concrete composition patterns, building from simple two-module combinations to a full resilience stack. Every example is a complete, runnable module -- copy it into your project and adapt the function bodies.


Why Combine?

A single resilience primitive covers one failure mode. Real systems face several at once:

Failure modePrimitiveWhat it does
Transient errors (503s, timeouts)BackoffRetryRetries with backoff until success or budget exhaustion
Tail latency (p99 spikes)HedgedFires a backup request after a delay, takes whichever finishes first
Thundering herd (cache stampede)SingleFlightDeduplicates concurrent calls so the function executes once per key
Downstream overloadWeightedSemaphoreBounds concurrency to protect the downstream service
Workload isolationBulkheadLimits per-partition concurrency with rejection semantics
Request frequencyRateLimiterRejects calls when tokens are exhausted; returns a retry-after hint

When you call an external payment API, a single retry loop is not enough. The payment service might be slow (hedging helps), your retries might fan out across hundreds of pods (single-flight collapses them), and your retry storms might overwhelm the service entirely (semaphore caps concurrency). Combining patterns gives you defense at every layer.

The key insight: compose from the outside in. The outermost wrapper controls the broadest concern (deduplication, concurrency limits), and the innermost wrapper handles the narrowest (individual request hedging, retry logic).


Pattern: Retry + Hedged Requests

Scenario -- You are calling a slow search service. Individual requests sometimes time out (transient failure), and tail latency is high (p99 is 3x the median). You want to hedge each attempt and retry the entire hedged call if both the primary and hedge fail.

The retry loop wraps the hedged call. Each "attempt" from BackoffRetry's perspective is a full hedged execution -- primary plus backup.

defmodule MyApp.Search do
  @moduledoc """
  Search client with retry-wrapped hedged requests.

  Each retry attempt fires a hedged request (primary + backup).
  If both the primary and hedge fail, BackoffRetry sleeps and
  tries again.
  """

  require Logger

  @doc """
  Queries the search service with hedged requests and retry.

  Returns `{:ok, results}` or `{:error, reason}` after all
  retries are exhausted.
  """
  @spec search(String.t(), keyword()) :: {:ok, map()} | {:error, any()}
  def search(query, opts \\ []) do
    tracker = Keyword.get(opts, :tracker, MyApp.Search.Tracker)

    Resiliency.BackoffRetry.retry(
      fn ->
        case Resiliency.Hedged.run(tracker, fn -> do_search(query) end, timeout: 3_000) do
          {:ok, results} -> {:ok, results}
          {:error, reason} -> {:error, reason}
        end
      end,
      max_attempts: 3,
      backoff: :exponential,
      base_delay: 200,
      max_delay: 2_000,
      budget: 10_000,
      retry_if: fn
        {:error, :timeout} -> true
        {:error, :service_unavailable} -> true
        {:error, _} -> false
      end,
      on_retry: fn attempt, delay, error ->
        Logger.warning(
          "Search retry attempt=#{attempt} delay=#{delay}ms error=#{inspect(error)}"
        )
      end
    )
  end

  defp do_search(query) do
    case HttpClient.post("https://search.internal/query", %{q: query}) do
      {:ok, %{status: 200, body: body}} -> {:ok, body}
      {:ok, %{status: 503}} -> {:error, :service_unavailable}
      {:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
      {:error, :timeout} -> {:error, :timeout}
      {:error, reason} -> {:error, reason}
    end
  end
end

Supervision tree

The Hedged.Tracker is a GenServer that must be started before any calls to Resiliency.Hedged.run/3. Place it in your application's supervision tree:

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      {Resiliency.Hedged,
       name: MyApp.Search.Tracker,
       percentile: 95,
       min_delay: 10,
       max_delay: 2_000,
       initial_delay: 150}
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

How the layers interact

  1. BackoffRetry.retry/2 calls the anonymous function -- attempt 1.
  2. Inside, Hedged.run/3 fires the primary do_search/1. If the primary is slower than p95, a backup fires after the adaptive delay.
  3. Whichever finishes first wins. If both fail, Hedged.run/3 returns {:error, reason}.
  4. BackoffRetry checks retry_if -- if the error is retryable, it sleeps with exponential backoff and loops back to step 1.
  5. The :budget option ensures the entire retry+hedge sequence completes within 10 seconds, regardless of how many attempts remain.

Pattern: Hedged Requests + WeightedSemaphore

Scenario -- You are calling an external payment API with hedged requests to reduce tail latency. But hedging doubles your outbound request rate in the worst case, and the payment API has a strict rate limit. You need to throttle the total number of in-flight requests -- including hedges.

The semaphore wraps the hedged call. Each hedged execution (which may spawn up to max_requests concurrent calls) consumes a weight from the semaphore. This bounds the total downstream pressure.

defmodule MyApp.PaymentGateway do
  @moduledoc """
  Payment API client with hedged requests throttled by a
  weighted semaphore.

  The semaphore ensures at most 5 hedged executions run
  concurrently. Since each hedged execution may fire up to
  2 requests (primary + hedge), the downstream API sees at
  most 10 in-flight requests from this node.
  """

  require Logger

  @semaphore MyApp.PaymentGateway.Semaphore
  @tracker MyApp.PaymentGateway.Tracker

  @doc """
  Charges a payment method. Returns `{:ok, transaction}` or
  `{:error, reason}`.

  If the semaphore is at capacity, the caller blocks (FIFO)
  until a slot opens. Use `charge/3` with a timeout to fail
  fast under sustained load.
  """
  @spec charge(String.t(), pos_integer()) :: {:ok, map()} | {:error, any()}
  def charge(payment_method_id, amount_cents) do
    charge(payment_method_id, amount_cents, :infinity)
  end

  @spec charge(String.t(), pos_integer(), timeout()) :: {:ok, map()} | {:error, any()}
  def charge(payment_method_id, amount_cents, timeout) do
    # Weight of 2: each hedged execution may fire 2 downstream requests.
    case Resiliency.WeightedSemaphore.acquire(@semaphore, 2, fn ->
           Resiliency.Hedged.run(@tracker, fn ->
             do_charge(payment_method_id, amount_cents)
           end, timeout: 5_000)
         end, timeout) do
      {:ok, {:ok, transaction}} -> {:ok, transaction}
      {:ok, {:error, reason}} -> {:error, reason}
      {:error, :timeout} -> {:error, :throttled}
      {:error, reason} -> {:error, reason}
    end
  end

  @doc """
  Non-blocking variant. Returns `:rejected` immediately if the
  semaphore has no capacity -- useful for shedding load at the
  edge.
  """
  @spec try_charge(String.t(), pos_integer()) :: {:ok, map()} | {:error, any()} | :rejected
  def try_charge(payment_method_id, amount_cents) do
    case Resiliency.WeightedSemaphore.try_acquire(@semaphore, 2, fn ->
           Resiliency.Hedged.run(@tracker, fn ->
             do_charge(payment_method_id, amount_cents)
           end, timeout: 5_000)
         end) do
      {:ok, {:ok, transaction}} -> {:ok, transaction}
      {:ok, {:error, reason}} -> {:error, reason}
      :rejected -> :rejected
      {:error, reason} -> {:error, reason}
    end
  end

  defp do_charge(payment_method_id, amount_cents) do
    payload = %{payment_method: payment_method_id, amount: amount_cents, currency: "usd"}

    case HttpClient.post("https://payments.example.com/v1/charges", payload) do
      {:ok, %{status: 200, body: body}} -> {:ok, body}
      {:ok, %{status: 402, body: body}} -> {:error, {:payment_declined, body}}
      {:ok, %{status: 429}} -> {:error, :rate_limited}
      {:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
      {:error, reason} -> {:error, reason}
    end
  end
end

Supervision tree

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      # Hedged tracker -- adapts delay based on observed payment API latency
      {Resiliency.Hedged,
       name: MyApp.PaymentGateway.Tracker,
       percentile: 99,
       min_delay: 50,
       max_delay: 3_000,
       initial_delay: 500},

      # Semaphore -- max weight of 10 permits.
      # Each hedged call acquires weight 2, so at most 5 hedged
      # executions (= 10 downstream requests) run concurrently.
      {Resiliency.WeightedSemaphore,
       name: MyApp.PaymentGateway.Semaphore,
       max: 10}
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

Why weight 2?

Each hedged execution may fire up to 2 requests (the primary, then the backup after the adaptive delay). By acquiring weight 2 from a semaphore with max 10, you guarantee at most 10 concurrent HTTP requests hit the payment API -- even when every call triggers a hedge. If you set max_requests: 3 on the hedged call, you would acquire weight 3 instead.


Pattern: SingleFlight + Retry

Scenario -- Your application caches user profiles in Redis. On a cache miss, you fetch from the database. Under load, hundreds of concurrent requests for the same user all miss the cache simultaneously and stampede the database. You want to deduplicate those concurrent fetches and retry transient database errors.

SingleFlight wraps the retry. All concurrent callers for the same key share a single retry loop. If the retry succeeds, everyone gets the result. If it fails, everyone gets the error.

defmodule MyApp.UserProfile do
  @moduledoc """
  User profile loader with single-flight deduplication and retry.

  Concurrent requests for the same user ID are collapsed into a
  single database fetch. The fetch itself retries transient errors
  with exponential backoff.
  """

  require Logger

  @flight MyApp.UserProfile.Flight

  @doc """
  Loads a user profile by ID.

  On a cache hit, returns immediately. On a cache miss, a single
  database fetch runs -- even if 100 callers request the same user
  concurrently. The fetch retries up to 3 times on transient errors.
  """
  @spec get(pos_integer()) :: {:ok, map()} | {:error, any()}
  def get(user_id) do
    case Cache.get("user:#{user_id}") do
      {:ok, profile} ->
        {:ok, profile}

      :miss ->
        Resiliency.SingleFlight.flight(@flight, "user:#{user_id}", fn ->
          fetch_with_retry(user_id)
        end)
    end
  end

  defp fetch_with_retry(user_id) do
    case Resiliency.BackoffRetry.retry(
           fn -> fetch_from_db(user_id) end,
           max_attempts: 3,
           backoff: :exponential,
           base_delay: 50,
           max_delay: 1_000,
           retry_if: fn
             {:error, :timeout} -> true
             {:error, :connection_lost} -> true
             {:error, _} -> false
           end,
           on_retry: fn attempt, delay, error ->
             Logger.warning(
               "UserProfile.fetch retry user_id=#{user_id} " <>
                 "attempt=#{attempt} delay=#{delay}ms error=#{inspect(error)}"
             )
           end
         ) do
      {:ok, profile} ->
        Cache.put("user:#{user_id}", profile, ttl: :timer.minutes(5))
        profile

      {:error, reason} ->
        raise "Failed to fetch user #{user_id}: #{inspect(reason)}"
    end
  end

  defp fetch_from_db(user_id) do
    case Repo.get(User, user_id) do
      nil -> {:error, :not_found}
      user -> {:ok, Map.from_struct(user)}
    end
  end
end

Supervision tree

children = [
  {Resiliency.SingleFlight, name: MyApp.UserProfile.Flight}
]

Supervisor.start_link(children, strategy: :one_for_one)

Ordering matters

Note that SingleFlight is the outer layer and retry is the inner layer. This is intentional:

  • SingleFlight outside, retry inside -- 100 concurrent callers produce 1 retry loop with at most 3 database queries. This is what you want.
  • Retry outside, SingleFlight inside -- 100 concurrent callers each start their own retry loop. Each attempt deduplicates, but you still have 100 independent retry loops burning CPU and memory. Avoid this.

Always place deduplication outside retry.


Pattern: CircuitBreaker + Bulkhead

Scenario -- You are calling a payment API that occasionally has sustained outages. You want the circuit breaker to stop calling the service when it is down, and the bulkhead to limit concurrent calls when it is up -- preventing your application from overwhelming the API with too many requests.

The circuit breaker wraps the bulkhead. When the circuit is open, calls are rejected immediately without consuming a bulkhead permit. When the circuit is closed, the bulkhead limits concurrency.

defmodule MyApp.PaymentClient do
  @moduledoc """
  Payment client with circuit breaker and bulkhead.

  The circuit breaker rejects calls when the service is known to be down.
  The bulkhead limits concurrent calls to 10 when the service is up.
  """

  @breaker MyApp.Payment.Breaker
  @bulkhead MyApp.Payment.Bulkhead

  @spec charge(String.t(), pos_integer()) :: {:ok, map()} | {:error, any()}
  def charge(payment_method_id, amount_cents) do
    case Resiliency.CircuitBreaker.call(@breaker, fn ->
           case Resiliency.Bulkhead.call(@bulkhead, fn ->
                  do_charge(payment_method_id, amount_cents)
                end) do
             {:ok, result} -> result
             {:error, :bulkhead_full} -> {:error, :bulkhead_full}
             {:error, reason} -> {:error, reason}
           end
         end) do
      {:ok, {:ok, transaction}} -> {:ok, transaction}
      {:ok, {:error, reason}} -> {:error, reason}
      {:error, :circuit_open} -> {:error, :service_degraded}
      {:error, reason} -> {:error, reason}
    end
  end

  defp do_charge(payment_method_id, amount_cents) do
    case HttpClient.post("https://payments.example.com/v1/charges", %{
           payment_method: payment_method_id,
           amount: amount_cents
         }) do
      {:ok, %{status: 200, body: body}} -> {:ok, body}
      {:ok, %{status: 402, body: body}} -> {:error, {:payment_declined, body}}
      {:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
      {:error, reason} -> {:error, reason}
    end
  end
end

Supervision tree

children = [
  {Resiliency.CircuitBreaker,
   name: MyApp.Payment.Breaker,
   failure_rate_threshold: 0.5,
   open_timeout: 30_000},

  {Resiliency.Bulkhead,
   name: MyApp.Payment.Bulkhead,
   max_concurrent: 10,
   max_wait: 5_000}
]

Supervisor.start_link(children, strategy: :one_for_one)

Pattern: Full Resilience Stack

Scenario -- You are building a product catalog service that queries a slow, occasionally unreliable upstream inventory API. The API has rate limits, high tail latency, and transient 503 errors. Multiple pods may request the same product concurrently after a cache miss. You need all four primitives working together.

The composition order, from outermost to innermost:

  1. CircuitBreaker -- Reject calls when the downstream is known to be down.
  2. SingleFlight -- Deduplicate concurrent callers for the same product.
  3. Bulkhead -- Isolate this workload with its own concurrency limit.
  4. WeightedSemaphore -- Bound total concurrent outbound requests.
  5. BackoffRetry -- Retry transient failures with exponential backoff.
  6. Hedged -- Cut tail latency on each individual attempt.
defmodule MyApp.Inventory do
  @moduledoc """
  Inventory client combining all four resilience patterns.

  Layer order (outside to inside):
    1. SingleFlight  -- collapse concurrent callers per product
    2. Semaphore     -- bound outbound concurrency
    3. BackoffRetry  -- retry transient errors
    4. Hedged        -- cut tail latency per attempt
  """

  require Logger

  @breaker MyApp.Inventory.Breaker
  @flight MyApp.Inventory.Flight
  @bulkhead MyApp.Inventory.Bulkhead
  @semaphore MyApp.Inventory.Semaphore
  @tracker MyApp.Inventory.Tracker

  @doc """
  Fetches inventory for a product by SKU.

  Returns `{:ok, inventory}` or `{:error, reason}`.
  """
  @spec get_inventory(String.t()) :: {:ok, map()} | {:error, any()}
  def get_inventory(sku) do
    case Cache.get("inventory:#{sku}") do
      {:ok, data} ->
        {:ok, data}

      :miss ->
        fetch_inventory(sku)
    end
  end

  defp fetch_inventory(sku) do
    # Layer 0: CircuitBreaker -- reject when downstream is known-down
    case Resiliency.CircuitBreaker.call(@breaker, fn ->
           # Layer 1: SingleFlight -- deduplicate concurrent callers
           Resiliency.SingleFlight.flight(@flight, "inventory:#{sku}", fn ->
             # Layer 2: WeightedSemaphore -- bound concurrency
             case Resiliency.WeightedSemaphore.acquire(@semaphore, 2, fn ->
                    # Layer 3: BackoffRetry -- retry transient errors
                    Resiliency.BackoffRetry.retry(
                      fn ->
                        # Layer 4: Hedged -- cut tail latency
                        case Resiliency.Hedged.run(@tracker, fn ->
                               fetch_from_api(sku)
                             end, timeout: 4_000) do
                          {:ok, data} -> {:ok, data}
                          {:error, reason} -> {:error, reason}
                        end
                      end,
                      max_attempts: 3,
                      backoff: :exponential,
                      base_delay: 100,
                      max_delay: 2_000,
                      budget: 8_000,
                      retry_if: fn
                        {:error, :timeout} -> true
                        {:error, :service_unavailable} -> true
                        {:error, :rate_limited} -> true
                        {:error, _} -> false
                      end,
                      on_retry: fn attempt, delay, error ->
                        Logger.warning(
                          "Inventory.fetch retry sku=#{sku} " <>
                            "attempt=#{attempt} delay=#{delay}ms error=#{inspect(error)}"
                        )
                      end
                    )
                  end) do
               {:ok, {:ok, data}} ->
                 Cache.put("inventory:#{sku}", data, ttl: :timer.seconds(30))
                 {:ok, data}

               {:ok, {:error, reason}} ->
                 {:error, reason}

               {:error, reason} ->
                 {:error, reason}
             end
           end)
         end) do
      {:ok, {:ok, data}} -> {:ok, data}
      {:ok, {:error, reason}} -> {:error, reason}
      {:error, :circuit_open} -> {:error, :service_degraded}
      {:error, reason} -> {:error, reason}
    end
  end

  defp fetch_from_api(sku) do
    case HttpClient.get("https://inventory.internal/v2/products/#{sku}") do
      {:ok, %{status: 200, body: body}} -> {:ok, body}
      {:ok, %{status: 404}} -> {:error, :not_found}
      {:ok, %{status: 429}} -> {:error, :rate_limited}
      {:ok, %{status: 503}} -> {:error, :service_unavailable}
      {:ok, %{status: status}} -> {:error, {:unexpected_status, status}}
      {:error, :timeout} -> {:error, :timeout}
      {:error, reason} -> {:error, reason}
    end
  end
end

Full supervision tree

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      # -- Inventory resilience stack --
      {Resiliency.CircuitBreaker,
       name: MyApp.Inventory.Breaker,
       failure_rate_threshold: 0.5,
       open_timeout: 30_000},

      {Resiliency.SingleFlight, name: MyApp.Inventory.Flight},

      {Resiliency.WeightedSemaphore,
       name: MyApp.Inventory.Semaphore,
       max: 20},

      {Resiliency.Hedged,
       name: MyApp.Inventory.Tracker,
       percentile: 95,
       min_delay: 10,
       max_delay: 2_000,
       initial_delay: 200,
       min_samples: 20}
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

Request flow

Here is what happens when 50 concurrent requests arrive for the same SKU, all missing the cache:

  1. CircuitBreaker -- The breaker checks its state. If :open, all 50 callers immediately receive {:error, :service_degraded} with zero downstream load. If :closed or :half_open, the call proceeds.
  2. SingleFlight -- All 50 callers call flight/3 with key "inventory:sku-123". Only the first caller executes the function. The other 49 block, waiting for the result.
  3. WeightedSemaphore -- The single executing caller acquires weight 2 from the semaphore. If the semaphore is already at capacity from other SKU fetches, this caller blocks in FIFO order.
  4. BackoffRetry -- The caller enters the retry loop. Attempt 1 begins.
  5. Hedged -- The primary request fires. If it is slower than p95 (tracked adaptively), a backup fires after the adaptive delay. Whichever responds first wins.
  6. If the hedged call returns {:error, :service_unavailable}, BackoffRetry checks retry_if, sleeps with exponential backoff, and loops to step 4.
  7. On success, the result propagates back up: the circuit breaker records a success, the semaphore releases its permits, SingleFlight broadcasts the result to all 49 waiting callers, and the cache is populated.

Total downstream impact from 50 concurrent callers: at most 3 attempts x 2 hedged requests = 6 HTTP requests, all bounded by the semaphore. If the circuit is open, downstream impact is zero.


Supervision Tree Design

When combining multiple patterns, all stateful components -- Bulkhead, CircuitBreaker, Hedged.Tracker, SingleFlight, and WeightedSemaphore -- need to be started under a supervisor. BackoffRetry, Race, AllSettled, Map, and FirstOk are stateless and require no supervision.

Grouping by domain

Group resilience infrastructure by the domain it serves. This makes it clear which components belong together and simplifies restarts:

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      # Application-level services
      MyApp.Repo,
      MyApp.Cache,

      # Resilience infrastructure -- grouped by domain
      {Supervisor, name: MyApp.ResiliencySupervisor, children: resilience_children(),
       strategy: :one_for_one}
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end

  defp resilience_children do
    [
      # -- Payment gateway stack --
      {Resiliency.CircuitBreaker,
       name: MyApp.PaymentGateway.Breaker,
       failure_rate_threshold: 0.5,
       open_timeout: 60_000},

      {Resiliency.Bulkhead,
       name: MyApp.PaymentGateway.Bulkhead,
       max_concurrent: 10,
       max_wait: 5_000},

      {Resiliency.Hedged,
       name: MyApp.PaymentGateway.Tracker,
       percentile: 99,
       min_delay: 50,
       max_delay: 3_000,
       initial_delay: 500},

      {Resiliency.WeightedSemaphore,
       name: MyApp.PaymentGateway.Semaphore,
       max: 10},

      # -- Search stack --
      {Resiliency.Hedged,
       name: MyApp.Search.Tracker,
       percentile: 95,
       min_delay: 10,
       max_delay: 2_000,
       initial_delay: 150},

      # -- User profile stack --
      {Resiliency.SingleFlight,
       name: MyApp.UserProfile.Flight},

      # -- Inventory stack --
      {Resiliency.CircuitBreaker,
       name: MyApp.Inventory.Breaker,
       failure_rate_threshold: 0.5,
       open_timeout: 30_000},

      {Resiliency.SingleFlight,
       name: MyApp.Inventory.Flight},

      {Resiliency.Bulkhead,
       name: MyApp.Inventory.Bulkhead,
       max_concurrent: 15,
       max_wait: 3_000},

      {Resiliency.WeightedSemaphore,
       name: MyApp.Inventory.Semaphore,
       max: 20},

      {Resiliency.Hedged,
       name: MyApp.Inventory.Tracker,
       percentile: 95,
       min_delay: 10,
       max_delay: 2_000,
       initial_delay: 200}
    ]
  end
end

Child spec reference

Each stateful module provides a child_spec/1 that works with standard Supervisor syntax:

# CircuitBreaker -- failure-rate circuit breaker
{Resiliency.CircuitBreaker, name: MyApp.Breaker, failure_rate_threshold: 0.5}

# Bulkhead -- workload isolation with rejection semantics
{Resiliency.Bulkhead, name: MyApp.Bulkhead, max_concurrent: 10}

# Hedged.Tracker -- adaptive delay + token-bucket throttling
{Resiliency.Hedged, name: MyApp.HedgeTracker, percentile: 95}

# SingleFlight -- concurrent call deduplication
{Resiliency.SingleFlight, name: MyApp.Flights}

# WeightedSemaphore -- bounded concurrency
{Resiliency.WeightedSemaphore, name: MyApp.Semaphore, max: 10}

# RateLimiter -- token-bucket rate limiting
{Resiliency.RateLimiter, name: MyApp.ApiRateLimiter, rate: 100.0, burst_size: 10}

Each spec uses the :name as its child ID, so you can run multiple instances without conflict:

children = [
  {Resiliency.Hedged, name: MyApp.SearchTracker, percentile: 95},
  {Resiliency.Hedged, name: MyApp.PaymentTracker, percentile: 99}
]

Restart strategy

Use :one_for_one for resilience components. They are independent of each other -- a crashed semaphore should not take down the hedge tracker. If a component crashes and restarts, callers that were blocked on it will receive an exit signal, which propagates naturally through the retry/hedging layers and triggers a retry on the next attempt.


Production Considerations

Telemetry hooks

Use the :on_retry callback in BackoffRetry and :on_hedge in Hedged to emit telemetry events for observability:

defmodule MyApp.ResiliencyTelemetry do
  @moduledoc """
  Telemetry callbacks for resilience patterns.
  """

  def on_retry(attempt, delay, error) do
    :telemetry.execute(
      [:my_app, :resilience, :retry],
      %{attempt: attempt, delay_ms: delay},
      %{error: error}
    )
  end

  def on_hedge(attempt) do
    :telemetry.execute(
      [:my_app, :resilience, :hedge],
      %{attempt: attempt},
      %{}
    )
  end
end

Wire them into your calls:

Resiliency.BackoffRetry.retry(fun,
  on_retry: &MyApp.ResiliencyTelemetry.on_retry/3
)

Resiliency.Hedged.run(tracker, fun,
  on_hedge: &MyApp.ResiliencyTelemetry.on_hedge/1
)

The Hedged.Tracker also exposes stats/1 for periodic scraping:

# In a periodic reporter or health check
stats = Resiliency.Hedged.Tracker.stats(MyApp.Search.Tracker)
# => %{total_requests: 12450, hedged_requests: 623, hedge_won: 198,
#       p50: 12, p95: 87, p99: 340, current_delay: 87, tokens: 7.3}

Circuit breakers

Resiliency.CircuitBreaker sits as the outermost wrapper, before SingleFlight, making a binary decision -- call or reject -- before any other work happens. This prevents retries and hedges from running against a service that is known to be down:

case Resiliency.CircuitBreaker.call(MyApp.PaymentBreaker, fn ->
       Resiliency.SingleFlight.flight(@flight, key, fn ->
         # ... semaphore + retry + hedge ...
       end)
     end) do
  {:ok, result} -> result
  {:error, :circuit_open} -> {:error, :service_degraded}
  {:error, reason} -> {:error, reason}
end

Graceful degradation

Design your callers to degrade gracefully when the resilience stack fails:

defmodule MyApp.ProductPage do
  @moduledoc """
  Product page assembly with graceful degradation.
  """

  def render(product_id) do
    product = MyApp.Catalog.get!(product_id)

    inventory =
      case MyApp.Inventory.get_inventory(product.sku) do
        {:ok, data} -> data
        {:error, _reason} -> %{available: nil, message: "Check back shortly"}
      end

    reviews =
      case MyApp.Reviews.get_reviews(product_id) do
        {:ok, data} -> data
        {:error, _reason} -> []
      end

    %{product: product, inventory: inventory, reviews: reviews}
  end
end

The core product data is required -- if it fails, the page fails. But inventory and reviews are optional. When the inventory API is down and all retries are exhausted, the page still renders with a placeholder message instead of a 500 error.

Timeout budgets

When stacking multiple patterns, be deliberate about timeout budgets. A common mistake is setting generous timeouts at every layer, causing requests to hang for far too long:

LayerTimeoutRationale
HTTP client3 sSingle request deadline
Hedged4 sSlightly above HTTP timeout to allow the hedge to complete
BackoffRetry :budget8 sTotal time for all retry attempts combined
WeightedSemaphore10 sIncludes queuing time waiting for a permit
Caller-facing deadline12 sOutermost deadline the user experiences

The outer timeout must always be larger than the inner timeout. If your semaphore timeout is shorter than your retry budget, the semaphore will kill the request while retries are still in progress.

Abort on non-retryable errors

Use BackoffRetry.abort/1 to short-circuit the retry loop when you know the error is permanent:

Resiliency.BackoffRetry.retry(fn ->
  case HttpClient.post(url, payload) do
    {:ok, %{status: 200, body: body}} ->
      {:ok, body}

    {:ok, %{status: 400, body: body}} ->
      # Client error -- retrying won't help
      {:error, Resiliency.BackoffRetry.abort({:bad_request, body})}

    {:ok, %{status: 503}} ->
      {:error, :service_unavailable}

    {:error, reason} ->
      {:error, reason}
  end
end)

This prevents wasting retry budget on errors that will never succeed, and it propagates the abort through SingleFlight to all waiting callers immediately.