This guide walks you through the core modules in Resiliency with real-world examples you can paste into an iex session or a Mix project. By the end you will know how to retry flaky calls, hedge slow requests, deduplicate concurrent work, race tasks, and rate-limit access to shared resources.

Installation

Add resiliency to your dependencies in mix.exs:

def deps do
  [
    {:resiliency, "~> 0.6.0"}
  ]
end

Then fetch and compile:

mix deps.get
mix compile

Resiliency has zero runtime dependencies -- every module is self-contained and ready to use without starting any extra applications.


Your First Retry

Resiliency.BackoffRetry retries a function on failure with configurable backoff. No macros, no processes -- just a function call.

1. Retry with defaults

The simplest form retries up to 3 times with exponential backoff (100 ms, 200 ms, 400 ms):

result =
  Resiliency.BackoffRetry.retry(fn ->
    case File.read("/tmp/config.json") do
      {:ok, contents} -> {:ok, contents}
      {:error, :enoent} -> {:error, :enoent}
    end
  end)

# After 3 failed attempts:
# => {:error, :enoent}

2. Customize the strategy

Imagine you are calling a flaky third-party API that occasionally returns

  1. You want 5 attempts, linear backoff starting at 200 ms, and a cap of 2 seconds per delay:
Resiliency.BackoffRetry.retry(
  fn ->
    # Simulate an HTTP call -- replace with your real client
    case :rand.uniform(3) do
      1 -> {:ok, %{status: 200, body: "OK"}}
      _ -> {:error, :service_unavailable}
    end
  end,
  max_attempts: 5,
  backoff: :linear,
  base_delay: 200,
  max_delay: 2_000
)

3. Filter which errors are retryable

Not every error deserves a retry. Use :retry_if to short-circuit on permanent failures:

Resiliency.BackoffRetry.retry(
  fn ->
    case :rand.uniform(4) do
      1 -> {:ok, "success"}
      2 -> {:error, :timeout}
      3 -> {:error, :econnrefused}
      4 -> {:error, :not_found}
    end
  end,
  max_attempts: 5,
  retry_if: fn
    {:error, :timeout} -> true
    {:error, :econnrefused} -> true
    _other -> false
  end
)

Here :not_found is treated as a permanent failure and returned immediately without consuming additional attempts.

4. Abort early from inside the function

If the function itself discovers that retrying is pointless, wrap the reason in BackoffRetry.abort/1:

Resiliency.BackoffRetry.retry(
  fn ->
    case :rand.uniform(3) do
      1 -> {:ok, "payload"}
      2 -> {:error, :timeout}
      3 -> {:error, Resiliency.BackoffRetry.abort(:invalid_api_key)}
    end
  end,
  max_attempts: 10
)

# When abort is hit:
# => {:error, :invalid_api_key}

The abort stops retries immediately, regardless of remaining attempts or the :retry_if predicate.

5. Add observability with :on_retry

Log every retry so you can correlate failures with your monitoring stack:

require Logger

Resiliency.BackoffRetry.retry(
  fn ->
    case :rand.uniform(2) do
      1 -> {:ok, "data"}
      2 -> {:error, :timeout}
    end
  end,
  max_attempts: 4,
  backoff: :exponential,
  on_retry: fn attempt, delay_ms, error ->
    Logger.warning(
      "Retry attempt #{attempt}, sleeping #{delay_ms}ms after #{inspect(error)}"
    )
  end
)

6. Set a time budget

When you have a hard deadline -- say, an HTTP request timeout of 3 seconds -- use :budget to stop retrying once the budget is exhausted, even if you have attempts left:

Resiliency.BackoffRetry.retry(
  fn -> {:error, :timeout} end,
  max_attempts: 20,
  backoff: :constant,
  base_delay: 1_000,
  budget: 3_000
)
# Stops after ~3 seconds, not after 20 attempts

Your First Hedged Request

Resiliency.Hedged fires a backup request after a delay and returns whichever finishes first -- a technique from Google's "Tail at Scale" paper for cutting tail latency. It supports two modes: stateless (fixed delay) and stateful (adaptive, percentile-based delay).

Stateless mode -- fixed delay

Use this when you know a reasonable delay up front, or when you do not need adaptive tuning.

1. Basic hedged call

Send a hedge after 150 ms. If the first request has not finished by then, a second copy fires. The first success wins and the loser is cancelled:

{:ok, body} =
  Resiliency.Hedged.run(
    fn ->
      # Simulate a database query with variable latency
      Process.sleep(Enum.random(50..300))
      {:ok, %{rows: [%{id: 1, name: "Alice"}]}}
    end,
    delay: 150,
    timeout: 5_000
  )

2. Increase the fan-out

By default, at most 2 requests fly concurrently (the original plus one hedge). Bump :max_requests to fan out further:

{:ok, result} =
  Resiliency.Hedged.run(
    fn ->
      Process.sleep(Enum.random(10..500))
      {:ok, "response from replica"}
    end,
    delay: 100,
    max_requests: 3,
    timeout: 2_000
  )

Stateful mode -- adaptive delay

In production, the right delay shifts as latency changes. Start a Resiliency.Hedged tracker -- it records latencies and computes the hedge delay as a percentile of observed values.

1. Start the tracker

Add it to your supervision tree (or start it manually for experimentation):

{:ok, _pid} =
  Resiliency.Hedged.start_link(
    name: MyApp.SearchHedge,
    percentile: 95,
    initial_delay: 100
  )

The tracker begins with an initial_delay of 100 ms and switches to the observed p95 once it has collected enough samples.

2. Run hedged calls through the tracker

Pass the tracker name as the first argument instead of options:

{:ok, result} =
  Resiliency.Hedged.run(MyApp.SearchHedge, fn ->
    # Imagine this hits a search service with variable latency
    Process.sleep(Enum.random(20..200))
    {:ok, [%{title: "Elixir in Action"}]}
  end)

Each call records its latency. Over time the tracker learns the latency distribution and adjusts the hedge delay automatically. A built-in token bucket prevents hedge storms under sustained load.

3. Supervision tree integration

For production use, add the tracker as a child:

# In your Application module
children = [
  {Resiliency.Hedged, name: MyApp.SearchHedge, percentile: 95},
  # ... other children
]

Supervisor.start_link(children, strategy: :one_for_one)

Deduplicating with SingleFlight

Resiliency.SingleFlight ensures that when many processes request the same expensive computation concurrently, the function executes only once. All callers receive the same result. This is invaluable for cache stampede prevention.

1. Start the server

{:ok, _pid} = Resiliency.SingleFlight.start_link(name: MyApp.Flights)

2. Deduplicate a database lookup

Suppose 50 requests arrive simultaneously for user 42. Without SingleFlight, you hit the database 50 times. With it, you hit it once:

{:ok, user} =
  Resiliency.SingleFlight.flight(MyApp.Flights, "user:42", fn ->
    # Only one process executes this, even if 50 call concurrently
    Process.sleep(100)
    %{id: 42, name: "Bob", email: "bob@example.com"}
  end)

All 50 callers skip the I/O -- they wait for the single execution to complete, then receive {:ok, %{id: 42, ...}} without each doing their own round-trip.

3. Forget a key

After a write, you may want the next read to bypass the in-flight deduplication and fetch fresh data:

:ok = Resiliency.SingleFlight.forget(MyApp.Flights, "user:42")

Existing waiters still receive the original result. Only new callers after forget/2 trigger a fresh execution.

4. Caller-side timeout

If you cannot afford to wait forever for a slow in-flight call, pass a timeout. The calling process exits, but the in-flight function continues so other waiters still get their result:

try do
  Resiliency.SingleFlight.flight(MyApp.Flights, "slow-key", fn ->
    Process.sleep(10_000)
    :result
  end, 1_000)
rescue
  _ -> :timed_out
catch
  :exit, {:timeout, _} -> :timed_out
end

5. Supervision tree integration

children = [
  {Resiliency.SingleFlight, name: MyApp.Flights},
  # ... other children
]

Supervisor.start_link(children, strategy: :one_for_one)

Racing Tasks

Resiliency.Race, Resiliency.AllSettled, Resiliency.Map, and Resiliency.FirstOk provide higher-level concurrency combinators that are stateless -- no GenServer, no supervision tree entry.

Race.run/1 -- first success wins

Fire multiple strategies in parallel and take whichever returns first. Losers are killed automatically:

{:ok, data} =
  Resiliency.Race.run([
    fn ->
      # Try the local cache
      Process.sleep(5)
      :cached_value
    end,
    fn ->
      # Fall back to the database
      Process.sleep(50)
      :db_value
    end
  ])

# => {:ok, :cached_value}

If a task crashes, the race continues with the remaining tasks:

{:ok, :backup} =
  Resiliency.Race.run([
    fn -> raise "primary is down" end,
    fn -> :backup end
  ])

AllSettled.run/1 -- collect everything

Run tasks in parallel and wait for all of them. Crashes do not propagate to the caller -- each slot gets {:ok, value} or {:error, reason}:

results =
  Resiliency.AllSettled.run([
    fn -> {:ok, "service_a response"} end,
    fn -> raise "service_b is broken" end,
    fn -> {:ok, "service_c response"} end
  ])

# => [{:ok, {:ok, "service_a response"}},
#     {:error, {%RuntimeError{message: "service_b is broken"}, _stacktrace}},
#     {:ok, {:ok, "service_c response"}}]

Resiliency.Map.run/3 -- bounded-concurrency parallel map

Process a list of items in parallel with a concurrency cap. On the first failure, all remaining work is cancelled:

urls = [
  "https://api.example.com/users/1",
  "https://api.example.com/users/2",
  "https://api.example.com/users/3",
  "https://api.example.com/users/4",
  "https://api.example.com/users/5"
]

{:ok, responses} =
  Resiliency.Map.run(
    urls,
    fn url ->
      # Simulate fetching each URL
      Process.sleep(Enum.random(10..50))
      %{url: url, status: 200}
    end,
    max_concurrency: 3
  )

# responses is in the same order as urls

FirstOk.run/1 -- sequential fallback chain

Try data sources one at a time. Stop at the first success. Later sources are never called if an earlier one succeeds:

{:ok, value} =
  Resiliency.FirstOk.run([
    fn ->
      # L1 cache miss
      {:error, :not_found}
    end,
    fn ->
      # L2 cache miss
      {:error, :not_found}
    end,
    fn ->
      # Database hit
      {:ok, %{id: 1, name: "Alice"}}
    end,
    fn ->
      # Remote API -- never called because the DB succeeded
      {:ok, %{id: 1, name: "Alice (stale)"}}
    end
  ])

# => {:ok, %{id: 1, name: "Alice"}}

Rate Limiting with RateLimiter

Resiliency.RateLimiter controls how many calls can execute per second using a token-bucket algorithm. When the bucket is empty, callers are rejected immediately with a retry_after_ms hint.

1. Start the rate limiter

{:ok, _pid} =
  Resiliency.RateLimiter.start_link(
    name: MyApp.ApiRateLimiter,
    rate: 100.0,
    burst_size: 10
  )

rate is tokens per second; burst_size is the initial and maximum bucket size.

2. Basic call

case Resiliency.RateLimiter.call(MyApp.ApiRateLimiter, fn ->
  HttpClient.get("https://api.example.com/data")
end) do
  {:ok, response} -> handle_response(response)
  {:error, {:rate_limited, retry_after_ms}} -> {:error, {:overloaded, retry_after_ms}}
  {:error, reason} -> {:error, reason}
end

When rate limited, retry_after_ms tells the caller how many milliseconds to wait before retrying.

3. Weighted calls

More expensive operations can consume more tokens:

# 1 token for a lightweight read (default)
Resiliency.RateLimiter.call(MyApp.ApiRateLimiter, fn -> get_user(id) end)

# 5 tokens for a bulk operation
Resiliency.RateLimiter.call(MyApp.ApiRateLimiter, fn -> bulk_fetch(ids) end, weight: 5)

4. Reset and inspect

# Reset to a full bucket (e.g., in tests or after manual intervention)
:ok = Resiliency.RateLimiter.reset(MyApp.ApiRateLimiter)

# Inspect current token count without consuming tokens
%{tokens: _, rate: 100.0, burst_size: 10} =
  Resiliency.RateLimiter.get_stats(MyApp.ApiRateLimiter)

5. Rejection callback

Fire a callback whenever a call is rate limited:

Resiliency.RateLimiter.start_link(
  name: MyApp.ApiRateLimiter,
  rate: 100.0,
  burst_size: 10,
  on_reject: fn name ->
    Logger.warning("#{inspect(name)}: rate limit hit")
  end
)

6. Supervision tree integration

children = [
  {Resiliency.RateLimiter,
   name: MyApp.ApiRateLimiter,
   rate: 100.0,
   burst_size: 10}
]

Supervisor.start_link(children, strategy: :one_for_one)

Rate Limiting with WeightedSemaphore

Resiliency.WeightedSemaphore bounds concurrent access to a shared resource. Unlike a standard semaphore, each acquisition can specify a weight -- useful when different operations have different costs (a bulk insert costs more than a single read).

1. Start the semaphore

{:ok, _pid} =
  Resiliency.WeightedSemaphore.start_link(name: MyApp.DbPool, max: 10)

2. Acquire with default weight (1 permit)

{:ok, user} =
  Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, fn ->
    # Runs inside a managed process -- permits auto-release on completion
    Process.sleep(10)
    %{id: 1, name: "Alice"}
  end)

3. Acquire with a heavier weight

A bulk import might consume 5 of your 10 permits, leaving room for only 5 lightweight reads:

{:ok, :imported} =
  Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 5, fn ->
    # Holds 5 permits for the duration
    Process.sleep(100)
    :imported
  end)

4. Non-blocking try

When the system is under load, skip optional work instead of queuing:

case Resiliency.WeightedSemaphore.try_acquire(MyApp.DbPool, 3, fn ->
  :analytics_write
end) do
  {:ok, :analytics_write} ->
    :ok

  :rejected ->
    # Semaphore is full or a larger waiter is ahead -- drop this work
    :skipped
end

5. Acquire with a timeout

Block for at most 1 second. If permits are not available by then, give up:

case Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 3, fn ->
  :result
end, 1_000) do
  {:ok, :result} -> :ok
  {:error, :timeout} -> :gave_up
end

6. Supervision tree integration

In production, start the semaphore under your application supervisor so it restarts automatically on failure:

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    children = [
      {Resiliency.WeightedSemaphore, name: MyApp.DbPool, max: 10},
      {Resiliency.WeightedSemaphore, name: MyApp.ExternalApi, max: 5},
      # ... your other children
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

FIFO fairness

Waiters are served in strict FIFO order. If a weight-8 request is at the front of the queue but only 5 permits are free, a weight-1 request behind it also blocks -- even though it would fit. This prevents starvation of large requests.


Your First Bulkhead

Resiliency.Bulkhead isolates workloads by limiting how many concurrent calls can execute against a downstream service. When the limit is reached, callers are either rejected immediately or queued for a configurable wait time. This prevents one slow service from consuming all available resources.

1. Start the bulkhead

{:ok, _pid} =
  Resiliency.Bulkhead.start_link(
    name: MyApp.PaymentBulkhead,
    max_concurrent: 10
  )

2. Basic call

case Resiliency.Bulkhead.call(MyApp.PaymentBulkhead, fn ->
  HttpClient.post("https://payments.example.com/charge", payload)
end) do
  {:ok, response} -> handle_response(response)
  {:error, :bulkhead_full} -> {:error, :overloaded}
  {:error, reason} -> {:error, reason}
end

By default, max_wait is 0 — callers are rejected immediately when the bulkhead is full. This is the safest default for latency-sensitive services.

3. Wait for a permit

If you prefer callers to queue instead of failing immediately, set max_wait:

{:ok, _pid} =
  Resiliency.Bulkhead.start_link(
    name: MyApp.PaymentBulkhead,
    max_concurrent: 10,
    max_wait: 5_000
  )

Now callers wait up to 5 seconds for a permit before being rejected.

4. Per-call override

You can override the server's default max_wait on a per-call basis:

# This call waits up to 1 second, regardless of server default
Resiliency.Bulkhead.call(MyApp.PaymentBulkhead, fn ->
  HttpClient.post(url, payload)
end, max_wait: 1_000)

5. Monitor with callbacks

Resiliency.Bulkhead.start_link(
  name: MyApp.PaymentBulkhead,
  max_concurrent: 10,
  on_call_permitted: fn name ->
    Logger.info("#{name}: call permitted")
  end,
  on_call_rejected: fn name ->
    Logger.warning("#{name}: call rejected — bulkhead full")
  end,
  on_call_finished: fn name ->
    Logger.info("#{name}: call finished")
  end
)

6. Supervision tree integration

children = [
  {Resiliency.Bulkhead,
   name: MyApp.PaymentBulkhead,
   max_concurrent: 10,
   max_wait: 5_000},
  # ... other children
]

Supervisor.start_link(children, strategy: :one_for_one)

Your First Circuit Breaker

Resiliency.CircuitBreaker monitors call outcomes and "trips" when the failure rate exceeds a threshold. While tripped, calls are rejected immediately — giving the downstream service time to recover. After a cool-down period, probe calls verify the service is healthy before resuming full traffic.

1. Start the breaker

{:ok, _pid} =
  Resiliency.CircuitBreaker.start_link(
    name: MyApp.Breaker,
    failure_rate_threshold: 0.5,
    minimum_calls: 10,
    open_timeout: 30_000
  )

2. Basic call

case Resiliency.CircuitBreaker.call(MyApp.Breaker, fn ->
  HttpClient.get("https://api.example.com/data")
end) do
  {:ok, response} -> handle_response(response)
  {:error, :circuit_open} -> {:error, :service_degraded}
  {:error, reason} -> {:error, reason}
end

3. Custom failure classification

By default, {:ok, _} is a success and {:error, _} is a failure. Override with :should_record to ignore expected errors or treat specific successes as failures:

Resiliency.CircuitBreaker.start_link(
  name: MyApp.Breaker,
  should_record: fn
    {:ok, %{status: 200}} -> :success
    {:ok, %{status: 404}} -> :ignore   # not counted
    {:ok, %{status: 503}} -> :failure
    {:error, _}           -> :failure
    _                     -> :success
  end
)

4. Two-step API

When you cannot wrap the operation in a single function:

case Resiliency.CircuitBreaker.allow(MyApp.Breaker) do
  {:ok, record} ->
    result = do_work()
    record.(:success)   # or :failure or :ignore (one-shot, duplicates are no-ops)
    {:ok, result}

  {:error, :circuit_open} ->
    {:error, :service_degraded}
end

5. Manual control

# Force the circuit open (stays open until reset/force_close)
Resiliency.CircuitBreaker.force_open(MyApp.Breaker)

# Force it back to closed
Resiliency.CircuitBreaker.force_close(MyApp.Breaker)

# Reset to initial state
Resiliency.CircuitBreaker.reset(MyApp.Breaker)

# Inspect current state and statistics
Resiliency.CircuitBreaker.get_state(MyApp.Breaker)  # => :closed
Resiliency.CircuitBreaker.get_stats(MyApp.Breaker)   # => %{state: :closed, ...}

6. Supervision tree integration

children = [
  {Resiliency.CircuitBreaker,
   name: MyApp.Breaker,
   failure_rate_threshold: 0.5,
   open_timeout: 30_000},
  # ... other children
]

Supervisor.start_link(children, strategy: :one_for_one)

Combining Patterns

The real power of Resiliency comes from composing primitives. Here is a function that retries a flaky API call with exponential backoff, and on each attempt uses hedging to cut tail latency:

# Start the hedge tracker (do this once, typically in your supervision tree)
{:ok, _pid} =
  Resiliency.Hedged.start_link(name: MyApp.ApiHedge, percentile: 95)

defmodule MyApp.ResilientClient do
  def fetch_user(user_id) do
    Resiliency.BackoffRetry.retry(
      fn ->
        Resiliency.Hedged.run(MyApp.ApiHedge, fn ->
          # Replace with your real HTTP client
          case :rand.uniform(4) do
            1 -> {:ok, %{id: user_id, name: "Alice"}}
            2 -> {:error, :timeout}
            3 -> {:error, :service_unavailable}
            4 ->
              Process.sleep(500)
              {:ok, %{id: user_id, name: "Alice"}}
          end
        end)
      end,
      max_attempts: 3,
      backoff: :exponential,
      base_delay: 100,
      retry_if: fn
        {:error, :timeout} -> true
        {:error, :service_unavailable} -> true
        _ -> false
      end
    )
  end
end

MyApp.ResilientClient.fetch_user(42)

Each attempt fires a hedged pair of requests. If the first attempt fails, backoff kicks in before the next hedged pair. The result is a function that tolerates both transient errors (via retry) and slow responses (via hedging).


Next Steps

Now that you have the fundamentals, explore further: