Choosing a Pattern

Resiliency ships five modules that each solve a distinct reliability or concurrency problem. This guide helps you pick the right one -- or the right combination -- for your situation.

Decision Tree

Start from the problem you are trying to solve and follow the branch.

"My downstream service is failing and I want to stop calling it"

Use Resiliency.CircuitBreaker -- monitor failure rates and trip the circuit when the downstream is unhealthy. Probe calls automatically test recovery.

children = [{Resiliency.CircuitBreaker, name: MyBreaker, failure_rate_threshold: 0.5}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.CircuitBreaker.call(MyBreaker, fn ->
  HttpClient.get!(url)
end) do
  {:ok, body} -> body
  {:error, :circuit_open} -> {:error, :service_degraded}
  {:error, reason} -> {:error, reason}
end

Key traits:

Reduces load -- stops calling a failing service entirely.
Automatic recovery -- probes the service after a cool-down period.
Failure-rate-based (not just consecutive failures) with a sliding window.
Stateful -- requires a GenServer.

"My requests sometimes fail"

Use Resiliency.BackoffRetry -- retry the operation with configurable backoff so you do not hammer the downstream service.

Resiliency.BackoffRetry.retry(
  fn ->
    case HttpClient.get(url) do
      {:ok, %{status: 200}} = ok -> ok
      {:ok, %{status: 503}}      -> {:error, :unavailable}
      {:error, reason}           -> {:error, reason}
    end
  end,
  max_attempts: 4,
  backoff: :exponential,
  base_delay: 200,
  retry_if: fn
    {:error, :unavailable} -> true
    {:error, :timeout}     -> true
    _                      -> false
  end
)

Key traits:

Adds latency (each retry waits for the backoff delay).
Does not add concurrent load -- attempts are sequential.
Stateless -- no process to start.

"My requests are sometimes slow"

Use Resiliency.Hedged -- send a backup request after a delay, take whichever finishes first, cancel the loser. This is a tail-latency optimization, not a retry strategy.

# Adaptive mode -- delay auto-tunes from observed latency
{:ok, _} = Resiliency.Hedged.start_link(name: MyHedge, percentile: 95)

{:ok, body} = Resiliency.Hedged.run(MyHedge, fn ->
  HttpClient.get!(url)
end)

# Stateless mode -- fixed delay, no process needed
{:ok, body} = Resiliency.Hedged.run(fn -> HttpClient.get!(url) end, delay: 50)

Key traits:

Reduces tail latency at the cost of extra requests.
Does add load -- a hedge fires a second (or Nth) request.
Adaptive mode is stateful (a GenServer tracker); stateless mode is not.

"Multiple callers request the same thing at the same time"

Use Resiliency.SingleFlight -- deduplicate concurrent calls so the function executes only once per key, and all waiters share the result.

{:ok, _} = Resiliency.SingleFlight.start_link(name: MyFlights)

{:ok, user} = Resiliency.SingleFlight.flight(MyFlights, "user:123", fn ->
  Repo.get!(User, 123)
end)

Key traits:

Reduces load -- N concurrent callers produce exactly 1 execution.
Saves latency for callers that arrive after the first -- they skip the I/O entirely and receive the result as soon as the in-flight call completes.
Stateful -- requires a GenServer.

"I need to isolate workloads with per-partition concurrency limits"

Use Resiliency.Bulkhead -- named concurrency limiter that isolates workloads into separate partitions with rejection semantics.

children = [{Resiliency.Bulkhead, name: MyApp.PaymentBulkhead, max_concurrent: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.Bulkhead.call(MyApp.PaymentBulkhead, fn ->
  PaymentGateway.charge(amount)
end) do
  {:ok, result} -> result
  {:error, :bulkhead_full} -> {:error, :overloaded}
  {:error, reason} -> {:error, reason}
end

Key traits:

Runs the function in the caller's process -- the GenServer is never blocked.
Server-managed wait queue with configurable max_wait and FIFO fairness.
Immediate rejection when full (max_wait: 0) or queued waiting (max_wait: N).
Per-call max_wait overrides.
Stateful -- requires a GenServer.

"I need to limit concurrent access to a resource"

Use Resiliency.WeightedSemaphore -- bound concurrency with per-operation weights and FIFO fairness.

children = [{Resiliency.WeightedSemaphore, name: MyApp.DbPool, max: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

# Lightweight read -- 1 permit
{:ok, row} = Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 1, fn ->
  Repo.get(User, id)
end)

# Heavy bulk insert -- 5 permits
{:ok, _} = Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 5, fn ->
  Repo.insert_all(Event, batch)
end)

Key traits:

Adds latency only when the semaphore is saturated.
Reduces load on the protected resource.
Stateful -- requires a GenServer.
Permits are auto-released on normal return, raise, exit, or throw -- no leaks.

"I need to limit how many calls execute per second"

Use Resiliency.RateLimiter -- token-bucket rate limiter that rejects calls immediately when the bucket is empty, returning a retry_after_ms hint.

children = [{Resiliency.RateLimiter, name: MyRateLimiter, rate: 100.0, burst_size: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.RateLimiter.call(MyRateLimiter, fn ->
  HttpClient.get(url)
end) do
  {:ok, response} -> handle_response(response)
  {:error, {:rate_limited, retry_after_ms}} -> {:error, {:overloaded, retry_after_ms}}
  {:error, reason} -> {:error, reason}
end

Key traits:

Rejects immediately -- no queuing, no blocking.
Returns a retry_after_ms hint so callers know when to try again.
Weighted calls -- expensive operations consume more tokens (:weight option).
Lock-free ETS hot path -- no GenServer message on the grant or reject path.
Stateful -- requires a GenServer (table owner + reset).

"I need to run tasks in parallel with richer semantics than Task"

Use the stateless task combinators -- Resiliency.Race, Resiliency.AllSettled, Resiliency.Map, and Resiliency.FirstOk.

Race -- first success wins, losers are killed:

{:ok, fastest} = Resiliency.Race.run([
  fn -> fetch_from_region(:us_east) end,
  fn -> fetch_from_region(:eu_west) end
])

Parallel map -- bounded concurrency, cancels on first error:

{:ok, pages} = Resiliency.Map.run(urls, &fetch/1, max_concurrency: 10)

All settled -- never short-circuits, collects every result:

results = Resiliency.AllSettled.run([
  fn -> risky_a() end,
  fn -> risky_b() end
])
# => [{:ok, _}, {:error, _}]

First ok -- sequential fallback chain:

{:ok, value} = Resiliency.FirstOk.run([
  fn -> check_l1_cache(key) end,
  fn -> check_l2_cache(key) end,
  fn -> query_database(key) end
])

Key traits:

Stateless -- no process to start.
Task crashes never crash the caller.
Results are always in input order (for map and all_settled).

Full Comparison Table

Pattern	Problem	Adds Latency?	Adds Load?	Stateful?	Best For
`CircuitBreaker`	Sustained downstream failures	No -- rejects immediately when open	No -- stops calling the downstream	Yes	Protecting against cascading failures, failing fast when a service is down
`BackoffRetry`	Transient failures	Yes -- backoff delays between attempts	No -- sequential attempts	No	HTTP calls, database queries, anything with intermittent errors
`Hedged`	Tail latency	No -- reduces p99	Yes -- fires extra requests	Adaptive: yes; Stateless: no	Latency-sensitive RPCs, fan-out queries, cache lookups
`SingleFlight`	Thundering herd / duplicate work	No -- late arrivals skip the I/O and share the result	No -- reduces load by deduplication	Yes	Cache population, config reloads, expensive computations with shared keys
`Bulkhead`	Workload isolation	When waiting -- callers queue or reject	No -- caps it	Yes	Per-service concurrency limits, workload isolation, load shedding
`WeightedSemaphore`	Unbounded concurrency	When saturated -- callers queue	No -- caps it	Yes	Database pools, disk I/O, GPU access
`RateLimiter`	Too many calls per second	No -- rejects immediately	No -- caps it	Yes	External API rate limits, smoothing bursty traffic
`Race`	Need the fastest result from N sources	No -- returns the first success	Yes -- runs all concurrently	No	Multi-region fetch, redundant providers
`Map`	Parallel processing with a concurrency cap	No (unless saturated)	Bounded by `max_concurrency`	No	Bulk HTTP fetches, batch processing
`AllSettled`	Run all, tolerate individual failures	No	Yes -- runs all concurrently	No	Health checks, non-critical side effects, audit logging
`FirstOk`	Sequential fallback chain	Yes -- tries one at a time	No -- sequential	No	Cache/DB/API tiered lookups

"Why not just use..."

`fuse` / `ex_break`

fuse is an Erlang library last released in 2021. It lacks a half-open state, sliding window failure rates, slow call detection, and its API is non-idiomatic for Elixir. ex_break provides basic circuit breaking but no sliding window or percentage-based thresholds. Resiliency.CircuitBreaker provides Resilience4j-quality features: count-based sliding window with O(1) failure-rate computation, configurable slow call thresholds, half-open probing, custom failure classification via should_record, and callback-based observability -- all with an API that matches the rest of this library.

`Task.async` + `Task.await`

Task.await crashes the caller on timeout or task failure. Resiliency.Race.run and Resiliency.AllSettled.run handle failures gracefully -- crashed tasks are skipped or returned as {:error, reason}, and the caller never crashes. Resiliency.Map.run also cancels remaining work on first error, which Task.async_stream does not do.

`GenServer.call` with a timeout

A GenServer.call timeout exits the caller but does not stop the server from processing the request. Resiliency.WeightedSemaphore.acquire/4 supports a caller-side timeout that returns {:error, :timeout} cleanly, and permits are auto-released regardless of outcome. Resiliency.SingleFlight.flight/4 similarly supports a timeout -- the in-flight function continues for other waiters, but your caller gets an exit it can catch.

`Process.send_after` / `:timer.sleep` for retry

Rolling your own retry loop with Process.send_after or :timer.sleep means reimplementing backoff strategies, attempt counting, budgets, abort semantics, and on_retry callbacks. Resiliency.BackoffRetry.retry/2 handles all of this in a single function call with composable, stream-based backoff. It also supports reraise: true to preserve the original stacktrace -- something a hand-rolled loop typically drops.

Raw `Task.async_stream` for parallel work

Task.async_stream returns a lazy stream, requires you to handle :exit tuples yourself, and does not cancel remaining work on failure. Resiliency.Map.run/3 returns {:ok, results} or {:error, reason}, cancels all in-flight tasks on the first failure, and preserves input order. If you need all results regardless of failure, use Resiliency.AllSettled.run/1 instead.

Spawning two tasks manually for hedging

Manually spawning a primary and a backup task, selecting the first result, and killing the loser is roughly what Resiliency.Hedged does -- but adaptive hedging also tracks latency percentiles and uses a token bucket to avoid stampeding the backend when it is already slow. The stateless mode is a drop-in replacement for the manual approach with cleaner semantics.

When to Combine Patterns

Patterns in this library compose naturally. Here are common combinations.

Retry + Hedge

Retry handles total failures; hedging handles slow responses. Use retry as the outer wrapper when the entire hedged call might fail:

Resiliency.BackoffRetry.retry(
  fn ->
    Resiliency.Hedged.run(MyHedge, fn -> HttpClient.get!(url) end)
  end,
  max_attempts: 3,
  backoff: :exponential
)

The hedge reduces tail latency on each individual attempt, and the retry recovers from complete failures across attempts.

Hedge + Semaphore

Hedging adds load. If the downstream service has limited capacity, wrap the hedged function body in a semaphore to cap total concurrent requests:

Resiliency.Hedged.run(MyHedge, fn ->
  Resiliency.WeightedSemaphore.acquire(MyApp.ApiLimit, 1, fn ->
    HttpClient.get!(url)
  end)
end)

This way, even if multiple hedges fire concurrently, total concurrency against the backend stays bounded.

SingleFlight + Retry

Deduplicate first, then retry inside the flight function. This way N callers still collapse into one execution, and that one execution gets retry semantics:

Resiliency.SingleFlight.flight(MyFlights, cache_key, fn ->
  Resiliency.BackoffRetry.retry(fn ->
    ExpensiveService.fetch(cache_key)
  end, max_attempts: 3)
end)

Placing retry outside SingleFlight would defeat deduplication -- each retry attempt would be a separate flight.

SingleFlight + Semaphore

When you have many distinct keys but still want to bound total concurrency across all of them:

Resiliency.SingleFlight.flight(MyFlights, key, fn ->
  Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 1, fn ->
    Repo.get!(Resource, key)
  end)
end)

Parameter Quick-Reference

`Resiliency.Bulkhead`

Option	Type	Default	Recommendation
`:name`	atom or `{:via, ...}`	required	One bulkhead per downstream service or workload
`:max_concurrent`	non-negative integer	required	Set to the downstream's actual concurrency limit; `0` rejects all calls (kill-switch)
`:max_wait`	ms, `0`, or `:infinity`	`0`	`0` for fail-fast; set a timeout for queue-based load leveling
`:on_call_permitted`	`fn name -> any` or `nil`	`nil`	Use for telemetry -- track permitted calls
`:on_call_rejected`	`fn name -> any` or `nil`	`nil`	Use for telemetry -- track rejected calls
`:on_call_finished`	`fn name -> any` or `nil`	`nil`	Use for telemetry -- track completed calls
`max_wait` (per-call)	ms, `0`, or `:infinity`	server default	Override the server default for specific calls

`Resiliency.CircuitBreaker`

Option	Type	Default	Recommendation
`:name`	atom or `{:via, ...}`	required	One breaker per downstream service
`:window_size`	positive integer	`100`	Match to your traffic volume -- larger for high-throughput services
`:failure_rate_threshold`	float 0.0–1.0	`0.5`	0.5 is a good default; lower for critical services
`:slow_call_threshold`	ms or `:infinity`	`:infinity`	Set to your p99 latency to detect slow calls
`:slow_call_rate_threshold`	float 0.0–1.0	`1.0`	Effectively disabled at 1.0; lower to trip on slow calls
`:open_timeout`	ms	`60_000`	Time before probing -- longer for services with slow recovery
`:permitted_calls_in_half_open`	positive integer	`1`	More probes give higher confidence but delay recovery
`:minimum_calls`	positive integer	`10`	Prevents tripping on small sample sizes
`:should_record`	`fn result -> :success \| :failure \| :ignore`	default	Classify results; `:ignore` is not counted
`:on_state_change`	`fn name, from, to -> any` or `nil`	`nil`	Use for logging or telemetry

`Resiliency.BackoffRetry.retry/2`

Option	Type	Default	Recommendation
`:backoff`	`:exponential`, `:linear`, `:constant`, or `Enumerable`	`:exponential`	Use `:exponential` for most network calls; `:constant` for polling loops
`:base_delay`	ms	`100`	Match to the downstream service's typical recovery time
`:max_delay`	ms	`5_000`	Cap at a value that keeps total retry time within your SLA
`:max_attempts`	positive integer	`3`	3--5 for transient errors; 1 to disable retries
`:budget`	ms or `:infinity`	`:infinity`	Set to your overall timeout to avoid retrying past a deadline
`:retry_if`	`fn {:error, reason} -> boolean`	retries all errors	Always set this -- retry only transient/retriable errors
`:on_retry`	`fn attempt, delay, error -> any`	`nil`	Use for logging or metrics
`:sleep_fn`	`fn ms -> any`	`Process.sleep/1`	Inject a no-op for tests
`:reraise`	boolean	`false`	Set `true` if you want exceptions to propagate with their original stacktrace

`Resiliency.Hedged.run/2` (stateless)

Option	Type	Default	Recommendation
`:delay`	ms	`100`	Set to your p50--p95 latency; too low wastes requests, too high defeats the purpose
`:max_requests`	positive integer	`2`	2 is usually enough; 3 for very high-value calls
`:timeout`	ms	`5_000`	Set to your overall deadline
`:non_fatal`	`fn reason -> boolean`	`fn _ -> false end`	Return `true` for errors that should immediately trigger the next hedge
`:on_hedge`	`fn attempt -> any`	`nil`	Use for metrics -- track how often hedges fire

`Resiliency.Hedged.start_link/1` (adaptive tracker)

Option	Type	Default	Recommendation
`:name`	atom or `{:via, ...}`	required	One tracker per logical operation type
`:percentile`	number	`95`	95 is a good default; use 99 for ultra-low-latency paths
`:buffer_size`	positive integer	`1_000`	Increase if your traffic is very bursty
`:min_delay`	ms	`1`	Raise if you want a floor to avoid sub-millisecond hedges
`:max_delay`	ms	`5_000`	Match to your timeout
`:initial_delay`	ms	`100`	Used before enough samples are collected
`:min_samples`	non-negative integer	`10`	Lower for faster adaptation; higher for more stability
`:token_max`	number	`10`	Controls hedge budget -- lower values hedge less often
`:token_success_credit`	number	`0.1`	Each successful request adds this many tokens
`:token_hedge_cost`	number	`1.0`	Each hedge spends this many tokens
`:token_threshold`	number	`1.0`	Hedging is suppressed below this token level

`Resiliency.SingleFlight.flight/3,4`

Option	Type	Default	Recommendation
`server`	name or PID	required	One server per deduplication domain
`key`	any term	required	Use a string or tuple that uniquely identifies the work
`timeout` (4-arity)	ms or `:infinity`	`:infinity`	Set to your caller's deadline -- the in-flight function keeps running for other waiters

Resiliency.SingleFlight.forget/2 evicts a key so the next call triggers a fresh execution -- useful after a known data change.

`Resiliency.WeightedSemaphore`

Option	Type	Default	Recommendation
`:name`	atom or `{:via, ...}`	required	One semaphore per protected resource
`:max`	positive integer	required	Set to the resource's actual concurrency limit
`weight` (in `acquire/3,4`)	positive integer	`1`	Model the relative cost of each operation
`timeout` (in `acquire/4`)	ms or `:infinity`	`:infinity`	Set to avoid unbounded queue waits under load

try_acquire/2,3 returns :rejected immediately if permits are unavailable -- use it for best-effort work that can be dropped.

`Resiliency.RateLimiter`

Option	Type	Default	Recommendation
`:name`	atom	required	One rate limiter per upstream rate limit boundary
`:rate`	positive number	required	Tokens per second; match your upstream's stated rate limit
`:burst_size`	positive integer	required	Maximum burst; set to the upstream's burst allowance or a small multiple of `rate`
`:on_reject`	`fn name -> any` or `nil`	`nil`	Use for logging or metrics -- fires in the caller's process
`weight` (per-call)	positive integer	`1`	Model the relative cost of each operation

get_stats/1 returns %{tokens: float, rate: float, burst_size: integer} — reads the token count read-only without consuming any tokens or updating the timestamp.

reset/1 refills the bucket to burst_size — useful in tests and after manual intervention.

Task Combinators

Module	Key Options	Defaults	Notes
`Resiliency.Race.run/2`	`timeout`	`:infinity`	First success wins; all failures yield `{:error, :all_failed}`
`Resiliency.AllSettled.run/2`	`timeout`	`:infinity`	Timed-out tasks get `{:error, :timeout}` in the result list
`Resiliency.Map.run/3`	`max_concurrency`, `timeout`	`System.schedulers_online()`, `:infinity`	Cancels everything on first failure
`Resiliency.FirstOk.run/2`	`timeout`	`:infinity`	Sequential -- total timeout spans all attempts

← Previous Page Getting Started

Next Page → Combining Patterns

Choosing a Pattern

Decision Tree

"My downstream service is failing and I want to stop calling it"

"My requests sometimes fail"

"My requests are sometimes slow"

"Multiple callers request the same thing at the same time"

"I need to isolate workloads with per-partition concurrency limits"

"I need to limit concurrent access to a resource"

"I need to limit how many calls execute per second"

"I need to run tasks in parallel with richer semantics than Task"

Full Comparison Table

"Why not just use..."

fuse / ex_break

Task.async + Task.await

GenServer.call with a timeout

Process.send_after / :timer.sleep for retry

Raw Task.async_stream for parallel work

Spawning two tasks manually for hedging

When to Combine Patterns

Retry + Hedge

Hedge + Semaphore

SingleFlight + Retry

SingleFlight + Semaphore

Parameter Quick-Reference

Resiliency.Bulkhead

Resiliency.CircuitBreaker

Resiliency.BackoffRetry.retry/2

Resiliency.Hedged.run/2 (stateless)

Resiliency.Hedged.start_link/1 (adaptive tracker)

Resiliency.SingleFlight.flight/3,4

Resiliency.WeightedSemaphore

Resiliency.RateLimiter

Task Combinators

`fuse` / `ex_break`

`Task.async` + `Task.await`

`GenServer.call` with a timeout

`Process.send_after` / `:timer.sleep` for retry

Raw `Task.async_stream` for parallel work

`Resiliency.Bulkhead`

`Resiliency.CircuitBreaker`

`Resiliency.BackoffRetry.retry/2`

`Resiliency.Hedged.run/2` (stateless)

`Resiliency.Hedged.start_link/1` (adaptive tracker)

`Resiliency.SingleFlight.flight/3,4`

`Resiliency.WeightedSemaphore`

`Resiliency.RateLimiter`