Resiliency ships five modules that each solve a distinct reliability or concurrency problem. This guide helps you pick the right one -- or the right combination -- for your situation.

Decision Tree

Start from the problem you are trying to solve and follow the branch.


"My downstream service is failing and I want to stop calling it"

Use Resiliency.CircuitBreaker -- monitor failure rates and trip the circuit when the downstream is unhealthy. Probe calls automatically test recovery.

children = [{Resiliency.CircuitBreaker, name: MyBreaker, failure_rate_threshold: 0.5}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.CircuitBreaker.call(MyBreaker, fn ->
  HttpClient.get!(url)
end) do
  {:ok, body} -> body
  {:error, :circuit_open} -> {:error, :service_degraded}
  {:error, reason} -> {:error, reason}
end

Key traits:

  • Reduces load -- stops calling a failing service entirely.
  • Automatic recovery -- probes the service after a cool-down period.
  • Failure-rate-based (not just consecutive failures) with a sliding window.
  • Stateful -- requires a GenServer.

"My requests sometimes fail"

Use Resiliency.BackoffRetry -- retry the operation with configurable backoff so you do not hammer the downstream service.

Resiliency.BackoffRetry.retry(
  fn ->
    case HttpClient.get(url) do
      {:ok, %{status: 200}} = ok -> ok
      {:ok, %{status: 503}}      -> {:error, :unavailable}
      {:error, reason}           -> {:error, reason}
    end
  end,
  max_attempts: 4,
  backoff: :exponential,
  base_delay: 200,
  retry_if: fn
    {:error, :unavailable} -> true
    {:error, :timeout}     -> true
    _                      -> false
  end
)

Key traits:

  • Adds latency (each retry waits for the backoff delay).
  • Does not add concurrent load -- attempts are sequential.
  • Stateless -- no process to start.

"My requests are sometimes slow"

Use Resiliency.Hedged -- send a backup request after a delay, take whichever finishes first, cancel the loser. This is a tail-latency optimization, not a retry strategy.

# Adaptive mode -- delay auto-tunes from observed latency
{:ok, _} = Resiliency.Hedged.start_link(name: MyHedge, percentile: 95)

{:ok, body} = Resiliency.Hedged.run(MyHedge, fn ->
  HttpClient.get!(url)
end)
# Stateless mode -- fixed delay, no process needed
{:ok, body} = Resiliency.Hedged.run(fn -> HttpClient.get!(url) end, delay: 50)

Key traits:

  • Reduces tail latency at the cost of extra requests.
  • Does add load -- a hedge fires a second (or Nth) request.
  • Adaptive mode is stateful (a GenServer tracker); stateless mode is not.

"Multiple callers request the same thing at the same time"

Use Resiliency.SingleFlight -- deduplicate concurrent calls so the function executes only once per key, and all waiters share the result.

{:ok, _} = Resiliency.SingleFlight.start_link(name: MyFlights)

{:ok, user} = Resiliency.SingleFlight.flight(MyFlights, "user:123", fn ->
  Repo.get!(User, 123)
end)

Key traits:

  • Reduces load -- N concurrent callers produce exactly 1 execution.
  • Saves latency for callers that arrive after the first -- they skip the I/O entirely and receive the result as soon as the in-flight call completes.
  • Stateful -- requires a GenServer.

"I need to isolate workloads with per-partition concurrency limits"

Use Resiliency.Bulkhead -- named concurrency limiter that isolates workloads into separate partitions with rejection semantics.

children = [{Resiliency.Bulkhead, name: MyApp.PaymentBulkhead, max_concurrent: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.Bulkhead.call(MyApp.PaymentBulkhead, fn ->
  PaymentGateway.charge(amount)
end) do
  {:ok, result} -> result
  {:error, :bulkhead_full} -> {:error, :overloaded}
  {:error, reason} -> {:error, reason}
end

Key traits:

  • Runs the function in the caller's process -- the GenServer is never blocked.
  • Server-managed wait queue with configurable max_wait and FIFO fairness.
  • Immediate rejection when full (max_wait: 0) or queued waiting (max_wait: N).
  • Per-call max_wait overrides.
  • Stateful -- requires a GenServer.

"I need to limit concurrent access to a resource"

Use Resiliency.WeightedSemaphore -- bound concurrency with per-operation weights and FIFO fairness.

children = [{Resiliency.WeightedSemaphore, name: MyApp.DbPool, max: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

# Lightweight read -- 1 permit
{:ok, row} = Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 1, fn ->
  Repo.get(User, id)
end)

# Heavy bulk insert -- 5 permits
{:ok, _} = Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 5, fn ->
  Repo.insert_all(Event, batch)
end)

Key traits:

  • Adds latency only when the semaphore is saturated.
  • Reduces load on the protected resource.
  • Stateful -- requires a GenServer.
  • Permits are auto-released on normal return, raise, exit, or throw -- no leaks.

"I need to limit how many calls execute per second"

Use Resiliency.RateLimiter -- token-bucket rate limiter that rejects calls immediately when the bucket is empty, returning a retry_after_ms hint.

children = [{Resiliency.RateLimiter, name: MyRateLimiter, rate: 100.0, burst_size: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.RateLimiter.call(MyRateLimiter, fn ->
  HttpClient.get(url)
end) do
  {:ok, response} -> handle_response(response)
  {:error, {:rate_limited, retry_after_ms}} -> {:error, {:overloaded, retry_after_ms}}
  {:error, reason} -> {:error, reason}
end

Key traits:

  • Rejects immediately -- no queuing, no blocking.
  • Returns a retry_after_ms hint so callers know when to try again.
  • Weighted calls -- expensive operations consume more tokens (:weight option).
  • Lock-free ETS hot path -- no GenServer message on the grant or reject path.
  • Stateful -- requires a GenServer (table owner + reset).

"I need to run tasks in parallel with richer semantics than Task"

Use the stateless task combinators -- Resiliency.Race, Resiliency.AllSettled, Resiliency.Map, and Resiliency.FirstOk.

Race -- first success wins, losers are killed:

{:ok, fastest} = Resiliency.Race.run([
  fn -> fetch_from_region(:us_east) end,
  fn -> fetch_from_region(:eu_west) end
])

Parallel map -- bounded concurrency, cancels on first error:

{:ok, pages} = Resiliency.Map.run(urls, &fetch/1, max_concurrency: 10)

All settled -- never short-circuits, collects every result:

results = Resiliency.AllSettled.run([
  fn -> risky_a() end,
  fn -> risky_b() end
])
# => [{:ok, _}, {:error, _}]

First ok -- sequential fallback chain:

{:ok, value} = Resiliency.FirstOk.run([
  fn -> check_l1_cache(key) end,
  fn -> check_l2_cache(key) end,
  fn -> query_database(key) end
])

Key traits:

  • Stateless -- no process to start.
  • Task crashes never crash the caller.
  • Results are always in input order (for map and all_settled).

Full Comparison Table

PatternProblemAdds Latency?Adds Load?Stateful?Best For
CircuitBreakerSustained downstream failuresNo -- rejects immediately when openNo -- stops calling the downstreamYesProtecting against cascading failures, failing fast when a service is down
BackoffRetryTransient failuresYes -- backoff delays between attemptsNo -- sequential attemptsNoHTTP calls, database queries, anything with intermittent errors
HedgedTail latencyNo -- reduces p99Yes -- fires extra requestsAdaptive: yes; Stateless: noLatency-sensitive RPCs, fan-out queries, cache lookups
SingleFlightThundering herd / duplicate workNo -- late arrivals skip the I/O and share the resultNo -- reduces load by deduplicationYesCache population, config reloads, expensive computations with shared keys
BulkheadWorkload isolationWhen waiting -- callers queue or rejectNo -- caps itYesPer-service concurrency limits, workload isolation, load shedding
WeightedSemaphoreUnbounded concurrencyWhen saturated -- callers queueNo -- caps itYesDatabase pools, disk I/O, GPU access
RateLimiterToo many calls per secondNo -- rejects immediatelyNo -- caps itYesExternal API rate limits, smoothing bursty traffic
RaceNeed the fastest result from N sourcesNo -- returns the first successYes -- runs all concurrentlyNoMulti-region fetch, redundant providers
MapParallel processing with a concurrency capNo (unless saturated)Bounded by max_concurrencyNoBulk HTTP fetches, batch processing
AllSettledRun all, tolerate individual failuresNoYes -- runs all concurrentlyNoHealth checks, non-critical side effects, audit logging
FirstOkSequential fallback chainYes -- tries one at a timeNo -- sequentialNoCache/DB/API tiered lookups

"Why not just use..."

fuse / ex_break

fuse is an Erlang library last released in 2021. It lacks a half-open state, sliding window failure rates, slow call detection, and its API is non-idiomatic for Elixir. ex_break provides basic circuit breaking but no sliding window or percentage-based thresholds. Resiliency.CircuitBreaker provides Resilience4j-quality features: count-based sliding window with O(1) failure-rate computation, configurable slow call thresholds, half-open probing, custom failure classification via should_record, and callback-based observability -- all with an API that matches the rest of this library.

Task.async + Task.await

Task.await crashes the caller on timeout or task failure. Resiliency.Race.run and Resiliency.AllSettled.run handle failures gracefully -- crashed tasks are skipped or returned as {:error, reason}, and the caller never crashes. Resiliency.Map.run also cancels remaining work on first error, which Task.async_stream does not do.

GenServer.call with a timeout

A GenServer.call timeout exits the caller but does not stop the server from processing the request. Resiliency.WeightedSemaphore.acquire/4 supports a caller-side timeout that returns {:error, :timeout} cleanly, and permits are auto-released regardless of outcome. Resiliency.SingleFlight.flight/4 similarly supports a timeout -- the in-flight function continues for other waiters, but your caller gets an exit it can catch.

Process.send_after / :timer.sleep for retry

Rolling your own retry loop with Process.send_after or :timer.sleep means reimplementing backoff strategies, attempt counting, budgets, abort semantics, and on_retry callbacks. Resiliency.BackoffRetry.retry/2 handles all of this in a single function call with composable, stream-based backoff. It also supports reraise: true to preserve the original stacktrace -- something a hand-rolled loop typically drops.

Raw Task.async_stream for parallel work

Task.async_stream returns a lazy stream, requires you to handle :exit tuples yourself, and does not cancel remaining work on failure. Resiliency.Map.run/3 returns {:ok, results} or {:error, reason}, cancels all in-flight tasks on the first failure, and preserves input order. If you need all results regardless of failure, use Resiliency.AllSettled.run/1 instead.

Spawning two tasks manually for hedging

Manually spawning a primary and a backup task, selecting the first result, and killing the loser is roughly what Resiliency.Hedged does -- but adaptive hedging also tracks latency percentiles and uses a token bucket to avoid stampeding the backend when it is already slow. The stateless mode is a drop-in replacement for the manual approach with cleaner semantics.


When to Combine Patterns

Patterns in this library compose naturally. Here are common combinations.

Retry + Hedge

Retry handles total failures; hedging handles slow responses. Use retry as the outer wrapper when the entire hedged call might fail:

Resiliency.BackoffRetry.retry(
  fn ->
    Resiliency.Hedged.run(MyHedge, fn -> HttpClient.get!(url) end)
  end,
  max_attempts: 3,
  backoff: :exponential
)

The hedge reduces tail latency on each individual attempt, and the retry recovers from complete failures across attempts.

Hedge + Semaphore

Hedging adds load. If the downstream service has limited capacity, wrap the hedged function body in a semaphore to cap total concurrent requests:

Resiliency.Hedged.run(MyHedge, fn ->
  Resiliency.WeightedSemaphore.acquire(MyApp.ApiLimit, 1, fn ->
    HttpClient.get!(url)
  end)
end)

This way, even if multiple hedges fire concurrently, total concurrency against the backend stays bounded.

SingleFlight + Retry

Deduplicate first, then retry inside the flight function. This way N callers still collapse into one execution, and that one execution gets retry semantics:

Resiliency.SingleFlight.flight(MyFlights, cache_key, fn ->
  Resiliency.BackoffRetry.retry(fn ->
    ExpensiveService.fetch(cache_key)
  end, max_attempts: 3)
end)

Placing retry outside SingleFlight would defeat deduplication -- each retry attempt would be a separate flight.

SingleFlight + Semaphore

When you have many distinct keys but still want to bound total concurrency across all of them:

Resiliency.SingleFlight.flight(MyFlights, key, fn ->
  Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 1, fn ->
    Repo.get!(Resource, key)
  end)
end)

Parameter Quick-Reference

Resiliency.Bulkhead

OptionTypeDefaultRecommendation
:nameatom or {:via, ...}requiredOne bulkhead per downstream service or workload
:max_concurrentnon-negative integerrequiredSet to the downstream's actual concurrency limit; 0 rejects all calls (kill-switch)
:max_waitms, 0, or :infinity00 for fail-fast; set a timeout for queue-based load leveling
:on_call_permittedfn name -> any or nilnilUse for telemetry -- track permitted calls
:on_call_rejectedfn name -> any or nilnilUse for telemetry -- track rejected calls
:on_call_finishedfn name -> any or nilnilUse for telemetry -- track completed calls
max_wait (per-call)ms, 0, or :infinityserver defaultOverride the server default for specific calls

Resiliency.CircuitBreaker

OptionTypeDefaultRecommendation
:nameatom or {:via, ...}requiredOne breaker per downstream service
:window_sizepositive integer100Match to your traffic volume -- larger for high-throughput services
:failure_rate_thresholdfloat 0.0–1.00.50.5 is a good default; lower for critical services
:slow_call_thresholdms or :infinity:infinitySet to your p99 latency to detect slow calls
:slow_call_rate_thresholdfloat 0.0–1.01.0Effectively disabled at 1.0; lower to trip on slow calls
:open_timeoutms60_000Time before probing -- longer for services with slow recovery
:permitted_calls_in_half_openpositive integer1More probes give higher confidence but delay recovery
:minimum_callspositive integer10Prevents tripping on small sample sizes
:should_recordfn result -> :success | :failure | :ignoredefaultClassify results; :ignore is not counted
:on_state_changefn name, from, to -> any or nilnilUse for logging or telemetry

Resiliency.BackoffRetry.retry/2

OptionTypeDefaultRecommendation
:backoff:exponential, :linear, :constant, or Enumerable:exponentialUse :exponential for most network calls; :constant for polling loops
:base_delayms100Match to the downstream service's typical recovery time
:max_delayms5_000Cap at a value that keeps total retry time within your SLA
:max_attemptspositive integer33--5 for transient errors; 1 to disable retries
:budgetms or :infinity:infinitySet to your overall timeout to avoid retrying past a deadline
:retry_iffn {:error, reason} -> booleanretries all errorsAlways set this -- retry only transient/retriable errors
:on_retryfn attempt, delay, error -> anynilUse for logging or metrics
:sleep_fnfn ms -> anyProcess.sleep/1Inject a no-op for tests
:reraisebooleanfalseSet true if you want exceptions to propagate with their original stacktrace

Resiliency.Hedged.run/2 (stateless)

OptionTypeDefaultRecommendation
:delayms100Set to your p50--p95 latency; too low wastes requests, too high defeats the purpose
:max_requestspositive integer22 is usually enough; 3 for very high-value calls
:timeoutms5_000Set to your overall deadline
:non_fatalfn reason -> booleanfn _ -> false endReturn true for errors that should immediately trigger the next hedge
:on_hedgefn attempt -> anynilUse for metrics -- track how often hedges fire
OptionTypeDefaultRecommendation
:nameatom or {:via, ...}requiredOne tracker per logical operation type
:percentilenumber9595 is a good default; use 99 for ultra-low-latency paths
:buffer_sizepositive integer1_000Increase if your traffic is very bursty
:min_delayms1Raise if you want a floor to avoid sub-millisecond hedges
:max_delayms5_000Match to your timeout
:initial_delayms100Used before enough samples are collected
:min_samplesnon-negative integer10Lower for faster adaptation; higher for more stability
:token_maxnumber10Controls hedge budget -- lower values hedge less often
:token_success_creditnumber0.1Each successful request adds this many tokens
:token_hedge_costnumber1.0Each hedge spends this many tokens
:token_thresholdnumber1.0Hedging is suppressed below this token level

Resiliency.SingleFlight.flight/3,4

OptionTypeDefaultRecommendation
servername or PIDrequiredOne server per deduplication domain
keyany termrequiredUse a string or tuple that uniquely identifies the work
timeout (4-arity)ms or :infinity:infinitySet to your caller's deadline -- the in-flight function keeps running for other waiters

Resiliency.SingleFlight.forget/2 evicts a key so the next call triggers a fresh execution -- useful after a known data change.

Resiliency.WeightedSemaphore

OptionTypeDefaultRecommendation
:nameatom or {:via, ...}requiredOne semaphore per protected resource
:maxpositive integerrequiredSet to the resource's actual concurrency limit
weight (in acquire/3,4)positive integer1Model the relative cost of each operation
timeout (in acquire/4)ms or :infinity:infinitySet to avoid unbounded queue waits under load

try_acquire/2,3 returns :rejected immediately if permits are unavailable -- use it for best-effort work that can be dropped.

Resiliency.RateLimiter

OptionTypeDefaultRecommendation
:nameatomrequiredOne rate limiter per upstream rate limit boundary
:ratepositive numberrequiredTokens per second; match your upstream's stated rate limit
:burst_sizepositive integerrequiredMaximum burst; set to the upstream's burst allowance or a small multiple of rate
:on_rejectfn name -> any or nilnilUse for logging or metrics -- fires in the caller's process
weight (per-call)positive integer1Model the relative cost of each operation

get_stats/1 returns %{tokens: float, rate: float, burst_size: integer} — reads the token count read-only without consuming any tokens or updating the timestamp.

reset/1 refills the bucket to burst_size — useful in tests and after manual intervention.

Task Combinators

ModuleKey OptionsDefaultsNotes
Resiliency.Race.run/2timeout:infinityFirst success wins; all failures yield {:error, :all_failed}
Resiliency.AllSettled.run/2timeout:infinityTimed-out tasks get {:error, :timeout} in the result list
Resiliency.Map.run/3max_concurrency, timeoutSystem.schedulers_online(), :infinityCancels everything on first failure
Resiliency.FirstOk.run/2timeout:infinitySequential -- total timeout spans all attempts