# Choosing a Pattern

Resiliency ships five modules that each solve a distinct reliability or
concurrency problem.  This guide helps you pick the right one -- or the right
combination -- for your situation.

## Decision Tree

Start from the problem you are trying to solve and follow the branch.

---

### "My downstream service is failing and I want to stop calling it"

Use **`Resiliency.CircuitBreaker`** -- monitor failure rates and trip the circuit
when the downstream is unhealthy.  Probe calls automatically test recovery.

```elixir
children = [{Resiliency.CircuitBreaker, name: MyBreaker, failure_rate_threshold: 0.5}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.CircuitBreaker.call(MyBreaker, fn ->
  HttpClient.get!(url)
end) do
  {:ok, body} -> body
  {:error, :circuit_open} -> {:error, :service_degraded}
  {:error, reason} -> {:error, reason}
end
```

Key traits:

- Reduces load -- stops calling a failing service entirely.
- Automatic recovery -- probes the service after a cool-down period.
- Failure-rate-based (not just consecutive failures) with a sliding window.
- Stateful -- requires a `GenServer`.

---

### "My requests sometimes fail"

Use **`Resiliency.BackoffRetry`** -- retry the operation with configurable
backoff so you do not hammer the downstream service.

```elixir
Resiliency.BackoffRetry.retry(
  fn ->
    case HttpClient.get(url) do
      {:ok, %{status: 200}} = ok -> ok
      {:ok, %{status: 503}}      -> {:error, :unavailable}
      {:error, reason}           -> {:error, reason}
    end
  end,
  max_attempts: 4,
  backoff: :exponential,
  base_delay: 200,
  retry_if: fn
    {:error, :unavailable} -> true
    {:error, :timeout}     -> true
    _                      -> false
  end
)
```

Key traits:

- Adds latency (each retry waits for the backoff delay).
- Does **not** add concurrent load -- attempts are sequential.
- Stateless -- no process to start.

---

### "My requests are sometimes slow"

Use **`Resiliency.Hedged`** -- send a backup request after a delay, take
whichever finishes first, cancel the loser.  This is a tail-latency
optimization, not a retry strategy.

```elixir
# Adaptive mode -- delay auto-tunes from observed latency
{:ok, _} = Resiliency.Hedged.start_link(name: MyHedge, percentile: 95)

{:ok, body} = Resiliency.Hedged.run(MyHedge, fn ->
  HttpClient.get!(url)
end)
```

```elixir
# Stateless mode -- fixed delay, no process needed
{:ok, body} = Resiliency.Hedged.run(fn -> HttpClient.get!(url) end, delay: 50)
```

Key traits:

- Reduces tail latency at the cost of extra requests.
- **Does** add load -- a hedge fires a second (or Nth) request.
- Adaptive mode is stateful (a `GenServer` tracker); stateless mode is not.

---

### "Multiple callers request the same thing at the same time"

Use **`Resiliency.SingleFlight`** -- deduplicate concurrent calls so the
function executes only once per key, and all waiters share the result.

```elixir
{:ok, _} = Resiliency.SingleFlight.start_link(name: MyFlights)

{:ok, user} = Resiliency.SingleFlight.flight(MyFlights, "user:123", fn ->
  Repo.get!(User, 123)
end)
```

Key traits:

- Reduces load -- N concurrent callers produce exactly 1 execution.
- Saves latency for callers that arrive after the first -- they skip the I/O
  entirely and receive the result as soon as the in-flight call completes.
- Stateful -- requires a `GenServer`.

---

### "I need to isolate workloads with per-partition concurrency limits"

Use **`Resiliency.Bulkhead`** -- named concurrency limiter that isolates
workloads into separate partitions with rejection semantics.

```elixir
children = [{Resiliency.Bulkhead, name: MyApp.PaymentBulkhead, max_concurrent: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.Bulkhead.call(MyApp.PaymentBulkhead, fn ->
  PaymentGateway.charge(amount)
end) do
  {:ok, result} -> result
  {:error, :bulkhead_full} -> {:error, :overloaded}
  {:error, reason} -> {:error, reason}
end
```

Key traits:

- Runs the function in the **caller's process** -- the GenServer is never blocked.
- Server-managed wait queue with configurable `max_wait` and FIFO fairness.
- Immediate rejection when full (`max_wait: 0`) or queued waiting (`max_wait: N`).
- Per-call `max_wait` overrides.
- Stateful -- requires a `GenServer`.

---

### "I need to limit concurrent access to a resource"

Use **`Resiliency.WeightedSemaphore`** -- bound concurrency with per-operation
weights and FIFO fairness.

```elixir
children = [{Resiliency.WeightedSemaphore, name: MyApp.DbPool, max: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

# Lightweight read -- 1 permit
{:ok, row} = Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 1, fn ->
  Repo.get(User, id)
end)

# Heavy bulk insert -- 5 permits
{:ok, _} = Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 5, fn ->
  Repo.insert_all(Event, batch)
end)
```

Key traits:

- Adds latency only when the semaphore is saturated.
- Reduces load on the protected resource.
- Stateful -- requires a `GenServer`.
- Permits are auto-released on normal return, raise, exit, or throw -- no
  leaks.

---

### "I need to limit how many calls execute per second"

Use **`Resiliency.RateLimiter`** -- token-bucket rate limiter that rejects calls
immediately when the bucket is empty, returning a `retry_after_ms` hint.

```elixir
children = [{Resiliency.RateLimiter, name: MyRateLimiter, rate: 100.0, burst_size: 10}]
Supervisor.start_link(children, strategy: :one_for_one)

case Resiliency.RateLimiter.call(MyRateLimiter, fn ->
  HttpClient.get(url)
end) do
  {:ok, response} -> handle_response(response)
  {:error, {:rate_limited, retry_after_ms}} -> {:error, {:overloaded, retry_after_ms}}
  {:error, reason} -> {:error, reason}
end
```

Key traits:

- Rejects immediately -- no queuing, no blocking.
- Returns a `retry_after_ms` hint so callers know when to try again.
- Weighted calls -- expensive operations consume more tokens (`:weight` option).
- Lock-free ETS hot path -- no GenServer message on the grant or reject path.
- Stateful -- requires a GenServer (table owner + reset).

---

### "I need to run tasks in parallel with richer semantics than Task"

Use the stateless task combinators -- `Resiliency.Race`, `Resiliency.AllSettled`,
`Resiliency.Map`, and `Resiliency.FirstOk`.

**Race** -- first success wins, losers are killed:

```elixir
{:ok, fastest} = Resiliency.Race.run([
  fn -> fetch_from_region(:us_east) end,
  fn -> fetch_from_region(:eu_west) end
])
```

**Parallel map** -- bounded concurrency, cancels on first error:

```elixir
{:ok, pages} = Resiliency.Map.run(urls, &fetch/1, max_concurrency: 10)
```

**All settled** -- never short-circuits, collects every result:

```elixir
results = Resiliency.AllSettled.run([
  fn -> risky_a() end,
  fn -> risky_b() end
])
# => [{:ok, _}, {:error, _}]
```

**First ok** -- sequential fallback chain:

```elixir
{:ok, value} = Resiliency.FirstOk.run([
  fn -> check_l1_cache(key) end,
  fn -> check_l2_cache(key) end,
  fn -> query_database(key) end
])
```

Key traits:

- Stateless -- no process to start.
- Task crashes never crash the caller.
- Results are always in input order (for `map` and `all_settled`).

---

## Full Comparison Table

| Pattern | Problem | Adds Latency? | Adds Load? | Stateful? | Best For |
|---|---|---|---|---|---|
| `CircuitBreaker` | Sustained downstream failures | No -- rejects immediately when open | No -- stops calling the downstream | Yes | Protecting against cascading failures, failing fast when a service is down |
| `BackoffRetry` | Transient failures | Yes -- backoff delays between attempts | No -- sequential attempts | No | HTTP calls, database queries, anything with intermittent errors |
| `Hedged` | Tail latency | No -- reduces p99 | Yes -- fires extra requests | Adaptive: yes; Stateless: no | Latency-sensitive RPCs, fan-out queries, cache lookups |
| `SingleFlight` | Thundering herd / duplicate work | No -- late arrivals skip the I/O and share the result | No -- reduces load by deduplication | Yes | Cache population, config reloads, expensive computations with shared keys |
| `Bulkhead` | Workload isolation | When waiting -- callers queue or reject | No -- caps it | Yes | Per-service concurrency limits, workload isolation, load shedding |
| `WeightedSemaphore` | Unbounded concurrency | When saturated -- callers queue | No -- caps it | Yes | Database pools, disk I/O, GPU access |
| `RateLimiter` | Too many calls per second | No -- rejects immediately | No -- caps it | Yes | External API rate limits, smoothing bursty traffic |
| `Race` | Need the fastest result from N sources | No -- returns the first success | Yes -- runs all concurrently | No | Multi-region fetch, redundant providers |
| `Map` | Parallel processing with a concurrency cap | No (unless saturated) | Bounded by `max_concurrency` | No | Bulk HTTP fetches, batch processing |
| `AllSettled` | Run all, tolerate individual failures | No | Yes -- runs all concurrently | No | Health checks, non-critical side effects, audit logging |
| `FirstOk` | Sequential fallback chain | Yes -- tries one at a time | No -- sequential | No | Cache/DB/API tiered lookups |

---

## "Why not just use..."

### `fuse` / `ex_break`

`fuse` is an Erlang library last released in 2021. It lacks a half-open state,
sliding window failure rates, slow call detection, and its API is non-idiomatic
for Elixir. `ex_break` provides basic circuit breaking but no sliding window or
percentage-based thresholds. `Resiliency.CircuitBreaker` provides
Resilience4j-quality features: count-based sliding window with O(1) failure-rate
computation, configurable slow call thresholds, half-open probing, custom failure
classification via `should_record`, and callback-based observability -- all with
an API that matches the rest of this library.

### `Task.async` + `Task.await`

`Task.await` crashes the caller on timeout or task failure.
`Resiliency.Race.run` and `Resiliency.AllSettled.run` handle failures gracefully --
crashed tasks are skipped or returned as `{:error, reason}`, and the caller
never crashes.  `Resiliency.Map.run` also cancels remaining work on first error,
which `Task.async_stream` does not do.

### `GenServer.call` with a timeout

A `GenServer.call` timeout exits the caller but **does not** stop the server
from processing the request.  `Resiliency.WeightedSemaphore.acquire/4` supports
a caller-side timeout that returns `{:error, :timeout}` cleanly, and permits
are auto-released regardless of outcome.  `Resiliency.SingleFlight.flight/4`
similarly supports a timeout -- the in-flight function continues for other
waiters, but your caller gets an exit it can catch.

### `Process.send_after` / `:timer.sleep` for retry

Rolling your own retry loop with `Process.send_after` or `:timer.sleep` means
reimplementing backoff strategies, attempt counting, budgets, abort semantics,
and `on_retry` callbacks.  `Resiliency.BackoffRetry.retry/2` handles all of
this in a single function call with composable, stream-based backoff.  It also
supports `reraise: true` to preserve the original stacktrace -- something a
hand-rolled loop typically drops.

### Raw `Task.async_stream` for parallel work

`Task.async_stream` returns a lazy stream, requires you to handle `:exit`
tuples yourself, and does not cancel remaining work on failure.
`Resiliency.Map.run/3` returns `{:ok, results}` or
`{:error, reason}`, cancels all in-flight tasks on the first failure, and
preserves input order.  If you need all results regardless of failure, use
`Resiliency.AllSettled.run/1` instead.

### Spawning two tasks manually for hedging

Manually spawning a primary and a backup task, selecting the first result, and
killing the loser is roughly what `Resiliency.Hedged` does -- but adaptive
hedging also tracks latency percentiles and uses a token bucket to avoid
stampeding the backend when it is already slow.  The stateless mode is a
drop-in replacement for the manual approach with cleaner semantics.

---

## When to Combine Patterns

Patterns in this library compose naturally.  Here are common combinations.

### Retry + Hedge

Retry handles **total failures**; hedging handles **slow responses**.  Use
retry as the outer wrapper when the entire hedged call might fail:

```elixir
Resiliency.BackoffRetry.retry(
  fn ->
    Resiliency.Hedged.run(MyHedge, fn -> HttpClient.get!(url) end)
  end,
  max_attempts: 3,
  backoff: :exponential
)
```

The hedge reduces tail latency on each individual attempt, and the retry
recovers from complete failures across attempts.

### Hedge + Semaphore

Hedging adds load.  If the downstream service has limited capacity, wrap the
hedged function body in a semaphore to cap total concurrent requests:

```elixir
Resiliency.Hedged.run(MyHedge, fn ->
  Resiliency.WeightedSemaphore.acquire(MyApp.ApiLimit, 1, fn ->
    HttpClient.get!(url)
  end)
end)
```

This way, even if multiple hedges fire concurrently, total concurrency against
the backend stays bounded.

### SingleFlight + Retry

Deduplicate first, then retry inside the flight function.  This way N callers
still collapse into one execution, and that one execution gets retry semantics:

```elixir
Resiliency.SingleFlight.flight(MyFlights, cache_key, fn ->
  Resiliency.BackoffRetry.retry(fn ->
    ExpensiveService.fetch(cache_key)
  end, max_attempts: 3)
end)
```

Placing retry **outside** SingleFlight would defeat deduplication -- each retry
attempt would be a separate flight.

### SingleFlight + Semaphore

When you have many distinct keys but still want to bound total concurrency
across all of them:

```elixir
Resiliency.SingleFlight.flight(MyFlights, key, fn ->
  Resiliency.WeightedSemaphore.acquire(MyApp.DbPool, 1, fn ->
    Repo.get!(Resource, key)
  end)
end)
```

---

## Parameter Quick-Reference

### `Resiliency.Bulkhead`

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `:name` | atom or `{:via, ...}` | required | One bulkhead per downstream service or workload |
| `:max_concurrent` | non-negative integer | required | Set to the downstream's actual concurrency limit; `0` rejects all calls (kill-switch) |
| `:max_wait` | ms, `0`, or `:infinity` | `0` | `0` for fail-fast; set a timeout for queue-based load leveling |
| `:on_call_permitted` | `fn name -> any` or `nil` | `nil` | Use for telemetry -- track permitted calls |
| `:on_call_rejected` | `fn name -> any` or `nil` | `nil` | Use for telemetry -- track rejected calls |
| `:on_call_finished` | `fn name -> any` or `nil` | `nil` | Use for telemetry -- track completed calls |
| `max_wait` (per-call) | ms, `0`, or `:infinity` | server default | Override the server default for specific calls |

### `Resiliency.CircuitBreaker`

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `:name` | atom or `{:via, ...}` | required | One breaker per downstream service |
| `:window_size` | positive integer | `100` | Match to your traffic volume -- larger for high-throughput services |
| `:failure_rate_threshold` | float 0.0–1.0 | `0.5` | 0.5 is a good default; lower for critical services |
| `:slow_call_threshold` | ms or `:infinity` | `:infinity` | Set to your p99 latency to detect slow calls |
| `:slow_call_rate_threshold` | float 0.0–1.0 | `1.0` | Effectively disabled at 1.0; lower to trip on slow calls |
| `:open_timeout` | ms | `60_000` | Time before probing -- longer for services with slow recovery |
| `:permitted_calls_in_half_open` | positive integer | `1` | More probes give higher confidence but delay recovery |
| `:minimum_calls` | positive integer | `10` | Prevents tripping on small sample sizes |
| `:should_record` | `fn result -> :success \| :failure \| :ignore` | default | Classify results; `:ignore` is not counted |
| `:on_state_change` | `fn name, from, to -> any` or `nil` | `nil` | Use for logging or telemetry |

### `Resiliency.BackoffRetry.retry/2`

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `:backoff` | `:exponential`, `:linear`, `:constant`, or `Enumerable` | `:exponential` | Use `:exponential` for most network calls; `:constant` for polling loops |
| `:base_delay` | ms | `100` | Match to the downstream service's typical recovery time |
| `:max_delay` | ms | `5_000` | Cap at a value that keeps total retry time within your SLA |
| `:max_attempts` | positive integer | `3` | 3--5 for transient errors; 1 to disable retries |
| `:budget` | ms or `:infinity` | `:infinity` | Set to your overall timeout to avoid retrying past a deadline |
| `:retry_if` | `fn {:error, reason} -> boolean` | retries all errors | Always set this -- retry only transient/retriable errors |
| `:on_retry` | `fn attempt, delay, error -> any` | `nil` | Use for logging or metrics |
| `:sleep_fn` | `fn ms -> any` | `Process.sleep/1` | Inject a no-op for tests |
| `:reraise` | boolean | `false` | Set `true` if you want exceptions to propagate with their original stacktrace |

### `Resiliency.Hedged.run/2` (stateless)

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `:delay` | ms | `100` | Set to your p50--p95 latency; too low wastes requests, too high defeats the purpose |
| `:max_requests` | positive integer | `2` | 2 is usually enough; 3 for very high-value calls |
| `:timeout` | ms | `5_000` | Set to your overall deadline |
| `:non_fatal` | `fn reason -> boolean` | `fn _ -> false end` | Return `true` for errors that should immediately trigger the next hedge |
| `:on_hedge` | `fn attempt -> any` | `nil` | Use for metrics -- track how often hedges fire |

### `Resiliency.Hedged.start_link/1` (adaptive tracker)

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `:name` | atom or `{:via, ...}` | required | One tracker per logical operation type |
| `:percentile` | number | `95` | 95 is a good default; use 99 for ultra-low-latency paths |
| `:buffer_size` | positive integer | `1_000` | Increase if your traffic is very bursty |
| `:min_delay` | ms | `1` | Raise if you want a floor to avoid sub-millisecond hedges |
| `:max_delay` | ms | `5_000` | Match to your timeout |
| `:initial_delay` | ms | `100` | Used before enough samples are collected |
| `:min_samples` | non-negative integer | `10` | Lower for faster adaptation; higher for more stability |
| `:token_max` | number | `10` | Controls hedge budget -- lower values hedge less often |
| `:token_success_credit` | number | `0.1` | Each successful request adds this many tokens |
| `:token_hedge_cost` | number | `1.0` | Each hedge spends this many tokens |
| `:token_threshold` | number | `1.0` | Hedging is suppressed below this token level |

### `Resiliency.SingleFlight.flight/3,4`

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `server` | name or PID | required | One server per deduplication domain |
| `key` | any term | required | Use a string or tuple that uniquely identifies the work |
| `timeout` (4-arity) | ms or `:infinity` | `:infinity` | Set to your caller's deadline -- the in-flight function keeps running for other waiters |

`Resiliency.SingleFlight.forget/2` evicts a key so the next call triggers a
fresh execution -- useful after a known data change.

### `Resiliency.WeightedSemaphore`

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `:name` | atom or `{:via, ...}` | required | One semaphore per protected resource |
| `:max` | positive integer | required | Set to the resource's actual concurrency limit |
| `weight` (in `acquire/3,4`) | positive integer | `1` | Model the relative cost of each operation |
| `timeout` (in `acquire/4`) | ms or `:infinity` | `:infinity` | Set to avoid unbounded queue waits under load |

`try_acquire/2,3` returns `:rejected` immediately if permits are unavailable --
use it for best-effort work that can be dropped.

### `Resiliency.RateLimiter`

| Option | Type | Default | Recommendation |
|---|---|---|---|
| `:name` | atom | required | One rate limiter per upstream rate limit boundary |
| `:rate` | positive number | required | Tokens per second; match your upstream's stated rate limit |
| `:burst_size` | positive integer | required | Maximum burst; set to the upstream's burst allowance or a small multiple of `rate` |
| `:on_reject` | `fn name -> any` or `nil` | `nil` | Use for logging or metrics -- fires in the caller's process |
| `weight` (per-call) | positive integer | `1` | Model the relative cost of each operation |

`get_stats/1` returns `%{tokens: float, rate: float, burst_size: integer}` — reads
the token count read-only without consuming any tokens or updating the timestamp.

`reset/1` refills the bucket to `burst_size` — useful in tests and after manual
intervention.

### Task Combinators

| Module | Key Options | Defaults | Notes |
|---|---|---|---|
| `Resiliency.Race.run/2` | `timeout` | `:infinity` | First success wins; all failures yield `{:error, :all_failed}` |
| `Resiliency.AllSettled.run/2` | `timeout` | `:infinity` | Timed-out tasks get `{:error, :timeout}` in the result list |
| `Resiliency.Map.run/3` | `max_concurrency`, `timeout` | `System.schedulers_online()`, `:infinity` | Cancels everything on first failure |
| `Resiliency.FirstOk.run/2` | `timeout` | `:infinity` | Sequential -- total timeout spans all attempts |