Tuning and Observability

This guide covers parameter selection, tuning strategies, and observability patterns for every module in the Resiliency library. Each section provides a parameter reference table, explains how parameters interact, and gives concrete recommendations for common workloads.

CircuitBreaker Tuning
BackoffRetry Tuning
Hedged Requests Tuning
SingleFlight Tuning
Task Combinator Tuning
WeightedSemaphore Tuning
RateLimiter Tuning
Observability
Common Pitfalls

CircuitBreaker Tuning

Parameter Reference

Parameter	Default	Range	Effect
`:name`	-- (required)	atom or `{:via, ...}`	Registered name for the GenServer.
`:window_size`	`100`	`1..`	Number of call outcomes in the count-based sliding window. Larger windows smooth out short bursts but react slower to genuine shifts.
`:failure_rate_threshold`	`0.5`	`0.0..1.0`	Failure rate that trips the circuit. `0.5` means the circuit trips when half or more of the recorded calls fail.
`:slow_call_threshold`	`:infinity`	`:infinity` or `1..` ms	Duration above which a call is classified as "slow". `:infinity` disables slow call detection entirely.
`:slow_call_rate_threshold`	`1.0`	`0.0..1.0`	Slow call rate that trips the circuit. `1.0` effectively disables slow-rate tripping.
`:open_timeout`	`60_000` ms	`1..`	Time the circuit stays `:open` before transitioning to `:half_open` for probing.
`:permitted_calls_in_half_open`	`1`	`1..`	Number of probe calls allowed through in `:half_open` state before deciding to close or reopen.
`:minimum_calls`	`10`	`1..`	Minimum recorded calls in the window before the failure rate is evaluated. Prevents tripping on small sample sizes.
`:should_record`	default predicate	`fn result -> :success \| :failure \| :ignore`	Custom classification function. `:ignore` results are not counted in the window.
`:on_state_change`	`nil`	`fn name, from, to -> any` or `nil`	Callback fired on every state transition. Use for logging, metrics, or telemetry.

How Parameters Interact

The circuit evaluates after each recorded call:

if window.total >= minimum_calls do
  if failure_rate >= failure_rate_threshold or slow_rate >= slow_call_rate_threshold do
    trip to :open
  end
end

The :minimum_calls parameter acts as a warm-up guard -- the circuit will not trip until enough calls have been observed. This prevents a single failure from tripping a freshly started breaker.

The sliding window is count-based (not time-based). Old outcomes are evicted when new ones push them out of the fixed-size buffer. This means the window naturally adapts to traffic volume without requiring time-based expiry.

Tuning for Common Workloads

Scenario	`failure_rate_threshold`	`minimum_calls`	`window_size`	`open_timeout`	Notes
High-throughput API	`0.5`	`20`	`200`	`30_000`	Larger window for stable signal.
Critical payment service	`0.3`	`10`	`100`	`60_000`	Trip earlier to protect revenue.
Background job queue	`0.8`	`50`	`500`	`120_000`	Tolerate more failures, longer recovery.
Health-check probe	`0.5`	`3`	`10`	`10_000`	Small window, fast recovery for probes.
Slow-call-sensitive API	`0.5` / `0.3` (slow)	`10`	`100`	`30_000`	Set `slow_call_threshold` to p99 latency.

Interpreting get_stats

Call Resiliency.CircuitBreaker.get_stats/1 to inspect the breaker at runtime:

%{
  state: :closed,
  total: 87,
  failures: 12,
  slow_calls: 3,
  failure_rate: 0.1379,
  slow_call_rate: 0.0345
}

Stat	Healthy range	What it means
`failure_rate`	< `failure_rate_threshold`	Current failure rate in the sliding window.
`slow_call_rate`	< `slow_call_rate_threshold`	Current slow call rate. High values suggest downstream latency issues.
`total`	>= `minimum_calls`	Total calls in the window. Below `minimum_calls`, the circuit will not trip.
`state`	`:closed`	Current state. `:open` means rejecting calls; `:half_open` means probing.

BackoffRetry Tuning

Parameter Reference

Parameter	Default	Range	Effect
`:max_attempts`	`3`	`1..`	Total attempts including the first call. `1` means no retries.
`:backoff`	`:exponential`	`:exponential`, `:linear`, `:constant`, or any `Enumerable` of ms	Determines the shape of the delay curve between retries.
`:base_delay`	`100` ms	`0..`	Seed value for delay computation -- first retry waits this long (before cap/jitter).
`:max_delay`	`5_000` ms	`0..`	Hard ceiling on any single retry delay. Applied via `Backoff.cap/2`.
`:budget`	`:infinity`	`:infinity` or `0..` ms	Total wall-clock budget for all attempts. When the next delay would push past the deadline, retries stop.
`:retry_if`	`fn {:error, _} -> true end`	`fn result -> boolean end`	Predicate that decides whether a given failure is retryable. Non-matching errors are returned immediately.
`:on_retry`	`nil`	`fn attempt, delay, error -> any`	Callback fired before each sleep. Use for logging, metrics, telemetry.
`:sleep_fn`	`&Process.sleep/1`	`fn ms -> any`	Injectable sleep -- replace with a no-op or test double in tests.
`:reraise`	`false`	`true`, `false`	When `true`, re-raises rescued exceptions (with original stacktrace) instead of returning `{:error, exception}` after exhausting retries.

How Parameters Interact

The effective retry sequence is computed as:

delays = backoff_strategy(base_delay)
         |> Backoff.cap(max_delay)
         |> Enum.take(max_attempts - 1)

For exponential backoff with the defaults, the delay sequence before cap is:

attempt 1: immediate (no delay -- first call)
attempt 2: 100 ms
attempt 3: 200 ms

With max_attempts: 5 and base_delay: 100:

attempt 1: immediate
attempt 2: 100 ms
attempt 3: 200 ms
attempt 4: 400 ms
attempt 5: 800 ms

The :budget option acts as a secondary stop condition. Even if max_attempts has not been reached, the retry loop aborts when the next sleep would exceed the total time budget. This makes :budget the right knob for SLA-sensitive callers -- set it to your upstream timeout minus a safety margin.

The Backoff.jitter/2 modifier (usable when passing a custom stream) spreads each delay d uniformly over [d * (1 - proportion), d * (1 + proportion)]. Jitter is critical in multi-instance deployments to avoid thundering herd on the downstream service after a shared failure.

Backoff Strategy Formulas

Strategy	Formula (n-th retry, 0-indexed)	Sequence (base=100)
`:exponential`	`base * multiplier^n`	100, 200, 400, 800, 1600, ...
`:linear`	`base + increment * n`	100, 200, 300, 400, 500, ...
`:constant`	`base`	100, 100, 100, 100, 100, ...

Tuning for Common Workloads

Scenario	`max_attempts`	`backoff`	`base_delay`	`max_delay`	`budget`	Notes
Low-latency API	2--3	`:exponential`	50	500	1_000	Short budget prevents blocking the request path.
Batch / ETL job	5--8	`:exponential`	500	30_000	`:infinity`	Generous delays -- throughput matters more than latency.
Database reconnect	10+	`:exponential` + jitter	200	60_000	`:infinity`	Long max_delay with jitter to avoid connection storms.
Idempotent webhook	4	`:linear`	1_000	10_000	30_000	Linear ramp gives the receiver steady recovery time.
Circuit-breaker probe	1	`:constant`	0	0	500	Single attempt with tight budget -- just checking if the service is back.

Custom Backoff Streams

For advanced scenarios, compose your own stream with jitter and cap:

custom_backoff =
  Resiliency.BackoffRetry.Backoff.exponential(base: 200, multiplier: 3)
  |> Resiliency.BackoffRetry.Backoff.jitter(0.25)
  |> Resiliency.BackoffRetry.Backoff.cap(15_000)

Resiliency.BackoffRetry.retry(fn -> call_service() end,
  backoff: custom_backoff,
  max_attempts: 6
)

Or pass a literal list when you want fully deterministic delays:

Resiliency.BackoffRetry.retry(fn -> call_service() end,
  backoff: [100, 500, 2_000, 5_000]
)

When a list is provided, max_attempts is implicitly length(list) + 1 (one initial call plus one retry per delay entry).

Rule of Thumb

Total worst-case latency = sum(delays) + max_attempts * max_call_duration. If this exceeds your caller's timeout, either reduce max_attempts, lower max_delay, or set a :budget.

Hedged Requests Tuning

Stateless Mode Parameter Reference

Parameter	Default	Range	Effect
`:delay`	`100` ms	`0..`	Time to wait before firing the backup (hedge) request.
`:max_requests`	`2`	`1..`	Total concurrent attempts. `1` disables hedging entirely.
`:timeout`	`5_000` ms	`1..`	Overall deadline -- all in-flight tasks are killed at this point.
`:non_fatal`	`fn _ -> false end`	`fn reason -> boolean`	When `true` for a failure reason, the next hedge fires immediately instead of waiting for the delay.
`:on_hedge`	`nil`	`fn attempt -> any`	Callback invoked before each hedge fires. Use for metrics and logging.

Adaptive Mode (Tracker) Parameter Reference

When using Resiliency.Hedged.start_link/1, the delay auto-tunes based on observed latency. A token bucket throttles the hedge rate under load.

Parameter	Default	Range	Effect
`:name`	-- (required)	atom or `{:via, ...}`	Registered name for the tracker GenServer.
`:percentile`	`95`	`0..100`	Target latency percentile used as the adaptive delay. Higher values hedge less aggressively.
`:buffer_size`	`1_000`	`1..`	Number of latency samples in the rolling window. Larger buffers smooth out spikes but react slower to shifts.
`:min_delay`	`1` ms	`0..`	Floor for the adaptive delay. Prevents hedging on every request even when p95 is near zero.
`:max_delay`	`5_000` ms	`1..`	Ceiling for the adaptive delay. Ensures hedges still fire even during latency spikes.
`:initial_delay`	`100` ms	`0..`	Delay used during cold-start before `:min_samples` observations are collected.
`:min_samples`	`10`	`0..`	Number of observations required before switching from `:initial_delay` to the adaptive percentile.
`:token_max`	`10`	`> 0`	Token bucket capacity. Determines burst budget for hedging.
`:token_success_credit`	`0.1`	`> 0`	Tokens earned per completed request (hedged or not).
`:token_hedge_cost`	`1.0`	`> 0`	Tokens consumed when a hedge fires.
`:token_threshold`	`1.0`	`>= 0`	Minimum token balance required to allow hedging. Below this, `max_requests` is forced to `1`.

How Adaptive Delay Works

Cold start -- Before min_samples observations, the tracker returns initial_delay as the hedge delay. Pick this conservatively -- too low and you double load during startup.
Steady state -- The tracker maintains a circular buffer of the last buffer_size latency samples. On each get_config/1 call, it computes the configured percentile of the buffer and clamps it to [min_delay, max_delay]. This becomes the hedge delay.
Token bucket -- Every completed request (success or failure) adds token_success_credit tokens. Every hedge that fires costs token_hedge_cost tokens. When tokens drop below token_threshold, hedging is disabled until enough successful requests replenish the bucket. This creates a natural feedback loop: under sustained load where hedges are not winning, the system backs off automatically.

Effective hedge rate (steady state):

max_hedge_fraction = token_success_credit / token_hedge_cost

With defaults (0.1 / 1.0), at most 10% of requests will be hedged in steady state. To allow up to 20%, set token_success_credit: 0.2. To allow up to 5%, set token_success_credit: 0.05 or token_hedge_cost: 2.0.

Tuning for Different Scenarios

Scenario	`:percentile`	`:buffer_size`	Token settings	Notes
Fast API (p50 ~5ms)	`90`	`500`	defaults	Aggressive hedging -- low cost per extra request.
Slow DB query (p50 ~200ms)	`99`	`2_000`	`token_success_credit: 0.05`	Conservative -- extra DB queries are expensive.
Multi-region fanout	`95`	`1_000`	defaults	Classic tail-latency use case.
Startup / cold cache	--	--	`initial_delay: 500`	High initial delay avoids flooding a cold service.

Interpreting Tracker Stats

Call Resiliency.Hedged.Tracker.stats/1 to inspect the tracker at runtime:

%{
  total_requests: 15234,
  hedged_requests: 1412,
  hedge_won: 987,
  p50: 12,
  p95: 45,
  p99: 210,
  current_delay: 45,
  tokens: 7.2
}

Stat	Healthy range	What it means
`hedged_requests / total_requests`	5--15%	Hedge rate. Below 5% suggests the delay is too high or the service is healthy (good). Above 20% means you are generating significant extra load.
`hedge_won / hedged_requests`	30--70%	Win rate. Below 30% means hedges rarely help -- consider raising the percentile. Above 70% means the primary is consistently slow -- investigate the downstream.
`tokens`	`> token_threshold`	Token balance. If consistently near zero, hedging is being throttled. Raise `token_success_credit` or lower `token_hedge_cost` if you want more hedging.
`p99 / p50` ratio	< 10x	Tail-to-median ratio. A high ratio (> 20x) indicates severe tail latency -- hedging is valuable here.
`current_delay`	between `min_delay` and `max_delay`	The adaptive delay. If pinned at `max_delay`, latency has spiked and the tracker is being conservative.

SingleFlight Tuning

Resiliency.SingleFlight has no tunable numeric parameters -- its behavior is determined entirely by key design and usage patterns.

When It Helps

Cache stampede -- Many processes request the same cache key after expiry. Without SingleFlight, all of them hit the database. With it, one process fetches while the rest share the result.
Expensive computation -- Deduplicating concurrent calls to a heavy aggregation or report-generation function.
External API rate limits -- Preventing duplicate requests to a rate-limited third-party service.

When It Hurts

Non-idempotent operations -- If the function has side effects that must execute per-caller (e.g., incrementing a counter, sending a notification), SingleFlight will suppress those side effects for coalesced callers.
Caller-specific context -- If each caller needs a slightly different variant of the result (different query parameters, different auth tokens), deduplication by a shared key will return the wrong result for some callers.
Very short functions -- If the function completes in microseconds, the overhead of the GenServer round-trip (message passing, ETS or map lookup) may exceed the savings from deduplication.

Key Design Considerations

Consideration	Guidance
Key granularity	Too broad (e.g., `"users"`) coalesces unrelated calls. Too narrow (e.g., `"user:#{id}:#{timestamp}"`) defeats deduplication. Use the natural cache key.
Key type	Any Erlang term works. Atoms and short strings are fastest for map lookups.
Error propagation	If the executing function fails, all waiting callers receive the same `{:error, reason}`. This is usually correct for cache-fill scenarios but may not be appropriate if different callers should retry independently.
Timeout	Use `flight/4` with a timeout when the function may be slow. Timed-out callers exit, but the in-flight function continues and serves other waiters.
`forget/2`	Call `forget/2` to force a fresh execution for the next caller. Useful when you know cached data is stale (e.g., after a write). Existing waiters still receive the original result.

Task Combinator Tuning

Choosing the Right Combinator

Module	Use when	Concurrency	Failure behavior
`Resiliency.Race`	You need the fastest result from N alternatives	All functions run concurrently	First success wins; crashed tasks are skipped; returns `{:error, :all_failed}` if all fail
`Resiliency.AllSettled`	You need every result regardless of failures	All functions run concurrently	Each result is `{:ok, _}` or `{:error, _}` independently
`Resiliency.Map`	You are processing a collection with bounded parallelism	Up to `max_concurrency` at a time	Cancels all remaining work on first failure
`Resiliency.FirstOk`	You have a fallback chain (cache -> DB -> API)	Sequential -- one at a time	Tries the next function only after the previous one fails

Timeout Selection

Parameter	Default	Range	Effect
`:timeout` (Race)	`:infinity`	`:infinity` or `1..` ms	Overall deadline for the race. Remaining tasks are killed when it expires.
`:timeout` (AllSettled)	`:infinity`	`:infinity` or `1..` ms	Completed tasks keep their results; tasks still running get `{:error, :timeout}`.
`:timeout` (Map)	`:infinity`	`:infinity` or `1..` ms	Returns `{:error, :timeout}` and kills all active tasks.
`:timeout` (FirstOk)	`:infinity`	`:infinity` or `1..` ms	Total budget across all sequential attempts.
`:max_concurrency` (Map)	`System.schedulers_online()`	`1..`	Limits how many items are processed in parallel.

Timeout Rules of Thumb

For Race.run/1 -- set the timeout to your SLA ceiling. If no backend responds in time, you want a clear timeout rather than an indefinite hang.
For AllSettled.run/1 -- set the timeout to the slowest acceptable task duration. Tasks that finish within the deadline keep their results; the rest are marked as timed out.
For Resiliency.Map.run/2 -- multiply your per-item budget by the number of items divided by max_concurrency, then add a margin. Or use :infinity and rely on per-item timeouts inside the function.
For FirstOk.run/1 -- set the timeout to the total latency budget for the entire fallback chain. Each attempt subtracts from the remaining budget.

`Race` vs `FirstOk` Decision

Do you want concurrent execution?
  Yes -> Race.run/1 (all functions run at once, first success wins)
  No  -> FirstOk.run/1 (sequential, stops at first success)

Use Race.run/1 when all backends can handle the load of concurrent requests. Use FirstOk.run/1 when you want to avoid unnecessary calls to slower/more expensive backends.

WeightedSemaphore Tuning

Parameter Reference

Parameter	Default	Range	Effect
`:name`	-- (required)	atom or `{:via, ...}`	Registered name for the semaphore GenServer.
`:max`	-- (required)	`1..`	Total permit capacity. The sum of all concurrently held weights must not exceed this value.

`max` Selection

The :max value represents your concurrency budget. Choose it based on the downstream resource's capacity:

Resource	Suggested `max`	Rationale
Database connection pool (size N)	`N` or `N - 1`	Match the pool size. Reserve one connection for health checks if needed.
External API (rate limit R req/s)	`R / avg_requests_per_second`	Keep in-flight requests below the rate limit.
CPU-bound work	`System.schedulers_online()`	One permit per scheduler avoids over-subscription.
Memory-bound work	`available_mb / per_task_mb`	Weight by memory cost per task.

Weight Assignment Strategies

Strategy	When to use	Example
Uniform (weight=1)	All operations have equal cost	`acquire(sem, fn -> read_row() end)`
Cost-proportional	Operations vary in resource consumption	`acquire(sem, row_count, fn -> bulk_insert(rows) end)`
Tiered	Two or three operation classes	Reads = 1, writes = 3, bulk = 10
Estimated	Cost is data-dependent	`acquire(sem, estimate_cost(query), fn -> run(query) end)`

Backpressure Behavior

The semaphore's FIFO queue provides natural backpressure:

Blocking -- acquire/3 blocks the caller until permits are available. This is the default and simplest mode. Callers queue up and are served in order.
Non-blocking -- try_acquire/3 returns :rejected immediately if permits are not available or if there are waiters in the queue. Use this for "best effort" work that can be dropped under load.
Timeout -- acquire/4 accepts a timeout in milliseconds. If permits are not available within the deadline, returns {:error, :timeout}. The caller is removed from the queue.

Fairness guarantee -- Waiters are served in strict FIFO order. A large waiter at the head of the queue blocks smaller waiters behind it, preventing starvation. This means a weight-8 request waiting for permits will not be bypassed by a weight-1 request, even if capacity exists for the smaller request.

Sizing Guidelines

Utilization = avg_concurrent_weight / max

< 50% -- The semaphore is rarely contended. You may be over-provisioned, or the workload is bursty. Consider lowering max to catch genuine overload earlier.
50--80% -- Healthy range. Some queuing occurs during bursts but callers are not waiting long.
> 90% -- The semaphore is a bottleneck. Callers are frequently blocked. Either increase max (if the downstream can handle it) or reduce the arrival rate.

RateLimiter Tuning

Parameter Reference

Parameter	Default	Range	Effect
`:name`	-- (required)	atom	Registered name for the GenServer and `persistent_term` config key.
`:rate`	-- (required)	positive number	Refill rate in tokens per second.
`:burst_size`	-- (required)	positive integer	Bucket capacity and initial fill.
`:on_reject`	`nil`	`fn name -> any` or `nil`	Callback fired in the caller's process on every rejection.
`weight` (per-call)	`1`	positive integer	Tokens consumed per call.

How It Works

The rate limiter uses a lazy token-bucket — there is no background timer. Tokens refill on each call based on elapsed time since the last operation:

new_tokens = min(burst_size, old_tokens + elapsed_ms * rate_per_ms)

If new_tokens >= weight, the call is granted and weight tokens are deducted. Otherwise the call is rejected immediately. The hot path (grant and reject) runs entirely in the caller's process via lock-free ETS operations — no GenServer message is sent.

Choosing `rate` and `burst_size`

Goal	`rate`	`burst_size`	Notes
Match an external API limit of 100 req/s	`100.0`	`100` or `10`	`burst_size = rate` gives a one-second burst. Lower it to smooth traffic more aggressively.
Allow occasional bursts, then throttle	`50.0`	`500`	Bucket fills slowly but callers can burst to 500 before being throttled.
Strict per-second limit, no burst	`100.0`	`1`	Only 1 token ever available; at most 1 call per ~10ms.
Expensive operation (weight 5) at 20/s	`100.0`	`100`	Each call costs 5 tokens; effective rate = 100 / 5 = 20 op/s.

`rate` vs `burst_size` Interaction

# Steady-state throughput (tokens/s) = rate
# Burst capacity (tokens) = burst_size
# Time to refill from empty = burst_size / rate seconds

A full empty bucket refills in burst_size / rate seconds. If burst_size equals rate, the bucket refills in exactly one second. If burst_size is much larger than rate, callers can absorb large traffic spikes before the rate limit kicks in.

`retry_after_ms` Formula

When a call is rejected, the hint is:

retry_after_ms = ceil((weight - current_tokens) / rate * 1000)

This tells the caller exactly how long to wait for enough tokens to refill for their specific weight. Callers should treat this as a minimum — token counts are shared across concurrent callers.

Weighted Calls

Use :weight when different operations have different costs relative to your upstream rate limit. For example, if an API counts bulk requests as equivalent to N single requests, pass weight: N:

# Single item lookup: 1 token
Resiliency.RateLimiter.call(rl, fn -> get_one(id) end)

# Bulk fetch of 50 items: 50 tokens
Resiliency.RateLimiter.call(rl, fn -> get_many(ids) end, weight: 50)

A weight larger than burst_size is always rejected immediately.

`get_stats/1` Usage

get_stats/1 computes the projected token count using the current timestamp, without writing to ETS or consuming any tokens. Use it for health checks and dashboards — it does not interfere with the hot path:

%{tokens: tokens, rate: rate, burst_size: burst_size} =
  Resiliency.RateLimiter.get_stats(MyApp.ApiRateLimiter)

utilization = (burst_size - tokens) / burst_size
# 0.0 = full bucket (no recent calls)
# 1.0 = empty bucket (fully rate limited)

Performance Characteristics

The hot path avoids GenServer messages entirely:

Grant path: persistent_term.get + ETS lookup + float refill math + ETS CAS (select_replace)
Reject path: persistent_term.get + ETS lookup + float refill math + ETS update_element

Observed on M-series hardware: ~3µs/call for grants, ~2µs/call for rejects. Under 8 concurrent processes, reductions per acquire stay flat (< 100) — no serialisation.

Common Pitfalls

Mistake	Symptom	Fix
`burst_size` too small	Legitimate bursts are rejected; traffic is over-smoothed.	Set `burst_size` to match the upstream's burst allowance.
`burst_size` too large	Bucket takes minutes to refill after a burst; callers see `retry_after_ms` in the thousands.	Cap `burst_size` at `rate * acceptable_burst_seconds`.
`weight > burst_size`	Calls with that weight are always rejected.	Ensure max weight is <= `burst_size`. Validate at startup.
Treating `retry_after_ms` as exact	Token counts are shared; another caller may consume tokens before you retry.	Add a small jitter to the `retry_after_ms` before sleeping.
Using RateLimiter for concurrency	Bucket drains after a burst even if calls are all in-flight simultaneously.	Use `Bulkhead` or `WeightedSemaphore` for concurrency limits.
Starting without a supervisor	GenServer crash leaves `persistent_term` stale until process exits.	Always start under a supervisor using `child_spec/1`.

Observability

Emitting Telemetry from CircuitBreaker

Use the :on_state_change callback to emit :telemetry events on every state transition:

{Resiliency.CircuitBreaker,
 name: MyApp.Breaker,
 failure_rate_threshold: 0.5,
 on_state_change: fn name, from, to ->
   :telemetry.execute(
     [:my_app, :circuit_breaker, :state_change],
     %{},
     %{name: name, from: from, to: to}
   )
 end}

Poll Resiliency.CircuitBreaker.get_stats/1 periodically for dashboard metrics:

stats = Resiliency.CircuitBreaker.get_stats(MyApp.Breaker)

:telemetry.execute(
  [:my_app, :circuit_breaker, :stats],
  %{
    failure_rate: stats.failure_rate,
    slow_call_rate: stats.slow_call_rate,
    total: stats.total,
    failures: stats.failures
  },
  %{name: MyApp.Breaker, state: stats.state}
)

Logging Retry Attempts

Use the :on_retry callback to emit structured log lines on every retry:

require Logger

Resiliency.BackoffRetry.retry(
  fn -> MyService.call(params) end,
  max_attempts: 4,
  backoff: :exponential,
  base_delay: 200,
  on_retry: fn attempt, delay, error ->
    Logger.warning(
      "Retry attempt",
      attempt: attempt,
      delay_ms: delay,
      error: inspect(error),
      service: "my_service"
    )
  end
)

Emitting Telemetry from BackoffRetry

Wrap your retry call to emit :telemetry events for each attempt and for the final outcome:

defmodule MyApp.Resilient do
  def call_with_telemetry(fun, opts \\ []) do
    start_time = System.monotonic_time()
    meta = %{service: Keyword.get(opts, :service, :unknown)}

    result =
      Resiliency.BackoffRetry.retry(fun,
        max_attempts: Keyword.get(opts, :max_attempts, 3),
        on_retry: fn attempt, delay, error ->
          :telemetry.execute(
            [:my_app, :retry, :attempt],
            %{delay_ms: delay},
            Map.merge(meta, %{attempt: attempt, error: inspect(error)})
          )
        end,
        retry_if: Keyword.get(opts, :retry_if, fn {:error, _} -> true end)
      )

    duration = System.monotonic_time() - start_time

    case result do
      {:ok, _} ->
        :telemetry.execute(
          [:my_app, :retry, :success],
          %{duration: duration},
          meta
        )

      {:error, _} ->
        :telemetry.execute(
          [:my_app, :retry, :failure],
          %{duration: duration},
          meta
        )
    end

    result
  end
end

Emitting Telemetry from Hedged Requests

Use :on_hedge for per-hedge telemetry, and wrap the call for overall metrics:

defmodule MyApp.HedgedCall do
  def run(fun, opts \\ []) do
    service = Keyword.get(opts, :service, :unknown)
    start_time = System.monotonic_time()

    result =
      Resiliency.Hedged.run(fun,
        delay: Keyword.get(opts, :delay, 100),
        timeout: Keyword.get(opts, :timeout, 5_000),
        on_hedge: fn attempt ->
          :telemetry.execute(
            [:my_app, :hedged, :hedge_fired],
            %{},
            %{service: service, attempt: attempt}
          )
        end
      )

    duration = System.monotonic_time() - start_time

    case result do
      {:ok, _} ->
        :telemetry.execute(
          [:my_app, :hedged, :success],
          %{duration: duration},
          %{service: service}
        )

      {:error, _} ->
        :telemetry.execute(
          [:my_app, :hedged, :failure],
          %{duration: duration},
          %{service: service}
        )
    end

    result
  end
end

Monitoring Adaptive Hedging with Tracker.stats/1

Poll Resiliency.Hedged.Tracker.stats/1 periodically to feed dashboards:

defmodule MyApp.HedgeReporter do
  use GenServer

  def start_link(opts) do
    tracker = Keyword.fetch!(opts, :tracker)
    interval = Keyword.get(opts, :interval, 10_000)
    GenServer.start_link(__MODULE__, %{tracker: tracker, interval: interval})
  end

  @impl true
  def init(state) do
    schedule(state.interval)
    {:ok, state}
  end

  @impl true
  def handle_info(:report, state) do
    stats = Resiliency.Hedged.Tracker.stats(state.tracker)

    :telemetry.execute(
      [:my_app, :hedged, :tracker_stats],
      %{
        p50: stats.p50 || 0,
        p95: stats.p95 || 0,
        p99: stats.p99 || 0,
        current_delay: stats.current_delay,
        tokens: stats.tokens,
        total_requests: stats.total_requests,
        hedged_requests: stats.hedged_requests,
        hedge_won: stats.hedge_won
      },
      %{tracker: state.tracker}
    )

    hedge_rate =
      if stats.total_requests > 0,
        do: stats.hedged_requests / stats.total_requests,
        else: 0.0

    win_rate =
      if stats.hedged_requests > 0,
        do: stats.hedge_won / stats.hedged_requests,
        else: 0.0

    :telemetry.execute(
      [:my_app, :hedged, :tracker_rates],
      %{hedge_rate: hedge_rate, win_rate: win_rate},
      %{tracker: state.tracker}
    )

    schedule(state.interval)
    {:noreply, state}
  end

  defp schedule(interval), do: Process.send_after(self(), :report, interval)
end

Monitoring WeightedSemaphore

The semaphore does not expose internal stats directly. Instrument it by wrapping calls:

defmodule MyApp.InstrumentedSemaphore do
  def acquire(sem, weight, fun) do
    start_time = System.monotonic_time()

    result = Resiliency.WeightedSemaphore.acquire(sem, weight, fn ->
      wait_duration = System.monotonic_time() - start_time

      :telemetry.execute(
        [:my_app, :semaphore, :acquired],
        %{wait_duration: wait_duration, weight: weight},
        %{semaphore: sem}
      )

      fun.()
    end)

    total_duration = System.monotonic_time() - start_time

    case result do
      {:ok, _} ->
        :telemetry.execute(
          [:my_app, :semaphore, :complete],
          %{duration: total_duration, weight: weight},
          %{semaphore: sem, outcome: :ok}
        )

      {:error, _} ->
        :telemetry.execute(
          [:my_app, :semaphore, :complete],
          %{duration: total_duration, weight: weight},
          %{semaphore: sem, outcome: :error}
        )
    end

    result
  end

  def try_acquire(sem, weight, fun) do
    case Resiliency.WeightedSemaphore.try_acquire(sem, weight, fun) do
      :rejected ->
        :telemetry.execute(
          [:my_app, :semaphore, :rejected],
          %{weight: weight},
          %{semaphore: sem}
        )

        :rejected

      other ->
        other
    end
  end
end

Suggested Telemetry Event Names

Module	Event	Measurements	Metadata
CircuitBreaker	`[:app, :circuit_breaker, :state_change]`	`%{}`	`%{name: atom, from: atom, to: atom}`
CircuitBreaker	`[:app, :circuit_breaker, :stats]`	`%{failure_rate: float, total: integer, ...}`	`%{name: atom, state: atom}`
BackoffRetry	`[:app, :retry, :attempt]`	`%{delay_ms: integer}`	`%{attempt: integer, error: string, service: atom}`
BackoffRetry	`[:app, :retry, :success]`	`%{duration: native_time}`	`%{service: atom}`
BackoffRetry	`[:app, :retry, :failure]`	`%{duration: native_time}`	`%{service: atom}`
Hedged	`[:app, :hedged, :hedge_fired]`	`%{}`	`%{service: atom, attempt: integer}`
Hedged	`[:app, :hedged, :success]`	`%{duration: native_time}`	`%{service: atom}`
Hedged	`[:app, :hedged, :tracker_stats]`	`%{p50: num, p95: num, p99: num, ...}`	`%{tracker: atom}`
Semaphore	`[:app, :semaphore, :acquired]`	`%{wait_duration: native_time, weight: integer}`	`%{semaphore: atom}`
Semaphore	`[:app, :semaphore, :rejected]`	`%{weight: integer}`	`%{semaphore: atom}`
RateLimiter	`[:resiliency, :rate_limiter, :call, :start]`	`%{system_time: integer}`	`%{name: atom}`
RateLimiter	`[:resiliency, :rate_limiter, :call, :rejected]`	`%{retry_after: integer}`	`%{name: atom}`
RateLimiter	`[:resiliency, :rate_limiter, :call, :stop]`	`%{duration: native_time}`	`%{name: atom, result: :ok \| :error, error: term \| nil}`

Common Pitfalls

Mistake	Symptom	Fix
`minimum_calls` too low	Circuit trips on normal variance -- a few early failures trip the breaker.	Increase `minimum_calls` to at least 10. Higher for high-throughput services.
`failure_rate_threshold` too low	Circuit trips too aggressively; service appears degraded when it is merely imperfect.	Start with `0.5` and lower only if the downstream is critical and failures are costly.
`open_timeout` too short	Circuit keeps probing a still-broken service, consuming resources.	Set `open_timeout` to at least the downstream's expected recovery time.
`open_timeout` too long	Service has recovered but callers are still being rejected.	Balance between recovery time and responsiveness. Use `force_close/1` for manual intervention.
`window_size` too small	A few bad calls dominate the rate; circuit trips on transient spikes.	Use a window large enough to smooth out normal variance (e.g., 100+).
`permitted_calls_in_half_open` too low	A single unlucky probe reopens the circuit; recovery takes multiple open-timeout cycles.	Increase to 3--5 for more confident half-open evaluation.
Not handling `{:error, :circuit_open}`	Caller crashes or returns unexpected error shape.	Always pattern-match on `:circuit_open` and degrade gracefully.
Retry delay too short	Floods downstream during outage; downstream never recovers.	Increase `base_delay`, use exponential backoff, add jitter via `Backoff.jitter/2`.
No jitter on retries	Thundering herd -- all clients retry at the same instant.	Compose `Backoff.jitter(0.25)` into your backoff stream.
Retrying non-idempotent calls	Duplicate side effects (double charges, duplicate messages).	Use `:retry_if` to only retry safe errors (timeouts, connection refused). Return `Resiliency.BackoffRetry.abort(reason)` for fatal errors.
No `:budget` with high `max_attempts`	Callers block for minutes during sustained outages.	Set `:budget` to your SLA ceiling.
Hedge percentile too low (e.g., p50)	Every other request spawns a hedge -- doubles load on downstream.	Use p90--p99. Start with p95 and lower only if tail latency is severe and the downstream can handle it.
Hedge `initial_delay` too low	During cold start, hedges fire on nearly every request before samples accumulate.	Set `initial_delay` to your expected p95 or higher.
Token bucket too generous	Hedge rate exceeds expectations.	Lower `token_success_credit` or raise `token_hedge_cost`. The steady-state hedge fraction is `token_success_credit / token_hedge_cost`.
Token bucket too restrictive	Hedging is effectively disabled; tail latency suffers.	Raise `token_success_credit` or increase `token_max` for burst capacity.
SingleFlight key too broad	Unrelated requests share a result.	Use fine-grained keys that include all parameters affecting the result.
SingleFlight on non-idempotent work	Side effects (writes, increments) execute only once instead of per-caller.	Do not use SingleFlight for write operations.
Semaphore `max` too high	Downstream overloaded despite semaphore.	Lower `max` to match the downstream's actual capacity.
Semaphore `max` too low	Healthy throughput is artificially limited; callers queue unnecessarily.	Profile the downstream and raise `max` to its tested concurrency limit.
`try_acquire` without fallback	`:rejected` silently drops work.	Always handle the `:rejected` case -- return an error, queue the work, or degrade gracefully.
Weight exceeds max	`{:error, :weight_exceeds_max}` returned immediately.	Ensure no single operation's weight can exceed the semaphore's `:max`. Validate weights at the call site.
`Race.run/1` without timeout	If all backends hang, the caller hangs forever.	Always pass a `:timeout` to `Race.run/1` in production.
`Resiliency.Map.run/3` with `max_concurrency: 1`	Effectively sequential -- no parallelism benefit.	Use `max_concurrency` >= 2. If you need sequential execution, use `Enum.map/2` directly.
Forgetting to supervise stateful modules	Tracker, SingleFlight, or CircuitBreaker crashes and is not restarted.	Always start `Resiliency.CircuitBreaker`, `Resiliency.Hedged`, and `Resiliency.SingleFlight` under a supervisor using their `child_spec/1`.

← Previous Page Combining Patterns

Next Page → Resiliency

Tuning and Observability

Table of Contents

CircuitBreaker Tuning

Parameter Reference

How Parameters Interact

Tuning for Common Workloads

Interpreting get_stats

BackoffRetry Tuning

Parameter Reference

How Parameters Interact

Backoff Strategy Formulas

Tuning for Common Workloads

Custom Backoff Streams

Rule of Thumb

Hedged Requests Tuning

Stateless Mode Parameter Reference

Adaptive Mode (Tracker) Parameter Reference

How Adaptive Delay Works

Tuning for Different Scenarios

Interpreting Tracker Stats

SingleFlight Tuning

When It Helps

When It Hurts

Key Design Considerations

Task Combinator Tuning

Choosing the Right Combinator

Timeout Selection

Timeout Rules of Thumb

Race vs FirstOk Decision

WeightedSemaphore Tuning

Parameter Reference

max Selection

Weight Assignment Strategies

Backpressure Behavior

Sizing Guidelines

RateLimiter Tuning

Parameter Reference

How It Works

Choosing rate and burst_size

rate vs burst_size Interaction

retry_after_ms Formula

Weighted Calls

get_stats/1 Usage

Performance Characteristics

Common Pitfalls

Observability

Emitting Telemetry from CircuitBreaker

Logging Retry Attempts

Emitting Telemetry from BackoffRetry

Emitting Telemetry from Hedged Requests

Monitoring Adaptive Hedging with Tracker.stats/1

Monitoring WeightedSemaphore

Suggested Telemetry Event Names

Common Pitfalls

`Race` vs `FirstOk` Decision

`max` Selection

Choosing `rate` and `burst_size`

`rate` vs `burst_size` Interaction

`retry_after_ms` Formula

`get_stats/1` Usage