This guide covers parameter selection, tuning strategies, and observability patterns for every module in the Resiliency library. Each section provides a parameter reference table, explains how parameters interact, and gives concrete recommendations for common workloads.
Table of Contents
- CircuitBreaker Tuning
- BackoffRetry Tuning
- Hedged Requests Tuning
- SingleFlight Tuning
- Task Combinator Tuning
- WeightedSemaphore Tuning
- RateLimiter Tuning
- Observability
- Common Pitfalls
CircuitBreaker Tuning
Parameter Reference
| Parameter | Default | Range | Effect |
|---|---|---|---|
:name | -- (required) | atom or {:via, ...} | Registered name for the GenServer. |
:window_size | 100 | 1.. | Number of call outcomes in the count-based sliding window. Larger windows smooth out short bursts but react slower to genuine shifts. |
:failure_rate_threshold | 0.5 | 0.0..1.0 | Failure rate that trips the circuit. 0.5 means the circuit trips when half or more of the recorded calls fail. |
:slow_call_threshold | :infinity | :infinity or 1.. ms | Duration above which a call is classified as "slow". :infinity disables slow call detection entirely. |
:slow_call_rate_threshold | 1.0 | 0.0..1.0 | Slow call rate that trips the circuit. 1.0 effectively disables slow-rate tripping. |
:open_timeout | 60_000 ms | 1.. | Time the circuit stays :open before transitioning to :half_open for probing. |
:permitted_calls_in_half_open | 1 | 1.. | Number of probe calls allowed through in :half_open state before deciding to close or reopen. |
:minimum_calls | 10 | 1.. | Minimum recorded calls in the window before the failure rate is evaluated. Prevents tripping on small sample sizes. |
:should_record | default predicate | fn result -> :success | :failure | :ignore | Custom classification function. :ignore results are not counted in the window. |
:on_state_change | nil | fn name, from, to -> any or nil | Callback fired on every state transition. Use for logging, metrics, or telemetry. |
How Parameters Interact
The circuit evaluates after each recorded call:
if window.total >= minimum_calls do
if failure_rate >= failure_rate_threshold or slow_rate >= slow_call_rate_threshold do
trip to :open
end
endThe :minimum_calls parameter acts as a warm-up guard -- the circuit will not
trip until enough calls have been observed. This prevents a single failure from
tripping a freshly started breaker.
The sliding window is count-based (not time-based). Old outcomes are evicted when new ones push them out of the fixed-size buffer. This means the window naturally adapts to traffic volume without requiring time-based expiry.
Tuning for Common Workloads
| Scenario | failure_rate_threshold | minimum_calls | window_size | open_timeout | Notes |
|---|---|---|---|---|---|
| High-throughput API | 0.5 | 20 | 200 | 30_000 | Larger window for stable signal. |
| Critical payment service | 0.3 | 10 | 100 | 60_000 | Trip earlier to protect revenue. |
| Background job queue | 0.8 | 50 | 500 | 120_000 | Tolerate more failures, longer recovery. |
| Health-check probe | 0.5 | 3 | 10 | 10_000 | Small window, fast recovery for probes. |
| Slow-call-sensitive API | 0.5 / 0.3 (slow) | 10 | 100 | 30_000 | Set slow_call_threshold to p99 latency. |
Interpreting get_stats
Call Resiliency.CircuitBreaker.get_stats/1 to inspect the breaker at runtime:
%{
state: :closed,
total: 87,
failures: 12,
slow_calls: 3,
failure_rate: 0.1379,
slow_call_rate: 0.0345
}| Stat | Healthy range | What it means |
|---|---|---|
failure_rate | < failure_rate_threshold | Current failure rate in the sliding window. |
slow_call_rate | < slow_call_rate_threshold | Current slow call rate. High values suggest downstream latency issues. |
total | >= minimum_calls | Total calls in the window. Below minimum_calls, the circuit will not trip. |
state | :closed | Current state. :open means rejecting calls; :half_open means probing. |
BackoffRetry Tuning
Parameter Reference
| Parameter | Default | Range | Effect |
|---|---|---|---|
:max_attempts | 3 | 1.. | Total attempts including the first call. 1 means no retries. |
:backoff | :exponential | :exponential, :linear, :constant, or any Enumerable of ms | Determines the shape of the delay curve between retries. |
:base_delay | 100 ms | 0.. | Seed value for delay computation -- first retry waits this long (before cap/jitter). |
:max_delay | 5_000 ms | 0.. | Hard ceiling on any single retry delay. Applied via Backoff.cap/2. |
:budget | :infinity | :infinity or 0.. ms | Total wall-clock budget for all attempts. When the next delay would push past the deadline, retries stop. |
:retry_if | fn {:error, _} -> true end | fn result -> boolean end | Predicate that decides whether a given failure is retryable. Non-matching errors are returned immediately. |
:on_retry | nil | fn attempt, delay, error -> any | Callback fired before each sleep. Use for logging, metrics, telemetry. |
:sleep_fn | &Process.sleep/1 | fn ms -> any | Injectable sleep -- replace with a no-op or test double in tests. |
:reraise | false | true, false | When true, re-raises rescued exceptions (with original stacktrace) instead of returning {:error, exception} after exhausting retries. |
How Parameters Interact
The effective retry sequence is computed as:
delays = backoff_strategy(base_delay)
|> Backoff.cap(max_delay)
|> Enum.take(max_attempts - 1)For exponential backoff with the defaults, the delay sequence before cap is:
attempt 1: immediate (no delay -- first call)
attempt 2: 100 ms
attempt 3: 200 msWith max_attempts: 5 and base_delay: 100:
attempt 1: immediate
attempt 2: 100 ms
attempt 3: 200 ms
attempt 4: 400 ms
attempt 5: 800 msThe :budget option acts as a secondary stop condition. Even if max_attempts
has not been reached, the retry loop aborts when the next sleep would exceed
the total time budget. This makes :budget the right knob for SLA-sensitive
callers -- set it to your upstream timeout minus a safety margin.
The Backoff.jitter/2 modifier (usable when passing a custom stream) spreads
each delay d uniformly over [d * (1 - proportion), d * (1 + proportion)].
Jitter is critical in multi-instance deployments to avoid thundering herd on
the downstream service after a shared failure.
Backoff Strategy Formulas
| Strategy | Formula (n-th retry, 0-indexed) | Sequence (base=100) |
|---|---|---|
:exponential | base * multiplier^n | 100, 200, 400, 800, 1600, ... |
:linear | base + increment * n | 100, 200, 300, 400, 500, ... |
:constant | base | 100, 100, 100, 100, 100, ... |
Tuning for Common Workloads
| Scenario | max_attempts | backoff | base_delay | max_delay | budget | Notes |
|---|---|---|---|---|---|---|
| Low-latency API | 2--3 | :exponential | 50 | 500 | 1_000 | Short budget prevents blocking the request path. |
| Batch / ETL job | 5--8 | :exponential | 500 | 30_000 | :infinity | Generous delays -- throughput matters more than latency. |
| Database reconnect | 10+ | :exponential + jitter | 200 | 60_000 | :infinity | Long max_delay with jitter to avoid connection storms. |
| Idempotent webhook | 4 | :linear | 1_000 | 10_000 | 30_000 | Linear ramp gives the receiver steady recovery time. |
| Circuit-breaker probe | 1 | :constant | 0 | 0 | 500 | Single attempt with tight budget -- just checking if the service is back. |
Custom Backoff Streams
For advanced scenarios, compose your own stream with jitter and cap:
custom_backoff =
Resiliency.BackoffRetry.Backoff.exponential(base: 200, multiplier: 3)
|> Resiliency.BackoffRetry.Backoff.jitter(0.25)
|> Resiliency.BackoffRetry.Backoff.cap(15_000)
Resiliency.BackoffRetry.retry(fn -> call_service() end,
backoff: custom_backoff,
max_attempts: 6
)Or pass a literal list when you want fully deterministic delays:
Resiliency.BackoffRetry.retry(fn -> call_service() end,
backoff: [100, 500, 2_000, 5_000]
)When a list is provided, max_attempts is implicitly length(list) + 1 (one
initial call plus one retry per delay entry).
Rule of Thumb
Total worst-case latency =
sum(delays) + max_attempts * max_call_duration. If this exceeds your caller's timeout, either reducemax_attempts, lowermax_delay, or set a:budget.
Hedged Requests Tuning
Stateless Mode Parameter Reference
| Parameter | Default | Range | Effect |
|---|---|---|---|
:delay | 100 ms | 0.. | Time to wait before firing the backup (hedge) request. |
:max_requests | 2 | 1.. | Total concurrent attempts. 1 disables hedging entirely. |
:timeout | 5_000 ms | 1.. | Overall deadline -- all in-flight tasks are killed at this point. |
:non_fatal | fn _ -> false end | fn reason -> boolean | When true for a failure reason, the next hedge fires immediately instead of waiting for the delay. |
:on_hedge | nil | fn attempt -> any | Callback invoked before each hedge fires. Use for metrics and logging. |
Adaptive Mode (Tracker) Parameter Reference
When using Resiliency.Hedged.start_link/1, the delay auto-tunes based on
observed latency. A token bucket throttles the hedge rate under load.
| Parameter | Default | Range | Effect |
|---|---|---|---|
:name | -- (required) | atom or {:via, ...} | Registered name for the tracker GenServer. |
:percentile | 95 | 0..100 | Target latency percentile used as the adaptive delay. Higher values hedge less aggressively. |
:buffer_size | 1_000 | 1.. | Number of latency samples in the rolling window. Larger buffers smooth out spikes but react slower to shifts. |
:min_delay | 1 ms | 0.. | Floor for the adaptive delay. Prevents hedging on every request even when p95 is near zero. |
:max_delay | 5_000 ms | 1.. | Ceiling for the adaptive delay. Ensures hedges still fire even during latency spikes. |
:initial_delay | 100 ms | 0.. | Delay used during cold-start before :min_samples observations are collected. |
:min_samples | 10 | 0.. | Number of observations required before switching from :initial_delay to the adaptive percentile. |
:token_max | 10 | > 0 | Token bucket capacity. Determines burst budget for hedging. |
:token_success_credit | 0.1 | > 0 | Tokens earned per completed request (hedged or not). |
:token_hedge_cost | 1.0 | > 0 | Tokens consumed when a hedge fires. |
:token_threshold | 1.0 | >= 0 | Minimum token balance required to allow hedging. Below this, max_requests is forced to 1. |
How Adaptive Delay Works
Cold start -- Before
min_samplesobservations, the tracker returnsinitial_delayas the hedge delay. Pick this conservatively -- too low and you double load during startup.Steady state -- The tracker maintains a circular buffer of the last
buffer_sizelatency samples. On eachget_config/1call, it computes the configured percentile of the buffer and clamps it to[min_delay, max_delay]. This becomes the hedge delay.Token bucket -- Every completed request (success or failure) adds
token_success_credittokens. Every hedge that fires coststoken_hedge_costtokens. When tokens drop belowtoken_threshold, hedging is disabled until enough successful requests replenish the bucket. This creates a natural feedback loop: under sustained load where hedges are not winning, the system backs off automatically.
Effective hedge rate (steady state):
max_hedge_fraction = token_success_credit / token_hedge_costWith defaults (0.1 / 1.0), at most 10% of requests will be hedged in steady
state. To allow up to 20%, set token_success_credit: 0.2. To allow up to
5%, set token_success_credit: 0.05 or token_hedge_cost: 2.0.
Tuning for Different Scenarios
| Scenario | :percentile | :buffer_size | Token settings | Notes |
|---|---|---|---|---|
| Fast API (p50 ~5ms) | 90 | 500 | defaults | Aggressive hedging -- low cost per extra request. |
| Slow DB query (p50 ~200ms) | 99 | 2_000 | token_success_credit: 0.05 | Conservative -- extra DB queries are expensive. |
| Multi-region fanout | 95 | 1_000 | defaults | Classic tail-latency use case. |
| Startup / cold cache | -- | -- | initial_delay: 500 | High initial delay avoids flooding a cold service. |
Interpreting Tracker Stats
Call Resiliency.Hedged.Tracker.stats/1 to inspect the tracker at runtime:
%{
total_requests: 15234,
hedged_requests: 1412,
hedge_won: 987,
p50: 12,
p95: 45,
p99: 210,
current_delay: 45,
tokens: 7.2
}| Stat | Healthy range | What it means |
|---|---|---|
hedged_requests / total_requests | 5--15% | Hedge rate. Below 5% suggests the delay is too high or the service is healthy (good). Above 20% means you are generating significant extra load. |
hedge_won / hedged_requests | 30--70% | Win rate. Below 30% means hedges rarely help -- consider raising the percentile. Above 70% means the primary is consistently slow -- investigate the downstream. |
tokens | > token_threshold | Token balance. If consistently near zero, hedging is being throttled. Raise token_success_credit or lower token_hedge_cost if you want more hedging. |
p99 / p50 ratio | < 10x | Tail-to-median ratio. A high ratio (> 20x) indicates severe tail latency -- hedging is valuable here. |
current_delay | between min_delay and max_delay | The adaptive delay. If pinned at max_delay, latency has spiked and the tracker is being conservative. |
SingleFlight Tuning
Resiliency.SingleFlight has no tunable numeric parameters -- its behavior is
determined entirely by key design and usage patterns.
When It Helps
Cache stampede -- Many processes request the same cache key after expiry. Without SingleFlight, all of them hit the database. With it, one process fetches while the rest share the result.
Expensive computation -- Deduplicating concurrent calls to a heavy aggregation or report-generation function.
External API rate limits -- Preventing duplicate requests to a rate-limited third-party service.
When It Hurts
Non-idempotent operations -- If the function has side effects that must execute per-caller (e.g., incrementing a counter, sending a notification), SingleFlight will suppress those side effects for coalesced callers.
Caller-specific context -- If each caller needs a slightly different variant of the result (different query parameters, different auth tokens), deduplication by a shared key will return the wrong result for some callers.
Very short functions -- If the function completes in microseconds, the overhead of the GenServer round-trip (message passing, ETS or map lookup) may exceed the savings from deduplication.
Key Design Considerations
| Consideration | Guidance |
|---|---|
| Key granularity | Too broad (e.g., "users") coalesces unrelated calls. Too narrow (e.g., "user:#{id}:#{timestamp}") defeats deduplication. Use the natural cache key. |
| Key type | Any Erlang term works. Atoms and short strings are fastest for map lookups. |
| Error propagation | If the executing function fails, all waiting callers receive the same {:error, reason}. This is usually correct for cache-fill scenarios but may not be appropriate if different callers should retry independently. |
| Timeout | Use flight/4 with a timeout when the function may be slow. Timed-out callers exit, but the in-flight function continues and serves other waiters. |
forget/2 | Call forget/2 to force a fresh execution for the next caller. Useful when you know cached data is stale (e.g., after a write). Existing waiters still receive the original result. |
Task Combinator Tuning
Choosing the Right Combinator
| Module | Use when | Concurrency | Failure behavior |
|---|---|---|---|
Resiliency.Race | You need the fastest result from N alternatives | All functions run concurrently | First success wins; crashed tasks are skipped; returns {:error, :all_failed} if all fail |
Resiliency.AllSettled | You need every result regardless of failures | All functions run concurrently | Each result is {:ok, _} or {:error, _} independently |
Resiliency.Map | You are processing a collection with bounded parallelism | Up to max_concurrency at a time | Cancels all remaining work on first failure |
Resiliency.FirstOk | You have a fallback chain (cache -> DB -> API) | Sequential -- one at a time | Tries the next function only after the previous one fails |
Timeout Selection
| Parameter | Default | Range | Effect |
|---|---|---|---|
:timeout (Race) | :infinity | :infinity or 1.. ms | Overall deadline for the race. Remaining tasks are killed when it expires. |
:timeout (AllSettled) | :infinity | :infinity or 1.. ms | Completed tasks keep their results; tasks still running get {:error, :timeout}. |
:timeout (Map) | :infinity | :infinity or 1.. ms | Returns {:error, :timeout} and kills all active tasks. |
:timeout (FirstOk) | :infinity | :infinity or 1.. ms | Total budget across all sequential attempts. |
:max_concurrency (Map) | System.schedulers_online() | 1.. | Limits how many items are processed in parallel. |
Timeout Rules of Thumb
For
Race.run/1-- set the timeout to your SLA ceiling. If no backend responds in time, you want a clear timeout rather than an indefinite hang.For
AllSettled.run/1-- set the timeout to the slowest acceptable task duration. Tasks that finish within the deadline keep their results; the rest are marked as timed out.For
Resiliency.Map.run/2-- multiply your per-item budget by the number of items divided bymax_concurrency, then add a margin. Or use:infinityand rely on per-item timeouts inside the function.For
FirstOk.run/1-- set the timeout to the total latency budget for the entire fallback chain. Each attempt subtracts from the remaining budget.
Race vs FirstOk Decision
Do you want concurrent execution?
Yes -> Race.run/1 (all functions run at once, first success wins)
No -> FirstOk.run/1 (sequential, stops at first success)Use Race.run/1 when all backends can handle the load of concurrent requests. Use
FirstOk.run/1 when you want to avoid unnecessary calls to slower/more expensive
backends.
WeightedSemaphore Tuning
Parameter Reference
| Parameter | Default | Range | Effect |
|---|---|---|---|
:name | -- (required) | atom or {:via, ...} | Registered name for the semaphore GenServer. |
:max | -- (required) | 1.. | Total permit capacity. The sum of all concurrently held weights must not exceed this value. |
max Selection
The :max value represents your concurrency budget. Choose it based on the
downstream resource's capacity:
| Resource | Suggested max | Rationale |
|---|---|---|
| Database connection pool (size N) | N or N - 1 | Match the pool size. Reserve one connection for health checks if needed. |
| External API (rate limit R req/s) | R / avg_requests_per_second | Keep in-flight requests below the rate limit. |
| CPU-bound work | System.schedulers_online() | One permit per scheduler avoids over-subscription. |
| Memory-bound work | available_mb / per_task_mb | Weight by memory cost per task. |
Weight Assignment Strategies
| Strategy | When to use | Example |
|---|---|---|
| Uniform (weight=1) | All operations have equal cost | acquire(sem, fn -> read_row() end) |
| Cost-proportional | Operations vary in resource consumption | acquire(sem, row_count, fn -> bulk_insert(rows) end) |
| Tiered | Two or three operation classes | Reads = 1, writes = 3, bulk = 10 |
| Estimated | Cost is data-dependent | acquire(sem, estimate_cost(query), fn -> run(query) end) |
Backpressure Behavior
The semaphore's FIFO queue provides natural backpressure:
Blocking --
acquire/3blocks the caller until permits are available. This is the default and simplest mode. Callers queue up and are served in order.Non-blocking --
try_acquire/3returns:rejectedimmediately if permits are not available or if there are waiters in the queue. Use this for "best effort" work that can be dropped under load.Timeout --
acquire/4accepts a timeout in milliseconds. If permits are not available within the deadline, returns{:error, :timeout}. The caller is removed from the queue.
Fairness guarantee -- Waiters are served in strict FIFO order. A large waiter at the head of the queue blocks smaller waiters behind it, preventing starvation. This means a weight-8 request waiting for permits will not be bypassed by a weight-1 request, even if capacity exists for the smaller request.
Sizing Guidelines
Utilization = avg_concurrent_weight / max< 50% -- The semaphore is rarely contended. You may be over-provisioned, or the workload is bursty. Consider lowering
maxto catch genuine overload earlier.50--80% -- Healthy range. Some queuing occurs during bursts but callers are not waiting long.
> 90% -- The semaphore is a bottleneck. Callers are frequently blocked. Either increase
max(if the downstream can handle it) or reduce the arrival rate.
RateLimiter Tuning
Parameter Reference
| Parameter | Default | Range | Effect |
|---|---|---|---|
:name | -- (required) | atom | Registered name for the GenServer and persistent_term config key. |
:rate | -- (required) | positive number | Refill rate in tokens per second. |
:burst_size | -- (required) | positive integer | Bucket capacity and initial fill. |
:on_reject | nil | fn name -> any or nil | Callback fired in the caller's process on every rejection. |
weight (per-call) | 1 | positive integer | Tokens consumed per call. |
How It Works
The rate limiter uses a lazy token-bucket — there is no background timer. Tokens refill on each call based on elapsed time since the last operation:
new_tokens = min(burst_size, old_tokens + elapsed_ms * rate_per_ms)If new_tokens >= weight, the call is granted and weight tokens are deducted.
Otherwise the call is rejected immediately. The hot path (grant and reject) runs
entirely in the caller's process via lock-free ETS operations — no GenServer
message is sent.
Choosing rate and burst_size
| Goal | rate | burst_size | Notes |
|---|---|---|---|
| Match an external API limit of 100 req/s | 100.0 | 100 or 10 | burst_size = rate gives a one-second burst. Lower it to smooth traffic more aggressively. |
| Allow occasional bursts, then throttle | 50.0 | 500 | Bucket fills slowly but callers can burst to 500 before being throttled. |
| Strict per-second limit, no burst | 100.0 | 1 | Only 1 token ever available; at most 1 call per ~10ms. |
| Expensive operation (weight 5) at 20/s | 100.0 | 100 | Each call costs 5 tokens; effective rate = 100 / 5 = 20 op/s. |
rate vs burst_size Interaction
# Steady-state throughput (tokens/s) = rate
# Burst capacity (tokens) = burst_size
# Time to refill from empty = burst_size / rate secondsA full empty bucket refills in burst_size / rate seconds. If burst_size equals
rate, the bucket refills in exactly one second. If burst_size is much larger than
rate, callers can absorb large traffic spikes before the rate limit kicks in.
retry_after_ms Formula
When a call is rejected, the hint is:
retry_after_ms = ceil((weight - current_tokens) / rate * 1000)This tells the caller exactly how long to wait for enough tokens to refill for their specific weight. Callers should treat this as a minimum — token counts are shared across concurrent callers.
Weighted Calls
Use :weight when different operations have different costs relative to your
upstream rate limit. For example, if an API counts bulk requests as equivalent to
N single requests, pass weight: N:
# Single item lookup: 1 token
Resiliency.RateLimiter.call(rl, fn -> get_one(id) end)
# Bulk fetch of 50 items: 50 tokens
Resiliency.RateLimiter.call(rl, fn -> get_many(ids) end, weight: 50)A weight larger than burst_size is always rejected immediately.
get_stats/1 Usage
get_stats/1 computes the projected token count using the current timestamp,
without writing to ETS or consuming any tokens. Use it for health checks and
dashboards — it does not interfere with the hot path:
%{tokens: tokens, rate: rate, burst_size: burst_size} =
Resiliency.RateLimiter.get_stats(MyApp.ApiRateLimiter)
utilization = (burst_size - tokens) / burst_size
# 0.0 = full bucket (no recent calls)
# 1.0 = empty bucket (fully rate limited)Performance Characteristics
The hot path avoids GenServer messages entirely:
- Grant path:
persistent_term.get+ ETS lookup + float refill math + ETS CAS (select_replace) - Reject path:
persistent_term.get+ ETS lookup + float refill math + ETSupdate_element
Observed on M-series hardware: ~3µs/call for grants, ~2µs/call for rejects. Under 8 concurrent processes, reductions per acquire stay flat (< 100) — no serialisation.
Common Pitfalls
| Mistake | Symptom | Fix |
|---|---|---|
burst_size too small | Legitimate bursts are rejected; traffic is over-smoothed. | Set burst_size to match the upstream's burst allowance. |
burst_size too large | Bucket takes minutes to refill after a burst; callers see retry_after_ms in the thousands. | Cap burst_size at rate * acceptable_burst_seconds. |
weight > burst_size | Calls with that weight are always rejected. | Ensure max weight is <= burst_size. Validate at startup. |
Treating retry_after_ms as exact | Token counts are shared; another caller may consume tokens before you retry. | Add a small jitter to the retry_after_ms before sleeping. |
| Using RateLimiter for concurrency | Bucket drains after a burst even if calls are all in-flight simultaneously. | Use Bulkhead or WeightedSemaphore for concurrency limits. |
| Starting without a supervisor | GenServer crash leaves persistent_term stale until process exits. | Always start under a supervisor using child_spec/1. |
Observability
Emitting Telemetry from CircuitBreaker
Use the :on_state_change callback to emit :telemetry events on every
state transition:
{Resiliency.CircuitBreaker,
name: MyApp.Breaker,
failure_rate_threshold: 0.5,
on_state_change: fn name, from, to ->
:telemetry.execute(
[:my_app, :circuit_breaker, :state_change],
%{},
%{name: name, from: from, to: to}
)
end}Poll Resiliency.CircuitBreaker.get_stats/1 periodically for dashboard metrics:
stats = Resiliency.CircuitBreaker.get_stats(MyApp.Breaker)
:telemetry.execute(
[:my_app, :circuit_breaker, :stats],
%{
failure_rate: stats.failure_rate,
slow_call_rate: stats.slow_call_rate,
total: stats.total,
failures: stats.failures
},
%{name: MyApp.Breaker, state: stats.state}
)Logging Retry Attempts
Use the :on_retry callback to emit structured log lines on every retry:
require Logger
Resiliency.BackoffRetry.retry(
fn -> MyService.call(params) end,
max_attempts: 4,
backoff: :exponential,
base_delay: 200,
on_retry: fn attempt, delay, error ->
Logger.warning(
"Retry attempt",
attempt: attempt,
delay_ms: delay,
error: inspect(error),
service: "my_service"
)
end
)Emitting Telemetry from BackoffRetry
Wrap your retry call to emit :telemetry events for each attempt and for the
final outcome:
defmodule MyApp.Resilient do
def call_with_telemetry(fun, opts \\ []) do
start_time = System.monotonic_time()
meta = %{service: Keyword.get(opts, :service, :unknown)}
result =
Resiliency.BackoffRetry.retry(fun,
max_attempts: Keyword.get(opts, :max_attempts, 3),
on_retry: fn attempt, delay, error ->
:telemetry.execute(
[:my_app, :retry, :attempt],
%{delay_ms: delay},
Map.merge(meta, %{attempt: attempt, error: inspect(error)})
)
end,
retry_if: Keyword.get(opts, :retry_if, fn {:error, _} -> true end)
)
duration = System.monotonic_time() - start_time
case result do
{:ok, _} ->
:telemetry.execute(
[:my_app, :retry, :success],
%{duration: duration},
meta
)
{:error, _} ->
:telemetry.execute(
[:my_app, :retry, :failure],
%{duration: duration},
meta
)
end
result
end
endEmitting Telemetry from Hedged Requests
Use :on_hedge for per-hedge telemetry, and wrap the call for overall metrics:
defmodule MyApp.HedgedCall do
def run(fun, opts \\ []) do
service = Keyword.get(opts, :service, :unknown)
start_time = System.monotonic_time()
result =
Resiliency.Hedged.run(fun,
delay: Keyword.get(opts, :delay, 100),
timeout: Keyword.get(opts, :timeout, 5_000),
on_hedge: fn attempt ->
:telemetry.execute(
[:my_app, :hedged, :hedge_fired],
%{},
%{service: service, attempt: attempt}
)
end
)
duration = System.monotonic_time() - start_time
case result do
{:ok, _} ->
:telemetry.execute(
[:my_app, :hedged, :success],
%{duration: duration},
%{service: service}
)
{:error, _} ->
:telemetry.execute(
[:my_app, :hedged, :failure],
%{duration: duration},
%{service: service}
)
end
result
end
endMonitoring Adaptive Hedging with Tracker.stats/1
Poll Resiliency.Hedged.Tracker.stats/1 periodically to feed dashboards:
defmodule MyApp.HedgeReporter do
use GenServer
def start_link(opts) do
tracker = Keyword.fetch!(opts, :tracker)
interval = Keyword.get(opts, :interval, 10_000)
GenServer.start_link(__MODULE__, %{tracker: tracker, interval: interval})
end
@impl true
def init(state) do
schedule(state.interval)
{:ok, state}
end
@impl true
def handle_info(:report, state) do
stats = Resiliency.Hedged.Tracker.stats(state.tracker)
:telemetry.execute(
[:my_app, :hedged, :tracker_stats],
%{
p50: stats.p50 || 0,
p95: stats.p95 || 0,
p99: stats.p99 || 0,
current_delay: stats.current_delay,
tokens: stats.tokens,
total_requests: stats.total_requests,
hedged_requests: stats.hedged_requests,
hedge_won: stats.hedge_won
},
%{tracker: state.tracker}
)
hedge_rate =
if stats.total_requests > 0,
do: stats.hedged_requests / stats.total_requests,
else: 0.0
win_rate =
if stats.hedged_requests > 0,
do: stats.hedge_won / stats.hedged_requests,
else: 0.0
:telemetry.execute(
[:my_app, :hedged, :tracker_rates],
%{hedge_rate: hedge_rate, win_rate: win_rate},
%{tracker: state.tracker}
)
schedule(state.interval)
{:noreply, state}
end
defp schedule(interval), do: Process.send_after(self(), :report, interval)
endMonitoring WeightedSemaphore
The semaphore does not expose internal stats directly. Instrument it by wrapping calls:
defmodule MyApp.InstrumentedSemaphore do
def acquire(sem, weight, fun) do
start_time = System.monotonic_time()
result = Resiliency.WeightedSemaphore.acquire(sem, weight, fn ->
wait_duration = System.monotonic_time() - start_time
:telemetry.execute(
[:my_app, :semaphore, :acquired],
%{wait_duration: wait_duration, weight: weight},
%{semaphore: sem}
)
fun.()
end)
total_duration = System.monotonic_time() - start_time
case result do
{:ok, _} ->
:telemetry.execute(
[:my_app, :semaphore, :complete],
%{duration: total_duration, weight: weight},
%{semaphore: sem, outcome: :ok}
)
{:error, _} ->
:telemetry.execute(
[:my_app, :semaphore, :complete],
%{duration: total_duration, weight: weight},
%{semaphore: sem, outcome: :error}
)
end
result
end
def try_acquire(sem, weight, fun) do
case Resiliency.WeightedSemaphore.try_acquire(sem, weight, fun) do
:rejected ->
:telemetry.execute(
[:my_app, :semaphore, :rejected],
%{weight: weight},
%{semaphore: sem}
)
:rejected
other ->
other
end
end
endSuggested Telemetry Event Names
| Module | Event | Measurements | Metadata |
|---|---|---|---|
| CircuitBreaker | [:app, :circuit_breaker, :state_change] | %{} | %{name: atom, from: atom, to: atom} |
| CircuitBreaker | [:app, :circuit_breaker, :stats] | %{failure_rate: float, total: integer, ...} | %{name: atom, state: atom} |
| BackoffRetry | [:app, :retry, :attempt] | %{delay_ms: integer} | %{attempt: integer, error: string, service: atom} |
| BackoffRetry | [:app, :retry, :success] | %{duration: native_time} | %{service: atom} |
| BackoffRetry | [:app, :retry, :failure] | %{duration: native_time} | %{service: atom} |
| Hedged | [:app, :hedged, :hedge_fired] | %{} | %{service: atom, attempt: integer} |
| Hedged | [:app, :hedged, :success] | %{duration: native_time} | %{service: atom} |
| Hedged | [:app, :hedged, :tracker_stats] | %{p50: num, p95: num, p99: num, ...} | %{tracker: atom} |
| Semaphore | [:app, :semaphore, :acquired] | %{wait_duration: native_time, weight: integer} | %{semaphore: atom} |
| Semaphore | [:app, :semaphore, :rejected] | %{weight: integer} | %{semaphore: atom} |
| RateLimiter | [:resiliency, :rate_limiter, :call, :start] | %{system_time: integer} | %{name: atom} |
| RateLimiter | [:resiliency, :rate_limiter, :call, :rejected] | %{retry_after: integer} | %{name: atom} |
| RateLimiter | [:resiliency, :rate_limiter, :call, :stop] | %{duration: native_time} | %{name: atom, result: :ok | :error, error: term | nil} |
Common Pitfalls
| Mistake | Symptom | Fix |
|---|---|---|
minimum_calls too low | Circuit trips on normal variance -- a few early failures trip the breaker. | Increase minimum_calls to at least 10. Higher for high-throughput services. |
failure_rate_threshold too low | Circuit trips too aggressively; service appears degraded when it is merely imperfect. | Start with 0.5 and lower only if the downstream is critical and failures are costly. |
open_timeout too short | Circuit keeps probing a still-broken service, consuming resources. | Set open_timeout to at least the downstream's expected recovery time. |
open_timeout too long | Service has recovered but callers are still being rejected. | Balance between recovery time and responsiveness. Use force_close/1 for manual intervention. |
window_size too small | A few bad calls dominate the rate; circuit trips on transient spikes. | Use a window large enough to smooth out normal variance (e.g., 100+). |
permitted_calls_in_half_open too low | A single unlucky probe reopens the circuit; recovery takes multiple open-timeout cycles. | Increase to 3--5 for more confident half-open evaluation. |
Not handling {:error, :circuit_open} | Caller crashes or returns unexpected error shape. | Always pattern-match on :circuit_open and degrade gracefully. |
| Retry delay too short | Floods downstream during outage; downstream never recovers. | Increase base_delay, use exponential backoff, add jitter via Backoff.jitter/2. |
| No jitter on retries | Thundering herd -- all clients retry at the same instant. | Compose Backoff.jitter(0.25) into your backoff stream. |
| Retrying non-idempotent calls | Duplicate side effects (double charges, duplicate messages). | Use :retry_if to only retry safe errors (timeouts, connection refused). Return Resiliency.BackoffRetry.abort(reason) for fatal errors. |
No :budget with high max_attempts | Callers block for minutes during sustained outages. | Set :budget to your SLA ceiling. |
| Hedge percentile too low (e.g., p50) | Every other request spawns a hedge -- doubles load on downstream. | Use p90--p99. Start with p95 and lower only if tail latency is severe and the downstream can handle it. |
Hedge initial_delay too low | During cold start, hedges fire on nearly every request before samples accumulate. | Set initial_delay to your expected p95 or higher. |
| Token bucket too generous | Hedge rate exceeds expectations. | Lower token_success_credit or raise token_hedge_cost. The steady-state hedge fraction is token_success_credit / token_hedge_cost. |
| Token bucket too restrictive | Hedging is effectively disabled; tail latency suffers. | Raise token_success_credit or increase token_max for burst capacity. |
| SingleFlight key too broad | Unrelated requests share a result. | Use fine-grained keys that include all parameters affecting the result. |
| SingleFlight on non-idempotent work | Side effects (writes, increments) execute only once instead of per-caller. | Do not use SingleFlight for write operations. |
Semaphore max too high | Downstream overloaded despite semaphore. | Lower max to match the downstream's actual capacity. |
Semaphore max too low | Healthy throughput is artificially limited; callers queue unnecessarily. | Profile the downstream and raise max to its tested concurrency limit. |
try_acquire without fallback | :rejected silently drops work. | Always handle the :rejected case -- return an error, queue the work, or degrade gracefully. |
| Weight exceeds max | {:error, :weight_exceeds_max} returned immediately. | Ensure no single operation's weight can exceed the semaphore's :max. Validate weights at the call site. |
Race.run/1 without timeout | If all backends hang, the caller hangs forever. | Always pass a :timeout to Race.run/1 in production. |
Resiliency.Map.run/3 with max_concurrency: 1 | Effectively sequential -- no parallelism benefit. | Use max_concurrency >= 2. If you need sequential execution, use Enum.map/2 directly. |
| Forgetting to supervise stateful modules | Tracker, SingleFlight, or CircuitBreaker crashes and is not restarted. | Always start Resiliency.CircuitBreaker, Resiliency.Hedged, and Resiliency.SingleFlight under a supervisor using their child_spec/1. |