Tuning and Observability

Copy Markdown View Source

This guide covers parameter selection, tuning strategies, and observability patterns for every module in the Resiliency library. Each section provides a parameter reference table, explains how parameters interact, and gives concrete recommendations for common workloads.


Table of Contents

  1. CircuitBreaker Tuning
  2. BackoffRetry Tuning
  3. Hedged Requests Tuning
  4. SingleFlight Tuning
  5. Task Combinator Tuning
  6. WeightedSemaphore Tuning
  7. RateLimiter Tuning
  8. Observability
  9. Common Pitfalls

CircuitBreaker Tuning

Parameter Reference

ParameterDefaultRangeEffect
:name-- (required)atom or {:via, ...}Registered name for the GenServer.
:window_size1001..Number of call outcomes in the count-based sliding window. Larger windows smooth out short bursts but react slower to genuine shifts.
:failure_rate_threshold0.50.0..1.0Failure rate that trips the circuit. 0.5 means the circuit trips when half or more of the recorded calls fail.
:slow_call_threshold:infinity:infinity or 1.. msDuration above which a call is classified as "slow". :infinity disables slow call detection entirely.
:slow_call_rate_threshold1.00.0..1.0Slow call rate that trips the circuit. 1.0 effectively disables slow-rate tripping.
:open_timeout60_000 ms1..Time the circuit stays :open before transitioning to :half_open for probing.
:permitted_calls_in_half_open11..Number of probe calls allowed through in :half_open state before deciding to close or reopen.
:minimum_calls101..Minimum recorded calls in the window before the failure rate is evaluated. Prevents tripping on small sample sizes.
:should_recorddefault predicatefn result -> :success | :failure | :ignoreCustom classification function. :ignore results are not counted in the window.
:on_state_changenilfn name, from, to -> any or nilCallback fired on every state transition. Use for logging, metrics, or telemetry.

How Parameters Interact

The circuit evaluates after each recorded call:

if window.total >= minimum_calls do
  if failure_rate >= failure_rate_threshold or slow_rate >= slow_call_rate_threshold do
    trip to :open
  end
end

The :minimum_calls parameter acts as a warm-up guard -- the circuit will not trip until enough calls have been observed. This prevents a single failure from tripping a freshly started breaker.

The sliding window is count-based (not time-based). Old outcomes are evicted when new ones push them out of the fixed-size buffer. This means the window naturally adapts to traffic volume without requiring time-based expiry.

Tuning for Common Workloads

Scenariofailure_rate_thresholdminimum_callswindow_sizeopen_timeoutNotes
High-throughput API0.52020030_000Larger window for stable signal.
Critical payment service0.31010060_000Trip earlier to protect revenue.
Background job queue0.850500120_000Tolerate more failures, longer recovery.
Health-check probe0.531010_000Small window, fast recovery for probes.
Slow-call-sensitive API0.5 / 0.3 (slow)1010030_000Set slow_call_threshold to p99 latency.

Interpreting get_stats

Call Resiliency.CircuitBreaker.get_stats/1 to inspect the breaker at runtime:

%{
  state: :closed,
  total: 87,
  failures: 12,
  slow_calls: 3,
  failure_rate: 0.1379,
  slow_call_rate: 0.0345
}
StatHealthy rangeWhat it means
failure_rate< failure_rate_thresholdCurrent failure rate in the sliding window.
slow_call_rate< slow_call_rate_thresholdCurrent slow call rate. High values suggest downstream latency issues.
total>= minimum_callsTotal calls in the window. Below minimum_calls, the circuit will not trip.
state:closedCurrent state. :open means rejecting calls; :half_open means probing.

BackoffRetry Tuning

Parameter Reference

ParameterDefaultRangeEffect
:max_attempts31..Total attempts including the first call. 1 means no retries.
:backoff:exponential:exponential, :linear, :constant, or any Enumerable of msDetermines the shape of the delay curve between retries.
:base_delay100 ms0..Seed value for delay computation -- first retry waits this long (before cap/jitter).
:max_delay5_000 ms0..Hard ceiling on any single retry delay. Applied via Backoff.cap/2.
:budget:infinity:infinity or 0.. msTotal wall-clock budget for all attempts. When the next delay would push past the deadline, retries stop.
:retry_iffn {:error, _} -> true endfn result -> boolean endPredicate that decides whether a given failure is retryable. Non-matching errors are returned immediately.
:on_retrynilfn attempt, delay, error -> anyCallback fired before each sleep. Use for logging, metrics, telemetry.
:sleep_fn&Process.sleep/1fn ms -> anyInjectable sleep -- replace with a no-op or test double in tests.
:reraisefalsetrue, falseWhen true, re-raises rescued exceptions (with original stacktrace) instead of returning {:error, exception} after exhausting retries.

How Parameters Interact

The effective retry sequence is computed as:

delays = backoff_strategy(base_delay)
         |> Backoff.cap(max_delay)
         |> Enum.take(max_attempts - 1)

For exponential backoff with the defaults, the delay sequence before cap is:

attempt 1: immediate (no delay -- first call)
attempt 2: 100 ms
attempt 3: 200 ms

With max_attempts: 5 and base_delay: 100:

attempt 1: immediate
attempt 2: 100 ms
attempt 3: 200 ms
attempt 4: 400 ms
attempt 5: 800 ms

The :budget option acts as a secondary stop condition. Even if max_attempts has not been reached, the retry loop aborts when the next sleep would exceed the total time budget. This makes :budget the right knob for SLA-sensitive callers -- set it to your upstream timeout minus a safety margin.

The Backoff.jitter/2 modifier (usable when passing a custom stream) spreads each delay d uniformly over [d * (1 - proportion), d * (1 + proportion)]. Jitter is critical in multi-instance deployments to avoid thundering herd on the downstream service after a shared failure.

Backoff Strategy Formulas

StrategyFormula (n-th retry, 0-indexed)Sequence (base=100)
:exponentialbase * multiplier^n100, 200, 400, 800, 1600, ...
:linearbase + increment * n100, 200, 300, 400, 500, ...
:constantbase100, 100, 100, 100, 100, ...

Tuning for Common Workloads

Scenariomax_attemptsbackoffbase_delaymax_delaybudgetNotes
Low-latency API2--3:exponential505001_000Short budget prevents blocking the request path.
Batch / ETL job5--8:exponential50030_000:infinityGenerous delays -- throughput matters more than latency.
Database reconnect10+:exponential + jitter20060_000:infinityLong max_delay with jitter to avoid connection storms.
Idempotent webhook4:linear1_00010_00030_000Linear ramp gives the receiver steady recovery time.
Circuit-breaker probe1:constant00500Single attempt with tight budget -- just checking if the service is back.

Custom Backoff Streams

For advanced scenarios, compose your own stream with jitter and cap:

custom_backoff =
  Resiliency.BackoffRetry.Backoff.exponential(base: 200, multiplier: 3)
  |> Resiliency.BackoffRetry.Backoff.jitter(0.25)
  |> Resiliency.BackoffRetry.Backoff.cap(15_000)

Resiliency.BackoffRetry.retry(fn -> call_service() end,
  backoff: custom_backoff,
  max_attempts: 6
)

Or pass a literal list when you want fully deterministic delays:

Resiliency.BackoffRetry.retry(fn -> call_service() end,
  backoff: [100, 500, 2_000, 5_000]
)

When a list is provided, max_attempts is implicitly length(list) + 1 (one initial call plus one retry per delay entry).

Rule of Thumb

Total worst-case latency = sum(delays) + max_attempts * max_call_duration. If this exceeds your caller's timeout, either reduce max_attempts, lower max_delay, or set a :budget.


Hedged Requests Tuning

Stateless Mode Parameter Reference

ParameterDefaultRangeEffect
:delay100 ms0..Time to wait before firing the backup (hedge) request.
:max_requests21..Total concurrent attempts. 1 disables hedging entirely.
:timeout5_000 ms1..Overall deadline -- all in-flight tasks are killed at this point.
:non_fatalfn _ -> false endfn reason -> booleanWhen true for a failure reason, the next hedge fires immediately instead of waiting for the delay.
:on_hedgenilfn attempt -> anyCallback invoked before each hedge fires. Use for metrics and logging.

Adaptive Mode (Tracker) Parameter Reference

When using Resiliency.Hedged.start_link/1, the delay auto-tunes based on observed latency. A token bucket throttles the hedge rate under load.

ParameterDefaultRangeEffect
:name-- (required)atom or {:via, ...}Registered name for the tracker GenServer.
:percentile950..100Target latency percentile used as the adaptive delay. Higher values hedge less aggressively.
:buffer_size1_0001..Number of latency samples in the rolling window. Larger buffers smooth out spikes but react slower to shifts.
:min_delay1 ms0..Floor for the adaptive delay. Prevents hedging on every request even when p95 is near zero.
:max_delay5_000 ms1..Ceiling for the adaptive delay. Ensures hedges still fire even during latency spikes.
:initial_delay100 ms0..Delay used during cold-start before :min_samples observations are collected.
:min_samples100..Number of observations required before switching from :initial_delay to the adaptive percentile.
:token_max10> 0Token bucket capacity. Determines burst budget for hedging.
:token_success_credit0.1> 0Tokens earned per completed request (hedged or not).
:token_hedge_cost1.0> 0Tokens consumed when a hedge fires.
:token_threshold1.0>= 0Minimum token balance required to allow hedging. Below this, max_requests is forced to 1.

How Adaptive Delay Works

  1. Cold start -- Before min_samples observations, the tracker returns initial_delay as the hedge delay. Pick this conservatively -- too low and you double load during startup.

  2. Steady state -- The tracker maintains a circular buffer of the last buffer_size latency samples. On each get_config/1 call, it computes the configured percentile of the buffer and clamps it to [min_delay, max_delay]. This becomes the hedge delay.

  3. Token bucket -- Every completed request (success or failure) adds token_success_credit tokens. Every hedge that fires costs token_hedge_cost tokens. When tokens drop below token_threshold, hedging is disabled until enough successful requests replenish the bucket. This creates a natural feedback loop: under sustained load where hedges are not winning, the system backs off automatically.

Effective hedge rate (steady state):

max_hedge_fraction = token_success_credit / token_hedge_cost

With defaults (0.1 / 1.0), at most 10% of requests will be hedged in steady state. To allow up to 20%, set token_success_credit: 0.2. To allow up to 5%, set token_success_credit: 0.05 or token_hedge_cost: 2.0.

Tuning for Different Scenarios

Scenario:percentile:buffer_sizeToken settingsNotes
Fast API (p50 ~5ms)90500defaultsAggressive hedging -- low cost per extra request.
Slow DB query (p50 ~200ms)992_000token_success_credit: 0.05Conservative -- extra DB queries are expensive.
Multi-region fanout951_000defaultsClassic tail-latency use case.
Startup / cold cache----initial_delay: 500High initial delay avoids flooding a cold service.

Interpreting Tracker Stats

Call Resiliency.Hedged.Tracker.stats/1 to inspect the tracker at runtime:

%{
  total_requests: 15234,
  hedged_requests: 1412,
  hedge_won: 987,
  p50: 12,
  p95: 45,
  p99: 210,
  current_delay: 45,
  tokens: 7.2
}
StatHealthy rangeWhat it means
hedged_requests / total_requests5--15%Hedge rate. Below 5% suggests the delay is too high or the service is healthy (good). Above 20% means you are generating significant extra load.
hedge_won / hedged_requests30--70%Win rate. Below 30% means hedges rarely help -- consider raising the percentile. Above 70% means the primary is consistently slow -- investigate the downstream.
tokens> token_thresholdToken balance. If consistently near zero, hedging is being throttled. Raise token_success_credit or lower token_hedge_cost if you want more hedging.
p99 / p50 ratio< 10xTail-to-median ratio. A high ratio (> 20x) indicates severe tail latency -- hedging is valuable here.
current_delaybetween min_delay and max_delayThe adaptive delay. If pinned at max_delay, latency has spiked and the tracker is being conservative.

SingleFlight Tuning

Resiliency.SingleFlight has no tunable numeric parameters -- its behavior is determined entirely by key design and usage patterns.

When It Helps

  • Cache stampede -- Many processes request the same cache key after expiry. Without SingleFlight, all of them hit the database. With it, one process fetches while the rest share the result.

  • Expensive computation -- Deduplicating concurrent calls to a heavy aggregation or report-generation function.

  • External API rate limits -- Preventing duplicate requests to a rate-limited third-party service.

When It Hurts

  • Non-idempotent operations -- If the function has side effects that must execute per-caller (e.g., incrementing a counter, sending a notification), SingleFlight will suppress those side effects for coalesced callers.

  • Caller-specific context -- If each caller needs a slightly different variant of the result (different query parameters, different auth tokens), deduplication by a shared key will return the wrong result for some callers.

  • Very short functions -- If the function completes in microseconds, the overhead of the GenServer round-trip (message passing, ETS or map lookup) may exceed the savings from deduplication.

Key Design Considerations

ConsiderationGuidance
Key granularityToo broad (e.g., "users") coalesces unrelated calls. Too narrow (e.g., "user:#{id}:#{timestamp}") defeats deduplication. Use the natural cache key.
Key typeAny Erlang term works. Atoms and short strings are fastest for map lookups.
Error propagationIf the executing function fails, all waiting callers receive the same {:error, reason}. This is usually correct for cache-fill scenarios but may not be appropriate if different callers should retry independently.
TimeoutUse flight/4 with a timeout when the function may be slow. Timed-out callers exit, but the in-flight function continues and serves other waiters.
forget/2Call forget/2 to force a fresh execution for the next caller. Useful when you know cached data is stale (e.g., after a write). Existing waiters still receive the original result.

Task Combinator Tuning

Choosing the Right Combinator

ModuleUse whenConcurrencyFailure behavior
Resiliency.RaceYou need the fastest result from N alternativesAll functions run concurrentlyFirst success wins; crashed tasks are skipped; returns {:error, :all_failed} if all fail
Resiliency.AllSettledYou need every result regardless of failuresAll functions run concurrentlyEach result is {:ok, _} or {:error, _} independently
Resiliency.MapYou are processing a collection with bounded parallelismUp to max_concurrency at a timeCancels all remaining work on first failure
Resiliency.FirstOkYou have a fallback chain (cache -> DB -> API)Sequential -- one at a timeTries the next function only after the previous one fails

Timeout Selection

ParameterDefaultRangeEffect
:timeout (Race):infinity:infinity or 1.. msOverall deadline for the race. Remaining tasks are killed when it expires.
:timeout (AllSettled):infinity:infinity or 1.. msCompleted tasks keep their results; tasks still running get {:error, :timeout}.
:timeout (Map):infinity:infinity or 1.. msReturns {:error, :timeout} and kills all active tasks.
:timeout (FirstOk):infinity:infinity or 1.. msTotal budget across all sequential attempts.
:max_concurrency (Map)System.schedulers_online()1..Limits how many items are processed in parallel.

Timeout Rules of Thumb

  • For Race.run/1 -- set the timeout to your SLA ceiling. If no backend responds in time, you want a clear timeout rather than an indefinite hang.

  • For AllSettled.run/1 -- set the timeout to the slowest acceptable task duration. Tasks that finish within the deadline keep their results; the rest are marked as timed out.

  • For Resiliency.Map.run/2 -- multiply your per-item budget by the number of items divided by max_concurrency, then add a margin. Or use :infinity and rely on per-item timeouts inside the function.

  • For FirstOk.run/1 -- set the timeout to the total latency budget for the entire fallback chain. Each attempt subtracts from the remaining budget.

Race vs FirstOk Decision

Do you want concurrent execution?
  Yes -> Race.run/1 (all functions run at once, first success wins)
  No  -> FirstOk.run/1 (sequential, stops at first success)

Use Race.run/1 when all backends can handle the load of concurrent requests. Use FirstOk.run/1 when you want to avoid unnecessary calls to slower/more expensive backends.


WeightedSemaphore Tuning

Parameter Reference

ParameterDefaultRangeEffect
:name-- (required)atom or {:via, ...}Registered name for the semaphore GenServer.
:max-- (required)1..Total permit capacity. The sum of all concurrently held weights must not exceed this value.

max Selection

The :max value represents your concurrency budget. Choose it based on the downstream resource's capacity:

ResourceSuggested maxRationale
Database connection pool (size N)N or N - 1Match the pool size. Reserve one connection for health checks if needed.
External API (rate limit R req/s)R / avg_requests_per_secondKeep in-flight requests below the rate limit.
CPU-bound workSystem.schedulers_online()One permit per scheduler avoids over-subscription.
Memory-bound workavailable_mb / per_task_mbWeight by memory cost per task.

Weight Assignment Strategies

StrategyWhen to useExample
Uniform (weight=1)All operations have equal costacquire(sem, fn -> read_row() end)
Cost-proportionalOperations vary in resource consumptionacquire(sem, row_count, fn -> bulk_insert(rows) end)
TieredTwo or three operation classesReads = 1, writes = 3, bulk = 10
EstimatedCost is data-dependentacquire(sem, estimate_cost(query), fn -> run(query) end)

Backpressure Behavior

The semaphore's FIFO queue provides natural backpressure:

  • Blocking -- acquire/3 blocks the caller until permits are available. This is the default and simplest mode. Callers queue up and are served in order.

  • Non-blocking -- try_acquire/3 returns :rejected immediately if permits are not available or if there are waiters in the queue. Use this for "best effort" work that can be dropped under load.

  • Timeout -- acquire/4 accepts a timeout in milliseconds. If permits are not available within the deadline, returns {:error, :timeout}. The caller is removed from the queue.

Fairness guarantee -- Waiters are served in strict FIFO order. A large waiter at the head of the queue blocks smaller waiters behind it, preventing starvation. This means a weight-8 request waiting for permits will not be bypassed by a weight-1 request, even if capacity exists for the smaller request.

Sizing Guidelines

Utilization = avg_concurrent_weight / max
  • < 50% -- The semaphore is rarely contended. You may be over-provisioned, or the workload is bursty. Consider lowering max to catch genuine overload earlier.

  • 50--80% -- Healthy range. Some queuing occurs during bursts but callers are not waiting long.

  • > 90% -- The semaphore is a bottleneck. Callers are frequently blocked. Either increase max (if the downstream can handle it) or reduce the arrival rate.


RateLimiter Tuning

Parameter Reference

ParameterDefaultRangeEffect
:name-- (required)atomRegistered name for the GenServer and persistent_term config key.
:rate-- (required)positive numberRefill rate in tokens per second.
:burst_size-- (required)positive integerBucket capacity and initial fill.
:on_rejectnilfn name -> any or nilCallback fired in the caller's process on every rejection.
weight (per-call)1positive integerTokens consumed per call.

How It Works

The rate limiter uses a lazy token-bucket — there is no background timer. Tokens refill on each call based on elapsed time since the last operation:

new_tokens = min(burst_size, old_tokens + elapsed_ms * rate_per_ms)

If new_tokens >= weight, the call is granted and weight tokens are deducted. Otherwise the call is rejected immediately. The hot path (grant and reject) runs entirely in the caller's process via lock-free ETS operations — no GenServer message is sent.

Choosing rate and burst_size

Goalrateburst_sizeNotes
Match an external API limit of 100 req/s100.0100 or 10burst_size = rate gives a one-second burst. Lower it to smooth traffic more aggressively.
Allow occasional bursts, then throttle50.0500Bucket fills slowly but callers can burst to 500 before being throttled.
Strict per-second limit, no burst100.01Only 1 token ever available; at most 1 call per ~10ms.
Expensive operation (weight 5) at 20/s100.0100Each call costs 5 tokens; effective rate = 100 / 5 = 20 op/s.

rate vs burst_size Interaction

# Steady-state throughput (tokens/s) = rate
# Burst capacity (tokens) = burst_size
# Time to refill from empty = burst_size / rate seconds

A full empty bucket refills in burst_size / rate seconds. If burst_size equals rate, the bucket refills in exactly one second. If burst_size is much larger than rate, callers can absorb large traffic spikes before the rate limit kicks in.

retry_after_ms Formula

When a call is rejected, the hint is:

retry_after_ms = ceil((weight - current_tokens) / rate * 1000)

This tells the caller exactly how long to wait for enough tokens to refill for their specific weight. Callers should treat this as a minimum — token counts are shared across concurrent callers.

Weighted Calls

Use :weight when different operations have different costs relative to your upstream rate limit. For example, if an API counts bulk requests as equivalent to N single requests, pass weight: N:

# Single item lookup: 1 token
Resiliency.RateLimiter.call(rl, fn -> get_one(id) end)

# Bulk fetch of 50 items: 50 tokens
Resiliency.RateLimiter.call(rl, fn -> get_many(ids) end, weight: 50)

A weight larger than burst_size is always rejected immediately.

get_stats/1 Usage

get_stats/1 computes the projected token count using the current timestamp, without writing to ETS or consuming any tokens. Use it for health checks and dashboards — it does not interfere with the hot path:

%{tokens: tokens, rate: rate, burst_size: burst_size} =
  Resiliency.RateLimiter.get_stats(MyApp.ApiRateLimiter)

utilization = (burst_size - tokens) / burst_size
# 0.0 = full bucket (no recent calls)
# 1.0 = empty bucket (fully rate limited)

Performance Characteristics

The hot path avoids GenServer messages entirely:

  • Grant path: persistent_term.get + ETS lookup + float refill math + ETS CAS (select_replace)
  • Reject path: persistent_term.get + ETS lookup + float refill math + ETS update_element

Observed on M-series hardware: ~3µs/call for grants, ~2µs/call for rejects. Under 8 concurrent processes, reductions per acquire stay flat (< 100) — no serialisation.

Common Pitfalls

MistakeSymptomFix
burst_size too smallLegitimate bursts are rejected; traffic is over-smoothed.Set burst_size to match the upstream's burst allowance.
burst_size too largeBucket takes minutes to refill after a burst; callers see retry_after_ms in the thousands.Cap burst_size at rate * acceptable_burst_seconds.
weight > burst_sizeCalls with that weight are always rejected.Ensure max weight is <= burst_size. Validate at startup.
Treating retry_after_ms as exactToken counts are shared; another caller may consume tokens before you retry.Add a small jitter to the retry_after_ms before sleeping.
Using RateLimiter for concurrencyBucket drains after a burst even if calls are all in-flight simultaneously.Use Bulkhead or WeightedSemaphore for concurrency limits.
Starting without a supervisorGenServer crash leaves persistent_term stale until process exits.Always start under a supervisor using child_spec/1.

Observability

Emitting Telemetry from CircuitBreaker

Use the :on_state_change callback to emit :telemetry events on every state transition:

{Resiliency.CircuitBreaker,
 name: MyApp.Breaker,
 failure_rate_threshold: 0.5,
 on_state_change: fn name, from, to ->
   :telemetry.execute(
     [:my_app, :circuit_breaker, :state_change],
     %{},
     %{name: name, from: from, to: to}
   )
 end}

Poll Resiliency.CircuitBreaker.get_stats/1 periodically for dashboard metrics:

stats = Resiliency.CircuitBreaker.get_stats(MyApp.Breaker)

:telemetry.execute(
  [:my_app, :circuit_breaker, :stats],
  %{
    failure_rate: stats.failure_rate,
    slow_call_rate: stats.slow_call_rate,
    total: stats.total,
    failures: stats.failures
  },
  %{name: MyApp.Breaker, state: stats.state}
)

Logging Retry Attempts

Use the :on_retry callback to emit structured log lines on every retry:

require Logger

Resiliency.BackoffRetry.retry(
  fn -> MyService.call(params) end,
  max_attempts: 4,
  backoff: :exponential,
  base_delay: 200,
  on_retry: fn attempt, delay, error ->
    Logger.warning(
      "Retry attempt",
      attempt: attempt,
      delay_ms: delay,
      error: inspect(error),
      service: "my_service"
    )
  end
)

Emitting Telemetry from BackoffRetry

Wrap your retry call to emit :telemetry events for each attempt and for the final outcome:

defmodule MyApp.Resilient do
  def call_with_telemetry(fun, opts \\ []) do
    start_time = System.monotonic_time()
    meta = %{service: Keyword.get(opts, :service, :unknown)}

    result =
      Resiliency.BackoffRetry.retry(fun,
        max_attempts: Keyword.get(opts, :max_attempts, 3),
        on_retry: fn attempt, delay, error ->
          :telemetry.execute(
            [:my_app, :retry, :attempt],
            %{delay_ms: delay},
            Map.merge(meta, %{attempt: attempt, error: inspect(error)})
          )
        end,
        retry_if: Keyword.get(opts, :retry_if, fn {:error, _} -> true end)
      )

    duration = System.monotonic_time() - start_time

    case result do
      {:ok, _} ->
        :telemetry.execute(
          [:my_app, :retry, :success],
          %{duration: duration},
          meta
        )

      {:error, _} ->
        :telemetry.execute(
          [:my_app, :retry, :failure],
          %{duration: duration},
          meta
        )
    end

    result
  end
end

Emitting Telemetry from Hedged Requests

Use :on_hedge for per-hedge telemetry, and wrap the call for overall metrics:

defmodule MyApp.HedgedCall do
  def run(fun, opts \\ []) do
    service = Keyword.get(opts, :service, :unknown)
    start_time = System.monotonic_time()

    result =
      Resiliency.Hedged.run(fun,
        delay: Keyword.get(opts, :delay, 100),
        timeout: Keyword.get(opts, :timeout, 5_000),
        on_hedge: fn attempt ->
          :telemetry.execute(
            [:my_app, :hedged, :hedge_fired],
            %{},
            %{service: service, attempt: attempt}
          )
        end
      )

    duration = System.monotonic_time() - start_time

    case result do
      {:ok, _} ->
        :telemetry.execute(
          [:my_app, :hedged, :success],
          %{duration: duration},
          %{service: service}
        )

      {:error, _} ->
        :telemetry.execute(
          [:my_app, :hedged, :failure],
          %{duration: duration},
          %{service: service}
        )
    end

    result
  end
end

Monitoring Adaptive Hedging with Tracker.stats/1

Poll Resiliency.Hedged.Tracker.stats/1 periodically to feed dashboards:

defmodule MyApp.HedgeReporter do
  use GenServer

  def start_link(opts) do
    tracker = Keyword.fetch!(opts, :tracker)
    interval = Keyword.get(opts, :interval, 10_000)
    GenServer.start_link(__MODULE__, %{tracker: tracker, interval: interval})
  end

  @impl true
  def init(state) do
    schedule(state.interval)
    {:ok, state}
  end

  @impl true
  def handle_info(:report, state) do
    stats = Resiliency.Hedged.Tracker.stats(state.tracker)

    :telemetry.execute(
      [:my_app, :hedged, :tracker_stats],
      %{
        p50: stats.p50 || 0,
        p95: stats.p95 || 0,
        p99: stats.p99 || 0,
        current_delay: stats.current_delay,
        tokens: stats.tokens,
        total_requests: stats.total_requests,
        hedged_requests: stats.hedged_requests,
        hedge_won: stats.hedge_won
      },
      %{tracker: state.tracker}
    )

    hedge_rate =
      if stats.total_requests > 0,
        do: stats.hedged_requests / stats.total_requests,
        else: 0.0

    win_rate =
      if stats.hedged_requests > 0,
        do: stats.hedge_won / stats.hedged_requests,
        else: 0.0

    :telemetry.execute(
      [:my_app, :hedged, :tracker_rates],
      %{hedge_rate: hedge_rate, win_rate: win_rate},
      %{tracker: state.tracker}
    )

    schedule(state.interval)
    {:noreply, state}
  end

  defp schedule(interval), do: Process.send_after(self(), :report, interval)
end

Monitoring WeightedSemaphore

The semaphore does not expose internal stats directly. Instrument it by wrapping calls:

defmodule MyApp.InstrumentedSemaphore do
  def acquire(sem, weight, fun) do
    start_time = System.monotonic_time()

    result = Resiliency.WeightedSemaphore.acquire(sem, weight, fn ->
      wait_duration = System.monotonic_time() - start_time

      :telemetry.execute(
        [:my_app, :semaphore, :acquired],
        %{wait_duration: wait_duration, weight: weight},
        %{semaphore: sem}
      )

      fun.()
    end)

    total_duration = System.monotonic_time() - start_time

    case result do
      {:ok, _} ->
        :telemetry.execute(
          [:my_app, :semaphore, :complete],
          %{duration: total_duration, weight: weight},
          %{semaphore: sem, outcome: :ok}
        )

      {:error, _} ->
        :telemetry.execute(
          [:my_app, :semaphore, :complete],
          %{duration: total_duration, weight: weight},
          %{semaphore: sem, outcome: :error}
        )
    end

    result
  end

  def try_acquire(sem, weight, fun) do
    case Resiliency.WeightedSemaphore.try_acquire(sem, weight, fun) do
      :rejected ->
        :telemetry.execute(
          [:my_app, :semaphore, :rejected],
          %{weight: weight},
          %{semaphore: sem}
        )

        :rejected

      other ->
        other
    end
  end
end

Suggested Telemetry Event Names

ModuleEventMeasurementsMetadata
CircuitBreaker[:app, :circuit_breaker, :state_change]%{}%{name: atom, from: atom, to: atom}
CircuitBreaker[:app, :circuit_breaker, :stats]%{failure_rate: float, total: integer, ...}%{name: atom, state: atom}
BackoffRetry[:app, :retry, :attempt]%{delay_ms: integer}%{attempt: integer, error: string, service: atom}
BackoffRetry[:app, :retry, :success]%{duration: native_time}%{service: atom}
BackoffRetry[:app, :retry, :failure]%{duration: native_time}%{service: atom}
Hedged[:app, :hedged, :hedge_fired]%{}%{service: atom, attempt: integer}
Hedged[:app, :hedged, :success]%{duration: native_time}%{service: atom}
Hedged[:app, :hedged, :tracker_stats]%{p50: num, p95: num, p99: num, ...}%{tracker: atom}
Semaphore[:app, :semaphore, :acquired]%{wait_duration: native_time, weight: integer}%{semaphore: atom}
Semaphore[:app, :semaphore, :rejected]%{weight: integer}%{semaphore: atom}
RateLimiter[:resiliency, :rate_limiter, :call, :start]%{system_time: integer}%{name: atom}
RateLimiter[:resiliency, :rate_limiter, :call, :rejected]%{retry_after: integer}%{name: atom}
RateLimiter[:resiliency, :rate_limiter, :call, :stop]%{duration: native_time}%{name: atom, result: :ok | :error, error: term | nil}

Common Pitfalls

MistakeSymptomFix
minimum_calls too lowCircuit trips on normal variance -- a few early failures trip the breaker.Increase minimum_calls to at least 10. Higher for high-throughput services.
failure_rate_threshold too lowCircuit trips too aggressively; service appears degraded when it is merely imperfect.Start with 0.5 and lower only if the downstream is critical and failures are costly.
open_timeout too shortCircuit keeps probing a still-broken service, consuming resources.Set open_timeout to at least the downstream's expected recovery time.
open_timeout too longService has recovered but callers are still being rejected.Balance between recovery time and responsiveness. Use force_close/1 for manual intervention.
window_size too smallA few bad calls dominate the rate; circuit trips on transient spikes.Use a window large enough to smooth out normal variance (e.g., 100+).
permitted_calls_in_half_open too lowA single unlucky probe reopens the circuit; recovery takes multiple open-timeout cycles.Increase to 3--5 for more confident half-open evaluation.
Not handling {:error, :circuit_open}Caller crashes or returns unexpected error shape.Always pattern-match on :circuit_open and degrade gracefully.
Retry delay too shortFloods downstream during outage; downstream never recovers.Increase base_delay, use exponential backoff, add jitter via Backoff.jitter/2.
No jitter on retriesThundering herd -- all clients retry at the same instant.Compose Backoff.jitter(0.25) into your backoff stream.
Retrying non-idempotent callsDuplicate side effects (double charges, duplicate messages).Use :retry_if to only retry safe errors (timeouts, connection refused). Return Resiliency.BackoffRetry.abort(reason) for fatal errors.
No :budget with high max_attemptsCallers block for minutes during sustained outages.Set :budget to your SLA ceiling.
Hedge percentile too low (e.g., p50)Every other request spawns a hedge -- doubles load on downstream.Use p90--p99. Start with p95 and lower only if tail latency is severe and the downstream can handle it.
Hedge initial_delay too lowDuring cold start, hedges fire on nearly every request before samples accumulate.Set initial_delay to your expected p95 or higher.
Token bucket too generousHedge rate exceeds expectations.Lower token_success_credit or raise token_hedge_cost. The steady-state hedge fraction is token_success_credit / token_hedge_cost.
Token bucket too restrictiveHedging is effectively disabled; tail latency suffers.Raise token_success_credit or increase token_max for burst capacity.
SingleFlight key too broadUnrelated requests share a result.Use fine-grained keys that include all parameters affecting the result.
SingleFlight on non-idempotent workSide effects (writes, increments) execute only once instead of per-caller.Do not use SingleFlight for write operations.
Semaphore max too highDownstream overloaded despite semaphore.Lower max to match the downstream's actual capacity.
Semaphore max too lowHealthy throughput is artificially limited; callers queue unnecessarily.Profile the downstream and raise max to its tested concurrency limit.
try_acquire without fallback:rejected silently drops work.Always handle the :rejected case -- return an error, queue the work, or degrade gracefully.
Weight exceeds max{:error, :weight_exceeds_max} returned immediately.Ensure no single operation's weight can exceed the semaphore's :max. Validate weights at the call site.
Race.run/1 without timeoutIf all backends hang, the caller hangs forever.Always pass a :timeout to Race.run/1 in production.
Resiliency.Map.run/3 with max_concurrency: 1Effectively sequential -- no parallelism benefit.Use max_concurrency >= 2. If you need sequential execution, use Enum.map/2 directly.
Forgetting to supervise stateful modulesTracker, SingleFlight, or CircuitBreaker crashes and is not restarted.Always start Resiliency.CircuitBreaker, Resiliency.Hedged, and Resiliency.SingleFlight under a supervisor using their child_spec/1.