Hedged requests for Elixir.
Fire a backup request after a delay, take whichever finishes first, cancel the rest. A tail-latency optimization inspired by Google's "Tail at Scale" paper.
When to use
- Querying a latency-sensitive datastore (e.g., a key-value cache or search index) where occasional slow responses are caused by GC pauses, network jitter, or cold replicas — a hedge cuts tail latency dramatically.
- Calling a replicated service behind a load balancer where individual instances sometimes stall — firing a second request to a different instance lets the fast replica win.
- Performing DNS or geo-routing lookups where the first resolver may be slow but a redundant resolver responds quickly — hedge to keep p99 tight.
- Any read-only or idempotent operation where issuing a duplicate request is cheap relative to the cost of waiting for a slow primary.
How it works
When you call run/2 (stateless mode), the engine spawns the first request
immediately. After :delay milliseconds — or sooner if the first request
fails and the failure matches the :non_fatal predicate — a second copy of
the same function is spawned. This continues until either a request succeeds,
:max_requests copies have been launched, or the overall :timeout expires.
The first successful response is returned and all remaining in-flight tasks
are killed, freeing resources.
The idea comes from Google's 2013 paper "The Tail at Scale" by Jeffrey Dean and Luiz Andre Barroso. The paper observes that even when individual request latency is fast at the median, high-percentile latency can be orders of magnitude worse. By sending a redundant request after a short delay — chosen to be around the expected latency percentile — you effectively "race" two independent samples of the latency distribution. The probability that both are slow is the product of two small probabilities, so tail latency drops significantly at the cost of a modest increase in total load.
In adaptive mode (run/3 with a tracker), the delay is not fixed — it is
computed as a configurable percentile of recently observed latencies (see
Resiliency.Hedged.Tracker). A token-bucket mechanism throttles the hedge
rate: each completed request earns a small credit, and each hedge spends a
larger cost. When the bucket runs dry, hedging is temporarily disabled,
preventing runaway fan-out under sustained load.
Algorithm Complexity
| Function | Time | Space |
|---|---|---|
run/2 (stateless) | O(m) where m = max_requests — each hedge spawn is O(1) | O(m) — one monitored process per hedge |
run/3 (adaptive) | O(m) plus O(1) GenServer call to the tracker | O(m) |
start_link/1 | O(1) | O(1) |
child_spec/1 | O(1) | O(1) |
Quick start
# Stateless — fixed delay
{:ok, body} = Resiliency.Hedged.run(fn -> fetch(url) end, delay: 100)
# Adaptive — delay auto-tunes from observed latency
{:ok, _} = Resiliency.Hedged.start_link(name: MyHedge)
{:ok, body} = Resiliency.Hedged.run(MyHedge, fn -> fetch(url) end)Stateless options
:delay— ms before firing the next hedge (default:100):max_requests— total concurrent attempts (default:2):timeout— overall deadline in ms (default:5_000):non_fatal—fn reason -> booleanpredicate; when true, fires the next hedge immediately instead of waiting for the delay (default:fn _ -> false end):on_hedge—fn attempt -> anycallback invoked before each hedge:now_fn— injectable clockfn :millisecond -> integerfor testing
Result normalization
{:ok, value}— success{:error, reason}— failure- bare value — wrapped as
{:ok, value} :ok—{:ok, :ok}:error—{:error, :error}- raise / exit / throw — captured, treated as failure
Telemetry
All events are emitted in the caller's process. See Resiliency.Telemetry for the
complete event catalogue.
[:resiliency, :hedged, :run, :start]
Emitted at the beginning of every run/2,3 invocation.
Measurements
| Key | Type | Description |
|---|---|---|
system_time | integer | System.system_time() at emission time |
Metadata
| Key | Type | Description |
|---|
| mode | :stateless | :adaptive | :stateless for run/2 (fun/opts), :adaptive for run/3 (server/fun/opts) |
[:resiliency, :hedged, :hedge]
Emitted each time a hedge is dispatched (2nd, 3rd, … request). Not emitted for the original request.
Measurements
| Key | Type | Description |
|---|---|---|
| (none) |
Metadata
| Key | Type | Description |
|---|---|---|
attempt | integer | Hedge attempt number (2 for first hedge, 3 for second, etc.) |
[:resiliency, :hedged, :run, :stop]
Emitted after the first successful response (or after all attempts fail).
Measurements
| Key | Type | Description |
|---|---|---|
duration | integer | Elapsed native time units (System.monotonic_time/0 delta) |
Metadata
| Key | Type | Description |
|---|
mode | `:stateless | :adaptive` | Matches the :start event |
result | `:ok | :error` | :ok if any attempt succeeded, :error if all failed |
dispatched | integer | Total number of attempts dispatched (including original) |
hedged | boolean | true if at least one hedge was dispatched (dispatched > 1) |
Summary
Types
Options for start_link/1 and child_spec/1.
Functions
Returns a child specification for use in a supervision tree.
Executes fun with hedging using a fixed delay.
Executes fun with adaptive hedging controlled by a Resiliency.Hedged.Tracker.
Starts a Resiliency.Hedged.Tracker process linked to the current process.
Types
@type option() :: {:delay, non_neg_integer()} | {:max_requests, pos_integer()} | {:timeout, pos_integer()} | {:non_fatal, (any() -> boolean())} | {:on_hedge, (pos_integer() -> any()) | nil} | {:now_fn, (:millisecond -> integer())}
Options for stateless run/2.
@type tracker_option() :: {:name, GenServer.name()} | {:percentile, number()} | {:buffer_size, pos_integer()} | {:min_delay, non_neg_integer()} | {:max_delay, pos_integer()} | {:initial_delay, non_neg_integer()} | {:min_samples, non_neg_integer()} | {:token_max, number()} | {:token_success_credit, number()} | {:token_hedge_cost, number()} | {:token_threshold, number()}
Options for start_link/1 and child_spec/1.
Functions
@spec child_spec([tracker_option()]) :: Supervisor.child_spec()
Returns a child specification for use in a supervision tree.
Parameters
opts-- keyword list of tracker options (same asstart_link/1). The:nameoption is required.
Returns
A Supervisor.child_spec() map suitable for inclusion in a supervision tree.
Example
children = [
{Resiliency.Hedged, name: MyHedge, percentile: 99}
]
Supervisor.start_link(children, strategy: :one_for_one)
Executes fun with hedging using a fixed delay.
Returns {:ok, result} or {:error, reason}.
Parameters
fun-- a zero-arity function to execute. Results are normalized (see "Result normalization" in the module docs).opts-- keyword list of options. Defaults to[].:delay-- milliseconds before firing the next hedge. Defaults to100.:max_requests-- total concurrent attempts. Defaults to2.:timeout-- overall deadline in milliseconds. Defaults to5_000.:non_fatal--fn reason -> booleanpredicate; whentrue, fires the next hedge immediately instead of waiting for the delay. Defaults tofn _ -> false end.:on_hedge--fn attempt -> anycallback invoked before each hedge. Defaults tonil.:now_fn-- injectable clockfn :millisecond -> integerfor testing. Defaults to&System.monotonic_time/1.
Returns
{:ok, result} from the first function invocation that completes successfully, or {:error, reason} if all attempts fail.
Examples
iex> Resiliency.Hedged.run(fn -> {:ok, 42} end)
{:ok, 42}
iex> Resiliency.Hedged.run(fn -> :hello end)
{:ok, :hello}
iex> Resiliency.Hedged.run(fn -> {:error, :boom} end, max_requests: 1)
{:error, :boom}
@spec run(GenServer.server(), (-> any()), [option()]) :: {:ok, any()} | {:error, any()}
Executes fun with adaptive hedging controlled by a Resiliency.Hedged.Tracker.
The tracker determines the delay based on observed latency percentiles and controls hedge rate via a token bucket.
Returns {:ok, result} or {:error, reason}.
Parameters
server-- the name or PID of a runningResiliency.Hedged.Trackerprocess.fun-- a zero-arity function to execute. Results are normalized (see "Result normalization" in the module docs).opts-- keyword list of options. Defaults to[].:max_requests-- total concurrent attempts. Defaults to2. Automatically set to1when the tracker's token bucket disallows hedging.:timeout-- overall deadline in milliseconds. Defaults to5_000.:non_fatal--fn reason -> booleanpredicate; whentrue, fires the next hedge immediately. Defaults tofn _ -> false end.:on_hedge--fn attempt -> anycallback invoked before each hedge. Defaults tonil.:now_fn-- injectable clockfn :millisecond -> integerfor testing. Defaults to&System.monotonic_time/1.
Returns
{:ok, result} from the first function invocation that completes successfully, or {:error, reason} if all attempts fail.
Examples
{:ok, _} = Resiliency.Hedged.start_link(name: MyHedge)
{:ok, body} = Resiliency.Hedged.run(MyHedge, fn -> fetch(url) end)
@spec start_link([tracker_option()]) :: GenServer.on_start()
Starts a Resiliency.Hedged.Tracker process linked to the current process.
Parameters
opts-- keyword list of options.:name-- (required) the registered name for the tracker process.:percentile-- target percentile for adaptive delay. Defaults to95.:buffer_size-- max latency samples to keep. Defaults to1000.:min_delay-- floor for adaptive delay in milliseconds. Defaults to1.:max_delay-- ceiling for adaptive delay in milliseconds. Defaults to5_000.:initial_delay-- delay used before enough samples are collected. Defaults to100.:min_samples-- samples needed before switching to adaptive delay. Defaults to10.:token_max-- token bucket capacity. Defaults to10.:token_success_credit-- tokens earned per completed request. Defaults to0.1.:token_hedge_cost-- tokens spent when a hedge fires. Defaults to1.0.:token_threshold-- minimum tokens required to allow hedging. Defaults to1.0.
Returns
{:ok, pid} on success, or {:error, reason} if the process cannot be started.
Raises
Raises KeyError if the required :name option is not provided.