Resiliency.Hedged (Resiliency v0.6.0)

Copy Markdown View Source

Hedged requests for Elixir.

Fire a backup request after a delay, take whichever finishes first, cancel the rest. A tail-latency optimization inspired by Google's "Tail at Scale" paper.

When to use

  • Querying a latency-sensitive datastore (e.g., a key-value cache or search index) where occasional slow responses are caused by GC pauses, network jitter, or cold replicas — a hedge cuts tail latency dramatically.
  • Calling a replicated service behind a load balancer where individual instances sometimes stall — firing a second request to a different instance lets the fast replica win.
  • Performing DNS or geo-routing lookups where the first resolver may be slow but a redundant resolver responds quickly — hedge to keep p99 tight.
  • Any read-only or idempotent operation where issuing a duplicate request is cheap relative to the cost of waiting for a slow primary.

How it works

When you call run/2 (stateless mode), the engine spawns the first request immediately. After :delay milliseconds — or sooner if the first request fails and the failure matches the :non_fatal predicate — a second copy of the same function is spawned. This continues until either a request succeeds, :max_requests copies have been launched, or the overall :timeout expires. The first successful response is returned and all remaining in-flight tasks are killed, freeing resources.

The idea comes from Google's 2013 paper "The Tail at Scale" by Jeffrey Dean and Luiz Andre Barroso. The paper observes that even when individual request latency is fast at the median, high-percentile latency can be orders of magnitude worse. By sending a redundant request after a short delay — chosen to be around the expected latency percentile — you effectively "race" two independent samples of the latency distribution. The probability that both are slow is the product of two small probabilities, so tail latency drops significantly at the cost of a modest increase in total load.

In adaptive mode (run/3 with a tracker), the delay is not fixed — it is computed as a configurable percentile of recently observed latencies (see Resiliency.Hedged.Tracker). A token-bucket mechanism throttles the hedge rate: each completed request earns a small credit, and each hedge spends a larger cost. When the bucket runs dry, hedging is temporarily disabled, preventing runaway fan-out under sustained load.

Algorithm Complexity

FunctionTimeSpace
run/2 (stateless)O(m) where m = max_requests — each hedge spawn is O(1)O(m) — one monitored process per hedge
run/3 (adaptive)O(m) plus O(1) GenServer call to the trackerO(m)
start_link/1O(1)O(1)
child_spec/1O(1)O(1)

Quick start

# Stateless — fixed delay
{:ok, body} = Resiliency.Hedged.run(fn -> fetch(url) end, delay: 100)

# Adaptive — delay auto-tunes from observed latency
{:ok, _} = Resiliency.Hedged.start_link(name: MyHedge)
{:ok, body} = Resiliency.Hedged.run(MyHedge, fn -> fetch(url) end)

Stateless options

  • :delay — ms before firing the next hedge (default: 100)
  • :max_requests — total concurrent attempts (default: 2)
  • :timeout — overall deadline in ms (default: 5_000)
  • :non_fatalfn reason -> boolean predicate; when true, fires the next hedge immediately instead of waiting for the delay (default: fn _ -> false end)
  • :on_hedgefn attempt -> any callback invoked before each hedge
  • :now_fn — injectable clock fn :millisecond -> integer for testing

Result normalization

  • {:ok, value} — success
  • {:error, reason} — failure
  • bare value — wrapped as {:ok, value}
  • :ok{:ok, :ok}
  • :error{:error, :error}
  • raise / exit / throw — captured, treated as failure

Telemetry

All events are emitted in the caller's process. See Resiliency.Telemetry for the complete event catalogue.

[:resiliency, :hedged, :run, :start]

Emitted at the beginning of every run/2,3 invocation.

Measurements

KeyTypeDescription
system_timeintegerSystem.system_time() at emission time

Metadata

KeyTypeDescription

| mode | :stateless | :adaptive | :stateless for run/2 (fun/opts), :adaptive for run/3 (server/fun/opts) |

[:resiliency, :hedged, :hedge]

Emitted each time a hedge is dispatched (2nd, 3rd, … request). Not emitted for the original request.

Measurements

KeyTypeDescription
(none)

Metadata

KeyTypeDescription
attemptintegerHedge attempt number (2 for first hedge, 3 for second, etc.)

[:resiliency, :hedged, :run, :stop]

Emitted after the first successful response (or after all attempts fail).

Measurements

KeyTypeDescription
durationintegerElapsed native time units (System.monotonic_time/0 delta)

Metadata

KeyTypeDescription
mode`:stateless:adaptive`Matches the :start event
result`:ok:error`:ok if any attempt succeeded, :error if all failed
dispatchedintegerTotal number of attempts dispatched (including original)
hedgedbooleantrue if at least one hedge was dispatched (dispatched > 1)

Summary

Types

Options for stateless run/2.

Functions

Returns a child specification for use in a supervision tree.

Executes fun with hedging using a fixed delay.

Executes fun with adaptive hedging controlled by a Resiliency.Hedged.Tracker.

Starts a Resiliency.Hedged.Tracker process linked to the current process.

Types

option()

@type option() ::
  {:delay, non_neg_integer()}
  | {:max_requests, pos_integer()}
  | {:timeout, pos_integer()}
  | {:non_fatal, (any() -> boolean())}
  | {:on_hedge, (pos_integer() -> any()) | nil}
  | {:now_fn, (:millisecond -> integer())}

Options for stateless run/2.

tracker_option()

@type tracker_option() ::
  {:name, GenServer.name()}
  | {:percentile, number()}
  | {:buffer_size, pos_integer()}
  | {:min_delay, non_neg_integer()}
  | {:max_delay, pos_integer()}
  | {:initial_delay, non_neg_integer()}
  | {:min_samples, non_neg_integer()}
  | {:token_max, number()}
  | {:token_success_credit, number()}
  | {:token_hedge_cost, number()}
  | {:token_threshold, number()}

Options for start_link/1 and child_spec/1.

Functions

child_spec(opts)

@spec child_spec([tracker_option()]) :: Supervisor.child_spec()

Returns a child specification for use in a supervision tree.

Parameters

  • opts -- keyword list of tracker options (same as start_link/1). The :name option is required.

Returns

A Supervisor.child_spec() map suitable for inclusion in a supervision tree.

Example

children = [
  {Resiliency.Hedged, name: MyHedge, percentile: 99}
]

Supervisor.start_link(children, strategy: :one_for_one)

run(fun)

Executes fun with hedging using a fixed delay.

Returns {:ok, result} or {:error, reason}.

Parameters

  • fun -- a zero-arity function to execute. Results are normalized (see "Result normalization" in the module docs).
  • opts -- keyword list of options. Defaults to [].
    • :delay -- milliseconds before firing the next hedge. Defaults to 100.
    • :max_requests -- total concurrent attempts. Defaults to 2.
    • :timeout -- overall deadline in milliseconds. Defaults to 5_000.
    • :non_fatal -- fn reason -> boolean predicate; when true, fires the next hedge immediately instead of waiting for the delay. Defaults to fn _ -> false end.
    • :on_hedge -- fn attempt -> any callback invoked before each hedge. Defaults to nil.
    • :now_fn -- injectable clock fn :millisecond -> integer for testing. Defaults to &System.monotonic_time/1.

Returns

{:ok, result} from the first function invocation that completes successfully, or {:error, reason} if all attempts fail.

Examples

iex> Resiliency.Hedged.run(fn -> {:ok, 42} end)
{:ok, 42}

iex> Resiliency.Hedged.run(fn -> :hello end)
{:ok, :hello}

iex> Resiliency.Hedged.run(fn -> {:error, :boom} end, max_requests: 1)
{:error, :boom}

run(fun, opts)

@spec run((-> any()), [option()]) :: {:ok, any()} | {:error, any()}

run(server, fun, opts)

@spec run(GenServer.server(), (-> any()), [option()]) ::
  {:ok, any()} | {:error, any()}

Executes fun with adaptive hedging controlled by a Resiliency.Hedged.Tracker.

The tracker determines the delay based on observed latency percentiles and controls hedge rate via a token bucket.

Returns {:ok, result} or {:error, reason}.

Parameters

  • server -- the name or PID of a running Resiliency.Hedged.Tracker process.
  • fun -- a zero-arity function to execute. Results are normalized (see "Result normalization" in the module docs).
  • opts -- keyword list of options. Defaults to [].
    • :max_requests -- total concurrent attempts. Defaults to 2. Automatically set to 1 when the tracker's token bucket disallows hedging.
    • :timeout -- overall deadline in milliseconds. Defaults to 5_000.
    • :non_fatal -- fn reason -> boolean predicate; when true, fires the next hedge immediately. Defaults to fn _ -> false end.
    • :on_hedge -- fn attempt -> any callback invoked before each hedge. Defaults to nil.
    • :now_fn -- injectable clock fn :millisecond -> integer for testing. Defaults to &System.monotonic_time/1.

Returns

{:ok, result} from the first function invocation that completes successfully, or {:error, reason} if all attempts fail.

Examples

{:ok, _} = Resiliency.Hedged.start_link(name: MyHedge)
{:ok, body} = Resiliency.Hedged.run(MyHedge, fn -> fetch(url) end)

start_link(opts)

@spec start_link([tracker_option()]) :: GenServer.on_start()

Starts a Resiliency.Hedged.Tracker process linked to the current process.

Parameters

  • opts -- keyword list of options.
    • :name -- (required) the registered name for the tracker process.
    • :percentile -- target percentile for adaptive delay. Defaults to 95.
    • :buffer_size -- max latency samples to keep. Defaults to 1000.
    • :min_delay -- floor for adaptive delay in milliseconds. Defaults to 1.
    • :max_delay -- ceiling for adaptive delay in milliseconds. Defaults to 5_000.
    • :initial_delay -- delay used before enough samples are collected. Defaults to 100.
    • :min_samples -- samples needed before switching to adaptive delay. Defaults to 10.
    • :token_max -- token bucket capacity. Defaults to 10.
    • :token_success_credit -- tokens earned per completed request. Defaults to 0.1.
    • :token_hedge_cost -- tokens spent when a hedge fires. Defaults to 1.0.
    • :token_threshold -- minimum tokens required to allow hedging. Defaults to 1.0.

Returns

{:ok, pid} on success, or {:error, reason} if the process cannot be started.

Raises

Raises KeyError if the required :name option is not provided.