Internal — Layer B retry helper consumed by adapters.
Wraps a non-streaming adapter call in a bounded retry loop with
exponential backoff and additive jitter, per spec §6.1. Adapters
call run/3 with their own per-attempt closure; the closure returns
one of {:ok, term}, {:retry, delay_ms, error}, or {:error, term}
(Phase 9 design Non-obvious Decision #4) — the helper handles the
loop, the sleep, and the per-attempt [:allm, :adapter, :retry]
telemetry emission (Decision #16).
Streaming adapters do not call run/3: spec §6.1 is explicit
that "Streaming calls are not retried automatically (partial output
has already been delivered to the consumer)." The
ALLM.StreamAdapter behaviour-doc surfaces this contract;
enforcement is by code review, not by the helper.
Tested end-to-end via ALLM.Providers.Fake's
adapter_opts: [retry_until_call: n] plumbing; the same helper is
consumed by real-provider adapters in Phase 10/11.
Phase 14.3 added a second caller class: the image-side façade
(ALLM.generate_image/3 · edit_image/4 · image_variations/3) wraps
the adapter dispatch in Retry.run/3. Backoff timing reuses the
chat-side default_policy/0 unchanged AT ITS SOURCE; the image
façade augments retry_on AT THE CALL SITE via
ALLM.augment_image_retry_policy/1 to add the four image-error
atoms (:rate_limited, :provider_unavailable, :timeout,
:network_error) — chat-side default_policy/0.retry_on is
HTTP-status-coded ([429, 500, 502, 503, 504, :timeout]) and the
image side surfaces closed-enum atoms only, so without the
augmentation only :timeout would coincidentally retry. See PHASE 14
Decision #9 for the full design rationale. Image-side closures emit
{:retry, delay_ms, %ALLM.Error.ImageAdapterError{}} for the four
retry-engaging reasons; other reasons surface verbatim with no retry
attempt.
v0.2 surface caveat — public-API retry visibility
In v0.2 the public Layer-C entry points (ALLM.generate/3,
ALLM.step/3, ALLM.chat/3) all route through ALLM.StreamRunner,
which calls the adapter's ALLM.StreamAdapter.stream/2 — the
streaming path. Per spec §6.1, streaming calls are not retried,
so [:allm, :adapter, :retry] events do NOT fire from any public
Layer-C call in v0.2. The retry round-trip is exercised in v0.2 only
by direct adapter calls
(ALLM.Providers.Fake.generate(req, adapter_opts: [retry_until_call: n]), mirroring how real-provider adapters call Retry.run/3 from
their non-streaming c:ALLM.Adapter.generate/3 callbacks). Per the
Phase 9 design's Decision #14, real-provider Phase 10/11 adapters'
non-streaming generate/2 callbacks reuse this same helper; until a
non-streaming Layer-C path lands (or a real-provider integration
ships and is invoked through one), retry telemetry observed via the
public façade requires the adapter to be reachable through a path
that exercises c:ALLM.Adapter.generate/3.
:request_id on retry events
:request_id appears on [:allm, :adapter, :retry] metadata only
when the adapter call's opts carry a :request_id (typically
threaded from a wrapping Runner / StreamRunner span). Direct
adapter calls (e.g., the Fake retry round-trip) emit retry events
without :request_id because no wrapping span generated one. This
is a soft-promise rather than a hard invariant — see review Finding
#3 for the v0.2 surface notes.
Default policy
See default_policy/0 for the spec §6.1 closed map. Materialised
via materialize/1 from the engine's :retry field
(:default | false | keyword()).
Closure contract
The closure passed to run/3 is invoked up to policy.max_attempts
times. It must return one of:
{:ok, value}— success; loop returns{:ok, value}.{:retry, delay_ms, error}— retryable failure. Whendelay_ms > 0, that value (plus jitter) is the delay; otherwise the computed exponential backoff (with jitter) is used. Theerrorterm is checked againstpolicy.retry_onfor membership; a non-matching error returns{:error, error}immediately.{:error, error}— non-retryable failure; loop returns{:error, error}immediately, no telemetry, no sleep.
Closure-raised exceptions propagate to the caller of run/3
unchanged (no rescue, no telemetry — spec §6.1's "exception is not
retryable" rule).
Telemetry
[:allm, :adapter, :retry] is emitted per retry attempt, before
sleeping, with measurements %{system_time: System.system_time()}
and metadata Map.merge(telemetry_metadata, %{attempt: attempt, delay_ms: actual_delay, reason: error}). The final attempt (when
attempt == max_attempts) emits no retry event — the surrounding
[:allm, :adapter, :stop] (or :exception) span fires instead.
Summary
Types
Closure return: success, retry-with-delay, or non-retryable error.
Engine-side retry shapes accepted by materialize/1.
Materialised retry policy after merging :default | false | keyword().
Functions
Return the spec §6.1 default retry policy.
Return true if error is a member of retry_on.
Materialise an engine :retry field into a policy() or :no_retry.
Run fun under the given retry policy.
Types
@type closure_result(ok) :: {:ok, ok} | {:retry, non_neg_integer(), term()} | {:error, term()}
Closure return: success, retry-with-delay, or non-retryable error.
@type engine_retry() :: :default | false | keyword()
Engine-side retry shapes accepted by materialize/1.
@type policy() :: %{ max_attempts: non_neg_integer(), base_delay_ms: pos_integer(), max_delay_ms: pos_integer(), retry_on: [pos_integer() | atom()], jitter_ms: non_neg_integer(), respect_retry_after: boolean() }
Materialised retry policy after merging :default | false | keyword().
Functions
@spec default_policy() :: policy()
Return the spec §6.1 default retry policy.
Field-by-field cited against
steering/allm_engine_session_streaming_spec_v0_2.md §6.1
(max_attempts: 3, base_delay_ms: 500, max_delay_ms: 30_000,
retry_on: [429, 500, 502, 503, 504, :timeout], jitter_ms: 250,
respect_retry_after: true).
Examples
iex> p = ALLM.Retry.default_policy()
iex> p.max_attempts
3
iex> p.retry_on
[429, 500, 502, 503, 504, :timeout]
@spec error_matches?(term(), [pos_integer() | atom()]) :: boolean()
Return true if error is a member of retry_on.
Integers match HTTP status codes (429 ∈ [429, 500, ...]); atoms
match error atoms (:timeout). For opaque error structs carrying a
:reason field, the reason is extracted before membership check.
Examples
iex> retry_on = [429, 500, :timeout]
iex> ALLM.Retry.error_matches?(429, retry_on)
true
iex> ALLM.Retry.error_matches?(400, retry_on)
false
iex> ALLM.Retry.error_matches?(%{reason: :timeout}, retry_on)
true
@spec materialize(engine_retry()) :: policy() | :no_retry
Materialise an engine :retry field into a policy() or :no_retry.
false→:no_retry.:default→default_policy().[]→default_policy().keyword()→Map.merge(default_policy(), Map.new(kw)).[max_attempts: 0]→:no_retry(zero attempts is indistinguishable from "no retry").- Unknown keys raise
ArgumentError(a typo likemax_atempts:fails loudly).
Examples
iex> ALLM.Retry.materialize(:default).max_attempts
3
iex> ALLM.Retry.materialize(false)
:no_retry
iex> ALLM.Retry.materialize(max_attempts: 5).max_attempts
5
iex> ALLM.Retry.materialize(max_attempts: 0)
:no_retry
@spec run(policy() | :no_retry | engine_retry(), map(), (-> closure_result(ok))) :: {:ok, ok} | {:error, term()} when ok: var
Run fun under the given retry policy.
Accepts a materialised policy(), :no_retry, or any of the engine
shapes (:default | false | keyword()); the engine shape is
materialised internally via materialize/1.
Behaviour:
:no_retry—funis invoked once.{:ok, _}is returned verbatim;{:error, error}and{:retry, _, error}collapse to{:error, error}so the caller doesn't have to handle the third shape.policy()—funis invoked up topolicy.max_attemptstimes.telemetry_metadatais shallow-merged with%{attempt: attempt, delay_ms: actual_delay, reason: error}per attempt and emitted under[:allm, :adapter, :retry]before sleeping. The final attempt (whenattempt == max_attempts) emits no retry event because the caller's surrounding[:allm, :adapter, :stop]/:exceptionspan fires instead.actual_delay = max(closure_delay_ms, computed_backoff)wherecomputed_backoff = min(max_delay_ms, base_delay_ms * (2 ** (attempt - 1))) + jitter. Whenrespect_retry_after: trueANDclosure_delay_ms > 0,actual_delay = closure_delay_ms + jitter(the closure-suppliedRetry-Aftervalue wins).
Jitter bounds: jitter is additive in [0, jitter_ms]
inclusive, never subtractive — the spec §6.1 + jitter(0..250ms)
notation is load-bearing. Implementation:
:rand.uniform(jitter_ms + 1) - 1. Verified on OTP 27 in IEx
2026-04-25: :rand.uniform/1 returns 1..N inclusive, and
:rand.uniform(1) - 1 == 0 when jitter_ms == 0 (no jitter,
deterministic delay).
Closure-raised exceptions propagate to the caller unchanged — no rescue, no telemetry, no retry (spec §6.1 "exception is not retryable").
Examples
iex> {:ok, v} = ALLM.Retry.run(:no_retry, %{}, fn -> {:ok, 42} end)
iex> v
42
iex> ALLM.Retry.run(:no_retry, %{}, fn -> {:retry, 0, :transient} end)
{:error, :transient}
iex> {:ok, _pid} = Agent.start(fn -> 0 end, name: :doctest_retry_counter)
iex> {:ok, v} =
...> ALLM.Retry.run(
...> [base_delay_ms: 1, jitter_ms: 0],
...> %{},
...> fn ->
...> n = Agent.get_and_update(:doctest_retry_counter, &{&1 + 1, &1 + 1})
...> if n < 2, do: {:retry, 0, 429}, else: {:ok, n}
...> end
...> )
iex> Agent.stop(:doctest_retry_counter)
iex> v
2