# `ALLM.Retry`
[🔗](https://github.com/cykod/ALLM/blob/v0.3.0/lib/allm/retry.ex#L1)

Internal — Layer B retry helper consumed by adapters.

Wraps a non-streaming adapter call in a bounded retry loop with
exponential backoff and additive jitter, per spec §6.1. Adapters
call `run/3` with their own per-attempt closure; the closure returns
one of `{:ok, term}`, `{:retry, delay_ms, error}`, or `{:error, term}`
(Phase 9 design Non-obvious Decision #4) — the helper handles the
loop, the sleep, and the per-attempt `[:allm, :adapter, :retry]`
telemetry emission (Decision #16).

Streaming adapters do **not** call `run/3`: spec §6.1 is explicit
that "Streaming calls are not retried automatically (partial output
has already been delivered to the consumer)." The
`ALLM.StreamAdapter` behaviour-doc surfaces this contract;
enforcement is by code review, not by the helper.

Tested end-to-end via `ALLM.Providers.Fake`'s
`adapter_opts: [retry_until_call: n]` plumbing; the same helper is
consumed by real-provider adapters in Phase 10/11.

Phase 14.3 added a second caller class: the image-side façade
(`ALLM.generate_image/3 · edit_image/4 · image_variations/3`) wraps
the adapter dispatch in `Retry.run/3`. Backoff timing reuses the
chat-side `default_policy/0` unchanged AT ITS SOURCE; the image
façade augments `retry_on` AT THE CALL SITE via
`ALLM.augment_image_retry_policy/1` to add the four image-error
atoms (`:rate_limited`, `:provider_unavailable`, `:timeout`,
`:network_error`) — chat-side `default_policy/0.retry_on` is
HTTP-status-coded (`[429, 500, 502, 503, 504, :timeout]`) and the
image side surfaces closed-enum atoms only, so without the
augmentation only `:timeout` would coincidentally retry. See PHASE 14
Decision #9 for the full design rationale. Image-side closures emit
`{:retry, delay_ms, %ALLM.Error.ImageAdapterError{}}` for the four
retry-engaging reasons; other reasons surface verbatim with no retry
attempt.

## v0.2 surface caveat — public-API retry visibility

In v0.2 the public Layer-C entry points (`ALLM.generate/3`,
`ALLM.step/3`, `ALLM.chat/3`) all route through `ALLM.StreamRunner`,
which calls the adapter's `c:ALLM.StreamAdapter.stream/2` — the
streaming path. Per spec §6.1, **streaming calls are not retried**,
so `[:allm, :adapter, :retry]` events do NOT fire from any public
Layer-C call in v0.2. The retry round-trip is exercised in v0.2 only
by direct adapter calls
(`ALLM.Providers.Fake.generate(req, adapter_opts: [retry_until_call:
n])`, mirroring how real-provider adapters call `Retry.run/3` from
their non-streaming `c:ALLM.Adapter.generate/3` callbacks). Per the
Phase 9 design's Decision #14, real-provider Phase 10/11 adapters'
non-streaming `generate/2` callbacks reuse this same helper; until a
non-streaming Layer-C path lands (or a real-provider integration
ships and is invoked through one), retry telemetry observed via the
public façade requires the adapter to be reachable through a path
that exercises `c:ALLM.Adapter.generate/3`.

## `:request_id` on retry events

`:request_id` appears on `[:allm, :adapter, :retry]` metadata only
when the adapter call's `opts` carry a `:request_id` (typically
threaded from a wrapping `Runner` / `StreamRunner` span). Direct
adapter calls (e.g., the Fake retry round-trip) emit retry events
without `:request_id` because no wrapping span generated one. This
is a soft-promise rather than a hard invariant — see review Finding
#3 for the v0.2 surface notes.

## Default policy

See `default_policy/0` for the spec §6.1 closed map. Materialised
via `materialize/1` from the engine's `:retry` field
(`:default | false | keyword()`).

## Closure contract

The closure passed to `run/3` is invoked up to `policy.max_attempts`
times. It must return one of:

  * `{:ok, value}` — success; loop returns `{:ok, value}`.
  * `{:retry, delay_ms, error}` — retryable failure. When
    `delay_ms > 0`, that value (plus jitter) is the delay; otherwise
    the computed exponential backoff (with jitter) is used. The
    `error` term is checked against `policy.retry_on` for membership;
    a non-matching error returns `{:error, error}` immediately.
  * `{:error, error}` — non-retryable failure; loop returns
    `{:error, error}` immediately, no telemetry, no sleep.

Closure-raised exceptions propagate to the caller of `run/3`
unchanged (no rescue, no telemetry — spec §6.1's "exception is not
retryable" rule).

## Telemetry

`[:allm, :adapter, :retry]` is emitted per retry attempt, **before**
sleeping, with measurements `%{system_time: System.system_time()}`
and metadata `Map.merge(telemetry_metadata, %{attempt: attempt,
delay_ms: actual_delay, reason: error})`. The final attempt (when
`attempt == max_attempts`) emits no retry event — the surrounding
`[:allm, :adapter, :stop]` (or `:exception`) span fires instead.

# `closure_result`

```elixir
@type closure_result(ok) ::
  {:ok, ok} | {:retry, non_neg_integer(), term()} | {:error, term()}
```

Closure return: success, retry-with-delay, or non-retryable error.

# `engine_retry`

```elixir
@type engine_retry() :: :default | false | keyword()
```

Engine-side retry shapes accepted by `materialize/1`.

# `policy`

```elixir
@type policy() :: %{
  max_attempts: non_neg_integer(),
  base_delay_ms: pos_integer(),
  max_delay_ms: pos_integer(),
  retry_on: [pos_integer() | atom()],
  jitter_ms: non_neg_integer(),
  respect_retry_after: boolean()
}
```

Materialised retry policy after merging `:default | false | keyword()`.

# `default_policy`

```elixir
@spec default_policy() :: policy()
```

Return the spec §6.1 default retry policy.

Field-by-field cited against
`steering/allm_engine_session_streaming_spec_v0_2.md` §6.1
(`max_attempts: 3`, `base_delay_ms: 500`, `max_delay_ms: 30_000`,
`retry_on: [429, 500, 502, 503, 504, :timeout]`, `jitter_ms: 250`,
`respect_retry_after: true`).

## Examples

    iex> p = ALLM.Retry.default_policy()
    iex> p.max_attempts
    3
    iex> p.retry_on
    [429, 500, 502, 503, 504, :timeout]

# `error_matches?`

```elixir
@spec error_matches?(term(), [pos_integer() | atom()]) :: boolean()
```

Return `true` if `error` is a member of `retry_on`.

Integers match HTTP status codes (`429 ∈ [429, 500, ...]`); atoms
match error atoms (`:timeout`). For opaque error structs carrying a
`:reason` field, the reason is extracted before membership check.

## Examples

    iex> retry_on = [429, 500, :timeout]
    iex> ALLM.Retry.error_matches?(429, retry_on)
    true
    iex> ALLM.Retry.error_matches?(400, retry_on)
    false
    iex> ALLM.Retry.error_matches?(%{reason: :timeout}, retry_on)
    true

# `materialize`

```elixir
@spec materialize(engine_retry()) :: policy() | :no_retry
```

Materialise an engine `:retry` field into a `policy()` or `:no_retry`.

  * `false` → `:no_retry`.
  * `:default` → `default_policy()`.
  * `[]` → `default_policy()`.
  * `keyword()` → `Map.merge(default_policy(), Map.new(kw))`.
  * `[max_attempts: 0]` → `:no_retry` (zero attempts is indistinguishable
    from "no retry").
  * Unknown keys raise `ArgumentError` (a typo like `max_atempts:` fails
    loudly).

## Examples

    iex> ALLM.Retry.materialize(:default).max_attempts
    3

    iex> ALLM.Retry.materialize(false)
    :no_retry

    iex> ALLM.Retry.materialize(max_attempts: 5).max_attempts
    5

    iex> ALLM.Retry.materialize(max_attempts: 0)
    :no_retry

# `run`

```elixir
@spec run(policy() | :no_retry | engine_retry(), map(), (-&gt; closure_result(ok))) ::
  {:ok, ok} | {:error, term()}
when ok: var
```

Run `fun` under the given retry policy.

Accepts a materialised `policy()`, `:no_retry`, or any of the engine
shapes (`:default | false | keyword()`); the engine shape is
materialised internally via `materialize/1`.

Behaviour:

  * `:no_retry` — `fun` is invoked once. `{:ok, _}` is returned
    verbatim; `{:error, error}` and `{:retry, _, error}` collapse to
    `{:error, error}` so the caller doesn't have to handle the third
    shape.

  * `policy()` — `fun` is invoked up to `policy.max_attempts` times.
    `telemetry_metadata` is shallow-merged with `%{attempt: attempt,
    delay_ms: actual_delay, reason: error}` per attempt and emitted
    under `[:allm, :adapter, :retry]` **before** sleeping. The final
    attempt (when `attempt == max_attempts`) emits no retry event
    because the caller's surrounding `[:allm, :adapter, :stop]` /
    `:exception` span fires instead.

    `actual_delay = max(closure_delay_ms, computed_backoff)` where
    `computed_backoff = min(max_delay_ms,
    base_delay_ms * (2 ** (attempt - 1))) + jitter`. When
    `respect_retry_after: true` AND `closure_delay_ms > 0`,
    `actual_delay = closure_delay_ms + jitter` (the closure-supplied
    `Retry-After` value wins).

Jitter bounds: jitter is **additive** in `[0, jitter_ms]`
inclusive, never subtractive — the spec §6.1 `+ jitter(0..250ms)`
notation is load-bearing. Implementation:
`:rand.uniform(jitter_ms + 1) - 1`. Verified on OTP 27 in IEx
2026-04-25: `:rand.uniform/1` returns `1..N` inclusive, and
`:rand.uniform(1) - 1 == 0` when `jitter_ms == 0` (no jitter,
deterministic delay).

Closure-raised exceptions propagate to the caller unchanged — no
rescue, no telemetry, no retry (spec §6.1 "exception is not
retryable").

## Examples

    iex> {:ok, v} = ALLM.Retry.run(:no_retry, %{}, fn -> {:ok, 42} end)
    iex> v
    42

    iex> ALLM.Retry.run(:no_retry, %{}, fn -> {:retry, 0, :transient} end)
    {:error, :transient}

    iex> {:ok, _pid} = Agent.start(fn -> 0 end, name: :doctest_retry_counter)
    iex> {:ok, v} =
    ...>   ALLM.Retry.run(
    ...>     [base_delay_ms: 1, jitter_ms: 0],
    ...>     %{},
    ...>     fn ->
    ...>       n = Agent.get_and_update(:doctest_retry_counter, &{&1 + 1, &1 + 1})
    ...>       if n < 2, do: {:retry, 0, 429}, else: {:ok, n}
    ...>     end
    ...>   )
    iex> Agent.stop(:doctest_retry_counter)
    iex> v
    2

---

*Consult [api-reference.md](api-reference.md) for complete listing*
