ALLM.Retry (allm v0.3.0)

Copy Markdown View Source

Internal — Layer B retry helper consumed by adapters.

Wraps a non-streaming adapter call in a bounded retry loop with exponential backoff and additive jitter, per spec §6.1. Adapters call run/3 with their own per-attempt closure; the closure returns one of {:ok, term}, {:retry, delay_ms, error}, or {:error, term} (Phase 9 design Non-obvious Decision #4) — the helper handles the loop, the sleep, and the per-attempt [:allm, :adapter, :retry] telemetry emission (Decision #16).

Streaming adapters do not call run/3: spec §6.1 is explicit that "Streaming calls are not retried automatically (partial output has already been delivered to the consumer)." The ALLM.StreamAdapter behaviour-doc surfaces this contract; enforcement is by code review, not by the helper.

Tested end-to-end via ALLM.Providers.Fake's adapter_opts: [retry_until_call: n] plumbing; the same helper is consumed by real-provider adapters in Phase 10/11.

Phase 14.3 added a second caller class: the image-side façade (ALLM.generate_image/3 · edit_image/4 · image_variations/3) wraps the adapter dispatch in Retry.run/3. Backoff timing reuses the chat-side default_policy/0 unchanged AT ITS SOURCE; the image façade augments retry_on AT THE CALL SITE via ALLM.augment_image_retry_policy/1 to add the four image-error atoms (:rate_limited, :provider_unavailable, :timeout, :network_error) — chat-side default_policy/0.retry_on is HTTP-status-coded ([429, 500, 502, 503, 504, :timeout]) and the image side surfaces closed-enum atoms only, so without the augmentation only :timeout would coincidentally retry. See PHASE 14 Decision #9 for the full design rationale. Image-side closures emit {:retry, delay_ms, %ALLM.Error.ImageAdapterError{}} for the four retry-engaging reasons; other reasons surface verbatim with no retry attempt.

v0.2 surface caveat — public-API retry visibility

In v0.2 the public Layer-C entry points (ALLM.generate/3, ALLM.step/3, ALLM.chat/3) all route through ALLM.StreamRunner, which calls the adapter's ALLM.StreamAdapter.stream/2 — the streaming path. Per spec §6.1, streaming calls are not retried, so [:allm, :adapter, :retry] events do NOT fire from any public Layer-C call in v0.2. The retry round-trip is exercised in v0.2 only by direct adapter calls (ALLM.Providers.Fake.generate(req, adapter_opts: [retry_until_call: n]), mirroring how real-provider adapters call Retry.run/3 from their non-streaming c:ALLM.Adapter.generate/3 callbacks). Per the Phase 9 design's Decision #14, real-provider Phase 10/11 adapters' non-streaming generate/2 callbacks reuse this same helper; until a non-streaming Layer-C path lands (or a real-provider integration ships and is invoked through one), retry telemetry observed via the public façade requires the adapter to be reachable through a path that exercises c:ALLM.Adapter.generate/3.

:request_id on retry events

:request_id appears on [:allm, :adapter, :retry] metadata only when the adapter call's opts carry a :request_id (typically threaded from a wrapping Runner / StreamRunner span). Direct adapter calls (e.g., the Fake retry round-trip) emit retry events without :request_id because no wrapping span generated one. This is a soft-promise rather than a hard invariant — see review Finding #3 for the v0.2 surface notes.

Default policy

See default_policy/0 for the spec §6.1 closed map. Materialised via materialize/1 from the engine's :retry field (:default | false | keyword()).

Closure contract

The closure passed to run/3 is invoked up to policy.max_attempts times. It must return one of:

  • {:ok, value} — success; loop returns {:ok, value}.
  • {:retry, delay_ms, error} — retryable failure. When delay_ms > 0, that value (plus jitter) is the delay; otherwise the computed exponential backoff (with jitter) is used. The error term is checked against policy.retry_on for membership; a non-matching error returns {:error, error} immediately.
  • {:error, error} — non-retryable failure; loop returns {:error, error} immediately, no telemetry, no sleep.

Closure-raised exceptions propagate to the caller of run/3 unchanged (no rescue, no telemetry — spec §6.1's "exception is not retryable" rule).

Telemetry

[:allm, :adapter, :retry] is emitted per retry attempt, before sleeping, with measurements %{system_time: System.system_time()} and metadata Map.merge(telemetry_metadata, %{attempt: attempt, delay_ms: actual_delay, reason: error}). The final attempt (when attempt == max_attempts) emits no retry event — the surrounding [:allm, :adapter, :stop] (or :exception) span fires instead.

Summary

Types

Closure return: success, retry-with-delay, or non-retryable error.

Engine-side retry shapes accepted by materialize/1.

Materialised retry policy after merging :default | false | keyword().

Functions

Return the spec §6.1 default retry policy.

Return true if error is a member of retry_on.

Materialise an engine :retry field into a policy() or :no_retry.

Run fun under the given retry policy.

Types

closure_result(ok)

@type closure_result(ok) ::
  {:ok, ok} | {:retry, non_neg_integer(), term()} | {:error, term()}

Closure return: success, retry-with-delay, or non-retryable error.

engine_retry()

@type engine_retry() :: :default | false | keyword()

Engine-side retry shapes accepted by materialize/1.

policy()

@type policy() :: %{
  max_attempts: non_neg_integer(),
  base_delay_ms: pos_integer(),
  max_delay_ms: pos_integer(),
  retry_on: [pos_integer() | atom()],
  jitter_ms: non_neg_integer(),
  respect_retry_after: boolean()
}

Materialised retry policy after merging :default | false | keyword().

Functions

default_policy()

@spec default_policy() :: policy()

Return the spec §6.1 default retry policy.

Field-by-field cited against steering/allm_engine_session_streaming_spec_v0_2.md §6.1 (max_attempts: 3, base_delay_ms: 500, max_delay_ms: 30_000, retry_on: [429, 500, 502, 503, 504, :timeout], jitter_ms: 250, respect_retry_after: true).

Examples

iex> p = ALLM.Retry.default_policy()
iex> p.max_attempts
3
iex> p.retry_on
[429, 500, 502, 503, 504, :timeout]

error_matches?(error, retry_on)

@spec error_matches?(term(), [pos_integer() | atom()]) :: boolean()

Return true if error is a member of retry_on.

Integers match HTTP status codes (429 ∈ [429, 500, ...]); atoms match error atoms (:timeout). For opaque error structs carrying a :reason field, the reason is extracted before membership check.

Examples

iex> retry_on = [429, 500, :timeout]
iex> ALLM.Retry.error_matches?(429, retry_on)
true
iex> ALLM.Retry.error_matches?(400, retry_on)
false
iex> ALLM.Retry.error_matches?(%{reason: :timeout}, retry_on)
true

materialize(opts)

@spec materialize(engine_retry()) :: policy() | :no_retry

Materialise an engine :retry field into a policy() or :no_retry.

  • false:no_retry.
  • :defaultdefault_policy().
  • []default_policy().
  • keyword()Map.merge(default_policy(), Map.new(kw)).
  • [max_attempts: 0]:no_retry (zero attempts is indistinguishable from "no retry").
  • Unknown keys raise ArgumentError (a typo like max_atempts: fails loudly).

Examples

iex> ALLM.Retry.materialize(:default).max_attempts
3

iex> ALLM.Retry.materialize(false)
:no_retry

iex> ALLM.Retry.materialize(max_attempts: 5).max_attempts
5

iex> ALLM.Retry.materialize(max_attempts: 0)
:no_retry

run(policy_or_retry, telemetry_metadata, fun)

@spec run(policy() | :no_retry | engine_retry(), map(), (-> closure_result(ok))) ::
  {:ok, ok} | {:error, term()}
when ok: var

Run fun under the given retry policy.

Accepts a materialised policy(), :no_retry, or any of the engine shapes (:default | false | keyword()); the engine shape is materialised internally via materialize/1.

Behaviour:

  • :no_retryfun is invoked once. {:ok, _} is returned verbatim; {:error, error} and {:retry, _, error} collapse to {:error, error} so the caller doesn't have to handle the third shape.

  • policy()fun is invoked up to policy.max_attempts times. telemetry_metadata is shallow-merged with %{attempt: attempt, delay_ms: actual_delay, reason: error} per attempt and emitted under [:allm, :adapter, :retry] before sleeping. The final attempt (when attempt == max_attempts) emits no retry event because the caller's surrounding [:allm, :adapter, :stop] / :exception span fires instead.

    actual_delay = max(closure_delay_ms, computed_backoff) where computed_backoff = min(max_delay_ms, base_delay_ms * (2 ** (attempt - 1))) + jitter. When respect_retry_after: true AND closure_delay_ms > 0, actual_delay = closure_delay_ms + jitter (the closure-supplied Retry-After value wins).

Jitter bounds: jitter is additive in [0, jitter_ms] inclusive, never subtractive — the spec §6.1 + jitter(0..250ms) notation is load-bearing. Implementation: :rand.uniform(jitter_ms + 1) - 1. Verified on OTP 27 in IEx 2026-04-25: :rand.uniform/1 returns 1..N inclusive, and :rand.uniform(1) - 1 == 0 when jitter_ms == 0 (no jitter, deterministic delay).

Closure-raised exceptions propagate to the caller unchanged — no rescue, no telemetry, no retry (spec §6.1 "exception is not retryable").

Examples

iex> {:ok, v} = ALLM.Retry.run(:no_retry, %{}, fn -> {:ok, 42} end)
iex> v
42

iex> ALLM.Retry.run(:no_retry, %{}, fn -> {:retry, 0, :transient} end)
{:error, :transient}

iex> {:ok, _pid} = Agent.start(fn -> 0 end, name: :doctest_retry_counter)
iex> {:ok, v} =
...>   ALLM.Retry.run(
...>     [base_delay_ms: 1, jitter_ms: 0],
...>     %{},
...>     fn ->
...>       n = Agent.get_and_update(:doctest_retry_counter, &{&1 + 1, &1 + 1})
...>       if n < 2, do: {:retry, 0, 429}, else: {:ok, n}
...>     end
...>   )
iex> Agent.stop(:doctest_retry_counter)
iex> v
2