Failure-and-retry agent. handle_error/3 returns {:prompt, retry_text, state} to self-chain a new attempt. The retry decision lives on agent state: attempt count, accumulated errors, and a configurable cap.

When to reach for this

You expect transient failures (rate limits, network blips, flaky backends) and want the agent to absorb them without the manager having to notice and retry. The retry logic benefits from being stateful -- you want to count attempts, track the sequence of errors, apply backoff, and give up after a cap.

The pattern hinges on one gen_agent primitive: handle_error/3 has the same return shape as handle_response/3, so returning {:prompt, text, state} from handle_error self-chains a retry turn just as cleanly as self-chaining after a success.

What it exercises in gen_agent

  • handle_error/3 returning {:prompt, ..., state} as a retry primitive -- the whole pattern is this one return shape.
  • Attempt counting and error accumulation on agent state.
  • Giving-up semantics via {:halt, %{state | phase: :failed}} when max_attempts is reached.

  • Backoff via Process.sleep/1 inside handle_error/3 before returning the retry prompt (the agent's own process is the one doing the sleep, so it's naturally rate-limited).

The pattern

One callback module. No facade needed -- the manager just starts the agent and polls status.

defmodule Retry.Agent do
  use GenAgent

  defmodule State do
    defstruct [
      :task,
      :max_attempts,
      :result,
      phase: :running,
      attempts: 0,
      errors: []
    ]
  end

  @impl true
  def init_agent(opts) do
    state = %State{
      task: Keyword.fetch!(opts, :task),
      max_attempts: Keyword.get(opts, :max_attempts, 3)
    }

    system = "You are a persistent assistant. Answer concisely in 1-2 sentences."

    backend_opts = [
      system: system,
      max_tokens: Keyword.get(opts, :max_tokens, 100)
    ]

    {:ok, backend_opts, state}
  end

  @impl true
  def handle_response(_ref, response, %State{} = state) do
    new_attempts = state.attempts + 1
    text = String.trim(response.text)

    {:halt,
     %{
       state
       | result: text,
         attempts: new_attempts,
         phase: :succeeded
     }}
  end

  @impl true
  def handle_error(_ref, reason, %State{} = state) do
    new_attempts = state.attempts + 1
    new_errors = state.errors ++ [reason]
    new_state = %{state | attempts: new_attempts, errors: new_errors}

    if new_attempts < state.max_attempts do
      # Optional: exponential backoff before retrying.
      backoff_ms = :math.pow(2, new_attempts - 1) |> round() |> Kernel.*(1000)
      Process.sleep(backoff_ms)

      retry_prompt = "The previous attempt failed. Retry the task: #{state.task}"
      {:prompt, retry_prompt, new_state}
    else
      {:halt, %{new_state | phase: :failed}}
    end
  end
end

Using it

{:ok, _pid} = GenAgent.start_agent(Retry.Agent,
  name: "retry-haiku",
  backend: GenAgent.Backends.Anthropic,
  task: "write a haiku about persistence",
  max_attempts: 5
)

# Kick off the first attempt.
{:ok, _ref} = GenAgent.tell("retry-haiku",
  "write a haiku about persistence")

# Wait for phase in [:succeeded, :failed] and read the result.
%{agent_state: state} = GenAgent.status("retry-haiku")
IO.inspect(%{
  phase: state.phase,
  attempts: state.attempts,
  errors: state.errors,
  result: state.result
})

GenAgent.stop("retry-haiku")

Testing without burning tokens

In the playground we test this pattern by injecting a stateful http_fn into the Anthropic backend. The function counts calls and fails the first N with a synthetic {:http_error, 429, ...}, then succeeds. No real backend calls, no tokens spent.

defp failing_http_fn(fail_count) do
  {:ok, counter} = Elixir.Agent.start_link(fn -> 0 end)

  fn _request ->
    n = Elixir.Agent.get_and_update(counter, fn n -> {n, n + 1} end)

    if n < fail_count do
      {:error,
       {:http_error, 429,
        %{"error" => %{"type" => "rate_limit_error",
                       "message" => "simulated (call #{n + 1})"}}}}
    else
      {:ok, canned_success_response(n + 1)}
    end
  end
end

Pass it as http_fn: failing_http_fn(2) in your start opts when testing. Worth cribbing for any unit/integration test of retry logic.

Variations

  • Exponential backoff with jitter. Instead of a fixed 2^(n-1) * 1000, add random jitter: sleep(backoff + :rand.uniform(500)). Avoids thundering-herd when many agents retry simultaneously.
  • Error-class-aware retry. Not every error should be retried. Pattern-match on the reason in handle_error/3: :interrupted and :timeout should probably not retry, {:http_error, 429, _} and {:http_error, 503, _} should. Treat non-retryable errors as immediate halt.
  • Retry budget. Instead of a count cap, a time budget: halt if System.monotonic_time() - state.started_at exceeds N seconds. More natural for "best effort within a deadline."
  • Different prompt on retry. The retry prompt could incorporate the error, e.g. "The previous attempt failed with: #{inspect(reason)}. Try a different approach." -- useful when the failure is prompt-shaped rather than transport-shaped.
  • Retry with a different backend. If the first backend errors three times, swap to a fallback backend. Requires the callback module to track which backend it started with and to restart the session mid-run, which is more work than the minimal pattern shown here.