Tinkex.Recovery.Executor (Tinkex v0.3.4)

View Source

GenServer that performs recovery attempts for corrupted training runs.

Users must start and drive this module explicitly (typically alongside Tinkex.Recovery.Monitor). Concurrency is capped (default: 1) to avoid unbounded restarts; adjust via :max_concurrent in start_link/1.

Telemetry events:

  • [:tinkex, :recovery, :started] - attempt began (measurements: %{attempt: n})
  • [:tinkex, :recovery, :checkpoint_selected] - checkpoint chosen
  • [:tinkex, :recovery, :client_created] - training client successfully created
  • [:tinkex, :recovery, :completed] - recovery finished successfully
  • [:tinkex, :recovery, :failed] - attempt failed (metadata includes :error)
  • [:tinkex, :recovery, :exhausted] - max attempts reached, no recovery

Summary

Functions

Returns a specification to start this module under a supervisor.

Start the executor.

Types

option()

@type option() ::
  {:rest_module, module()}
  | {:service_client_module, module()}
  | {:max_concurrent, pos_integer()}
  | {:send_after, (term(), non_neg_integer() -> reference())}

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

recover(executor, run_id, service_pid, policy, opts \\ [])

@spec recover(pid(), String.t(), pid(), Tinkex.Recovery.Policy.t() | map(), keyword()) ::
  :ok | {:error, term()}

Enqueue a recovery request.

Options:

  • :config - Tinkex.Config.t() used for REST lookups when checkpoint is not provided
  • :metadata - map propagated to telemetry/callbacks (e.g., %{training_pid: pid})
  • :last_checkpoint - Tinkex.Types.Checkpoint.t()/map/string path to skip refetch
  • :run - Tinkex.Types.TrainingRun.t() to reuse an already fetched run

start_link(opts \\ [])

@spec start_link([option()]) :: GenServer.on_start()

Start the executor.