Tinkex.Recovery.Executor (Tinkex v0.4.0)

Copy Markdown View Source

GenServer that performs recovery attempts for corrupted training runs.

Users must start and drive this module explicitly (typically alongside Tinkex.Recovery.Monitor). Concurrency is capped (default: 1) to avoid unbounded restarts; adjust via :max_concurrent in start_link/1.

Telemetry events:

  • [:tinkex, :recovery, :started] - attempt began (measurements: %{attempt: n})
  • [:tinkex, :recovery, :checkpoint_selected] - checkpoint chosen
  • [:tinkex, :recovery, :client_created] - training client successfully created
  • [:tinkex, :recovery, :completed] - recovery finished successfully
  • [:tinkex, :recovery, :failed] - attempt failed (metadata includes :error)
  • [:tinkex, :recovery, :exhausted] - max attempts reached, no recovery

Summary

Functions

Returns a specification to start this module under a supervisor.

Start the executor.

Types

option()

@type option() ::
  {:rest_module, module()}
  | {:service_client_module, module()}
  | {:max_concurrent, pos_integer()}
  | {:send_after, (term(), non_neg_integer() -> reference())}

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

recover(executor, run_id, service_pid, policy, opts \\ [])

@spec recover(pid(), String.t(), pid(), Tinkex.Recovery.Policy.t() | map(), keyword()) ::
  :ok | {:error, term()}

Enqueue a recovery request.

Options:

  • :config - Tinkex.Config.t() used for REST lookups when checkpoint is not provided
  • :metadata - map propagated to telemetry/callbacks (e.g., %{training_pid: pid})
  • :last_checkpoint - Tinkex.Types.Checkpoint.t()/map/string path to skip refetch
  • :run - Tinkex.Types.TrainingRun.t() to reuse an already fetched run

start_link(opts \\ [])

@spec start_link([option()]) :: GenServer.on_start()

Start the executor.