Tinkex.Recovery.Executor (Tinkex v0.3.4)
View SourceGenServer that performs recovery attempts for corrupted training runs.
Users must start and drive this module explicitly (typically alongside
Tinkex.Recovery.Monitor). Concurrency is capped (default: 1) to avoid
unbounded restarts; adjust via :max_concurrent in start_link/1.
Telemetry events:
[:tinkex, :recovery, :started]- attempt began (measurements:%{attempt: n})[:tinkex, :recovery, :checkpoint_selected]- checkpoint chosen[:tinkex, :recovery, :client_created]- training client successfully created[:tinkex, :recovery, :completed]- recovery finished successfully[:tinkex, :recovery, :failed]- attempt failed (metadata includes:error)[:tinkex, :recovery, :exhausted]- max attempts reached, no recovery
Summary
Functions
Returns a specification to start this module under a supervisor.
Enqueue a recovery request.
Start the executor.
Types
@type option() :: {:rest_module, module()} | {:service_client_module, module()} | {:max_concurrent, pos_integer()} | {:send_after, (term(), non_neg_integer() -> reference())}
Functions
Returns a specification to start this module under a supervisor.
See Supervisor.
@spec recover(pid(), String.t(), pid(), Tinkex.Recovery.Policy.t() | map(), keyword()) :: :ok | {:error, term()}
Enqueue a recovery request.
Options:
:config-Tinkex.Config.t()used for REST lookups when checkpoint is not provided:metadata- map propagated to telemetry/callbacks (e.g.,%{training_pid: pid}):last_checkpoint-Tinkex.Types.Checkpoint.t()/map/string path to skip refetch:run-Tinkex.Types.TrainingRun.t()to reuse an already fetched run
@spec start_link([option()]) :: GenServer.on_start()
Start the executor.