Tinkex.Recovery.Policy (Tinkex v0.3.4)
View SourceConfiguration for opt-in training run recovery.
Defaults are disabled and conservative: three attempts, 5s base backoff
(capped at 60s), 30s polling interval, and optimizer state restore enabled.
The checkpoint strategy defaults to :latest; :best is reserved for future
support, and {:specific, path} targets an explicit checkpoint path.
Telemetry events emitted by the recovery pipeline:
[:tinkex, :recovery, :detected]- monitor observedcorrupted: true[:tinkex, :recovery, :started]- executor attempt began[:tinkex, :recovery, :checkpoint_selected]- checkpoint picked[:tinkex, :recovery, :client_created]- training client created[:tinkex, :recovery, :completed]- recovery finished successfully[:tinkex, :recovery, :failed]- attempt failed (before exhaustion)[:tinkex, :recovery, :exhausted]- max attempts reached, no recovery[:tinkex, :recovery, :poll_error]- monitor failed to poll run status (keeps state)
Summary
Functions
Build a recovery policy from a struct, keyword list, or map.
Types
@type checkpoint_strategy() :: :latest | :best | {:specific, String.t()}
@type recovery_callback() :: (pid() | nil, pid(), Tinkex.Types.Checkpoint.t() -> :ok) | nil
@type t() :: %Tinkex.Recovery.Policy{ backoff_ms: pos_integer(), checkpoint_strategy: checkpoint_strategy(), enabled: boolean(), max_attempts: pos_integer(), max_backoff_ms: pos_integer(), on_failure: failure_callback(), on_recovery: recovery_callback(), poll_interval_ms: pos_integer(), restore_optimizer: boolean() }