Tinkex.Recovery.Policy (Tinkex v0.3.4)

View Source

Configuration for opt-in training run recovery.

Defaults are disabled and conservative: three attempts, 5s base backoff (capped at 60s), 30s polling interval, and optimizer state restore enabled. The checkpoint strategy defaults to :latest; :best is reserved for future support, and {:specific, path} targets an explicit checkpoint path.

Telemetry events emitted by the recovery pipeline:

  • [:tinkex, :recovery, :detected] - monitor observed corrupted: true
  • [:tinkex, :recovery, :started] - executor attempt began
  • [:tinkex, :recovery, :checkpoint_selected] - checkpoint picked
  • [:tinkex, :recovery, :client_created] - training client created
  • [:tinkex, :recovery, :completed] - recovery finished successfully
  • [:tinkex, :recovery, :failed] - attempt failed (before exhaustion)
  • [:tinkex, :recovery, :exhausted] - max attempts reached, no recovery
  • [:tinkex, :recovery, :poll_error] - monitor failed to poll run status (keeps state)

Summary

Functions

Build a recovery policy from a struct, keyword list, or map.

Types

checkpoint_strategy()

@type checkpoint_strategy() :: :latest | :best | {:specific, String.t()}

failure_callback()

@type failure_callback() :: (String.t(), term() -> :ok) | nil

recovery_callback()

@type recovery_callback() ::
  (pid() | nil, pid(), Tinkex.Types.Checkpoint.t() -> :ok) | nil

t()

@type t() :: %Tinkex.Recovery.Policy{
  backoff_ms: pos_integer(),
  checkpoint_strategy: checkpoint_strategy(),
  enabled: boolean(),
  max_attempts: pos_integer(),
  max_backoff_ms: pos_integer(),
  on_failure: failure_callback(),
  on_recovery: recovery_callback(),
  poll_interval_ms: pos_integer(),
  restore_optimizer: boolean()
}

Functions

new(policy)

@spec new(t() | keyword() | map() | nil) :: t()

Build a recovery policy from a struct, keyword list, or map.

Unknown keys are ignored; invalid values fall back to defaults to keep the policy conservative and opt-in.