Recovery and Checkpoint Restart (Opt-in)

View Source

The recovery layer is disabled by default. It lets you detect corrupted training runs and restart from checkpoints automatically using explicit, user-controlled processes.

Components

Telemetry events: :detected, :started, :checkpoint_selected, :client_created, :completed, :failed, :exhausted, :poll_error (all under [:tinkex, :recovery, ...]).

Quickstart

config = Tinkex.Config.new(api_key: api_key, recovery: %{enabled: true})

policy =
  Tinkex.Recovery.Policy.new(
    enabled: true,
    checkpoint_strategy: :latest,        # or {:specific, "tinker://..."}
    restore_optimizer: true,             # false for weights-only
    poll_interval_ms: 15_000,
    max_attempts: 3,
    backoff_ms: 5_000,
    max_backoff_ms: 60_000,
    on_recovery: fn old_pid, new_pid, cp ->
      Logger.info("Recovered #{cp.tinker_path} -> #{inspect(new_pid)} (old=#{inspect(old_pid)})")
      :ok
    end,
    on_failure: fn run_id, reason ->
      Logger.warning("Recovery failed for #{run_id}: #{inspect(reason)}")
      :ok
    end
  )

{:ok, service} = Tinkex.ServiceClient.start_link(config: config)

{:ok, executor} =
  Tinkex.Recovery.Executor.start_link(
    max_concurrent: 2
  )

{:ok, monitor} =
  Tinkex.Recovery.Monitor.start_link(
    executor: executor,
    policy: policy
  )

:ok = Tinkex.Recovery.Monitor.monitor_run(monitor, "run-123", service, %{training_pid: training_pid})

Checkpoint Strategy

  • :latest (default): uses TrainingRun.last_checkpoint
  • {:specific, path}: explicit checkpoint path
  • :best: reserved for future support

If last_checkpoint is missing, the executor will fetch the run via REST (requires passing config).

Optimizer Restore

restore_optimizer: true uses ServiceClient.create_training_client_from_state_with_optimizer/3. Set to false for weights-only restarts (fresh optimizer).

Concurrency and Backoff

  • max_concurrent (executor option) caps simultaneous recoveries (default: 1).
  • max_attempts, backoff_ms, max_backoff_ms control retry behavior; :exhausted is emitted when attempts are spent.

Telemetry and Observability

Attach handlers to [:tinkex, :recovery, *] for tracing. :poll_error indicates REST polling issues (monitor keeps state).

Safety Notes

  • Nothing runs automatically; you must start monitor/executor and set policy.enabled: true.
  • Callbacks should be cheap and resilient; errors are swallowed to avoid cascading failures.
  • Ensure config is available for REST lookups when monitor/executor are used outside the original process.

Offline Example

Run mix run examples/recovery_simulated.exs to see a fully offline flow that seeds a checkpoint, flips a run to corrupted: true, and exercises the monitor + executor with stubbed REST/service modules. No network or API key is required.

Live Example with Injected Corruption

Run mix run examples/recovery_live_injected.exs to create a training run, save a real checkpoint, inject a single corrupted: true poll response (no server change), and let the monitor + executor restore from that checkpoint. The example then saves a second checkpoint from the recovered client to prove the restart path. Requires TINKER_API_KEY (and optional TINKER_BASE_URL/TINKER_BASE_MODEL).