Recovery and Checkpoint Restart (Opt-in)
View SourceThe recovery layer is disabled by default. It lets you detect corrupted training runs and restart from checkpoints automatically using explicit, user-controlled processes.
Components
Tinkex.Recovery.Policy– configuration (defaults off, conservative backoff, optimizer restore on)Tinkex.Recovery.Monitor– pollsRest.get_training_run/2forcorrupted: trueTinkex.Recovery.Executor– bounded worker that restarts runs from checkpoints
Telemetry events: :detected, :started, :checkpoint_selected, :client_created, :completed, :failed, :exhausted, :poll_error (all under [:tinkex, :recovery, ...]).
Quickstart
config = Tinkex.Config.new(api_key: api_key, recovery: %{enabled: true})
policy =
Tinkex.Recovery.Policy.new(
enabled: true,
checkpoint_strategy: :latest, # or {:specific, "tinker://..."}
restore_optimizer: true, # false for weights-only
poll_interval_ms: 15_000,
max_attempts: 3,
backoff_ms: 5_000,
max_backoff_ms: 60_000,
on_recovery: fn old_pid, new_pid, cp ->
Logger.info("Recovered #{cp.tinker_path} -> #{inspect(new_pid)} (old=#{inspect(old_pid)})")
:ok
end,
on_failure: fn run_id, reason ->
Logger.warning("Recovery failed for #{run_id}: #{inspect(reason)}")
:ok
end
)
{:ok, service} = Tinkex.ServiceClient.start_link(config: config)
{:ok, executor} =
Tinkex.Recovery.Executor.start_link(
max_concurrent: 2
)
{:ok, monitor} =
Tinkex.Recovery.Monitor.start_link(
executor: executor,
policy: policy
)
:ok = Tinkex.Recovery.Monitor.monitor_run(monitor, "run-123", service, %{training_pid: training_pid})Checkpoint Strategy
:latest(default): usesTrainingRun.last_checkpoint{:specific, path}: explicit checkpoint path:best: reserved for future support
If last_checkpoint is missing, the executor will fetch the run via REST (requires passing config).
Optimizer Restore
restore_optimizer: true uses ServiceClient.create_training_client_from_state_with_optimizer/3. Set to false for weights-only restarts (fresh optimizer).
Concurrency and Backoff
max_concurrent(executor option) caps simultaneous recoveries (default: 1).max_attempts,backoff_ms,max_backoff_mscontrol retry behavior;:exhaustedis emitted when attempts are spent.
Telemetry and Observability
Attach handlers to [:tinkex, :recovery, *] for tracing. :poll_error indicates REST polling issues (monitor keeps state).
Safety Notes
- Nothing runs automatically; you must start monitor/executor and set
policy.enabled: true. - Callbacks should be cheap and resilient; errors are swallowed to avoid cascading failures.
- Ensure
configis available for REST lookups when monitor/executor are used outside the original process.
Offline Example
Run mix run examples/recovery_simulated.exs to see a fully offline flow that seeds a checkpoint, flips a run to corrupted: true, and exercises the monitor + executor with stubbed REST/service modules. No network or API key is required.
Live Example with Injected Corruption
Run mix run examples/recovery_live_injected.exs to create a training run, save a real checkpoint, inject a single corrupted: true poll response (no server change), and let the monitor + executor restore from that checkpoint. The example then saves a second checkpoint from the recovered client to prove the restart path. Requires TINKER_API_KEY (and optional TINKER_BASE_URL/TINKER_BASE_MODEL).