Tinkex.Recovery.Policy (Tinkex v0.4.0)

Configuration for opt-in training run recovery.

Defaults are disabled and conservative: three attempts, 5s base backoff (capped at 60s), 30s polling interval, and optimizer state restore enabled. The checkpoint strategy defaults to :latest; :best is reserved for future support, and {:specific, path} targets an explicit checkpoint path.

Telemetry events emitted by the recovery pipeline:

[:tinkex, :recovery, :detected] - monitor observed corrupted: true
[:tinkex, :recovery, :started] - executor attempt began
[:tinkex, :recovery, :checkpoint_selected] - checkpoint picked
[:tinkex, :recovery, :client_created] - training client created
[:tinkex, :recovery, :completed] - recovery finished successfully
[:tinkex, :recovery, :failed] - attempt failed (before exhaustion)
[:tinkex, :recovery, :exhausted] - max attempts reached, no recovery
[:tinkex, :recovery, :poll_error] - monitor failed to poll run status (keeps state)

Summary

Types

checkpoint_strategy()

failure_callback()

recovery_callback()

t()

Functions

new(policy)

Build a recovery policy from a struct, keyword list, or map.