Tinkex.Recovery.Monitor (Tinkex v0.3.4)

View Source

Polls training runs for corruption flags and dispatches recovery work.

This GenServer must be started explicitly and configured with a recovery policy (disabled by default), a REST module for polling, and an executor pid.

Telemetry events:

  • [:tinkex, :recovery, :detected] - observed corrupted: true on a run
  • [:tinkex, :recovery, :poll_error] - REST poll failed (metadata includes :error)

Summary

Functions

Returns a specification to start this module under a supervisor.

Start the monitor.

Stop monitoring a training run.

Types

option()

@type option() ::
  {:policy, Tinkex.Recovery.Policy.t() | map() | nil}
  | {:config, Tinkex.Config.t()}
  | {:rest_module, module()}
  | {:rest_client_fun,
     (pid() -> {:ok, %{config: Tinkex.Config.t()}} | {:error, term()})}
  | {:service_client_module, module()}
  | {:executor, pid()}
  | {:send_after, (term(), non_neg_integer() -> reference())}

state()

@type state() :: %{
  policy: Tinkex.Recovery.Policy.t(),
  rest_module: module(),
  rest_client_fun: (pid() ->
                      {:ok, %{config: Tinkex.Config.t()}} | {:error, term()}),
  service_module: module(),
  executor: pid() | nil,
  runs: %{
    optional(String.t()) => %{
      service_pid: pid(),
      config: Tinkex.Config.t(),
      metadata: map()
    }
  },
  poll_ref: reference() | nil,
  send_after: (term(), non_neg_integer() -> reference())
}

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

monitor_run(monitor, run_id, service_pid, metadata \\ %{})

@spec monitor_run(pid(), String.t(), pid(), map()) :: :ok | {:error, term()}

Begin monitoring a training run.

service_pid is the Tinkex.ServiceClient pid used to create recovery clients.

start_link(opts \\ [])

@spec start_link([option()]) :: GenServer.on_start()

Start the monitor.

stop_monitoring(monitor, run_id)

@spec stop_monitoring(pid(), String.t()) :: :ok

Stop monitoring a training run.