Lifeline Plugin

🌟 This plugin is available through Oban.Pro

The Lifeline plugin records queue activity as heartbeats and periodically rescues orphaned jobs, i.e. jobs that are stuck in the executing state because the node was shut down before the job could finish. Without the Lifeline plugin you need to manually rescue jobs stuck in the executing state.

Lifeline must be included used in order for Oban.Web to run properly.

Using and Configuring

To use the Lifeline plugin add the module to your list of Oban plugins in config.exs:

config :my_app, Oban,
  plugins: [Oban.Pro.Plugins.Lifeline]
  ...

There isn't any configuration necessary. By default the plugin will record heartbeats every 1 second, prune old heartbeats every 5 minutes, and rescue orphaned jobs every 1 minute. If necessary you can configure any or all of those intervals:

plugins: [{
  Oban.Pro.Plugins.Lifeline,
  delete_interval: :timer.minutes(10),
  record_interval: :timer.seconds(10),
  rescue_interval: :timer.minutes(5)
}]

This configuration will record 10x fewer heartbeats per minute, retain them 2x as long and attempt to rescue 5x less frequently. It optimizes for less database activity, at the expense of fidelity and recovery speed.

Note that rescuing orphans relies on recent heartbeats. Be sure that the delete_interval is always longer than the rescue_interval or it will look like all executing jobs are orphaned.

Implementation Notes

Some additional notes about heartbeat records and how you can expect Lifeline to operate:

  • Orphan rescuing is guaranteed to only rescue jobs that belong to dead queue processes or nodes.

  • Heartbeat records are written to the oban_beats table very efficiently, as a single batch.

  • Heartbeat records are only retained for five minutes, by default. This prevents bloat or exhausting the row limit on free tier databases.

  • Only a single node will delete heartbeats or rescue orphans at any given time, which prevents potential deadlocks and churn.

Instrumenting with Telemetry

The Lifeline module uses Oban.Telemetry.span/3 to emit the following telemetry events:

  • [:oban, :lifeline, :start] — emitted when heartbeats are recorded or pruned and when it attempts to rescue orphans
  • [:oban, :lifeline, :stop] — emitted when any activity completes successfully
  • [:oban, :lifeline, :exception] — emitted when an exception occurs during any activity

Each lifeline event includes an :action key as part of the metadata, where the action value is one of :delete, :record or :rescue.