Troubleshooting

View Source

Jobs Stuck Executing Forever

During deployment or unexpected node restarts, jobs may be left in an executing state indefinitely. We call these jobs "orphans", but orphaning isn't a bad thing. It means that the job wasn't lost and it may be retried again when the system comes back online.

There are two mechanisms to mitigate orphans:

  1. Increase the shutdown_grace_period to allow the system more time to finish executing before shutdown. During shutdown each queue stops fetching more jobs, but executing jobs have up to the grace period to complete. The default value is 15000ms, or 15 seconds.

  2. Use the Lifeline plugin to automatically move those jobs back to available so they can run again.

config :my_app, Oban,
    plugins: [Oban.Plugins.Lifeline],
    shutdown_grace_period: :timer.seconds(60),
    ...

Jobs or Plugins aren't Running

Sometimes Cron or Pruner plugins appear to stop working unexpectedly. Typically, this happens in systems with multi-node setups where "web" nodes only enqueue jobs while "worker" nodes are configured to run queues and plugins. Most plugins require leadership to function, so when a "web" node becomes leader the plugins go dormant.

The solution is to disable leadership with peer: false on any node that doesn't run plugins:

config :my_app, Oban, peer: false, ...

Cron @reboot Not Running in Development

The @reboot cron expression depends on leadership to prevent duplicate job insertion across nodes. In development, when you shut down your application (e.g., by exiting IEx), the node may not cleanly relinquish leadership in the database. This creates a delay before the node can become leader again on the next startup, making it appear as though @reboot jobs aren't working.

Solutions

  1. Wait for leadership - The default peer will eventually assume leadership, typically within 30 seconds.

  2. Use the Global peer in development - The Global peer handles restarts more gracefully:

    # In config/dev.exs
    config :my_app, Oban,
      peer: Oban.Peers.Global,
      ...
  3. Clear leadership manually - If needed, you can clear the oban_peers table in your database to force immediate leadership.

Keep the default peer in production for better reliability and persistence across restarts.

No Notifications with PgBouncer

Using PgBouncer's "Transaction Pooling" setup disables all of PostgreSQL's LISTEN and NOTIFY activity. Some functionality, such as triggering job execution, scaling queues, canceling jobs, etc. rely on those notifications.

There are several options available to ensure functional notifications:

  1. Switch to the Oban.Notifiers.PG notifier. This alternative notifier relies on Distributed Erlang and exchanges messages within a cluster. The only drawback to the PG notifier is that it doesn't trigger job insertion events.

  2. Switch PgBouncer to "Session Pooling". Session pooling isn't as resource efficient as transaction pooling, but it retains all Postgres functionality.

  3. Use a dedicated Repo that connects directly to the database, bypassing PgBouncer.

If none of those options work, Oban's job staging will switch to local polling mode to ensure that queues keep processing jobs.

Unexpectedly Re-running All Migrations

Without a version comment on the oban_jobs table, it will rerun all of the migrations. This can happen when comments are stripped when restoring from a backup, most commonly during a transition from one database to another.

The fix is to set the latest migrated version as a comment. To start, search through your previous migrations and find the last time you ran an Oban migration. Once you've found the latest version, e.g. version: 10, then you can set that as a comment on the oban_jobs table:

COMMENT ON TABLE public.oban_jobs IS '10'"

Once the comment is in place only the migrations from that version onward will run.