BullMQ.StalledChecker (BullMQ v1.2.7)

Detects and handles stalled jobs.

A job is considered "stalled" when a worker takes a job but fails to:

Complete it (move to completed/failed)
Renew its lock before the lock expires

This typically happens when:

The worker process crashes
The machine running the worker loses power
Network issues prevent lock renewal
The job processor blocks without yielding

Detection Algorithm

BullMQ uses a two-phase stalled job detection:

Mark Phase: Jobs without valid locks are moved to a "stalled" set
Recover Phase: On next check, jobs still in stalled set are either requeued or moved to failed (based on max_stalled_count)

This two-phase approach prevents false positives from timing issues.

Configuration

The stalled checker is configured on the worker. The defaults are sensible and should normally not be changed:

{BullMQ.Worker,
  queue: "emails",
  connection: :redis,
  processor: &MyApp.send_email/1,
  lock_duration: 30_000,      # Default: 30s - normally don't change
  stalled_interval: 30_000,   # Default: 30s - normally don't change
  max_stalled_count: 1        # Default: 1 - see note below
}

About max_stalled_count

The default max_stalled_count is 1 because stalled jobs are considered a rare occurrence. If a job stalls more than once, it typically indicates a more serious issue such as:

Repeated worker crashes on specific job data
Resource exhaustion (memory, CPU)
External service failures
Bugs in job processing logic

Increasing this value is generally not recommended. Instead, investigate why jobs are stalling and fix the underlying issue.

About lock_duration

The lock_duration should only be increased if you have jobs that legitimately take longer than 30 seconds between lock renewals (which happen automatically). Jobs that process quickly don't need longer lock durations.

Manual Checking

You can also run the stalled check manually:

BullMQ.StalledChecker.check(:redis, "emails")

Summary

Types

opts()

Functions

check(connection, queue, opts \\ [])

Manually triggers a stalled jobs check.

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

job_stalled?(connection, queue, job_id, opts \\ [])

Checks if a specific job is stalled.

start_link(opts)

Starts the stalled job checker.