BullMQ.StalledChecker (BullMQ v1.0.1)

View Source

Detects and handles stalled jobs.

A job is considered "stalled" when a worker takes a job but fails to:

  • Complete it (move to completed/failed)
  • Renew its lock before the lock expires

This typically happens when:

  • The worker process crashes
  • The machine running the worker loses power
  • Network issues prevent lock renewal
  • The job processor blocks without yielding

Detection Algorithm

BullMQ uses a two-phase stalled job detection:

  1. Mark Phase: Jobs without valid locks are moved to a "stalled" set
  2. Recover Phase: On next check, jobs still in stalled set are either requeued or moved to failed (based on max_stalled_count)

This two-phase approach prevents false positives from timing issues.

Configuration

The stalled checker is configured on the worker. The defaults are sensible and should normally not be changed:

{BullMQ.Worker,
  queue: "emails",
  connection: :redis,
  processor: &MyApp.send_email/1,
  lock_duration: 30_000,      # Default: 30s - normally don't change
  stalled_interval: 30_000,   # Default: 30s - normally don't change
  max_stalled_count: 1        # Default: 1 - see note below
}

About max_stalled_count

The default max_stalled_count is 1 because stalled jobs are considered a rare occurrence. If a job stalls more than once, it typically indicates a more serious issue such as:

  • Repeated worker crashes on specific job data
  • Resource exhaustion (memory, CPU)
  • External service failures
  • Bugs in job processing logic

Increasing this value is generally not recommended. Instead, investigate why jobs are stalling and fix the underlying issue.

About lock_duration

The lock_duration should only be increased if you have jobs that legitimately take longer than 30 seconds between lock renewals (which happen automatically). Jobs that process quickly don't need longer lock durations.

Manual Checking

You can also run the stalled check manually:

BullMQ.StalledChecker.check(:redis, "emails")

Summary

Functions

Manually triggers a stalled jobs check.

Returns a specification to start this module under a supervisor.

Checks if a specific job is stalled.

Starts the stalled job checker.

Types

opts()

@type opts() :: [
  connection: atom(),
  queue: String.t(),
  prefix: String.t(),
  stalled_interval: pos_integer(),
  max_stalled_count: pos_integer()
]

Functions

check(connection, queue, opts \\ [])

@spec check(atom(), String.t(), Keyword.t()) :: {:ok, map()} | {:error, term()}

Manually triggers a stalled jobs check.

Returns {:ok, %{recovered: count, failed: count}}.

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

job_stalled?(connection, queue, job_id, opts \\ [])

@spec job_stalled?(atom(), String.t(), String.t(), Keyword.t()) ::
  {:ok, boolean()} | {:error, term()}

Checks if a specific job is stalled.

start_link(opts)

@spec start_link(opts()) :: GenServer.on_start()

Starts the stalled job checker.

Options

  • :connection - Required. Redis connection name
  • :queue - Required. Queue name
  • :prefix - Key prefix (default: "bull")
  • :stalled_interval - Check interval in ms (default: 30_000)
  • :max_stalled_count - Max stall count before failing (default: 1)