BullMQ.StalledChecker (BullMQ v1.0.1)
View SourceDetects and handles stalled jobs.
A job is considered "stalled" when a worker takes a job but fails to:
- Complete it (move to completed/failed)
- Renew its lock before the lock expires
This typically happens when:
- The worker process crashes
- The machine running the worker loses power
- Network issues prevent lock renewal
- The job processor blocks without yielding
Detection Algorithm
BullMQ uses a two-phase stalled job detection:
- Mark Phase: Jobs without valid locks are moved to a "stalled" set
- Recover Phase: On next check, jobs still in stalled set are
either requeued or moved to failed (based on
max_stalled_count)
This two-phase approach prevents false positives from timing issues.
Configuration
The stalled checker is configured on the worker. The defaults are sensible and should normally not be changed:
{BullMQ.Worker,
queue: "emails",
connection: :redis,
processor: &MyApp.send_email/1,
lock_duration: 30_000, # Default: 30s - normally don't change
stalled_interval: 30_000, # Default: 30s - normally don't change
max_stalled_count: 1 # Default: 1 - see note below
}About max_stalled_count
The default max_stalled_count is 1 because stalled jobs are considered a rare
occurrence. If a job stalls more than once, it typically indicates a more serious
issue such as:
- Repeated worker crashes on specific job data
- Resource exhaustion (memory, CPU)
- External service failures
- Bugs in job processing logic
Increasing this value is generally not recommended. Instead, investigate why jobs are stalling and fix the underlying issue.
About lock_duration
The lock_duration should only be increased if you have jobs that legitimately
take longer than 30 seconds between lock renewals (which happen automatically).
Jobs that process quickly don't need longer lock durations.
Manual Checking
You can also run the stalled check manually:
BullMQ.StalledChecker.check(:redis, "emails")
Summary
Functions
Manually triggers a stalled jobs check.
Returns a specification to start this module under a supervisor.
Checks if a specific job is stalled.
Starts the stalled job checker.
Types
@type opts() :: [ connection: atom(), queue: String.t(), prefix: String.t(), stalled_interval: pos_integer(), max_stalled_count: pos_integer() ]
Functions
Manually triggers a stalled jobs check.
Returns {:ok, %{recovered: count, failed: count}}.
Returns a specification to start this module under a supervisor.
See Supervisor.
@spec job_stalled?(atom(), String.t(), String.t(), Keyword.t()) :: {:ok, boolean()} | {:error, term()}
Checks if a specific job is stalled.
@spec start_link(opts()) :: GenServer.on_start()
Starts the stalled job checker.
Options
:connection- Required. Redis connection name:queue- Required. Queue name:prefix- Key prefix (default: "bull"):stalled_interval- Check interval in ms (default: 30_000):max_stalled_count- Max stall count before failing (default: 1)