BullMQ in Production: Lessons Learned
What breaks when your queue actually gets used. Idempotency, poison pills, Redis memory blowups, and the graceful-shutdown dance no tutorial mentions.
BullMQ in Production: Lessons Learned
BullMQ is the nicest background job library I've used in Node. But "nicest" isn't "safe." Here's what went wrong when we put it in front of real traffic, and what I'd configure differently next time.
1. Jobs must be idempotent, for real
"Exactly-once" does not exist. A worker can crash after doing the work and before acking. Redis can fail over. You will re-run jobs. Design every job so that running it twice is indistinguishable from running it once. This usually means writing a deterministic jobId derived from the input and checking state before you do the side effect.
2. Poison pills eat your queue
One job that throws synchronously will retry, retry, retry — stealing every worker slot. Always set attempts + a backoff, and route terminal failures to a dead-letter queue you actually monitor.
await queue.add('email', payload, {
attempts: 5,
backoff: { type: 'exponential', delay: 5000 },
removeOnFail: 1000,
});3. Completed jobs will eat your Redis
By default, BullMQ keeps finished jobs forever. Your Redis memory climbs, your BGSAVE gets slow, and eventually you page on a weekend. Set removeOnComplete to a small rolling window — 1000 is usually plenty for debugging.
4. Graceful shutdown is a dance
On SIGTERM you want workers to finish their current job but accept no new ones. worker.close() does this, but you need to resolve *after* in-flight jobs settle, not just after the call returns. Wire this into your orchestrator's grace period.
5. Observe, or fly blind
Install bullmq-prometheus-exporter or roll a small exporter. Watch: queue depth, active workers, failed rate, oldest waiting job. Any one of those flatlines or spikes before an outage.
Closing
BullMQ is great. Just treat the happy path as the *least* important thing to design for.