HN – Handling Database Failures in a Distributed System with RabbitMQ Workers

I have a worker that processes tasks from RabbitMQ and inserts data into a database. The system operates at high scale, handling thousands of messages per second, which makes proper failure handling crucial to avoid overwhelming the system.

The problem arises when the database crashes for a period of time. If the worker has already pulled a message from the queue, it disappears unless handled properly.

Proposed Solution: 1. Use ACK Mechanism Properly • If the database is available, insert the data and send an ACK to remove the message from the queue. • If the database is down, send a NO_ACK so that RabbitMQ requeues the message for another attempt. 2. Prevent Infinite Retry Loops • If we simply send NO_ACK, the message will immediately return to the queue, leading to a tight retry loop that overloads the system. • To solve this, implement a progressive retry delay: • The worker should sleep for a few seconds before retrying the same task: • 1s → 3s → 5s → 30s → 60s (capped at a max delay). 3. Limit Retry Attempts • Introduce a retry counter (e.g., 5 attempts). • After 5 failures, move the message to a Dead Letter Queue (DLQ) instead of retrying indefinitely.

Alternative Approach:

Instead of relying on RabbitMQ’s NO_ACK retry cycle, an alternative would be to keep the message in memory and attempt 5 internal retries within the worker itself: 1. The worker tries to insert the data into the DB. 2. If the DB is down, it retries up to 5 times, with a sleep interval between attempts. 3. If all retries fail, move the message to the DLQ.

Questions: • Which approach is preferable? Should I rely on RabbitMQ to handle retries, or should I manage them within the worker itself? • Are there better practices for handling failures in a high-scale distributed system with RabbitMQ and a database backend?

Handling Database Failures in a Distributed System with RabbitMQ Workers

6 comments