Delayed event processing and backlogged queue


I run Sentry 9.1.2 and noticed recently, that the events.preprocess_event queue is growing and items are getting expired and failing, As a remediation, I added more instances of workers just for that queue. It helped but the events.process_event queue has been backlogged since then, even after adding dedicated workers just for that queue.

Scaling out workers for that queue doesn’t help. No matter if I run 1 or 10 EC2 instances with 4 workers each - the processing speed is the same and the queue is still growing (currently there’s 100k items and most of them are failing due to expired Redis TTL)

Workers run on machines with 4 vCPU and 8GB RAM. I tried having 2 workers with -c 2 and 4 workers with concurrency 1. The CPU utilisation on EC2 instances with workers is rather low (<20% for most of the time), same with PostgreSQL and Redis so I assume there’s plenty of resources.

The successful processing happens 1h after an event got preprocessed. Logs I can observe:
10:13 - Task store.preprocess[Id123] succeeded in 0.1s
11:13 - Task store.process[Id123] succeeded in 3s
11:13 - Task[Id123] succeeded in 1s

If the processing got delayed by a few seconds - it fails due to TTL expiration.

Also, I purged the processing queue twice and the amount of messages got back to the same state in ~30min.

That leads to a few questions:

  1. Why does scaling out workers for a specific queue have no effect on queue’s processing speed?

  2. What is causing 1h delay between preprocessing and processing?

  3. Is there any known bug regarding locking, synchronization, etc. which keeps a queue growing and not being able to process it in timely manner?

I’m happy to provide more info about our setup and the things I tried during troubleshooting.

PS I found a few questions in this forum regarding similar issue but there’s no answer, for example:

Hopefully we can get more info about the issue.