Delayed event processing and backlogged queue

sentoryu · September 16, 2020, 2:10pm

Hi,

I run Sentry 9.1.2 and noticed recently, that the events.preprocess_event queue is growing and items are getting expired and failing, As a remediation, I added more instances of workers just for that queue. It helped but the events.process_event queue has been backlogged since then, even after adding dedicated workers just for that queue.

Scaling out workers for that queue doesn’t help. No matter if I run 1 or 10 EC2 instances with 4 workers each - the processing speed is the same and the queue is still growing (currently there’s 100k items and most of them are failing due to expired Redis TTL)

Workers run on machines with 4 vCPU and 8GB RAM. I tried having 2 workers with -c 2 and 4 workers with concurrency 1. The CPU utilisation on EC2 instances with workers is rather low (<20% for most of the time), same with PostgreSQL and Redis so I assume there’s plenty of resources.

The successful processing happens 1h after an event got preprocessed. Logs I can observe:
10:13 - Task store.preprocess[Id123] succeeded in 0.1s
11:13 - Task store.process[Id123] succeeded in 3s
11:13 - Task store.save[Id123] succeeded in 1s

If the processing got delayed by a few seconds - it fails due to TTL expiration.

Also, I purged the processing queue twice and the amount of messages got back to the same state in ~30min.

That leads to a few questions:

Why does scaling out workers for a specific queue have no effect on queue’s processing speed?
What is causing 1h delay between preprocessing and processing?
Is there any known bug regarding locking, synchronization, etc. which keeps a queue growing and not being able to process it in timely manner?

I’m happy to provide more info about our setup and the things I tried during troubleshooting.

PS I found a few questions in this forum regarding similar issue but there’s no answer, for example:

Accepted events not showing up in project
Long queue processing time
forum .sentry.io/t/the-job-accumulated-in-the-queue-has-been-consumed-late-so-it-is-delayed-for-more-than-an-hour-on-the-dashboard/10451 (sorry, can’t put more than 2 links as a new user)

Hopefully we can get more info about the issue.

Thanks!

chadwhitacre · December 8, 2020, 1:18pm

Sorry this fell through the cracks, @sentoryu. Are you still having this issue?

Topic		Replies	Views
The job accumulated in the queue has been consumed late, so it is delayed for more than an hour on the dashboard On-Premise	0	830	July 17, 2020
Sentry: 429 errors On-Premise	8	3036	December 8, 2020
Not processing events On-Premise	9	1713	December 8, 2020
Tuning workers performance On-Premise	2	1775	July 5, 2018
Apparently not Processing Events after Upgrade On-Premise	0	936	June 12, 2018

Delayed event processing and backlogged queue

Related topics