For a few days we’ve had the problem that all events in our on-premise (k8s/helm) sentry deployment seem to be stuck. There are no errors in the logs as far as I can tell. I can log in to one of the worker pods and see that there are a number of events in the preprocessing queue - I’ve purged this, but it didn’t help. I then went through and restarted all of the services and eventually a few more events showed up but it seems to be stuck again. Any clue how to debug this?
So I was mistaken. I thought it was another problem. But now I see that it is actually processing this. It looked stuck, but it is processing them, it’s just that it has a huge backlog
I used to have like 1/event minute, but now it’s above 40-ish events a minute and the worker can’t keep up during daytime. So it has a queue of 8000 or so events. Since peak time is over it is slowly processing through them, and at 2300-ish it seems to be on top of it again. So sorry about the noise. I had just updated and thought this was related, but it’s just a regular issue.
The server is maxed out on CPU, so I won’t add more workers, but I could maybe give it some more CPUs and then add additional workers. Or I could just start rate limiting events.
Of course the real fix is to fix my apps that’s spamming issues at the Sentry server! Thanks @untitaker!
Sure I understand that, but the question is not about the helm chart, it’s about resource requirements for sentry. Are there some guidelines for running sentry on-premise?