We have in our infrastructure a sentry setup that was setup by someone who didn’t document anything and then left the company (like it happens everywhere I guess ).
We’ve recently experienced some performance issues (workers having a hard time processing events as fast as they come in). Here’s our current situation:
The boxes where the workers run seem fine, high CPU / mem…etc usage but nothing crazy
Same for the redis cluster, pretty high but not necessarily worrying.
Nothing in particular in the sentry logs.
Postgres database is huge (~400G, ~320G just for the sentry_eventmapping table…) and often hits 100% of CPU utilization.
This table has a bit more than 1 Billion of rows (1 100 832 546 exactly)
So I have two questions:
Do you guys agree that it seems like the performance issue is coming from the current state of the Postgres DB.
The fact that you have 1 billion rows itself, isn’t necessarily problematic. It definitely depends on the specs of the machine.
With that said, you can freely TRUNCATE that entire table without losing much functionality. If you upgrade to a newer version of Sentry (I forget what version I added this), we write significantly less into that table, since it’s not always needed.
This table purely facilitates looking up an event id to a group. Which is only uesd when you do a search with an explicit event id.
Also I’ve been trying to run cleanup but the command hangs forever. The process on the box is waiting (Status code S+) and the DB takes forever to execute some queries like:
delete from sentry_eventmapping where id = any(array( select id from sentry_eventmapping where “date_added” < now() - interval ‘7 days’ limit 10000));
I assume the cleanup is hanging because of this: the table being huge, the queries take forever, and so is the cleanup?
Oh, yeah, I think we don’t ship with the right index on that table. I’m not sure why, but looking at code now, there’s no index. I think we’ve applied the index manually on sentry.io years ago and never brought it into code.
You can either add an index onto the date_added column manually, or just truncate the table for now, because like I said, it’s very limited value.
I’m still getting some performance issues and I don’t get where it’s coming from.
No error in the sentry-web / redis / sentry-cron logs.
I get from time to time the following line in sentry-workers logs
[WARNING] sentry.tasks.process_buffer: Failed to process pending buffers due to error: Unable to acquire <Lock: ‘buffer:process_pending’> due to error: Could not set key: u’l:buffer:process_pending’
but that’s all.
Here are the symptoms:
super fast growing events queue
high CPU usage on the DB side
sentry workers processes seem to be sitting here doing nothing, waiting.