We have a running installation in our Kubernetes cluster, using the unofficial Helm chart.
We are seeing some very high memory usage at times by the sentry.web workers, and we are wondering if there is something we can do to troubleshoot. We see it spiking to 10-11GB at times.
Clicking around in the interface is slow and things time out when these spikes occur.
Just for some context, you have over 10x rows that we have in our production system. I’m wondering if you’re using environments in a way we didn’t intend. We mostly expect environment to be something like prod, dev, etc. So for a given org/project, we’re expect a low number of these, 0-20 on average.
I wouldn’t be surprised if this is the cause of your memory issues, since you’re probably bringing all those rows into memory frequently.
Yes, we have production, staging, dev, and a couple others, but for some reason, they are not unique in the database, so we have ~27 million ‘production’ environments. Would love to figure out what the schema needs to looks like for those to be unique, similar to how we had to do for the other table.
There should be a unique on your Environment table on (organization_id, name). If that is not in place, then you’ll likely have similar issues to the previous thread.
Fixing this is a lot more complex. For each organization_id, name combo, you’ll need to select the min(id) as the Environment you want to keep. Then, for any related tables, you’ll need to update the environment_id for the environments you want to remove with the correct one (and when there are duplicate rows, remove the duplicates).
There are a lot of tables to do:
sentry_environment
sentry_environmentproject
sentry_groupenvironment (if you have a lot of sentry_groupedmessage rows this will be huge).
sentry_releaseenvironment
sentry_releaseprojectenvironment
There are probably more. This isn’t something we’ve ever really had to do, and I’m not sure how your database ended up in this state. It’s possible there will be other side effects to repairing this data, it’s a little hard to predict since we don’t typically do this kind of fixing.
An alternative here is to just remove the duplicate Environment rows via sentry shell, and let Django cascade delete all of the duplicates. This will result in data loss on the related tables, but if events are coming in then this data should refill over time. This is a lot easier, but again, hard to predict the side effects of doing something like this.
That’s about all I can offer in terms of advice here. If you have any other questions about this process let me know.
Also, as a separate followup once you finish solving these immediate issues, I’d recommend you make a fresh install of a separate Sentry instance, dump the schema, and compare it to your existing Sentry install schema.
The one thing I will reiterate in this post is I am highly curious how the compound keys for the tables sentry_organizationonboardingtask and sentry_environmentproject got wiped out. From my understanding, these should have always existed from a fresh install even going back to earlier versions of on-premise.
@mleclerc001 Unfortunately I don’t have time to look at a dump of your schema. I think your best option here is to use a schema comparision tool, dump the schema of your production database and compare it to the schema from a fresh sentry install. Then at least you can determine if any other indexes are missing. If you do this, you might find that some indexes have different names but are on the same columns - that’s fine if it’s the case.
I’m also curious about how those keys got wiped out. I can’t think of a way it would have happened as part of our regular upgrade process, and I couldn’t reproduce it. I wonder if at some point someone was attempting to migration data across instances, hit some integrity error and removed them?
No worries @wedamija, figured it couldn’t hurt to ask. I was able to fix our environment issue using a similar method to the other thread we had open. Fortunately it was only our environmentproject table that was missing the index and not the environment table itself. Cleaning out the duplicates and re-adding the index seems to have fixed our performance problems.