Cleanup stuck for several days on "Removing old NodeStore values"

I have a mid-sized Sentry system that has been receiving events with a 30 day retention.

I have a cleanup job that runs daily, but wasn’t introduced until the Sentry system had been running for a month or so. Over this period, the DB has managed to scale to about 350GB in size, which may have something to do with this.

It keeps getting stuck on:

Removing expired values for LostPasswordHash
Removing expired values for OrganizationMember
Removing expired values for ApiGrant
Removing expired values for ApiToken
Removing expired files associated with ExportedData
Removing old NodeStore values <----

I’ve managed to have one running for 4 full days, and it is still just stuck here.

I am in the process of running a VACUUM FULL; on PG to try and clean things up in the meantime. (I’ve closed the clean up job in the meantime). Edit: VACUUM didn’t clear anything :frowning:

Is this expected behaviour? Is it possible that something is getting stuck?

Bumping for visibility.

Can you inspect that process with let’s say wireshark to see if it is hanging up on network activity, or if there’s really just a lot to clean up? Does it use any CPU at all? Does postgres receive any queries (you’d probably have to bump log verbosity in postgres container)

Looks like eventually, after some Vacuuming (vacuumdb) it all cleaned up, now that I have the daily cron running for cleanup. Problem is hopefully solved. Thank you for your input @untitaker

@DandyDeveloper I have a similar problem where the nodestore_node has grown to 600GB and we realised the only way to clear it is by running a separate cleanup script for the table and then use pg_repack.
I ran the cleanup script to clear everything down in node store to 0 days. It has been running for hours now. @untitaker we had 90 days worth of data in that table and I ran cleanup in the past with 60 days retention. So removing only 30 days. This completed in an hour or less. However when I run the script to clear everything, something like below.

docker-compose run -T web cleanup --days 0 -m nodestore -l debug

It seems to run for hours possibly days. I know 0 days works as expected since I was able to clear 4GB in our dev environment in around 15-20 minutes.
This is a very long time for the script to run for a small set of data. I know once I am running this regularly as a nightly cron job it may be quicker to run to completion. Since this is not documented as a requirement in the Selfhosted repo it is easy to miss.