We have an on-premise sentry deployment (deployed using chart v8.1.2) which is working great for the past few months (except the growing postgreSQL (nodestore_node) disk usage, which is a known issue for which there is no perfect solution )
Last week we had a burst of events from one of our projects which has triggered several issues as described below.
#1. Issues with Redis: Redis pods were in a
CrashLoopBackOff state with broken replication and we had to increase the disk space (it has reached 100% usage) and also perform a hard recovery. (The AOF file was corrupted, so we disabled AOF so it would load the dump.rdb. Then, created a new AOF using
BGREWRITEAOF and restarted the pods)
#2. Issues with kafka and clickhouseDB: The sentry events were not processing even after #1 and also the clickhouseDB pods were throwing many errors, we went through the logs of each and every sentry components and have noticed
Offset Out of Range errors similar to what is explained at issues/478, and I had to run the recovery procedure to fix it.
The sentry stared to work fine and accept events after the above fixes but we have noticed many old issues/events were disappeared from the sentry portal and we don’t know what exactly has caused the removal of the issues/events.
I know that the root cause for the failure was the burst of events and our sentry components were not able to consume the events as fast as they produced as explained here but I really don’t know how the old events/issues got disappeared, can this be due to the consumer offset reset as part of the recovery (#2)? I was under the impression that the events are stored in postgreSQL and not sure how the kafka consumer offset rest has triggered the deletion of older issues/events. Thanks!