Sentry data loss incident


We have an on-premise sentry deployment (deployed using chart v8.1.2) which is working great for the past few months (except the growing postgreSQL (nodestore_node) disk usage, which is a known issue for which there is no perfect solution :slightly_frowning_face:)

Last week we had a burst of events from one of our projects which has triggered several issues as described below.

#1. Issues with Redis: Redis pods were in a CrashLoopBackOff state with broken replication and we had to increase the disk space (it has reached 100% usage) and also perform a hard recovery. (The AOF file was corrupted, so we disabled AOF so it would load the dump.rdb. Then, created a new AOF using BGREWRITEAOF and restarted the pods)

#2. Issues with kafka and clickhouseDB: The sentry events were not processing even after #1 and also the clickhouseDB pods were throwing many errors, we went through the logs of each and every sentry components and have noticed Offset Out of Range errors similar to what is explained at issues/478, and I had to run the recovery procedure to fix it.

The sentry stared to work fine and accept events after the above fixes but we have noticed many old issues/events were disappeared from the sentry portal and we don’t know what exactly has caused the removal of the issues/events.

I know that the root cause for the failure was the burst of events and our sentry components were not able to consume the events as fast as they produced as explained here but I really don’t know how the old events/issues got disappeared, can this be due to the consumer offset reset as part of the recovery (#2)? I was under the impression that the events are stored in postgreSQL and not sure how the kafka consumer offset rest has triggered the deletion of older issues/events. Thanks!

Hi! First of all, sorry for all the trouble and glad you were able to bring the system back onto its feet. This might be due to 2 reasons that I can think of:

  1. For some reason your cleanup tasks were not running properly and this incident triggered the old issue cleanup. Our default retention period is 90 days.
  2. As you suggested, some of the in-flight data is stored on Redis and Kafka and resetting these offsets may have caused that data to be lost. The raw event data should still be there in nodestore in this case but since they are not processed, they wouldn’t show up on the UI.

@fpacifici @tkaemming anything I’m missing?

@BYK, I think you summarized it properly.
There are only two ways the system can drop old events: Clickhouse has a 90 days TTL by default, after that it drops events and it applies this every day, this would apply only to events, the Issues would remain intact. Then there is a background task that removes the 90 days old issues without activity and deletes the events from postgres. This could make the entire issue disappear from Sentry. @prasad was any of the old issues, older than 90 days ?

Regarding the in flight events that were lost, you may still see the issues being created but they will be missing events inside as they would not be stored in Clickhouse.
Skipping offsets per se does not delete existing events, only skips some new ones.

1 Like

Thanks much for the response @BYK & @fpacifici :pray:

I had disabled the global cleanup cronjob and had added few project based cronjobs with custom retention’s as seen below,

              - "cleanup"
              - "--days"
              - "30"
              - "--project"
              - "6" 

The issues removed were older than 60 days, so will the default cleanup policy applies, even if we have disabled the cleanup cronjobs and created custom ones like above?

There are certain pieces in the code that may assume a 90-day retention period and I’m not sure if the code is architected in a way to deal with per-project retention period settings. So in summary, yes, this sounds possible but hard to give a deterministic answer with my limited knowledge.

1 Like