Sentry data loss incident

prasad · July 26, 2021, 10:47am

Greetings,

We have an on-premise sentry deployment (deployed using chart v8.1.2) which is working great for the past few months (except the growing postgreSQL (nodestore_node) disk usage, which is a known issue for which there is no perfect solution )

Last week we had a burst of events from one of our projects which has triggered several issues as described below.

#1. Issues with Redis: Redis pods were in a CrashLoopBackOff state with broken replication and we had to increase the disk space (it has reached 100% usage) and also perform a hard recovery. (The AOF file was corrupted, so we disabled AOF so it would load the dump.rdb. Then, created a new AOF using BGREWRITEAOF and restarted the pods)

#2. Issues with kafka and clickhouseDB: The sentry events were not processing even after #1 and also the clickhouseDB pods were throwing many errors, we went through the logs of each and every sentry components and have noticed Offset Out of Range errors similar to what is explained at issues/478, and I had to run the recovery procedure to fix it.

The sentry stared to work fine and accept events after the above fixes but we have noticed many old issues/events were disappeared from the sentry portal and we don’t know what exactly has caused the removal of the issues/events.

I know that the root cause for the failure was the burst of events and our sentry components were not able to consume the events as fast as they produced as explained here but I really don’t know how the old events/issues got disappeared, can this be due to the consumer offset reset as part of the recovery (#2)? I was under the impression that the events are stored in postgreSQL and not sure how the kafka consumer offset rest has triggered the deletion of older issues/events. Thanks!

BYK · July 26, 2021, 1:06pm

Hi! First of all, sorry for all the trouble and glad you were able to bring the system back onto its feet. This might be due to 2 reasons that I can think of:

For some reason your cleanup tasks were not running properly and this incident triggered the old issue cleanup. Our default retention period is 90 days.
As you suggested, some of the in-flight data is stored on Redis and Kafka and resetting these offsets may have caused that data to be lost. The raw event data should still be there in nodestore in this case but since they are not processed, they wouldn’t show up on the UI.

@fpacifici @tkaemming anything I’m missing?

fpacifici · July 26, 2021, 9:51pm

@BYK, I think you summarized it properly.
There are only two ways the system can drop old events: Clickhouse has a 90 days TTL by default, after that it drops events and it applies this every day, this would apply only to events, the Issues would remain intact. Then there is a background task that removes the 90 days old issues without activity and deletes the events from postgres. This could make the entire issue disappear from Sentry. @prasad was any of the old issues, older than 90 days ?

Regarding the in flight events that were lost, you may still see the issues being created but they will be missing events inside as they would not be stored in Clickhouse.
Skipping offsets per se does not delete existing events, only skips some new ones.

prasad · July 27, 2021, 2:29am

Thanks much for the response @BYK & @fpacifici

I had disabled the global cleanup cronjob and had added few project based cronjobs with custom retention’s as seen below,

     args:
              - "cleanup"
              - "--days"
              - "30"
              - "--project"
              - "6"

The issues removed were older than 60 days, so will the default cleanup policy applies, even if we have disabled the cleanup cronjobs and created custom ones like above?

BYK · July 27, 2021, 12:00pm

There are certain pieces in the code that may assume a 90-day retention period and I’m not sure if the code is architected in a way to deal with per-project retention period settings. So in summary, yes, this sounds possible but hard to give a deterministic answer with my limited knowledge.

Topic		Replies	Views
Clickhouse - How to recover data from sentry db On-Premise	8	3304	January 16, 2022
Sentry not logging events On-Premise	1	4026	July 13, 2020
Events missing mysteriously On-Premise	4	3505	January 16, 2020
Sentry processing transactions but not errors On-Premise	2	1944	August 4, 2021
Sentry did not access new events On-Premise	9	5033	September 18, 2021

Sentry data loss incident

Related topics