Yesterday, I was in Sentry looking at a specific issue, and it had “1.2k” events.
Today, I’m looking at the same issue, and it only has 76 issues.
Overnight (literally) almost all of the events from this issue were disappeared. Interestingly, the events that are missing are the RECENT events — we had a production incident (most events are from the last day) — older events (the 76) are still there.
What are all the possible reasons that this could happen?
I hope to get an exhaustive list. We’ve already looked at the obvious (to us) things — checked our Sentry configuration, looked at app logs, Postgres logs, and Redis logs, checked VM disk usage, etc. Our best guess/speculation is that maybe Redis collects new events and then flushes them to Postgres, and that it failed (for some reason), but we can find no evidence that happened (unless we’re looking at the wrong logs).
I’ve attached a screenshot that demonstrates (hopefully, at least a little bit) that I’m not mistaken about seeing the event count decrease.
Another thing to notice about this screenshot: the oldest issue(s) were disappeared too — notice it changes from “2 months old” to “a month old”. So, it’s mostly the newer events that are missing, but at least one older one too.
Since you haven’t stated the version you are using, I’ll assume you are on 9.1.2 which uses Redis as a temporary buffer before flushing events into the db. That said once you see them on the UI, I cannot really think of any other way than a clean up operation taking place and removing them. Or you had a Postgres issue regarding this. I’m sure @matt can come up with more and more accurate ways that this can happen.
If this happened with Sentry 10, then there are a lot more moving pieces so won’t go there unless you confirm the version.
I think I figured out what is happening. It’s my fault (user confusion) fueled by (arguably) a UX issue?
My screenshots above in the original question are showing two different issues — one with 1.2k events, another with 76 events — not the same issue, like I had stated. In fact, I’ve discovered there are actually three distinct issues where these “too many request” events are grouped.
These three issues are ostensibly about 404s, 429s, and 404s respectively — however, when I click-through to look at the events list, I can see that each of these issues all contain events for three different HTTP errors (404s, 429s, and 500s):
OpenURI::HTTPError: 404 Not Found
OpenURI::HTTPError: 429 Too Many Requests
OpenURI::HTTPError: 500 Internal Server Error
I had assumed that these different HTTP errors were split-out into their own issues, but they’re all being grouped together into the same OpenURI::HTTPError issue. The fact that there are three different OpenURI::HTTPError issues further confused me — two of these issues are for the same function, one is for a separate function, so at the very least, there are two different errors here.
Right now, if I search “too many requests”, I still only get one result, because (I think) that Sentry is searching the issue’s current summary, which can change every time a new event comes through (speculating here)?
I don’t recall, and there doesn’t appear to be, any merging that had occurred. I’m not sure why there would be three separate issues that all contain events for three separate HTTP errors — I would expect for these to be split out, but I bet this is a hard problem to solve (how aggressively to group).
Not to advertise but there are significant improvements regarding search and an experimental feature to provide more control and transparency over event grouping in v10 so this might be a very good reason to give it a try