Hi, we’re interested in scaling up our install of Sentry, focusing on increasing the throughput of events/min our cluster can handle. This post is split into two sections - the first half explaining our current scaling problem and the second half asking some misc. questions that we didn’t find the answers to in docs or this forum. Sorry for the wall of text, and thanks for being responsive to the onpremise community - we’ve read many of these threads.
This forum won’t let me attach more than one image or paste more than 2 links in this thread, so see Debugging Snuba consumers lag - Google Docs for the pictures and external links for this post.
We don’t have a great working hypothesis for why consumers - both Sentry and Snuba - have fallen over for us in the past. Ingest consumer appears to be very lightweight and we’ve never experienced any backup issues with it. We have a recurring issue where these consumer groups, freshly caught up by resetting to latest message index, will fall behind after 2-3 hours:
- snuba_consumers
- transaction_groups
- snuba-post-processor
Once lag has built up, the consumers will begin to crash loop: see gdoc for stacktrace
. We’ll get a bunch of “Error submitting packet, dropping the packet and closing the socket” which we think is unrelated since the warning appears even during normal operation; then we’ll get the “Offset out of Range” error followed by continuous crashlooping until we manually intervene at the Kafka level.
Our topic retention is set to 2G per partition, so we think it’s plausible that consumers crashlooping is roughly time correlated with the point where the cumulative size of the lagged messages is >2G. At 3:30, when the consumers start crashlooping, the most lagged events
partition is about 5k messages behind, which could plausibly be about 2G in size.
- see gdoc for image, which shows consumer lag by message offset -
That being said, we don’t think the crashlooping is correlated with a spike in events; we process about 200-300k non-rate limited events per hour, and lag is close to zero in the steady state. There is no spike in events shortly before lag starts to build up indefinitely.
- see gdoc for image, which shows the lagginess of consumers -
- see gdoc for image, which shows event submission traffic is fairly constant -
We’ve allocated post-process-forwarder 6 cores, 14G memory, and two instances. Although, it’s implied in [gdoc link] that the extra cores might be a waste because the post-process-forwarder consumer is not yet multithreaded.
We’ve allocated Snuba consumers 1 core, 1G memory, and 2 instances each. We haven’t tested out more cores yet because the instructions given in [see gdoc for sentry thread] to enable multiprocessing didn’t work for us. We scaled all Sentry related topics up to 20 partitions but the input-block-size and output-block-size flags, set to --processes=6 --input-block-size=6000000000 --output-block-size=6000000000
, caused the following error to occur: (see gdoc for stacktrace). We also tried decreasing this number by a factor of 1000x without success. This was particularly strange because there was no shortage of unutilized disk space — we were running on a volume 500G in size, which should be plenty.
I’m testing out Snuba with more instances and memory as we speak, but will note that Snuba did not seem particularly resource constrained on CPU or memory even at the 1 core/1G memory scale.
We’ve also got these misc. questions, which are somewhat unrelated:
In the same sentry.io thread above “subscriptions” consumers are mentioned. We realized upon reading this that we neglected to run four consumers:
- snuba-events-subscriptions-consumers
- snuba-transactions-subscriptions-consumers
- events-subscription-results
- transactions-subscription-results
We’re curious: what do these consumers do? It seems like our cluster does ~everything in the Sentry event ingestion critical path without needing these consumers at all.
Finally: Does Snuba emit any metrics related to the rate of events it processes, or is the recommended approach here to observe the lag on the consumer groups directly?