Debugging Snuba consumers lag

Hi, we’re interested in scaling up our install of Sentry, focusing on increasing the throughput of events/min our cluster can handle. This post is split into two sections - the first half explaining our current scaling problem and the second half asking some misc. questions that we didn’t find the answers to in docs or this forum. Sorry for the wall of text, and thanks for being responsive to the onpremise community - we’ve read many of these threads.

This forum won’t let me attach more than one image or paste more than 2 links in this thread, so see Debugging Snuba consumers lag - Google Docs for the pictures and external links for this post.

We don’t have a great working hypothesis for why consumers - both Sentry and Snuba - have fallen over for us in the past. Ingest consumer appears to be very lightweight and we’ve never experienced any backup issues with it. We have a recurring issue where these consumer groups, freshly caught up by resetting to latest message index, will fall behind after 2-3 hours:

  • snuba_consumers
  • transaction_groups
  • snuba-post-processor

Once lag has built up, the consumers will begin to crash loop: see gdoc for stacktrace. We’ll get a bunch of “Error submitting packet, dropping the packet and closing the socket” which we think is unrelated since the warning appears even during normal operation; then we’ll get the “Offset out of Range” error followed by continuous crashlooping until we manually intervene at the Kafka level.

Our topic retention is set to 2G per partition, so we think it’s plausible that consumers crashlooping is roughly time correlated with the point where the cumulative size of the lagged messages is >2G. At 3:30, when the consumers start crashlooping, the most lagged events partition is about 5k messages behind, which could plausibly be about 2G in size.

- see gdoc for image, which shows consumer lag by message offset -

That being said, we don’t think the crashlooping is correlated with a spike in events; we process about 200-300k non-rate limited events per hour, and lag is close to zero in the steady state. There is no spike in events shortly before lag starts to build up indefinitely.

- see gdoc for image, which shows the lagginess of consumers -
- see gdoc for image, which shows event submission traffic is fairly constant -

We’ve allocated post-process-forwarder 6 cores, 14G memory, and two instances. Although, it’s implied in [gdoc link] that the extra cores might be a waste because the post-process-forwarder consumer is not yet multithreaded.

We’ve allocated Snuba consumers 1 core, 1G memory, and 2 instances each. We haven’t tested out more cores yet because the instructions given in [see gdoc for sentry thread] to enable multiprocessing didn’t work for us. We scaled all Sentry related topics up to 20 partitions but the input-block-size and output-block-size flags, set to --processes=6 --input-block-size=6000000000 --output-block-size=6000000000, caused the following error to occur: (see gdoc for stacktrace). We also tried decreasing this number by a factor of 1000x without success. This was particularly strange because there was no shortage of unutilized disk space — we were running on a volume 500G in size, which should be plenty.

I’m testing out Snuba with more instances and memory as we speak, but will note that Snuba did not seem particularly resource constrained on CPU or memory even at the 1 core/1G memory scale.


We’ve also got these misc. questions, which are somewhat unrelated:

In the same sentry.io thread above “subscriptions” consumers are mentioned. We realized upon reading this that we neglected to run four consumers:

  • snuba-events-subscriptions-consumers
  • snuba-transactions-subscriptions-consumers
  • events-subscription-results
  • transactions-subscription-results

We’re curious: what do these consumers do? It seems like our cluster does ~everything in the Sentry event ingestion critical path without needing these consumers at all.

Finally: Does Snuba emit any metrics related to the rate of events it processes, or is the recommended approach here to observe the lag on the consumer groups directly?

Some updates:

  • By monkey patching these two lines [0][1] we were able to work around the crashes at the cost of dropping legitimate events that have been enqueued faster than they could be consumed.
  • We were able to test out the above behavior by spamming Sentry with a lot of events, watching Kafka lag build up on the “events” topic, and confirming that when the lag grew to be too large, that the offsets would auto-reset to latest
  • We tracked down the OSError “disk out of space” errors to insufficient space on /dev/shm. We allocated more memory to that mount point and found that the multiprocessing args no longer cause Snuba to crash. We’re still looking into if any throughput benefit is observed.

With the settings:

      "consumer",
      "--storage=events",
      "--auto-offset-reset=latest",
      "--processes=5",
      "--input-block-size=64000000000",
      "--output-block-size=64000000000",
      "--log-level=debug",
      "--queued-max-messages-kbytes=2097151",
TOPIC_PARTITION_COUNTS = {
    "errors-replacements": 20,
    "event-replacements": 20,
    "events": 20,
    "ingest-attachments": 20,
    "ingest-events": 20,
    "ingest-sessions": 20,
    "ingest-transactions": 20,
    "outcomes": 20,
    "snuba-commit-log": 20,
}

and more than enough memory allocated (queued-max-messages + input block size + output block size, and some headroom), we’re still unable to make a single snuba consumer perform any better than when it was given 1 CPU core and 1G of memory. The way I’ve been testing this is to raise the rate limit for a particularly spammy DSN (so we can identify which snuba-consumer instance to look at) from 1000/min (a rate where consumers are able to keep up with lag) to 3000/min (where lag quickly grows faster than events can be consumed), and to monitor the rate of progress for the consumer, which is constant for both the multiprocessing and non-multiprocessing setups.

I took the time to play around with the other parameters available to us, with no avail.

Raising the batch sizing/interval does appear to impact the batching behavior itself, but does not have an effect on throughput.

Neither did:

  • Increasing the number of cores and processes
  • Increasing --processes > # of logical cpu cores