Perfomance metric alarm related to Snuba-events-subscriptions-consumers

Snuba-events-subscriptions-consumers of onpremise sentry built internally does not work properly. For this reason, the alarm according to the metric change in the performance tab does not work properly. I’ve already asked a couple of questions about this issue before. but failed to normalize

Here is a summary of my experiences with kafka offset kinks:

  1. When the inflow is much higher than the throughput
  2. In the post-processing process, when clickhouse does not operate normally

What about Snuba-events-subscriptions-consumers? I wonder how this consumer works.

@BYK previously mentioned two people @markstory @fpacifici regarding this issue. If you know anything, please let me know.

Will move the conversation here:
re: how the snuba-events-subscriptions-consumer works.
IT consumes two topics. The main events topic (all events (errors/transactions) produced by sentry) is one and the other is the commit_log topic. The snuba consumer (that writes on clickhouse) writes a mark on the commit log telling the offset it committed for each partition.
The snuba events subscriptions consumer gets an offset from the commit log, then it consumes the events topic up to that offsset and, for every time tick (let’s say every minute to simplify) it executes all the clickhouse queries to provide metric alerts data to sentry. Then it pauses until the main consumer has made some progress.

There are multiple reasons why the subscriptions consumer may be rebalacing continuously depending how many replicas of that consumer you are running. I believe you have 10 partitions.

  • Does one replica crash for some reason? That would immediately cause a rebalance. We saw OOM issues at times.
  • Is one (or multiple) of the replicas so overwhelmed that it does not manage to poll from Kafka within the timeout, then the broker may be excluding the consumer and trigger a rebalance.

To nail down the cause we would need more information than those provided in this issue. Specifically:

  • How many replicas of the consumer are you running ?
  • how often does the rebalancing happens? The consumer log would show partition assignments. Would you mind providing logs on a longer period of time? What was provided in the other issue only contains an exception which is hard to contextualize.
  • Also any chance you could run the consumer in debug mode -log-level DEBUG And provide the full log from the start to the crash ?
  • Do you have a lot of metric alerts configured ? Maybe the bottleneck is running those queries against clickhouse. We recently made a change to even out the load on the DB which should help in those cases. Here and here

Hope this helps.
Filippo

Thanks so much for the detailed explanation! I’m sure it will be of great help to us in resolving this issue. We are now configuring a separate RC environment for various tests including related issues. It seems that the process of log collection and testing you mentioned will be done after the RC environment is built. I will build an RC environment next week to check it out and leave a reply again. thank you @fpacifici

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.