Will move the conversation here:
re: how the snuba-events-subscriptions-consumer works.
IT consumes two topics. The main events topic (all events (errors/transactions) produced by sentry) is one and the other is the commit_log topic. The snuba consumer (that writes on clickhouse) writes a mark on the commit log telling the offset it committed for each partition.
The snuba events subscriptions consumer gets an offset from the commit log, then it consumes the events topic up to that offsset and, for every time tick (let’s say every minute to simplify) it executes all the clickhouse queries to provide metric alerts data to sentry. Then it pauses until the main consumer has made some progress.
There are multiple reasons why the subscriptions consumer may be rebalacing continuously depending how many replicas of that consumer you are running. I believe you have 10 partitions.
- Does one replica crash for some reason? That would immediately cause a rebalance. We saw OOM issues at times.
- Is one (or multiple) of the replicas so overwhelmed that it does not manage to poll from Kafka within the timeout, then the broker may be excluding the consumer and trigger a rebalance.
To nail down the cause we would need more information than those provided in this issue. Specifically:
- How many replicas of the consumer are you running ?
- how often does the rebalancing happens? The consumer log would show partition assignments. Would you mind providing logs on a longer period of time? What was provided in the other issue only contains an exception which is hard to contextualize.
- Also any chance you could run the consumer in debug mode
-log-level DEBUG And provide the full log from the start to the crash ?
- Do you have a lot of metric alerts configured ? Maybe the bottleneck is running those queries against clickhouse. We recently made a change to even out the load on the DB which should help in those cases. Here and here
Hope this helps.