Perfomance metric alarm related to Snuba-events-subscriptions-consumers, Snuba-transactions-subscriptions-consumers

This question is related to this link: Perfomance metric alarm related to Snuba-events-subscriptions-consumers - #4

@fpacifici , It’s been a long time since I inquired again. In the meantime, I have been operating the sentry while neglecting the problem due to other tasks.
As mentioned in the previous inquiry, I tried to collect more detailed logs, but to no avail. The log looks like this:

2021-09-09 12:03:26,986 Initialized processing strategy: <snuba.utils.streams.processing.strategies.batching.BatchProcessingStrategy object at 0x7fc3b087f430>
2021-09-09 12:03:28,951 Caught KeyError(1), shutting down...
2021-09-09 12:03:28,951 Terminating <snuba.utils.streams.processing.strategies.batching.BatchProcessingStrategy object at 0x7fc3b087f430>...
2021-09-09 12:03:28,951 Closing <snuba.subscriptions.consumer.TickConsumer object at 0x7fc3b082af10>...
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions.py", line 224, in subscriptions
    batching_consumer.run()
  File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 109, in run
    self._run_once()
  File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 147, in _run_once
    self.__processing_strategy.submit(self.__message)
  File "/usr/src/snuba/snuba/utils/streams/processing/strategies/batching.py", line 152, in submit
    result = self.__worker.process_message(message)
  File "/usr/src/snuba/snuba/subscriptions/worker.py", line 105, in process_message
    for task in self.__schedulers[message.partition.index].find(tick.timestamps)
KeyError: 1

This is the status of our sentry build for the list of questions you’ve been asking.

  • Does one replica crash for some reason? That would immediately cause a rebalance. We saw OOM issues at times.
    => Rather than crash, after being assigned to a partition within a topic, we understand that consumption itself is not possible. both snuba-transactions-subscriptions-consumers, snuba-events-subscriptions-consumers group. The current state is that the offset is not increasing because the partition is not consumed, rather than rebalancing.

  • Is one (or multiple) of the replicas so overwhelmed that it does not manage to poll from Kafka within the timeout, then the broker may be excluding the consumer and trigger a rebalance.
    => Currently we have a lot of snuba. We have more than 20 partitions of some kafka topics for parallel processing. (One node contains all snuba-related containers) My guess is that this isn’t a kafka problem, it’s a problem with how snuba-transactions-subscriptions-consumers and snuba-events-subscriptions-consumers consume topics. We increased the partition of the kafka topic differently from the kafka configuration in the existing repository, so did this part cause a problem in the above containers operating in “order”? As you said in a previous article, snuba-transactions-subscriptions-consumers and snuba-events-subscriptions-consumers consume two topics, event and commit_log, but we only increased the event partition. For other topics, there is only one partition.

  • How many replicas of the consumer are you running ?
    => “ingest-events”, “ingest-transactions”, “events”, “outcomes”, These events use 20 partitions, and the rest of the topics use 1 partition. I don’t think the “order” of events is guaranteed when there are more than 2 partitions, would this affect this issue?

  • how often does the rebalancing happens? The consumer log would show partition assignments. Would you mind providing logs on a longer period of time? What was provided in the other issue only contains an exception which is hard to contextualize.

=> I will collect more logs about the rebalancing situation. For now, the topic itself is not consumed and remains stuck. This is the biggest problem

  • Also any chance you could run the consumer in debug mode -log-level DEBUG And provide the full log from the start to the crash ?
    => The log is as above. After the initialized processing strategy starts, it stops immediately. In the case of the debug level log above, it is an rc environment, and in our real environment, the following log is produced.

2021-07-24 06:37:36,124 Caught OffsetOutOfRange(‘KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str=“Broker: Offset out of range”}’), shutting down…

  • Do you have a lot of metric alerts configured ? Maybe the bottleneck is running those queries against clickhouse. We recently made a change to even out the load on the DB which should help in those cases. Here and here

=> Not right now, but I think there will be more in the future, so we are planning a version upgrade as you said.

This frame in your stack trace could explain the whole issue.
The subscription consumer create a scheduler per partition to execute the subscriptions. Now unfortunately the subscriptions consumer is not able to infer the number of partitions of your topic (we are working on that).
It seems the subscriptions consumer started believing the topic has only 1 partition (while you have 20).
There are two way to set the number of partitions the subscriptions consumer would work with:

That would explain the consumer crash and the continuous rebalancing. If the system keeps crashing that way, after 24 hours (or the retention you set in kafka) you will see the OFFSET_OUT_OF_RANGE message because the offset the consumer expects has been dropped.