This question is related to this link: Perfomance metric alarm related to Snuba-events-subscriptions-consumers - #4
@fpacifici , It’s been a long time since I inquired again. In the meantime, I have been operating the sentry while neglecting the problem due to other tasks.
As mentioned in the previous inquiry, I tried to collect more detailed logs, but to no avail. The log looks like this:
2021-09-09 12:03:26,986 Initialized processing strategy: <snuba.utils.streams.processing.strategies.batching.BatchProcessingStrategy object at 0x7fc3b087f430>
2021-09-09 12:03:28,951 Caught KeyError(1), shutting down...
2021-09-09 12:03:28,951 Terminating <snuba.utils.streams.processing.strategies.batching.BatchProcessingStrategy object at 0x7fc3b087f430>...
2021-09-09 12:03:28,951 Closing <snuba.subscriptions.consumer.TickConsumer object at 0x7fc3b082af10>...
Traceback (most recent call last):
File "/usr/local/bin/snuba", line 33, in <module>
sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/src/snuba/snuba/cli/subscriptions.py", line 224, in subscriptions
batching_consumer.run()
File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 109, in run
self._run_once()
File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 147, in _run_once
self.__processing_strategy.submit(self.__message)
File "/usr/src/snuba/snuba/utils/streams/processing/strategies/batching.py", line 152, in submit
result = self.__worker.process_message(message)
File "/usr/src/snuba/snuba/subscriptions/worker.py", line 105, in process_message
for task in self.__schedulers[message.partition.index].find(tick.timestamps)
KeyError: 1
This is the status of our sentry build for the list of questions you’ve been asking.
-
Does one replica crash for some reason? That would immediately cause a rebalance. We saw OOM issues at times.
=> Rather than crash, after being assigned to a partition within a topic, we understand that consumption itself is not possible. both snuba-transactions-subscriptions-consumers, snuba-events-subscriptions-consumers group. The current state is that the offset is not increasing because the partition is not consumed, rather than rebalancing. -
Is one (or multiple) of the replicas so overwhelmed that it does not manage to poll from Kafka within the timeout, then the broker may be excluding the consumer and trigger a rebalance.
=> Currently we have a lot of snuba. We have more than 20 partitions of some kafka topics for parallel processing. (One node contains all snuba-related containers) My guess is that this isn’t a kafka problem, it’s a problem with how snuba-transactions-subscriptions-consumers and snuba-events-subscriptions-consumers consume topics. We increased the partition of the kafka topic differently from the kafka configuration in the existing repository, so did this part cause a problem in the above containers operating in “order”? As you said in a previous article, snuba-transactions-subscriptions-consumers and snuba-events-subscriptions-consumers consume two topics, event and commit_log, but we only increased the event partition. For other topics, there is only one partition. -
How many replicas of the consumer are you running ?
=> “ingest-events”, “ingest-transactions”, “events”, “outcomes”, These events use 20 partitions, and the rest of the topics use 1 partition. I don’t think the “order” of events is guaranteed when there are more than 2 partitions, would this affect this issue? -
how often does the rebalancing happens? The consumer log would show partition assignments. Would you mind providing logs on a longer period of time? What was provided in the other issue only contains an exception which is hard to contextualize.
=> I will collect more logs about the rebalancing situation. For now, the topic itself is not consumed and remains stuck. This is the biggest problem
- Also any chance you could run the consumer in debug mode
-log-level DEBUG
And provide the full log from the start to the crash ?
=> The log is as above. After the initialized processing strategy starts, it stops immediately. In the case of the debug level log above, it is an rc environment, and in our real environment, the following log is produced.
2021-07-24 06:37:36,124 Caught OffsetOutOfRange(‘KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str=“Broker: Offset out of range”}’), shutting down…
- Do you have a lot of metric alerts configured ? Maybe the bottleneck is running those queries against clickhouse. We recently made a change to even out the load on the DB which should help in those cases. Here and here
=> Not right now, but I think there will be more in the future, so we are planning a version upgrade as you said.