I’m trying to pinpoint this issue in my on-premise Sentry instance deployed via Helm on Kubernetes.
After running for a while, a number of the Snuba and consumers in general (Except workers, they are fine) just error out with:
cimpl.KafkaException: KafkaError{code=NOT_COORDINATOR,val=16,str="Commit failed: Broker: Not coordinator"}
Frustratingly, the Kafka instances all look healthy and the partitions and topics also seem OK.
After restarting, the components seem to function as normal, but again, after 1-5 minutes, reboot again.
Any ideas on what could be causing this? Clickhouse, Kafka and Zookeeper all seem healthy and don’t mention any leadership changes.
Further look, it appears after a day of this, eventually I’m now getting partitions being out of sync:
snuba.utils.streams.backends.abstract.ConsumerError: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
This effectively kills the services completely until I reset the offset on the topic in Kafka, which just isn’t scaleable.
Why does this happen?
Bumping for visibility here. It’s a weird issue.
If I reset Kafka partitions completely, it gets fixed but comes back again after a period of time (undetermined).
It looks like my consumer is timing out:
%4|1605225797.460|SESSTMOUT|rdkafka#consumer-2| [thrd:main]: Consumer group session timed out (in join-state started) after 10223 ms without a successful response from the group coordinator (broker 2, last error was Success): revoking assignment and rejoining group
2020-11-13 00:03:21,057 Completed processing <Batch: 107 messages, open for 23.32 seconds>.
2020-11-13 00:03:21,058 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
2020-11-13 00:03:21,059 Error submitting packet, dropping the packet and closing the socket
2020-11-13 00:03:21,355 Dropping staged offset for revoked partition (Partition(topic=Topic(name='events'), index=0))!
2020-11-13 00:03:22,058 Caught AttributeError("type object 'cimpl.KafkaError' has no attribute 'NOT_COORDINATOR_FOR_GROUP'"), shutting down...
But I don’t see a snuba consumer
option for timeout?
I guess we’re probably hitting Sentry too hard and Kafka seems to be struggling with response times. I’ll see if it’s a resource allocation issue.
Anyone from the Sentry team experienced this?
(Im not with the team) Im sorry to party poop but Helm is not supported here and Kafka nor Snuba is a “Sentry product”.
Have you tried this: Sentry no more catch errors ?
Yes, I know how to resolve it through offset resetting but the fundamental issue stands unfortunately. I’m aware the helm chart isn’t official, but this isn’t related to the chart in any way, except maybe the Kafka configuration.
It looks as though this might be caused by Kafka trying to rebalance the group/topic:
I have no name!@sentry-kafka-0:/opt/bitnami/kafka/bin$ ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group snuba-post-processor --describe
Warning: Consumer group 'snuba-post-processor' is rebalancing.
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
snuba-post-processor events 0 6305995 6729200 423205 - - -
During the rebalance seems to be the time when the specific consumers seem to be failing.
However, I don’t understand what triggers the rebalance, the consumer or Kafka? And why is it rebalancing? Trying to work this out.