Snuba Consumers - Kafka Issues (NOT_COORDINATOR_FOR_GROUP) causing process restart

DandyDeveloper · November 9, 2020, 7:35am

I’m trying to pinpoint this issue in my on-premise Sentry instance deployed via Helm on Kubernetes.

After running for a while, a number of the Snuba and consumers in general (Except workers, they are fine) just error out with:

cimpl.KafkaException: KafkaError{code=NOT_COORDINATOR,val=16,str="Commit failed: Broker: Not coordinator"}

Frustratingly, the Kafka instances all look healthy and the partitions and topics also seem OK.

After restarting, the components seem to function as normal, but again, after 1-5 minutes, reboot again.

Any ideas on what could be causing this? Clickhouse, Kafka and Zookeeper all seem healthy and don’t mention any leadership changes.

DandyDeveloper · November 10, 2020, 12:33am

Further look, it appears after a day of this, eventually I’m now getting partitions being out of sync:

snuba.utils.streams.backends.abstract.ConsumerError: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}

This effectively kills the services completely until I reset the offset on the topic in Kafka, which just isn’t scaleable.

Why does this happen?

DandyDeveloper · November 12, 2020, 1:24am

Bumping for visibility here. It’s a weird issue.

If I reset Kafka partitions completely, it gets fixed but comes back again after a period of time (undetermined).

DandyDeveloper · November 13, 2020, 12:22am

It looks like my consumer is timing out:

%4|1605225797.460|SESSTMOUT|rdkafka#consumer-2| [thrd:main]: Consumer group session timed out (in join-state started) after 10223 ms without a successful response from the group coordinator (broker 2, last error was Success): revoking assignment and rejoining group
2020-11-13 00:03:21,057 Completed processing <Batch: 107 messages, open for 23.32 seconds>.
2020-11-13 00:03:21,058 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
2020-11-13 00:03:21,059 Error submitting packet, dropping the packet and closing the socket
2020-11-13 00:03:21,355 Dropping staged offset for revoked partition (Partition(topic=Topic(name='events'), index=0))!
2020-11-13 00:03:22,058 Caught AttributeError("type object 'cimpl.KafkaError' has no attribute 'NOT_COORDINATOR_FOR_GROUP'"), shutting down...

But I don’t see a snuba consumer option for timeout?

I guess we’re probably hitting Sentry too hard and Kafka seems to be struggling with response times. I’ll see if it’s a resource allocation issue.

DandyDeveloper · November 17, 2020, 3:10am

Anyone from the Sentry team experienced this?

Alexander · November 17, 2020, 6:50am

(Im not with the team) Im sorry to party poop but Helm is not supported here and Kafka nor Snuba is a “Sentry product”.
Have you tried this: Sentry no more catch errors ?

DandyDeveloper · November 17, 2020, 8:10am

Yes, I know how to resolve it through offset resetting but the fundamental issue stands unfortunately. I’m aware the helm chart isn’t official, but this isn’t related to the chart in any way, except maybe the Kafka configuration.

It looks as though this might be caused by Kafka trying to rebalance the group/topic:

I have no name!@sentry-kafka-0:/opt/bitnami/kafka/bin$ ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group snuba-post-processor --describe

Warning: Consumer group 'snuba-post-processor' is rebalancing.

GROUP                TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
snuba-post-processor events          0          6305995         6729200         423205          -               -               -

During the rebalance seems to be the time when the specific consumers seem to be failing.

However, I don’t understand what triggers the rebalance, the consumer or Kafka? And why is it rebalancing? Trying to work this out.

Topic		Replies	Views
Perfomance metric alarm related to Snuba-events-subscriptions-consumers, Snuba-transactions-subscriptions-consumers On-Premise	2	3330	December 14, 2021
Perfomance metric alarm related to Snuba-events-subscriptions-consumers On-Premise	3	1938	July 30, 2021
Kafka offset issue - snuba-subscription-consumer-events On-Premise	4	5434	May 11, 2021
Events have stopped appearing On-Premise	5	3034	January 3, 2021
KafkaError OFFSET_OUT_OF_RANGE error On-Premise	7	14207	July 22, 2021

Snuba Consumers - Kafka Issues (NOT_COORDINATOR_FOR_GROUP) causing process restart

Related topics