Kafka is failing to change state to Online

nathanpalmer · August 4, 2020, 8:50pm

We had a machine failure and had to restore from backup. The site seems to work, except it’s rejecting all error messages we send to it. My best guess is it’s because the Kafka server is not working. We’re getting a lot of messages like this in the logs.

[2020-08-04 20:38:00,535] ERROR [Controller id=1002 epoch=212] Controller 1002 epoch 212 failed to change state for partition __consumer_offsets-40 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-40 under strategy OfflinePartitionLeaderElectionStrategy(false)
	at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:427)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:424)
	at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:335)
	at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:236)
	at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:157)
	at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:73)
	at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:58)
	at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:41)
	at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:306)
	at kafka.controller.KafkaController.elect(KafkaController.scala:1404)
	at kafka.controller.KafkaController.processStartup(KafkaController.scala:1291)
	at kafka.controller.KafkaController.process(KafkaController.scala:1924)
	at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
	at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:136)
	at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:139)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
	at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:139)
	at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

I actually tried to remove the sentry-kafka docker volume and recreate it, but the error message seems to be the same.

We’re currently running on the development 20.0.8 version (unfortunately we accidentally upgraded to that stream and it ran some migrations so we weren’t sure the best way to downgrade to 20.0.7)

Thoughts on what can be done to fix this?

turbo124 · August 5, 2020, 12:09pm

I’m here with similar issues, i’ve tried a bunch of things such as removing volumes, pruning kafka topics, but the issue still persists.

I haven’t found clear documentation that explains how i can import/export my basic settings to preserve my DSN’s so that i can wipe the docker containers completely and start from scratch.

nathanpalmer · August 6, 2020, 1:29pm

I found a fix. I’m not sure what type of data-loss it incurs, but I was willing to take the risk.

It seems that the Kafka cluster depends on Zookeeper to bootstrap. So I assumed that my Zookeeper data was in a bad/unrecoverable state. What I did was shut down the service and deleted both docker volumes related to those services.

docker volume rm sentry-kafka
docker volume rm sentry-zookeeper

You may have to remove the kafka_1 and other containers before it lets you. Then re-create the volumes.

docker volume create sentry-kafka
docker volume create sentry-zookeeper

Then boot everything back up. Once I did I noticed the Kafka server loaded up fine and errors posted to the server started processing again.

vietthang · January 4, 2021, 9:06am

For me, I need to remove sentry_onpremise_sentry-kafka-log and sentry_onpremise_sentry-zookeeper-log as well.

saubi1993 · July 19, 2021, 4:30pm

Try following.

docker volume rm sentry-kafka
docker volume rm sentry-zookeeper
docker volume rm sentry_onpremise_sentry-kafka-log
docker volume rm sentry_onpremise_sentry-zookeeper-log


./install.sh // to create Kafka partitions

Topic		Replies	Views
Regarding Kafka error On-Premise	6	3269	August 23, 2021
Sentry not logging issues On-Premise	3	1535	June 14, 2020
Snuba Consumers - Kafka Issues (NOT_COORDINATOR_FOR_GROUP) causing process restart On-Premise	6	4098	November 17, 2020
Upgrade fails Sentry 10.0.0 -> Latest On-Premise	16	2722	March 5, 2021
After the sentry upgrade no more catch errors On-Premise	3	2979	September 20, 2021

Kafka is failing to change state to Online

Related topics