Kafka is failing to change state to Online

We had a machine failure and had to restore from backup. The site seems to work, except it’s rejecting all error messages we send to it. My best guess is it’s because the Kafka server is not working. We’re getting a lot of messages like this in the logs.

[2020-08-04 20:38:00,535] ERROR [Controller id=1002 epoch=212] Controller 1002 epoch 212 failed to change state for partition __consumer_offsets-40 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-40 under strategy OfflinePartitionLeaderElectionStrategy(false)
	at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:427)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:424)
	at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:335)
	at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:236)
	at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:157)
	at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:73)
	at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:58)
	at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:41)
	at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:306)
	at kafka.controller.KafkaController.elect(KafkaController.scala:1404)
	at kafka.controller.KafkaController.processStartup(KafkaController.scala:1291)
	at kafka.controller.KafkaController.process(KafkaController.scala:1924)
	at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
	at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:136)
	at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:139)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
	at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:139)
	at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

I actually tried to remove the sentry-kafka docker volume and recreate it, but the error message seems to be the same.

We’re currently running on the development 20.0.8 version (unfortunately we accidentally upgraded to that stream and it ran some migrations so we weren’t sure the best way to downgrade to 20.0.7)

Thoughts on what can be done to fix this?

I’m here with similar issues, i’ve tried a bunch of things such as removing volumes, pruning kafka topics, but the issue still persists.

I haven’t found clear documentation that explains how i can import/export my basic settings to preserve my DSN’s so that i can wipe the docker containers completely and start from scratch.

1 Like

I found a fix. I’m not sure what type of data-loss it incurs, but I was willing to take the risk.

It seems that the Kafka cluster depends on Zookeeper to bootstrap. So I assumed that my Zookeeper data was in a bad/unrecoverable state. What I did was shut down the service and deleted both docker volumes related to those services.

docker volume rm sentry-kafka
docker volume rm sentry-zookeeper

You may have to remove the kafka_1 and other containers before it lets you. Then re-create the volumes.

docker volume create sentry-kafka
docker volume create sentry-zookeeper

Then boot everything back up. Once I did I noticed the Kafka server loaded up fine and errors posted to the server started processing again.

1 Like