We had a machine failure and had to restore from backup. The site seems to work, except it’s rejecting all error messages we send to it. My best guess is it’s because the Kafka server is not working. We’re getting a lot of messages like this in the logs.
[2020-08-04 20:38:00,535] ERROR [Controller id=1002 epoch=212] Controller 1002 epoch 212 failed to change state for partition __consumer_offsets-40 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-40 under strategy OfflinePartitionLeaderElectionStrategy(false)
at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:427)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:424)
at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:335)
at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:236)
at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:157)
at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:73)
at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:58)
at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:41)
at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:306)
at kafka.controller.KafkaController.elect(KafkaController.scala:1404)
at kafka.controller.KafkaController.processStartup(KafkaController.scala:1291)
at kafka.controller.KafkaController.process(KafkaController.scala:1924)
at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:136)
at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:139)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:139)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
I actually tried to remove the sentry-kafka
docker volume and recreate it, but the error message seems to be the same.
We’re currently running on the development 20.0.8 version (unfortunately we accidentally upgraded to that stream and it ran some migrations so we weren’t sure the best way to downgrade to 20.0.7)
Thoughts on what can be done to fix this?