Sentry stops processing events after upgrade 10.0 => 20.8.0.dev0ba2aa70

Standard docker setup.

After upgrade events were accepted for a few hours and then stopped. Started again after server restart and then stooped again. Server restart does not help anymore.

I have kafka related errors in consumer logs:

+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 19:05:49,717 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}
2020-08-10 19:05:54,980 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
%3|1597086403.040|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1597086403.051|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 12ms in state CONNECT)
%3|1597086404.037|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%3|1597086404.037|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 146, in consumer
    consumer.run()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 132, in run
    self._run_once()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 138, in _run_once
    msg = self.__consumer.poll(timeout=1.0)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 674, in poll
    return super().poll(timeout)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 412, in poll
    raise ConsumerError(str(error))
snuba.utils.streams.consumer.ConsumerError: KafkaError{code=COORDINATOR_LOAD_IN_PROGRESS,val=14,str="JoinGroup failed: Broker: Coordinator load in progress"}
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 19:07:12,593 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}
2020-08-10 20:17:24,210 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
%3|1597090682.504|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 1ms in state CONNECT)
%3|1597090682.510|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 2ms in state CONNECT)
%3|1597090690.505|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 8 identical error(s) suppressed)
%3|1597090690.508|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 8 identical error(s) suppressed)
%3|1597090720.508|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 29 identical error(s) suppressed)
%3|1597090720.511|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 29 identical error(s) suppressed)
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 146, in consumer
    consumer.run()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 132, in run
    self._run_once()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 138, in _run_once
    msg = self.__consumer.poll(timeout=1.0)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 674, in poll
    return super().poll(timeout)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 412, in poll
    raise ConsumerError(str(error))
snuba.utils.streams.consumer.ConsumerError: KafkaError{code=COORDINATOR_LOAD_IN_PROGRESS,val=14,str="JoinGroup failed: Broker: Coordinator load in progress"}
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 20:19:00,698 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}
2020-08-10 22:05:42,556 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 22:05:46,575 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}

But kafka looks to be working fine.

1 Like

Looks like an issue with Kafka but I really am not a Kafka expert. Someone else reported a similar issue earlier and my suggestion was to basically nuke Kafka: https://github.com/getsentry/sentry/issues/16534#issuecomment-576899686

Thanks for help!
I tried last night to nuke Kafka and ZK. Like this: https://github.com/getsentry/onpremise/issues/502#issuecomment-644590939

Now I have errors in kafka log:

[2020-08-10 23:17:50,220] ERROR [Controller id=1002 epoch=60] Controller 1002 epoch 60 failed to change state for partition __consumer_offsets-1 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-1 under strategy OfflinePartitionLeaderElectionStrategy(false)
at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:427)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:424)
at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:335)
at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:236)
at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:157)
at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:73)
at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:58)
at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:41)
at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:306)
at kafka.controller.KafkaController.elect(KafkaController.scala:1404)
at kafka.controller.KafkaController.processStartup(KafkaController.scala:1291)
at kafka.controller.KafkaController.process(KafkaController.scala:1924)
at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:136)
at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:139)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:139)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2020-08-10 23:17:50,220] ERROR [Controller id=1002 epoch=60] Controller 1002 epoch 60 failed to change state for partition __consumer_offsets-5 from OfflinePartition to OnlinePartition (state.change.logger)

That comment seems a bit… lacking :smiley:

For a full Kafka nuke, I think this is what you need:

Sorry for the multiple hops for this. We are going to document all this soon.

Well, code I quoted and executed deleted sentry-kafka and sentry-zookeeper volumes.
It did not delete -log volumes but it’s just logs, right?

The logs are also important to reset the state, especially Zookeeper one.

Well… I do not have those volumes:

hex@sentry:~/onpremise$ sudo docker volume ls
DRIVER VOLUME NAME
local 4ca71fa41928a170b796f8b8f6040ef7732d5c7cb3ad4c95ee0b1927b6b09ec5
local 76218279a2a8f4a3a20fa85b943a87b83ea74fc2ac846e7b7d3050a7db9520e8
local bff2a942d51061331f8a3c65cee9cd4c9a95724ad23cc262da92b2e10f4a8181
local ea9ed9a7df7f3c342b3af67d08c128492b630ebfa620e3468c201085402d9dc8
local f7b96e53dd5ac75cd01d295d74fa17e93287d2277b5b55f01e05bdbbdbc0828f
local f84be0b7b4ecfe9822a6f20396142876a7842e822c2a68b1653324072d1de046
local sentry-clickhouse
local sentry-data
local sentry-postgres
local sentry-redis
local sentry-symbolicator

I have added KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: ‘2’ to kafka config. No more errors in kafka.
ingest-consumer now semas to be working without problems.

But I still don’t receive new events :frowning:
I have errors in relay:

2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.19.0.10:9092 failed: Connection refused (after 28ms in state CONNECT)
2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.19.0.10:9092 failed: Connection refused (after 0ms in state CONNECT)
2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
2020-08-12T20:24:12Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Timeout while waiting for response
2020-08-12T20:24:12Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: No route to host (os error 113)
caused by: No route to host (os error 113)
caused by: No route to host (os error 113)
2020-08-12T20:24:13Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:14Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:15Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:15Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:17Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:17Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:19Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:21Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:23Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:26Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:28Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated

This is all still very much an issue, so I’d appreciate some guidance. After deleting whatever this Kafka stuff was, I’m now getting a bunch of “Broker: Unknown topic or partition” errors in Docker logs, with various containers constantly restarting because whatever this Kafka nonsense is has broken everything.

I strongly encourage you to understand what Kafka is and what it is used for before calling it “whatever this nonsense is”. Sounds like you need to run ./install.sh again to get those missing topics created.

These errors indicate a networking issue between Relay and other services. Are you using docker-compose from our official repo or using something else?

Yes, everything is default.
Only change that I made in docker-compose is that I switched ngnix port from 9000 to 80.

I face the same issue, after upgrading from 10 to 20.7.2 after a few hours, it stopped processing events.

Then I went to upgrade to 20.8*-dev (master) to resolve issues with relay authentication.
The relay errors are gone.

I’ve reset my installation by calling:

cd /var/www/equadrat/sentry.e-2.at
docker volume rm sentry-zookeeper sentry_onpremise_sentry-zookeeper-log sentry_onpremise_sentry-kafka-log sentry_onpremise_sentry-kafka
rm /mnt/nfs/docker-volumes/sentry-kafka/* -rf # custom NFS volume for Kafka queue
bash install.sh
docker-compose up -d

Then same thing, it processed events for about one hour, then it stopped. I’m not too sure where to look at:

  • relay looks fine
  • got some errors here: After restarting only those two, it did not recover.
snuba-replacer_1               | %3|1597477851.492|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 1ms in state CONNECT)
snuba-replacer_1               | %3|1597477852.491|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
snuba-sessions-consumer_1      | %3|1597477851.373|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 3ms in state CONNECT)
snuba-sessions-consumer_1      | %3|1597477851.385|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 2ms in state CONNECT)
snuba-sessions-consumer_1      | %3|1597477852.370|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
snuba-sessions-consumer_1      | %3|1597477852.381|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
  • a full restart via docker-compose stop && docker-compose up -d did not resolve it either

How to go forward? Where to look at?
Or is it possible to go backward to v10 again? I mean do migrations on postgres/clickhouse run into opposite direction as well?

EDIT: docker-compose stop && docker-compose rm && docker-compose up -d made it to process events again now, but not sure how long…

If this is not a networking issue (and it doesn’t seem like it) then I’d suspect Kafka being overloaded or throttled by something and not being able to accept new consumer connections. Is your volume high? Or your NFS volume is slow maybe?

Sentry 9 & Sentry 10 were running with same load fine. Now I can see no errors/warnings on Kafka, but this time:

worker_1                       | 14:16:51 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...

Then I restarted kafka, worker & relay. It started to process again. How could I go on? Where to debug Kafka?

EDIT: Have moved Kafka’s volume from NFS to local disk. Behavior did not change. Either worker or relay looses connection.

Should we post this in Sentry github issues?

I’m not sure yet what to report? Kafka does not print any warning, worker & relay do not reconnect automatically, but why? Is there metrics on Kafka that might indicate some overload?

I understand.
I can provide access to my server to Sentry engineers if needed.

I honestly don’t know much about how to run Kafka @e2_robert so cannot really comment here until I learn more.

Maybe @matt would have some ideas.

I deleted clickhouse-data together with kafka-data and zookeeper-data.
I lost all events but it looks like new ones are arriving for now…