Sentry stops processing events after upgrade 10.0 => 20.8.0.dev0ba2aa70

Standard docker setup.

After upgrade events were accepted for a few hours and then stopped. Started again after server restart and then stooped again. Server restart does not help anymore.

I have kafka related errors in consumer logs:

+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 19:05:49,717 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}
2020-08-10 19:05:54,980 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
%3|1597086403.040|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1597086403.051|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 12ms in state CONNECT)
%3|1597086404.037|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%3|1597086404.037|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 146, in consumer
    consumer.run()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 132, in run
    self._run_once()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 138, in _run_once
    msg = self.__consumer.poll(timeout=1.0)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 674, in poll
    return super().poll(timeout)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 412, in poll
    raise ConsumerError(str(error))
snuba.utils.streams.consumer.ConsumerError: KafkaError{code=COORDINATOR_LOAD_IN_PROGRESS,val=14,str="JoinGroup failed: Broker: Coordinator load in progress"}
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 19:07:12,593 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}
2020-08-10 20:17:24,210 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
%3|1597090682.504|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 1ms in state CONNECT)
%3|1597090682.510|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 2ms in state CONNECT)
%3|1597090690.505|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 8 identical error(s) suppressed)
%3|1597090690.508|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 8 identical error(s) suppressed)
%3|1597090720.508|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 29 identical error(s) suppressed)
%3|1597090720.511|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.7:9092 failed: Connection refused (after 0ms in state CONNECT, 29 identical error(s) suppressed)
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 146, in consumer
    consumer.run()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 132, in run
    self._run_once()
  File "/usr/src/snuba/snuba/utils/streams/processing.py", line 138, in _run_once
    msg = self.__consumer.poll(timeout=1.0)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 674, in poll
    return super().poll(timeout)
  File "/usr/src/snuba/snuba/utils/streams/kafka.py", line 412, in poll
    raise ConsumerError(str(error))
snuba.utils.streams.consumer.ConsumerError: KafkaError{code=COORDINATOR_LOAD_IN_PROGRESS,val=14,str="JoinGroup failed: Broker: Coordinator load in progress"}
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 20:19:00,698 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}
2020-08-10 22:05:42,556 Partitions revoked: [Partition(topic=Topic(name='events'), index=0)]
+ '[' c = - ']'
+ snuba consumer --help
+ set -- snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
+ exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
2020-08-10 22:05:46,575 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 16227028}

But kafka looks to be working fine.

1 Like

Looks like an issue with Kafka but I really am not a Kafka expert. Someone else reported a similar issue earlier and my suggestion was to basically nuke Kafka: https://github.com/getsentry/sentry/issues/16534#issuecomment-576899686

Thanks for help!
I tried last night to nuke Kafka and ZK. Like this: Kafka starts with 50 partitions bogging down system resources Ā· Issue #502 Ā· getsentry/self-hosted Ā· GitHub

Now I have errors in kafka log:

[2020-08-10 23:17:50,220] ERROR [Controller id=1002 epoch=60] Controller 1002 epoch 60 failed to change state for partition __consumer_offsets-1 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-1 under strategy OfflinePartitionLeaderElectionStrategy(false)
at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:427)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:424)
at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:335)
at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:236)
at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:157)
at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:73)
at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:58)
at kafka.controller.PartitionStateMachine.startup(PartitionStateMachine.scala:41)
at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:306)
at kafka.controller.KafkaController.elect(KafkaController.scala:1404)
at kafka.controller.KafkaController.processStartup(KafkaController.scala:1291)
at kafka.controller.KafkaController.process(KafkaController.scala:1924)
at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:136)
at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:139)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:139)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
[2020-08-10 23:17:50,220] ERROR [Controller id=1002 epoch=60] Controller 1002 epoch 60 failed to change state for partition __consumer_offsets-5 from OfflinePartition to OnlinePartition (state.change.logger)

That comment seems a bitā€¦ lacking :smiley:

For a full Kafka nuke, I think this is what you need:

Sorry for the multiple hops for this. We are going to document all this soon.

Well, code I quoted and executed deleted sentry-kafka and sentry-zookeeper volumes.
It did not delete -log volumes but itā€™s just logs, right?

The logs are also important to reset the state, especially Zookeeper one.

Wellā€¦ I do not have those volumes:

hex@sentry:~/onpremise$ sudo docker volume ls
DRIVER VOLUME NAME
local 4ca71fa41928a170b796f8b8f6040ef7732d5c7cb3ad4c95ee0b1927b6b09ec5
local 76218279a2a8f4a3a20fa85b943a87b83ea74fc2ac846e7b7d3050a7db9520e8
local bff2a942d51061331f8a3c65cee9cd4c9a95724ad23cc262da92b2e10f4a8181
local ea9ed9a7df7f3c342b3af67d08c128492b630ebfa620e3468c201085402d9dc8
local f7b96e53dd5ac75cd01d295d74fa17e93287d2277b5b55f01e05bdbbdbc0828f
local f84be0b7b4ecfe9822a6f20396142876a7842e822c2a68b1653324072d1de046
local sentry-clickhouse
local sentry-data
local sentry-postgres
local sentry-redis
local sentry-symbolicator

I have added KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: ā€˜2ā€™ to kafka config. No more errors in kafka.
ingest-consumer now semas to be working without problems.

But I still donā€™t receive new events :frowning:
I have errors in relay:

2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.19.0.10:9092 failed: Connection refused (after 28ms in state CONNECT)
2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.19.0.10:9092 failed: Connection refused (after 0ms in state CONNECT)
2020-08-12T20:24:09Z [rdkafka::client] ERROR: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
2020-08-12T20:24:12Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Timeout while waiting for response
2020-08-12T20:24:12Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: No route to host (os error 113)
caused by: No route to host (os error 113)
caused by: No route to host (os error 113)
2020-08-12T20:24:13Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:14Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:15Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:15Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:17Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:17Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:19Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:21Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:23Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated
2020-08-12T20:24:26Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
caused by: Failed to connect to host: Connection refused (os error 111)
caused by: Connection refused (os error 111)
caused by: Connection refused (os error 111)
2020-08-12T20:24:28Z [relay_server::actors::project_upstream] ERROR: error fetching project states: attempted to send request while not yet authenticated

This is all still very much an issue, so Iā€™d appreciate some guidance. After deleting whatever this Kafka stuff was, Iā€™m now getting a bunch of ā€œBroker: Unknown topic or partitionā€ errors in Docker logs, with various containers constantly restarting because whatever this Kafka nonsense is has broken everything.

I strongly encourage you to understand what Kafka is and what it is used for before calling it ā€œwhatever this nonsense isā€. Sounds like you need to run ./install.sh again to get those missing topics created.

These errors indicate a networking issue between Relay and other services. Are you using docker-compose from our official repo or using something else?

Yes, everything is default.
Only change that I made in docker-compose is that I switched ngnix port from 9000 to 80.

I face the same issue, after upgrading from 10 to 20.7.2 after a few hours, it stopped processing events.

Then I went to upgrade to 20.8*-dev (master) to resolve issues with relay authentication.
The relay errors are gone.

Iā€™ve reset my installation by calling:

cd /var/www/equadrat/sentry.e-2.at
docker volume rm sentry-zookeeper sentry_onpremise_sentry-zookeeper-log sentry_onpremise_sentry-kafka-log sentry_onpremise_sentry-kafka
rm /mnt/nfs/docker-volumes/sentry-kafka/* -rf # custom NFS volume for Kafka queue
bash install.sh
docker-compose up -d

Then same thing, it processed events for about one hour, then it stopped. Iā€™m not too sure where to look at:

  • relay looks fine
  • got some errors here: After restarting only those two, it did not recover.
snuba-replacer_1               | %3|1597477851.492|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 1ms in state CONNECT)
snuba-replacer_1               | %3|1597477852.491|FAIL|rdkafka#consumer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
snuba-sessions-consumer_1      | %3|1597477851.373|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 3ms in state CONNECT)
snuba-sessions-consumer_1      | %3|1597477851.385|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 2ms in state CONNECT)
snuba-sessions-consumer_1      | %3|1597477852.370|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
snuba-sessions-consumer_1      | %3|1597477852.381|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#10.17.8.9:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
  • a full restart via docker-compose stop && docker-compose up -d did not resolve it either

How to go forward? Where to look at?
Or is it possible to go backward to v10 again? I mean do migrations on postgres/clickhouse run into opposite direction as well?

EDIT: docker-compose stop && docker-compose rm && docker-compose up -d made it to process events again now, but not sure how longā€¦

If this is not a networking issue (and it doesnā€™t seem like it) then Iā€™d suspect Kafka being overloaded or throttled by something and not being able to accept new consumer connections. Is your volume high? Or your NFS volume is slow maybe?

Sentry 9 & Sentry 10 were running with same load fine. Now I can see no errors/warnings on Kafka, but this time:

worker_1                       | 14:16:51 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...

Then I restarted kafka, worker & relay. It started to process again. How could I go on? Where to debug Kafka?

EDIT: Have moved Kafkaā€™s volume from NFS to local disk. Behavior did not change. Either worker or relay looses connection.

Should we post this in Sentry github issues?

Iā€™m not sure yet what to report? Kafka does not print any warning, worker & relay do not reconnect automatically, but why? Is there metrics on Kafka that might indicate some overload?

I understand.
I can provide access to my server to Sentry engineers if needed.

I honestly donā€™t know much about how to run Kafka @e2_robert so cannot really comment here until I learn more.

Maybe @matt would have some ideas.

I deleted clickhouse-data together with kafka-data and zookeeper-data.
I lost all events but it looks like new ones are arriving for nowā€¦