Sentry stops processing events after upgrade 10.0 => 20.8.0.dev0ba2aa70

Sentry 9 & Sentry 10 were running with same load fine. Now I can see no errors/warnings on Kafka, but this time:

worker_1                       | 14:16:51 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...

Then I restarted kafka, worker & relay. It started to process again. How could I go on? Where to debug Kafka?

EDIT: Have moved Kafka’s volume from NFS to local disk. Behavior did not change. Either worker or relay looses connection.

Should we post this in Sentry github issues?

I’m not sure yet what to report? Kafka does not print any warning, worker & relay do not reconnect automatically, but why? Is there metrics on Kafka that might indicate some overload?

I understand.
I can provide access to my server to Sentry engineers if needed.

I honestly don’t know much about how to run Kafka @e2_robert so cannot really comment here until I learn more.

Maybe @matt would have some ideas.

I deleted clickhouse-data together with kafka-data and zookeeper-data.
I lost all events but it looks like new ones are arriving for now…

It worked for 12 hours and stopped again :sob:

I had the same result. But it stopped after 8 hours.

Im having the same issue… already removed the kafka/zookeeper volumes, but the problem is still happening.

I’m encountering a similar problem. My installation is Standart (besides the port of the nginx)

Sometimes on the Web interface the Backlog Error pops up.


The Size of the Queue increases if i send new Errors. Therefore i guess my Server receives the Errors and it at least gets to the Backlog. At least to my understanding all Containers required are up and running.

@e2_robert does the docker stop/start sets it to run since your entry 6 days ago?

Any updates on this issue. Same issue persists for our on-prem instance running of the repo clone

@e2_robert @BYK

I ended up having a dirty workaround by using a cronjob that restarts kafka, worker & relay every full hour. Was processing events for 3 days, when a full restart was required, since some snuba containers stopped working.

Did anyone successfully downgrade? If so, which versions are compatible (I assume going to Sentry 10 won’t work)?

I am having issues as well, and I noticed processes are getting killed because the server is running out of memory frequently:

# dmesg | grep "Out of memory"
[56625.167756] Out of memory: Kill process 9431 (clickhouse-serv) score 202 or sacrifice child
[57970.751846] Out of memory: Kill process 10723 (clickhouse-serv) score 210 or sacrifice child
[58370.176550] Out of memory: Kill process 13908 (clickhouse-serv) score 216 or sacrifice child
[62206.017300] Out of memory: Kill process 24844 (clickhouse-serv) score 348 or sacrifice child
[68831.860803] Out of memory: Kill process 27860 (clickhouse-serv) score 224 or sacrifice child

Maybe this is also happening to you?

Here Clickhouse is killed multiple times, but Kafka is also using quite a lot of memory and was also killed. I suspect this might corrupt their file and sometimes prevent them restarting.

I don’t have out-of-memory issues.

No Out of memory displays here either
But after the restart i have not encountered a new fallout. --> but there was only one new entry 5minutes ago

I’m having the same issue after upgrading from 10.1.0 to 20.8.0, although it did not happen immediately. I upgraded via ./install.sh and seemed to all go fine.

After upgrading, it performed almost identically to the previous version, but then on 8/18 I stopped getting events after changing nothing. Upgrade was done on 8/7 so it worked fine for 11 days before this happened.

After starting with docker-compose up to watch the logs, I can see this error repeatedly:

worker_1                   | 20:33:00 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:6ce4d190b4244e4aa956e0690eeda3ff:6')
worker_1                   | 20:33:00 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:8128b458011849a189e4c03e82184e9c:6')

It seems to run this for a very long time, then goes to 0% CPU after awhile, then never processes any new events.

I was very fortunate to have saved a snapshot of the instance prior to upgrading. I’ve reverted back to 10.1.0 and its working just fine again.

I don’t have OOM errors neither, VM has 16GB.

set – snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • set gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • exec gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

2020-08-27 10:47:22,943 New partitions assigned: {Partition(topic=Topic(name=‘ingest-sessions’), index=0): 0}

2020-08-27 10:47:23,109 Partitions revoked: [Partition(topic=Topic(name=‘ingest-sessions’), index=0)]

  • ‘[’ c = - ‘]’

  • snuba consumer --help

  • set – snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • set gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • exec gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

%3|1598525451.691|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 5ms in state CONNECT)

%3|1598525451.695|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 3ms in state CONNECT)

%3|1598525452.686|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)

%3|1598525452.691|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)

2020-08-27 10:51:03,877 New partitions assigned: {Partition(topic=Topic(name=‘ingest-sessions’), index=0): 0}

What does ‘New partitions assigned’ mean? He manages to connect in the end or not?

Still with the same issue…
I do a git pull && ./install.sh everyday to see if it gets fixed, but still not.

1 Like

Same problem here :c