Sentry stops processing events after upgrade 10.0 => 20.8.0.dev0ba2aa70

It worked for 12 hours and stopped again :sob:

I had the same result. But it stopped after 8 hours.

Im having the same issueā€¦ already removed the kafka/zookeeper volumes, but the problem is still happening.

Iā€™m encountering a similar problem. My installation is Standart (besides the port of the nginx)

Sometimes on the Web interface the Backlog Error pops up.


The Size of the Queue increases if i send new Errors. Therefore i guess my Server receives the Errors and it at least gets to the Backlog. At least to my understanding all Containers required are up and running.

@e2_robert does the docker stop/start sets it to run since your entry 6 days ago?

Any updates on this issue. Same issue persists for our on-prem instance running of the repo clone

@e2_robert @BYK

I ended up having a dirty workaround by using a cronjob that restarts kafka, worker & relay every full hour. Was processing events for 3 days, when a full restart was required, since some snuba containers stopped working.

Did anyone successfully downgrade? If so, which versions are compatible (I assume going to Sentry 10 wonā€™t work)?

I am having issues as well, and I noticed processes are getting killed because the server is running out of memory frequently:

# dmesg | grep "Out of memory"
[56625.167756] Out of memory: Kill process 9431 (clickhouse-serv) score 202 or sacrifice child
[57970.751846] Out of memory: Kill process 10723 (clickhouse-serv) score 210 or sacrifice child
[58370.176550] Out of memory: Kill process 13908 (clickhouse-serv) score 216 or sacrifice child
[62206.017300] Out of memory: Kill process 24844 (clickhouse-serv) score 348 or sacrifice child
[68831.860803] Out of memory: Kill process 27860 (clickhouse-serv) score 224 or sacrifice child

Maybe this is also happening to you?

Here Clickhouse is killed multiple times, but Kafka is also using quite a lot of memory and was also killed. I suspect this might corrupt their file and sometimes prevent them restarting.

I donā€™t have out-of-memory issues.

No Out of memory displays here either
But after the restart i have not encountered a new fallout. --> but there was only one new entry 5minutes ago

Iā€™m having the same issue after upgrading from 10.1.0 to 20.8.0, although it did not happen immediately. I upgraded via ./install.sh and seemed to all go fine.

After upgrading, it performed almost identically to the previous version, but then on 8/18 I stopped getting events after changing nothing. Upgrade was done on 8/7 so it worked fine for 11 days before this happened.

After starting with docker-compose up to watch the logs, I can see this error repeatedly:

worker_1                   | 20:33:00 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:6ce4d190b4244e4aa956e0690eeda3ff:6')
worker_1                   | 20:33:00 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:8128b458011849a189e4c03e82184e9c:6')

It seems to run this for a very long time, then goes to 0% CPU after awhile, then never processes any new events.

I was very fortunate to have saved a snapshot of the instance prior to upgrading. Iā€™ve reverted back to 10.1.0 and its working just fine again.

I donā€™t have OOM errors neither, VM has 16GB.

set ā€“ snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • set gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • exec gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

2020-08-27 10:47:22,943 New partitions assigned: {Partition(topic=Topic(name=ā€˜ingest-sessionsā€™), index=0): 0}

2020-08-27 10:47:23,109 Partitions revoked: [Partition(topic=Topic(name=ā€˜ingest-sessionsā€™), index=0)]

  • ā€˜[ā€™ c = - ā€˜]ā€™

  • snuba consumer --help

  • set ā€“ snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • set gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

  • exec gosu snuba snuba consumer --storage sessions_raw --auto-offset-reset=latest --max-batch-time-ms 750

%3|1598525451.691|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 5ms in state CONNECT)

%3|1598525451.695|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 3ms in state CONNECT)

%3|1598525452.686|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)

%3|1598525452.691|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.10:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)

2020-08-27 10:51:03,877 New partitions assigned: {Partition(topic=Topic(name=ā€˜ingest-sessionsā€™), index=0): 0}

What does ā€˜New partitions assignedā€™ mean? He manages to connect in the end or not?

Still with the same issueā€¦
I do a git pull && ./install.sh everyday to see if it gets fixed, but still not.

1 Like

Same problem here :c

Same problem here, but restarting only the worker solves the problem for a few ours.
We didnā€™t had any problems with 20.7.2. Problems were starting with 20.8

After restarting the worker i have tons of these messages in the worker log:

05:42:42 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:50e084e07290492ca85fe87a269f3a4f:3')
05:42:42 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:e2620e41ca894f47ba547595dc3f3284:3')
05:42:42 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:d64d2ab997a1414385259e0a1762aa3b:3')
05:42:42 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:66d775ad13ef4bb68b63c59b82b2851f:3')
05:42:42 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:ad3b2445331947769e4da8d8d340c146:3')
05:42:42 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:d441cebf14724d169ab84f4b49d8d039:3')
05:42:42 [ERROR] sentry.errors.events: process.failed.empty (cache_key=u'e:9c3a9528afd14db6b1837e2e5ad448e2:3')

same problem

I now restart the worker every 3~4 hours

I also have a strange problem that some errors ļ¼ˆfrom a few days agoļ¼‰ occurred time to time (but this project has been stopped)

I donā€™t konw if this is a kafka error, or my configuration error

BTW, there was some connect error in sentry_install_log

%3|1598057437.870|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.21.0.5:9092 failed: Connection refused (after 5ms in state CONNECT)57438.865|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.21.0.5:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)ka failed (attempt 0)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}

I had upgraded Sentry on another project with identical configuration, and upgrade process.

This instance is not having the same issue, the only real difference between the two is the second one has a much lower volume of events approximately only 1% of the events of the instance that is having the issue, so I suspect it may be related to load/volume of events.

try my trick: add web as dependency to relay in docker-compose.yml as:

  relay:
    << : *restart_policy
    image: '$RELAY_IMAGE'
    volumes:
      - type: bind
        read_only: true
        source: ./relay
        target: /work/.relay
    depends_on:
      - kafka
      - redis
      - web

This trick is to ensure that only start relay service after web is up, so the upstream destination of relay is reachable.

did not work for meā€¦ thanks for sharing!

Restarting worker container every hour make the following error disappear :

Background workers haven't checked in recently. It seems that you have a backlog of 80 tasks. Either your workers aren't running or you need more capacity.

But thatā€™s allā€¦ Since more than 15 days I havenā€™t any error event coming inā€¦

Trying to git pull && ./install.sh frequently, hoping for an update