Sentry worker stop working (rabbitmq connection issue?)

hairongGao · March 24, 2021, 1:22am

Hey guys,

We notice that one of our sentry cluster stop working since last night , the worker pods keep throwing error logs like

Blockquote
NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue ‘events.save_event’ in vhost ‘/’ due to timeout
01:18:43 [CRITICAL] celery.worker: Unrecoverable error: NotFound(404, u"NOT_FOUND - failed to perform operation on queue ‘events.save_event’ in vhost ‘/’ due to timeout", (50, 10), u’Queue.declare’)
01:18:45 [INFO] sentry.bgtasks: bgtask.stop (task_name=u’sentry.bgtasks.clean_dsymcache:clean_dsymcache’)
01:18:45 [INFO] sentry.bgtasks: bgtask.stop (task_name=u’sentry.bgtasks.clean_releasefilecache:clean_releasefilecache’)

Blockquote

So does ingest-conusmer pods , but rabbitmq sts looks fine and we tried to restart/recreate the rabbitmq sts but still no luck . Any ideas why this is happening ?

Thanks,
Ray

hairongGao · March 24, 2021, 3:56am

Looks like there are several queues in rabbitmq corrupt somehow and show NaN stats in console. I tried to delete those queues but failed
bash-5.0$ rabbitmqctl eval 'rabbit_amqqueue:internal_delete({resource,<<"/">>,queue,<<"events.save_event">>}).' Error: {:undef, [{:rabbit_amqqueue, :internal_delete, [{:resource, "/", :queue, "events.save_event"}], []}, {:erl_eval, :do_apply, 6, [file: 'erl_eval.erl', line: 680]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}
And rabbitmyctl didn’t show thoese currupted queues
bash-5.0$ rabbitmqctl --node rabbit@sentry-rabbitmq-1.sentry-rabbitmq-discovery.sentry.svc.cluster.local list_queues --vhost / | grep events events.reprocessing.symbolicate_event 0 events.reprocess_events 0 events.reprocessing.preprocess_event 0 events.reprocessing.process_event 0 events.preprocess_event 0 events.symbolicate_event 0

hairongGao · March 25, 2021, 8:40am

So after two days troubleshooting , the story is
Infras details:
We are using AWS EBS volumes as our persistent storage plan , which means we gotta run sentry-cron/ingest-consumer/relay/web/worker on the same node. (Because AWS EBS can’t be attach to multiple nodes)

So when the cronjob triggered 0:00am everyday , the cpu usage spikes on that node and somehow corrupt some of the rabbitmq queues. Don’t know the whole process but the result is there are several queues showing NaN in the web console but not shown in the cli result.(rabbitmqctl -p / list_queues )
When I tried to manually delete those queues I got undef errors:
bash-5.0$ rabbitmqctl eval 'rabbit_amqqueue:internal_delete({resource,<<"/">>,queue,<<"events.save_event">>}).' Error: {:undef, [{:rabbit_amqqueue, :internal_delete, [{:resource, "/", :queue, "events.save_event"}], []}, {:erl_eval, :do_apply, 6, [file: 'erl_eval.erl', line: 680]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}
So the solution is I had to bring down the rabbitmq sts by sequence and restart them to re-construct the cluster , then restart woker and ingest-consumer pods.

My questions are .

Any specify reasons why the rabbitmq queues broken while the cluster is still running , we can’t observe the issue until the workers and ingest-consumer pods crashed.
We observe huge memory usage for redis cluster, our woker nodes are 8vcpus/32G mem and one redis pod might consume 16G mem. I don’t think that’s expected ,right ?
In general , if the incoming traffic spike lets say the workers went down , then the queues and then the redis cluster the whole sentry will down , any workaround we can improve that ?
@Mokto mentioned not to use volumes for persistent storage but try S3, just wondering is there any documentations for this part, is s3 backend already supported?

Thanks in advance, any suggestions comments are much appreciated !

Topic		Replies	Views
Sentry worker performance On-Premise	1	2731	September 7, 2018
Events are not coming up (Waiting for events…)	6	9016	February 4, 2017
Sentry stopped showing errors , just after few days. ? no issues on internal On-Premise	4	5087	December 8, 2020
Sentry stops registering events after a while On-Premise	4	2517	July 2, 2021
Not processing events On-Premise	9	1715	December 8, 2020

Sentry worker stop working (rabbitmq connection issue?)

Related topics