Sentry worker stop working (rabbitmq connection issue?)

Hey guys,

We notice that one of our sentry cluster stop working since last night , the worker pods keep throwing error logs like

Blockquote
NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue ‘events.save_event’ in vhost ‘/’ due to timeout
01:18:43 [CRITICAL] celery.worker: Unrecoverable error: NotFound(404, u"NOT_FOUND - failed to perform operation on queue ‘events.save_event’ in vhost ‘/’ due to timeout", (50, 10), u’Queue.declare’)
01:18:45 [INFO] sentry.bgtasks: bgtask.stop (task_name=u’sentry.bgtasks.clean_dsymcache:clean_dsymcache’)
01:18:45 [INFO] sentry.bgtasks: bgtask.stop (task_name=u’sentry.bgtasks.clean_releasefilecache:clean_releasefilecache’)

Blockquote

So does ingest-conusmer pods , but rabbitmq sts looks fine and we tried to restart/recreate the rabbitmq sts but still no luck . Any ideas why this is happening ?

Thanks,
Ray

Looks like there are several queues in rabbitmq corrupt somehow and show NaN stats in console. I tried to delete those queues but failed
bash-5.0$ rabbitmqctl eval 'rabbit_amqqueue:internal_delete({resource,<<"/">>,queue,<<"events.save_event">>}).' Error: {:undef, [{:rabbit_amqqueue, :internal_delete, [{:resource, "/", :queue, "events.save_event"}], []}, {:erl_eval, :do_apply, 6, [file: 'erl_eval.erl', line: 680]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}
And rabbitmyctl didn’t show thoese currupted queues
bash-5.0$ rabbitmqctl --node rabbit@sentry-rabbitmq-1.sentry-rabbitmq-discovery.sentry.svc.cluster.local list_queues --vhost / | grep events events.reprocessing.symbolicate_event 0 events.reprocess_events 0 events.reprocessing.preprocess_event 0 events.reprocessing.process_event 0 events.preprocess_event 0 events.symbolicate_event 0

So after two days troubleshooting , the story is
Infras details:
We are using AWS EBS volumes as our persistent storage plan , which means we gotta run sentry-cron/ingest-consumer/relay/web/worker on the same node. (Because AWS EBS can’t be attach to multiple nodes)

So when the cronjob triggered 0:00am everyday , the cpu usage spikes on that node and somehow corrupt some of the rabbitmq queues. Don’t know the whole process but the result is there are several queues showing NaN in the web console but not shown in the cli result.(rabbitmqctl -p / list_queues )
When I tried to manually delete those queues I got undef errors:
bash-5.0$ rabbitmqctl eval 'rabbit_amqqueue:internal_delete({resource,<<"/">>,queue,<<"events.save_event">>}).' Error: {:undef, [{:rabbit_amqqueue, :internal_delete, [{:resource, "/", :queue, "events.save_event"}], []}, {:erl_eval, :do_apply, 6, [file: 'erl_eval.erl', line: 680]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}
So the solution is I had to bring down the rabbitmq sts by sequence and restart them to re-construct the cluster , then restart woker and ingest-consumer pods.

My questions are .

  1. Any specify reasons why the rabbitmq queues broken while the cluster is still running , we can’t observe the issue until the workers and ingest-consumer pods crashed.
  2. We observe huge memory usage for redis cluster, our woker nodes are 8vcpus/32G mem and one redis pod might consume 16G mem. I don’t think that’s expected ,right ?
  3. In general , if the incoming traffic spike lets say the workers went down , then the queues and then the redis cluster the whole sentry will down , any workaround we can improve that ?
  4. @Mokto mentioned not to use volumes for persistent storage but try S3, just wondering is there any documentations for this part, is s3 backend already supported?

Thanks in advance, any suggestions comments are much appreciated !