Large backlog of events.process_event and events.save_event

Hiya! We recently started running Sentry on-premise and are experiencing a large backlog of events.process_event and events.save_event.

Setup
Sentry 20.9.0bb3d590
Azure machine with 4 cpus and 16 GB of memory
Traffic from 2 - 150 incoming events per minute
All services running on machine except Postgres which is managed separately
5 workers configured like so:

    worker:
      << : *sentry_defaults
      command: run worker
    worker2:
      << : *sentry_defaults
      command: run worker -Q events.process_event
    worker3:
      << : *sentry_defaults
      command: run worker -Q events.process_event
    worker4:
      << : *sentry_defaults
      command: run worker -Q events.process_event
    worker5:
      << : *sentry_defaults
      command: run worker -Q events.process_event

Adding workers and upgrading the machine (previously was 2cpu 8GB) seems to have helped but we’re still seeing the queue build up a lot. Here are the logs from sentry queues list

    activity.notify 0
    alerts 0
    app_platform 0
    assemble 0
    auth 0
    buffers.process_pending 0
    cleanup 0
    commits 0
    counters-0 0
    data_export 0
    default 0
    digests.delivery 0
    digests.scheduling 0
    email 0
    events.preprocess_event 0
    events.process_event 34032
    events.reprocess_events 0
    events.reprocessing.preprocess_event 0
    events.reprocessing.process_event 0
    events.reprocessing.symbolicate_event 0
    events.save_event 18708
    events.symbolicate_event 0
    files.delete 0
    incident_snapshots 0
    incidents 0
    integrations 0
    merge 0
    options 0
    relay_config 0
    reports.deliver 0
    reports.prepare 0
    search 0
    sleep 0
    stats 0
    subscriptions 0
    triggers-0 809
    unmerge 0
    update 0

What steps can we take to investigate and diagnose this issue? Can anyone point me to other threads or resources?

I think this thread can help you: How to clear backlog and monitor it

Thanks for the response @BYK! I did add the workers already based on that thread (you can see the config in the initial post), I also added one for save_event. This doesn’t really seem to be making a dent in the queue. The Clickhouse max memory flag is also set to 0.3.

Did you mean something else from that thread?

Ooops, sorry for zombie responding :smiley:

This may actually be limiting you so you may want to try increasing it a bit.

Also how do your worker logs look like? Maybe there are hints we are missing?

re: Clickhouse, I’ll try that! All the worker logs look like this. I don’t see any errors or warnings that aren’t these

worker5_1                      | 2020-10-19T20:22:38.193458842Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:23:12.644595224Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:23:12.644628325Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:23:18.442984386Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:23:18.443033987Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:23:25.576246320Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:23:25.576301321Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:23:26.634927924Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:23:26.634984525Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:24:10.207719860Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:24:10.207768761Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:24:14.737120000Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:24:14.737182201Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:24:19.714759327Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:24:19.714798828Z   InsecureRequestWarning)
worker5_1                      | 2020-10-19T20:24:23.803244720Z /usr/local/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
worker5_1                      | 2020-10-19T20:24:23.803331422Z   InsecureRequestWarning)

Those errors might be the reason for your issues. Are you using Sentry with SSL? Do you have any custom config? If so can you share that with us?

We are using Sentry with SSL through an nginx ingress (managed by Kubernetes) that sits in front of this. When I purge the queues, events do get processed for a few hours before they get backlogged again. How should I address these errors?

One other thing is that setting up a worker to ingest events.save_event tasks doesn’t work. This is the config, and I don’t see any events being ingested in the logs, whereas our workers that ingest events.process_event have processing logs.

  worker6:
    << : *sentry_defaults
    command: run worker -Q events.save_event

I don’t have any pointers or clues right now. Were you able to solve this? If not, sharing full logs may help with finding a solution.