How to clear backlog and monitor it

Can you share your worker logs?

I removed volumes
sentry-kafka
sentry-zookeeper
and reran ./install.sh

During the install and docker-compose up i get
snuba-outcomes-consumer_1 | %3|1597171318.012|FAIL|rdkafka#producer-1| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.9:9092 failed: Connection refused (after 59ms in state CONNECT)
snuba-outcomes-consumer_1 | %3|1597171318.012|FAIL|rdkafka#consumer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.9:9092 failed: Connection refused (after 48ms in state CONNECT)

my worker logs. But , seems like the problem is with the Kafka connections

entry.utils.geo: settings.GEOIP_PATH_MMDB not configured.
/usr/local/lib/python2.7/site-packages/cryptography/init.py:39: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
CryptographyDeprecationWarning,
18:42:15 [INFO] sentry.plugins.github: apps-not-configured
18:42:16 [INFO] sentry.bgtasks: bgtask.spawn (task_name=u’sentry.bgtasks.clean_dsymcache:clean_dsymcache’)
18:42:16 [INFO] sentry.bgtasks: bgtask.spawn (task_name=u’sentry.bgtasks.clean_releasefilecache:clean_releasefilecache’)

-------------- celery@9e74d72ecd16 v4.1.1 (latentcall)
---- **** -----
— * *** * – Linux-3.10.0-957.10.1.el7.x86_64-x86_64-with-debian-10.1 2020-08-11 18:42:20
– * - **** —

  • ** ---------- [config]
  • ** ---------- .> app: sentry:0x7fc2a527acd0
  • ** ---------- .> transport: redis://redis:6379/0
  • ** ---------- .> results: disabled://
  • *** — * — .> concurrency: 2 (prefork)
    – ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
    — ***** -----
    -------------- [queues]
    .> activity.notify exchange=(direct) key=activity.notify
    .> alerts exchange=(direct) key=alerts
    .> app_platform exchange=(direct) key=app_platform
    .> assemble exchange=(direct) key=assemble
    .> auth exchange=(direct) key=auth
    .> buffers.process_pending exchange=(direct) key=buffers.process_pending
    .> cleanup exchange=(direct) key=cleanup
    .> commits exchange=(direct) key=commits
    .> counters-0 exchange=counters(direct) key=default
    .> data_export exchange=(direct) key=data_export
    .> default exchange=(direct) key=default
    .> digests.delivery exchange=(direct) key=digests.delivery
    .> digests.scheduling exchange=(direct) key=digests.scheduling
    .> email exchange=(direct) key=email
    .> events.preprocess_event exchange=(direct) key=events.preprocess_event
    .> events.process_event exchange=(direct) key=events.process_event
    .> events.reprocess_events exchange=(direct) key=events.reprocess_events
    .> events.reprocessing.preprocess_event exchange=(direct) key=events.reprocessing.preprocess_event
    .> events.reprocessing.process_event exchange=(direct) key=events.reprocessing.process_event
    .> events.reprocessing.symbolicate_event exchange=(direct) key=events.reprocessing.symbolicate_event
    .> events.save_event exchange=(direct) key=events.save_event
    .> events.symbolicate_event exchange=(direct) key=events.symbolicate_event
    .> files.delete exchange=(direct) key=files.delete
    .> incident_snapshots exchange=(direct) key=incident_snapshots
    .> incidents exchange=(direct) key=incidents
    .> integrations exchange=(direct) key=integrations
    .> merge exchange=(direct) key=merge
    .> options exchange=(direct) key=options
    .> relay_config exchange=(direct) key=relay_config
    .> reports.deliver exchange=(direct) key=reports.deliver
    .> reports.prepare exchange=(direct) key=reports.prepare
    .> search exchange=(direct) key=search
    .> sleep exchange=(direct) key=sleep
    .> stats exchange=(direct) key=stats
    .> subscriptions exchange=(direct) key=subscriptions
    .> triggers-0 exchange=triggers(direct) key=default
    .> unmerge exchange=(direct) key=unmerge
    .> update exchange=(direct) key=update

Traceback (most recent call last):
File “/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py”, line 316, in start
blueprint.start(self)
File “/usr/local/lib/python2.7/site-packages/celery/bootsteps.py”, line 119, in start
step.start(parent)
File “/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py”, line 592, in start
c.loop(*c.loop_args())
File “/usr/local/lib/python2.7/site-packages/celery/worker/loops.py”, line 91, in asynloop
next(loop)
File “/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py”, line 354, in create_loop
cb(*cbargs)
File “/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py”, line 1047, in on_readable
self.cycle.on_readable(fileno)
File “/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py”, line 344, in on_readable
chan.handlerstype
File “/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py”, line 721, in _brpop_read
**options)
File “/usr/local/lib/python2.7/site-packages/redis/client.py”, line 680, in parse_response
response = connection.read_response()
File “/usr/local/lib/python2.7/site-packages/redis/connection.py”, line 624, in read_response
response = self._parser.read_response()
File “/usr/local/lib/python2.7/site-packages/redis/connection.py”, line 403, in read_response
(e.args,))
ConnectionError: Error while reading from socket: (‘Connection closed by server.’,)
18:47:28 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection…
Restoring 7 unacknowledged message(s)

worker: Warm shutdown (MainProcess)
19:43:04 [WARNING] sentry.utils.geo: settings.GEOIP_PATH_MMDB not configured.
/usr/local/lib/python2.7/site-packages/cryptography/init.py:39: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
CryptographyDeprecationWarning,

@BYK, the issue on the worker node still exist . The error i am getting on the worker is below.

i checked networking from worker to redis, and that looks good. so not sure . The UI robot , sample event creation works, but sending an event to a DSN externally doesnot .

Traceback (most recent call last):
File “/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py”, line 316, in start
blueprint.start(self)
File “/usr/local/lib/python2.7/site-packages/celery/bootsteps.py”, line 119, in start
step.start(parent)
File “/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py”, line 592, in start
c.loop(*c.loop_args())
File “/usr/local/lib/python2.7/site-packages/celery/worker/loops.py”, line 91, in asynloop
next(loop)
File “/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py”, line 354, in create_loop
cb(*cbargs)
File “/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py”, line 1047, in on_readable
self.cycle.on_readable(fileno)
File “/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py”, line 344, in on_readable
chan.handlerstype
File “/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py”, line 721, in _brpop_read
**options)
File “/usr/local/lib/python2.7/site-packages/redis/client.py”, line 680, in parse_response
response = connection.read_response()
File “/usr/local/lib/python2.7/site-packages/redis/connection.py”, line 624, in read_response
response = self._parser.read_response()
File “/usr/local/lib/python2.7/site-packages/redis/connection.py”, line 403, in read_response
(e.args,))
ConnectionError: Error while reading from socket: (‘Connection closed by server.’,)
00:06:55 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection…
Restoring 7 unacknowledged message(s)

Maybe your Redis port or credentials are not set correctly?

Running nmap from worker to redis, shows good connection ,
I am not sure on any creds on redis. What might it be , i retried install , still same issue

nmap -p 6379 redis
Starting Nmap 7.70 ( https://nmap.org ) at 2020-08-13 04:14 UTC
Nmap scan report for redis (172.22.0.8)
Host is up (0.000085s latency).
rDNS record for 172.22.0.8: sentry_onpremise_redis_1.sentry_onpremise_default

Redis configs

on sentry.conf.py
SENTRY_OPTIONS[“redis.clusters”] = {
“default”: {
“hosts”: {0: {“host”: “redis”, “password”: “”, “port”: “6379”, “db”: “0”}}
}
}
on docker-compose.yml
redis:
<< : *restart_policy
image: ‘redis:5.0-alpine’
volumes:
- ‘sentry-redis:/data’

Are you using the on-premise repo without any modifications or do you have a custom setup? If you have some customizations, can you make sure you mount the sentry config volume to worker service too and the worker image and sentry images are at the same version?

@BYK , i am using the on-premise repo , directly. No custom setup.
How do i check the version of worker image and the sentry images?

Based on the on-prem docker-compose, there is no volume mounted for the worker service
worker:
<< : *sentry_defaults
command: run worker

Our production setup , ingests 100-200 tasks every 1hour. I stood the application from scratch using the new repo. It works initially , and then as the traffic increases, the workers stop processing the tasks . Now there are about 20K messages to be processed.

I am perplexed , if the kafka configurations are not optimized for a production setup. My issue is similar to topic Sentry stops processing events after upgrade 10.0 => 20.8.0.dev0ba2aa70

None of the on-premise setup is optimized for heavy use as you can guess from our use of docker-compose and having everything on a single node.

The run workers command has some options for production optimization and fine tuning:

You may want to leverage those such as having multiple, dedicated workers for specific queues.

@BYK are you suggesting something like on the docker-compose
worker:
<< : *sentry_defaults
command: run worker -c -Q ingest-consumer, snuba-consumers,…
basically concurrent workers processes for each topic

Similar: having multiple separate workers (such as worker-1, worker-2 etc) dedicated to specific queues. I think you can see which queues get the highest load from somewhere and you can have dedicated workers for those queues only.

@BYK, I looked up the Queues using below. Are the worker processing messages in each of these topics ?
kafka-topics --list --zookeeper zookeeper:2181
__consumer_offsets
cdc
errors-replacements
event-replacements
events
ingest-attachments
ingest-events
ingest-sessions
ingest-transactions
outcomes
snuba-commit-log

I am not sure how to see the the highest load. Any pointers you can provide will help.

@amit1 - oh, those are Kafka queues, which are not used by workers. Worker queues are in Redis. I think these are all the queues we have:

@BYK, i was able to fix this issue by adding additional workers … as below .

I have a 4 CPU , 8 GB ram setup . and see a very high utilization of resources. Is this normal . Any way to optimize resource allocation. The sentry app, uses about 7.2GB of memory out of the 8Gb allocated.

worker1:
    << : *sentry_defaults
    command: run worker -Q events.process_event
  worker2:
    << : *sentry_defaults
    command: run worker -Q events.reprocessing.process_event
  worker3:
    << : *sentry_defaults
    command: run worker -Q events.reprocess_events
  worker4:
    << : *sentry_defaults
    command: run worker -Q events.save_event
  worker5:
    << : *sentry_defaults
    command: run worker -Q subscriptions
  worker:
    << : *sentry_defaults
    command: run worker
2 Likes

Might be related to Clickhouse. I strongly recommend you to keep an eye on feat(clickhouse): Reduce max memory usage to 30% of RAM by BYK · Pull Request #662 · getsentry/self-hosted · GitHub

@BYK I see the PR merged. do you suggest a reinstall of the app , to pull down new images?

Responded on the PR (no need to double post :wink: ):

@amitseth7 this should resolve the high memory usage of Clickhouse, yes. To be able to use this fix, you don’t need a re-install with new images.

You can just apply this PR manually and do docker-compose restart clickhouse . Make sure to check the log output to see sentry.xml getting applied in the init phase of clickhouse and you should be good.

My apologies. :innocent:
I will reapply the PR.

1 Like

I have a question.
Last worker will understand by himself that another queues are used by their own workers or I must use --exclude option for it?

  worker:
    << : *sentry_defaults
    command: run worker -X events.process_event,events.reprocessing.process_event,events.reprocess_events,events.save_event,subscriptions

I don’t think you need the exclude part.