Ingest-consumer workload sharing

Hi Experts,

I am setting on-prem sentry in Kubernetes. I see a huge lag in the “ingest-event” kafka topic processing by ingest-consumer. In order to make it more efficient, I increased the “ingest-events” topic’s partition to 5 and have 5 replicas of sentry-ingest-consumer running to poll each partition.
Below is the sentry-ingest-consumer arguments passed to the container during run time.
["–config", “/shared-config/”, “run”,“ingest-consumer”,"–all-consumer-types", “–max-batch-size”, “1000”]

Though the --max-batch-size is mentioned as 1000, the ingest-consumer does not process 1000 messages in a poll, though I have a huge lag to process (look at partition 0).

14:00:12 [WARNING] sentry.utils.geo: Error opening GeoIP database: /geoip/GeoLite2-City.mmdb
14:00:13 [WARNING] sentry.utils.geo: Error opening GeoIP database in Rust: /geoip/GeoLite2-City.mmdb
14:03:34 [INFO] sentry.plugins.github: apps-not-configured
14:03:50 [DEBUG] batching-kafka-consumer: Topic 'ingest-events' is ready
14:03:50 [DEBUG] batching-kafka-consumer: Topic 'ingest-transactions' is ready
14:03:50 [DEBUG] batching-kafka-consumer: Topic 'ingest-attachments' is ready
14:03:50 [DEBUG] batching-kafka-consumer: Starting
14:04:13 [INFO] batching-kafka-consumer: New partitions assigned: 
[TopicPartition{topic=ingest-events,partition=1,offset=-1001,error=None}]
14:04:15 [INFO] batching-kafka-consumer: Flushing 495 items (from {('ingest-events', 1): 
[26206, 26700]}): forced:False size:False time:True
14:04:15 [DEBUG] batching-kafka-consumer: Flushing batch via worker
14:07:35 [INFO] batching-kafka-consumer: Worker flush took 200238ms
14:07:35 [DEBUG] batching-kafka-consumer: Committing Kafka offsets
14:07:35 [DEBUG] batching-kafka-consumer: Committed offsets: [TopicPartition{topic=ingest- 
events,partition=1,offset=26701,error=None}]
14:07:35 [DEBUG] batching-kafka-consumer: Kafka offset commit took 70ms
14:07:35 [DEBUG] batching-kafka-consumer: Resetting in-memory batch
14:07:37 [INFO] batching-kafka-consumer: Flushing 678 items (from {('ingest-events', 1): [26701, 
27378]}): forced:False size:False time:True
14:07:37 [DEBUG] batching-kafka-consumer: Flushing batch via worker
14:12:03 [INFO] batching-kafka-consumer: Worker flush took 266006ms 



./kafka-consumer-groups.sh --bootstrap-server xxxx:9092,xxxx:9092,xxxx:9092 --group ingest-consumer -describe

GROUP           TOPIC               PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             
CONSUMER-ID                                  HOST             CLIENT-ID
ingest-consumer ingest-events       0          16036942        17279611        1242669         rdkafka- 
034219b6-62e6-4347-847e-65cc73126275 /xxx.xxx.xxx.xxx rdkafka
ingest-consumer ingest-transactions 0          -               0               -               rdkafka-034219b6- 
62e6-4347-847e-65cc73126275 /xxx.xxx.xxx.xxx  rdkafka
ingest-consumer ingest-attachments  0          -               0               -               rdkafka-034219b6- 
62e6-4347-847e-65cc73126275 /xxx.xxx.xxx.xxx  rdkafka
ingest-consumer ingest-events       2          27856           27863           7               rdkafka-4cce6dd9- 
5ef2-4283-8c73-9b8d1c4b6dad /xxx.xxx.xxx.xxx rdkafka
ingest-consumer ingest-events       3          26336           27908           1572            rdkafka- 
bc571d01-e18f-4d1b-b1c0-97788c28ee86 /xxx.xxx.xxx.xxx rdkafka
ingest-consumer ingest-events       4          27361           28119           758             rdkafka-e9af7b70- 
31d6-4539-8408-3c98cd043ee9 /xxx.xxx.xxx.xxx  rdkafka
ingest-consumer ingest-events       1          26701           27901           1200            rdkafka-1d2101c1- 
7407-4b51-999c-a90bc5cf569a /xxx.xxx.xxx.xxx  rdkafka

How can I fine tune & rightly balance the workload across multiple ingest-consumers and kafka to process the ingest-events efficiently in production workload ?
Let me know if there is better kafka consumer fine tuning options available.

Hi @BYK ,
Wondering if you can guide me here. Sorry for follow-up.

Thanks.

You may want to try increasing the number of workers you have: https://develop.sentry.dev/self-hosted/troubleshooting/#workers

Hi @BYK,

Thankyou for the guidance, I deployed individual workers for every queue and increased the replicas to process the lag. Also for the ingest-consumer increased the partitions further and deployed dedicated replicas to process the queues and the lag was processed after a cool off period.

1 Like

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.