Snuba consumer : Exception: Broker: Not enough in-sync replicas

Hi There,

I have set up my on-premise environment in the following way: EKS cluster with redis, kafka and postgresql services as AWS Managed services - Elastic Cache, MSK and RDS respectively.
Clickhouse and symbolicator are deployed as statefulset.

While I was load testing the setup to test the scale in and out
the snuba component: snuba-event-consumer is going on a crashloopback.

Logs:

```❯ kubectl logs snuba-event-consumer-xxxxxxxx -n sentry
2020-12-17 09:46:47,231 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 13713}
2020-12-17 09:46:49,335 Completed processing <Batch: 368 messages, open for 2.09 seconds>.
2020-12-17 09:46:50,635 Caught Exception('Broker: Not enough in-sync replicas'), shutting down...
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/consumer.py", line 161, in consumer
    consumer.run()
  File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 109, in run
    self._run_once()
  File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 144, in _run_once
    self.__processing_strategy.poll()
  File "/usr/src/snuba/snuba/utils/streams/processing/strategies/streaming/transform.py", line 55, in poll
    self.__next_step.poll()
  File "/usr/src/snuba/snuba/utils/streams/processing/strategies/streaming/collect.py", line 122, in poll
    self.__close_and_reset_batch()
  File "/usr/src/snuba/snuba/utils/streams/processing/strategies/streaming/collect.py", line 105, in __close_and_reset_batch
    self.__batch.join()
  File "/usr/src/snuba/snuba/utils/streams/processing/strategies/streaming/collect.py", line 73, in join
    self.__step.join(timeout)
  File "/usr/src/snuba/snuba/consumer.py", line 238, in join
    self.__replacement_batch_writer.join(timeout)
  File "/usr/src/snuba/snuba/consumer.py", line 163, in join
    self.__producer.flush(*args)
  File "/usr/src/snuba/snuba/utils/streams/backends/kafka.py", line 755, in __commit_message_delivery_callback
    raise Exception(error.str())
Exception: Broker: Not enough in-sync replicas```

The snuba-event-consumer deployment file:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: snuba
name: snuba-event-consumer
namespace: sentry
spec:
replicas: 1
selector:
matchLabels:
app: snuba-event-consumer
strategy:
type: RollingUpdate
template:
metadata:
labels:
app: snuba-event-consumer
spec:
containers:
- image: getsentry/snuba:77a6bbfc892c442e3a2230ca20cc6bcc5e2620ce
imagePullPolicy: Always
name: snuba-event-consumer
resources:
requests:
cpu: “0.125”
memory: “350Mi”
limits:
cpu: “0.25”
memory: “700Mi”
command: [“snuba”]
args: [“consumer”,“–storage”, “events”,“–auto-offset-reset=latest”, “–max-batch-time-ms”, “750”]
envFrom:
- configMapRef:
name: snuba-config

The snuba.config is as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  name: snuba-config
  namespace: sentry
data:
    SNUBA_SETTINGS: docker
    CLICKHOUSE_HOST: clickhousedb
    DEFAULT_BROKERS: 'xxx:9092','xxxx:9092','xxx:9092'
    REDIS_HOST: redis-service
    UWSGI_MAX_REQUESTS: '10000'
    UWSGI_DISABLE_LOGGING: 'true'

The default kafka options provided in sentry.conf.py are as follows:

DEFAULT_KAFKA_OPTIONS = {
"bootstrap.servers": 'xxx:9092','xxxx:9092','xxx:9092',
"message.max.bytes": 50000000,
"socket.timeout.ms": 10000,
"acks": 1,

}

*Are there any recommendations for the following kafka configuration to be set ?And also is there any snuba specific kafka settings to be used ? I cant find any int the documentation.

min.insync.replicas
replication.factor
ack

At present I have the following kafka configuration set in MSK:
auto.create.topics.enable = true
delete.topic.enable = true
default.replication.factor = 3
min.insync.replicas = 2

Hello,

About setting the Kafka connection configuration in Snuba, this is the place: https://github.com/getsentry/snuba/blob/d03cba8618a75f57f316d542785b9a4cb5fde239/snuba/settings.py#L78-L81 There is a list of parameters suppoorted for modification here: https://github.com/getsentry/snuba/blob/d03cba8618a75f57f316d542785b9a4cb5fde239/snuba/utils/streams/backends/kafka.py#L624-L632
Sorry, it is not documented yet, it is fairly new.

Regarding the broker configuration, that depends on what guarantees of durability and consistency you are looking for in your deployment, so it is hard to provide general best practices. One of the reasons why you may have ran into the issue is that the default value for ack on the producer (https://docs.confluent.io/5.5.0/clients/librdkafka/md_CONFIGURATION.html) is all, and you may just not have had enough kafka broker replicas up and running and in sync at the time the producer tried to send its message.
Another possibility may be that the replication factor of your specific topic was lower than min.insync.replicas (this is set when the topic was created and it may not have taken the default value if a value is specified). Could you check what replication factor does your topic have (in this case it should be either snuba-commit-log or event-replacements. I cannot say from the error message which one failed). You may use the describe-topic command for this bin/kafka-topics.sh --describe --topic

Hope this helps
Filippo

Hi Filippo,

Thankyou for the quick response.
Yes you are right, the topics “event-replacements”, “snuba-commit-log” had the replication factor 1 and min.in.sync replicas 2.
Had to delete and bootstrap the topics which then picked up the msk configuration values of replication factor 3 and min.in.sync replicas 2. On load testing the pods sustained and scaled up as expected.

Thanks,
Vinotha

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.