Events have stopped appearing

Thanks in advance for your help.

My setup:

  • Running getsentry/onpremise at 20.12.1 (github.com) on using the install.sh script on Ubuntu
  • 8GB RAM / 2CPU@2.5GHz VM
  • Using external Postgres server
  • Using external Redis instance
  • We have a pretty high volume of events
  • Migrated from a v9 setup.

Everything was running smoothly on 20.12.1 until events suddenly stopped appearing. There was no update or migration happening at the time.

I’ve tried restarting and recreating containers and rebooting the host VM, but it appears that events are no longer being processed. CPU usage has been pegged at 100% for a few days.

CPU mostly seems to be consumed by Snuba and Celery

Snuba keeps restarting over and over.

docker logs sentry_onpremise_snuba-consumer_1 reveals the below logs, which looks like the issue is somewhere in Kafka.

I’ve had a look through topics in this forum and on GitHub and haven’t found anything which has helped.

Summary
  • ‘[’ c = - ‘]’
  • snuba consumer --help
  • set – snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
  • set gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
  • exec gosu snuba snuba consumer --storage events --auto-offset-reset=latest --max-batch-time-ms 750
    2020-12-21 00:28:38,313 New partitions assigned: {Partition(topic=Topic(name=‘events’), index=0): 8412616}
    2020-12-21 00:28:38,315 Caught OffsetOutOfRange(‘KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str=“Broker: Offset out of range”}’), shutting down…
    Traceback (most recent call last):
    File “/usr/local/bin/snuba”, line 33, in
    sys.exit(load_entry_point(‘snuba’, ‘console_scripts’, ‘snuba’)())
    File “/usr/local/lib/python3.8/site-packages/click/core.py”, line 722, in call
    return self.main(*args, **kwargs)
    File “/usr/local/lib/python3.8/site-packages/click/core.py”, line 697, in main
    rv = self.invoke(ctx)
    File “/usr/local/lib/python3.8/site-packages/click/core.py”, line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
    File “/usr/local/lib/python3.8/site-packages/click/core.py”, line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File “/usr/local/lib/python3.8/site-packages/click/core.py”, line 535, in invoke
    return callback(*args, **kwargs)
    File “/usr/src/snuba/snuba/cli/consumer.py”, line 161, in consumer
    consumer.run()
    File “/usr/src/snuba/snuba/utils/streams/processing/processor.py”, line 109, in run
    self._run_once()
    File “/usr/src/snuba/snuba/utils/streams/processing/processor.py”, line 139, in _run_once
    self.__message = self.__consumer.poll(timeout=1.0)
    File “/usr/src/snuba/snuba/utils/streams/backends/kafka.py”, line 749, in poll
    return super().poll(timeout)
    File “/usr/src/snuba/snuba/utils/streams/backends/kafka.py”, line 400, in poll
    raise OffsetOutOfRange(str(error))
    snuba.utils.streams.backends.abstract.OffsetOutOfRange: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str=“Broker: Offset out of range”}

Have you tried this: Sentry no more catch errors ?

1 Like

Thanks for helping Alexander,

No, I hadn’t seen that one. I have given it a try today (ran all the commands, none failed) but it doesn’t seem to have fixed the issue. The Snuba containers are still restarting over and over with Kafka out of range errors.

Is there a way to reset the Kafka storage entirely? Will that affect the historical saved events?

Here’s what I use to reset offsets in my kafka cluster, obviously replace bootstrap address with your own:

unset JMX_PORT; bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --group snuba-post-processor --all-topics --to-latest [--execute]

List the kafka groups:

unset JMX_PORT; bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

I found that this only happens as a result of some other error causing the service not to run and the kafka cache expiring. You can buy yourself more time to recover if you extend the duration kafka retains messages for.

offsets.retention.minutes
log.retention.hours
log.retention.bytes
1 Like

Thanks @caseyduquettesc, that command is pretty similar to the one in the issue @Alexander linked, but has the additional all-topics flag, is that the key here?

I’ve just come back from holidays and it seems while I was away the Kafka server has fixed itself into a good state and events are appearing again. I’m not sure if it was because of the command @Alexander sent earlier working but just taking a few days to work? Either way, I will keep your help in mind if it ever does it again, and I’ll look into adjusting those timeframes.

Thanks again for your help both of you, happy new year!

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.