Cant get email alert anymore

Hi there,

as by subject, we can’t get new email alert anymore after sentry host has gone out of disk space.
Email test works correctly.

I’m still trying to figure out which container should log the error…

Please help :slight_smile:

UPDATE: containers are all up & running

snuba-subscription-consumer-transactions_1 errors
snuba-subscription-consumer-transactions_1  | 2020-11-25 15:34:56,240 Could not construct valid time interval between MessageDetails(offset=1991, timestamp=datetime.datetime(2020, 11, 25, 15, 34, 54, 166000)) and Message(partition=Partition(topic=Topic(name='events'), index=0), offset=1992)!
snuba-subscription-consumer-transactions_1  | Traceback (most recent call last):
snuba-subscription-consumer-transactions_1  |   File "/usr/src/snuba/snuba/subscriptions/consumer.py", line 129, in poll
snuba-subscription-consumer-transactions_1  |     time_interval = Interval(previous_message.timestamp, message.timestamp)
snuba-subscription-consumer-transactions_1  |   File "<string>", line 5, in __init__
snuba-subscription-consumer-transactions_1  |   File "/usr/src/snuba/snuba/utils/types.py", line 67, in __post_init__
snuba-subscription-consumer-transactions_1  |     raise InvalidRangeError(self.lower, self.upper)
snuba-subscription-consumer-transactions_1  | snuba.utils.types.InvalidRangeError: (datetime.datetime(2020, 11, 25, 15, 34, 54, 166000), datetime.datetime(2020, 11, 25, 15, 34, 54, 165000))
snuba-subscription-consumer-events_1 errors
snuba-subscription-consumer-events_1        | 2020-11-25 15:43:05,698 Could not construct valid time interval between MessageDetails(offset=2296, timestamp=datetime.datetime(2020, 11, 25, 15, 43, 3, 303000)) and Message(partition=Partition(topic=Topic(name='events'), index=0), offset=2297)!
snuba-subscription-consumer-events_1        | Traceback (most recent call last):
snuba-subscription-consumer-events_1        |   File "/usr/src/snuba/snuba/subscriptions/consumer.py", line 129, in poll
snuba-subscription-consumer-events_1        |     time_interval = Interval(previous_message.timestamp, message.timestamp)
snuba-subscription-consumer-events_1        |   File "<string>", line 5, in __init__
snuba-subscription-consumer-events_1        |   File "/usr/src/snuba/snuba/utils/types.py", line 67, in __post_init__
snuba-subscription-consumer-events_1        |     raise InvalidRangeError(self.lower, self.upper)
snuba-subscription-consumer-events_1        | snuba.utils.types.InvalidRangeError: (datetime.datetime(2020, 11, 25, 15, 43, 3, 303000), datetime.datetime(2020, 11, 25, 15, 43, 3, 300000))

UPDATE2: kafka events and ingest-events are growing up, is it normal?

3.0M	/var/lib/kafka/data/__consumer_offsets-0
4.0K	/var/lib/kafka/data/cdc-0
0		/var/lib/kafka/data/cleaner-offset-checkpoint
4.0K	/var/lib/kafka/data/errors-replacements-0
8.0K	/var/lib/kafka/data/event-replacements-0
436M	/var/lib/kafka/data/events-0
4.0K	/var/lib/kafka/data/events-subscription-results-0
328K	/var/lib/kafka/data/ingest-attachments-0
436M	/var/lib/kafka/data/ingest-events-0
8.0K	/var/lib/kafka/data/ingest-sessions-0
4.0K	/var/lib/kafka/data/ingest-transactions-0
4.0K	/var/lib/kafka/data/log-start-offset-checkpoint
4.0K	/var/lib/kafka/data/meta.properties
2.3M	/var/lib/kafka/data/outcomes-0
4.0K	/var/lib/kafka/data/recovery-point-offset-checkpoint
4.0K	/var/lib/kafka/data/replication-offset-checkpoint
396K	/var/lib/kafka/data/snuba-commit-log-0
4.0K	/var/lib/kafka/data/transactions-subscription-results-0

Sorry to be dense … are you expecting an email telling you that you’ve run out of disk space? Or an email alert about something else, which disk space seems to prevent?

No, I think they are referring to the notification emails. Them not being sent and the growth indicate stuck / broken workers. I’d check the logs for them.

Hi guys,

thanks a lot for your answers.

@chadwhitacre an issue about that in the ‘internal’ project would be great :smiley:

@BYK yes! you are right!
When the host went out of disk space, I did free some space and restart the cluster.
Notification emails did not get back to work as expected.

I had to fix with an rm on all volumes but:

  • sentry-data
  • sentry-postgres
  • sentry-clickhouse

Is it the right way in this scenario?
Could you please provide us some hint/ integrity check to perform just to be sure the cluster is healthy?

PRs accepted :grin:

Not the ideal one as I think you just removed all in-flight events (which may be okay to lose but we’d rather avoid this). I think the correct recovery would be to follow these instructions: https://github.com/getsentry/onpremise/issues/478#issuecomment-666254392 (official docs on this coming soon) as I’m assuming it was Kafka which was having issues.

If you can find your logs from that time (especially for the worker service), we can make , more educated guesses.

You can try sending a test event and make sure it shows up. If you already have a project set up for this you can try that or do what we do here in onpremise tests: https://github.com/getsentry/onpremise/blob/19f4561a9e2abe32dc5eb5a03a332b50f2265b4b/test.sh

Hope these help!

Thanks a lot!

1 Like

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.