I noticed the checks on https://status.sentry.io/ but I couldn’t find how they work / what do they check. Is there a wiki/repo that contains the checks that are running for the statuspage ? Do you have somewhere a documentation on how to ensure sentry is running correctly? In docker, we can add healthchecks, but I don’t think It covers the actual running of sentry.
What URLs need to be monitored ? I found /_health/ but it’s not clear what does it check. Is there a script that can be run to check the status of all the kafka consumers /sentry workers? How can I make sure that snuba is working correctly ? The architecture is very complex and I can’t find documentation that covers the monitoring of all the flows ( besides the Datadog integration, which is not OpenSource).
The healthchecks on docker are good, but I’m also interested in the queue workers, that don’t have docker healthchecks.
For example, on our dev environment no data was entering sentry although all the docker containers were up & running. I noticed that the kafka consumer group - ingest-consumer- was Empty and then I checked the logs and saw that the ingest-consumer container received an error and was doing nothing ( it did not crash so it was blocked ).
This showed me that we need to monitor Kafka consumer groups and maybe their lag ( I’m not sure how much is too much). It would be nice to have this information in a monitoring recommendations wiki that can be used by Sentry administrators.
It’s also not clear for me the /_health/ sentry heathcheck checks - is it just that the http interface is up or does it test connection to all the dependencies, like the database, redis, etc
Also, how can I monitor the sentry worker ? There is a message in sentry when no workers are running, but what is check that is used to get this information ?