Single point of failure on On-Premise

I’m working on a POC on on-premise Sentry server on AWS. Currently, I’m running default on-premise Sentry server on a single EC2 with default configuration.

Have a question regarding to single point of failure using default setting. What is a general plan to avoid single point of failure? As far as I understood, there are a couple:

  • Webserver - we can create autoscaling group and load balance before this.
  • datastore - clickhouse and Postgres in Sentry 10, what would be a best way to avoid single point of failure on this?
  • cache, event process - Kafka and Redis, do we need to think more on this over single point of failure

In general, we don’t expect our data to be very large, so I believe ,we don’t need to worry much on Scaling. However, if we don’t scale horizontally, is there a way to schedule a clean up so as to avoid data base growing too big?

I’m very new to Sentry, thanks for any help :slight_smile: I will definitely return the favor to the community with my experience throughout the journey.

Thanks,
Martin

1 Like

We already have built-in clean up tasks. You may play with the event retention period: https://github.com/getsentry/onpremise/blob/f4c309624538ca0ebce1ca5b0ab714f0b22d9921/.env#L2

I’ll be honest, if you want your Sentry instance to scale and w/o SPoFs, and you are asking around, you should probably consider using https://sentry.io where this is all taken care for you :slight_smile:

Now, after the shameless plug, here are some thoughts:

  1. Your initial failure point very likely be redis which holds everything in memory and is both used as a key-value store and as a poor man’s task queue. The first thing I’d do would be to add in rabbitmq for the task queues.
  2. We only ship with a single worker service which tries to do all. If you check https://develop.sentry.dev/self-hosted/troubleshooting/#workers (along with other good advice there), you’ll see that you can create multiple dedicated workers to increase ingestion speed.
  3. Around the same time, it is likely that you’d need to scale out Kafka. Even if not to scale, you should probably have some redundancy there.
  4. Right around there, you’ll likely notice that your nodestore which defaults to Postgres on self-hosted starts to get too big. We use BigQuery for this, if that option is available to you. Someone on the forum or GitHub issues (Google is your friend) talked about writing an S3 backend as a nodestore which may work for you.
  5. After all these, Clickhouse may become a problem. It is already running in single-instance mode so you have no redundancy there. So it would make sense to switch that to cluster-mode (not easy!).

There are more things you can do like adding more webservers (or better, relays), as you can see it is a complex beast and almost a never-ending job hence my suggestion for using the hosted service.

1 Like

Port your Sentry from Docker Compose to AWS ECS. EC2 is not a solution that you should further pursue if you want serious scalability and security for your data.

But in General to help you out:

Use a AWS MSK Kafka Cluster to replace the Confluentic Kafka Stack of Sentry.
Use AWS Elasticache Redis to Replace the Redis Container.
Replace the Nginx with an ALB and add the proper ruling for https based traffic to forward requests properly.

If you gain further knowledge of the underlying components then proceed to port to ECS :grinning_face_with_smiling_eyes:

If that is too much for you https://sentry.io is a good alternative that is priced fairly

Thanks @BYK , this is really helpful and I expected this is a really large effort to host it ourselves, the good side is to better understanding the components. I’m currently working on a POC, so I’m working on Sentry.io and onpremise both.

To clarify, is Postgres only storing meta data for the Sentry, or it has complete set of raw logging? Trying to see if Clickhouse has everything or this is only for boosting indexing and searching.

1 Like

Hey @OpCode thanks so much, Do we see significant performance gain using K8s vs ECS?

One of the concern of sentry.io is the data privacy/security, do we have enterprise solution?

Clickhouse is only for boosting indexing and searching. That said I’m not so sure about performance data. @fpacifici can give a more authoritative answer regarding this.

nodestore has the “raw” data and on onpremise we use Postgres as the nodestore. That said on production we use Google BigTable and there are other people in the community using S3 for this.

Got chat, Thanks @BYK

1 Like

If K8s is your cup of tea then go for it sure.
The Problem you have to solve for yourself is not where or on top of what you want Sentry up and running, any major Plattform should be sufficient.

It is the how you have to solve (Lifecycle Managment, Updates, Migrations).
And i think K8s and Automation can provide that.

1 Like

Thanks for the note @OpCode !

1 Like

Clickhouse is what powers time series features and search.
So it does not have the full payload of the events (that is only in nodestore). Still if clickhouse is down sentry is down. nodestore is not a replacement in terms of functionalities even though it contains all the data.
Clickhouse can be run in distributed and replicated mode, though we are working right now on how to support that for on premise users.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.