Single point of failure on On-Premise

martinhu · May 31, 2021, 3:07am

I’m working on a POC on on-premise Sentry server on AWS. Currently, I’m running default on-premise Sentry server on a single EC2 with default configuration.

Have a question regarding to single point of failure using default setting. What is a general plan to avoid single point of failure? As far as I understood, there are a couple:

Webserver - we can create autoscaling group and load balance before this.
datastore - clickhouse and Postgres in Sentry 10, what would be a best way to avoid single point of failure on this?
cache, event process - Kafka and Redis, do we need to think more on this over single point of failure

In general, we don’t expect our data to be very large, so I believe ,we don’t need to worry much on Scaling. However, if we don’t scale horizontally, is there a way to schedule a clean up so as to avoid data base growing too big?

I’m very new to Sentry, thanks for any help I will definitely return the favor to the community with my experience throughout the journey.

Thanks,
Martin

BYK · June 3, 2021, 8:48am

We already have built-in clean up tasks. You may play with the event retention period: https://github.com/getsentry/onpremise/blob/f4c309624538ca0ebce1ca5b0ab714f0b22d9921/.env#L2

I’ll be honest, if you want your Sentry instance to scale and w/o SPoFs, and you are asking around, you should probably consider using https://sentry.io where this is all taken care for you

Now, after the shameless plug, here are some thoughts:

Your initial failure point very likely be redis which holds everything in memory and is both used as a key-value store and as a poor man’s task queue. The first thing I’d do would be to add in rabbitmq for the task queues.
We only ship with a single worker service which tries to do all. If you check https://develop.sentry.dev/self-hosted/troubleshooting/#workers (along with other good advice there), you’ll see that you can create multiple dedicated workers to increase ingestion speed.
Around the same time, it is likely that you’d need to scale out Kafka. Even if not to scale, you should probably have some redundancy there.
Right around there, you’ll likely notice that your nodestore which defaults to Postgres on self-hosted starts to get too big. We use BigQuery for this, if that option is available to you. Someone on the forum or GitHub issues (Google is your friend) talked about writing an S3 backend as a nodestore which may work for you.
After all these, Clickhouse may become a problem. It is already running in single-instance mode so you have no redundancy there. So it would make sense to switch that to cluster-mode (not easy!).

There are more things you can do like adding more webservers (or better, relays), as you can see it is a complex beast and almost a never-ending job hence my suggestion for using the hosted service.

OpCode · June 3, 2021, 2:13pm

Port your Sentry from Docker Compose to AWS ECS. EC2 is not a solution that you should further pursue if you want serious scalability and security for your data.

But in General to help you out:

Use a AWS MSK Kafka Cluster to replace the Confluentic Kafka Stack of Sentry.
Use AWS Elasticache Redis to Replace the Redis Container.
Replace the Nginx with an ALB and add the proper ruling for https based traffic to forward requests properly.

If you gain further knowledge of the underlying components then proceed to port to ECS

If that is too much for you https://sentry.io is a good alternative that is priced fairly

martinhu · June 9, 2021, 4:31pm

Thanks @BYK , this is really helpful and I expected this is a really large effort to host it ourselves, the good side is to better understanding the components. I’m currently working on a POC, so I’m working on Sentry.io and onpremise both.

To clarify, is Postgres only storing meta data for the Sentry, or it has complete set of raw logging? Trying to see if Clickhouse has everything or this is only for boosting indexing and searching.

martinhu · June 9, 2021, 4:33pm

Hey @OpCode thanks so much, Do we see significant performance gain using K8s vs ECS?

One of the concern of sentry.io is the data privacy/security, do we have enterprise solution?

BYK · June 9, 2021, 5:00pm

Clickhouse is only for boosting indexing and searching. That said I’m not so sure about performance data. @fpacifici can give a more authoritative answer regarding this.

nodestore has the “raw” data and on onpremise we use Postgres as the nodestore. That said on production we use Google BigTable and there are other people in the community using S3 for this.

martinhu · June 9, 2021, 8:54pm

Got chat, Thanks @BYK

OpCode · June 10, 2021, 7:08am

If K8s is your cup of tea then go for it sure.
The Problem you have to solve for yourself is not where or on top of what you want Sentry up and running, any major Plattform should be sufficient.

It is the how you have to solve (Lifecycle Managment, Updates, Migrations).
And i think K8s and Automation can provide that.

martinhu · June 10, 2021, 4:23pm

Thanks for the note @OpCode !

fpacifici · June 14, 2021, 11:18pm

Clickhouse is what powers time series features and search.
So it does not have the full payload of the events (that is only in nodestore). Still if clickhouse is down sentry is down. nodestore is not a replacement in terms of functionalities even though it contains all the data.
Clickhouse can be run in distributed and replicated mode, though we are working right now on how to support that for on premise users.

system · September 12, 2021, 11:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Deploying on-premise Sentry 10 on AWS On-Premise	3	2906	May 8, 2020
Autoscaling and data location On-Premise	2	1695	June 3, 2021
Sentry 10 High Availability On-Premise	12	7044	April 6, 2021
What is the structure of the sentry onpremise you would like to recommend? On-Premise	2	1482	December 18, 2020
Dependency & page of containers for on-premise-20.12.1 On-Premise	3	831	January 6, 2021

Single point of failure on On-Premise

Related topics