Scaling strategy

asthana4 · June 12, 2018, 4:06pm

Hi, I am experimenting with Sentry stack and I am in process of developing a design which can achieve high throughout (ideally 1k rps during spikes). I am fairly new to developing applications at scale and I am skeptical about my approach. So far I plan to use ECS on AWS with Elasticache (for Redis), RDS and a load balanced fleet of EC2 instances on top of which the sentry server and the worker processes would reside on. My question are:

What should be ratio of the webservers to the worker processes if I am planning to handle that throughput?
How many connections do I need to be able to establish with my datastore and database for allowing that level of concurrency? How do I detect if they are the bottleneck?
Is RDS/Elasticache worth using over managing them my self, with performance being the only concern?
How big a fleet would I be needing? Is scaling vertically more beneficial over scaling horizontally?
Should I opt for a caching mechanism like Cyclops over achieving 1k rps just by infrastructure?
Would Kubernetes provide any performance incentive over ECS?

The only endpoint I am going to be using on scale is POST @ /api//store
Sorry if these questions are too broad/simple, I have done a lot of research but I am not able to find answers on my own and it would be great if someone who had create such a system can speak from their experience. Any insights would be very appreciated. Thanks!

Edit: Changed 5k Rps to 1k Rps

lloydwatkin · June 14, 2018, 7:02pm

Very basic responses below, some of them you can’t answer without data or knowing more, is this a rails/PHP/java/python stack? Also I use hosted sentry not on-premise so can’t comment on those things.

Is RDS/Elasticache worth using over managing them my self, with performance being the only concern?
He’ll no. Until you have experts in each service at massive scale use RDS (aurora, mysql, postgres). We use RDS Aurora (in mysql mode). The performance is great, scaling is simple (usually no change to the app since it’s a read and a write endpoint), backups are done for you (restores are easy), as is master-master cross AZ replication, upgrades, and failovers. Why do the hard work yourself, Amazon use these services themselves for a reason. Same for redis via elasticache, essentially we use it both as a shared rails cache and a shared job queue.

How big a fleet would I be needing? Is scaling vertically more beneficial over scaling horizontally?
Horizontally is almost always better, harder to achieve but definitely better overall. Theres only really souch vertical, there’s lots of horizontal and even multiple ways of going horizontal once you reach each limit.

Should I opt for a caching mechanism like Cyclops over achieving 5k rps just by infrastructure?
Not aware of what Cyclops is, but use what is appropriate/quick to start with and make sure however you code in your solution its easy to switch out.

Would Kubernetes provide any performance incentive over ECS?
Can’t comment I’m afraid. I’m really interested in fargate but that means never having to manage servers myself again. Currently using rancher.io because kubes was overkill and ECS didn’t have features we needed (like a start up grace period).

Good luck!

lloydwatkin · June 14, 2018, 7:04pm

Also never do anything immediately that you can out off to an non-webfacing server. Store/quickly process data, give your user a response. Process the results of that request off public facing servers (obviously only if it makes sense)

asthana4 · June 15, 2018, 4:03am

Thanks a lot for the insights! I have been able to achieve about 180 RPS off a t2.xlarge, a 1 instance t2.medium RDS, and a 1 instance t2.micro Elasticache which seems satisfactory for my use case (should I be getting more off this configuration?, it costs about 170 USD/month :/). During my time with testing different instance types, one thing I always noticed was that I got the best throughput when I had 1 web server container along with only 1 worker. If I added more, it would just slow down. If I understand correctly when the web server receives the request, it immediately caches it within Redis, and then the Cron instructs the workers to stream this over to the database from the cache. My cache usage seems very nominal on the console, the payloads I am delivering are about 100-200 KBs, and I do seem to hit the IOPS limit on my database. But since the web server simply puts the blob in the datastore, why do I lose throughput when I add more workers?

Also, this 1:1 ratio makes me think that if I am to scale this horizontally, I would be better off 1) Creating a load balancer and attach them to an autoscaling group of instances, each of which run 1 Web server, 1 Worker, and maybe 1 Redis container, but share an RDS cluster, rather than 2) Using container orchestration by EKS/ECS and scale on 1:1 ratio for servers and workers. Is this the correct way of going about this? Thanks again!

lloydwatkin · June 15, 2018, 8:14am

You’ve gone beyond me with on-premise I’m afraid (as I don’t do it) but glad the little bits of insight helped. Good luck, report your findings to the public

zeeg · June 15, 2018, 5:17pm

Won’t comment on AWS specifics but:

How many connections do I need to be able to establish with my datastore and database for allowing that level of concurrency? How do I detect if they are the bottleneck?

A lot. You will need pgbouncer in front of Postgres (i dont believe RDS bakes this in). You can Google what good tuning is here, but even having 20 connections mapped to 500 in pgbouncer goes a long way.

asthana4 · June 15, 2018, 5:29pm

Thank you for that (and helping creating this amazing project!). Aws seems to have pgbouncer-rr offering to support this and I will try this out. Can you comment on the webserver to worker ratio scaling and why would I keep losing throughput if the ratio is not 1:1. Ideally I would just like to be able to support 1k RPS, but I can’t seem to isolate the upper bounds imposed by this ratio. Thanks again!

asthana4 · June 15, 2018, 5:30pm

Thanks! I’ll post anything I can find. Documentation regarding scaling is still sparse or perhaps all this is just too new for me.

zeeg · June 15, 2018, 6:24pm

Workers do a lot more than web machines. The web basically just needs to pass the event into worker, and does very minimal work. You probably need something like 4:1 ratio.

@jtcunning might be able to comment on what it looks like for sentry.io, but it wont be entirely representative of your own needs

Topic		Replies	Views
Autoscaling and data location On-Premise	2	1692	June 3, 2021
Deploy Sentry through CloudFormation using only AWS services On-Premise	1	3129	April 20, 2021
Sentry on premise scalability & architecture On-Premise	1	1364	March 12, 2021
Performance issue On-Premise	0	936	October 20, 2018
AWS Multi-AZ Sentry On-Premise	3	3096	August 22, 2017

Scaling strategy

Related topics