Hi, I am experimenting with Sentry stack and I am in process of developing a design which can achieve high throughout (ideally 1k rps during spikes). I am fairly new to developing applications at scale and I am skeptical about my approach. So far I plan to use ECS on AWS with Elasticache (for Redis), RDS and a load balanced fleet of EC2 instances on top of which the sentry server and the worker processes would reside on. My question are:
What should be ratio of the webservers to the worker processes if I am planning to handle that throughput?
How many connections do I need to be able to establish with my datastore and database for allowing that level of concurrency? How do I detect if they are the bottleneck?
Is RDS/Elasticache worth using over managing them my self, with performance being the only concern?
How big a fleet would I be needing? Is scaling vertically more beneficial over scaling horizontally?
Should I opt for a caching mechanism like Cyclops over achieving 1k rps just by infrastructure?
Would Kubernetes provide any performance incentive over ECS?
The only endpoint I am going to be using on scale is POST @ /api//store
Sorry if these questions are too broad/simple, I have done a lot of research but I am not able to find answers on my own and it would be great if someone who had create such a system can speak from their experience. Any insights would be very appreciated. Thanks!
Very basic responses below, some of them you can’t answer without data or knowing more, is this a rails/PHP/java/python stack? Also I use hosted sentry not on-premise so can’t comment on those things.
Is RDS/Elasticache worth using over managing them my self, with performance being the only concern?
He’ll no. Until you have experts in each service at massive scale use RDS (aurora, mysql, postgres). We use RDS Aurora (in mysql mode). The performance is great, scaling is simple (usually no change to the app since it’s a read and a write endpoint), backups are done for you (restores are easy), as is master-master cross AZ replication, upgrades, and failovers. Why do the hard work yourself, Amazon use these services themselves for a reason. Same for redis via elasticache, essentially we use it both as a shared rails cache and a shared job queue.
How big a fleet would I be needing? Is scaling vertically more beneficial over scaling horizontally?
Horizontally is almost always better, harder to achieve but definitely better overall. Theres only really souch vertical, there’s lots of horizontal and even multiple ways of going horizontal once you reach each limit.
Should I opt for a caching mechanism like Cyclops over achieving 5k rps just by infrastructure?
Not aware of what Cyclops is, but use what is appropriate/quick to start with and make sure however you code in your solution its easy to switch out.
Would Kubernetes provide any performance incentive over ECS?
Can’t comment I’m afraid. I’m really interested in fargate but that means never having to manage servers myself again. Currently using rancher.io because kubes was overkill and ECS didn’t have features we needed (like a start up grace period).
Also never do anything immediately that you can out off to an non-webfacing server. Store/quickly process data, give your user a response. Process the results of that request off public facing servers (obviously only if it makes sense)
Thanks a lot for the insights! I have been able to achieve about 180 RPS off a t2.xlarge, a 1 instance t2.medium RDS, and a 1 instance t2.micro Elasticache which seems satisfactory for my use case (should I be getting more off this configuration?, it costs about 170 USD/month :/). During my time with testing different instance types, one thing I always noticed was that I got the best throughput when I had 1 web server container along with only 1 worker. If I added more, it would just slow down. If I understand correctly when the web server receives the request, it immediately caches it within Redis, and then the Cron instructs the workers to stream this over to the database from the cache. My cache usage seems very nominal on the console, the payloads I am delivering are about 100-200 KBs, and I do seem to hit the IOPS limit on my database. But since the web server simply puts the blob in the datastore, why do I lose throughput when I add more workers?
Also, this 1:1 ratio makes me think that if I am to scale this horizontally, I would be better off 1) Creating a load balancer and attach them to an autoscaling group of instances, each of which run 1 Web server, 1 Worker, and maybe 1 Redis container, but share an RDS cluster, rather than 2) Using container orchestration by EKS/ECS and scale on 1:1 ratio for servers and workers. Is this the correct way of going about this? Thanks again!
You’ve gone beyond me with on-premise I’m afraid (as I don’t do it) but glad the little bits of insight helped. Good luck, report your findings to the public
How many connections do I need to be able to establish with my datastore and database for allowing that level of concurrency? How do I detect if they are the bottleneck?
A lot. You will need pgbouncer in front of Postgres (i dont believe RDS bakes this in). You can Google what good tuning is here, but even having 20 connections mapped to 500 in pgbouncer goes a long way.
Thank you for that (and helping creating this amazing project!). Aws seems to have pgbouncer-rr offering to support this and I will try this out. Can you comment on the webserver to worker ratio scaling and why would I keep losing throughput if the ratio is not 1:1. Ideally I would just like to be able to support 1k RPS, but I can’t seem to isolate the upper bounds imposed by this ratio. Thanks again!
Workers do a lot more than web machines. The web basically just needs to pass the event into worker, and does very minimal work. You probably need something like 4:1 ratio.
@jtcunning might be able to comment on what it looks like for sentry.io, but it wont be entirely representative of your own needs