Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to scale your infrastructure


Published on

How to scale to millions of requests

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

How to scale your infrastructure

  1. 1. mAdme How to scale your app/infrastructure to millions of requests per second Jorge Dionisio Senior DevOps Team Lead
  2. 2. Slide 2 of 112 Edge traffic  Assuming a request/response of 25Kb per request  We actively use a CDN for media so we can disregard the requests of images  Assuming half million requests how much bandwidth will we need?  12.2 Gbytes/sec, That's huge!  Client retries to balance the load  Good inter network connectivity
  3. 3. Slide 3 of 112 Load Balancers  AWS ELB/ALB  F5  Citrix Netscaler
  4. 4. Slide 4 of 112 Correct Sized instances  One aspect we need is good network connectivity specially to the load balancer  If talking about AWS what instance types would we have to choose for our application nodes?  Starting with a c5.large for the ASG's could be the best option (they support up to 10Gbit)  Never use t2/t3 for production as they don't really scale unless you could have many of them but the network performance is not guaranteed.
  5. 5. Slide 5 of 112 Internal traffic  Orchestration tool  Chef/Puppet/Salt/Ansible for service discovery  HAProxy for internal communication
  6. 6. Slide 6 of 112 Load test your infrastructure    
  7. 7. Slide 7 of 112 Kernel parameters  6856e6f6a928b79ab6c  net.ipv4.tcp_tw_recycle = 1  Be careful with this setting in virtualized environments it can break connections to load balancers  Always test these settings
  8. 8. Slide 8 of 112 Database instances  If you need high performance database choose the i3 class in AWS  Run your DBs in RAID-10 wherever possible, RAID-50 could also be a solution.  Tune your DB instances  Try to fit all your data if possible in memory.
  9. 9. Slide 9 of 112 SSD storage  SSD have the advantage of very low random seek times or virtually no seek time, they are not mechanical  Run all the nodes that require fast disk access with local SSDs
  10. 10. Slide 10 of 112 Helpful scrips  If your db is PostgreSQL run:  Postgresqltuner  If your db is MySQL run:  Mysqltuner  Always apply the URGENT level settings these scripts recommend, you can get a huge performance boost just by applying a few settings
  11. 11. Slide 11 of 112 Huge pages  Turn on Huge Pages on your linux kernel whatever database you are running  These can save time in the way the CPU accesses memory  They can give a huge “as the name says” performance boost  I have some graphs at the end to show you all
  12. 12. Slide 12 of 112 Shared memory area  Always try to make your database consume 85 to 90 percent of the instance's memory  In PostgreSQL tune your shared_memory and effective_cache to hit these values, again postgresql tuner will help with this  Remember that also for each connection you allow into your database, it will also consume a bit o memory, take into account this
  13. 13. Slide 13 of 112 Auto scaling groups  One of the benefits of most cloud operators is having some kind of scaling capabilities, like ASGs on AWS  Start small, and then grow big  Do proper testing in a QA env first  Once you think you are ready use one of the links displayed previously to test your cloud deployment  Don't worry if at first “Hell breaks loose”
  14. 14. Slide 14 of 112 Monitoring  I've used several monitoring tools in the past  Currently we use graphana and prometheus, we've found this pair to be very scalable and stable  I've also used nagios in the past, but nagios can become a nightmare in terms of managing it's configuration files, specially if you need to automate it all.  If you are going to use Nagios take a look into NagiosQL, it can help you to have a graphical interface to your configuration files
  15. 15. Slide 15 of 112 Monitoring Continued  For your monitoring needs, if you are using prometheus  Try to use a proper time-series database  Tune the granularity of it by decreasing the number of events you store over time:  Past 3 days -> All events  Past 3-15 days -> Half of the events  More than 15 days -> 1/3 of the events  This way the amount of storage required is decreased and you still have some data to look at if you need to go back in time
  16. 16. Slide 16 of 112 Automation  Use lambda functions in AWS to clean up your registered instances  Automate all or almost all of your work from the start, this will save you time in the future  My favorite orchestration tool is Chef, but you can choose others  Automate tests to your infrastructure with Blazemeter and trigger Pagerduty calls when something is wrong
  17. 17. Slide 17 of 112 Some graphs
  18. 18. Slide 112 of 112 Q/A?