The AWS Cloud : Leveraging the State of the Art


Published on

Keynote at the SAP Cloud Conference, February 2012

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The AWS Cloud : Leveraging the State of the Art

  1. 1. The AWS Cloud Leveraging the State of the Art Sid Anand (@r39132) SAP Cloud Inside Track 2012 1Thursday, February 16, 2012
  2. 2. What is the AWS Cloud? A Real World Scenario 2Thursday, February 16, 2012
  3. 3. A Real World Scenario Question If you were to build your own website today, what would you need? Answer You need a machine! For simplicity, we will assume that your web server and application server code run on the same box! AWS offers EC2 instances (i.e. virtual instances) to host your code - Various sizes (e.g. IOps, # of Spindles, CPUs, Memory, Network bandwidth) - Various configurations (e.g. Virtual Private Cloud, High Performance Cluster ) - Various pricing schemes (e.g. on-demand, reserved, SPOT, etc....) 3Thursday, February 16, 2012
  4. 4. A Real World Scenario Question Is one machine enough to handle traffic from all of your users? What if that machine were to fall over or need maintenance (i.e. a restart)? Answer Add many machines! 4Thursday, February 16, 2012
  5. 5. A Real World Scenario Question This handles more traffic, but what if your servers were to fall over or need maintenance? Answer AWS offers AutoScaleGroups (a.k.a. ASG)! You can deploy your servers under the protection of an ASG with a min and max pool size set. The ASG ensures that machines are replaced when they die to guarantee your “min” pool size ASGs monitor the health of your machines by polling an http port on each machine 5Thursday, February 16, 2012
  6. 6. A Real World Scenario Question How do you distribute traffic to all of your machines evenly? Answer Deploy your favorite software load balancer! And write some custom code to register/deregister your machine instances with the load balancer 6Thursday, February 16, 2012
  7. 7. A Real World Scenario Question What if the load balancer were to fall over or to need maintenance or to become a traffic choke point? Answer Add multiple servers and deploy them under an ASG! This is not ideal for a few reasons - Need to register/deregister your Load Balancer instances with DNS - Need to sync with ASGsʼs view of what is alive and dead, being added or removed, etc... 7Thursday, February 16, 2012
  8. 8. A Real World Scenario Answer AWS offers Elastic Load Balancers (i.e. ELB) - Conceptually similar to having many LBs in an ASG, with some additional features: - Provides DNS hostname (e.g. - Maps all of the load balancer instances to this hostname - Takes care of maintenance of the load balancer machines and the requisite DNS registrations/deregistrations - Syncs with the ASG -- if the ASG replaces one of your instances, the ELB will also remove that instance - Letʼs see how it works in action! 8Thursday, February 16, 2012
  9. 9. 9 @r39132 23Thursday, February 16, 2012
  10. 10. A Real World Scenario Question What about a DB to persist my data? Answer Multiple AWS hosted/managed options! - DynamoDB (the new SimpleDB replacement) offers key-value semantics Netflix replaced Oracle with SimpleDB and ran on it 2010-2011 - 4.5 Billion user-facing request a day - S3 offers key-value semantics for very large files (e.g. 5TB). Typically for Map-Reduce files, media files, or Oracle BLOBS/ CLOBS - RDS - hosted Oracle or MySQL if you need relations and complex queries 10Thursday, February 16, 2012
  11. 11. A Real World Scenario Question What if I have high-volume writes, but donʼt care when they are written -- e.g. event streams Answer Simple Queue Service - Think Enterprise Message Bus - Highly available, infinitely scalable - Handles application/system monitoring event traffic and social graph events at Netflix 11Thursday, February 16, 2012
  12. 12. A Real World Scenario Question What if the whole Data Center goes down? How do I keep my service available? Answer Amazon Data Center = Availability Zone 12Thursday, February 16, 2012
  13. 13. A Real World Scenario Answer Always deploy your code in multiple Availability Zones! - Netflix deploys in 3 AZs in Virgina - Best Practice : Always deploy enough capacity in each AZ to handle losing one AZ during peak - Netflix follows this best practice! 13Thursday, February 16, 2012
  14. 14. A Real World Scenario Question What if your Asian and European customers complain of slow response times? Recall : Higher Response times, lower scalability Answer AWS has 8 global regions! Each region has between 3 and 4 AZs - Netflixʼs launch in the UK and Ireland were out of AWS EU-West Region 14Thursday, February 16, 2012
  15. 15. A Real World Scenario 15Thursday, February 16, 2012
  16. 16. A Real World Scenario Other AWS Services: - Elastic Map Reduce : Map-Reduce as a Service for analytics. Supports PIG and Hive - ElastiCache : A hosted cache service (think Memcached as a Service) Whatʼs Missing (or coming soon)?: - Discovery & Load Balancing for N-tier applications! - In effect, weʼd like ELB for internal traffic - Crypto as a Service - Currently, none of the services are cross-region! Itʼs left to the user to transfer data or proxy requests between regions 16Thursday, February 16, 2012
  17. 17. Who Uses AWS? Netflix’s Cloud Architecture 17Thursday, February 16, 2012
  18. 18. Netflix’s Cloud Architecture ELB ELB NES NES NES NES Components Many (~100) applications, organized in Discovery clusters (a.k.a. ASGs) NMTS NMTS NMTS NMTS Clusters can be at different levels in the call stack NMTS NMTS Clusters can call each other NBES NBES IAAS IAAS IAAS 18Thursday, February 16, 2012
  19. 19. Netflix’s Cloud Architecture ELB ELB Levels NES NES NES NES NES : Netflix Edge Services Discovery NMTS : Netflix Mid-tier Services NMTS NMTS NMTS NMTS NBES : Netflix Back-end Services IAAS : AWS IAAS Services NMTS NMTS Discovery : Help services discover NMTS and NBES services NBES NBES IAAS IAAS IAAS 19Thursday, February 16, 2012
  20. 20. Netflix’s Cloud Architecture ELB ELB Components (NES) NES NES NES NES Overview Any service that browsers and streaming Discovery devices connect to over the internet NMTS NMTS NMTS NMTS They sit behind AWS Elastic Load Balancers (a.k.a. ELB) NMTS NMTS They call clusters at lower levels NBES NBES IAAS IAAS IAAS 20Thursday, February 16, 2012
  21. 21. Netflix’s Cloud Architecture Components (NES) ELB ELB Examples NES NES NES NES API Servers Discovery Support the video browsing experience NMTS NMTS NMTS NMTS Also allows users to modify their Q Serves 1.4 Billions calls/day NMTS NMTS Streaming Control Servers Support streaming video playback NBES NBES Authenticate your Wii, PS3, etc... Download DRM to the Wii, PS3, etc... Return a list of CDN urls to the Wii, PS3, IAAS IAAS IAAS etc... 21Thursday, February 16, 2012
  22. 22. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Overview Discovery Can call services at the same or lower NMTS NMTS NMTS NMTS levels Other NMTS NMTS NMTS NBES, IAAS Not NES NBES NBES Exposed through our Discovery service IAAS IAAS IAAS 22Thursday, February 16, 2012
  23. 23. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Examples Discovery Netflix Queue Servers NMTS NMTS NMTS NMTS Modify items in the usersʼ movie queue Viewing History Servers NMTS NMTS Record and track all streaming movie watching SIMS Servers NBES NBES Compute and serve user-to-user and movie-to-movie similarities IAAS IAAS IAAS 23Thursday, February 16, 2012
  24. 24. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Overview Discovery A back-end, usually 3rd party, open-source service NMTS NMTS NMTS NMTS Leaf in the call tree. Cannot call anything else NMTS NMTS NBES NBES IAAS IAAS IAAS 24Thursday, February 16, 2012
  25. 25. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Examples Discovery Cassandra Clusters NMTS NMTS NMTS NMTS Our new cloud database is Cassandra and stores all sorts of data to support application needs NMTS NMTS Zookeeper Clusters Our distributed lock service and sequence NBES NBES generator Memcached Clusters Typically caches things that we store in S3 but need to access quickly or often IAAS IAAS IAAS 25Thursday, February 16, 2012
  26. 26. Netflix’s Cloud Architecture ELB ELB Components (IAAS) NES NES NES NES Examples AWS S3 Discovery Large-sized data (e.g. video encodes, NMTS NMTS NMTS NMTS application logs, etc...) is stored here, not Cassandra NMTS NMTS AWS SQS Amazonʼs message queue to send events (e.g. Facebook network updates are processed asynchronously over SQS) NBES NBES IAAS IAAS IAAS 26Thursday, February 16, 2012
  27. 27. Netflix’s Cloud Architecture Architecture Pros Horizontally scalable at every level Should give us maximum availability Architecture Cons A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation Latency can be a concern EC2 instances in AWS can die at any time! A lot of moving parts 27Thursday, February 16, 2012
  28. 28. Dealing with the Cons! We have a little help 28Thursday, February 16, 2012
  29. 29. Simian Army Prevention (& Early Detection) is the best medicine 29Thursday, February 16, 2012
  30. 30. Simian Army • Chaos Monkey • Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group) • Similar to how EC2 instances can be killed by AWS with little warning • Tests Netflixʼs ability to gracefully deal with broken connections, interrupted calls, etc... • Verifies that all services are running within the protection of AWS Auto Scale Groups, which reincarnates killed instances • If not, the Chaos monkey will win! 30Thursday, February 16, 2012
  31. 31. Simian Army • Latency Monkey • Simulates soft failures -- i.e. a service gets slower • Injects random delays in servers! • Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder problem of delays • Delays cause Thundering Herds (outside of the scope of this talk!) 31Thursday, February 16, 2012
  32. 32. Simian Army Does this solve all of our issues? 32Thursday, February 16, 2012
  33. 33. Simian Army The infinite cloud is infinite when your needs are moderate! To ensure fairness among tenants, AWS meters or limits every resource Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to turn around and raise the limit -- a few hours! 33Thursday, February 16, 2012
  34. 34. Simian Army • Limits Monkey • Checks once an hour whether we are approaching one of our limits and triggers alerts for us to proactively reach out to AWS! • Conformity & Janitor Monkeys • Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG, unreferenced security groups, ELBs, ASGs, etc...) to increase head-room • Buys us more time before we run out of resources and also saves us $$$$ 34Thursday, February 16, 2012
  35. 35. Questions? Sid Anand @r39132 35Thursday, February 16, 2012