The AWS Cloud : Leveraging the State of the Art
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

The AWS Cloud : Leveraging the State of the Art

  • 2,170 views
Uploaded on

Keynote at the SAP Cloud Conference, February 2012

Keynote at the SAP Cloud Conference, February 2012

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,170
On Slideshare
2,158
From Embeds
12
Number of Embeds
4

Actions

Shares
Downloads
45
Comments
0
Likes
1

Embeds 12

http://www.twylah.com 4
http://www.linkedin.com 4
https://www.linkedin.com 3
https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The AWS Cloud Leveraging the State of the Art Sid Anand (@r39132) SAP Cloud Inside Track 2012 1Thursday, February 16, 2012
  • 2. What is the AWS Cloud? A Real World Scenario 2Thursday, February 16, 2012
  • 3. A Real World Scenario Question If you were to build your own website today, what would you need? Answer You need a machine! For simplicity, we will assume that your web server and application server code run on the same box! AWS offers EC2 instances (i.e. virtual instances) to host your code - Various sizes (e.g. IOps, # of Spindles, CPUs, Memory, Network bandwidth) - Various configurations (e.g. Virtual Private Cloud, High Performance Cluster ) - Various pricing schemes (e.g. on-demand, reserved, SPOT, etc....) 3Thursday, February 16, 2012
  • 4. A Real World Scenario Question Is one machine enough to handle traffic from all of your users? What if that machine were to fall over or need maintenance (i.e. a restart)? Answer Add many machines! 4Thursday, February 16, 2012
  • 5. A Real World Scenario Question This handles more traffic, but what if your servers were to fall over or need maintenance? Answer AWS offers AutoScaleGroups (a.k.a. ASG)! You can deploy your servers under the protection of an ASG with a min and max pool size set. The ASG ensures that machines are replaced when they die to guarantee your “min” pool size ASGs monitor the health of your machines by polling an http port on each machine 5Thursday, February 16, 2012
  • 6. A Real World Scenario Question How do you distribute traffic to all of your machines evenly? Answer Deploy your favorite software load balancer! And write some custom code to register/deregister your machine instances with the load balancer 6Thursday, February 16, 2012
  • 7. A Real World Scenario Question What if the load balancer were to fall over or to need maintenance or to become a traffic choke point? Answer Add multiple servers and deploy them under an ASG! This is not ideal for a few reasons - Need to register/deregister your Load Balancer instances with DNS - Need to sync with ASGsʼs view of what is alive and dead, being added or removed, etc... 7Thursday, February 16, 2012
  • 8. A Real World Scenario Answer AWS offers Elastic Load Balancers (i.e. ELB) - Conceptually similar to having many LBs in an ASG, with some additional features: - Provides DNS hostname (e.g. mysite-11111111.us- east-1.elb.amazonaws.com) - Maps all of the load balancer instances to this hostname - Takes care of maintenance of the load balancer machines and the requisite DNS registrations/deregistrations - Syncs with the ASG -- if the ASG replaces one of your instances, the ELB will also remove that instance - Letʼs see how it works in action! 8Thursday, February 16, 2012
  • 9. 9 @r39132 23Thursday, February 16, 2012
  • 10. A Real World Scenario Question What about a DB to persist my data? Answer Multiple AWS hosted/managed options! - DynamoDB (the new SimpleDB replacement) offers key-value semantics Netflix replaced Oracle with SimpleDB and ran on it 2010-2011 - 4.5 Billion user-facing request a day - S3 offers key-value semantics for very large files (e.g. 5TB). Typically for Map-Reduce files, media files, or Oracle BLOBS/ CLOBS - RDS - hosted Oracle or MySQL if you need relations and complex queries 10Thursday, February 16, 2012
  • 11. A Real World Scenario Question What if I have high-volume writes, but donʼt care when they are written -- e.g. event streams Answer Simple Queue Service - Think Enterprise Message Bus - Highly available, infinitely scalable - Handles application/system monitoring event traffic and social graph events at Netflix 11Thursday, February 16, 2012
  • 12. A Real World Scenario Question What if the whole Data Center goes down? How do I keep my service available? Answer Amazon Data Center = Availability Zone 12Thursday, February 16, 2012
  • 13. A Real World Scenario Answer Always deploy your code in multiple Availability Zones! - Netflix deploys in 3 AZs in Virgina - Best Practice : Always deploy enough capacity in each AZ to handle losing one AZ during peak - Netflix follows this best practice! 13Thursday, February 16, 2012
  • 14. A Real World Scenario Question What if your Asian and European customers complain of slow response times? Recall : Higher Response times, lower scalability Answer AWS has 8 global regions! Each region has between 3 and 4 AZs - Netflixʼs launch in the UK and Ireland were out of AWS EU-West Region 14Thursday, February 16, 2012
  • 15. A Real World Scenario 15Thursday, February 16, 2012
  • 16. A Real World Scenario Other AWS Services: - Elastic Map Reduce : Map-Reduce as a Service for analytics. Supports PIG and Hive - ElastiCache : A hosted cache service (think Memcached as a Service) Whatʼs Missing (or coming soon)?: - Discovery & Load Balancing for N-tier applications! - In effect, weʼd like ELB for internal traffic - Crypto as a Service - Currently, none of the services are cross-region! Itʼs left to the user to transfer data or proxy requests between regions 16Thursday, February 16, 2012
  • 17. Who Uses AWS? Netflix’s Cloud Architecture 17Thursday, February 16, 2012
  • 18. Netflix’s Cloud Architecture ELB ELB NES NES NES NES Components Many (~100) applications, organized in Discovery clusters (a.k.a. ASGs) NMTS NMTS NMTS NMTS Clusters can be at different levels in the call stack NMTS NMTS Clusters can call each other NBES NBES IAAS IAAS IAAS 18Thursday, February 16, 2012
  • 19. Netflix’s Cloud Architecture ELB ELB Levels NES NES NES NES NES : Netflix Edge Services Discovery NMTS : Netflix Mid-tier Services NMTS NMTS NMTS NMTS NBES : Netflix Back-end Services IAAS : AWS IAAS Services NMTS NMTS Discovery : Help services discover NMTS and NBES services NBES NBES IAAS IAAS IAAS 19Thursday, February 16, 2012
  • 20. Netflix’s Cloud Architecture ELB ELB Components (NES) NES NES NES NES Overview Any service that browsers and streaming Discovery devices connect to over the internet NMTS NMTS NMTS NMTS They sit behind AWS Elastic Load Balancers (a.k.a. ELB) NMTS NMTS They call clusters at lower levels NBES NBES IAAS IAAS IAAS 20Thursday, February 16, 2012
  • 21. Netflix’s Cloud Architecture Components (NES) ELB ELB Examples NES NES NES NES API Servers Discovery Support the video browsing experience NMTS NMTS NMTS NMTS Also allows users to modify their Q Serves 1.4 Billions calls/day NMTS NMTS Streaming Control Servers Support streaming video playback NBES NBES Authenticate your Wii, PS3, etc... Download DRM to the Wii, PS3, etc... Return a list of CDN urls to the Wii, PS3, IAAS IAAS IAAS etc... 21Thursday, February 16, 2012
  • 22. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Overview Discovery Can call services at the same or lower NMTS NMTS NMTS NMTS levels Other NMTS NMTS NMTS NBES, IAAS Not NES NBES NBES Exposed through our Discovery service IAAS IAAS IAAS 22Thursday, February 16, 2012
  • 23. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Examples Discovery Netflix Queue Servers NMTS NMTS NMTS NMTS Modify items in the usersʼ movie queue Viewing History Servers NMTS NMTS Record and track all streaming movie watching SIMS Servers NBES NBES Compute and serve user-to-user and movie-to-movie similarities IAAS IAAS IAAS 23Thursday, February 16, 2012
  • 24. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Overview Discovery A back-end, usually 3rd party, open-source service NMTS NMTS NMTS NMTS Leaf in the call tree. Cannot call anything else NMTS NMTS NBES NBES IAAS IAAS IAAS 24Thursday, February 16, 2012
  • 25. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Examples Discovery Cassandra Clusters NMTS NMTS NMTS NMTS Our new cloud database is Cassandra and stores all sorts of data to support application needs NMTS NMTS Zookeeper Clusters Our distributed lock service and sequence NBES NBES generator Memcached Clusters Typically caches things that we store in S3 but need to access quickly or often IAAS IAAS IAAS 25Thursday, February 16, 2012
  • 26. Netflix’s Cloud Architecture ELB ELB Components (IAAS) NES NES NES NES Examples AWS S3 Discovery Large-sized data (e.g. video encodes, NMTS NMTS NMTS NMTS application logs, etc...) is stored here, not Cassandra NMTS NMTS AWS SQS Amazonʼs message queue to send events (e.g. Facebook network updates are processed asynchronously over SQS) NBES NBES IAAS IAAS IAAS 26Thursday, February 16, 2012
  • 27. Netflix’s Cloud Architecture Architecture Pros Horizontally scalable at every level Should give us maximum availability Architecture Cons A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation Latency can be a concern EC2 instances in AWS can die at any time! A lot of moving parts 27Thursday, February 16, 2012
  • 28. Dealing with the Cons! We have a little help 28Thursday, February 16, 2012
  • 29. Simian Army Prevention (& Early Detection) is the best medicine 29Thursday, February 16, 2012
  • 30. Simian Army • Chaos Monkey • Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group) • Similar to how EC2 instances can be killed by AWS with little warning • Tests Netflixʼs ability to gracefully deal with broken connections, interrupted calls, etc... • Verifies that all services are running within the protection of AWS Auto Scale Groups, which reincarnates killed instances • If not, the Chaos monkey will win! 30Thursday, February 16, 2012
  • 31. Simian Army • Latency Monkey • Simulates soft failures -- i.e. a service gets slower • Injects random delays in servers! • Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder problem of delays • Delays cause Thundering Herds (outside of the scope of this talk!) 31Thursday, February 16, 2012
  • 32. Simian Army Does this solve all of our issues? 32Thursday, February 16, 2012
  • 33. Simian Army The infinite cloud is infinite when your needs are moderate! To ensure fairness among tenants, AWS meters or limits every resource Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to turn around and raise the limit -- a few hours! 33Thursday, February 16, 2012
  • 34. Simian Army • Limits Monkey • Checks once an hour whether we are approaching one of our limits and triggers alerts for us to proactively reach out to AWS! • Conformity & Janitor Monkeys • Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG, unreferenced security groups, ELBs, ASGs, etc...) to increase head-room • Buys us more time before we run out of resources and also saves us $$$$ 34Thursday, February 16, 2012
  • 35. Questions? Sid Anand @r39132 http://www.linkedin.com/in/siddharthanand 35Thursday, February 16, 2012