Netflix: Embracing the CloudNeil Hunt, CPO / Yury Izrailevsky, VP Engineering
Netflix – Service Unavailable – Database CrashedRest assured that the right peopleare losing sleep to fix this problem!We ...
Availability                4 x nines    Scale             Performance Unconstrained              Unlimitedhorizontal scal...
• Experimented with both• Ended up with NoSQL for almost everything important
Transitional Infrastructure: “Roman Riding”
Phase          Components         Data & PrerequisitesTrial (2009)   Streaming Player   Content keys (RO)                 ...
Availability                4 x nines    Scale             Performance Unconstrained              Unlimitedhorizontal scal...
Scalability   Performance   Availability
Scalability   Performance   Availability
1/4/2009      2/4/2009      3/4/2009      4/4/2009      5/4/2009      6/4/2009      7/4/2009      8/4/2009      9/4/2009  ...
Netflix Cross-Regional Cloud Architecture
Goal: Regional Failover
Building Global Netflix Streaming Product
Scalability   Performance   Availability
Weekly Cloud Cost Per Streaming Start (last 12 months)                                                         28
Simian Army: Cloud Efficiency Automation   Janitor Monkey     Regularly scrape unused capacity     Clean up instances, ...
Cyclical Streaming Usage Pattern                                   30
Load-Based Auto Scaling                             50%+ Cost Saving                                          Scale up/dow...
Scalability   Performance   Availability
A Truly Great Service…      Has To Just Work!            Availability Goal: 99.99%          (30 secs/week at peak traffic)...
7/17/2011 7/24/2011 7/31/2011  8/7/2011 8/14/2011 8/21/2011 8/28/2011  9/4/2011 9/11/2011 9/18/2011 9/25/2011 10/2/2011 10...
Cascading Failures               API              Instant              Queue              SimpleDB                        ...
Netflix Cloud Architecture                             36
Cascading Failures                   X                      …99% Availability       99% Availability       99% Availabilit...
Strategies to Improve Availability        Graceful       Degradation                   Redundancy                         ...
Graceful Degradation                       39
Redundancy                           A      B        C    Zone   Zone   Zone         Cassandra     A      B      C        ...
Testing Fault Tolerance: Simian Army   Chaos Monkey       Latency Monkey   Chaos Gorilla                                  ...
Open Source Portal at http://netflix.github.com
Superstorm Sandy                   AWS Infrastructure Held Up                   >2x Netflix Streaming Usage               ...
Focus on Building a Great Streaming Product                                              44
Netflix at 2012 re:InventDate/Time         Presenter             TopicWed 8:30-10:00    Reed Hastings         Keynote with...
We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation   form wh...
We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation   form wh...
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
Upcoming SlideShare
Loading in...5
×

ENT101 Embracing the Cloud - AWS re: Invent 2012

2,381

Published on

Join the product and cloud computing leaders of Netflix to discuss why and how the company moved to Amazon Web Services. From early experiments for media transcoding, to building the operational skills to optimize costs and the creation of the Simian Army, this session guides business leaders through real world examples of evaluating and adopting cloud computing.

0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,381
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Transcript of "ENT101 Embracing the Cloud - AWS re: Invent 2012"

  1. 1. Netflix: Embracing the CloudNeil Hunt, CPO / Yury Izrailevsky, VP Engineering
  2. 2. Netflix – Service Unavailable – Database CrashedRest assured that the right peopleare losing sleep to fix this problem!We expect to resume service in approximately 72h12 Aug 2008 03:12am
  3. 3. Availability 4 x nines Scale Performance Unconstrained Unlimitedhorizontal scaling compute
  4. 4. • Experimented with both• Ended up with NoSQL for almost everything important
  5. 5. Transitional Infrastructure: “Roman Riding”
  6. 6. Phase Components Data & PrerequisitesTrial (2009) Streaming Player Content keys (RO) Membership status (RO)Development Member product Content catalog (RW)(2010-11) pages and APIs Personalization data (RW) & recs algorithms AB Test data (RW)Followthrough Account and Membership data (RW)(2011-12) membershipFinal (2013) Payments PCI and SOX data
  7. 7. Availability 4 x nines Scale Performance Unconstrained Unlimitedhorizontal scaling compute
  8. 8. Scalability Performance Availability
  9. 9. Scalability Performance Availability
  10. 10. 1/4/2009 2/4/2009 3/4/2009 4/4/2009 5/4/2009 6/4/2009 7/4/2009 8/4/2009 9/4/2009 10/4/2009 11/4/2009 12/4/2009 1/4/2010 2/4/2010 3/4/2010 4/4/2010 5/4/2010 6/4/2010 7/4/2010 8/4/2010 9/4/2010 10/4/2010 11/4/2010 12/4/2010 1/4/2011 2/4/2011 3/4/2011 4/4/2011 5/4/2011 6/4/2011 7/4/2011 8/4/2011 9/4/2011 10/4/2011 11/4/2011 12/4/2011 1/4/2012 2/4/2012 3/4/2012 4/4/2012 5/4/2012 6/4/2012 7/4/2012 8/4/2012 Scaling Netflix Streaming Service: Weekly Streaming Starts23
  11. 11. Netflix Cross-Regional Cloud Architecture
  12. 12. Goal: Regional Failover
  13. 13. Building Global Netflix Streaming Product
  14. 14. Scalability Performance Availability
  15. 15. Weekly Cloud Cost Per Streaming Start (last 12 months) 28
  16. 16. Simian Army: Cloud Efficiency Automation Janitor Monkey  Regularly scrape unused capacity  Clean up instances, ASGs, ELBs, SGs, etc. Efficiency Monkey  AI-based resource under-usage detection (CPU, memory, etc.) Automated Deletion of Old Data  TTL for S3 (using ObjectExpiration) 29
  17. 17. Cyclical Streaming Usage Pattern 30
  18. 18. Load-Based Auto Scaling 50%+ Cost Saving Scale up/down by 70%+ Move to Load-Based Scaling 31 31
  19. 19. Scalability Performance Availability
  20. 20. A Truly Great Service… Has To Just Work! Availability Goal: 99.99% (30 secs/week at peak traffic) 33
  21. 21. 7/17/2011 7/24/2011 7/31/2011 8/7/2011 8/14/2011 8/21/2011 8/28/2011 9/4/2011 9/11/2011 9/18/2011 9/25/2011 10/2/2011 10/9/201110/16/201110/23/201110/30/2011 11/6/201111/13/201111/20/201111/27/2011 12/4/201112/11/201112/18/201112/25/2011 1/1/2012 1/8/2012 1/15/2012 1/22/2012 1/29/2012 2/5/2012 2/12/2012 2/19/2012 2/26/2012 3/4/2012 3/11/2012 3/18/2012 3/25/2012 4/1/2012 4/8/2012 4/15/2012 4/22/2012 Other AWS Outages 4/29/2012 5/6/2012 5/13/2012 5/20/2012 5/27/2012 6/3/2012 6/10/2012 6/17/2012 6/24/2012 7/1/2012 Historical Streaming Availability (13wkMA) 7/8/2012 Outage 7/15/2012 7/22/2012 7/29/2012 8/5/2012 8/12/2012 AWS / Netflix 8/19/2012 8/26/2012 June 29th, 2012 9/2/2012 9/9/2012 9/16/2012 9/23/2012 9/30/2012 10/7/2012 14-Oct10/21/201210/28/2012 Using Redundancy in AWS Infrastructure to Survive Failures 11/4/201211/11/2012
  22. 22. Cascading Failures API Instant Queue SimpleDB 35
  23. 23. Netflix Cloud Architecture 36
  24. 24. Cascading Failures X …99% Availability 99% Availability 99% Availability 300 99% = 4.90% 37
  25. 25. Strategies to Improve Availability Graceful Degradation Redundancy 38
  26. 26. Graceful Degradation 39
  27. 27. Redundancy A B C Zone Zone Zone Cassandra A B C S3 Backup Redundancy Across Availability Secure Cloud Zones Backup Storage Redundancy Across Regions, 40 Vendors
  28. 28. Testing Fault Tolerance: Simian Army Chaos Monkey Latency Monkey Chaos Gorilla 4
  29. 29. Open Source Portal at http://netflix.github.com
  30. 30. Superstorm Sandy AWS Infrastructure Held Up >2x Netflix Streaming Usage in East Coast Markets  Boston  New York  Philadelphia  Baltimore  D.C.
  31. 31. Focus on Building a Great Streaming Product 44
  32. 32. Netflix at 2012 re:InventDate/Time Presenter TopicWed 8:30-10:00 Reed Hastings Keynote with Andy JassyWed 1:00-1:45 Coburn Watson Optimizing Costs with AWSWed 2:05-2:55 Kevin McEntee Netflix’s Transcoding TransformationWed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the CloudWed 4:30-5:20 Adrian Cockcroft High Availability Architecture at NetflixThu 10:30-11:20 Jeremy Edberg Rainmakers – Operating CloudsThu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWSThu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer PanelThu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWSThu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army
  33. 33. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  34. 34. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.

×