Netflix: Embracing the Cloud
Neil Hunt, CPO / Yury Izrailevsky, VP Engineering
Netflix – Service Unavailable – Database Crashed

Rest assured that the right people
are losing sleep to fix this problem!

We expect to resume service in approximately 72h


12 Aug 2008 03:12am
Availability
                4 x nines




    Scale             Performance
 Unconstrained              Unlimited
horizontal scaling          compute
• Experimented with both
• Ended up with NoSQL for almost everything important
Transitional Infrastructure: “Roman Riding”
Phase          Components         Data & Prerequisites
Trial (2009)   Streaming Player   Content keys (RO)
                                  Membership status (RO)
Development Member product        Content catalog (RW)
(2010-11)   pages and APIs        Personalization data
                                  (RW) & recs algorithms
                                  AB Test data (RW)
Followthrough Account and         Membership data (RW)
(2011-12)     membership
Final (2013) Payments             PCI and SOX data
Availability
                4 x nines




    Scale             Performance
 Unconstrained              Unlimited
horizontal scaling          compute
Scalability   Performance   Availability
Scalability   Performance   Availability
1/4/2009
      2/4/2009
      3/4/2009
      4/4/2009
      5/4/2009
      6/4/2009
      7/4/2009
      8/4/2009
      9/4/2009
     10/4/2009
     11/4/2009
     12/4/2009
      1/4/2010
      2/4/2010
      3/4/2010
      4/4/2010
      5/4/2010
      6/4/2010
      7/4/2010
      8/4/2010
      9/4/2010
     10/4/2010
     11/4/2010
     12/4/2010
      1/4/2011
      2/4/2011
      3/4/2011
      4/4/2011
      5/4/2011
      6/4/2011
      7/4/2011
      8/4/2011
      9/4/2011
     10/4/2011
     11/4/2011
     12/4/2011
      1/4/2012
      2/4/2012
      3/4/2012
      4/4/2012
      5/4/2012
      6/4/2012
      7/4/2012
      8/4/2012
                 Scaling Netflix Streaming Service: Weekly Streaming Starts




23
Netflix Cross-Regional Cloud Architecture
Goal: Regional Failover
Building Global Netflix Streaming Product
Scalability   Performance   Availability
Weekly Cloud Cost Per Streaming Start (last 12 months)




                                                         28
Simian Army: Cloud Efficiency Automation
   Janitor Monkey
     Regularly scrape unused capacity
     Clean up instances, ASGs, ELBs, SGs, etc.
   Efficiency Monkey
     AI-based resource under-usage detection
      (CPU, memory, etc.)
   Automated Deletion of Old Data
     TTL for S3 (using ObjectExpiration)




                                                  29
Cyclical Streaming Usage Pattern




                                   30
Load-Based Auto Scaling




                             50%+ Cost Saving
                                          Scale up/down
                                             by 70%+




         Move to Load-Based Scaling



                                                          31
                                                          31
Scalability   Performance   Availability
A Truly Great Service…      Has To Just Work!




            Availability Goal: 99.99%
          (30 secs/week at peak traffic)
                                                33
7/17/2011
 7/24/2011
 7/31/2011
  8/7/2011
 8/14/2011
 8/21/2011
 8/28/2011
  9/4/2011
 9/11/2011
 9/18/2011
 9/25/2011
 10/2/2011
 10/9/2011
10/16/2011
10/23/2011
10/30/2011
 11/6/2011
11/13/2011
11/20/2011
11/27/2011
 12/4/2011
12/11/2011
12/18/2011
12/25/2011
  1/1/2012
  1/8/2012
 1/15/2012
 1/22/2012
 1/29/2012
  2/5/2012
 2/12/2012
 2/19/2012
 2/26/2012
  3/4/2012
 3/11/2012
 3/18/2012
 3/25/2012
  4/1/2012
  4/8/2012
 4/15/2012
 4/22/2012
                                                                                            Other AWS Outages




 4/29/2012
  5/6/2012
 5/13/2012
 5/20/2012
 5/27/2012
  6/3/2012
 6/10/2012
 6/17/2012
 6/24/2012
  7/1/2012
                                                                                                                Historical Streaming Availability (13wkMA)




  7/8/2012
                                                                          Outage




 7/15/2012
 7/22/2012
 7/29/2012
  8/5/2012
 8/12/2012
                                                                          AWS / Netflix




 8/19/2012
 8/26/2012
                                                                          June 29th, 2012




  9/2/2012
  9/9/2012
 9/16/2012
 9/23/2012
 9/30/2012
 10/7/2012
    14-Oct
10/21/2012
10/28/2012
             Using Redundancy in AWS Infrastructure to Survive Failures




 11/4/2012
11/11/2012
Cascading Failures




               API




              Instant
              Queue




              SimpleDB

                         35
Netflix Cloud Architecture




                             36
Cascading Failures




                   X                      …
99% Availability       99% Availability       99% Availability


                       300
            99%              = 4.90%                             37
Strategies to Improve Availability




        Graceful
       Degradation                   Redundancy




                                                  38
Graceful Degradation




                       39
Redundancy



                           A        B       C
    Zone   Zone   Zone          Cassandra
     A      B      C



                                S3 Backup

   Redundancy
 Across Availability           Secure Cloud
      Zones                      Backup

                         Storage Redundancy
                               Across
                                                40
                          Regions, Vendors
Testing Fault Tolerance: Simian Army




   Chaos Monkey       Latency Monkey   Chaos Gorilla




                                                       4
Open Source Portal at http://netflix.github.com
Superstorm Sandy

                   AWS Infrastructure Held Up


                   >2x Netflix Streaming Usage
                   in East Coast Markets
                      Boston
                      New York
                      Philadelphia
                      Baltimore
                      D.C.
Focus on Building a Great Streaming Product




                                              44
Netflix at 2012 re:Invent

Date/Time         Presenter             Topic
Wed 8:30-10:00    Reed Hastings         Keynote with Andy Jassy
Wed 1:00-1:45     Coburn Watson         Optimizing Costs with AWS
Wed 2:05-2:55     Kevin McEntee         Netflix’s Transcoding Transformation
Wed 3:25-4:15     Neil Hunt / Yury I.   Netflix: Embracing the Cloud
Wed 4:30-5:20     Adrian Cockcroft      High Availability Architecture at Netflix
Thu 10:30-11:20   Jeremy Edberg         Rainmakers – Operating Clouds
Thu 11:35-12:25   Kurt Brown            Data Science with Elastic Map Reduce (EMR)
Thu 11:35-12:25   Jason Chan            Security Panel: Learn from CISOs working with AWS
Thu 3:00-3:50     Adrian Cockcroft      Compute & Networking Masters Customer Panel
Thu 3:00-3:50     Ruslan M./Gregg U.    Optimizing Your Cassandra Database on AWS
Thu 4:05-4:55     Ariel Tseitlin        Intro to Chaos Monkey and the Simian Army
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

2012 re:Invent Netflix: embracing the cloud final

  • 1.
    Netflix: Embracing theCloud Neil Hunt, CPO / Yury Izrailevsky, VP Engineering
  • 3.
    Netflix – ServiceUnavailable – Database Crashed Rest assured that the right people are losing sleep to fix this problem! We expect to resume service in approximately 72h 12 Aug 2008 03:12am
  • 5.
    Availability 4 x nines Scale Performance Unconstrained Unlimited horizontal scaling compute
  • 9.
    • Experimented withboth • Ended up with NoSQL for almost everything important
  • 12.
  • 17.
    Phase Components Data & Prerequisites Trial (2009) Streaming Player Content keys (RO) Membership status (RO) Development Member product Content catalog (RW) (2010-11) pages and APIs Personalization data (RW) & recs algorithms AB Test data (RW) Followthrough Account and Membership data (RW) (2011-12) membership Final (2013) Payments PCI and SOX data
  • 20.
    Availability 4 x nines Scale Performance Unconstrained Unlimited horizontal scaling compute
  • 21.
    Scalability Performance Availability
  • 22.
    Scalability Performance Availability
  • 23.
    1/4/2009 2/4/2009 3/4/2009 4/4/2009 5/4/2009 6/4/2009 7/4/2009 8/4/2009 9/4/2009 10/4/2009 11/4/2009 12/4/2009 1/4/2010 2/4/2010 3/4/2010 4/4/2010 5/4/2010 6/4/2010 7/4/2010 8/4/2010 9/4/2010 10/4/2010 11/4/2010 12/4/2010 1/4/2011 2/4/2011 3/4/2011 4/4/2011 5/4/2011 6/4/2011 7/4/2011 8/4/2011 9/4/2011 10/4/2011 11/4/2011 12/4/2011 1/4/2012 2/4/2012 3/4/2012 4/4/2012 5/4/2012 6/4/2012 7/4/2012 8/4/2012 Scaling Netflix Streaming Service: Weekly Streaming Starts 23
  • 24.
  • 25.
  • 26.
    Building Global NetflixStreaming Product
  • 27.
    Scalability Performance Availability
  • 28.
    Weekly Cloud CostPer Streaming Start (last 12 months) 28
  • 29.
    Simian Army: CloudEfficiency Automation  Janitor Monkey  Regularly scrape unused capacity  Clean up instances, ASGs, ELBs, SGs, etc.  Efficiency Monkey  AI-based resource under-usage detection (CPU, memory, etc.)  Automated Deletion of Old Data  TTL for S3 (using ObjectExpiration) 29
  • 30.
  • 31.
    Load-Based Auto Scaling 50%+ Cost Saving Scale up/down by 70%+ Move to Load-Based Scaling 31 31
  • 32.
    Scalability Performance Availability
  • 33.
    A Truly GreatService… Has To Just Work! Availability Goal: 99.99% (30 secs/week at peak traffic) 33
  • 34.
    7/17/2011 7/24/2011 7/31/2011 8/7/2011 8/14/2011 8/21/2011 8/28/2011 9/4/2011 9/11/2011 9/18/2011 9/25/2011 10/2/2011 10/9/2011 10/16/2011 10/23/2011 10/30/2011 11/6/2011 11/13/2011 11/20/2011 11/27/2011 12/4/2011 12/11/2011 12/18/2011 12/25/2011 1/1/2012 1/8/2012 1/15/2012 1/22/2012 1/29/2012 2/5/2012 2/12/2012 2/19/2012 2/26/2012 3/4/2012 3/11/2012 3/18/2012 3/25/2012 4/1/2012 4/8/2012 4/15/2012 4/22/2012 Other AWS Outages 4/29/2012 5/6/2012 5/13/2012 5/20/2012 5/27/2012 6/3/2012 6/10/2012 6/17/2012 6/24/2012 7/1/2012 Historical Streaming Availability (13wkMA) 7/8/2012 Outage 7/15/2012 7/22/2012 7/29/2012 8/5/2012 8/12/2012 AWS / Netflix 8/19/2012 8/26/2012 June 29th, 2012 9/2/2012 9/9/2012 9/16/2012 9/23/2012 9/30/2012 10/7/2012 14-Oct 10/21/2012 10/28/2012 Using Redundancy in AWS Infrastructure to Survive Failures 11/4/2012 11/11/2012
  • 35.
    Cascading Failures API Instant Queue SimpleDB 35
  • 36.
  • 37.
    Cascading Failures X … 99% Availability 99% Availability 99% Availability 300 99% = 4.90% 37
  • 38.
    Strategies to ImproveAvailability Graceful Degradation Redundancy 38
  • 39.
  • 40.
    Redundancy A B C Zone Zone Zone Cassandra A B C S3 Backup Redundancy Across Availability Secure Cloud Zones Backup Storage Redundancy Across 40 Regions, Vendors
  • 41.
    Testing Fault Tolerance:Simian Army Chaos Monkey Latency Monkey Chaos Gorilla 4
  • 42.
    Open Source Portalat http://netflix.github.com
  • 43.
    Superstorm Sandy AWS Infrastructure Held Up >2x Netflix Streaming Usage in East Coast Markets  Boston  New York  Philadelphia  Baltimore  D.C.
  • 44.
    Focus on Buildinga Great Streaming Product 44
  • 45.
    Netflix at 2012re:Invent Date/Time Presenter Topic Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR) Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army
  • 46.
    We are sincerelyeager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  • 47.
    We are sincerelyeager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.

Editor's Notes

  • #26 Make clear it’s still tentative, not a committed project – longer term…