• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013
 

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

on

  • 671 views

This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and ...

This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.

Statistics

Views

Total Views
671
Views on SlideShare
671
Embed Views
0

Actions

Likes
3
Downloads
36
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013 Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013 Presentation Transcript

    • Netflix Development Patterns for Rapid Iteration, Scale, Performance, & Availability Neil Hunt, Netflix November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
    • Are You Designing Systems That Are: • • • • Web-scale Global Highly-available Consumer-facing • Cloud Native
    • Cloud Native • • • • • Service oriented architecture Redundancy Statelessness NoSQL Eventual consistency
    • Assumptions Everything is Broken Hardware will fail Scale Slowly Changing Large Scale Rapid Change Large Scale Telcos Web-Scale Enterprise IT Startups Slowly Changing Small Scale Rapid Change Small Scale Everything works Software will fail Speed
    • Netflix Cloud Goals: Availability, Scale, Performance
    • Performance • Reduce session start by 1s Save 1 human lifetime per day! Win more moments of truth • Suggest choices 1% better 500k hours/day additional value delivered
    • Scale • • • • • 50% y/y traffic growth 50 Countries, 3 continents Tens of thousands of instances at peak 4 AWS regions, 12 datacenters ~$.001 per start
    • Availability • Aspire to 4 x nines (99.99% of starts successful) • Per Quarter: – Downtime: < 3 mins (peak time) – Successful starts: 9.999B – Failures: 1M  frustration, calls, lost business
    • Availabilities Compound N Service Dependencies 2 … N dependencies 99.99% .99 1000 99.99% .999 100 99.99% .9998 10 99.99N% Availability .9
    • Availabilities Compound To achieve 99.99% availability with 1000 components requires: or 99.9999% availability for each dependency Isolation for independence Component failure leads to system failure Component failure leads to degradation rather than system failure
    • Availability, Scale, Performance Are Not Enough!
    • Rapid Iteration – Rate of Change • Running tests • Rolling out tests – Engineering the winning test experience for scale • Adding features • Scaling up • Removing features, simplifying, minimizing
    • Testing • Up to 1,000 changes per day!
    • Rate of Change • Change leads to bugs – – – – New features New configurations New types of inputs Scaling up • Availability is in tension with rate of change
    • Availability / Rate of Change Tradeoff Availability 99.999% 99.99% Frontier of availability/change 99.9% 99% 1 10 100 Rate of Change 1000
    • Availability / Rate of Change Tradeoff Availability 99.999% 99.99% Frontier of availability/change 99.9% 99% 1 10 100 Rate of Change 1000
    • Shifting the Curve… Availability 99.999% 99.99% 99.9% 99% 1 10 100 Rate of Change 1000
    • Shifting the Curve • Must break the chained dependencies that compound in cascading system failure • Subsystem isolation: – Failure in one component should never result in cascading system failure
    • Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network • Latency monkey to test Dependent System Timeout Dependence
    • Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network Higher Tier System Longer timeout Dependent System Short timeout • Latency monkey to test Dependence
    • Isolating Subsystems Timeout with fallback default response • Network failure • Software bug { status=mem, plan=4, device=true } Dependent System Timeout & Default response Dependence
    • Isolating Subsystems Canary Push • Network failure • Software bug Dependent System Timeout Canary instance new code Dependence
    • Isolating Subsystems Red/Black deployment • Software bugs Dependent System Fail back to old code Bad code pushed Dependence V2.3 Dependence V2.2
    • Isolating Subsystems Standby Blue system • Independent implementation • Simplified logic Dependent System Fail to static version Static reference implementation Dependence V2.3
    • Isolating Subsystems Load Balancer Zone isolation • Infrastructure failure (e.g. power outage) Zone A Zone B Dependent System Dependent System • Chaos Gorilla Dependence Dependence
    • Isolating Subsystems Region isolation DNS • Infrastructure software bugs (e.g. load balancer fail) • Chaos Kong Region E Region W Load Balancer Load Balancer Zone A Zone B Zone A Zone B Dependen t System Dependen t System Dependen t System Dependen t System Dependence Dependence Dependence Dependence
    • Isolating Subsystems Dependency Mode Isolating Technique Instance Failure Network failure Redundant systems with failover and timeout Timeout with default response Network failure Software bug Canary push Red-black deployment Blue systems Infrastructure failure Zone isolation Cross-zone software bugs Region isolation
    • Trying Harder Won’t Cut It • Trying harder gets a linear return on an exponential problem • Need to be great at execution AND Have the right architecture • What architectural features are you using to ensure availability, scale, performance, & rapid rate of change?
    • Please give us your feedback on this presentation DMG206 As a thank you, we will select prize winners daily for completed surveys!