@atseitlin
Resiliency through failure
Netflix's Approach to Extreme Availability in the Cloud
Ariel Tseitlin
http://www.li...
@atseitlin
About Netflix
Netflix is the world’s
leading Internet
television network with
more than 38 million
members in 4...
@atseitlin
A complex distributed system
@atseitlin
How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
...
@atseitlin
Highly Available Architecture
Micro-services, redundancy,
resiliency
@atseitlin
Web Server Dependencies Flow
(Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassa...
@atseitlin
Component Micro-Services
Test With Chaos Monkey, Latency Monkey
@atseitlin
Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and E...
@atseitlin
Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
...
@atseitlin
Isolated Regions
Will someday test with Chaos Kong
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandr...
@atseitlin
Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic d...
@atseitlin
Application Resilience
Run what you wrote
Rapid detection
Rapid Response
Fail often
@atseitlin
Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fai...
@atseitlin
Rapid Detection
• If your pilot had no instument panel, would
you ever board fly on a plane?
– Never run your s...
@atseitlin
Edda
AWS
Instances, ASGs, et
c.
Eureka Services
metadata
AppDynamics
Request flow
Edda – Configuration History
...
@atseitlin
Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/...
@atseitlin
Rapid Rollback
• Use a new Autoscale Group to push code
• Leave existing ASG in place, switch traffic
• If OK, ...
@atseitlin
Asgard
http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
@atseitlin
@atseitlin
@atseitlin
Our goal is availability
• Members can stream Netflix whenever they
want
• New users can explore and sign up fo...
@atseitlin
Failure is all around us
• Disks fail
• Power goes out. And your generator fails.
• Software bugs introduced
• ...
@atseitlin
We design around failure
• Exception handling
• Clusters
• Redundancy
• Fault tolerance
• Fall-back or degraded...
@atseitlin
It’s not enough
• How do we know if we’ve succeeded?
• Does the system work as designed?
• Is it as resilient a...
@atseitlin
More testing!
• Unit testing
• Integration testing
• Stress testing
• Exhaustive test suites to simulate and te...
@atseitlin
Building distributed systems is hard
Testing them exhaustively is even harder
• Massive data sets and changing ...
@atseitlin
What if we could reduce variability of failures?
@atseitlin
There is another way
• Cause failure to validate resiliency
• Test design assumption by stressing them
• Don’t ...
@atseitlin
And that’s exactly what we did
@atseitlin
Instances fail
@atseitlin
@atseitlin
Chaos Monkey taught us…
• State is bad
• Clusters are good
• Surviving single instance failure is not enough
@atseitlin
Lots of instances fail
@atseitlin
Chaos Gorilla
@atseitlin
Chaos Gorilla taught us…
• Hidden assumptions on deployment topology
• Infrastructure control plane can be a
bo...
@atseitlin
What about larger catastrophes?
Anyone remember Sandy?
@atseitlin
Chaos Kong (*some day soon*)
@atseitlin
The Sick and Wounded
@atseitlin
Latency Monkey
@atseitlin
@atseitlin
Resilient Design – Hystrix, RxJava
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
@atseitlin
Latency Monkey taught us
• Startup resiliency is often missed
• An ongoing unified approach to runtime
dependen...
@atseitlin
Entropy
@atseitlin
Clutter accumulates
• Complexity
• Cruft
• Vulnerabilities
• Cost
@atseitlin
Janitor Monkey
@atseitlin
Janitor Monkey taught us…
• Label everything
• Clutter builds up
@atseitlin
Ranks of the Simian Army
• Chaos Monkey
• Chaos Gorilla
• Latency Monkey
• Janitor Monkey
• Conformity
Monkey
•...
@atseitlin
Observability is key
• Don’t exacerbate real customer issues with
failure exercises
• Deep system visibility is...
@atseitlin
Organizational elements
• Every engineer is an operator of the service
• Each failure is an opportunity to lear...
@atseitlin
Assembling the Puzzle
@atseitlin
Netflix Highly Available Platform
now open
@NetflixOSS
@atseitlin
Open Source Projects
Github / Techblog
Apache Contributions
Techblog Post
Coming Soon
Priam
Cassandra as a Serv...
@atseitlin
How does it all fit together?
@atseitlin
@atseitlin
Our Current Catalog of Releases
Free code available at http://netflix.github.com
@atseitlin
We’re hiring!
• Simian Army
• Cloud Tools
• NetflixOSS
• Cloud Operations
• Reliability Engineering
• Edge Serv...
@atseitlin
Takeaways
Create fine-grained micro-services. Don’t trust your dependencies.
Regularly inducing failure in your...
@atseitlin
Thank you!
Any questions?
Ariel Tseitlin
http://www.linkedin.com/in/atseitlin
@atseitlin
Upcoming SlideShare
Loading in...5
×

Resiliency through Failure @ OSCON 2013

2,503

Published on

Resiliency through failure talk from OSCON, updated with a few details on application architecture and resiliency.

More details at http://queue.acm.org/detail.cfm?id=2499552

Published in: Technology
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,503
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • The genre box shots were chosen because we have rights to use them, we are starting to make specific logos for each project going forward.
  • Transcript of "Resiliency through Failure @ OSCON 2013"

    1. 1. @atseitlin Resiliency through failure Netflix's Approach to Extreme Availability in the Cloud Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin
    2. 2. @atseitlin About Netflix Netflix is the world’s leading Internet television network with more than 38 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series[1] [1] http://ir.netflix.com/
    3. 3. @atseitlin A complex distributed system
    4. 4. @atseitlin How Netflix Streaming Works Customer Device (PC, PS3, TV…) Web Site or Discovery API User Data Personalization Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding Consumer Electronics AWS Cloud Services CDN Edge Locations Browse Play Watch
    5. 5. @atseitlin Highly Available Architecture Micro-services, redundancy, resiliency
    6. 6. @atseitlin Web Server Dependencies Flow (Home page business transaction as seen by AppDynamics) Start Here memcached Cassandra Web service S3 bucket Personalization movie group chooser Each icon is three to a few hundred instances across three AWS zones
    7. 7. @atseitlin Component Micro-Services Test With Chaos Monkey, Latency Monkey
    8. 8. @atseitlin Three Balanced Availability Zones Test with Chaos Gorilla Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
    9. 9. @atseitlin Triple Replicated Persistence Cassandra maintenance affects individual replicas Cassandra and Evcache Replicas Zone A Cassandra and Evcache Replicas Zone B Cassandra and Evcache Replicas Zone C Load Balancers
    10. 10. @atseitlin Isolated Regions Will someday test with Chaos Kong Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C US-East Load Balancers Cassandra Replicas Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C EU-West Load Balancers
    11. 11. @atseitlin Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Wait for region to recover AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
    12. 12. @atseitlin Application Resilience Run what you wrote Rapid detection Rapid Response Fail often
    13. 13. @atseitlin Run What You Wrote • Make developers responsible for failures – Then they learn and write code that doesn’t fail • Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame” • Keep timeouts short, fail fast – Don’t let cascading timeouts stack up
    14. 14. @atseitlin Rapid Detection • If your pilot had no instument panel, would you ever board fly on a plane? – Never run your service blind • Monitor services, not instances – Make instance failure a non-event • Don’t pay people to watch screens – Instead pay them to build alerting
    15. 15. @atseitlin Edda AWS Instances, ASGs, et c. Eureka Services metadata AppDynamics Request flow Edda – Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
    16. 16. @atseitlin Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b”] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { … "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" … }
    17. 17. @atseitlin Rapid Rollback • Use a new Autoscale Group to push code • Leave existing ASG in place, switch traffic • If OK, auto-delete old ASG a few hours later • If “whoops”, switch traffic back in seconds
    18. 18. @atseitlin Asgard http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
    19. 19. @atseitlin
    20. 20. @atseitlin
    21. 21. @atseitlin Our goal is availability • Members can stream Netflix whenever they want • New users can explore and sign up for the service • New members can activate their service and add new devices
    22. 22. @atseitlin Failure is all around us • Disks fail • Power goes out. And your generator fails. • Software bugs introduced • People make mistakes Failure is unavoidable
    23. 23. @atseitlin We design around failure • Exception handling • Clusters • Redundancy • Fault tolerance • Fall-back or degraded experience (Hystrix) • All to insulate our users from failure Is that enough?
    24. 24. @atseitlin It’s not enough • How do we know if we’ve succeeded? • Does the system work as designed? • Is it as resilient as we believe? • How do we prevent drifting into failure? The typical answer is…
    25. 25. @atseitlin More testing! • Unit testing • Integration testing • Stress testing • Exhaustive test suites to simulate and test all failure mode Can we effectively simulate a large- scale distributed system?
    26. 26. @atseitlin Building distributed systems is hard Testing them exhaustively is even harder • Massive data sets and changing shape • Internet-scale traffic • Complex interaction and information flow • Asynchronous nature • 3rd party services • All while innovating and building features Prohibitively expensive, if not impossible, for most large-scale systems
    27. 27. @atseitlin What if we could reduce variability of failures?
    28. 28. @atseitlin There is another way • Cause failure to validate resiliency • Test design assumption by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it periodically
    29. 29. @atseitlin And that’s exactly what we did
    30. 30. @atseitlin Instances fail
    31. 31. @atseitlin
    32. 32. @atseitlin Chaos Monkey taught us… • State is bad • Clusters are good • Surviving single instance failure is not enough
    33. 33. @atseitlin Lots of instances fail
    34. 34. @atseitlin Chaos Gorilla
    35. 35. @atseitlin Chaos Gorilla taught us… • Hidden assumptions on deployment topology • Infrastructure control plane can be a bottleneck • Large scale events are hard to simulate • Rapidly shifting traffic is error prone • Smooth recovery is a challenge • Cassandra works as expected
    36. 36. @atseitlin What about larger catastrophes? Anyone remember Sandy?
    37. 37. @atseitlin Chaos Kong (*some day soon*)
    38. 38. @atseitlin The Sick and Wounded
    39. 39. @atseitlin Latency Monkey
    40. 40. @atseitlin
    41. 41. @atseitlin Resilient Design – Hystrix, RxJava http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
    42. 42. @atseitlin Latency Monkey taught us • Startup resiliency is often missed • An ongoing unified approach to runtime dependency management is important (visibility & transparency gets missed otherwise) • Know thy neighbor (unknown dependencies) • Fall backs can fail too
    43. 43. @atseitlin Entropy
    44. 44. @atseitlin Clutter accumulates • Complexity • Cruft • Vulnerabilities • Cost
    45. 45. @atseitlin Janitor Monkey
    46. 46. @atseitlin Janitor Monkey taught us… • Label everything • Clutter builds up
    47. 47. @atseitlin Ranks of the Simian Army • Chaos Monkey • Chaos Gorilla • Latency Monkey • Janitor Monkey • Conformity Monkey • Circus Monkey • Doctor Monkey • Howler Monkey • Security Monkey • Chaos Kong • Efficiency Monkey
    48. 48. @atseitlin Observability is key • Don’t exacerbate real customer issues with failure exercises • Deep system visibility is key to root-cause failures and understand the system
    49. 49. @atseitlin Organizational elements • Every engineer is an operator of the service • Each failure is an opportunity to learn • Blameless culture Goal is to create a learning organization
    50. 50. @atseitlin Assembling the Puzzle
    51. 51. @atseitlin Netflix Highly Available Platform now open @NetflixOSS
    52. 52. @atseitlin Open Source Projects Github / Techblog Apache Contributions Techblog Post Coming Soon Priam Cassandra as a Service Astyanax Cassandra client for Java CassJMeter Cassandra test suite Cassandra Multi-region EC2 datastore support Aegisthus Hadoop ETL for Cassandra Ice Spend analytics Governator Library lifecycle and dependency injection Odin Cloud orchestration Blitz4j Async logging Exhibitor Zookeeper as a Service Curator Zookeeper Patterns EVCache Memcached as a Service Eureka / Discovery Service Directory Archaius Dynamics Properties Service Edda Config state with history Denominator Ribbon REST Client + mid-tier LB Karyon Instrumented REST Base Serve Servo and Autoscaling Scripts Genie Hadoop PaaS Hystrix Robust service pattern RxJava Reactive Patterns Asgard AutoScaleGroup based AWS console Chaos Monkey Robustness verification Latency Monkey Janitor Monkey Bakeries / Aminotor Legend
    53. 53. @atseitlin How does it all fit together?
    54. 54. @atseitlin
    55. 55. @atseitlin Our Current Catalog of Releases Free code available at http://netflix.github.com
    56. 56. @atseitlin We’re hiring! • Simian Army • Cloud Tools • NetflixOSS • Cloud Operations • Reliability Engineering • Edge Services • Many, many more jobs.netflix.com
    57. 57. @atseitlin Takeaways Create fine-grained micro-services. Don’t trust your dependencies. Regularly inducing failure in your production environment validates resiliency and increases availability Netflix has built and deployed a scalable global and highly available Platform as a Service and opened sourced it (NetflixOSS) http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix http://www.linkedin.com/in/atseitlin @atseitlin @NetflixOSS
    58. 58. @atseitlin Thank you! Any questions? Ariel Tseitlin http://www.linkedin.com/in/atseitlin @atseitlin

    ×