Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or i...
•~50 million members, ~50 countries 
•> 1 billion hours per month 
•> 1000 device types 
•3 AWS regions, hundreds of servi...
Availability means that members can 
●sign up 
●activate a device 
●browse 
●watch
What keeps us up at night
Failures can happen any time 
•Disks fail 
•Power outages 
•Natural disasters 
•Software bugs 
•Human error
We design for failure 
•Exception handling 
•Fault tolerance and isolation 
•Fall-backs and degraded experiences 
•Auto-sc...
Testing for failure is hard 
•Web-scale traffic 
•Massive, changing data sets 
•Complex interactions and request patterns ...
What if we regularly inject failures into our systems under controlled circumstances?
Blast Radius 
•Unit of isolation 
•Scope of an outage 
•Scope a chaos exercise 
Zone 
Region 
Instance 
Global
An Instance Fails 
Edge Cluster 
Cluster A 
Cluster B 
Cluster D 
Cluster C
Chaos Monkey 
•Monkey loose in your DC 
•Run during business hours 
•What we learned 
–Auto-replacement works 
–State is p...
A State of Xen -Chaos Monkey & Cassandra 
Out of our 2700+ Cassandra nodes 
•218 rebooted 
•22 did not reboot successfully...
An Availability Zone Fails 
EU-West 
US-East 
US-West 
AZ1 
AZ2
Chaos Gorilla 
Simulate an Availability Zone outage 
•3-zone configuration 
•Eliminate one zone 
•Ensure that others can h...
Challenges 
•Rapidly shifting traffic 
–LBs must expire connections quickly 
–Lingering connections to caches must be addr...
A Region Fails 
EU-West 
US-East 
US-West
AZ1 
AZ2 
AZ3 
Regional Load Balancers 
Zuul –Traffic Shaping/Routing 
Data 
Data 
Data 
Geo-located 
Chaos Kong 
Chaos Ko...
Challenges 
●Rapidly shifting traffic 
○Auto-scaling configuration 
○Static configuration/pinning 
○Instance start time 
○...
Challenges 
●Service Configuration 
○Timeout configurations 
○Fallbacks fail or don’t provide the desired experience 
●No ...
A Service Fails 
Zone 
Region 
Global 
Service
Services Slow Down and Fail 
Simulate latent/failed service calls 
•Inject arbitrary latency and errors at the service lev...
Latency Monkey 
Device 
Zuul 
ELB 
Edge 
Service B 
Service C 
Internet 
Service A
Challenges 
•Startup resiliency is an issue 
•Services owners don’t know all dependencies 
•Fallbacks can fail too 
•Secon...
More Precise and Continuous
Service Failure Testing:FIT
Distributed Systems Fail 
●Complex interactions at scale 
●Variability across services 
●Byzantine failures 
●Combinatoria...
Any service can cause cascading failures 
ELB
Fault Injection Testing (FIT) 
Device 
Service B 
Service C 
Internet 
Edge 
Device or Account Override 
Zuul 
Service A 
...
Failure Injection Points 
IPC 
Cassandra Client 
Memcached Client 
Service Container 
Fault Tolerance
FIT Details 
●Common Simulation Syntax 
●Single Simulation Interface 
●Transported via Http Request header
Integrating Failure 
Service 
Filter 
Ribbon 
Service 
Filter 
Ribbon 
ServerRcv 
ServerRcv 
ClientSend 
request 
Service ...
Failure Scenarios 
●Set of injection points to fail 
●Defined based on 
○Past outages 
○Specific dependency interactions 
...
FIT Insights : Salp 
●Distributed tracing inspired by Dapper paper 
●Provides insight into dependencies 
●Helps define & v...
Functional Validation 
●Isolated synthetic transactions 
○Set of devices 
Validation at Scale 
●Dial up customer traffic -...
Continuous Validation 
Critical 
Services 
Non-critical Services 
Synthetic Transactions
Don’t Fear The Monkeys
Take-aways 
•Don’t wait for random failures 
–Cause failure to validate resiliency 
–Remove uncertainty by forcing failure...
The Simian Army is part of the Netflix open source cloud platform 
http://netflix.github.com
Netflix talks at re:Invent 
Talk 
Time 
Title 
BDT-403 
Wednesday, 2:15pm 
Next Generation Big Data Platform at Netflix 
P...
Please give us your feedback on this session. 
Complete session evaluations and earn re:Invent swag. 
http://bit.ly/awseva...
(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014
Upcoming SlideShare
Loading in …5
×

(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

8,602 views

Published on

Complex distributed systems fail. They fail more frequently, and in different ways, as they scale and evolve over time. In this session, you learn how Netflix embraces failure to provide high service availability. Netflix discusses their motivations for inducing failure in production, the mechanics of how Netflix does this, and the lessons they learned along the way. Come hear about the Failure Injection Testing (FIT) framework and suite of tools that Netflix created and currently uses to induce controlled system failures in an effort to help discover vulnerabilities, resolve them, and improve the resiliency of their cloud environment.

Published in: Technology

(PFC305) Embracing Failure: Fault-Injection and Service Reliability | AWS re:Invent 2014

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in partwithout the express consent of Amazon.com, Inc. November 12th, 2014 | Las Vegas, NV PFC305Embracing Failure Fault Injection and Service Resilience at Netflix Josh Evans and NareshGopalani, Netflix
  2. 2. •~50 million members, ~50 countries •> 1 billion hours per month •> 1000 device types •3 AWS regions, hundreds of services •Hundreds of thousands of requests/second •CDN serves petabytes of data at terabits/second Netflix Ecosystem Service Partners Static Content Akamai Netflix CDN AWS/Netflix Control Plane Internet
  3. 3. Availability means that members can ●sign up ●activate a device ●browse ●watch
  4. 4. What keeps us up at night
  5. 5. Failures can happen any time •Disks fail •Power outages •Natural disasters •Software bugs •Human error
  6. 6. We design for failure •Exception handling •Fault tolerance and isolation •Fall-backs and degraded experiences •Auto-scaling clusters •Redundancy
  7. 7. Testing for failure is hard •Web-scale traffic •Massive, changing data sets •Complex interactions and request patterns •Asynchronous, concurrent requests •Complete and partial failure modes Constant innovation and change
  8. 8. What if we regularly inject failures into our systems under controlled circumstances?
  9. 9. Blast Radius •Unit of isolation •Scope of an outage •Scope a chaos exercise Zone Region Instance Global
  10. 10. An Instance Fails Edge Cluster Cluster A Cluster B Cluster D Cluster C
  11. 11. Chaos Monkey •Monkey loose in your DC •Run during business hours •What we learned –Auto-replacement works –State is problematic
  12. 12. A State of Xen -Chaos Monkey & Cassandra Out of our 2700+ Cassandra nodes •218 rebooted •22 did not reboot successfully •Automation replaced failed nodes •0 downtime due to reboot
  13. 13. An Availability Zone Fails EU-West US-East US-West AZ1 AZ2
  14. 14. Chaos Gorilla Simulate an Availability Zone outage •3-zone configuration •Eliminate one zone •Ensure that others can handle the load and nothing breaks Chaos Gorilla
  15. 15. Challenges •Rapidly shifting traffic –LBs must expire connections quickly –Lingering connections to caches must be addressed •Service configuration –Not all clusters auto-scaled or pinned –Services not configured for cross-zone calls –Mismatched timeouts –fallbacks prevented fail-over
  16. 16. A Region Fails EU-West US-East US-West
  17. 17. AZ1 AZ2 AZ3 Regional Load Balancers Zuul –Traffic Shaping/Routing Data Data Data Geo-located Chaos Kong Chaos Kong AZ1 AZ2 AZ3 Regional Load Balancers Zuul –Traffic Shaping/Routing Data Data Data Customer Device
  18. 18. Challenges ●Rapidly shifting traffic ○Auto-scaling configuration ○Static configuration/pinning ○Instance start time ○Cache fill time
  19. 19. Challenges ●Service Configuration ○Timeout configurations ○Fallbacks fail or don’t provide the desired experience ●No minimal (critical) stack ○Any service may be critical!
  20. 20. A Service Fails Zone Region Global Service
  21. 21. Services Slow Down and Fail Simulate latent/failed service calls •Inject arbitrary latency and errors at the service level •Observe for effects Latency Monkey
  22. 22. Latency Monkey Device Zuul ELB Edge Service B Service C Internet Service A
  23. 23. Challenges •Startup resiliency is an issue •Services owners don’t know all dependencies •Fallbacks can fail too •Second order effects not easily tested •Dependencies are in constant flux •Latency Monkey tests function and scale –Not a staged approach –Lots of opt-outs
  24. 24. More Precise and Continuous
  25. 25. Service Failure Testing:FIT
  26. 26. Distributed Systems Fail ●Complex interactions at scale ●Variability across services ●Byzantine failures ●Combinatorial complexity
  27. 27. Any service can cause cascading failures ELB
  28. 28. Fault Injection Testing (FIT) Device Service B Service C Internet Edge Device or Account Override Zuul Service A Request-level simulations ELB
  29. 29. Failure Injection Points IPC Cassandra Client Memcached Client Service Container Fault Tolerance
  30. 30. FIT Details ●Common Simulation Syntax ●Single Simulation Interface ●Transported via Http Request header
  31. 31. Integrating Failure Service Filter Ribbon Service Filter Ribbon ServerRcv ServerRcv ClientSend request Service A response Service B [sendRequestHeader] >>fit.failure: 1|fit.Serializer| 2|[[{"name”:”failSocial, ”whitelist":false, "injectionPoints”: [“SocialService”]},{} ]], {"Id": "252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
  32. 32. Failure Scenarios ●Set of injection points to fail ●Defined based on ○Past outages ○Specific dependency interactions ○Whitelist of a set of critical services ○Dynamic tracing of dependencies
  33. 33. FIT Insights : Salp ●Distributed tracing inspired by Dapper paper ●Provides insight into dependencies ●Helps define & visualize scenarios
  34. 34. Functional Validation ●Isolated synthetic transactions ○Set of devices Validation at Scale ●Dial up customer traffic -% based ●Simulation of full service failure Dialing Up Failure Chaos!
  35. 35. Continuous Validation Critical Services Non-critical Services Synthetic Transactions
  36. 36. Don’t Fear The Monkeys
  37. 37. Take-aways •Don’t wait for random failures –Cause failure to validate resiliency –Remove uncertainty by forcing failures regularly –Better to fail at 2pm than 2am •Test design assumptions by stressing them Embrace Failure
  38. 38. The Simian Army is part of the Netflix open source cloud platform http://netflix.github.com
  39. 39. Netflix talks at re:Invent Talk Time Title BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix PFC-306 Wednesday, 3:30pm Performance Tuning Amazon EC2 DEV-309 Wednesday, 3:30pm From Asgardto Zuul, How Netflix’s proven Open Source Tools can accelerate and scale your services ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale PFC-304 Wednesday, 4:30pm Effective InterProcessCommunications in the Cloud: The Pros and Cons of MicroservicesArchitectures ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems APP-310 Friday 9:00am Scheduling using Apache Mesosin the Cloud
  40. 40. Please give us your feedback on this session. Complete session evaluations and earn re:Invent swag. http://bit.ly/awsevals Josh Evans jevans@netflix.com @josh_evans_nflx Naresh Gopalani ngopalani@netflix.com

×