Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud


Published on

Video and slides synchronized, mp3 and slide download available at URL

Ariel Tseitlin discusses Netflix' suite of tools, collectively called the Simian Army, used to improve resiliency and maintain the cloud environment. The tools simulate failure in order to see how the system reacts to it. Filmed at

Ariel Tseitlin manages the Netflix Cloud and is interested in all things cloudy. At Netflix, he is Director of Cloud Solutions, helping Netflix be successful in the Cloud, including cloud tooling, monitoring, performance and scalability, and cloud operations and reliability engineering.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud

  1. 1. @atseitlin   Resiliency  through  failure     Ne3lix's  Approach  to  Extreme  Availability  in  the  Cloud     Ariel  Tseitlin   h.p://   @atseitlin    
  2. 2. News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on! /netflix-resiliency-failure-cloud
  3. 3. Presented at QCon New York Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. @atseitlin   About  Ne<lix   Ne#lix  is  the  world’s   leading  Internet   television  network  with   more  than  36  million   members  in  40   countries  enjoying  more   than  one  billion  hours   of  TV  shows  and  movies   per  month,  including   original  series[1]   [1]  h.p://<  
  5. 5. @atseitlin   A  complex  distributed  system  
  6. 6. @atseitlin   How  Ne<lix  Streaming  Works   Customer  Device   (PC,  PS3,  TV…)   Web  Site  or   Discovery  API   User  Data   PersonalizaSon   Streaming  API   DRM   QoS  Logging   OpenConnect   CDN  Boxes   CDN   Management  and   Steering   Content  Encoding   Consumer   Electronics   AWS  Cloud   Services   CDN  Edge   LocaSons   Browse   Play   Watch  
  7. 7. @atseitlin  
  8. 8. @atseitlin  
  9. 9. @atseitlin   Our  goal  is  availability   •  Members  can  stream  Ne<lix  whenever  they   want   •  New  users  can  explore  and  sign  up  for  the   service   •  New  members  can  acSvate  their  service  and   add  new  devices  
  10. 10. @atseitlin   Failure  is  all  around  us   •  Disks  fail   •  Power  goes  out.  And  your  generator  fails.   •  So]ware  bugs  introduced   •  People  make  mistakes     Failure  is  unavoidable  
  11. 11. @atseitlin   We  design  around  failure   •  ExcepSon  handling   •  Clusters   •  Redundancy   •  Fault  tolerance     •  Fall-­‐back  or  degraded  experience  (Hystrix)   •  All  to  insulate  our  users  from  failure   Is  that  enough?    
  12. 12. @atseitlin   It’s  not  enough   •  How  do  we  know  if  we’ve  succeeded?   •  Does  the  system  work  as  designed?   •  Is  it  as  resilient  as  we  believe?   •  How  do  we  prevent  dri]ing  into  failure?     The  typical  answer  is…  
  13. 13. @atseitlin   More  tesSng!   •  Unit  tesSng   •  IntegraSon  tesSng   •  Stress  tesSng   •  ExhausSve  test  suites  to  simulate  and  test  all   failure  mode   Can  we  effec<vely  simulate  a  large-­‐ scale  distributed  system?    
  14. 14. @atseitlin   Building  distributed  systems  is  hard   TesSng  them  exhausSvely  is  even  harder   •  Massive  data  sets  and  changing  shape   •  Internet-­‐scale  traffic   •  Complex  interacSon  and  informaSon  flow   •  Asynchronous  nature   •  3rd  party  services   •  All  while  innovaSng  and  building  features         Prohibi<vely  expensive,  if  not  impossible,   for  most  large-­‐scale  systems  
  15. 15. @atseitlin   What  if  we  could  reduce  variability  of  failures?  
  16. 16. @atseitlin   There  is  another  way     •  Cause  failure  to  validate  resiliency   •  Test  design  assumpSon  by  stressing  them   •  Don’t  wait  for  random  failure.    Remove  its   uncertainty  by  forcing  it  periodically  
  17. 17. @atseitlin   And  that’s  exactly  what  we  did  
  18. 18. @atseitlin   Instances  fail  
  19. 19. @atseitlin  
  20. 20. @atseitlin   Chaos  Monkey  taught  us…   •  State  is  bad   •  Clusters  are  good   •  Surviving  single  instance  failure  is  not  enough  
  21. 21. @atseitlin   Lots  of  instances  fail  
  22. 22. @atseitlin   Chaos  Gorilla  
  23. 23. @atseitlin   Chaos  Gorilla  taught  us…   •  Hidden  assumpSons  on  deployment  topology   •  Infrastructure  control  plane  can  be  a   bo.leneck   •  Large  scale  events  are  hard  to  simulate   •  Rapidly  shi]ing  traffic  is  error  prone   •  Smooth  recovery  is  a  challenge   •  Cassandra  works  as  expected  
  24. 24. @atseitlin   What  about  larger  catastrophes?        Anyone  remember  Sandy?  
  25. 25. @atseitlin   Chaos  Kong  (*some  day  soon*)  
  26. 26. @atseitlin   The  Sick  and  Wounded  
  27. 27. @atseitlin   Latency  Monkey  
  28. 28. @atseitlin  
  29. 29. @atseitlin   Hystrix,  RxJava   h.p://<­‐tolerance-­‐in-­‐high-­‐volume.html  
  30. 30. @atseitlin   Latency  Monkey  taught  us   •  Startup  resiliency  is  o]en  missed   •  An  ongoing  unified  approach  to  runSme   dependency  management  is  important  (visibility  &   transparency  gets  missed  otherwise)   •  Know  thy  neighbor  (unknown  dependencies)   •  Fall  backs  can  fail  too  
  31. 31. @atseitlin   Entropy  
  32. 32. @atseitlin  accumulates   •  Complexity     •  Cru]   •  VulnerabiliSes   •  Cost  
  33. 33. @atseitlin   Janitor  Monkey  
  34. 34. @atseitlin   Janitor  Monkey  taught  us…   •  Label  everything   •  builds  up  
  35. 35. @atseitlin   Ranks  of  the  Simian  Army   •  Chaos  Monkey   •  Chaos  Gorilla   •  Latency  Monkey   •  Janitor  Monkey   •  Conformity   Monkey     •  Circus  Monkey   •  Doctor  Monkey   •  Howler  Monkey   •  Security  Monkey   •  Chaos  Kong   •  Efficiency  Monkey  
  36. 36. @atseitlin   Observability  is  key   •  Don’t  exacerbate  real  customer  issues  with   failure  exercises   •  Deep  system  visibility  is  key  to  root-­‐cause   failures  and  understand  the  system  
  37. 37. @atseitlin   OrganizaSonal  elements   •  Every  engineer  is  an  operator  of  the  service   •  Each  failure  is  an  opportunity  to  learn   •  Blameless  culture       Goal  is  to  create  a  learning  organiza<on    
  38. 38. @atseitlin   Assembling  the  Puzzle    
  39. 39. @atseitlin   Open  Source  Projects   Github  /  Techblog   Apache  ContribuSons   Techblog  Post   Coming  Soon   Priam   Cassandra  as  a  Service   Astyanax   Cassandra  client  for  Java   CassJMeter   Cassandra  test  suite   Cassandra   MulS-­‐region  EC2  datastore   support   Aegisthus   Hadoop  ETL  for  Cassandra   AWS  Usage   Spend  analyScs   Governator   Library  lifecycle  and  dependency   injecSon   Odin   Cloud  orchestraSon   Blitz4j  Async  logging   Exhibitor   Zookeeper  as  a  Service   Curator   Zookeeper  Pa.erns   EVCache   Memcached  as  a  Service   Eureka  /  Discovery   Service  Directory   Archaius   Dynamics  ProperSes  Service   Edda   Config  state  with  history   Denominator     Ribbon   REST  Client  +  mid-­‐Ser  LB   Karyon   Instrumented  REST  Base  Serve   Servo  and  Autoscaling  Scripts   Genie   Hadoop  PaaS   Hystrix   Robust  service  pa.ern   RxJava  ReacSve  Pa.erns   Asgard   AutoScaleGroup  based  AWS   console   Chaos  Monkey   Robustness  verificaSon   Latency  Monkey   Janitor  Monkey   Bakeries  /  Aminotor   Legend  
  40. 40. @atseitlin   How  does  it  all  fit  together?  
  41. 41. @atseitlin  
  42. 42. @atseitlin   Our  Current  Catalog  of  Releases   Free  code  available  at  h.p://ne<  
  43. 43. @atseitlin   Takeaways   Regularly  inducing  failure  in  your  producSon   environment  validates  resiliency  and  increases   availability     Use  the  Ne<lixOSS  pla<orm  to  handle  the  heavy   li]ing  for  building  large-­‐scale  distributed  cloud-­‐ naSve  applicaSons  
  44. 44. @atseitlin   Thank  you!   Any  quesSons?   Ariel  Tseitlin   h.p://   @atseitlin  
  45. 45. Watch the video with slide synchronization on! resiliency-failure-cloud