Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud

Uploaded on

Video and slides synchronized, mp3 and slide download available at URL …

Video and slides synchronized, mp3 and slide download available at URL

Ariel Tseitlin discusses Netflix' suite of tools, collectively called the Simian Army, used to improve resiliency and maintain the cloud environment. The tools simulate failure in order to see how the system reacts to it. Filmed at

Ariel Tseitlin manages the Netflix Cloud and is interested in all things cloudy. At Netflix, he is Director of Cloud Solutions, helping Netflix be successful in the Cloud, including cloud tooling, monitoring, performance and scalability, and cloud operations and reliability engineering.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. @atseitlin   Resiliency  through  failure     Ne3lix's  Approach  to  Extreme  Availability  in  the  Cloud     Ariel  Tseitlin   h.p://   @atseitlin    
  • 2. News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on! /netflix-resiliency-failure-cloud
  • 3. Presented at QCon New York Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. @atseitlin   About  Ne<lix   Ne#lix  is  the  world’s   leading  Internet   television  network  with   more  than  36  million   members  in  40   countries  enjoying  more   than  one  billion  hours   of  TV  shows  and  movies   per  month,  including   original  series[1]   [1]  h.p://<  
  • 5. @atseitlin   A  complex  distributed  system  
  • 6. @atseitlin   How  Ne<lix  Streaming  Works   Customer  Device   (PC,  PS3,  TV…)   Web  Site  or   Discovery  API   User  Data   PersonalizaSon   Streaming  API   DRM   QoS  Logging   OpenConnect   CDN  Boxes   CDN   Management  and   Steering   Content  Encoding   Consumer   Electronics   AWS  Cloud   Services   CDN  Edge   LocaSons   Browse   Play   Watch  
  • 7. @atseitlin  
  • 8. @atseitlin  
  • 9. @atseitlin   Our  goal  is  availability   •  Members  can  stream  Ne<lix  whenever  they   want   •  New  users  can  explore  and  sign  up  for  the   service   •  New  members  can  acSvate  their  service  and   add  new  devices  
  • 10. @atseitlin   Failure  is  all  around  us   •  Disks  fail   •  Power  goes  out.  And  your  generator  fails.   •  So]ware  bugs  introduced   •  People  make  mistakes     Failure  is  unavoidable  
  • 11. @atseitlin   We  design  around  failure   •  ExcepSon  handling   •  Clusters   •  Redundancy   •  Fault  tolerance     •  Fall-­‐back  or  degraded  experience  (Hystrix)   •  All  to  insulate  our  users  from  failure   Is  that  enough?    
  • 12. @atseitlin   It’s  not  enough   •  How  do  we  know  if  we’ve  succeeded?   •  Does  the  system  work  as  designed?   •  Is  it  as  resilient  as  we  believe?   •  How  do  we  prevent  dri]ing  into  failure?     The  typical  answer  is…  
  • 13. @atseitlin   More  tesSng!   •  Unit  tesSng   •  IntegraSon  tesSng   •  Stress  tesSng   •  ExhausSve  test  suites  to  simulate  and  test  all   failure  mode   Can  we  effec<vely  simulate  a  large-­‐ scale  distributed  system?    
  • 14. @atseitlin   Building  distributed  systems  is  hard   TesSng  them  exhausSvely  is  even  harder   •  Massive  data  sets  and  changing  shape   •  Internet-­‐scale  traffic   •  Complex  interacSon  and  informaSon  flow   •  Asynchronous  nature   •  3rd  party  services   •  All  while  innovaSng  and  building  features         Prohibi<vely  expensive,  if  not  impossible,   for  most  large-­‐scale  systems  
  • 15. @atseitlin   What  if  we  could  reduce  variability  of  failures?  
  • 16. @atseitlin   There  is  another  way     •  Cause  failure  to  validate  resiliency   •  Test  design  assumpSon  by  stressing  them   •  Don’t  wait  for  random  failure.    Remove  its   uncertainty  by  forcing  it  periodically  
  • 17. @atseitlin   And  that’s  exactly  what  we  did  
  • 18. @atseitlin   Instances  fail  
  • 19. @atseitlin  
  • 20. @atseitlin   Chaos  Monkey  taught  us…   •  State  is  bad   •  Clusters  are  good   •  Surviving  single  instance  failure  is  not  enough  
  • 21. @atseitlin   Lots  of  instances  fail  
  • 22. @atseitlin   Chaos  Gorilla  
  • 23. @atseitlin   Chaos  Gorilla  taught  us…   •  Hidden  assumpSons  on  deployment  topology   •  Infrastructure  control  plane  can  be  a   bo.leneck   •  Large  scale  events  are  hard  to  simulate   •  Rapidly  shi]ing  traffic  is  error  prone   •  Smooth  recovery  is  a  challenge   •  Cassandra  works  as  expected  
  • 24. @atseitlin   What  about  larger  catastrophes?        Anyone  remember  Sandy?  
  • 25. @atseitlin   Chaos  Kong  (*some  day  soon*)  
  • 26. @atseitlin   The  Sick  and  Wounded  
  • 27. @atseitlin   Latency  Monkey  
  • 28. @atseitlin  
  • 29. @atseitlin   Hystrix,  RxJava   h.p://<­‐tolerance-­‐in-­‐high-­‐volume.html  
  • 30. @atseitlin   Latency  Monkey  taught  us   •  Startup  resiliency  is  o]en  missed   •  An  ongoing  unified  approach  to  runSme   dependency  management  is  important  (visibility  &   transparency  gets  missed  otherwise)   •  Know  thy  neighbor  (unknown  dependencies)   •  Fall  backs  can  fail  too  
  • 31. @atseitlin   Entropy  
  • 32. @atseitlin  accumulates   •  Complexity     •  Cru]   •  VulnerabiliSes   •  Cost  
  • 33. @atseitlin   Janitor  Monkey  
  • 34. @atseitlin   Janitor  Monkey  taught  us…   •  Label  everything   •  builds  up  
  • 35. @atseitlin   Ranks  of  the  Simian  Army   •  Chaos  Monkey   •  Chaos  Gorilla   •  Latency  Monkey   •  Janitor  Monkey   •  Conformity   Monkey     •  Circus  Monkey   •  Doctor  Monkey   •  Howler  Monkey   •  Security  Monkey   •  Chaos  Kong   •  Efficiency  Monkey  
  • 36. @atseitlin   Observability  is  key   •  Don’t  exacerbate  real  customer  issues  with   failure  exercises   •  Deep  system  visibility  is  key  to  root-­‐cause   failures  and  understand  the  system  
  • 37. @atseitlin   OrganizaSonal  elements   •  Every  engineer  is  an  operator  of  the  service   •  Each  failure  is  an  opportunity  to  learn   •  Blameless  culture       Goal  is  to  create  a  learning  organiza<on    
  • 38. @atseitlin   Assembling  the  Puzzle    
  • 39. @atseitlin   Open  Source  Projects   Github  /  Techblog   Apache  ContribuSons   Techblog  Post   Coming  Soon   Priam   Cassandra  as  a  Service   Astyanax   Cassandra  client  for  Java   CassJMeter   Cassandra  test  suite   Cassandra   MulS-­‐region  EC2  datastore   support   Aegisthus   Hadoop  ETL  for  Cassandra   AWS  Usage   Spend  analyScs   Governator   Library  lifecycle  and  dependency   injecSon   Odin   Cloud  orchestraSon   Blitz4j  Async  logging   Exhibitor   Zookeeper  as  a  Service   Curator   Zookeeper  Pa.erns   EVCache   Memcached  as  a  Service   Eureka  /  Discovery   Service  Directory   Archaius   Dynamics  ProperSes  Service   Edda   Config  state  with  history   Denominator     Ribbon   REST  Client  +  mid-­‐Ser  LB   Karyon   Instrumented  REST  Base  Serve   Servo  and  Autoscaling  Scripts   Genie   Hadoop  PaaS   Hystrix   Robust  service  pa.ern   RxJava  ReacSve  Pa.erns   Asgard   AutoScaleGroup  based  AWS   console   Chaos  Monkey   Robustness  verificaSon   Latency  Monkey   Janitor  Monkey   Bakeries  /  Aminotor   Legend  
  • 40. @atseitlin   How  does  it  all  fit  together?  
  • 41. @atseitlin  
  • 42. @atseitlin   Our  Current  Catalog  of  Releases   Free  code  available  at  h.p://ne<  
  • 43. @atseitlin   Takeaways   Regularly  inducing  failure  in  your  producSon   environment  validates  resiliency  and  increases   availability     Use  the  Ne<lixOSS  pla<orm  to  handle  the  heavy   li]ing  for  building  large-­‐scale  distributed  cloud-­‐ naSve  applicaSons  
  • 44. @atseitlin   Thank  you!   Any  quesSons?   Ariel  Tseitlin   h.p://   @atseitlin  
  • 45. Watch the video with slide synchronization on! resiliency-failure-cloud