Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Netflix presents at MassTLC Cloud Summit 2013


Published on

Ariel Tseitlin, Director of the Netflix Cloud presented on the elasticity and redundancy of its Cloud service.

Published in: Self Improvement, Technology
  • Be the first to comment

Netflix presents at MassTLC Cloud Summit 2013

  1. 1. @atseitlin   Ne#lix  Cloud  Pla#orm       Ne#lix's  evolu3on  in  the  cloud     Ariel  Tseitlin   h.p://   @atseitlin    
  2. 2. @atseitlin   About  Ne<lix   Ne#lix  is  the  world’s   leading  Internet   television  network  with   nearly  38  million   members  in  40   countries  enjoying  more   than  one  billion  hours   of  TV  shows  and  movies   per  month,  including   original  series[1]   [1]  h.p://<  
  3. 3. @atseitlin   Original  Content  
  4. 4. @atseitlin   CriDcal  Acclaim  
  5. 5. @atseitlin   A  complex  distributed  system  
  6. 6. @atseitlin   How  Ne<lix  Streaming  Works   Customer  Device   (PC,  PS3,  TV…)   Web  Site  or   Discovery  API   User  Data   PersonalizaDon   Streaming  API   DRM   QoS  Logging   OpenConnect   CDN  Boxes   CDN   Management  and   Steering   Content  Encoding   Consumer   Electronics   AWS  Cloud   Services   CDN  Edge   LocaDons   Browse   Play   Watch  
  7. 7. @atseitlin   Highly  Available  Architecture   Micro-­‐services,  redundancy,   resiliency  
  8. 8. @atseitlin   Web  Server  Dependencies  Flow   Home  page  business  transacDon   Start  Here   memcached   Cassandra   Web  service   S3  bucket   PersonalizaDon  movie   group  chooser   Each  icon  is   three  to  a  few   hundred   instances   across  three   AWS  zones  
  9. 9. @atseitlin   Component  Micro-­‐Services   Test  With  Chaos  Monkey,  Latency  Monkey  
  10. 10. @atseitlin   Three  Balanced  Availability  Zones   Test  with  Chaos  Gorilla   Cassandra  and  Evcache   Replicas   Zone  A   Cassandra  and  Evcache   Replicas   Zone  B   Cassandra  and  Evcache   Replicas   Zone  C   Load  Balancers  
  11. 11. @atseitlin   Triple  Replicated  Persistence   Cassandra  maintenance  affects  individual  replicas     Cassandra  and  Evcache   Replicas   Zone  A   Cassandra  and  Evcache   Replicas   Zone  B   Cassandra  and  Evcache   Replicas   Zone  C   Load  Balancers  
  12. 12. @atseitlin   Isolated  Regions   Will  someday  test  with  Chaos  Kong   Cassandra  Replicas   Zone  A   Cassandra  Replicas   Zone  B   Cassandra  Replicas   Zone  C   US-­‐East  Load  Balancers   Cassandra  Replicas   Zone  A   Cassandra  Replicas   Zone  B   Cassandra  Replicas   Zone  C   EU-­‐West  Load  Balancers  
  13. 13. @atseitlin   Failure  Modes  and  Effects   Failure  Mode   Probability   Current  Mi3ga3on  Plan   ApplicaDon  Failure   High   AutomaDc  degraded  response   AWS  Region  Failure   Low   Wait  for  region  to  recover   AWS  Zone  Failure   Medium   ConDnue  to  run  on  2  out  of  3  zones   Datacenter  Failure   Medium   Migrate  more  funcDons  to  cloud   Data  store  failure   Low   Restore  from  S3  backups   S3  failure   Low   Restore  from  remote  archive   UnDl  we  got  really  good  at  miDgaDng  high  and  medium   probability  failures,  the  ROI  for  miDgaDng  regional   failures  didn’t  make  sense.  Gedng  there…  
  14. 14. @atseitlin   ApplicaDon  Resilience   Run  what  you  wrote   Rapid  detecDon   Rapid  Response   Fail  oeen    
  15. 15. @atseitlin   Run  What  You  Wrote   •  Make  developers  responsible  for  failures   – Then  they  learn  and  write  code  that  doesn’t  fail   •  Use  Incident  Reviews  to  find  gaps  to  fix   – Make  sure  its  not  about  finding  “who  to  blame”   •  Keep  Dmeouts  short,  fail  fast   – Don’t  let  cascading  Dmeouts  stack  up  
  16. 16. @atseitlin   Rapid  DetecDon   •  If  your  pilot  had  no  instument  panel,  would   you  ever  board  fly  on  a  plane?   – Never  run  your  service  blind   •  Monitor  services,  not  instances   – Make  instance  failure  a  non-­‐event   •  Don’t  pay  people  to  watch  screens   – Instead  pay  them  to  build  alerDng  
  17. 17. @atseitlin   Rapid  Rollback   •  Use  a  new  Autoscale  Group  to  push  code   •  Leave  exisDng  ASG  in  place,  switch  traffic   •  If  OK,  auto-­‐delete  old  ASG  a  few  hours  later   •  If  “whoops”,  switch  traffic  back  in  seconds  
  18. 18. @atseitlin   Asgard   h.p://<­‐web-­‐based-­‐cloud-­‐management-­‐and.html  
  19. 19. @atseitlin   Made  possible  in  the  cloud   APIs,  ElasDcity,  Efficiency  
  20. 20. @atseitlin   APIs   •  Control  everything  (start,  terminate,  scale)   •  Inject  failure   •  Monitor  &  audit   •  Automate  operaDons  
  21. 21. @atseitlin   ElasDcity   •  Capacity  planning  replaced  with  forecasDng   •  Dynamic  load-­‐based  auto-­‐scaling   •  New  data  centers  at  the  click  of  a  bu.on  
  22. 22. @atseitlin   Efficiency   •  ~10x  trough  to  peak  raDo.    Fill  trough  with   batch  workloads   •  OpDmize  machine  class  for  each  service   •  Highly  available  red/black  deployments  
  23. 23. @atseitlin   Coming  soon  to  a  cloud  near  you   Billing  &  Payments,  Big  Data  &   AnalyDcs,  SaaS  
  24. 24. @atseitlin   Billing  &  Payments   •  PCI  compliance   •  Privacy  &  security   •  Intermediate  step  of  cache  in  the  cloud  
  25. 25. @atseitlin   Big  Data  &  AnalyDcs   •  On  deck  for  cloud  migraDon   •  ETL  already  in  cloud  with  EMR  (Hadoop)   •  Many  cloud  alternaDves  but  not  yet  as  mature   as  the  old  guard  
  26. 26. @atseitlin   Corporate  system  moving  to  SaaS   •  Email  (Exchange-­‐>Google  Apps)   •  Expense  Management  (Concur-­‐>Workday)   •  Document  sharing  (File  Servers-­‐>Box)   •  Goal  is  100%  SaaS  
  27. 27. @atseitlin  
  28. 28. @atseitlin   Open  Source  Projects   Github  /  Techblog   Apache  ContribuDons   Techblog  Post   Coming  Soon   Priam   Cassandra  as  a  Service   Astyanax   Cassandra  client  for  Java   CassJMeter   Cassandra  test  suite   Cassandra   MulD-­‐region  EC2  datastore   support   Aegisthus   Hadoop  ETL  for  Cassandra   Ice   Spend  analyDcs   Governator   Library  lifecycle  and  dependency   injecDon   Odin   Cloud  orchestraDon   Blitz4j  Async  logging   Exhibitor   Zookeeper  as  a  Service   Curator   Zookeeper  Pa.erns   EVCache   Memcached  as  a  Service   Eureka  /  Discovery   Service  Directory   Archaius   Dynamics  ProperDes  Service   Edda   Config  state  with  history   Denominator     Ribbon   REST  Client  +  mid-­‐Der  LB   Karyon   Instrumented  REST  Base  Serve   Servo  and  Autoscaling  Scripts   Genie   Hadoop  PaaS   Hystrix   Robust  service  pa.ern   RxJava  ReacDve  Pa.erns   Asgard   AutoScaleGroup  based  AWS   console   Chaos  Monkey   Robustness  verificaDon   Latency  Monkey   Janitor  Monkey   Bakeries  /  Aminotor   Legend  
  29. 29. @atseitlin  
  30. 30. @atseitlin   Our  Current  Catalog  of  Releases   Free  code  available  at  h.p://ne<  
  31. 31. @atseitlin   We’re  hiring!   •  Simian  Army   •  Cloud  Tools   •  Ne<lixOSS   •  Cloud  OperaDons   •  Reliability  Engineering   •  Many,  many  more      <  
  32. 32. @atseitlin   Takeaways     Ne#lix  has  built  and  deployed  a  scalable  global  and  highly  available  Pla#orm  as  a   Service  and  opened  sourced  it  (Ne#lixOSS)     The  Cloud  enables  elasNcity,  efficiency  and  fine-­‐grained  control  via  APIs     Credit  cards,  Big  Data,  and  rest  of  corporate  systems  are  next  to  move  to  the  Cloud       h.p://ne<   h.p://<   h.p://<lix     h.p://     @atseitlin  @Ne<lixOSS  
  33. 33. @atseitlin   Thank  you!   Any  quesDons?   Ariel  Tseitlin   h.p://   @atseitlin