Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Netflix thinks of DevOps. Spoiler: we don’t.

6,219 views

Published on

"You Build It, You Run It". Driven by Scale. Empowered by Culture. Supported by Tools.

Published in: Technology

How Netflix thinks of DevOps. Spoiler: we don’t.

  1. 1. Dianne Marsh Director of Engineering @dmarsh
  2. 2. DevOps Photo Photo Credit: https://www.facebook.com/theprincessbride/photos_stream
  3. 3. DevOps  in  Three  Acts  
  4. 4. Driven  by  Scale  
  5. 5. Empowered  by  Culture  
  6. 6. Supported  by  Tools  
  7. 7. Approaching  Global  Reach   October - Spain, Portugal, Italy Early 2016 - Korea, Taiwan, Singapore, Hong Kong 65m members à 100m ~60 counties à 200
  8. 8. Ne=lix  ecosystem   •  100s  of  microservices   •  1000s  of  daily  producBon  changes   •  10,000s  of  instances   •  100,000s  of  customer  interacBons/minute   •  1,000,000s  of  customers   •  1,000,000,000s  of  metrics   •  10,000,000,000  hours  of  streamed    
  9. 9. Yet  …   •  10s  of  OperaBons  Engineers   •  No  NOC  
  10. 10. You  Build  It,  You  Run  It  
  11. 11. Outages   24/7
  12. 12. •  Developers   •  CriBcal  OperaBons/Reliability   Engineering  team  (CORE)   •  Crisis  Response  Manager      
  13. 13. “Get  rid  of  the  safeguards.     Enable  the  most  knowledgeable   people  to  do  their  job   effecBvely.”  
  14. 14. Blameless  Culture  
  15. 15. Produc4on  Ready   •  IdenBfy  criBcal  services   •  Provide  context,  assistance   •  Keep  number  small  
  16. 16. Conformity  Monkey     IdenBfy  best  pracBces   NoBfy  service  owners  
  17. 17. AutomaBon  and  Tools  
  18. 18. It’s  Complicated  …  
  19. 19. Common  RunBme  Services  and   Libraries   Eureka   Ribbon   Hystrix   Zuul    
  20. 20. Hystrix:  Automate  Recovery  
  21. 21. Delivery  Tools   Aminator   Spinnaker      
  22. 22. •  Cloud Management •  Delivery Engine •  Automation Platform
  23. 23. Global  Cloud  Management  
  24. 24. Delivery  Pipelines    
  25. 25. Automated  Global  Delivery  
  26. 26. Insight   Atlas   Edda   Vector      
  27. 27. Atlas:  Telemetry  Pla=orm  
  28. 28. Insight  
  29. 29. Insight  (Dashboards)  
  30. 30. What  did  you  expect?  
  31. 31. Been  Thro_led?  
  32. 32. Performance  Monitoring  
  33. 33. Vector  
  34. 34. •  DES on time series data •  Predict the future based on history •  Favor recent history •  Threshold-based alerts •  6-8 minute delay Anomaly Detection Alert!
  35. 35. Finer Granularity, Shorter Time Windows
  36. 36. Ensemble  Learning  
  37. 37. Median Absolute Deviation IQR Least Squares HDI Voting
  38. 38. Alert  Sooner   Alert! From 6-8 minutes to < 1 minute
  39. 39. AcBon  was  an  Alert  
  40. 40. Ge`ng  the  Humans  Out  of  the   EquaBon  is  BETTER  
  41. 41. Outlier Detection & Remediation
  42. 42. Kepler   •  Unsupervised  machine   learning   •  Density-­‐based  clustering   algorithm     •  AcBons   –  Email,  page   –  OOS,  detach,   terminate  
  43. 43. An  ounce  of  prevenBon…  
  44. 44. Old Version (v1.0) New Version (v1.1) Load BalancerCustomers 100 Servers 5 Servers 95% 5% Metrics Canary  Release  Process  
  45. 45. Old Version (v1.0) New Version (v1.1) Load BalancerCustomers 0 Servers 100 Servers 100% Metrics Canary  Release  Process  
  46. 46. Automated  Canary  Analysis   Define   •  Metrics   •  A  threshold     Every  n  minutes   ●  Classify  metrics   ●  Compute  score   ●  Make  a  decision  
  47. 47. Chaos  Engineering   the  discipline  of  experimenBng  on  a  distributed  system  in  order   to  build  confidence  in  the  systems  capability  to  withstand   turbulent  condiBons  in  producBon.  
  48. 48. Cluster A Cluster D Edge Cluster Cluster B Cluster C Imagine a monkey loose in your data center…
  49. 49. Xen  Hypervisor  vulnerability  –  9/25/14     218  out  of  2700+  Cassandra  nodes  rebooted     22  did  not  reboot  successfully   AutomaBon  recovered  those   A State of Xen – Chaos Monkey & Cassandra
  50. 50. Device   Service  B     Service  C   Internet   Edge  Zuul   Service  A     ELB   FIT   Fault-Injection Testing (FIT) •  Simulate service failures •  Override by device or account •  % of member traffic
  51. 51. Device   Service  B     Service  C   Internet   Edge  Zuul   Service  A     ELB   FIT   Fault-Injection Testing (FIT) •  Simulate service failures •  Override by device or account •  % of member traffic
  52. 52. Monkey  –  Single  Instance   Gorilla  –  Availability  Zone   Kong  -­‐  Region   More Chaos
  53. 53. US-EastUS-West AZ1 EU-West Global Traffic Management
  54. 54. Exercise  Regularly  
  55. 55. DevOps  at  Ne=lix  
  56. 56. How  do  you  think  about  DevOps?  
  57. 57. Roll  the  Credits   Ne=lix.github.io     Dianne  Marsh,  Director  of  Engineering     dmarsh@ne=lix.com   @dmarsh  

×