20140708 - Jeremy Edberg: How Netflix Delivers Software

13,219 views

Published on

Jeremy Edberg: How Netflix Delivers Software

Published in: Technology

20140708 - Jeremy Edberg: How Netflix Delivers Software

  1. 1. How Netflix Delivers Software ! July 8th, 2014 Email: jedberg@{gmail,netflix}.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg
  2. 2. When your software fails...
  3. 3. will your system survive?
  4. 4. The Netflix way • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deployment
  5. 5. • Everything is “built for three” • Independent teams responsible for both Dev and Ops • Redundancy through multi- region deployment The Netflix way
  6. 6. Philosophy
  7. 7. • We hire responsible adults and keep rules and policies to a minimum • Developers can change any code in production at any time • And things don’t break (usually) Freedom and Responsibility
  8. 8. Automate all the things! http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html
  9. 9. • Application startup • Configuration • Code deployment • System deployment Automate all the things!
  10. 10. • Standard base image • Tools to manage all the systems • Reduce errors through reproducibility Automation
  11. 11. Shared state should be stored in a shared service ! Data on an instance should be replicated to other instances
  12. 12. “Build for three” We hold a boot camp for new engineers to teach them how to build for a highly distributed environment.
  13. 13. “Build for three” We hold a boot camp for new engineers to teach them how to build for a highly distributed environment.
  14. 14. 12B  outbound   requests  per  day   to  API   dependencies Movie   Ra)ngs Personaliza)on   Engine User  Info Movie   Metadata Similar   Movies Reviews A/B  Test   Engine 2B  requests  per   day     into  the  NeHlix   API Discovery API Streaming API
  15. 15. Movie   Ra)ngs Personaliza)on   Engine User  Info Movie   Metadata Similar   Movies Reviews A/B  Test   Engine Discovery API Streaming API Content Encoding CDN Management QOS Logging DRM OpenConnect Edge Locations Browse Play Watch
  16. 16. • Services are built by different teams who work together to figure out what each service will provide. • The service owner publishes an API that anyone can use. Highly aligned, loosely coupled
  17. 17. • Easier auto-scaling • Easier capacity planning • Identify problematic code-paths more easily • Narrow in the effects of a change • More efficient local caching Advantages to a Service Oriented Architecture
  18. 18. • Developers deploy when they want • They also manage their own capacity and autoscaling • And fix anything that breaks at 4am! Freedom and Responsibility
  19. 19. All systems choices assume some part will fail at some point.
  20. 20. •Simulate things that go wrong •Find things that are different The Monkey Theory
  21. 21. Execution
  22. 22. AWS Netflix OSS Netflix Application Code
  23. 23. AWS Netflix OSS YOUR Application Code
  24. 24. • Instances • Machine Images • Elastic IPs • Load Balancers • Security groups / Autoscaling What AWS Provides AWS
  25. 25. AWS Netflix OSS YOUR Application Code
  26. 26. • Service Oriented Architecture • HTTP/Rest interfaces between services Netflix built a global PaaS Netflix OSS
  27. 27. • Supports all regions and zones • Multiple accounts • Cross region/account replication • Internationalized, localized and GeoIP routed • Advanced key management • Autoscaling with 1000s of instances • Monitoring and alerting on millions of metrics Netflix PaaS features Netflix OSS
  28. 28. Open Source at Netflix
  29. 29. Netflix OSS
  30. 30. Be liberal in what you accept, strict in what you send Circuit Breakers (Hystrix)
  31. 31. •Simulate things that go wrong •Find things that are different The Monkey Theory
  32. 32. • Chaos -- Kills random instances • Chaos Gorilla -- Kills zones • Chaos Kong -- Kills regions • Latency -- Degrades network and injects faults • Conformity -- Looks for outliers The simian army • Circus -- Kills and launches instances to maintain zone balance • Doctor -- Fixes unhealthy resources • Janitor -- Cleans up unused resources • Howler --Yells about bad things like Amazon limit violations • Security -- Finds security issues and expiring certificates
  33. 33. Netflix OSS
  34. 34. • Blueprint for the rest of the platform libraries • Pluggable architecture
  35. 35. • On instance software load balancer • Zone aware / Zone affinity • Handles retry logic
  36. 36. • Global variables • Support for staged rollout • Feature flags
  37. 37. Netflix OSS
  38. 38. • Application to instance mapping • Heartbeat to keep track of health
  39. 39. DQ Transport Routing Suro etc Eventbus Druid
  40. 40. Netflix OSS
  41. 41. Why Bake? Generic AMI Instance Traditional: •launch OS •install packages •install app Netflix: •launch OS +app App AMI Instance
  42. 42. Getting Baked Perforce / Git libraries source Ant targets Ivy Groovy all over app bundles Jenkins sync resolve buildcompile report publishtest Artifactory snapshot / release libraries / apps
  43. 43. Base Image Baking Yum / Apt Linux: CentOS, Fedora, Ubuntu RPMs: Apache, Java... ec2 slave instances S3 / EBS foundation AMI base AMI Bakery mount install Ready for app bake snapshot AWS
  44. 44. App Image Baking Jenkins /Yum / Artifactory Linux, Apache, Java, Tomcat AWS app bundle ec2 slave instances S3 / EBS base AMI app AMI Bakery mount install Ready to launch! snapshot
  45. 45. app AMI Linux Base AMI (CentOS or Ubuntu) Java Tomcat Optional Apache Monitoring ! Log Rotation to S3 monitoring GC and thread dump logging Application war file, base servlet, platform, interface jars for dependent services Healthcheck, status servelets, JMX interface, Servo autoscale
  46. 46. Linux Base AMI (CentOS or Ubuntu) Java Tomcat Optional Apache Monitoring ! Log Rotation to S3 monitoring GC and thread dump logging Application war file, base servlet, platform, interface jars for dependent services Healthcheck, status servelets, JMX interface, Servo autoscale app AMI Application war file
  47. 47. Linux Base AMI (CentOS or Ubuntu) Java JBoss Optional Apache Monitoring ! Log Rotation to S3 monitoring GC and thread dump logging Application war file, base servlet, platform, interface jars for dependent services Healthcheck, status servelets, JMX interface, Servo autoscale app AMI
  48. 48. Linux Base AMI (CentOS or Ubuntu) Python Bottle Optional Apache Monitoring ! Log Rotation to S3 monitoring logging Application file, base server, platform, interface libs for dependent services app AMI
  49. 49. Netflix OSS
  50. 50. Deploying Code; Step 1
  51. 51. Auto Scaling Group Launch Configuration Security Group Amazon Machine Image Instances Load Balancer
  52. 52. Netflix has moved the granularity from the instance to the cluster
  53. 53. Data is the most important asset Netflix has. It’s what differentiates us from our competitors.
  54. 54. Netflix OSS
  55. 55. EVCache • Wrapper on top of memcached • Automatically replicates writes to multiple regions • Pulls cache data intelligently via zone affinity
  56. 56. Cassandra
  57. 57. • Availability over consistency • Writes over reads • We know Java • Open source + support Why Cassandra?
  58. 58. • Priam • Zero touch auto-config • State management • Token assignment • Node replacement • Backup/restore to/from S3 Using Cassandra at Netflix • Astyanax • OO abstraction to Cassandra • Multi-region support
  59. 59. Cassandra Architecture
  60. 60. Going Multi-region
  61. 61. • 100% uptime is theoretically possible. • You have to replicate your data • This will cost money Leveraging Multi-region
  62. 62. us-east-1 us-west-2 etc eu-west-1
  63. 63. us-east-1 us-west-2 etc eu-west-1
  64. 64. us-east-1 us-west-2 etc eu-west-1
  65. 65. What’s going on?!
  66. 66. Atlas ! alerting api api Central Event Gateway Paging Service Amazon SES CORE Agent Other Team’s Agent CORE Agent Alert Systems
  67. 67. Central Event Gateway • Parse raw alerts, match application to owner • Add image captures and links to related graphs for easy mobile use • Send to the right service based on priority • Register the event in Chronos, the timeline application • Correlate low priority alerts and generate new high priority alerts
  68. 68. Metrics in Production • 796B Daily metric points • Peaks at 1.4B / min • 50% daily metric churn
  69. 69. What is a metric? com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US
  70. 70. How we built it • Built our own big data system • Based on S3 and EMR • Less copies, lower resolution, and slower speed retrieval based on age of data
  71. 71. Self Serve is the Key • Developers choose what metrics to submit • What graphs they put on their dashboards • What to alert on
  72. 72. Example Alert Config
  73. 73. Atlas
  74. 74. When something breaks..
  75. 75. Breakdown of an outage Is something wrong? Alerting Where is the problem? Telemetry and Dashboards What changed? ???
  76. 76. Breakdown of an outage Is something wrong? Alerting Where is the problem? Telemetry and Dashboards What changed? Change control?
  77. 77. Change control, the good • Tells you what changed • Tells you what’s about to change • Great for coordination when one change gates another change
  78. 78. Change control, the bad • It’s manual • It expresses intent, not reality • It forces you to serialize your changes to an extent
  79. 79. Breakdown of an outage Is something wrong? Alerting Where is the problem? Telemetry and Dashboards What changed? Chronos
  80. 80. (Some of) Netflix is open source: https://netflix.github.io Just a quick reminder...
  81. 81. Netflix is hiring! If you like what you see here, feel free to reach out!
  82. 82. Questions?
  83. 83. Getting in touch Email: jedberg@{gmail,netflix}.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg

×