20140708 - Jeremy Edberg: How Netflix Delivers Software
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

20140708 - Jeremy Edberg: How Netflix Delivers Software

  • 1,180 views
Uploaded on

Jeremy Edberg: How Netflix Delivers Software

Jeremy Edberg: How Netflix Delivers Software

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,180
On Slideshare
1,103
From Embeds
77
Number of Embeds
3

Actions

Shares
Downloads
22
Comments
0
Likes
9

Embeds 77

https://twitter.com 73
http://www.slideee.com 3
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. How Netflix Delivers Software ! July 8th, 2014 Email: jedberg@{gmail,netflix}.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg
  • 2. When your software fails...
  • 3. will your system survive?
  • 4. The Netflix way • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deployment
  • 5. • Everything is “built for three” • Independent teams responsible for both Dev and Ops • Redundancy through multi- region deployment The Netflix way
  • 6. Philosophy
  • 7. • We hire responsible adults and keep rules and policies to a minimum • Developers can change any code in production at any time • And things don’t break (usually) Freedom and Responsibility
  • 8. Automate all the things! http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html
  • 9. • Application startup • Configuration • Code deployment • System deployment Automate all the things!
  • 10. • Standard base image • Tools to manage all the systems • Reduce errors through reproducibility Automation
  • 11. Shared state should be stored in a shared service ! Data on an instance should be replicated to other instances
  • 12. “Build for three” We hold a boot camp for new engineers to teach them how to build for a highly distributed environment.
  • 13. “Build for three” We hold a boot camp for new engineers to teach them how to build for a highly distributed environment.
  • 14. 12B  outbound   requests  per  day   to  API   dependencies Movie   Ra)ngs Personaliza)on   Engine User  Info Movie   Metadata Similar   Movies Reviews A/B  Test   Engine 2B  requests  per   day     into  the  NeHlix   API Discovery API Streaming API
  • 15. Movie   Ra)ngs Personaliza)on   Engine User  Info Movie   Metadata Similar   Movies Reviews A/B  Test   Engine Discovery API Streaming API Content Encoding CDN Management QOS Logging DRM OpenConnect Edge Locations Browse Play Watch
  • 16. • Services are built by different teams who work together to figure out what each service will provide. • The service owner publishes an API that anyone can use. Highly aligned, loosely coupled
  • 17. • Easier auto-scaling • Easier capacity planning • Identify problematic code-paths more easily • Narrow in the effects of a change • More efficient local caching Advantages to a Service Oriented Architecture
  • 18. • Developers deploy when they want • They also manage their own capacity and autoscaling • And fix anything that breaks at 4am! Freedom and Responsibility
  • 19. All systems choices assume some part will fail at some point.
  • 20. •Simulate things that go wrong •Find things that are different The Monkey Theory
  • 21. Execution
  • 22. AWS Netflix OSS Netflix Application Code
  • 23. AWS Netflix OSS YOUR Application Code
  • 24. • Instances • Machine Images • Elastic IPs • Load Balancers • Security groups / Autoscaling What AWS Provides AWS
  • 25. AWS Netflix OSS YOUR Application Code
  • 26. • Service Oriented Architecture • HTTP/Rest interfaces between services Netflix built a global PaaS Netflix OSS
  • 27. • Supports all regions and zones • Multiple accounts • Cross region/account replication • Internationalized, localized and GeoIP routed • Advanced key management • Autoscaling with 1000s of instances • Monitoring and alerting on millions of metrics Netflix PaaS features Netflix OSS
  • 28. Open Source at Netflix
  • 29. Netflix OSS
  • 30. Be liberal in what you accept, strict in what you send Circuit Breakers (Hystrix)
  • 31. •Simulate things that go wrong •Find things that are different The Monkey Theory
  • 32. • Chaos -- Kills random instances • Chaos Gorilla -- Kills zones • Chaos Kong -- Kills regions • Latency -- Degrades network and injects faults • Conformity -- Looks for outliers The simian army • Circus -- Kills and launches instances to maintain zone balance • Doctor -- Fixes unhealthy resources • Janitor -- Cleans up unused resources • Howler --Yells about bad things like Amazon limit violations • Security -- Finds security issues and expiring certificates
  • 33. Netflix OSS
  • 34. • Blueprint for the rest of the platform libraries • Pluggable architecture
  • 35. • On instance software load balancer • Zone aware / Zone affinity • Handles retry logic
  • 36. • Global variables • Support for staged rollout • Feature flags
  • 37. Netflix OSS
  • 38. • Application to instance mapping • Heartbeat to keep track of health
  • 39. DQ Transport Routing Suro etc Eventbus Druid
  • 40. Netflix OSS
  • 41. Why Bake? Generic AMI Instance Traditional: •launch OS •install packages •install app Netflix: •launch OS +app App AMI Instance
  • 42. Getting Baked Perforce / Git libraries source Ant targets Ivy Groovy all over app bundles Jenkins sync resolve buildcompile report publishtest Artifactory snapshot / release libraries / apps
  • 43. Base Image Baking Yum / Apt Linux: CentOS, Fedora, Ubuntu RPMs: Apache, Java... ec2 slave instances S3 / EBS foundation AMI base AMI Bakery mount install Ready for app bake snapshot AWS
  • 44. App Image Baking Jenkins /Yum / Artifactory Linux, Apache, Java, Tomcat AWS app bundle ec2 slave instances S3 / EBS base AMI app AMI Bakery mount install Ready to launch! snapshot
  • 45. app AMI Linux Base AMI (CentOS or Ubuntu) Java Tomcat Optional Apache Monitoring ! Log Rotation to S3 monitoring GC and thread dump logging Application war file, base servlet, platform, interface jars for dependent services Healthcheck, status servelets, JMX interface, Servo autoscale
  • 46. Linux Base AMI (CentOS or Ubuntu) Java Tomcat Optional Apache Monitoring ! Log Rotation to S3 monitoring GC and thread dump logging Application war file, base servlet, platform, interface jars for dependent services Healthcheck, status servelets, JMX interface, Servo autoscale app AMI Application war file
  • 47. Linux Base AMI (CentOS or Ubuntu) Java JBoss Optional Apache Monitoring ! Log Rotation to S3 monitoring GC and thread dump logging Application war file, base servlet, platform, interface jars for dependent services Healthcheck, status servelets, JMX interface, Servo autoscale app AMI
  • 48. Linux Base AMI (CentOS or Ubuntu) Python Bottle Optional Apache Monitoring ! Log Rotation to S3 monitoring logging Application file, base server, platform, interface libs for dependent services app AMI
  • 49. Netflix OSS
  • 50. Deploying Code; Step 1
  • 51. Auto Scaling Group Launch Configuration Security Group Amazon Machine Image Instances Load Balancer
  • 52. Netflix has moved the granularity from the instance to the cluster
  • 53. Data is the most important asset Netflix has. It’s what differentiates us from our competitors.
  • 54. Netflix OSS
  • 55. EVCache • Wrapper on top of memcached • Automatically replicates writes to multiple regions • Pulls cache data intelligently via zone affinity
  • 56. Cassandra
  • 57. • Availability over consistency • Writes over reads • We know Java • Open source + support Why Cassandra?
  • 58. • Priam • Zero touch auto-config • State management • Token assignment • Node replacement • Backup/restore to/from S3 Using Cassandra at Netflix • Astyanax • OO abstraction to Cassandra • Multi-region support
  • 59. Cassandra Architecture
  • 60. Going Multi-region
  • 61. • 100% uptime is theoretically possible. • You have to replicate your data • This will cost money Leveraging Multi-region
  • 62. us-east-1 us-west-2 etc eu-west-1
  • 63. us-east-1 us-west-2 etc eu-west-1
  • 64. us-east-1 us-west-2 etc eu-west-1
  • 65. What’s going on?!
  • 66. Atlas ! alerting api api Central Event Gateway Paging Service Amazon SES CORE Agent Other Team’s Agent CORE Agent Alert Systems
  • 67. Central Event Gateway • Parse raw alerts, match application to owner • Add image captures and links to related graphs for easy mobile use • Send to the right service based on priority • Register the event in Chronos, the timeline application • Correlate low priority alerts and generate new high priority alerts
  • 68. Metrics in Production • 796B Daily metric points • Peaks at 1.4B / min • 50% daily metric churn
  • 69. What is a metric? com.netflix.eds.nccp.successful.requests.uiversion.nccprt-authorization.devtypid-101.clver-PHL_0AB.uiver-UI_169_mid.geo-US
  • 70. How we built it • Built our own big data system • Based on S3 and EMR • Less copies, lower resolution, and slower speed retrieval based on age of data
  • 71. Self Serve is the Key • Developers choose what metrics to submit • What graphs they put on their dashboards • What to alert on
  • 72. Example Alert Config
  • 73. Atlas
  • 74. When something breaks..
  • 75. Breakdown of an outage Is something wrong? Alerting Where is the problem? Telemetry and Dashboards What changed? ???
  • 76. Breakdown of an outage Is something wrong? Alerting Where is the problem? Telemetry and Dashboards What changed? Change control?
  • 77. Change control, the good • Tells you what changed • Tells you what’s about to change • Great for coordination when one change gates another change
  • 78. Change control, the bad • It’s manual • It expresses intent, not reality • It forces you to serialize your changes to an extent
  • 79. Breakdown of an outage Is something wrong? Alerting Where is the problem? Telemetry and Dashboards What changed? Chronos
  • 80. (Some of) Netflix is open source: https://netflix.github.io Just a quick reminder...
  • 81. Netflix is hiring! If you like what you see here, feel free to reach out!
  • 82. Questions?
  • 83. Getting in touch Email: jedberg@{gmail,netflix}.com Twitter: @jedberg Web: www.jedberg.net Facebook: facebook.com/jedberg Linkedin: www.linkedin.com/in/jedberg