AppJam 2012: How Netflix Operates and Monitor in the Cloud


Published on

Ariel Tseitlin, Director of Cloud Solutions, Netflix

As an early adopter of many cloud technologies and the record-holder of the largest public cloud deployment, Netflix is often seen as a leading innovator in cloud computing. In this session Ariel will discuss the key drivers, challenges and design principles behind Netflix’s journey to the cloud. Ariel will also cover how Netflix operates in the cloud, including which tools and mechanisms they use to manage the performance and availability of their highly distributed applications.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AppJam 2012: How Netflix Operates and Monitor in the Cloud

  1. 1. How Netflix Operates & Monitors in the Cloud Ariel Tseitlin Director of Cloud SolutionsTweet @atseitlin with feedback!
  2. 2. With more than 27 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix is the worlds leading internet subscription service for enjoying movies and TV programs. Source: @atseitlin with feedback!
  3. 3. Why the public cloud? • Better agility • Couldn’t build data centers fast enoughTweet @atseitlin with feedback!
  4. 4. API: 7x growth since moving out of the datacenter Source: @atseitlin with feedback!
  5. 5. We want to use clouds • We just don’t have the time or expertise to develop them. • We don’t build our own electric plants • We didn’t build CDNs until recently!Tweet @atseitlin with feedback!
  6. 6. Netflix on AWS Source: @adriancoTweet @atseitlin with feedback!
  7. 7. Netflix on AWS 2012 2012 2012 IPv6 IPv6 IPv6Tweet @atseitlin with feedback!
  8. 8. Why the Cloud? • Faster • 95th percentile latencies for both 1st page and subsequent pages went down • Scalable • We couldn’t grow our datacenter fast enough • Available • They already have datacenters around the world, with more to comeTweet @atseitlin with feedback!
  9. 9. Netflix built a global PaaS • Service Oriented Architecture • HTTP/Rest interfaces between servicesTweet @atseitlin with feedback!
  10. 10. Netflix PaaS features• Supports all regions and zones• Multiple accounts• Cross region/account replication• Internationalized, localized and GeoIP routed• Advanced key management• Autoscaling with 1000s of instances• Monitoring and alerting on millions of metricsTweet @atseitlin with feedback!
  11. 11. What AWS Provides • Instances • Machine Images • Elastic IPs • Load Balancers • Security groups / Autoscaling groups • Availability zones and regionsTweet @atseitlin with feedback!
  12. 12. What Netflix provides • Applications • Clusters • Discovery services • Application routing • MonitoringTweet @atseitlin with feedback!
  13. 13. Build Pipeline app bundles Artifactory Ivy snapshot / release libraries libraries / appsJenkins resolve test publish sync compile build report source Groovy all over Perforce Ant targets Gradle comingTweet @atseitlin with feedback!
  14. 14. App Image Baking S3 / EBS base AMI app AMI Linux, Apache, Java, Tomcat snapshot mount Ready to Jenkins / launch Yum / BakeryArtifactory app bundle install AWS ec2 slave instances Tweet @atseitlin with feedback!
  15. 15. Instance Architecture Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread jars for dependent Servo autoscale Machine dump services Agent loggingTweet @atseitlin with feedback!
  16. 16. Freedom and Responsibility • Developers deploy when they want to • They also manage their own capacity and autoscaling • And fix anything that breaks at 4am!Tweet @atseitlin with feedback!
  17. 17. How do we maintain reliability in this type of environment?Tweet @atseitlin with feedback!
  18. 18. All systems choices assume some part will fail at some point.Tweet @atseitlin with feedback!
  19. 19. Netflix autoscaling2 Text1 Traffic Peak Tweet @atseitlin with feedback!
  20. 20. Going multi-zoneTweet @atseitlin with feedback!
  21. 21. Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per regionTweet @atseitlin with feedback!
  22. 22. Going Multi-regionTweet @atseitlin with feedback!
  23. 23. Leveraging Mutli-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost moneyTweet @atseitlin with feedback!
  24. 24. Reliability and $$Tweet @atseitlin with feedback!
  25. 25. Automate all the things!Tweet @atseitlin with feedback!
  26. 26. Netflix has moved the granularity from the instance to the clusterTweet @atseitlin with feedback!
  27. 27. Automate The Netflix Way • Redundancy where possible and practicle • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deploymentTweet @atseitlin with feedback!
  28. 28. The Monkey Theory • Simulate things that go wrong • Find things that are differentTweet @atseitlin with feedback!
  29. 29. The Simian Army • Chaos -- Kills random instances • Latency -- Slows the network down • Security -- Finds security vulnerabilities • Conformity -- Promotes best-practices • Janitor -- Cleans up unused resources • Gorilla - Takes out availability zonesTweet @atseitlin with feedback!
  30. 30. Incident Reviews Ask the key questions: • What went wrong? • How could we have detected it sooner? • How could we have prevented it? • How can we prevent this class of problem in the future? • How can we improve our behavior for next time?Tweet @atseitlin with feedback!
  31. 31. Monitoring • We believe in redundancy • Internally-developed monitoring & instrumentation • AppDynamics • A few othersTweet @atseitlin with feedback!
  32. 32. Application Lifecycle 10 DEVELOPStart paying attention to monitoring 20 TEST Live and die by monitoring 30 RUN 40 GOTO 10 Tweet @atseitlin with feedback!
  33. 33. Monitoring Use Cases • Discovery • (Typically) interactive • Performance management, debugging, etc • What’s the problem? • Alerting • Automated • There’s a problemTweet @atseitlin with feedback!
  34. 34. Monitoring Application Lifecycle • During testing • Did we reduce latency? • Did we increase CPU usage? • Did we maintain memory footprint? • In Production • Alert when latency > 50ms • Warn when CPU > 80% • Terminate when ISS memory > 30MbTweet @atseitlin with feedback!
  35. 35. Why legacy tools don’t work in the Cloud • Lack of distributed view across services • Auto-Scaling/Elasticity causes Node churn • Scalability to monitor thousands of serversTweet @atseitlin with feedback!
  36. 36. AppDynamics @ NetflixTweet @atseitlin with feedback!
  37. 37. AppDynamics @ Netflix • >1 million metrics / minute • >400 BTs • >300 Tiers • 5 controllers (3 hosted, 1 in AWS, 1 in DC)Tweet @atseitlin with feedback!
  38. 38. AppDynamics @ Netflix • Automatic Monitoring • AppDynamics agent baked into AMI • Instances self-identify into Tiers • 10% instance spike during deployment • Incident Alerts • Business Transaction Latency & Error rate • Thresholds discover own baselines • Alertwith feedback!URL link to AppDynamicsTweet @atseitlin contains
  39. 39. Just a quick reminder... (Some of) Netflix is open source: @atseitlin with feedback!
  40. 40. Netflix is hiring @atseitlin with feedback!
  41. 41. Questions?Tweet @atseitlin with feedback!
  42. 42. Getting in touch Email: atseitlin@{gmail,netflix}.com Twitter: @atseitlin Linkedin: @atseitlin with feedback!