How Netflix            Operates & Monitors               in the Cloud                          Ariel Tseitlin              ...
With more than 27 million streaming members in    the United States, Canada, Latin America, the United Kingdom and Ireland...
Why the public cloud?  • Better agility  • Couldn’t build data centers fast enoughTweet @atseitlin with feedback!
API: 7x growth since moving       out of the datacenter                                  Source: http://techblog.netflix.co...
We want to use clouds  • We just don’t have the time or expertise to      develop them.  • We don’t build our own electric...
Netflix on AWS                                  Source: @adriancoTweet @atseitlin with feedback!
Netflix on AWS                                  2012   2012   2012                                  IPv6   IPv6   IPv6Tweet...
Why the Cloud?  • Faster   • 95th percentile latencies for both 1st page and          subsequent pages went down  • Scalab...
Netflix built a global             PaaS      • Service Oriented Architecture      • HTTP/Rest interfaces between servicesTw...
Netflix PaaS features• Supports all regions and zones• Multiple accounts• Cross region/account replication• Internationaliz...
What AWS Provides  • Instances  • Machine Images  • Elastic IPs  • Load Balancers  • Security groups / Autoscaling groups ...
What Netflix provides  • Applications  • Clusters  • Discovery services  • Application routing  • MonitoringTweet @atseitli...
Build Pipeline                                                                app bundles                              Art...
App Image         Baking                                    S3 / EBS                                                  base...
Instance Architecture     Linux Base AMI (CentOS or Ubuntu)       Optional                    Java (JDK 6 or 7)       Apac...
Freedom and                Responsibility  • Developers deploy when they want to  • They also manage their own capacity an...
How do we maintain reliability in              this type of environment?Tweet @atseitlin with feedback!
All systems choices  assume some part will    fail at some point.Tweet @atseitlin with feedback!
Netflix autoscaling2                                      Text1                                 Traffic Peak    Tweet @atsei...
Going multi-zoneTweet @atseitlin with feedback!
Benefits of Amazon’s           Zones  • Loosely connected  • Low latency between zones  • 99.95% uptime guarantee per regio...
Going Multi-regionTweet @atseitlin with feedback!
Leveraging Mutli-region  • 100% uptime is theoretically possible.  • You have to replicate your data  • This will cost mon...
Reliability and $$Tweet @atseitlin with feedback!
Automate all the things!Tweet @atseitlin with feedback!
Netflix has moved the    granularity from the  instance to the clusterTweet @atseitlin with feedback!
Automate The Netflix          Way  • Redundancy where possible and practicle  • Fully automated build tools to test and    ...
The Monkey Theory  • Simulate things that go wrong  • Find things that are differentTweet @atseitlin with feedback!
The Simian Army  • Chaos -- Kills random instances  • Latency -- Slows the network down  • Security -- Finds security vuln...
Incident Reviews                    Ask the key questions:  •   What went wrong?  •   How could we have detected it sooner...
Monitoring  • We believe in redundancy  • Internally-developed monitoring &      instrumentation  • AppDynamics  • A few o...
Application Lifecycle                                       10 DEVELOPStart paying attention to monitoring                ...
Monitoring Use Cases  •   Discovery      •   (Typically) interactive      •   Performance management, debugging, etc      ...
Monitoring      Application Lifecycle  •   During testing      •   Did we reduce latency?      •   Did we increase CPU usa...
Why legacy tools don’t  work in the Cloud  • Lack of distributed view across services  • Auto-Scaling/Elasticity causes No...
AppDynamics @ NetflixTweet @atseitlin with feedback!
AppDynamics @ Netflix  • >1 million metrics / minute  • >400 BTs  • >300 Tiers  • 5 controllers (3 hosted, 1 in AWS, 1 in D...
AppDynamics @ Netflix • Automatic Monitoring  • AppDynamics agent baked into AMI  • Instances self-identify into Tiers    •...
Just a quick reminder...            (Some of) Netflix is open source:                  https://github.com/netflixTweet @atse...
Netflix is hiring           http://jobs.netflix.com/jobs.htmlTweet @atseitlin with feedback!
Questions?Tweet @atseitlin with feedback!
Getting in touch  Email: atseitlin@{gmail,netflix}.com  Twitter: @atseitlin  Linkedin: www.linkedin.com/in/atseitlinTweet @...
Upcoming SlideShare
Loading in...5
×

AppJam 2012: How Netflix Operates and Monitor in the Cloud

6,883

Published on

Ariel Tseitlin, Director of Cloud Solutions, Netflix

As an early adopter of many cloud technologies and the record-holder of the largest public cloud deployment, Netflix is often seen as a leading innovator in cloud computing. In this session Ariel will discuss the key drivers, challenges and design principles behind Netflix’s journey to the cloud. Ariel will also cover how Netflix operates in the cloud, including which tools and mechanisms they use to manage the performance and availability of their highly distributed applications.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,883
On Slideshare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

AppJam 2012: How Netflix Operates and Monitor in the Cloud

  1. 1. How Netflix Operates & Monitors in the Cloud Ariel Tseitlin Director of Cloud SolutionsTweet @atseitlin with feedback!
  2. 2. With more than 27 million streaming members in the United States, Canada, Latin America, the United Kingdom and Ireland, Netflix is the worlds leading internet subscription service for enjoying movies and TV programs. Source: http://ir.netflix.comTweet @atseitlin with feedback!
  3. 3. Why the public cloud? • Better agility • Couldn’t build data centers fast enoughTweet @atseitlin with feedback!
  4. 4. API: 7x growth since moving out of the datacenter Source: http://techblog.netflix.com/2011/02/redesigning-netflix-api.htmlTweet @atseitlin with feedback!
  5. 5. We want to use clouds • We just don’t have the time or expertise to develop them. • We don’t build our own electric plants • We didn’t build CDNs until recently!Tweet @atseitlin with feedback!
  6. 6. Netflix on AWS Source: @adriancoTweet @atseitlin with feedback!
  7. 7. Netflix on AWS 2012 2012 2012 IPv6 IPv6 IPv6Tweet @atseitlin with feedback!
  8. 8. Why the Cloud? • Faster • 95th percentile latencies for both 1st page and subsequent pages went down • Scalable • We couldn’t grow our datacenter fast enough • Available • They already have datacenters around the world, with more to comeTweet @atseitlin with feedback!
  9. 9. Netflix built a global PaaS • Service Oriented Architecture • HTTP/Rest interfaces between servicesTweet @atseitlin with feedback!
  10. 10. Netflix PaaS features• Supports all regions and zones• Multiple accounts• Cross region/account replication• Internationalized, localized and GeoIP routed• Advanced key management• Autoscaling with 1000s of instances• Monitoring and alerting on millions of metricsTweet @atseitlin with feedback!
  11. 11. What AWS Provides • Instances • Machine Images • Elastic IPs • Load Balancers • Security groups / Autoscaling groups • Availability zones and regionsTweet @atseitlin with feedback!
  12. 12. What Netflix provides • Applications • Clusters • Discovery services • Application routing • MonitoringTweet @atseitlin with feedback!
  13. 13. Build Pipeline app bundles Artifactory Ivy snapshot / release libraries libraries / appsJenkins resolve test publish sync compile build report source Groovy all over Perforce Ant targets Gradle comingTweet @atseitlin with feedback!
  14. 14. App Image Baking S3 / EBS base AMI app AMI Linux, Apache, Java, Tomcat snapshot mount Ready to Jenkins / launch Yum / BakeryArtifactory app bundle install AWS ec2 slave instances Tweet @atseitlin with feedback!
  15. 15. Instance Architecture Linux Base AMI (CentOS or Ubuntu) Optional Java (JDK 6 or 7) Apache Appdynamics App Agent Monitoring monitoring Tomcat Log Rotation to S3 Application war file, base Healthcheck, status GC and servlet, platform, interface servelets, JMX interface, Appdynamics thread jars for dependent Servo autoscale Machine dump services Agent loggingTweet @atseitlin with feedback!
  16. 16. Freedom and Responsibility • Developers deploy when they want to • They also manage their own capacity and autoscaling • And fix anything that breaks at 4am!Tweet @atseitlin with feedback!
  17. 17. How do we maintain reliability in this type of environment?Tweet @atseitlin with feedback!
  18. 18. All systems choices assume some part will fail at some point.Tweet @atseitlin with feedback!
  19. 19. Netflix autoscaling2 Text1 Traffic Peak Tweet @atseitlin with feedback!
  20. 20. Going multi-zoneTweet @atseitlin with feedback!
  21. 21. Benefits of Amazon’s Zones • Loosely connected • Low latency between zones • 99.95% uptime guarantee per regionTweet @atseitlin with feedback!
  22. 22. Going Multi-regionTweet @atseitlin with feedback!
  23. 23. Leveraging Mutli-region • 100% uptime is theoretically possible. • You have to replicate your data • This will cost moneyTweet @atseitlin with feedback!
  24. 24. Reliability and $$Tweet @atseitlin with feedback!
  25. 25. Automate all the things!Tweet @atseitlin with feedback!
  26. 26. Netflix has moved the granularity from the instance to the clusterTweet @atseitlin with feedback!
  27. 27. Automate The Netflix Way • Redundancy where possible and practicle • Fully automated build tools to test and make packages • Fully automated machine image bakery • Fully automated image deploymentTweet @atseitlin with feedback!
  28. 28. The Monkey Theory • Simulate things that go wrong • Find things that are differentTweet @atseitlin with feedback!
  29. 29. The Simian Army • Chaos -- Kills random instances • Latency -- Slows the network down • Security -- Finds security vulnerabilities • Conformity -- Promotes best-practices • Janitor -- Cleans up unused resources • Gorilla - Takes out availability zonesTweet @atseitlin with feedback!
  30. 30. Incident Reviews Ask the key questions: • What went wrong? • How could we have detected it sooner? • How could we have prevented it? • How can we prevent this class of problem in the future? • How can we improve our behavior for next time?Tweet @atseitlin with feedback!
  31. 31. Monitoring • We believe in redundancy • Internally-developed monitoring & instrumentation • AppDynamics • A few othersTweet @atseitlin with feedback!
  32. 32. Application Lifecycle 10 DEVELOPStart paying attention to monitoring 20 TEST Live and die by monitoring 30 RUN 40 GOTO 10 Tweet @atseitlin with feedback!
  33. 33. Monitoring Use Cases • Discovery • (Typically) interactive • Performance management, debugging, etc • What’s the problem? • Alerting • Automated • There’s a problemTweet @atseitlin with feedback!
  34. 34. Monitoring Application Lifecycle • During testing • Did we reduce latency? • Did we increase CPU usage? • Did we maintain memory footprint? • In Production • Alert when latency > 50ms • Warn when CPU > 80% • Terminate when ISS memory > 30MbTweet @atseitlin with feedback!
  35. 35. Why legacy tools don’t work in the Cloud • Lack of distributed view across services • Auto-Scaling/Elasticity causes Node churn • Scalability to monitor thousands of serversTweet @atseitlin with feedback!
  36. 36. AppDynamics @ NetflixTweet @atseitlin with feedback!
  37. 37. AppDynamics @ Netflix • >1 million metrics / minute • >400 BTs • >300 Tiers • 5 controllers (3 hosted, 1 in AWS, 1 in DC)Tweet @atseitlin with feedback!
  38. 38. AppDynamics @ Netflix • Automatic Monitoring • AppDynamics agent baked into AMI • Instances self-identify into Tiers • 10% instance spike during deployment • Incident Alerts • Business Transaction Latency & Error rate • Thresholds discover own baselines • Alertwith feedback!URL link to AppDynamicsTweet @atseitlin contains
  39. 39. Just a quick reminder... (Some of) Netflix is open source: https://github.com/netflixTweet @atseitlin with feedback!
  40. 40. Netflix is hiring http://jobs.netflix.com/jobs.htmlTweet @atseitlin with feedback!
  41. 41. Questions?Tweet @atseitlin with feedback!
  42. 42. Getting in touch Email: atseitlin@{gmail,netflix}.com Twitter: @atseitlin Linkedin: www.linkedin.com/in/atseitlinTweet @atseitlin with feedback!

×