Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix

1,971 views
1,766 views

Published on

At Netflix, we realize that there’s a tension between the availability of our service and our speed of innovation. If we move slowly, we can be very available -- but that’s not a good business proposition. If we move super fast, we risk downtime -- and that might annoy our customers. But
what if we could increase our velocity without significantly impacting availability? How can we shift that curve so that we’re moving faster without dropping any of those coveted 9’s?
How can we engineer velocity by weaving together tooling and culture with software development to expose and elevate highly effective practices? This talk describes various
components of Netflix’s continuous delivery platform -- much of which is available in open source. I’ll show how these pieces fit together and allow us to build scaffolding so that we’re comfortable with software developers making the decision to push the button for prod deployment -- and helps them to recover if necessary. As a result, we can run fast, trusting our tooling and our culture. I’ll also describe how we test our resiliency through simulating failure, unleashing the monkeys (Simian Army) on our production environment. Because if you’re afraid of cute little monkeys,
imagine how afraid you’ll be of a production environment that offers those same risks but doesn’t give you an opportunity to test your response to those dangers.

Throughout this talk, I hope that you will challenge yourself to consider how your company can "shift the curve" through tooling and to achieve a high velocity environment without negatively impacting reliability.

Published in: Software, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,971
On SlideShare
0
From Embeds
0
Number of Embeds
88
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Saturn 2014. Engineering Velocity: Continuous Delivery at Netflix

  1. 1. Engineering Velocity: Continuous Delivery at Netflix Dianne Marsh SATURN 2014
  2. 2. en-gi-neer-ing + ve-loc-i-ty ! applying science and technology to designing and building speed into a system
  3. 3. Availability vs. Rate of ChangeAvailablity(in9’s) 0 1 2 3 4 5 6 Rate of Change 0 10 100 1000
  4. 4. Shift the CurveAvailablity(in9’s) 0 1 2 3 4 5 6 Rate of Change 0 10 100 1000 10000
  5. 5. http://www.slideshare.net/reed2001/culture-1798664
  6. 6. Manager’s Role Context, not Control Loosely coupled, Tightly aligned And hire well!
  7. 7. Get out of the Way Freedom to Innovate
  8. 8. Support Experimentation ! How We Built a Predictive Autoscaling Engine http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
  9. 9. Support Independent Paths of Exploration Don’t Prematurely Optimize!
  10. 10. Blameless Culture
  11. 11. Developers Deploy Their Code Run What You Wrote ! • Rapid Innovation • Rapid Detection • Rapid Response ! = Freedom + Responsibility
  12. 12. Support with Tools
  13. 13. Jenkins Job DSL Configuration as Code Groovy Script Scripts go in Version Control http://www.slideshare.net/quidryan/configuration-as-code
  14. 14. Aminator Create AMI from Base AMI Image contains service and everything needed to run it Unit of Deployment for Test and Prod Abstracts Cloud Details http://techblog.netflix.com/2013/03/ami-creation-with-aminator.html
  15. 15. Asgard Deploys Netflix to the Cloud Red/Black push Developed to address delays in rollback http://www.infoq.com/presentations/asgard
  16. 16. Red/Black Push ! • Scale up new instances • Run canary analysis • Turn on traffic to new ASG • Turn off traffic to old ASG • Wait … analyze … continue
  17. 17. Workflow Continuous Delivery Engine Judges between Stages Represent Best Practices http://techblog.netflix.com/2013/09/glisten-groovy-way-to-use-amazons.html
  18. 18. One Click Deployment?
  19. 19. Regional Isolation Limit Impact of Human Error ! • Stagger Deployments? • Canary Testing per Region? ! Know your Service!
  20. 20. Multi-Region Consistency Build Tooling to: ! • Schedule Deployments • Prefer Off-Peak • Choose Next Available Region • Provide Visibility by Region
  21. 21. Simian Army • Chaos Monkey • Latency Monkey • Conformity Monkey • Janitor Monkey 
 (and more) http://www.infoq.com/presentations/netflix-resiliency-failure-cloud
  22. 22. Chaos Monkey Kills Running Instances • Simulates failures inherent to running in the cloud • In Production
  23. 23. Latency Monkey Introduces Latency between services
  24. 24. Conformity Monkey Have Deployments Diverged? • Balance Regional Consistency with Regional Isolation • Build Best Practices into Tooling and Reporting
  25. 25. Janitor Monkey Reduce Cognitive Load and Cost • Remove unused instances • Uniform way to clean up
  26. 26. Shifting the Curve with Tooling • Value Self-Service • Test Everywhere • Awareness of Multiple Regions • Best Practices Represented in Tooling • Recover Quickly and Easily • Be Cloud Native
  27. 27. Shifting the Curve with Culture • Context not Control • Freedom to Experiment • Blameless Culture
  28. 28. ArsTechnica, November 2012 “As the number of applications and the scale of the campaign's AWS infrastructure use climbed, the DevOps team shifted to using Asgard—an open-source tool developed by Netflix to manage cloud deployments.”
  29. 29. Thanks! Dianne Marsh (@dmarsh) dmarsh@netflix.com

×