Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating Innovation and Time-to-Market @ Camp Devops Houston 2015


Published on

My talk at Camp DevOps in Houston on DevOps can accelerate innovation and time-to-market

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Accelerating Innovation and Time-to-Market @ Camp Devops Houston 2015

  1. 1. @atseitlin Accelerating Innovation and Time-to-Market October 28, 2014
  2. 2. 2@atseitlin Who cares about DevOps?
  3. 3. 3@atseitlin
  4. 4. 4@atseitlin Everything-as-a-Service Alphabet soup • IaaS • PaaS • DaaS • mBaaS • SBS • … All started with SaaS With great power comes great responsibility (Consumer web companies figured this out long ago)
  5. 5. 5@atseitlin Competition Software is a at the core of every business, not just software companies Ability to deliver software to market quickly is a competitive advantage
  6. 6. 6@atseitlin DevOps Accelerating Pace of Innovation Rate of Innovation Quality Cloud
  7. 7. 7@atseitlin DevOps Core Components 1. Self-service 2. Automation 3. Collaboration
  8. 8. 8@atseitlin Software Lifecycle Develop Test Deploy Operate
  9. 9. 9@atseitlin Develop Self-provisioning & self-service Operational platform that provides - Load balancing - Service discovery - Logging, metrics, alerting, and notifications - Resiliency design patterns (bulk heads & fallbacks) - Security - Data access (caching, persistence)
  10. 10. 10@atseitlin Test Unit / functional / regression / integration Chaos (failure injection / simulation) Performance / Squeeze Security Automate as much as possible (all!)
  11. 11. 11@atseitlin Deploy Continuous Delivery Automated configuration Zero-downtime (red/black or rolling) Canary Analysis Redundancy
  12. 12. 12@atseitlin Deploy: Continuous Delivery Replace manual release process with automation Assembly line
  13. 13. 13@atseitlin Netflix API Continuous Delivery pipeline * Sangeeta Narayanan, Move Fast;Stay Safe:Developing & Deploying the Netflix API the-netflix-api
  14. 14. 14@atseitlin Deploy: Automated Configuration “Servers Aren’t Pets; They’re Cattle” • The old: curated, hand-created infrastructure • The new: disposable, elastic, self-configuring
  15. 15. 15@atseitlin Deploy: Automated Configuration Two approaches 1. Bake virtual images or containers 2. Bootstrap at launch
  16. 16. 16@atseitlin Deploy: Zero downtime Two options 1. Rolling upgrade 2. Red/black deployment Must be able to rollback quickly!
  17. 17. 17@atseitlin Deploy: Canary Analysis Always, always stage deployments and compare them to baseline before fully deploying Launch 2 instances, one with new code one with old. Compare them. If good, continue deploying new Lots (183 slides!) of detailed tips & tricks from QConNY at: canary-analysis
  18. 18. 18@atseitlin Deploy: Redundancy Think in 3s Multiple data centers (AZs) Multiple regions Active-active
  19. 19. 19@atseitlin Operate Monitoring Alerting Notifications Dynamic auto-scaling Failure
  20. 20. 20@atseitlin Operate: Monitoring Collect everything Prefer structured time series over log data Build base health metrics into the platform (e.g. 2xx, 3xx, 4xx, 5xx counts, latency percentiles)
  21. 21. 21@atseitlin Operate: Alerting Self-service, based on meaningful service metrics Error rates often a better metric than success
  22. 22. 22@atseitlin Operate: Notifications Don’t rely on people to watch screens. Alerts should trigger notifications On-call rotations should include engineers that built the service Alerts should be granular down to service
  23. 23. 23@atseitlin Operate: Dynamic auto-scaling Save costs Increase availability
  24. 24. 24@atseitlin Operate: Resiliency through failure
  25. 25. 25@atseitlin Failure is all around us • Disks fail • Power goes out. And your generator fails • Software bugs introduced • People make mistakes Failure is unavoidable
  26. 26. 26@atseitlin We design around failure • Exception handling • Clusters • Redundancy • Fault tolerance • Fall-back or degraded experience • All to insulate our users from failure Is that enough?
  27. 27. 27@atseitlin It’s not enough • How do we know if we’ve succeeded? • Does the system work as designed? • Is it as resilient as we believe? • How do we prevent drifting into failure? The typical answer is…
  28. 28. 28@atseitlin More testing! • Unit testing • Integration testing • Stress testing • Exhaustive test suites to simulate and test all failure mode Can we effectively simulate a large-scale distributed system?
  29. 29. 29@atseitlin Building distributed systems is hard Testing them exhaustively is even harder • Massive data sets and changing shape • Internet-scale traffic • Complex interaction and information flow • Asynchronous nature • 3rd party services • All while innovating and building features Prohibitively expensive, if not impossible, for most large-scale systems
  30. 30. 30@atseitlin What if we could reduce variability of failures?
  31. 31. 31@atseitlin There is another way • Cause failure to validate resiliency • Test design assumption by stressing them • Don’t wait for random failure. Remove its uncertainty by forcing it periodically
  32. 32. 32@atseitlin And that’s exactly what we did
  33. 33. 33@atseitlin Instances fail
  34. 34. 34@atseitlin
  35. 35. 35@atseitlin Chaos Monkey taught us… • State is bad • Clusters are good • Surviving single instance failure is not enough
  36. 36. 36@atseitlin Lots of instances fail
  37. 37. 37@atseitlin Chaos Gorilla
  38. 38. 38@atseitlin Chaos Gorilla taught us… • Hidden assumptions on deployment topology • Infrastructure control plane can be a bottleneck • Large scale events are hard to simulate • Rapidly shifting traffic is error prone • Smooth recovery is a challenge • Cassandra works as expected
  39. 39. 39@atseitlin What about larger catastrophes? Anyone remember Sandy?
  40. 40. 40@atseitlin Chaos Kong
  41. 41. 41@atseitlin The Sick and Wounded
  42. 42. 42@atseitlin Latency Monkey
  43. 43. 43@atseitlin
  44. 44. 44@atseitlin Hystrix, RxJava
  45. 45. 45@atseitlin Latency Monkey taught us • Startup resiliency is often missed • An ongoing unified approach to runtime dependency management is important (visibility & transparency gets missed otherwise) • Know thy neighbor (unknown dependencies) • Fallbacks can fail too • It’s hard to be ready for Latency Monkey, especially with no customer impact
  46. 46. 46@atseitlin
  47. 47. 47@atseitlin Entropy
  48. 48. 48@atseitlin Clutter accumulates • Complexity • Cruft • Vulnerabilities • Cost
  49. 49. 49@atseitlin Janitor Monkey
  50. 50. 50@atseitlin Janitor Monkey taught us… • Label everything • Clutter builds up
  51. 51. 51@atseitlin Ranks of the Simian Army • Chaos Monkey • Chaos Gorilla • Latency Monkey • Janitor Monkey • Conformity Monkey • Circus Monkey • Doctor Monkey • Howler Monkey • Security Monkey • Chaos Kong • Efficiency Monkey
  52. 52. 52@atseitlin Observability is key • Don’t exacerbate real customer issues with failure exercises • Deep system visibility is key to root-cause failures and understand the system
  53. 53. 53@atseitlin Organizational elements • Every engineer is an operator of the service • Each failure is an opportunity to learn • Blameless culture Goal is to create a learning organization
  54. 54. 54@atseitlin Summary Software is becoming ubiquitous, is delivered as an always-on service, and is at the heart of enabling business innovation. Every organizations is looking for ways to accelerate innovation. DevOps accelerates innovation.
  55. 55. 55@atseitlin Q & A Questions? You can reach me at    @atseitlin