Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Next Wave of Reliability Engineering

289 views

Published on

In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks?

This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.

Published in: Engineering

The Next Wave of Reliability Engineering

  1. 1. The Next Wave of Reliability Engineering Michael Kehoe Staff Site Reliability Engineer
  2. 2. Today’s agenda 1 Introductions 2 Where have we come from 3 What is Reliability Engineering 4 Where are we going 5 The Future of Reliability Engineering 6 Key Takeaways 7 Q&A
  3. 3. Introduction
  4. 4. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  5. 5. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery - Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  6. 6. Where have we come from
  7. 7. Development/ Operations Bottlenecks Traditional • Department Silo’s • Slow release cycle’s • High toil workloads • Poor operational visibility Where have we come from
  8. 8. What is Reliability Engineering
  9. 9. “What happens when a software engineer is tasked with what used to be called operations” B E N T R E Y N O R S L O S S
  10. 10. “Helping Product and Engineering deliver the best experience possible for the end user from an operations perspective ”
  11. 11. What is Reliability Engineering
  12. 12. DevOps Concepts Operational silos Reduce Everything Measure Failure as normal Accept Gradual changes Implement Tooling and automation Leverage
  13. 13. Operational Silos Reduce • Shared ownership of code & infrastructure • Sharing of tools • Expectation of collaboration DevOps Concepts
  14. 14. Failure as Normal Accept • Expect & embrace risk • Quantify failure via SLO’s • Blameless postmortem DevOps Concepts
  15. 15. Gradual Change Implement • Encourage organization to move quickly • Lower the cost of failure • Manage Risk DevOps Concepts
  16. 16. Tooling and Automation Leverage • Automate toil away • Reduce ‘Human Touch’ DevOps Concepts
  17. 17. Everything Measure • Measure all aspects of systems • Availability • Errors • Incident statistics DevOps Concepts
  18. 18. Where are we going?
  19. 19. Where are we going? Agility Increased Everything Measure Is the new normal Failure Is Ubiqitous Automation In Depth Observe
  20. 20. The Next Wave of Reliability Engineering
  21. 21. The Future of Reliability Engineering Of the Network Engineer Evolution And measure Observe Is the new normal Failure As a Service Automation Is king Cloud
  22. 22. Making the network follow SRE practices Dawn of the Network Reliability Engineer https://forums.juniper.net/t5/SDN-and-NFV-Era/2018-and-the-Dawn-of-Network-Reliability-Engineering-NRE/ba-p/316915
  23. 23. Of Network Automation Evolution 1. Manual Operations 2. Automation 3. Visibility & Visualization 4. Data Analysis & realization 5. Reactive, Predictive Self Operation Credit: Greg Ferro (Packet Pushers) http://packetpushers.net/taxonomy-five-levels-intent-based- networking-beta/
  24. 24. Downgrade failures from exceptional to expected Failure is the new Normal https://azure.microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/
  25. 25. Is the new normal Failure • Accept failure as normal • Test for failure: • Application • Local Infrastructure • Global Infrastructure • Continuous experimentation
  26. 26. Automation & Orchestration will be a part of all systems Automation as a Service
  27. 27. Is ubiquitous Automation • Automation is expected • Automation is unified • No more one-off scripts • Automation extends to monitoring, triage & automation • Automation drives down: • Time to Detect • Time to Resolve
  28. 28. Applications are built for the cloud Cloud is King https://woodby.com/pricing-plans
  29. 29. Is King Cloud • Adoption of Private & Public Clouds will continue • Most infrastructure will be ephemeral • Applications will be engineered to be ‘Cloud Native’ • Engineering agility will continue to increase
  30. 30. Making the most of operational data Observe & Measure https://www.acronis.com/en-us/blog/posts/web-application-monitoring-basic-framework
  31. 31. And measure Observe • Machine driven triaging using tracing and advanced learning • Advanced analytics on performance to drive infrastructure optimization • Use of incident data to drive feedback loops
  32. 32. Key Takeaways
  33. 33. Key Takeawys DEVOPS CONCEPTS Operational silos Reduce Everything Measure Failure as normal Accept Gradual change Implement Tooling and automation Leverage
  34. 34. Key Takeaways THE FUTURE OF RELIABILITY ENGINEERING Of the Network Engineer Evolution And measure Observe Is the new normal Failure Is ubiquitous Automation Is king Cloud
  35. 35. Q&A

×