Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Crash Course in Building Site Reliability

261 views

Published on

<p>From <a href="https://en.wikipedia.org/wiki/Site_reliability_engineering" target="_blank">Wikipedia</a>: Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.<p>

<p>Over the past year Acquia has built their own SRE team to help their products and services scale with the demand of our growing number of customers. We wish to share our experience so that others are enabled to do the same and reap the rewards.</p>

<p>This presentation will discuss how the SRE team came about at Acquia, what achievements we have made so far, and the lessons we have learned along the way. We will then show the steps on how to introduce SRE to your workplace so you can deliver more reliable and scalable services to your customers! We will specifically cover:</p>
<ul>
<li>SRE's basic concepts and history from Google</li>
<li>The management support you will need to get started</li>
<li>Introducing the idea of service level objectives and error budgets</li>
<li>Operational Responsibility Assessments as a tool to measure risk</li>
<li>Creating a Launch Readiness Checklist to standardize and improve product launches</li>
<li>Finding ideal candidates for your SRE team</li></ul>

<p>The intended audience are software engineers, system administrators, and managers that have a desire to improve how they do their work and how their products/services perform.</p>

Published in: Technology
  • Be the first to comment

A Crash Course in Building Site Reliability

  1. 1. Building Site Reliability Engineering: A Crash Course Amin Astaneh, Acquia Inc.
  2. 2. Who am I? ● Senior Manager, SRE at Acquia ● Was in Operations Team from Dec 2010 - Nov 2015 ● Built and Lead the Site Reliability Engineering Team
  3. 3. Agenda ● What is SRE? ● Why Do SRE? ● Acquia, Pre-SRE ● How Acquia Does SRE ● Building an SRE Competency ● How to Hire SREs? ● 1-Year Retrospective
  4. 4. What is SRE?
  5. 5. What is SRE? “What happens when a software engineer is tasked with what used to be called operations.” - Ben Treynor, Google
  6. 6. What is SRE? SRE takes the manual processes associated with Operations..
  7. 7. What is SRE? ..and replaces them with automation using software engineering.
  8. 8. What is SRE? They also use a set of methodologies and best practices that help engineering teams create a mature and sustainable process for service ownership.
  9. 9. How Does This Relate to DevOps? DevOps is a set of values, tools, and processes that allow teams to best deliver value to the customer. Therefore, SRE can be considered a specific implementation of DevOps.
  10. 10. SRE Practices (according to Google)
  11. 11. 1)Hire only coders.
  12. 12. 2) Have SLO(s) for your service.
  13. 13. What are SLOs? ● SLI: Service Level Indicators (What to Measure) ● SLOs: Service Level Objectives (Targets for Measurements) ● SLAs: Service Level Agreements (Consequences for Missing Targets)
  14. 14. 3) Measure and report performance against the SLO(s).
  15. 15. 4) Use Error Budgets and gate launches on them.
  16. 16. 5) Have a common staffing pool for SRE and developers.
  17. 17. 6) Cap SRE operational load at 50%.
  18. 18. 7) Have excess Ops work overflow to the Dev Team.
  19. 19. 8) Share 5% of Ops work with the Dev Team.
  20. 20. 9) Oncall teams should have at least eight people at one location, or 6 people at each of multiple locations.
  21. 21. 10) Aim for a maximum of two events per oncall shift.
  22. 22. 11) Do a postmortem for every event.
  23. 23. 12) Postmortems are blameless and focus on process and technology, not people.
  24. 24. Why Do SRE?
  25. 25. Scale
  26. 26. Improve Employees’ Quality of Life
  27. 27. REDUCE COST
  28. 28. Acquia, Pre-SRE
  29. 29. Things We Tried First ● Implemented Kanban for Ops to make work visible and maximize throughput ● Did ‘Tier 2 Sprints’ to build automation for the team ● Generated team metrics to influence decision-making “People Metrics: How to Use Team Data to Produce Positive Change” https://events.drupal.org/dublin2016/sessions/people-metrics
  30. 30. How Acquia Does SRE
  31. 31. How Acquia Does SRE Acquia SRE was commissioned as the driving force of our DevOps Initiative, which has the following core values: ● Eliminate Toil ● No Capes ● Deliver With Empathy ● Own Your Service ● Own Your Business ● Own Customer Success
  32. 32. Acquia SRE vs Google SRE ● We embed engineers on teams, rather than build teams that run services on behalf of engineers ● The entire engineering team (plus the SRE) is expected to ‘own their service’, with the SRE providing leadership on how to best handle those responsibilities ● The SRE identifies risk as part of their day-to-day and brings improvement opportunities directly to the Product Manager for prioritization
  33. 33. Acquia SRE vs Google SRE ● We evaluate with Engineering and Product what the most critical projects are on a quarterly basis, and allocate the team to best meet the present need ● We still reserve the right to remove engineers if an engagement becomes untenable, though it has not yet been necessary ● We have a heavy focus on time tracking to aid in toil reduction
  34. 34. 8) Share 5% of Ops work with the Dev Team.
  35. 35. 8) Share 5% of Ops work with the Dev Team.
  36. 36. 8) Ops work IS the responsibility of the Dev Team.
  37. 37. Building A SRE Competency
  38. 38. Get Management Buy-In
  39. 39. SRE Won’t Work Without Two Things ● Authority to stop releases when the error budget has been exhausted ● Authority to overflow operational work to the dev team when operational load > 50% This must be given from lead of engineering/product efforts. DO NOT CONTINUE UNLESS YOU HAVE THESE!
  40. 40. How Do You Get Buy-In?
  41. 41. Establish a Sense of Urgency! https://events.drupal.org/baltimore2017/sessions/%C2%A1viva-la-revoluci%C3%B3n-how- start-devops-transformation-your-workplace
  42. 42. Automatically Measure Toil
  43. 43. SRE Operational Load Dashboard
  44. 44. Operational Responsibility Assessment
  45. 45. Operational Responsibility Assessment ● Based on the Capability Maturity Model (https://en.wikipedia.org/wiki/Capability_Maturity_Model) ● Evaluates the following responsibilities: ○ Routine Tasks ○ Emergency Response ○ Monitoring and Metrics ○ Capacity Planning ○ Change Management ○ New Product Introduction and Removal ○ Service Deploy and Decommissioning ○ Performance and Efficiency ○ Information Security
  46. 46. Operational Responsibility Assessment Each responsibility is scored from 1-5: 1. Initial: Chaotic. Undocumented, ad-hoc, and require individual heroics. 2. Repeatable: Documented sufficiently so they can be repeated with the same results. 3. Defined: Roles and responsibilities for the process are defined and confirmed. 4. Managed: The process is quantitatively managed in accordance with agreed- upon metrics. 5. Optimizing: Process management includes deliberate process
  47. 47. Operational Responsibility Assessment ● Assess your services often! (we suggest quarterly) ● Take findings/risks and create tasks for improvement ● Publish your results and share them with your organization ● Do not tie ORA results to KPIs, incentives, etc
  48. 48. READ APPENDIX A!
  49. 49. Blameless Post Mortems
  50. 50. Blameless Post Mortems ● Document timeline of the incident ● With the team, determine: ○ What went well ○ What didn’t go well (process failures, technical root cause) ○ What was lucky (or circumstantial) ● For each thing that didn’t go well or was circumstantial: ○ File an action item to address it ○ Make sure they have clear acceptance criteria/requirements (grooming) ○ Make sure they have a clear level of effort (sizing) ○ Prioritize in the backlog based on relative risk ● Openly share the post-mortem with the rest of the company ● Review with the team periodically
  51. 51. Launch Readiness Criteria
  52. 52. What is Launch Readiness Criteria? ● A set of guidelines that represent the minimum standard of what a new product launch requires from an operational standpoint ● Expressed in terms of the Operational Responsibility Assessment ● Intended to address the major forms of risk without introducing needless roadblocks into the product launch process ● A living document that is continuously maintained and kept relevant ● Inspired by: https://landing.google.com/sre/book/chapters/reliable-product- launches.html
  53. 53. Example LRC Checklist Items
  54. 54. LRC Enablement
  55. 55. Example Service Pages
  56. 56. Example Service Dashboard
  57. 57. Example Code
  58. 58. Example Operational Runbooks
  59. 59. Example Post Mortem/RCA Template
  60. 60. Create an Onboarding Process
  61. 61. Create an Onboarding Process ● Implement an Incident Response Process ○ On-Call Rotation ○ Documentation for stakeholders on how to get help ○ Fundamentals: production access credentials, runbooks ● Perform/Publish an Operational Responsibility Assessment ● Define/Publish Service Level Objectives ● Create Monitoring/Alerting against SLOs ● Create Dashboards For SLO performance and remaining error budget
  62. 62. Weekly Office Hours
  63. 63. How To Hire SREs?
  64. 64. Hire Software Developers
  65. 65. Hire Software Developers
  66. 66. Hire Operations People
  67. 67. Hire Operations People
  68. 68. What Makes a Good SRE? ● It’s complicated ● You want someone with the ability to contribute to a software engineering project.. ● Yet is motivated by operational concerns and understands the subject matter (Linux, TCP/IP, monitoring, performance, config management..) ● Is willing to be on-call ● Knowledge of agile practices as a method to suggest improvements ● ‘SRE Temperament’: can communicate their opinions on something in a way that is persuasive and data-driven
  69. 69. Selling Points for Prospective SREs ● Toil capped at 50%, that means 50%+ project work at all times! ● Authority to stop flow of releases when service is too unreliable ● There is oncall, but responsibility is shared with the whole team ● Root causes of outages are tracked, prioritized, and addressed These Create A Work Environment That Respects The SRE
  70. 70. 1 Year Retrospective
  71. 71. What Went Well
  72. 72. What Went Well ● Launch Readiness Criteria is now a corporate standard ● Teams are independently performing their own blameless post mortems ● Teams are independently performing their own ORAs ● SRE influenced a grassroots reorg of Cloud Engineering around SOA ● More and more teams are taking an active role in on-call responsibilities ● Weekly Office Hours has been an effective tool for sharing ideas
  73. 73. What Didn’t Go Well
  74. 74. What Didn’t Go Well ● We struggled with getting SLOs and error budgets established for all services ● We didn’t get Launch Readiness out the door fast enough for new services
  75. 75. Current Improvements
  76. 76. Current Improvements ● SRE engagements now require the onboarding process before any other work can take place: ○ Establish Incident Response Process ○ Perform Operational Responsibility Assessment ○ Defining Service Level Objectives ○ Establishing Monitoring and Alerting Against SLOs ○ Create Dashboards Displaying SLOs and Error Budgets ● Operational Stories are required to be prioritized proportional to the SRE presence on an engineering team.
  77. 77. “When we were in Ops, it was simple, because our purpose was to simply address the incident. Our purpose now is to address the problems of the business. We are the vehicle of change. That’s hard work, but we can do it.”
  78. 78. Questions?
  79. 79. Amin Astaneh T: @aastaneh M: amin.astaneh@acquia.com

×