Love DevOps? Wait 'Till You Meet SRE

1,921 views

Published on

A crucial transition is taking place at Atlassian... we can feel our DNA evolving a little each day. Our focus is always on the future, and that future will mean rallying behind a cloud-first strategy. In doing so, we have the unique opportunity to re-imagine the way we run our services and get behind a modern approach to distribute our operations function and optimize for scale. This talk will cover the steps we've taken on that journey as we build site reliability engineering, an operations approach pioneered by industry champs like Netflix. We'll talk about the concept, how it applies at Atlassian, the wins we have achieved, and learning you can bring back to your team.

Published in: Software

Love DevOps? Wait 'Till You Meet SRE

  1. 1. NICK WRIGHT • SRE MANAGER • ATLASSIAN Love DevOps? Wait ‘til you meet SRE!
  2. 2. SRE AND HOW IT CAN HELP GETTING STARTED OPS TOOLCHAIN Agenda SETTING THE SCENE
  3. 3. incidents per month 10+
  4. 4. incidents per month 100+
  5. 5. incidents per month 400+
  6. 6. incidents per month 900+
  7. 7. Too much firefighting Caters News Agency
  8. 8. Fixing the same thing repeatedly America’s Funniest Home Videos
  9. 9. Job Satisfaction NASA
  10. 10. Service Ops Application Development
  11. 11. SETTING THE SCENE SRE AND HOW IT CAN HELP Agenda GETTING STARTED OPS TOOLCHAIN
  12. 12. Site Reliability Engineering Preventative Multiple distinct operations teams, or a You-Build-It, You-Run-It model. Specialised Engineers focus on a single service or group of related services. Decentralised Primary focus: get away from break- fix, do work that prevents outages.
  13. 13. SRE vs DevOps? SRE DevOps • Operations • Incident response • Post Mortems • Monitoring, Events, Alertings • Capacity planning • Primary focus: Reliability • Delivery • Release automation • Environment builds • Config management • Infrastructure as code • Primary focus: Delivery Speed
  14. 14. Solutions
  15. 15. Balance Interrupt vs Preventative work GravityGlue.com
  16. 16. Hire Devs! And have a common hiring pool
  17. 17. Always do Post- Mortems
  18. 18. Scrap the release meeting!
  19. 19. SETTING THE SCENE Agenda SRE AND HOW IT CAN HELP OPS TOOLCHAIN GETTING STARTED
  20. 20. ? ? ? ? ? ??
  21. 21. The journey to SRE Improve Define how the team will work and how we measure success Build Get the team up and running! Vision Revisit regularly - if its not working, tweak, change, refine.
  22. 22. Team StructureGoals and MetricsResponsibilities Vision In 6 months we will: • Replace monitoring • DR Plan and Test How we measure success? • Number of Incidents • PIR Coverage • Service list • Service Owners • Team Duties Size and structure of team
  23. 23. Team Structure Developer TeamsSRE
  24. 24. ToolsHiring Build Training Get the team in place • Start Early! • Promotion Opportunities • Existing hiring pipeline Set things up so they can work! • Last part of the talk! • Bootcamps • Wheel of Misfortune!
  25. 25. Regular check-ins Improve Review decisions Change where needed Blog success stories!
  26. 26. Does it work?!
  27. 27. 100%Post Incident Review Completion Rate
  28. 28. DR Compliance
  29. 29. The SRE team runs ahead of the rest of the team on reliability and encourages everyone to lift their game ANDRE SERNA, DEV MANAGER “ ”
  30. 30. In the past the separate ops and dev teams would often pick the solution they were best positioned to implement. I like that our SRE team is able to pick the best solution to the problem instead. JAMES BUNTON, DEV-ON-ROTATION “ ”
  31. 31. SETTING THE SCENE Agenda SRE AND HOW IT CAN HELP OPS TOOLCHAIN GETTING STARTED
  32. 32. Incident
  33. 33. Alerts Dashboard Incident Ticket HOT roomOps room SREs Atlassians Ops JIRA Confluence Run Book
  34. 34. Ops room Ops JIRA JQL Select Action
  35. 35. JIRA HipChat Discussions Incident Ticket HOT room
  36. 36. Incident Ticket Pending Fixing Reviewing Closed
  37. 37. Incident Ticket ALL MOST FEW ONE Minor Impact Moderate Impact Severe Impact Outage
  38. 38. Incident Ticket DetectFail Fix CloseRespond JIRA ticket
  39. 39. Post Mortem
  40. 40. Incident Ticket HOT roomOps room SREs Ops JIRA Confluence JIRA Actions!
  41. 41. Confluence
  42. 42. Confluence Actions Linked Here
  43. 43. Incident Ticket HOT roomOps room SREs Ops JIRA Confluence JIRA Actions!
  44. 44. Pending Fixing Reviewing Closed Draft Approval Published Completed JIRA
  45. 45. JIRA Team 1 JIRA Team 2 Team 3 Reporting
  46. 46. JIRA
  47. 47. Summary atlassian.com/careers
  48. 48. atlassian.com/help-desk
  49. 49. Pedro Canahuati “Scaling the Operations Organisation at Facebook” Ben Treynor “Keys to SRE”
  50. 50. Thank you! NICK WRIGHT • SRE MANAGER • ATLASSIAN

×