Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SRE in Apiary

2,341 views

Published on

SRE/Devops meetup in CZJUG and panel discussion.

Published in: Technology
  • Login to see the comments

SRE in Apiary

  1. 1. SRE in Apiary CZJUG 21.5.2018 Ladislav Prskavec @abtris 1
  2. 2. What is SRE? 2
  3. 3. "What happens when a software engineer is tasked with what used to be called operations." » Ben Treynor Sloss, Vice President, Google Engineering, founder of Google SRE 3
  4. 4. SRE implement DevOps — Google Cloud 4
  5. 5. Apiary in numbers » Apiary users: 336,786 Apiary API projects: 440,178 » Apiary engineers: 19 » Apiary platform engineers: 10 + 4 » Apiary SREs: 4 » App deploys: 15 (per week) » Parsing service invocations [1 day]: ~200k » CI build: ~19 min (8 parallel workers) 5
  6. 6. How we started with SRE team 6
  7. 7. 2014 - 2 people software developer and ops guy 7
  8. 8. 2015-2017 - 3 people software developers 8
  9. 9. 2018 - 4 people 2 seniors, 2 juniors 9
  10. 10. Culture Process > Tools 10
  11. 11. No team separation » bounder context, but ... » Shared ownership of platform - shared responsibility » Shared tooling (debug, deploy, monitor) » Shared codebase » Brainstorm » Motivation for good design (monitoring, future debugging) 11
  12. 12. Things break » They do - better be ready » Knowing when there's problem (logs, metrics, alerting) » Having someone there - being oncall » Responding (mitigation, resolution) » Learning from it (postmortems) 12
  13. 13. Measure everything » No gut feeling when we have the data (app metrics, runtime metrics) » Both production and non-production systems (e.g. our CI test time) » Thresholds, automated alerting » Visualize the data (oncall dashboard, happiness dashboard) 13
  14. 14. 14
  15. 15. 15
  16. 16. Gradual changes » Delivery vs deploy » Continuous Integration / Continuous Delivery (CI/CD) Automated testing within CI » Testing environments (similar to production) » Short iterations, fast rollbacks » No-downtime deploy & immutable » Rolling delivery 16
  17. 17. Tooling & automation » oncall logistics » schedules » escalations » alerting » conflicts » documentation » runbooks » internal processes » domain dictionary 17
  18. 18. Reason 1. Decreasing changes of errors » Source and great post: http://www.devops.ch/2017/05/10/devops-explained/ 18
  19. 19. Reason 2: Eliminating toil, work that is: » Repetitive » Automatable » Doesn't provide enduring value » Scales linearly with service » Compounds significantly and surprisingly 19
  20. 20. Reason 3: Focusing on creative engineering work that: » Improves reliability » Improves performance & stability of systems » Ensures scalability » Reduces toil » Is fun: improves morale, speeds up progress, allows skill development 20
  21. 21. Incidents Types: » Low-priority incident » High-priority incident » Security incident Both production and non-production systems 21
  22. 22. Being oncall » Shared among developers (roles, not individuals, increase bus factor) Responsible for the platform » Safety net - you know who to call » Runbooks - you know what to do » Early alerting - proactively investigate 22
  23. 23. Incident response If critical: Incident commander role Separate roles, if necessary: » outbound and inbound communication » root cause analysis » issue mitigation Tracking time (incident ack expiration) and keeping track Tooling (alerts, paging, postmortem reminders) 23
  24. 24. Postmortems » Root cause » Lessons learned » Actionable items » Prevent future issues » Create runbooks » Blameless » Generated reminders 24
  25. 25. Incident reviews » Weekly, team lead sync » Reviewing past incidents - types, occurrence, actionability » Discuss improvements » Incident fatigue prevention 25
  26. 26. Summary 26
  27. 27. Summary » Culture is more important than process! » Start early and work on improvements! » Product owner for SRE work is useful role! 27
  28. 28. References » SRE vs. DevOps: competing standards or close friends? » SRE Weekly » Awesome Site Reliability Engineering Books » SRE book » Seeking SRE 28

×