SRE in Apiary

SRE in Apiary
CZJUG 21.5.2018
Ladislav Prskavec
@abtris
1

"What happens when a software engineer is tasked with what used to
be called operations."
» Ben Treynor Sloss, Vice President, Google Engineering,
founder of Google SRE
3

SRE implement DevOps
— Google Cloud
4

Apiary in numbers
» Apiary users: 336,786 Apiary API projects: 440,178
» Apiary engineers: 19
» Apiary platform engineers: 10 + 4
» Apiary SREs: 4
» App deploys: 15 (per week)
» Parsing service invocations [1 day]: ~200k
» CI build: ~19 min (8 parallel workers)
5

How we started with
SRE team
6

2014 - 2 people
software developer and ops guy
7

2015-2017 - 3 people
software developers
8

2018 - 4 people
2 seniors, 2 juniors
9

No team separation
» bounder context, but ...
» Shared ownership of platform - shared responsibility
» Shared tooling (debug, deploy, monitor)
» Shared codebase
» Brainstorm
» Motivation for good design (monitoring, future debugging)
11

Things break
» They do - better be ready
» Knowing when there's problem (logs, metrics, alerting)
» Having someone there - being oncall
» Responding (mitigation, resolution)
» Learning from it (postmortems)
12

Measure everything
» No gut feeling when we have the data (app metrics, runtime
metrics)
» Both production and non-production systems (e.g. our CI test
time)
» Thresholds, automated alerting
» Visualize the data (oncall dashboard, happiness dashboard)
13

Gradual changes
» Delivery vs deploy
» Continuous Integration / Continuous Delivery (CI/CD)
Automated testing within CI
» Testing environments (similar to production)
» Short iterations, fast rollbacks
» No-downtime deploy & immutable
» Rolling delivery
16

Tooling & automation
» oncall logistics
» schedules
» escalations
» alerting
» conflicts
» documentation
» runbooks
» internal processes
» domain dictionary
17

Reason 1. Decreasing changes of errors
» Source and great post: http://www.devops.ch/2017/05/10/devops-explained/
18

Reason 2: Eliminating toil, work that is:
» Repetitive
» Automatable
» Doesn't provide enduring value
» Scales linearly with service
» Compounds significantly and surprisingly
19

Reason 3: Focusing on creative
engineering work that:
» Improves reliability
» Improves performance & stability of systems
» Ensures scalability
» Reduces toil
» Is fun: improves morale, speeds up progress, allows skill
development
20

Incidents
Types:
» Low-priority incident
» High-priority incident
» Security incident
Both production and non-production systems
21

Being oncall
» Shared among developers (roles, not individuals, increase bus
factor) Responsible for the platform
» Safety net - you know who to call
» Runbooks - you know what to do
» Early alerting - proactively investigate
22

Incident response
If critical: Incident commander role Separate roles, if necessary:
» outbound and inbound communication
» root cause analysis
» issue mitigation
Tracking time (incident ack expiration) and keeping track Tooling
(alerts, paging, postmortem reminders)
23

Postmortems
» Root cause
» Lessons learned
» Actionable items
» Prevent future issues
» Create runbooks
» Blameless
» Generated reminders
24

Incident reviews
» Weekly, team lead sync
» Reviewing past incidents - types, occurrence, actionability
» Discuss improvements
» Incident fatigue prevention
25

Summary
» Culture is more important than process!
» Start early and work on improvements!
» Product owner for SRE work is useful role!
27

References
» SRE vs. DevOps: competing standards or close friends?
» SRE Weekly
» Awesome Site Reliability Engineering
Books
» SRE book
» Seeking SRE
28

SRE in Apiary

More Related Content

What's hot

Similar to SRE in Apiary

More from Ladislav Prskavec

Recently uploaded

SRE in Apiary