SRE in Apiary
CZJUG 21.5.2018
Ladislav Prskavec
@abtris
1
What is SRE?
2
"What happens when a software engineer is tasked with what used to
be called operations."
» Ben Treynor Sloss, Vice President, Google Engineering,
founder of Google SRE
3
SRE implement DevOps
— Google Cloud
4
Apiary in numbers
» Apiary users: 336,786 Apiary API projects: 440,178
» Apiary engineers: 19
» Apiary platform engineers: 10 + 4
» Apiary SREs: 4
» App deploys: 15 (per week)
» Parsing service invocations [1 day]: ~200k
» CI build: ~19 min (8 parallel workers)
5
How we started with
SRE team
6
2014 - 2 people
software developer and ops guy
7
2015-2017 - 3 people
software developers
8
2018 - 4 people
2 seniors, 2 juniors
9
Culture
Process > Tools
10
No team separation
» bounder context, but ...
» Shared ownership of platform - shared responsibility
» Shared tooling (debug, deploy, monitor)
» Shared codebase
» Brainstorm
» Motivation for good design (monitoring, future debugging)
11
Things break
» They do - better be ready
» Knowing when there's problem (logs, metrics, alerting)
» Having someone there - being oncall
» Responding (mitigation, resolution)
» Learning from it (postmortems)
12
Measure everything
» No gut feeling when we have the data (app metrics, runtime
metrics)
» Both production and non-production systems (e.g. our CI test
time)
» Thresholds, automated alerting
» Visualize the data (oncall dashboard, happiness dashboard)
13
14
15
Gradual changes
» Delivery vs deploy
» Continuous Integration / Continuous Delivery (CI/CD)
Automated testing within CI
» Testing environments (similar to production)
» Short iterations, fast rollbacks
» No-downtime deploy & immutable
» Rolling delivery
16
Tooling & automation
» oncall logistics
» schedules
» escalations
» alerting
» conflicts
» documentation
» runbooks
» internal processes
» domain dictionary
17
Reason 1. Decreasing changes of errors
» Source and great post: http://www.devops.ch/2017/05/10/devops-explained/
18
Reason 2: Eliminating toil, work that is:
» Repetitive
» Automatable
» Doesn't provide enduring value
» Scales linearly with service
» Compounds significantly and surprisingly
19
Reason 3: Focusing on creative
engineering work that:
» Improves reliability
» Improves performance & stability of systems
» Ensures scalability
» Reduces toil
» Is fun: improves morale, speeds up progress, allows skill
development
20
Incidents
Types:
» Low-priority incident
» High-priority incident
» Security incident
Both production and non-production systems
21
Being oncall
» Shared among developers (roles, not individuals, increase bus
factor) Responsible for the platform
» Safety net - you know who to call
» Runbooks - you know what to do
» Early alerting - proactively investigate
22
Incident response
If critical: Incident commander role Separate roles, if necessary:
» outbound and inbound communication
» root cause analysis
» issue mitigation
Tracking time (incident ack expiration) and keeping track Tooling
(alerts, paging, postmortem reminders)
23
Postmortems
» Root cause
» Lessons learned
» Actionable items
» Prevent future issues
» Create runbooks
» Blameless
» Generated reminders
24
Incident reviews
» Weekly, team lead sync
» Reviewing past incidents - types, occurrence, actionability
» Discuss improvements
» Incident fatigue prevention
25
Summary
26
Summary
» Culture is more important than process!
» Start early and work on improvements!
» Product owner for SRE work is useful role!
27
References
» SRE vs. DevOps: competing standards or close friends?
» SRE Weekly
» Awesome Site Reliability Engineering
Books
» SRE book
» Seeking SRE
28

SRE in Apiary

  • 1.
    SRE in Apiary CZJUG21.5.2018 Ladislav Prskavec @abtris 1
  • 2.
  • 3.
    "What happens whena software engineer is tasked with what used to be called operations." » Ben Treynor Sloss, Vice President, Google Engineering, founder of Google SRE 3
  • 4.
  • 5.
    Apiary in numbers »Apiary users: 336,786 Apiary API projects: 440,178 » Apiary engineers: 19 » Apiary platform engineers: 10 + 4 » Apiary SREs: 4 » App deploys: 15 (per week) » Parsing service invocations [1 day]: ~200k » CI build: ~19 min (8 parallel workers) 5
  • 6.
    How we startedwith SRE team 6
  • 7.
    2014 - 2people software developer and ops guy 7
  • 8.
    2015-2017 - 3people software developers 8
  • 9.
    2018 - 4people 2 seniors, 2 juniors 9
  • 10.
  • 11.
    No team separation »bounder context, but ... » Shared ownership of platform - shared responsibility » Shared tooling (debug, deploy, monitor) » Shared codebase » Brainstorm » Motivation for good design (monitoring, future debugging) 11
  • 12.
    Things break » Theydo - better be ready » Knowing when there's problem (logs, metrics, alerting) » Having someone there - being oncall » Responding (mitigation, resolution) » Learning from it (postmortems) 12
  • 13.
    Measure everything » Nogut feeling when we have the data (app metrics, runtime metrics) » Both production and non-production systems (e.g. our CI test time) » Thresholds, automated alerting » Visualize the data (oncall dashboard, happiness dashboard) 13
  • 14.
  • 15.
  • 16.
    Gradual changes » Deliveryvs deploy » Continuous Integration / Continuous Delivery (CI/CD) Automated testing within CI » Testing environments (similar to production) » Short iterations, fast rollbacks » No-downtime deploy & immutable » Rolling delivery 16
  • 17.
    Tooling & automation »oncall logistics » schedules » escalations » alerting » conflicts » documentation » runbooks » internal processes » domain dictionary 17
  • 18.
    Reason 1. Decreasingchanges of errors » Source and great post: http://www.devops.ch/2017/05/10/devops-explained/ 18
  • 19.
    Reason 2: Eliminatingtoil, work that is: » Repetitive » Automatable » Doesn't provide enduring value » Scales linearly with service » Compounds significantly and surprisingly 19
  • 20.
    Reason 3: Focusingon creative engineering work that: » Improves reliability » Improves performance & stability of systems » Ensures scalability » Reduces toil » Is fun: improves morale, speeds up progress, allows skill development 20
  • 21.
    Incidents Types: » Low-priority incident »High-priority incident » Security incident Both production and non-production systems 21
  • 22.
    Being oncall » Sharedamong developers (roles, not individuals, increase bus factor) Responsible for the platform » Safety net - you know who to call » Runbooks - you know what to do » Early alerting - proactively investigate 22
  • 23.
    Incident response If critical:Incident commander role Separate roles, if necessary: » outbound and inbound communication » root cause analysis » issue mitigation Tracking time (incident ack expiration) and keeping track Tooling (alerts, paging, postmortem reminders) 23
  • 24.
    Postmortems » Root cause »Lessons learned » Actionable items » Prevent future issues » Create runbooks » Blameless » Generated reminders 24
  • 25.
    Incident reviews » Weekly,team lead sync » Reviewing past incidents - types, occurrence, actionability » Discuss improvements » Incident fatigue prevention 25
  • 26.
  • 27.
    Summary » Culture ismore important than process! » Start early and work on improvements! » Product owner for SRE work is useful role! 27
  • 28.
    References » SRE vs.DevOps: competing standards or close friends? » SRE Weekly » Awesome Site Reliability Engineering Books » SRE book » Seeking SRE 28