SRE in startup
Zonky 17.1.2017
Ladislav Prskavec, Apiary
ladislav@apiary.io
@abtris
1
What is SRE?
2
"What happens when a software engineer is tasked with what used to
be called operations."
» Ben Treynor Sloss, Vice President, Google Engineering,
founder of Google SRE
3
"Our work is like being part of the world's most intense pit crew. We
change the tires of a race car as it's going 100 mph."
» Andrew Widdowson, Site Reliability Engineer, Mountain View
4
In general, an SRE team is responsible for:
» availability
» latency
» performance
» efficiency
» change management
» monitoring
» emergency response
» capacity planning
5
6
If the team agrees on a 99.9% SLA,
that gives them an error budget of
0.1%.
7
8
Rule
If service is in SLA, launch away
- clearly DEV team is doing a good job
If service is not within SLA, launch freeze
- Until you earn back enough error budget
9
Error budget
» removes SRE - DEV conflict
» DEV teams make self-police
10
Common staffing pool
» one more SRE = one less Dev
11
SRE hires only coders
» they get bored easily
» speak same language as Dev
12
50% cap on ops work
» if you succeed works scales with traffic
» coding reduce work / traffic ratio
13
Keep Dev in rotation
» 5% ops handled by devs
14
Speaking of Dev and Ops work
» excess operations load (tickets, oncall, etc.)
15
SRE portability
» no requirement to stick with project or SRE
16
Outages
» minimalize impact
» prevent recurrence
17
Minimalize damage
» no NOC
» good diagnostic information
» practice, practice, practice
18
Prevent recurrence
1. Handle event
2. Write post-mortems
3. Reset
19
Post-mortems philosophy
» blameless, focus on process and technology
» create timeline
» get all facts
» create bugs for all followup work
20
How are specific SRE
in startup?
21
1:10
22
Horizontal team
23
SaaS oriented
24
Oncall culture
25
It's cool work
26
SRE book
27
"May the Queries Flow,
And the Pagers Remain Silent"
SRE Benediction
28

SRE in Startup