Site reliability engineering

SITE RELIABILITY
ENGINEERING
FOR GROWING ORGANIZATIONS

My company in 20s
• End-to-end payments platform
• API-First
• Docker, C#, ASP.NET, Java,
Powershell, SQL
• #31 Nilsen top merchant acquirer
• Inc. 5000 fastest growing
company
• STL Top Place to Work
• Sound fun? It is. Come see me.
2

WHAT IS SRE?
• “Ops, if everything is treated as a
software problem”
• Typically experienced software
devs with a passion for
automation & infrastructure
• Sort of like devops, but with more
of a focus on production
automation, resiliency and
scalability
• Google wrote this book – It is
being adopted and explored by
many companies
• SRE for Google won't be SRE for
your team!
3

GROWING COMPANY PROBLEMS
4
• If you don't set a service level expectation, they will form around 100% uptime
• Keeping everything running as-is gets treated as a sunk cost
• Code atrophies or gets frozen, but the business keeps changing
EXPECTATIONS OFTEN OUTPACE CAPACITY

5
• How many more users can we
support at our current growth rate?
• If you buy X, will that let us scale?
• If we agree to buy X, can we wait
until next year's budget?
These are very hard questions to
answer without data and documented
expectations for performance &
uptime.
FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS

6
You will need more automation to keep
it running, not more people
• People are an ongoing cost,
automation is a capitalizable
investment
• With a bigger customer base, five
minute outages become damaging
experiences. Machines can react
faster.
• 4 nines (99.99%) is < 5 minutes
downtime per month. How quickly
can you triage an alert?
COMPLEXITY IS EVER INCREASING

7
• Improve reaction time to incidents
‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and
improving RTO
• Learn from mistakes, turn them into opportunities
‒ SRE teams can focus on blameless postmortems, extracting as much marrow as
possible from incidents, then being a champion for change
• Raise awareness for system behavior, weaknesses & strengths
‒ SRE can be an independent consulting agency or PR firm for dev teams
‒ SRE will create and/or publicize metrics to show facts
• Bandwidth dedicated to forward-looking system behaviors.
‒ Usually this is done as time permits (which is limited when companies grow fast).
WHY YOU WANT A DEDICATED SRE TEAM

SOUNDS GOOD! HOW IS IT DONE?
8
‒ Monitor externally the way your customers see you AND the way you see you
‒ There will be false alarms so not everyone should see these
AUTOMATED MONITORING

9
LOG INDEXING AND AGGREGATION

10
• Build self-healing systems when we can
‒ Service health checks & automated recovery actions
‒ Desired state configuration
‒ Service Orchestration
• Document procedures/playbooks/runbooks when we can't

11
• More than just a socket connection
‒ Does a typical request return a 200-OK?
‒ How many 200/300 Responses vs 400/500?
‒ Can you connect to your downstream
dependencies?
‒ How long have you been up?
• Provide rich info, but quickly
‒ Other endpoints can give more expensive
info
HEALTH CHECKS

12
• SLOs – Service Level Objectives
‒ Where you’d like to be
• SLAs – Service Level Agreements
‒ Where you tell your customers you’ll be
‒ Penalties
‒ More liberal than your SLOs
• Error Budgets
‒ Based on your SLO, how much risk
can you tolerate?
SERVICE LEVELS

SIGNAL VS NOISE
13
• Alert fatigue is real. Keep your alerts actionable.
• Rare errors can be the most interesting, but error velocity is an indicator.
• Strengthen the signal-noise ratio to combat fatigue.

14
MY EXPERIENCES
• Team was formed from various departments
• Carried forward some SRE-related projects from dev
• Matured & documented processes
‒ Playbooks
‒ Dependencies, metrics, app catalog
• Sharing responsibility for prod incidents with operations and
dev teams
• Finding ways to consult on app design & rollout
• We are first-responders, but the dev & ops teams are on call

5,124 HOURS
AKA CISCO FIELD NOTICE FN-64291

AUTO-IMMUNE DISORDER
AGGRESSIVE HEALTH CHECKING

STORIES FROM THE FIELD
WHAT’S THE STRANGEST PLACE YOU’VE
WORKED A PRODUCTION INCIDENT?

THANK YOU!
• Twitter: jmloeffler
• G+: jmloeffler
• Github: jmloeffler
19

Site reliability engineering

More Related Content

What's hot

Similar to Site reliability engineering

Recently uploaded

Site reliability engineering