SITE RELIABILITY
ENGINEERING
FOR GROWING ORGANIZATIONS
My company in 20s
• End-to-end payments platform
• API-First
• Docker, C#, ASP.NET, Java,
Powershell, SQL
• #31 Nilsen top merchant acquirer
• Inc. 5000 fastest growing
company
• STL Top Place to Work
• Sound fun? It is. Come see me.
2
WHAT IS SRE?
• “Ops, if everything is treated as a
software problem”
• Typically experienced software
devs with a passion for
automation & infrastructure
• Sort of like devops, but with more
of a focus on production
automation, resiliency and
scalability
• Google wrote this book – It is
being adopted and explored by
many companies
• SRE for Google won't be SRE for
your team!
3
GROWING COMPANY PROBLEMS
4
• If you don't set a service level expectation, they will form around 100% uptime
• Keeping everything running as-is gets treated as a sunk cost
• Code atrophies or gets frozen, but the business keeps changing
EXPECTATIONS OFTEN OUTPACE CAPACITY
5
• How many more users can we
support at our current growth rate?
• If you buy X, will that let us scale?
• If we agree to buy X, can we wait
until next year's budget?
These are very hard questions to
answer without data and documented
expectations for performance &
uptime.
GROWING COMPANY PROBLEMS
FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS
6
You will need more automation to keep
it running, not more people
• People are an ongoing cost,
automation is a capitalizable
investment
• With a bigger customer base, five
minute outages become damaging
experiences. Machines can react
faster.
• 4 nines (99.99%) is < 5 minutes
downtime per month. How quickly
can you triage an alert?
GROWING COMPANY PROBLEMS
COMPLEXITY IS EVER INCREASING
7
• Improve reaction time to incidents
‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and
improving RTO
• Learn from mistakes, turn them into opportunities
‒ SRE teams can focus on blameless postmortems, extracting as much marrow as
possible from incidents, then being a champion for change
• Raise awareness for system behavior, weaknesses & strengths
‒ SRE can be an independent consulting agency or PR firm for dev teams
‒ SRE will create and/or publicize metrics to show facts
• Bandwidth dedicated to forward-looking system behaviors.
‒ Usually this is done as time permits (which is limited when companies grow fast).
WHY YOU WANT A DEDICATED SRE TEAM
SOUNDS GOOD! HOW IS IT DONE?
8
‒ Monitor externally the way your customers see you AND the way you see you
‒ There will be false alarms so not everyone should see these
AUTOMATED MONITORING
SOUNDS GOOD! HOW IS IT DONE?
9
LOG INDEXING AND AGGREGATION
10
• Build self-healing systems when we can
‒ Service health checks & automated recovery actions
‒ Desired state configuration
‒ Service Orchestration
• Document procedures/playbooks/runbooks when we can't
SOUNDS GOOD! HOW IS IT DONE?
11
• More than just a socket connection
‒ Does a typical request return a 200-OK?
‒ How many 200/300 Responses vs 400/500?
‒ Can you connect to your downstream
dependencies?
‒ How long have you been up?
• Provide rich info, but quickly
‒ Other endpoints can give more expensive
info
HEALTH CHECKS
12
• SLOs – Service Level Objectives
‒ Where you’d like to be
• SLAs – Service Level Agreements
‒ Where you tell your customers you’ll be
‒ Penalties
‒ More liberal than your SLOs
• Error Budgets
‒ Based on your SLO, how much risk
can you tolerate?
SERVICE LEVELS
SIGNAL VS NOISE
13
• Alert fatigue is real. Keep your alerts actionable.
• Rare errors can be the most interesting, but error velocity is an indicator.
• Strengthen the signal-noise ratio to combat fatigue.
14
MY EXPERIENCES
• Team was formed from various departments
• Carried forward some SRE-related projects from dev
• Matured & documented processes
‒ Playbooks
‒ Dependencies, metrics, app catalog
• Sharing responsibility for prod incidents with operations and
dev teams
• Finding ways to consult on app design & rollout
• We are first-responders, but the dev & ops teams are on call
STORIES FROM THE FIELD
15
5,124 HOURS
AKA CISCO FIELD NOTICE FN-64291
AUTO-IMMUNE DISORDER
AGGRESSIVE HEALTH CHECKING
STORIES FROM THE FIELD
WHAT’S THE STRANGEST PLACE YOU’VE
WORKED A PRODUCTION INCIDENT?
THANK YOU!
• Twitter: jmloeffler
• G+: jmloeffler
• Github: jmloeffler
19

Site reliability engineering

  • 1.
  • 2.
    My company in20s • End-to-end payments platform • API-First • Docker, C#, ASP.NET, Java, Powershell, SQL • #31 Nilsen top merchant acquirer • Inc. 5000 fastest growing company • STL Top Place to Work • Sound fun? It is. Come see me. 2
  • 3.
    WHAT IS SRE? •“Ops, if everything is treated as a software problem” • Typically experienced software devs with a passion for automation & infrastructure • Sort of like devops, but with more of a focus on production automation, resiliency and scalability • Google wrote this book – It is being adopted and explored by many companies • SRE for Google won't be SRE for your team! 3
  • 4.
    GROWING COMPANY PROBLEMS 4 •If you don't set a service level expectation, they will form around 100% uptime • Keeping everything running as-is gets treated as a sunk cost • Code atrophies or gets frozen, but the business keeps changing EXPECTATIONS OFTEN OUTPACE CAPACITY
  • 5.
    5 • How manymore users can we support at our current growth rate? • If you buy X, will that let us scale? • If we agree to buy X, can we wait until next year's budget? These are very hard questions to answer without data and documented expectations for performance & uptime. GROWING COMPANY PROBLEMS FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS
  • 6.
    6 You will needmore automation to keep it running, not more people • People are an ongoing cost, automation is a capitalizable investment • With a bigger customer base, five minute outages become damaging experiences. Machines can react faster. • 4 nines (99.99%) is < 5 minutes downtime per month. How quickly can you triage an alert? GROWING COMPANY PROBLEMS COMPLEXITY IS EVER INCREASING
  • 7.
    7 • Improve reactiontime to incidents ‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and improving RTO • Learn from mistakes, turn them into opportunities ‒ SRE teams can focus on blameless postmortems, extracting as much marrow as possible from incidents, then being a champion for change • Raise awareness for system behavior, weaknesses & strengths ‒ SRE can be an independent consulting agency or PR firm for dev teams ‒ SRE will create and/or publicize metrics to show facts • Bandwidth dedicated to forward-looking system behaviors. ‒ Usually this is done as time permits (which is limited when companies grow fast). WHY YOU WANT A DEDICATED SRE TEAM
  • 8.
    SOUNDS GOOD! HOWIS IT DONE? 8 ‒ Monitor externally the way your customers see you AND the way you see you ‒ There will be false alarms so not everyone should see these AUTOMATED MONITORING
  • 9.
    SOUNDS GOOD! HOWIS IT DONE? 9 LOG INDEXING AND AGGREGATION
  • 10.
    10 • Build self-healingsystems when we can ‒ Service health checks & automated recovery actions ‒ Desired state configuration ‒ Service Orchestration • Document procedures/playbooks/runbooks when we can't SOUNDS GOOD! HOW IS IT DONE?
  • 11.
    11 • More thanjust a socket connection ‒ Does a typical request return a 200-OK? ‒ How many 200/300 Responses vs 400/500? ‒ Can you connect to your downstream dependencies? ‒ How long have you been up? • Provide rich info, but quickly ‒ Other endpoints can give more expensive info HEALTH CHECKS
  • 12.
    12 • SLOs –Service Level Objectives ‒ Where you’d like to be • SLAs – Service Level Agreements ‒ Where you tell your customers you’ll be ‒ Penalties ‒ More liberal than your SLOs • Error Budgets ‒ Based on your SLO, how much risk can you tolerate? SERVICE LEVELS
  • 13.
    SIGNAL VS NOISE 13 •Alert fatigue is real. Keep your alerts actionable. • Rare errors can be the most interesting, but error velocity is an indicator. • Strengthen the signal-noise ratio to combat fatigue.
  • 14.
    14 MY EXPERIENCES • Teamwas formed from various departments • Carried forward some SRE-related projects from dev • Matured & documented processes ‒ Playbooks ‒ Dependencies, metrics, app catalog • Sharing responsibility for prod incidents with operations and dev teams • Finding ways to consult on app design & rollout • We are first-responders, but the dev & ops teams are on call
  • 15.
  • 16.
    5,124 HOURS AKA CISCOFIELD NOTICE FN-64291
  • 17.
  • 18.
    STORIES FROM THEFIELD WHAT’S THE STRANGEST PLACE YOU’VE WORKED A PRODUCTION INCIDENT?
  • 19.
    THANK YOU! • Twitter:jmloeffler • G+: jmloeffler • Github: jmloeffler 19