Proprietary
Today’s speakers
Thomas Voß
Staff SRE@Google
Proprietary
Measuring
Reliability in
Production
Step By Step SLO
Creation in Cloud
Operations
stackconf ‘23, 2023-09-13
Proprietary
The Most Important
Feature of Any System is
its Reliability
Proprietary
SRE is what you get
when you treat
operations as a
software problem.
Proprietary
What is the Level of reliability we
need?
Proprietary
Proprietary
Terminology
CUJ SLIs SLOs SLAs
User interacts with
Service to achieve Goal
Critical User Journeys:
Your most important
user journeys
Metrics that describe
users' experiences
Targets for the overall
health of a service
Contractual obligations
Proprietary
Alignment throughout the Product Life Cycle
Concept Business Development Operations Market
Alignment
through SLOs
Business Process
Proprietary
Creating SLI/O
Step By Step
Proprietary
Cloud Operations Sandbox
Click-to-deploy open sourced learning experience that helps practitioners gain an
understanding of how to use Cloud Operations tools and apply SRE practices in an
isolated cloud environment with synthetic traffic, that is similar to real production.
● A “playground environment” to evaluate Cloud Operations as close as possible to
real production
● Includes: Demo Service, One-click deployment script, Interactive walkthrough ,
Synthetic Load Generator, SRE Recipes, etc.
● Start here:
github.com/GoogleCloudPlatform/cloud-ops-sandbox/
Proprietary
Online Boutique
*github.com/GoogleCloudPlatform/microservices-demo#architecture
Proprietary
*github.com/GoogleCloudPlatform/microservices-demo#architecture
Online Boutique
Proprietary
1. SLO Process -CUJ
List out critical user journeys and order them by business impact:
Browse products, Check out, Add to cart
Proprietary
1. SLO Process -CUJ
List out critical user journeys and order them by business impact:
1. Check out
2. Add to cart
3. Browse products
Proprietary
As a shopper I want to see
purchase (checkout) items in the
store.
Critical User Journeys
Proprietary
SLO Process - SLI Creation
Determine which metrics to use as service-level indicators (SLIs) to most accurately track
the user experience.
Proprietary
SLO Process -SLI creation
1. SLI Type:
○ Request/response interaction in a user journey, measure: availability, latency, and quality.
○ Data processing: freshness, coverage, correctness and throughput.
○ Storage: throughput and latency.
2. SLI Specification: an assessment of service outcome that you think matters to users
○ For availability: The proportion of valid events served successfully
○ For latency: The proportion of valid events served faster than a threshold
3. SLI Implementation: a way to measure the SLI specification
○ Includes: event + success criteria + where/how you record the SLI.
○ Measurement Strategies: Application-level Metrics, Logs Processing, Front-end Infra Metrics, Synthetic
Clients/Data, Client-side Instrumentation
Proprietary
SLO Process -Availability SLI creation
SLI Type: availability
SLI Specification: The proportion of valid checkout events served successfully.
● Requests to the CheckoutService that return HTTP response code 2xx, 3xx, or 4xx (excl. 429)
SLI Implementation: The proportion of HTTP GET requests for /checkout_service/response_counts
that do not have 5XX status (3XX and 4XX excluded) measured at the Istio service mesh.
Proprietary
SLO Process - SLO
1. Determine SLO target goals
2. Determine SLO measurement period
SLO should include: target and a measurement window:
● 99.9% of Checkout requests in the past 28 days are successful
Proprietary
SLO Process
1. List out critical user journeys and order them by business impact.
2. Determine which metrics to use as service-level indicators (SLIs) to most
accurately track the user experience.
3. Determine SLO target goals and the SLO measurement period.
4. Configure SLI, SLO, and error budget consoles.
5. Configure SLO alerts.
Proprietary
Measuring Reliability
on GCP
Setup Guide in 4 easy steps
Define Service Define SLI Define SLO Define Alert
Select or define a service to
monitor
Identify a behaviour for
your service to observe
Set a target for the service in
a time window
Configure alerts on the
service health & burn rate
Proprietary
Proprietary
Proprietary
Proprietary
Proprietary
Proprietary
Proprietary
Demo
Services Overview
Service Definition
1
2
SLI Creation
3
SLO Creation
4
SLO Alerts Creation
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
SLI Creation
SLO Creation
SLO Alerts
Services Overview
Service Definition
1
2
3
4
5
Setup Guide in 4 easy steps
Define Service Define SLI Define SLO Define Alert
Select or define a service to
monitor
Identify a behaviour for
your service to observe
Set a target for the service in
a time window
Configure alerts on the
service health & burn rate
Proprietary
How can you get started?
Proprietary
Resources
● Cloud Operations Sandbox one click Cluster:
github.com/GoogleCloudPlatform/cloud-ops-sandbox/
● Collection of public resources bit.ly/Public_SRE_Resources
● Detailed step by step guide: Measuring Reliability in GCP: Step By Step SLO creation guide using
Cloud Operation Sandbox.
● [Qwiklabs] Cloud operations for GKE
●
*Cover images used with permission. These books can be found on shop.oreilly.com.
Google's
Public
Resources
● Coursera for leaders Developing a Google SRE Culture , for engineers Site
Reliability Engineering: Measuring and Managing Reliability,
● Art of SLOs classroom: The Art Of SLOs
● Blogs: DevOps & SRE
● Google Professional Services SRE packages
● The books
Follow us on Twitter: @googlesre. Find Google SRE publications—including the SRE
Books, articles, trainings, and more—for free at sre.google/resources.
Book covers copyright O’Reilly Media. Used with permission.
Proprietary
Q&A?
Proprietary
Thank you!

stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf

  • 1.
  • 2.
    Proprietary Measuring Reliability in Production Step ByStep SLO Creation in Cloud Operations stackconf ‘23, 2023-09-13
  • 3.
    Proprietary The Most Important Featureof Any System is its Reliability
  • 4.
    Proprietary SRE is whatyou get when you treat operations as a software problem.
  • 5.
    Proprietary What is theLevel of reliability we need?
  • 6.
  • 7.
    Proprietary Terminology CUJ SLIs SLOsSLAs User interacts with Service to achieve Goal Critical User Journeys: Your most important user journeys Metrics that describe users' experiences Targets for the overall health of a service Contractual obligations
  • 8.
    Proprietary Alignment throughout theProduct Life Cycle Concept Business Development Operations Market Alignment through SLOs Business Process
  • 9.
  • 10.
    Proprietary Cloud Operations Sandbox Click-to-deployopen sourced learning experience that helps practitioners gain an understanding of how to use Cloud Operations tools and apply SRE practices in an isolated cloud environment with synthetic traffic, that is similar to real production. ● A “playground environment” to evaluate Cloud Operations as close as possible to real production ● Includes: Demo Service, One-click deployment script, Interactive walkthrough , Synthetic Load Generator, SRE Recipes, etc. ● Start here: github.com/GoogleCloudPlatform/cloud-ops-sandbox/
  • 11.
  • 12.
  • 13.
    Proprietary 1. SLO Process-CUJ List out critical user journeys and order them by business impact: Browse products, Check out, Add to cart
  • 14.
    Proprietary 1. SLO Process-CUJ List out critical user journeys and order them by business impact: 1. Check out 2. Add to cart 3. Browse products
  • 15.
    Proprietary As a shopperI want to see purchase (checkout) items in the store. Critical User Journeys
  • 16.
    Proprietary SLO Process -SLI Creation Determine which metrics to use as service-level indicators (SLIs) to most accurately track the user experience.
  • 17.
    Proprietary SLO Process -SLIcreation 1. SLI Type: ○ Request/response interaction in a user journey, measure: availability, latency, and quality. ○ Data processing: freshness, coverage, correctness and throughput. ○ Storage: throughput and latency. 2. SLI Specification: an assessment of service outcome that you think matters to users ○ For availability: The proportion of valid events served successfully ○ For latency: The proportion of valid events served faster than a threshold 3. SLI Implementation: a way to measure the SLI specification ○ Includes: event + success criteria + where/how you record the SLI. ○ Measurement Strategies: Application-level Metrics, Logs Processing, Front-end Infra Metrics, Synthetic Clients/Data, Client-side Instrumentation
  • 18.
    Proprietary SLO Process -AvailabilitySLI creation SLI Type: availability SLI Specification: The proportion of valid checkout events served successfully. ● Requests to the CheckoutService that return HTTP response code 2xx, 3xx, or 4xx (excl. 429) SLI Implementation: The proportion of HTTP GET requests for /checkout_service/response_counts that do not have 5XX status (3XX and 4XX excluded) measured at the Istio service mesh.
  • 19.
    Proprietary SLO Process -SLO 1. Determine SLO target goals 2. Determine SLO measurement period SLO should include: target and a measurement window: ● 99.9% of Checkout requests in the past 28 days are successful
  • 20.
    Proprietary SLO Process 1. Listout critical user journeys and order them by business impact. 2. Determine which metrics to use as service-level indicators (SLIs) to most accurately track the user experience. 3. Determine SLO target goals and the SLO measurement period. 4. Configure SLI, SLO, and error budget consoles. 5. Configure SLO alerts.
  • 21.
  • 22.
    Setup Guide in4 easy steps Define Service Define SLI Define SLO Define Alert Select or define a service to monitor Identify a behaviour for your service to observe Set a target for the service in a time window Configure alerts on the service health & burn rate
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    Services Overview Service Definition 1 2 SLICreation 3 SLO Creation 4 SLO Alerts Creation 5
  • 31.
    SLI Creation SLO Creation SLOAlerts Services Overview Service Definition 1 2 3 4 5
  • 32.
    SLI Creation SLO Creation SLOAlerts Services Overview Service Definition 1 2 3 4 5
  • 33.
    SLI Creation SLO Creation SLOAlerts Services Overview Service Definition 1 2 3 4 5
  • 34.
    SLI Creation SLO Creation SLOAlerts Services Overview Service Definition 1 2 3 4 5
  • 35.
    SLI Creation SLO Creation SLOAlerts Services Overview Service Definition 1 2 3 4 5
  • 36.
    Setup Guide in4 easy steps Define Service Define SLI Define SLO Define Alert Select or define a service to monitor Identify a behaviour for your service to observe Set a target for the service in a time window Configure alerts on the service health & burn rate
  • 37.
  • 38.
    Proprietary Resources ● Cloud OperationsSandbox one click Cluster: github.com/GoogleCloudPlatform/cloud-ops-sandbox/ ● Collection of public resources bit.ly/Public_SRE_Resources ● Detailed step by step guide: Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox. ● [Qwiklabs] Cloud operations for GKE ●
  • 39.
    *Cover images usedwith permission. These books can be found on shop.oreilly.com. Google's Public Resources ● Coursera for leaders Developing a Google SRE Culture , for engineers Site Reliability Engineering: Measuring and Managing Reliability, ● Art of SLOs classroom: The Art Of SLOs ● Blogs: DevOps & SRE ● Google Professional Services SRE packages ● The books
  • 40.
    Follow us onTwitter: @googlesre. Find Google SRE publications—including the SRE Books, articles, trainings, and more—for free at sre.google/resources. Book covers copyright O’Reilly Media. Used with permission.
  • 41.
  • 42.