SlideShare a Scribd company logo
Those Other SLOs
@stevemcghee
Reliability Advocate, SRE
Google Cloud
% whoami
SRE on: Ads, Search, Gmail, Android, Fiber, YouTube, Cloud
Steve McGhee
Reliability Advocate
@stevemcghee
SLOs in one slide
A ratio-rate of good/total, measured over a time duration.
If too much non-good, for too long, tell a human.
SLI is the squiggly line
SLO is the straight one
Area is consumed Error Budget
https://cloud.google.com/architecture/defining-SLOs
Tired: Availability & Latency
Everyone talks about Availability: is it up?
Sometimes we talk about Latency: is it fast?
Sometimes we even combine them. Ooh, Fancy!
https://cloud.google.com/architecture/adopting-slos
Wired: Freshness, Coverage, Skew, Duration
Those are: Request Driven Services
What about:
- Data Processing
- Scheduled Execution
- Using ML models ?!
How can SLOs help here?
Need to define good and total
Data Processing: Freshness
The proportion of valid data updated more recently than a threshold.
Examples:
- Generating map tiles in a game
- "X items in stock" in an ecommerce store
- "The percentage of views that used stock information that was refreshed within the last
minute."
Generally:
good = datetime_served - datetime_built < threshold
Data Processing: Coverage
The proportion of valid data processed successfully.
If your system processes many inputs, but drops some for various reasons
(malformed, empty, unpaid, resource constrained), Coverage can help you
assess the state of the whole system.
good = valid_records - num_processed < threshold
Measure this per "bucket" time period to see how coverage changes over time.
The time difference between when the job should have started, and when it did
Skew tells you if a job (like cron) runs early, late, or on-time. It can have a
negative or positive value.
good = time_started - time_scheduled < max_threshold
and
time_started - time_scheduled > min_threshold
Setting expected upper/lower boundaries provides a method for knowing what is
considered "good" and making decisions from that.
Scheduled Execution: Skew
The time difference between when the job should have completed and when it
was expected to complete by.
Duration helps us understand if a large system is "fast enough" as a whole. This
can even catch never-ending jobs, if done correctly.
good = (time_ended or NOW) - time_started < max_expected
and
time_ended - time_started > min_expected
Scheduled Execution: Duration
What about Durability?
Durability is confusing, as it tends to have SO MANY NINES (11!).
This is because it measures a predicted distribution of potential physical failure
modes over time. By adding physical replicas and encoding schemes for
storage, you can model, understand, and improve the durability.
This is very different. Try not to compare this directly.
See: pg 203-208 of Implementing Service Level Objectives by Alex Hidalgo for
more math.
https://cloud.google.com/blog/products/storage-data-transfer/understanding-cloud-storage-11-9s-durability-target
More SLO Terminology
latency < 150ms, >95% per quarter:
● The Service is usually some sort of RPC request,
● The Goal is 95%,
● The Criteria is "latency < 150ms",
● The Period is "quarter".
● The Performance over the Period of the last quarter would be the "number of requests in the last quarter
with latency < 150ms" divided by "the number of requests in the last quarter".
● The Error Budget is 100% - 95% = 5%.
● This SLO will be Met for the Period of 2016-Q1 if the Performance for that Period is greater than the Goal
of 95%.
Parting Shots
SLOs are a measure of a system, not components. SLOs are an abstraction.
Not a replacement for the deep understanding needed for diagnosis.
Abstractions provide consistent understanding of behavior through change.
This is good.
Implementation can change, side-effects can come and go.
SLOs persist.

More Related Content

Similar to Those Other SLOs.pdf

Take testing-to-cloud
Take testing-to-cloudTake testing-to-cloud
Take testing-to-cloud
Vipin Jain
 
My sql
My sqlMy sql
Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdf
DeepakGupta747774
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole
Vasu S
 
Dynamic Three Stages Task Scheduling Algorithm on Cloud Computing
Dynamic Three Stages Task Scheduling Algorithm on Cloud ComputingDynamic Three Stages Task Scheduling Algorithm on Cloud Computing
Dynamic Three Stages Task Scheduling Algorithm on Cloud Computing
IJCSIS Research Publications
 
Big data
Big dataBig data
Big data
Big dataBig data
EuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen FinalEuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen Final
Albert Witteveen
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
dhruv_chaudhari
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
IRJET Journal
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
AzarulIkhwan
 
Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)
Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)
Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)
Aditya Taneja
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 
Agility in an AI / DS / ML Project
Agility in an AI / DS / ML ProjectAgility in an AI / DS / ML Project
Agility in an AI / DS / ML Project
Tathagat Varma
 
Workload design[1]
Workload design[1]Workload design[1]
Workload design[1]
Benyam Ayalew
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Big Data Aplications Meetup
 
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
Vasu S
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
VMware Tanzu
 

Similar to Those Other SLOs.pdf (20)

Take testing-to-cloud
Take testing-to-cloudTake testing-to-cloud
Take testing-to-cloud
 
My sql
My sqlMy sql
My sql
 
Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdf
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole
 
Dynamic Three Stages Task Scheduling Algorithm on Cloud Computing
Dynamic Three Stages Task Scheduling Algorithm on Cloud ComputingDynamic Three Stages Task Scheduling Algorithm on Cloud Computing
Dynamic Three Stages Task Scheduling Algorithm on Cloud Computing
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
EuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen FinalEuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen Final
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
 
Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)
Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)
Modern Software Methodologies(Agile ,Scrum & Lean) + CASE STUDY(Google)
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
Agility in an AI / DS / ML Project
Agility in an AI / DS / ML ProjectAgility in an AI / DS / ML Project
Agility in an AI / DS / ML Project
 
Workload design[1]
Workload design[1]Workload design[1]
Workload design[1]
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
2020 Cloud Data Lake Platforms Buyers Guide - White paper | Qubole
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 

Recently uploaded

Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
architagupta876
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 

Recently uploaded (20)

Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 

Those Other SLOs.pdf

  • 1. Those Other SLOs @stevemcghee Reliability Advocate, SRE Google Cloud
  • 2. % whoami SRE on: Ads, Search, Gmail, Android, Fiber, YouTube, Cloud Steve McGhee Reliability Advocate @stevemcghee
  • 3. SLOs in one slide A ratio-rate of good/total, measured over a time duration. If too much non-good, for too long, tell a human. SLI is the squiggly line SLO is the straight one Area is consumed Error Budget https://cloud.google.com/architecture/defining-SLOs
  • 4. Tired: Availability & Latency Everyone talks about Availability: is it up? Sometimes we talk about Latency: is it fast? Sometimes we even combine them. Ooh, Fancy! https://cloud.google.com/architecture/adopting-slos
  • 5. Wired: Freshness, Coverage, Skew, Duration Those are: Request Driven Services What about: - Data Processing - Scheduled Execution - Using ML models ?! How can SLOs help here? Need to define good and total
  • 6. Data Processing: Freshness The proportion of valid data updated more recently than a threshold. Examples: - Generating map tiles in a game - "X items in stock" in an ecommerce store - "The percentage of views that used stock information that was refreshed within the last minute." Generally: good = datetime_served - datetime_built < threshold
  • 7. Data Processing: Coverage The proportion of valid data processed successfully. If your system processes many inputs, but drops some for various reasons (malformed, empty, unpaid, resource constrained), Coverage can help you assess the state of the whole system. good = valid_records - num_processed < threshold Measure this per "bucket" time period to see how coverage changes over time.
  • 8. The time difference between when the job should have started, and when it did Skew tells you if a job (like cron) runs early, late, or on-time. It can have a negative or positive value. good = time_started - time_scheduled < max_threshold and time_started - time_scheduled > min_threshold Setting expected upper/lower boundaries provides a method for knowing what is considered "good" and making decisions from that. Scheduled Execution: Skew
  • 9. The time difference between when the job should have completed and when it was expected to complete by. Duration helps us understand if a large system is "fast enough" as a whole. This can even catch never-ending jobs, if done correctly. good = (time_ended or NOW) - time_started < max_expected and time_ended - time_started > min_expected Scheduled Execution: Duration
  • 10. What about Durability? Durability is confusing, as it tends to have SO MANY NINES (11!). This is because it measures a predicted distribution of potential physical failure modes over time. By adding physical replicas and encoding schemes for storage, you can model, understand, and improve the durability. This is very different. Try not to compare this directly. See: pg 203-208 of Implementing Service Level Objectives by Alex Hidalgo for more math. https://cloud.google.com/blog/products/storage-data-transfer/understanding-cloud-storage-11-9s-durability-target
  • 11. More SLO Terminology latency < 150ms, >95% per quarter: ● The Service is usually some sort of RPC request, ● The Goal is 95%, ● The Criteria is "latency < 150ms", ● The Period is "quarter". ● The Performance over the Period of the last quarter would be the "number of requests in the last quarter with latency < 150ms" divided by "the number of requests in the last quarter". ● The Error Budget is 100% - 95% = 5%. ● This SLO will be Met for the Period of 2016-Q1 if the Performance for that Period is greater than the Goal of 95%.
  • 12. Parting Shots SLOs are a measure of a system, not components. SLOs are an abstraction. Not a replacement for the deep understanding needed for diagnosis. Abstractions provide consistent understanding of behavior through change. This is good. Implementation can change, side-effects can come and go. SLOs persist.