SlideShare a Scribd company logo
Scaling SLOs
With Kubernetes and cloud-native observability
Open Source Monitoring Conference – Nuremberg, Nov. 2022
Hello!
I am George Hantzaras
● Director, Cloud Platform Engineering @Citrix
● Organizer, Athens Cloud Computing Meetup Group
● Organizer, Athens Hashicorp User Group
● Creator of Ploigos (ploigos.co)
● Find me: @iamhantzo
● Based in Athens, Greece
2
Agenda
● SLO primer
● How SRE and Customer Experience are
related
● Issues with implementing and scaling SLOs
● Cloud-native observability and SLO-as-code
3
1.
Defining SL{A,O,I}s
Why do reliability metrics matter
“
You can’t manage what you
don’t measure.
5
How did we come up with this
6
How do we measure reliability
7
How SLOs
work
How do we define, measure,
and analyze our SLI, SLO, and
SLA
8
Defining SLOs
● Define what’s important for the business.
That’s our (SLI)
● Calculate what our target for this metric
should be (SLO)
● Make a promise to our customers about the
service we’ll be offering in relation to this
metric (SLA)
● These can be time-based or event-based
9
But what happens with this information
● Product development teams get an error
budget
● Allowed threshold of bad behavior of our
service related to a specific SLO
● Error Budget = 1 – SLO
● Budget Burn = 1 – SLI
● We want Budget > Burn
10
2.
How are these related
to Customer Experience
What really is reliability
● Reliability is the probability of failure-free
operation of a computer program for a
specified period in a specified environment
● A customer facing view of quality of our cloud
delivery
12
Reliability is CX
13
SLOs should focus on representing user impact
and experience
Reliability engineering is a part of CX
engineering in SaaS
Reliability drives cloud adoption
Defining
SLOs
Is not a purely technical task
14
Customer-centric approach to SLO
● Customers are different
● Availability should focus on what users
perceive
● It’s not about uptime anymore but availability
of functionality
● Combine SLO data with business and CRM
data
15
What does that mean for the SLI/SLO
process
● Define the critical user journeys in your
service and prioritize
● Determine which metrics are important for
each journey
● For those metrics, look at historical data and
set targets
● Operationalize those metrics and insights
16
3.
Issues with implementing
and scaling SLOs
Common implementation issues
● SLIs are not customer-focused or not aligned
with the business
● No clear stakeholders and ownership over
SLOs
● Reactive usage of error budgets
● Unrealistic SLOs (targets)
● Manual SLI/SLO evaluation process
18
Common scaling issues
Different standards in monitoring.
● A large number of engineers and a
growing number of teams brings many
different approached to observability
● Different standard around metric and logs,
can bring different results
19
Common scaling issues
Scaling services and distributed systems create
complexity
20
Common scaling issues
● A common understanding of service metrics
● Scaling and automating the evaluation
process of SLI/SLOs
● Tools agnosticism for monitoring
infrastructure
21
4.
Observability and SLOs as
code
Cloud-native observability
23
Real-time insights
Provide real-time observability
and insights for cloud operations,
product teams, customer
success, and leadership.
Self-service
Serve elastic infrastructure and
applications, in such a way, that
no operational overhead is added
to onboard a new service.
Alerting
Configurable alerting for multiple
platforms, using different
communication channels.
Integrations
Open observability infrastructure
which easily integrates with
customer data sources, reporting
tools, CI/CD, and more
SLO Framework
A new company-wide SLO
framework, adapted to latest
architectures and focused on
customer experience.
Operationalizing error budgets
for product insights
Powered by
24
Metrics Events Logs Traces
Observability as code
25
Metrics
as code
Dashboards
as code
Alerting
as code
GitOps Zero Trust
Observability as code
26
Commit + CI
Observability
configuration is added
via pull request. Build is
triggered.
Build
Build is executed and
final TF module and
variables are created.
Testing
IaC modules are tested
and notification for test
outcomes is sent
Deploy
Assets are published to
artifactory and ready to
be deployed in
production. Optionally,
deployment is triggered
Observability as code
27
Operationalizing
28
How about SLOs as code
● OpenSLO specification
● Open specification for defining SLOs to
enable a common, vendor–agnostic to SLOs
● Yaml-based format
● Sloth for Prometheus
29
SLOs a code – OpenSLO
30
SLOs a code – Sloth
31
SLOs a code – Sloth
32
Sloth example
33
Sloth example
34
Sloth example
35
Putting it all
together
36
Observability Architecture
37
SLO monitoring
38
Lessons Learnt
39
On-call ≠ Reporting
Monitoring and
alerting for usage by
SRE and on call
teams, have totally
different
requirements that
monitoring and
alerting for uptime
reporting, leadership
visibility etc
Customer focused
SLI/SLO reporting
needs to reflect
customer impact.
Observability should
focus on measuring
what the customers
are experiencing
Start somewhere
Observability is a
journey of continuous
improvement. You
should find the low
hanging fruit and
start from there. And
have in mind that you
should always keep on
working on it
5.
Wrapping up
Key takeaways
● Monitoring has evolved
● Observability is not just about measuring
● Reliability has also adapted to the technology
and delivery advancements
● Observability is a really important aspect of
reliability
● How we can scale Observability and Reliability
measurement with cloud-native tools
41
Thank you!
Stay in touch
● Find me: @iamhantzo
● LinkedIn: George Hantzaras
42

More Related Content

Similar to OSMC 2022 | Scaling SLOs with K8s and Cloud-native Observability by George Hantzaras.pdf

Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
ISC Cloud13 Sill - Crossing organizational boundaries in cloud computing
ISC Cloud13 Sill - Crossing organizational boundaries in cloud computingISC Cloud13 Sill - Crossing organizational boundaries in cloud computing
ISC Cloud13 Sill - Crossing organizational boundaries in cloud computing
Alan Sill
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
Ieva Navickaite
 
stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf
stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdfstackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf
stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf
NETWAYS
 
Observability - Stockholm Splunk UG Jan 19 2023.pptx
Observability - Stockholm Splunk UG Jan 19 2023.pptxObservability - Stockholm Splunk UG Jan 19 2023.pptx
Observability - Stockholm Splunk UG Jan 19 2023.pptx
Magnus Johansson
 
Overcoming Digital Transformation Pain Points
Overcoming Digital Transformation Pain PointsOvercoming Digital Transformation Pain Points
Overcoming Digital Transformation Pain Points
Inductive Automation
 
Mule soft meetup Houston 16
Mule soft meetup Houston 16Mule soft meetup Houston 16
Mule soft meetup Houston 16
Jim Andrews
 
Time series analysis with knime
Time series analysis with knimeTime series analysis with knime
Time series analysis with knime
Knoldus Inc.
 
Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)
Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)
Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)
TelecomValley
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2
 
[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...
[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...
[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...
WSO2
 
Esouag r12 presentation
Esouag r12 presentationEsouag r12 presentation
Esouag r12 presentation
Ishtiaq Khan
 
"Digital transformation and innovations implementation. Architectural points ...
"Digital transformation and innovations implementation. Architectural points ..."Digital transformation and innovations implementation. Architectural points ...
"Digital transformation and innovations implementation. Architectural points ...
Fwdays
 
Preparing for ASC 606
Preparing for ASC 606Preparing for ASC 606
Preparing for ASC 606
eprentise
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 
Camunda BPM - Said Mengi
Camunda BPM - Said MengiCamunda BPM - Said Mengi
Camunda BPM - Said Mengi
kloia
 
AI projects - Lifecyle & Best Practices
AI projects - Lifecyle & Best PracticesAI projects - Lifecyle & Best Practices
AI projects - Lifecyle & Best Practices
Vincent de Stoecklin
 
Monitoring in the DevOps Era
Monitoring in the DevOps EraMonitoring in the DevOps Era
Monitoring in the DevOps Era
Mike Kavis
 
Enterprise Agile at Lockheed Martin - 4th February 2014
Enterprise Agile at Lockheed Martin - 4th February 2014Enterprise Agile at Lockheed Martin - 4th February 2014
Enterprise Agile at Lockheed Martin - 4th February 2014
Association for Project Management
 
Why you should use Elastic for infrastructure metrics
Why you should use Elastic for infrastructure metricsWhy you should use Elastic for infrastructure metrics
Why you should use Elastic for infrastructure metrics
Elasticsearch
 

Similar to OSMC 2022 | Scaling SLOs with K8s and Cloud-native Observability by George Hantzaras.pdf (20)

Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
 
ISC Cloud13 Sill - Crossing organizational boundaries in cloud computing
ISC Cloud13 Sill - Crossing organizational boundaries in cloud computingISC Cloud13 Sill - Crossing organizational boundaries in cloud computing
ISC Cloud13 Sill - Crossing organizational boundaries in cloud computing
 
MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021MuleSoft Manchester Meetup #4 slides 11th February 2021
MuleSoft Manchester Meetup #4 slides 11th February 2021
 
stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf
stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdfstackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf
stackconf 2023 | Measuring Reliability in Production by Thomas Voss.pdf
 
Observability - Stockholm Splunk UG Jan 19 2023.pptx
Observability - Stockholm Splunk UG Jan 19 2023.pptxObservability - Stockholm Splunk UG Jan 19 2023.pptx
Observability - Stockholm Splunk UG Jan 19 2023.pptx
 
Overcoming Digital Transformation Pain Points
Overcoming Digital Transformation Pain PointsOvercoming Digital Transformation Pain Points
Overcoming Digital Transformation Pain Points
 
Mule soft meetup Houston 16
Mule soft meetup Houston 16Mule soft meetup Houston 16
Mule soft meetup Houston 16
 
Time series analysis with knime
Time series analysis with knimeTime series analysis with knime
Time series analysis with knime
 
Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)
Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)
Soirée du Test Logiciel - Présentation de Kiuwan (Jack ABDO)
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...
[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...
[WSO2Con USA 2018] Winning Strategy For Enterprise Integration to Empower Dig...
 
Esouag r12 presentation
Esouag r12 presentationEsouag r12 presentation
Esouag r12 presentation
 
"Digital transformation and innovations implementation. Architectural points ...
"Digital transformation and innovations implementation. Architectural points ..."Digital transformation and innovations implementation. Architectural points ...
"Digital transformation and innovations implementation. Architectural points ...
 
Preparing for ASC 606
Preparing for ASC 606Preparing for ASC 606
Preparing for ASC 606
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
Camunda BPM - Said Mengi
Camunda BPM - Said MengiCamunda BPM - Said Mengi
Camunda BPM - Said Mengi
 
AI projects - Lifecyle & Best Practices
AI projects - Lifecyle & Best PracticesAI projects - Lifecyle & Best Practices
AI projects - Lifecyle & Best Practices
 
Monitoring in the DevOps Era
Monitoring in the DevOps EraMonitoring in the DevOps Era
Monitoring in the DevOps Era
 
Enterprise Agile at Lockheed Martin - 4th February 2014
Enterprise Agile at Lockheed Martin - 4th February 2014Enterprise Agile at Lockheed Martin - 4th February 2014
Enterprise Agile at Lockheed Martin - 4th February 2014
 
Why you should use Elastic for infrastructure metrics
Why you should use Elastic for infrastructure metricsWhy you should use Elastic for infrastructure metrics
Why you should use Elastic for infrastructure metrics
 

Recently uploaded

A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
confluent
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
mohitd6
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
servicesNitor
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
dhavalvaghelanectarb
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
kalichargn70th171
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
Anand Bagmar
 
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Ortus Solutions, Corp
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
chandangoswami40933
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Vince Scalabrino
 
Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.
KrishnaveniMohan1
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
Alina Yurenko
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
VictoriaMetrics
 

Recently uploaded (20)

A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 
Building API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructureBuilding API data products on top of your real-time data infrastructure
Building API data products on top of your real-time data infrastructure
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
 
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
Strengthening Web Development with CommandBox 6: Seamless Transition and Scal...
 
Computer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdfComputer Science & Engineering VI Sem- New Syllabus.pdf
Computer Science & Engineering VI Sem- New Syllabus.pdf
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery FleetStork Product Overview: An AI-Powered Autonomous Delivery Fleet
Stork Product Overview: An AI-Powered Autonomous Delivery Fleet
 
Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.
 
Going AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applicationsGoing AOT: Everything you need to know about GraalVM for Java applications
Going AOT: Everything you need to know about GraalVM for Java applications
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
 

OSMC 2022 | Scaling SLOs with K8s and Cloud-native Observability by George Hantzaras.pdf

  • 1. Scaling SLOs With Kubernetes and cloud-native observability Open Source Monitoring Conference – Nuremberg, Nov. 2022
  • 2. Hello! I am George Hantzaras ● Director, Cloud Platform Engineering @Citrix ● Organizer, Athens Cloud Computing Meetup Group ● Organizer, Athens Hashicorp User Group ● Creator of Ploigos (ploigos.co) ● Find me: @iamhantzo ● Based in Athens, Greece 2
  • 3. Agenda ● SLO primer ● How SRE and Customer Experience are related ● Issues with implementing and scaling SLOs ● Cloud-native observability and SLO-as-code 3
  • 4. 1. Defining SL{A,O,I}s Why do reliability metrics matter
  • 5. “ You can’t manage what you don’t measure. 5
  • 6. How did we come up with this 6
  • 7. How do we measure reliability 7
  • 8. How SLOs work How do we define, measure, and analyze our SLI, SLO, and SLA 8
  • 9. Defining SLOs ● Define what’s important for the business. That’s our (SLI) ● Calculate what our target for this metric should be (SLO) ● Make a promise to our customers about the service we’ll be offering in relation to this metric (SLA) ● These can be time-based or event-based 9
  • 10. But what happens with this information ● Product development teams get an error budget ● Allowed threshold of bad behavior of our service related to a specific SLO ● Error Budget = 1 – SLO ● Budget Burn = 1 – SLI ● We want Budget > Burn 10
  • 11. 2. How are these related to Customer Experience
  • 12. What really is reliability ● Reliability is the probability of failure-free operation of a computer program for a specified period in a specified environment ● A customer facing view of quality of our cloud delivery 12
  • 13. Reliability is CX 13 SLOs should focus on representing user impact and experience Reliability engineering is a part of CX engineering in SaaS Reliability drives cloud adoption
  • 14. Defining SLOs Is not a purely technical task 14
  • 15. Customer-centric approach to SLO ● Customers are different ● Availability should focus on what users perceive ● It’s not about uptime anymore but availability of functionality ● Combine SLO data with business and CRM data 15
  • 16. What does that mean for the SLI/SLO process ● Define the critical user journeys in your service and prioritize ● Determine which metrics are important for each journey ● For those metrics, look at historical data and set targets ● Operationalize those metrics and insights 16
  • 18. Common implementation issues ● SLIs are not customer-focused or not aligned with the business ● No clear stakeholders and ownership over SLOs ● Reactive usage of error budgets ● Unrealistic SLOs (targets) ● Manual SLI/SLO evaluation process 18
  • 19. Common scaling issues Different standards in monitoring. ● A large number of engineers and a growing number of teams brings many different approached to observability ● Different standard around metric and logs, can bring different results 19
  • 20. Common scaling issues Scaling services and distributed systems create complexity 20
  • 21. Common scaling issues ● A common understanding of service metrics ● Scaling and automating the evaluation process of SLI/SLOs ● Tools agnosticism for monitoring infrastructure 21
  • 23. Cloud-native observability 23 Real-time insights Provide real-time observability and insights for cloud operations, product teams, customer success, and leadership. Self-service Serve elastic infrastructure and applications, in such a way, that no operational overhead is added to onboard a new service. Alerting Configurable alerting for multiple platforms, using different communication channels. Integrations Open observability infrastructure which easily integrates with customer data sources, reporting tools, CI/CD, and more SLO Framework A new company-wide SLO framework, adapted to latest architectures and focused on customer experience. Operationalizing error budgets for product insights
  • 25. Observability as code 25 Metrics as code Dashboards as code Alerting as code GitOps Zero Trust
  • 26. Observability as code 26 Commit + CI Observability configuration is added via pull request. Build is triggered. Build Build is executed and final TF module and variables are created. Testing IaC modules are tested and notification for test outcomes is sent Deploy Assets are published to artifactory and ready to be deployed in production. Optionally, deployment is triggered
  • 29. How about SLOs as code ● OpenSLO specification ● Open specification for defining SLOs to enable a common, vendor–agnostic to SLOs ● Yaml-based format ● Sloth for Prometheus 29
  • 30. SLOs a code – OpenSLO 30
  • 31. SLOs a code – Sloth 31
  • 32. SLOs a code – Sloth 32
  • 39. Lessons Learnt 39 On-call ≠ Reporting Monitoring and alerting for usage by SRE and on call teams, have totally different requirements that monitoring and alerting for uptime reporting, leadership visibility etc Customer focused SLI/SLO reporting needs to reflect customer impact. Observability should focus on measuring what the customers are experiencing Start somewhere Observability is a journey of continuous improvement. You should find the low hanging fruit and start from there. And have in mind that you should always keep on working on it
  • 41. Key takeaways ● Monitoring has evolved ● Observability is not just about measuring ● Reliability has also adapted to the technology and delivery advancements ● Observability is a really important aspect of reliability ● How we can scale Observability and Reliability measurement with cloud-native tools 41
  • 42. Thank you! Stay in touch ● Find me: @iamhantzo ● LinkedIn: George Hantzaras 42