SlideShare a Scribd company logo
SRE Demystified
Practical Alerting
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com,
http://ganeshniyer.com
Dr Ganesh Neelakanta Iyer
SRE
•
2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257
Monitoring
• Monitoring a very large system is challenging for a couple of
reasons:
• The sheer number of components being analyzed
• The need to maintain a reasonably low maintenance burden on the
engineers responsible for the system
• A large system should be designed to aggregate signals and
prune outliers
• We need monitoring systems that allow us to alert for high-
level service objectives, but retain the granularity to inspect
individual components as needed
3
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Borgmon monitoring at Google
• White-box monitoring
• Instead of executing custom scripts to detect system failures,
Borgmon relies on a common data exposition format
• This enables mass data collection with low overheads and avoids
the costs of subprocess execution and network connection setup
• The data is used both for rendering charts and creating
alerts, which are accomplished using simple arithmetic
• To facilitate mass collection, the metrics format had to be
standardized
4
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Instrumentation of applications
• Adding mapped variables for example
• An example map-valued variable
• Showing 25 HTTP 200 responses and 12 HTTP 500s:
• http_responses map:code 200:25 404:0 500:12
5
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
• A service is typically made up of many binaries running as
many tasks, on many machines, in many clusters
• Borgmon needs to keep all that data organized, while allowing
flexible querying and slicing of that data
• Borgmon stores all the data in an in-memory database,
regularly checkpointed to disk
• The data points have the form (timestamp, value), and are
stored in chronological lists called time-series, and each time-
series is named by a unique set of labels, of the
form name=value.
6
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
7
A time-series for errors labeled by the original host each was collected from
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Time-series are stored as sequences of numbers and
timestamps, which are referred to as vectors
• Like vectors in linear algebra, these vectors are slices and cross-sections of
the multidimensional matrix of data points in the arena
• The name of a time-series is a labelset, because it’s implemented
as a set of labels expressed as key=value pairs. One of these
labels is the variable name itself, the key that appears on the varz
page
8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Example variable expression
{var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west}
9
Label Value
var The name of the variable
job The name given to the type of server being monitored
service A loosely defined collection of jobs that provide a service to users,
either internal or external
zone Location of the Borgmon that performed the collection of this
variable
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Rule Evaluation
• The Borgmon program code, also known as Borgmon
rules, consists of simple algebraic expressions that
compute time-series from other time-series
• Rules run in a parallel threadpool where possible, but are
dependent on ordering when using previously defined
rules as input
• Aggregation is the cornerstone of rule evaluation in a
distributed environment
10
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Rule
11
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Alert Rule
• Creates an alert when the error ratio over 10 minutes exceeds
1% and the total number of errors exceeds 1 per second
12
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Maintaining the configuration
• Borgmon configuration separates the definition of the rules
from the targets being monitored
• Borgmon also supports language templates
• The first class simply codifies the emergent schema of
variables exported from a given library of code
• Such templates exist for the HTTP server library, memory
allocation, the storage client library
• The second class templates are to manage the aggregation
of data from a single-server task to the global service footprint
13
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
References
14
Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com

More Related Content

What's hot

A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 

What's hot (20)

Introduction to CircleCI
Introduction to CircleCIIntroduction to CircleCI
Introduction to CircleCI
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)How to Monitoring the SRE Golden Signals (E-Book)
How to Monitoring the SRE Golden Signals (E-Book)
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Observability with HAProxy
Observability with HAProxyObservability with HAProxy
Observability with HAProxy
 
Life as a SRE at Instana
Life as a SRE at InstanaLife as a SRE at Instana
Life as a SRE at Instana
 
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
Git for jenkins faster and better
Git for jenkins   faster and betterGit for jenkins   faster and better
Git for jenkins faster and better
 
DevOps for database
DevOps for databaseDevOps for database
DevOps for database
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
"DevOps > CI+CD "
"DevOps > CI+CD ""DevOps > CI+CD "
"DevOps > CI+CD "
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
Continuously Deploying Culture: Scaling Culture at Etsy - Velocity Europe 2012
 
Kubernetes Internals
Kubernetes InternalsKubernetes Internals
Kubernetes Internals
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 

Similar to SRE Demystified - 07 - Practical Alerting

Similar to SRE Demystified - 07 - Practical Alerting (20)

Overview of Postgres Utility Processes
Overview of Postgres Utility ProcessesOverview of Postgres Utility Processes
Overview of Postgres Utility Processes
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
 
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDBMongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - Recommendations
 
Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup) Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup)
 
515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganKoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBeganKoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
 
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBeganKoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthy
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
 
Good vs power automation frameworks
Good vs power automation frameworksGood vs power automation frameworks
Good vs power automation frameworks
 
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
 

More from Dr Ganesh Iyer

More from Dr Ganesh Iyer (20)

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overview
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - Simplicity
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed Monitoring
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement Model
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOs
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approach
 
Cloud and Industry4.0
Cloud and Industry4.0Cloud and Industry4.0
Cloud and Industry4.0
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneur
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 
Containerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentContainerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deployment
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 

SRE Demystified - 07 - Practical Alerting

  • 3. Monitoring • Monitoring a very large system is challenging for a couple of reasons: • The sheer number of components being analyzed • The need to maintain a reasonably low maintenance burden on the engineers responsible for the system • A large system should be designed to aggregate signals and prune outliers • We need monitoring systems that allow us to alert for high- level service objectives, but retain the granularity to inspect individual components as needed 3 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 4. Borgmon monitoring at Google • White-box monitoring • Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format • This enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup • The data is used both for rendering charts and creating alerts, which are accomplished using simple arithmetic • To facilitate mass collection, the metrics format had to be standardized 4 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 5. Instrumentation of applications • Adding mapped variables for example • An example map-valued variable • Showing 25 HTTP 200 responses and 12 HTTP 500s: • http_responses map:code 200:25 404:0 500:12 5 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 6. Storage in the Time-Series Arena • A service is typically made up of many binaries running as many tasks, on many machines, in many clusters • Borgmon needs to keep all that data organized, while allowing flexible querying and slicing of that data • Borgmon stores all the data in an in-memory database, regularly checkpointed to disk • The data points have the form (timestamp, value), and are stored in chronological lists called time-series, and each time- series is named by a unique set of labels, of the form name=value. 6 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 7. Storage in the Time-Series Arena 7 A time-series for errors labeled by the original host each was collected from https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 8. Labels and Vectors • Time-series are stored as sequences of numbers and timestamps, which are referred to as vectors • Like vectors in linear algebra, these vectors are slices and cross-sections of the multidimensional matrix of data points in the arena • The name of a time-series is a labelset, because it’s implemented as a set of labels expressed as key=value pairs. One of these labels is the variable name itself, the key that appears on the varz page 8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 9. Labels and Vectors • Example variable expression {var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west} 9 Label Value var The name of the variable job The name given to the type of server being monitored service A loosely defined collection of jobs that provide a service to users, either internal or external zone Location of the Borgmon that performed the collection of this variable https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 10. Rule Evaluation • The Borgmon program code, also known as Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series • Rules run in a parallel threadpool where possible, but are dependent on ordering when using previously defined rules as input • Aggregation is the cornerstone of rule evaluation in a distributed environment 10 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 12. Example Alert Rule • Creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1 per second 12 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 13. Maintaining the configuration • Borgmon configuration separates the definition of the rules from the targets being monitored • Borgmon also supports language templates • The first class simply codifies the emergent schema of variables exported from a given library of code • Such templates exist for the HTTP server library, memory allocation, the storage client library • The second class templates are to manage the aggregation of data from a single-server task to the global service footprint 13 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 15. Dr Ganesh Neelakanta Iyer ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com