SlideShare a Scribd company logo
1 of 44
Download to read offline
Title
Subtitle
Yonit Gruber-Hazani
Monitoring lessons
from Waze SRE team
A little about me - Yonit Gruber-Hazani
Helpdesk
MS admin
Linux Admin
Production Manager [Linux]
Devops Engineer [Linux]
SRE [Linux]
A little about me - Yonit Gruber-Hazani
A little about me - Yonit Gruber-Hazani
What we will go through: - About Waze, My Team
and Waze's technical
structure
- Monitoring, Alerting and
Complexity
- The new monitoring
direction
- Our best practices (that
works for us)
Waze in Numbers
130M 500K 80MActive Monthly
Users
Maps Editors API Calls Per Day
Outsmarting
traffic
together
Thousands of instances
Hundreds of Autoscaling
groups
2 PB cassandra data
On ~2000 cassandra
instances
Waze SRE team
● We build and operate the
Waze Infrastructure
● We’re part of Google
○ Autonomous
○ Running on top of
public clouds
● 21 Team members across the
globe
Waze Structure
Waze microservices multi cloud
Cache data layer
Database layer
Memcached Redis
Java microservices
Compute
engine
App engine Container
engine
Cassandra Spanner Cloud SQL
Cache data layer
Database layer
Memcached Redis
Java microservices
Containers EC2 Lambda
Cassandra RDS
Spinnaker
Waze microservices
Waze microservices
proprietary
communications
protocol
Geographical Sharding
Microservice regions
Microservice
Datacenters
Countries
Israel North
America
Asia Pacific Europe South
America
Production critical services are
split into dozens of geographical
shards.
● Spreads the load
● Reduces blast radius
Several Logical Data Centers
split across 3 regions
8am
5pm
Daily driving trends
Waze US data, 2017
In the beginning
there was Nagios
Managed monitoring API service
What did we look for?
- Managed monitoring service
- API for metrics collection, dashboard and Policies creation
- Support our scale and growing monitoring needs
- Multi cloud support
We chose Stackdriver
How do you deploy
monitoring on a
planet scale?
Baby steps
- Aggregate our Proprietary protocol stats from a central location
- Created basic dashboards that shows:
- QPM
- Latency
- Failure Rate
- We also added to the dashboards metrics from the cloud provides
GCP and AWS
For each Microservice}
Deployment steps
Auto monitoring for each microservice of:
- Memory
- Free disk
- CPU load
Zero conf monitoring
- Data layer
- Caching
- Pubsub
- Java GC
- Apps and configs versions
Removing monitoring
bottleneck from our
team
What about alerting?
Free
Disk
Space
Max Auto
Scaling
Groups
Too many
failed
instances
in group
CPU
overloaded
Free
memory
Herbert A. Simon
What information
consumes is rather
obvious: it consumes
the attention of its
recipients
Complexity
What's in
our
Dashboards
What's in
our
Dashboards
Server
Stats
‫קרהקר‬
What's in
our
Dashboards
Client
services
What's in
our
Dashboards
Dependencies
What's in
our
Dashboards
Data Layer
What's this service anyway?
The new monitoring
Error budgets
● SLI - Service Level Indicator
○ Error rate
○ Latency
● SLO - Service Level Objective
○ 95% Login < 300 ms
● User Journey
Services need target SLOs
that capture the
performance and
availability levels that, if
barely met, would keep the
typical customer happy.
SLO Classroom
The happiness test - Critical User Journey
“meets target SLO” ⇒ “happy customers”
“misses target SLO” ⇒ “sad customers”
30 day error budget
99.9 % == 43.2min
99.99% == 4.32min
99.999 % == 26sec
SLO in Numbers
Best Practices
Replace alerts with automations
Increase Max for autoscaling groups
Add disks
Replace instances with healthy instances
Remove all single pets servers
Blameless Post mortems
REALLY BLAMELESS
What happened?
Why did it happen?
How was it solved?
Did the Monitoring work?
What worked well?
What didn't?
Action Items
POST POSTMORTEM
Action Items bugs list after post mortems
with owner for each bug
Periodically review
EXISTING MONITORS
Review existing monitors and update thresholds
Remove old deprecated alerts
Verify you are monitoring the updated endpoints
Update monitors on the fly
Playbooks for alerts
Add Updated Playbooks for each alert
Playbooks contains DEV, SRE and QA owners,
links to dashboards,
Step by step procedures
Links to system designs
Relevant data layers - cassandra, DB, cache
dashboards
Clean your signals
Noisy signals cannot be monitored
Choose your battles
Three levels for alerts urgency:
1. Wake up an oncall
2. Open a bug
3. Send an email for debugging and
root cause searching
THINGS I LEARNED FROM BEING A PARENT
Thank you!

More Related Content

What's hot

A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
Container Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyondContainer Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyond
KubeAcademy
 

What's hot (20)

What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
What is DevOps | DevOps Introduction | DevOps Training | DevOps Tutorial | Ed...
 
Devops Intro - Devops for Unicorns & DevOps for Horses
Devops Intro - Devops for Unicorns & DevOps for HorsesDevops Intro - Devops for Unicorns & DevOps for Horses
Devops Intro - Devops for Unicorns & DevOps for Horses
 
Introducing GitLab (June 2018)
Introducing GitLab (June 2018)Introducing GitLab (June 2018)
Introducing GitLab (June 2018)
 
Introduction to Scrum for Project Managers
Introduction to Scrum for Project ManagersIntroduction to Scrum for Project Managers
Introduction to Scrum for Project Managers
 
Waterfall vs agile approach scrum framework and best practices in software d...
Waterfall vs agile approach  scrum framework and best practices in software d...Waterfall vs agile approach  scrum framework and best practices in software d...
Waterfall vs agile approach scrum framework and best practices in software d...
 
Uma introdução à SRE - Site reliability engineering
Uma introdução à SRE - Site reliability engineeringUma introdução à SRE - Site reliability engineering
Uma introdução à SRE - Site reliability engineering
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
GitOps with ArgoCD
GitOps with ArgoCDGitOps with ArgoCD
GitOps with ArgoCD
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Preparing for SRE Interviews
Preparing for SRE InterviewsPreparing for SRE Interviews
Preparing for SRE Interviews
 
Container Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyondContainer Network Interface: Network Plugins for Kubernetes and beyond
Container Network Interface: Network Plugins for Kubernetes and beyond
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
Shift-left SRE: Self-healing on OpenShift with Ansible
Shift-left SRE: Self-healing on OpenShift with AnsibleShift-left SRE: Self-healing on OpenShift with Ansible
Shift-left SRE: Self-healing on OpenShift with Ansible
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
 
[DPE Summit] How Improving the Testing Experience Goes Beyond Quality: A Deve...
[DPE Summit] How Improving the Testing Experience Goes Beyond Quality: A Deve...[DPE Summit] How Improving the Testing Experience Goes Beyond Quality: A Deve...
[DPE Summit] How Improving the Testing Experience Goes Beyond Quality: A Deve...
 
Traffic Control with Envoy Proxy
Traffic Control with Envoy ProxyTraffic Control with Envoy Proxy
Traffic Control with Envoy Proxy
 
Site reliability engineering
Site reliability engineeringSite reliability engineering
Site reliability engineering
 
OpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-SideOpenAPI and gRPC Side by-Side
OpenAPI and gRPC Side by-Side
 

Similar to Monitoring lessons from waze sre team

xandria_successstory_migros_en
xandria_successstory_migros_enxandria_successstory_migros_en
xandria_successstory_migros_en
Neil McConnell
 
Adaptive Server Farms for the Data Center
Adaptive Server Farms for the Data CenterAdaptive Server Farms for the Data Center
Adaptive Server Farms for the Data Center
elliando dias
 
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
Amazon Web Services
 
Nx ray etisalatnigeria
Nx ray etisalatnigeriaNx ray etisalatnigeria
Nx ray etisalatnigeria
Owoeye Opeyemi
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
Matei Zaharia
 

Similar to Monitoring lessons from waze sre team (20)

Building and scaling a B2D service, the bootstrap way
Building and scaling a B2D service, the bootstrap wayBuilding and scaling a B2D service, the bootstrap way
Building and scaling a B2D service, the bootstrap way
 
Edge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEA
Edge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEAEdge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEA
Edge 2014: Maintaining the Balance: Getting the Most of Your CDN with IKEA
 
Liferay DEVCON 2023 - What's cooking in Liferay's Cloud?
Liferay DEVCON 2023 - What's cooking in Liferay's Cloud?Liferay DEVCON 2023 - What's cooking in Liferay's Cloud?
Liferay DEVCON 2023 - What's cooking in Liferay's Cloud?
 
Cloud-native Java EE-volution
Cloud-native Java EE-volutionCloud-native Java EE-volution
Cloud-native Java EE-volution
 
xandria_successstory_migros_en
xandria_successstory_migros_enxandria_successstory_migros_en
xandria_successstory_migros_en
 
Dubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architectureDubbo and Weidian's practice on micro-service architecture
Dubbo and Weidian's practice on micro-service architecture
 
SolarWinds Scalability for the Enterprise
SolarWinds Scalability for the EnterpriseSolarWinds Scalability for the Enterprise
SolarWinds Scalability for the Enterprise
 
SAP on Azure. Use Cases and Benefits
SAP on Azure. Use Cases and BenefitsSAP on Azure. Use Cases and Benefits
SAP on Azure. Use Cases and Benefits
 
Adaptive Server Farms for the Data Center
Adaptive Server Farms for the Data CenterAdaptive Server Farms for the Data Center
Adaptive Server Farms for the Data Center
 
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
AWS re:Invent 2016: Effective Application Data Analytics for Modern Applicati...
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
Nx ray etisalatnigeria
Nx ray etisalatnigeriaNx ray etisalatnigeria
Nx ray etisalatnigeria
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Exploring Opportunities in Crisis by Ramco
Exploring Opportunities in Crisis by RamcoExploring Opportunities in Crisis by Ramco
Exploring Opportunities in Crisis by Ramco
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
 
Site Performance Challenge: Magento with CloudMaestro
Site Performance Challenge: Magento with CloudMaestroSite Performance Challenge: Magento with CloudMaestro
Site Performance Challenge: Magento with CloudMaestro
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 

Monitoring lessons from waze sre team