Monitoring lessons from waze sre team

Title
Subtitle
Yonit Gruber-Hazani
Monitoring lessons
from Waze SRE team

A little about me - Yonit Gruber-Hazani
Helpdesk
MS admin
Linux Admin
Production Manager [Linux]
Devops Engineer [Linux]
SRE [Linux]

A little about me - Yonit Gruber-Hazani

What we will go through: - About Waze, My Team
and Waze's technical
structure
- Monitoring, Alerting and
Complexity
- The new monitoring
direction
- Our best practices (that
works for us)

Waze in Numbers
130M 500K 80MActive Monthly
Users
Maps Editors API Calls Per Day

Outsmarting
traffic
together
Thousands of instances
Hundreds of Autoscaling
groups
2 PB cassandra data
On ~2000 cassandra
instances

Waze SRE team
● We build and operate the
Waze Infrastructure
● We’re part of Google
○ Autonomous
○ Running on top of
public clouds
● 21 Team members across the
globe

Waze microservices multi cloud
Cache data layer
Database layer
Memcached Redis
Java microservices
Compute
engine
App engine Container
engine
Cassandra Spanner Cloud SQL
Cache data layer
Database layer
Memcached Redis
Java microservices
Containers EC2 Lambda
Cassandra RDS

Waze microservices
proprietary
communications
protocol

Geographical Sharding
Microservice regions
Microservice
Datacenters
Countries
Israel North
America
Asia Pacific Europe South
America
Production critical services are
split into dozens of geographical
shards.
● Spreads the load
● Reduces blast radius
Several Logical Data Centers
split across 3 regions

8am
5pm
Daily driving trends
Waze US data, 2017

In the beginning
there was Nagios

Managed monitoring API service
What did we look for?
- Managed monitoring service
- API for metrics collection, dashboard and Policies creation
- Support our scale and growing monitoring needs
- Multi cloud support
We chose Stackdriver

How do you deploy
monitoring on a
planet scale?
Baby steps

- Aggregate our Proprietary protocol stats from a central location
- Created basic dashboards that shows:
- QPM
- Latency
- Failure Rate
- We also added to the dashboards metrics from the cloud provides
GCP and AWS
For each Microservice}
Deployment steps

Auto monitoring for each microservice of:
- Memory
- Free disk
- CPU load
Zero conf monitoring
- Data layer
- Caching
- Pubsub
- Java GC
- Apps and configs versions

Removing monitoring
bottleneck from our
team

What about alerting?
Free
Disk
Space
Max Auto
Scaling
Groups
Too many
failed
instances
in group
CPU
overloaded
Free
memory

Herbert A. Simon
What information
consumes is rather
obvious: it consumes
the attention of its
recipients

What's in
our
Dashboards
Server
Stats
‫קרהקר‬

What's in
our
Dashboards
Client
services

What's in
our
Dashboards
Dependencies

What's in
our
Dashboards
Data Layer

The new monitoring
Error budgets
● SLI - Service Level Indicator
○ Error rate
○ Latency
● SLO - Service Level Objective
○ 95% Login < 300 ms
● User Journey
Services need target SLOs
that capture the
performance and
availability levels that, if
barely met, would keep the
typical customer happy.
SLO Classroom

The happiness test - Critical User Journey
“meets target SLO” ⇒ “happy customers”
“misses target SLO” ⇒ “sad customers”

30 day error budget
99.9 % == 43.2min
99.99% == 4.32min
99.999 % == 26sec
SLO in Numbers

Replace alerts with automations
Increase Max for autoscaling groups
Add disks
Replace instances with healthy instances
Remove all single pets servers

Blameless Post mortems
REALLY BLAMELESS
What happened?
Why did it happen?
How was it solved?
Did the Monitoring work?
What worked well?
What didn't?

Action Items
POST POSTMORTEM
Action Items bugs list after post mortems
with owner for each bug

Periodically review
EXISTING MONITORS
Review existing monitors and update thresholds
Remove old deprecated alerts
Verify you are monitoring the updated endpoints
Update monitors on the fly

Playbooks for alerts
Add Updated Playbooks for each alert
Playbooks contains DEV, SRE and QA owners,
links to dashboards,
Step by step procedures
Links to system designs
Relevant data layers - cassandra, DB, cache
dashboards

Clean your signals
Noisy signals cannot be monitored

Choose your battles
Three levels for alerts urgency:
1. Wake up an oncall
2. Open a bug
3. Send an email for debugging and
root cause searching
THINGS I LEARNED FROM BEING A PARENT

Monitoring lessons from waze sre team

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monitoring lessons from waze sre team

Similar to Monitoring lessons from waze sre team (20)

Recently uploaded

Recently uploaded (20)

Monitoring lessons from waze sre team