Statscraft 2019 Talk session in TLV From Yonit Gruber-Hazani, Waze SRE team member.
Full youtube session is available here: https://www.youtube.com/watch?v=iSs4lTrUyI8
Talk is mainly in Hebrew
What we will go through: - About Waze, My Team
and Waze's technical
structure
- Monitoring, Alerting and
Complexity
- The new monitoring
direction
- Our best practices (that
works for us)
Waze SRE team
● We build and operate the
Waze Infrastructure
● We’re part of Google
○ Autonomous
○ Running on top of
public clouds
● 21 Team members across the
globe
Managed monitoring API service
What did we look for?
- Managed monitoring service
- API for metrics collection, dashboard and Policies creation
- Support our scale and growing monitoring needs
- Multi cloud support
We chose Stackdriver
How do you deploy
monitoring on a
planet scale?
Baby steps
- Aggregate our Proprietary protocol stats from a central location
- Created basic dashboards that shows:
- QPM
- Latency
- Failure Rate
- We also added to the dashboards metrics from the cloud provides
GCP and AWS
For each Microservice}
Deployment steps
Auto monitoring for each microservice of:
- Memory
- Free disk
- CPU load
Zero conf monitoring
- Data layer
- Caching
- Pubsub
- Java GC
- Apps and configs versions
The new monitoring
Error budgets
● SLI - Service Level Indicator
○ Error rate
○ Latency
● SLO - Service Level Objective
○ 95% Login < 300 ms
● User Journey
Services need target SLOs
that capture the
performance and
availability levels that, if
barely met, would keep the
typical customer happy.
SLO Classroom
The happiness test - Critical User Journey
“meets target SLO” ⇒ “happy customers”
“misses target SLO” ⇒ “sad customers”
30 day error budget
99.9 % == 43.2min
99.99% == 4.32min
99.999 % == 26sec
SLO in Numbers
Periodically review
EXISTING MONITORS
Review existing monitors and update thresholds
Remove old deprecated alerts
Verify you are monitoring the updated endpoints
Update monitors on the fly
Playbooks for alerts
Add Updated Playbooks for each alert
Playbooks contains DEV, SRE and QA owners,
links to dashboards,
Step by step procedures
Links to system designs
Relevant data layers - cassandra, DB, cache
dashboards
Choose your battles
Three levels for alerts urgency:
1. Wake up an oncall
2. Open a bug
3. Send an email for debugging and
root cause searching
THINGS I LEARNED FROM BEING A PARENT