Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring lessons from waze sre team

35 views

Published on

Statscraft 2019 Talk session in TLV From Yonit Gruber-Hazani, Waze SRE team member.
Full youtube session is available here: https://www.youtube.com/watch?v=iSs4lTrUyI8
Talk is mainly in Hebrew

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Monitoring lessons from waze sre team

  1. 1. Title Subtitle Yonit Gruber-Hazani Monitoring lessons from Waze SRE team
  2. 2. A little about me - Yonit Gruber-Hazani Helpdesk MS admin Linux Admin Production Manager [Linux] Devops Engineer [Linux] SRE [Linux]
  3. 3. A little about me - Yonit Gruber-Hazani
  4. 4. A little about me - Yonit Gruber-Hazani
  5. 5. What we will go through: - About Waze, My Team and Waze's technical structure - Monitoring, Alerting and Complexity - The new monitoring direction - Our best practices (that works for us)
  6. 6. Waze in Numbers 130M 500K 80MActive Monthly Users Maps Editors API Calls Per Day
  7. 7. Outsmarting traffic together Thousands of instances Hundreds of Autoscaling groups 2 PB cassandra data On ~2000 cassandra instances
  8. 8. Waze SRE team ● We build and operate the Waze Infrastructure ● We’re part of Google ○ Autonomous ○ Running on top of public clouds ● 21 Team members across the globe
  9. 9. Waze Structure
  10. 10. Waze microservices multi cloud Cache data layer Database layer Memcached Redis Java microservices Compute engine App engine Container engine Cassandra Spanner Cloud SQL Cache data layer Database layer Memcached Redis Java microservices Containers EC2 Lambda Cassandra RDS
  11. 11. Spinnaker
  12. 12. Waze microservices
  13. 13. Waze microservices proprietary communications protocol
  14. 14. Geographical Sharding Microservice regions Microservice Datacenters Countries Israel North America Asia Pacific Europe South America Production critical services are split into dozens of geographical shards. ● Spreads the load ● Reduces blast radius Several Logical Data Centers split across 3 regions
  15. 15. 8am 5pm Daily driving trends Waze US data, 2017
  16. 16. In the beginning there was Nagios
  17. 17. Managed monitoring API service What did we look for? - Managed monitoring service - API for metrics collection, dashboard and Policies creation - Support our scale and growing monitoring needs - Multi cloud support We chose Stackdriver
  18. 18. How do you deploy monitoring on a planet scale? Baby steps
  19. 19. - Aggregate our Proprietary protocol stats from a central location - Created basic dashboards that shows: - QPM - Latency - Failure Rate - We also added to the dashboards metrics from the cloud provides GCP and AWS For each Microservice} Deployment steps
  20. 20. Auto monitoring for each microservice of: - Memory - Free disk - CPU load Zero conf monitoring - Data layer - Caching - Pubsub - Java GC - Apps and configs versions
  21. 21. Removing monitoring bottleneck from our team
  22. 22. What about alerting? Free Disk Space Max Auto Scaling Groups Too many failed instances in group CPU overloaded Free memory
  23. 23. Herbert A. Simon What information consumes is rather obvious: it consumes the attention of its recipients
  24. 24. Complexity
  25. 25. What's in our Dashboards
  26. 26. What's in our Dashboards Server Stats ‫קרהקר‬
  27. 27. What's in our Dashboards Client services
  28. 28. What's in our Dashboards Dependencies
  29. 29. What's in our Dashboards Data Layer
  30. 30. What's this service anyway?
  31. 31. The new monitoring Error budgets ● SLI - Service Level Indicator ○ Error rate ○ Latency ● SLO - Service Level Objective ○ 95% Login < 300 ms ● User Journey Services need target SLOs that capture the performance and availability levels that, if barely met, would keep the typical customer happy. SLO Classroom
  32. 32. The happiness test - Critical User Journey “meets target SLO” ⇒ “happy customers” “misses target SLO” ⇒ “sad customers”
  33. 33. 30 day error budget 99.9 % == 43.2min 99.99% == 4.32min 99.999 % == 26sec SLO in Numbers
  34. 34. Best Practices
  35. 35. Replace alerts with automations Increase Max for autoscaling groups Add disks Replace instances with healthy instances Remove all single pets servers
  36. 36. Blameless Post mortems REALLY BLAMELESS What happened? Why did it happen? How was it solved? Did the Monitoring work? What worked well? What didn't?
  37. 37. Action Items POST POSTMORTEM Action Items bugs list after post mortems with owner for each bug
  38. 38. Periodically review EXISTING MONITORS Review existing monitors and update thresholds Remove old deprecated alerts Verify you are monitoring the updated endpoints Update monitors on the fly
  39. 39. Playbooks for alerts Add Updated Playbooks for each alert Playbooks contains DEV, SRE and QA owners, links to dashboards, Step by step procedures Links to system designs Relevant data layers - cassandra, DB, cache dashboards
  40. 40. Clean your signals Noisy signals cannot be monitored
  41. 41. Choose your battles Three levels for alerts urgency: 1. Wake up an oncall 2. Open a bug 3. Send an email for debugging and root cause searching THINGS I LEARNED FROM BEING A PARENT
  42. 42. Thank you!

×