Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Metrics at scale @UBER
Mantas Klasavičius
About Me
Senior software engineer @ Uber
About Me
Senior software engineer @ Uber
<metric_path> <value> <timestamp>
UBER
6 continents
72 countries
425 cities
>5 million a day
>1000 engineers
7 years
UBER in Vilnius
3y ago
>20 engineers4 Teams:
- Observability
- Databases
- Foundations
- DevExp
Hypergrowth defines us...
Growth of Services
Metrics
Metrics @UBER is a first class citizen
T0 Service
Handling ~500M telemetry timeseries
Writing ~3M values/sec and r...
Metrics Collection
Graphite ~2013
Metrics Collection
Graphite 2015
Metrics Collection
Considered choices
Netflix AtlasBlueflood
Update graphite
Metrics Collection
M3
Metrics Collection
M3
Metrics Collection
Cassandra is a figure of epic tradition and of tragedy.
High write throughput
Cassandra data model supp...
Metrics Collection
Cassandra - our use case
Separate clusters for different types of data
Clusters spans multiple datacent...
Metrics Collection
Metrics as free resource
*.application_1431728998581_0361.*
*. Connections.10_30_3_24.0x64d11081baa1837...
Metrics Collection
Cost accounting and metrics about metrics
Metrics Visualization
M3 - Querying
Metrics Visualization
Grafana
Observability: Past, Present, and Future
Metrics Visualization
aggregate = fillNulls target | sum;
fetch name:requests.err...
Metrics Visualization
Graphite Way vs. M3QL
Observability: Past, Present, and Future
Alerting based on metrics
Query Based Alerting
graphite.absolute_threshold(
‘scal...
Observability: Past, Present, and Future
Alerting based on metrics
Classic Thresholding
Classic high / low thresholds have...
Observability: Past, Present, and Future
Alerting based on metrics
• Zero config: thresholds are set and maintained automa...
Observability: Past, Present, and Future
Alerting based on metrics
The max lower threshold
exceeds the min upper
threshold...
Observability: Past, Present, and Future
Alerting based on metrics
Outage Detection
< 1% outages missed.
6.5 out of 10 ale...
Observability: Past, Present, and Future
Alerting based on metrics
F3
stats.foo
anomalies(stats.foo)
On-Call Dashboard
We are hiring!
mantas@uber.com
Upcoming SlideShare
Loading in …5
×

Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

468 views

Published on

Lviv IT Arena is a conference specially designed for programmers, designers, developers, top managers, inverstors, entrepreneurs and startuppers. Annually it takes place at the beginning of October in Lviv at Arena Lviv stadium. In 2016 the conference gathered more than 1800 participants and over 100 speakers from companies like Microsoft, Philips, Twitter, UBER and IBM. More details about the conference at itarena.lviv.ua.

Published in: Technology
  • Penis Enlargement and Enhancement Techniques: What REALLY Works?!? ♥♥♥ https://bit.ly/30G1ZO1
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Discover a WEIRD trick I use to make over $3500 per month taking paid surveys online. read more... ➤➤ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Just got my check for $500, Sometimes people don't believe me when I tell them about how much you can make taking paid surveys online... So I took a video of myself actually getting paid $500 for paid surveys to finally set the record straight. I'm not going to leave this video up for long, so check it out now before I take it down! ➤➤ http://ishbv.com/surveys6/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)

  1. 1. Metrics at scale @UBER Mantas Klasavičius
  2. 2. About Me Senior software engineer @ Uber
  3. 3. About Me Senior software engineer @ Uber <metric_path> <value> <timestamp>
  4. 4. UBER 6 continents 72 countries 425 cities >5 million a day >1000 engineers 7 years
  5. 5. UBER in Vilnius 3y ago >20 engineers4 Teams: - Observability - Databases - Foundations - DevExp
  6. 6. Hypergrowth defines us... Growth of Services
  7. 7. Metrics Metrics @UBER is a first class citizen T0 Service Handling ~500M telemetry timeseries Writing ~3M values/sec and running ~1K queries/sec 50M minutes worth of data per sec Growing >25% month over month
  8. 8. Metrics Collection Graphite ~2013
  9. 9. Metrics Collection Graphite 2015
  10. 10. Metrics Collection Considered choices Netflix AtlasBlueflood Update graphite
  11. 11. Metrics Collection M3
  12. 12. Metrics Collection M3
  13. 13. Metrics Collection Cassandra is a figure of epic tradition and of tragedy. High write throughput Cassandra data model supports time series data-store - DTCS Cassandra's native TTL support
  14. 14. Metrics Collection Cassandra - our use case Separate clusters for different types of data Clusters spans multiple datacenters Dynamically control to which cluster data is written Forcibly deleting old data https://github.com/m3db/m3db/
  15. 15. Metrics Collection Metrics as free resource *.application_1431728998581_0361.* *. Connections.10_30_3_24.0x64d11081baa1837.* *. ply_1b09f59b-a3cf-4b9a-99b4-93e8eb16722c.* *. check-<uid_or_uuid>.*
  16. 16. Metrics Collection Cost accounting and metrics about metrics
  17. 17. Metrics Visualization M3 - Querying
  18. 18. Metrics Visualization Grafana
  19. 19. Observability: Past, Present, and Future Metrics Visualization aggregate = fillNulls target | sum; fetch name:requests.errors caller:cn | aggregate | asPercent (fetch name:requests caller:cn | aggregate) | anomalies | sort max | tail 10 M3QL - Query Like It’s Bash tail( sort( anomalies( asPercent( sum(fillNulls(stats.counts.cn.*.requests.errors)), sum(fillNulls(stats.counts.cn.*.requests)) ), max ) ), 10 )
  20. 20. Metrics Visualization Graphite Way vs. M3QL
  21. 21. Observability: Past, Present, and Future Alerting based on metrics Query Based Alerting graphite.absolute_threshold( ‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’, alias=’velocity filter blocked requests’, warning_over=0.1, critical_over=10.0, )
  22. 22. Observability: Past, Present, and Future Alerting based on metrics Classic Thresholding Classic high / low thresholds have some intrinsic problems. • Labor-intensive: each threshold is hand-tuned and manually updated. • Too sensitive: hard to set thresholds for metrics with large fluctuations, even if there’s an obvious pattern. • Not sensitive enough: thresholds take a long time to catch slow degradations. • Poor UX: configuring really good alerts requires specialized knowledge of the query language. • No guidance: system doesn’t offer automated root cause exploration.
  23. 23. Observability: Past, Present, and Future Alerting based on metrics • Zero config: thresholds are set and maintained automatically. • Dynamic adjustment: thresholds cope with noise, underlying growth, seasonality and rollouts. • Rapid detection: embarrassingly parallel algorithm is efficient enough for minute- by-minute analysis at scale. • Integrated UX: work within our existing telemetry and alert configuration systems. • Helpful: automated root cause analysis. In short, the only input is a list of business-critical metrics. Intelligent Monitoring
  24. 24. Observability: Past, Present, and Future Alerting based on metrics The max lower threshold exceeds the min upper threshold Dynamic Thresholds
  25. 25. Observability: Past, Present, and Future Alerting based on metrics Outage Detection < 1% outages missed. 6.5 out of 10 alerts are true issues.
  26. 26. Observability: Past, Present, and Future Alerting based on metrics F3 stats.foo anomalies(stats.foo)
  27. 27. On-Call Dashboard
  28. 28. We are hiring! mantas@uber.com

×