At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Similar to OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced as a Prometheus long term storage backend by Rob Skillington (20)
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced as a Prometheus long term storage backend by Rob Skillington
1. osmc.de
November 7, 2018
Rob Skillington
Observability and M3, Uber NYC
M3
The new age of metrics and
monitoring in an increasingly
complex world
2. ● Working on monitoring “observability” “running computers” now at Uber
for 3 continuous years
● Wrote a lot of database code in the hope that I wouldn’t get paged
anymore (I don’t recommend it, but also.. please contribute!)
● Interested in reusable monitoring software, less interested in “all in one”
ecosystems
Why are you here?
3. Agenda
● Where we came from and where we
are going
● “High dimensionality metrics”
● Why use highly dimensional metrics
● Case study: Uber
● How to use M3, an open source
metrics platform
5. ● Nagios
○ For example, check a local HTTP service (but similar for many things):
#!/bin/sh
/usr/lib/nagios/plugins/check_http
-I 127.0.0.1
-p 5390
-s OK
-w 90
-c 180
Where we came from
7. ● A simple way to interpret Nagios:
○ Execute scripts in a distributed way
○ View summary information of script execution success across
machines
○ Drill down to view logs of single script execution
○ A way to send alerts
What is this really?
8. PromQL for “alert me when more than N kafka local agents are unhealthy”
Wouldn’t it be nice? (adjust on the fly)
9. ● Simple way to execute the actual checks and collect these metrics
○ https://github.com/google/cloudprober
“Integration with open source monitoring stack of Prometheus and Grafana. Cloudprober
exports probe results as counter based metrics that work well with Prometheus and
Grafana.”
○ What about non-HTTP? (but use HTTP where possible)
“Arbitrary, complex probes can be run through the external probe type. For example, you
could write a simple script to insert and delete a row in your database, and execute this
script through the 'EXTERNAL' probe type.”
Can we do this today?
10. Wouldn’t it be nice? (slice and dice by dimensions, look back 30 days)
“Any dimension?” well.. within reason
11. ● Metrics are great for getting a 10,000 ft view of your applications systems
● Also becoming increasingly valuable for drill downs when using “high
cardinality” metrics
○ Example: P99 of requests broken down by
■ Region (US_East, EU_West, etc)
■ Endpoint
■ User segment (client attributes, device, operating system, etc)
Wouldn’t it be nice? (to have metrics “for everything”)
12. ● Plug and play compatibility for open source systems becoming common
Wouldn’t it be nice? (to have metrics “for every type of thing”)
Redis Prometheus Grafana Dashboard
13. ● Going from a metrics graph directly to logs
○ Drill down into discovered issues from graphs or double check 5XXs
match a theory
● Today we have to pivot and move to our logging platform and reconstruct
a query that should surface events similar to the metrics query
Wouldn’t it be nice? (easy navigability to detailed logs)
17. ● Take a single metric, such as status code delivered by our frontends
○ http_status_code
● At Uber, we have roughly a few hundred important HTTP routes we want
to drill down into the status code with the following dimensions:
○ Route (few hundred)
○ Status code (less than a few common status codes)
○ Places of operation (few hundred)
■ failures sometimes isolated to place of operation
○ App device version (few hundred)
High dimensionality metrics
18. ● You can roll up metrics to make viewing fast
○ View status codes by route, but across all regions for single app
version
● However, it is still incredibly high dimensional, just to store
○ Routes (say 500) * Status code (say 5) * Region (say 5) * Client version
(say 20)
= 250,000 unique time series
○ Expensive but not too bad..? However add any other dimensions and
it gets out of control (any multiplier on 250k explodes to millions quickly)
High dimensionality metrics
20. ● Really useful for drill downs when monitoring lots of distinct things that can
fail independently of each other, even if they are the same software but
running with subtly different attributes (which is an increasing trend)
○ different region, configuration, client/server versions, upstream or downstream, protocol,
attributes of a request such as “type of a search” or “search in a category”, etc.).
● Time series are stored and compressed much higher than log data so you
can keep much more if the records have common values. They are also
much faster to fetch, visualize, and sub-aggregate together.
Why use high cardinality metrics?
22. ● A manageable way to scale out capacity
○ A single Prometheus instance can hold N million time series
(where N does not usually reach the double digits, even on very big machines)
Ok great but what do I need?
“We good
yeah?”
23. ● It’s a little like playing tetris, can I fit my high cardinality metrics into an
existing Prometheus instance or do I setup a new one, what happens
when it eclipses a single instance?
Ok great but what do I need?
25. ● ~4K Microservices with a central Observability self-service platform
● Onboarding? None, simply start emitting metrics
● Used for all manner of things:
○ Real-time alerting using application metrics (e.g., p99 response time)
○ Tracking business metrics (e.g., number of Uber rides in Berlin)
■ Why? So easy to get started
■ metrics.Tagged(Tags{“region”: “berlin”}).Counter(“ride_start”).Inc(1)
○ Capacity planning using system metrics (e.g., container load average)
○ And much more …
Uber and the M3 metrics platform
28. End User Code (“Biz logic”)
RPC
Storage
(C*/Redis/…) ...
29. End User Code (“Biz logic”)
RPC
Storage
(C*/Redis/…) ...
Library owners:
- Dashboard panel template = f(serviceName)
- Ensure library emits metrics following given template
Application devs:
- Service: uses library
- Provide “serviceName” at time of generation
34. Scale - Ingress
400-700M Pre-aggregated Metrics/s
(~130Gbits/sec)
(random week when I was making these slides)
35. Scale - Ingress
~29M Metrics Stored/s
(not including Replica Factor = 3)
(~50Gbits/sec)
(random week when I was making these slides)
36. Scale - Ingress
~ 9B Unique Metric IDs
(trending down due to metrics falling
out of retention)
(random week when I was making these slides)
37. Scale - Egress
~ 30B data points per second
(~20 gigabits/sec)
(services 150,000 scheduled “realtime” alerts)
(random week when I was making these slides)
39. ● The best onboarding of a team or service, is no onboarding
○ Just start emitting metrics from an application or probe
○ Which is fine if you can just expand your central metrics cluster in an easy way
● The best operators of computers, is computers
○ Scripts don’t cut it after someone runs it the wrong way just N times, where N
is your pain threshold for correcting output of scripts triggered by humans
● The holy grail is of course, is a system that looks after itself (for the most part)
Needed turnkey way to just “add capacity”
40. ● 2015 - 2016: M3 with Cassandra
○ Up to 80k writes/sec, poor compression
○ Replication but only RF=2
■ RF=3 too expensive, already very costly with RF=2
■ Repairs too slow to be feasible
○ Required expensive SSDs and hungry on CPU
■ Heavy compactions even with DTCS required top end CPU, RAM, disks
○ 16X YoY growth resulted in > 1500 hosts
■ Cost + Operational overhead not sustainable
Brief History of M3
41. ● 2016 - Now: M3 with M3DB
○ Up to 500k writes/sec, 10x compression vs Cassandra using much less disk
■ This is while using significantly cheaper CPU and disk
○ Reduced cost of running M3 to 2.5% of all hardware/compute in use by Uber
■ This is with little-to-no restrictions on all usage of M3 amongst teams
○ Continually making improvements to further reduce footprint and increase
density in terms of throughput and on disk space usage
Brief History of M3
42. M3DB: High-Level Architecture
● Think Log Structured
Merge tree (LSM).. but
with
much less compaction
● Typical LSM will have
levelled or size based
compaction, M3DB
has time window
compaction which by
default avoids any data
compaction (unlike
Cassandra time
window compaction)
43. M3DB - Design Principles
● Optimized for high read and write volume per machine
○ vs. those built on large data storage tech (eg. OpenTSDB/HBASE)
● Optimized to reduce cost for large metrics warehouse (custom compression, etc)
○ vs. more generic IOT/SQL base TSDBs (eg TimeScale)
○ (if you want the most instrumentation/insight for your cost, this matters)
● Strongly consistent cluster membership backed by etcd with quorum writes
○ vs. eventual in Cassandra or IronDB
○ (avoid putting up with Gossip hassles with hundreds of machines)
● Low-Latency for metrics “Time To Visible” as we alert off our persisted data
○ Optimized for smaller batch size writes vs larger (InfluxDB)
44. ● Support for high cardinality queries over large data sets (petabytes)
○ Data cannot live in “lukewarm” storage (Thanos uses S3, Cortex uses variety of
backends for its index, however Cassandra doesn’t support full text search so
this is done client side which makes it not great with a very large index)
● Optimized file-system storage with no need for compactions
○ We have fairly consistent ingestion load
○ We don’t downsample async due to added cost - do more with less!
● Optimized for node startup write availability and self-recovery after being down
○ Larger clusters = more HW failures
○ Need to optimize cluster upgrades with new tech
○ Better for compute platforms
M3DB - Design Principles
47. Application
M3 - OSS (beta)
M3DB
Grafana
PromQL (beta)
Grafana/Other
Alert Engine
M3
Aggregator
M3 Collector
M3 Query
Application
● Limited guides on how to use this
configuration currently
48. H1 2019
● Finish new dedicated scalable open source M3Query service for PromQL and
Graphite
● Open source support for Graphite metrics
● Better guides for M3 Aggregator and M3 Collector usage
● Kubernetes Operator
● CNCF Sandbox Submission
H2 2019
● You tell us…. on Gitter (gitter.im/m3db) or mail (m3db@googlegroups.com) or
GitHub issues (github.com/m3db/proposal/issues)
M3 - Roadmap
50. You can also find us...
● On Google Hangout / Zoom (because video internet doesn’t suck all the time
anymore)
● Book some time at https://m3.appointlet.com