This document discusses scaling issues with Graphite and solutions implemented at Similarweb to handle high volumes of metrics. Key points:
1) Graphite struggled with high IOPS and a single-threaded carbon-cache. Replacing carbon-cache with the multi-threaded go-carbon and using SSDs helped address IOPS bottlenecks.
2) carbon-relay was replaced with the faster C implementation carbon-c-relay to load balance metrics among go-carbon instances.
3) statsd was replaced with the C implementation statsite for better performance and capabilities like quantiles.
4) The final setup consisted of statsite sending to multiple carbon-c-relay and go-carbon instances, handling
3. The problem
No metrics across the board
● Hard to debug issues
● No intuitive way to measure efficiency, usage
● Capacity planning?
● Dashboards
4. The problem
No metrics across the board
● We knew graphite
● We wanted statsd for applicative metrics
● And we heard that collectd is-nice and we installed it X
500 physical machines
6. Graphite
Write throughput across our
Hadoop fleet
Ingress traffic to our load
balancing layer
"Store numeric time series data"
"Render graphs of this data on demand"
15. Graphite
+ Remember the bottlenecks we had
● Carbon-cache reached 100% CPU on a single core (it's
probably single threaded)
● Disks reached maximum IOPS capacity
17. Graphite
+ Carbon-cache
● Persists metrics to disk and serves hot-cache to graphite
● Python, single threaded
● So we replaced carbon-cache with go-carbon:
Golang implementation of Graphite/Carbon server with
classic architecture: Agent -> Cache -> Persister
18. Graphite
+ go-carbon
The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:
Reference: https://github.com/lomik/go-carbon
21. Graphite
+ IOPS
RAID 0? Raid controller became
the bottleneck and it wasn't
enough anyway
SSD? Yes! But one wasn't
enough :(
Hadoop inspiration! JBOD (no
raid)
Influx? No!
24. Graphite
+ carbon-relay
● "Load balancer" between metric producers and go-
carbon instances
● Same metric is routed to the same go-carbon instance
via a consistent hashing algorithm
● But… is a single-threaded Python app so your mileage
may vary
33. Graphite
● Wire compatible with statsd (drop in replacement)
● Pure C with a tight event loop (very fast)
● Low memory footprint
● Supports quantiles, histograms and much more.
+ Statsite
38. Graphite
● Beast graphite stack, peaked at 1M updates per minute,
room for more
● Very efficient: ˜10% user-land CPU usage, leaves more
room for IRQs (disk, network)
● We can still scale out the whole stacks with another
layer of carbon-c-relay but we never needed to go there.
+ Pros
39. Graphite
● SSD is still expensive and wears out quickly under heavy
random-writes scenarios - less relevant on AWS :-)
● Bugs - Custom components are somewhat less field
tested.
● Data is not highly available with JBOD
● Doing metrics right is demanding - go SaaS!
+ Cons
40. Graphite
+ Some tuning tips
● UDP creates correlated loss and has shitty backpressure behaviour
(actually, NO backpressure). Use TCP when possible
● High frequency UDP packets (statsite) can generate a shit-load of IRQs -
balance your interrupts or enforce affinity
● High Carbon PPU (Points per update) signals I/O latency
● Tune go-carbon cache, especially if you alert on metrics