MO’ METRICS, MO’ PROBLEMS
Erin Willingham
Infrastructure Engineer at Krux Digital
Twitter: GreenSilex
https://www.linkedin.com/in/erin-willingham-104082126
Krux
http://www.krux.com
GRAPHITE:THEN & NOW
What works, what doesn't and why we did what we did
http://www.lowcountryafricana.com/wp-content/uploads/2015/10/Research-Plan-Chalkboard-Slate-1000px.jpg
GRAPHS
http://i.stack.imgur.com/WBsLg.png
<metric path> <metric value> <metric timestamp>
test.bash.stats.count_ps 50 1473048113
test/bash/stats/count_ps.wsp
statsd & collectd
relay
aggregator
graphite whisper
GRAPHITE 1.0
ARCHITECTURE
RULES, MERGING,
EFFICIENCY & OPERATIONS
https://s-media-cache-ak0.pinimg.com/236x/21/ba/0f/21ba0fe48349a1d5382c261ac25cb6c6.jpg
Graphite v1
Relays are aware of aggregation
rules
Graphite Whisper merges
metrics!
Graphite Aggregators
are really efficient.
THREADING, SCALING,
RELAY CPU, & STORAGE
http://i.dailymail.co.uk/i/pix/2012/06/30/article-2166781-13BCE32D000005DC-492_634x948.jpg
Graphite v1
Python - single threaded
Relay is CPU intensive
Graphite Whisper -
requires sharding and is very I/O intensive
http://obfuscurity.com/
Slow UI when using distributed
remote backends
What are we trying to solve?
What is forcing the change?
http://oakdome.com/k5/lesson-plans/photo-editing/wanted-poster/wanted-reward-poster-background.jpg
Storage!
Relay & Aggregator
CPU usage high
Faster UI
KEEP COSTS LOW
http://3.bp.blogspot.com/-r9l7rltAjnM/Udq8kGlp65I/AAAAAAAAANo/VyQZN48nfMk/s1600/treasurepile.jpg
GRAPHITE ALTERNATIVES
http://3.bp.blogspot.com/-r9l7rltAjnM/Udq8kGlp65I/AAAAAAAAANo/VyQZN48nfMk/s1600/treasurepile.jpg
Circonus:All the insights you ever wanted
Hosted Graphite
Zabbix: OSS self hosted monitoring
CARBON-C-RELAY, KAFKA, SOCAT,
CARBON-RELAY-NG, KAFKACAT
https://wtfbabe.files.wordpress.com/2015/06/kung-fury-23-wtf-watch-the-film-saint-pauly.jpeg
The Tools
Carbon-c-relay
https://github.com/grobian/carbon-c-relay
GRAPHITE 2.0
TOOLS
Carbon-relay-ng
https://github.com/graphite-ng/carbon-relay-ng
GRAPHITE 2.0
TOOLS
Kafka Producer
tcp-stream-kafka-producer
https://github.com/krux/tcp-stream-kafka-producer
GRAPHITE 2.0
TOOLS
kafkacat
https://github.com/edenhill/kafkacat
GRAPHITE 2.0
TOOLS
GRAPHITE 2.0
TOOLS
socat
“exec:/usr/bin/kafkacat
-C
-o end
-b <kafka broker>
-t <kafka topic>”
,pty,ctty,echo=0,
tcp4-connect:localhost:<relay port>
BACKEND - STORAGE
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
• Whisper
• Ceres
• InfluxDB
• Cyanite
• Riak
• KairosDB
• OpenTSDB
Graphite - Whisper
InfluxDB
KairosDB
GRAPHITE 2.0
ARCHITECTURE
GRAPHITE ARCHITECTURE -
SCALABLE
http://www.dinopit.com/wp-content/uploads/2012/07/dinosaur-cowboy.jpg
Why?
LOADTESTINGTHE PARTS ANDTHE
PIPELINE
https://github.com/feangulo/graphite-stresser
All the Metrics!
Metrics / min
WHAT WORKED?
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Pre-aggregated
Post Aggregated
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
MIRROR PRODUCTION DATA
https://c2.staticflickr.com/6/5278/5903002116_762783602c_b.jpg
UH OH!
THE GRAPHS DON’T MATCH
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Old Cluster
New Cluster
HOW DO WE FIXTHIS?
http://www.startres.net/startresWP/wp-content/uploads/2013/06/3702A.jpg
TESTING CARBON-RELAY-NG
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Carbon-relay-ng uses more
than 2 CPUs!
FAILURE POINT FOR
CARBON-RELAY-NG
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Post Aggregated
Pre-aggregated
Carbon-relay-ng:
room for improvement
• scale out aggregators horizontally
• monitor for metrics per second and scale out as
needed
• pass metrics that don’t need to be aggregated
directly to the backend
https://github.com/edenhill/kafkacat
SOLUTION
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
QUESTIONS?

Mo' Metrics, Mo' Problems