Graphite at CityGrid
if you can’t measure it, you can’t fix it
Wil Heitritter
Director, Tech Ops
Los Angeles DevOps
2014/04/28
Magnum esse solem
philosophus probabit,
quantus sit mathematicus
-Seneca
Objectives
- Introduce Graphite to new users
- Show what we like, what we hate
- Present some interesting use-cases
- Generate discussion
Before Graphite
Ganglia
• Predictable interface
• Text “metrics” to store versions
• Slow
• Couldn’t pick and choose metrics to see
Why ganglia sucked
- Clusters had to be pre-configured
- Multicast vs. Unicast
- Data Retention
- Static Web Interface (can’t pick and choose)
- Static Host List
What did we think wanted?
Ease of adding metrics
Ease of sending metrics
Powerful metric display
Retain ganglia-style cluster dashboards
Long-term configurable metric retention
Graphite!
What is Graphite?
a highly scalable real-time graphing system
which collects numeric time-series data
is managed by carbon
and stored as whisper files
and visualized through web interfaces
or queried via the API
http://graphite.wikidot.com/
Graphite: what we like
Sending metrics is simple
Retrieving metrics is simple
Dashboard creation and sharing… is simple
Many functions()
120MM+ metric values received daily
Backfilling past metrics is simple
Expandable - different frontends
Graphite: what sucks
Dashboard ownership/promotion
No ganglia-like standard dashboard
Data retention… is NOT as simple as we
thought
CityGrid’s
Graphite
Implementation
Metric Naming
Business Metrics
- These are metrics that are not specific to a
specific server
- Format:
business.${hierarchical}.${path}.${here}.$metric
- Example:
business.ec2.testaccount.us-east-1a.OnDemand.running.m2.4xlarge
Metric Naming
Server Metrics
- These metrics are specific to a particular
server (just like ganglia)
- Format:
servers.${class}.${f_q_d_n}.${metric}
- Example:
servers.rvw.aws1prdrvw1_subdom_cityg_com.LW_api_reviews_QPS
Sending metrics
Sending directly from metric scripts
- /etc/graphite.conf
- May need to spread out sending if in volume
Collecting from gmond every minute
- Metrics are spread out to prevent spiking
- False data (gmond acts as a cache)
Impact of staggered sending
Sending is simply...
echo $metric $value $timestamp | nc $relay $port
Performance
carbon-cache/carbon-relay
SSD
replication within minutes
Maintenance
Changing retention
- whisper-auto-resize.py
Filling holes
- whisper-fill $source $destination
Backups
- Dashboards
- Metrics
Graphite Use-Cases
Single Metric
Combined Metrics
Key Metrics Dashboard
Examples of Key Metrics
- QPS
- Processing Time (Max/Mean/Distribution)
- Metrics about sub-requests
- Network usage
- CPU/load
Key Metrics Dashboard
Nagios Integration
check_graphite_target!highestMax(
servers.mai.@HOSTNAME@.LW_map_return_code_5*_ratio,
1
)!5!10
How about Pie Charts?
Ad-Hoc Dashboards
Demo
What NOT to do
Trying it out for yourself
Quick Setup
Install & Start
# pip install https://github.com/graphite-project/ceres/tarball/master
# pip install whisper
# pip install carbon
# pip install graphite-web
start it up...
send it a metric:
echo business.test.metric1 1 `date “+%s”` | nc localhost 2003
OK, it’s almost that easy...
Discussion
Graphite at CityGrid - LA DevOps April 2014

Graphite at CityGrid - LA DevOps April 2014