5. Why ganglia sucked
- Clusters had to be pre-configured
- Multicast vs. Unicast
- Data Retention
- Static Web Interface (can’t pick and choose)
- Static Host List
6. What did we think wanted?
Ease of adding metrics
Ease of sending metrics
Powerful metric display
Retain ganglia-style cluster dashboards
Long-term configurable metric retention
8. What is Graphite?
a highly scalable real-time graphing system
which collects numeric time-series data
is managed by carbon
and stored as whisper files
and visualized through web interfaces
or queried via the API
http://graphite.wikidot.com/
9. Graphite: what we like
Sending metrics is simple
Retrieving metrics is simple
Dashboard creation and sharing… is simple
Many functions()
120MM+ metric values received daily
Backfilling past metrics is simple
Expandable - different frontends
10. Graphite: what sucks
Dashboard ownership/promotion
No ganglia-like standard dashboard
Data retention… is NOT as simple as we
thought
12. Metric Naming
Business Metrics
- These are metrics that are not specific to a
specific server
- Format:
business.${hierarchical}.${path}.${here}.$metric
- Example:
business.ec2.testaccount.us-east-1a.OnDemand.running.m2.4xlarge
13. Metric Naming
Server Metrics
- These metrics are specific to a particular
server (just like ganglia)
- Format:
servers.${class}.${f_q_d_n}.${metric}
- Example:
servers.rvw.aws1prdrvw1_subdom_cityg_com.LW_api_reviews_QPS
14. Sending metrics
Sending directly from metric scripts
- /etc/graphite.conf
- May need to spread out sending if in volume
Collecting from gmond every minute
- Metrics are spread out to prevent spiking
- False data (gmond acts as a cache)