overviewWhat will we cover:- What is Graphite?- What data to capture- Chart interpretation
but whyI worked at a place with major scale fail- boxed vs service- 100s of servers in multiple datacenters- manual processes, shell scripts- no insight into the app, infrastructure- n-tier architecture- on-call duties- needed therapy, got it, didnt help
what is graphite- Scalable real-time graphing system- 3 main components: - Web front-end, graphite - Processing backend, carbon - Database, whisper- Python based* * Its good to learn other languages
what is graphiteSetup / Documentation:- Easy to setup- Decent documentation- API and CLI access
what is graphiteWhat does it capture?- Numeric time-series data... point some.data.path value 3.2 timestamp 1337690041 (epoch)
what is graphiteHow much data?- configurable- precision- retention period- aggregation
what data to captureThoughts on maximum vs. minimum:- What information do you need to capture?- Application Data (yes!)- System Data: cpu, disk-io, mem usage- Network: Connections? Latency? Packet loss?- Fine-grained vs summary and aggregate?
what data to captureIn your app:- function / method / calculation time- template / content generation- database query execution- Internal and 3rd-party API calls- queue sizes, processing times- A/B testing?
what data to captureFrom the systems:- cpu- disk usage- io (disk, network interface)- memory / paging / swap- file handles- log entries
what data to captureAt the network level:- connection count- socket state- qos levels- firewall stats- cdn / cache response- 3rd party status
chart interpretation...its like reading tea leaves... ...domains of knowledge leave gaps... ...thats not my job... ...forest through the trees...
chart interpretationSo what are we looking for:- normality *- deviations- jitters- historical performance- double rainbows * not present per Cals keynote
chart interpretationBecause at 3am when you get paged... Wouldnt it be great to correlate the site goingdown... due to swapping... because of highmemory usage... thanks to that code that gotpushed... that had that change to how youprocessed row results from a large databasequery.
chart interpretationOr that change window that just happened... Where the security folks made some configchanges to one of the firewalls.. that is nowblocking your outbound API calls.. just fromsome app servers in one of the datacenters..
chart interpretationWhat about that new kernel that fixes amemory leak... Can you compare side by side, and withhistorical context, what that looks like? What about a physical machine vs a virtualone?
chart interpretationDo we need to retune our load-balancers, appservers, or database replication? Does higher site traffic over the past fewweeks show signs of strain? Did that cache layer we add help any? Is historical data choking once-fast pages?
some final thoughts- come full circle, stats back in- this is one solution, there are others (statsd)- part of a larger tool bag- implement before big changes- establish a reference / baseline- suitable for dev, qa, and production- make implementing data capture easy
resourceshttp://graphite.wikidot.comhttp://wordpress.orghttp://memgenerator.nethttp://www.flickr.com/groups/webopsviz/ ..more resources available online..