Graphing real-timeperformance with    Graphite Neal Anders - https://joind.in/650
whoamiNeal AndersSenior Software Engineer at Infobloxhttp://github.com/nanderoohttp://neal-anders.com@nanderoo 
shameless plugInfoblox is working on some cool stuff...- DNS, DHCP, IPAM, NCCM- IPv6 Center of Excellence- IF-Map / DNSSec...
disclaimerThese thoughts and opinions are my own, andnot of my employer, bla bla bla...
whois $USERQuick poll:- Designers- Developers- Sys-Admins- Networking- Management- Other...?
overviewWhat will we cover:- What is Graphite?- What data to capture- Chart interpretation
but whyI worked at a place with major scale fail- boxed vs service- 100s of servers in multiple datacenters- manual proces...
what is graphite- Scalable real-time graphing system- 3 main components:  - Web front-end, graphite  - Processing backend,...
what is graphiteSetup / Documentation:- Easy to setup- Decent documentation- API and CLI access
what is graphiteWhat does it capture?- Numeric time-series data...    point       some.data.path    value       3.2    tim...
what is graphiteHow much data?- configurable- precision- retention period- aggregation  
what is graphite
what is graphiteNotes / gotchas:- Scales horizontally- Heavy on disk-io- Fault tolerance- Data loss- Precision or Storage ...
what data to capture...so what information should we capture? ..how detailed do we get? ..and does it have historical rele...
what data to capture
what data to captureThoughts on maximum vs. minimum:- What information do you need to capture?- Application Data (yes!)- S...
what data to captureIn your app:- function / method / calculation time- template / content generation- database query exec...
what data to captureFrom the systems:- cpu- disk usage- io (disk, network interface)- memory / paging / swap- file handles...
what data to captureAt the network level:- connection count- socket state- qos levels- firewall stats- cdn / cache respons...
chart interpretation...its like reading tea leaves... ...domains of knowledge leave gaps... ...thats not my job... ...fore...
chart interpretationSo what are we looking for:- normality *- deviations- jitters- historical performance- double rainbows...
chart interpretationBecause at 3am when you get paged... Wouldnt it be great to correlate the site goingdown... due to swa...
chart interpretationOr that change window that just happened... Where the security folks made some configchanges to one of...
chart interpretationWhat about that new kernel that fixes amemory leak... Can you compare side by side, and withhistorical...
chart interpretationDo we need to retune our load-balancers, appservers, or database replication? Does higher site traffic...
demowordpress example
some final thoughts-   come full circle, stats back in-   this is one solution, there are others (statsd)-   part of a lar...
resourceshttp://graphite.wikidot.comhttp://wordpress.orghttp://memgenerator.nethttp://www.flickr.com/groups/webopsviz/ ..m...
feedbackjoind.in - https://joind.in/6502email - neal.anders@yahoo.com 
fin      Thank you.
Bonus2001:1868:ad01:1::33
Upcoming SlideShare
Loading in …5
×

Tek12: Graphing real-time performance with Graphite

7,749 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,749
On SlideShare
0
From Embeds
0
Number of Embeds
1,108
Actions
Shares
0
Downloads
45
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Tek12: Graphing real-time performance with Graphite

  1. 1.  Graphing real-timeperformance with Graphite Neal Anders - https://joind.in/650
  2. 2. whoamiNeal AndersSenior Software Engineer at Infobloxhttp://github.com/nanderoohttp://neal-anders.com@nanderoo 
  3. 3. shameless plugInfoblox is working on some cool stuff...- DNS, DHCP, IPAM, NCCM- IPv6 Center of Excellence- IF-Map / DNSSec- Hiring (sales, services, support, engineering)
  4. 4. disclaimerThese thoughts and opinions are my own, andnot of my employer, bla bla bla...
  5. 5. whois $USERQuick poll:- Designers- Developers- Sys-Admins- Networking- Management- Other...?
  6. 6. overviewWhat will we cover:- What is Graphite?- What data to capture- Chart interpretation
  7. 7. but whyI worked at a place with major scale fail- boxed vs service- 100s of servers in multiple datacenters- manual processes, shell scripts- no insight into the app, infrastructure- n-tier architecture- on-call duties- needed therapy, got it, didnt help 
  8. 8. what is graphite- Scalable real-time graphing system- 3 main components: - Web front-end, graphite - Processing backend, carbon - Database, whisper- Python based*  * Its good to learn other languages
  9. 9. what is graphiteSetup / Documentation:- Easy to setup- Decent documentation- API and CLI access
  10. 10. what is graphiteWhat does it capture?- Numeric time-series data...  point some.data.path  value 3.2  timestamp 1337690041 (epoch)
  11. 11. what is graphiteHow much data?- configurable- precision- retention period- aggregation  
  12. 12. what is graphite
  13. 13. what is graphiteNotes / gotchas:- Scales horizontally- Heavy on disk-io- Fault tolerance- Data loss- Precision or Storage Space / io
  14. 14. what data to capture...so what information should we capture? ..how detailed do we get? ..and does it have historical relevance? ..are just a few key metrics enough? 
  15. 15. what data to capture
  16. 16. what data to captureThoughts on maximum vs. minimum:- What information do you need to capture?- Application Data (yes!)- System Data: cpu, disk-io, mem usage- Network: Connections? Latency? Packet loss?- Fine-grained vs summary and aggregate?
  17. 17. what data to captureIn your app:- function / method / calculation time- template / content generation- database query execution- Internal and 3rd-party API calls- queue sizes, processing times- A/B testing?
  18. 18. what data to captureFrom the systems:- cpu- disk usage- io (disk, network interface)- memory / paging / swap- file handles- log entries
  19. 19. what data to captureAt the network level:- connection count- socket state- qos levels- firewall stats- cdn / cache response- 3rd party status
  20. 20. chart interpretation...its like reading tea leaves... ...domains of knowledge leave gaps... ...thats not my job... ...forest through the trees...
  21. 21. chart interpretationSo what are we looking for:- normality *- deviations- jitters- historical performance- double rainbows * not present per Cals keynote
  22. 22. chart interpretationBecause at 3am when you get paged... Wouldnt it be great to correlate the site goingdown... due to swapping... because of highmemory usage... thanks to that code that gotpushed... that had that change to how youprocessed row results from a large databasequery.
  23. 23. chart interpretationOr that change window that just happened... Where the security folks made some configchanges to one of the firewalls.. that is nowblocking your outbound API calls.. just fromsome app servers in one of the datacenters..
  24. 24. chart interpretationWhat about that new kernel that fixes amemory leak... Can you compare side by side, and withhistorical context, what that looks like? What about a physical machine vs a virtualone?
  25. 25. chart interpretationDo we need to retune our load-balancers, appservers, or database replication? Does higher site traffic over the past fewweeks show signs of strain? Did that cache layer we add help any? Is historical data choking once-fast pages?
  26. 26. demowordpress example
  27. 27. some final thoughts- come full circle, stats back in- this is one solution, there are others (statsd)- part of a larger tool bag- implement before big changes- establish a reference / baseline- suitable for dev, qa, and production- make implementing data capture easy
  28. 28. resourceshttp://graphite.wikidot.comhttp://wordpress.orghttp://memgenerator.nethttp://www.flickr.com/groups/webopsviz/ ..more resources available online..  
  29. 29. feedbackjoind.in - https://joind.in/6502email - neal.anders@yahoo.com 
  30. 30. fin Thank you.
  31. 31. Bonus2001:1868:ad01:1::33

×