• Like
Tek12: Graphing real-time performance with Graphite
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Tek12: Graphing real-time performance with Graphite

  • 4,890 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,890
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
35
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1.  Graphing real-timeperformance with Graphite Neal Anders - https://joind.in/650
  • 2. whoamiNeal AndersSenior Software Engineer at Infobloxhttp://github.com/nanderoohttp://neal-anders.com@nanderoo 
  • 3. shameless plugInfoblox is working on some cool stuff...- DNS, DHCP, IPAM, NCCM- IPv6 Center of Excellence- IF-Map / DNSSec- Hiring (sales, services, support, engineering)
  • 4. disclaimerThese thoughts and opinions are my own, andnot of my employer, bla bla bla...
  • 5. whois $USERQuick poll:- Designers- Developers- Sys-Admins- Networking- Management- Other...?
  • 6. overviewWhat will we cover:- What is Graphite?- What data to capture- Chart interpretation
  • 7. but whyI worked at a place with major scale fail- boxed vs service- 100s of servers in multiple datacenters- manual processes, shell scripts- no insight into the app, infrastructure- n-tier architecture- on-call duties- needed therapy, got it, didnt help 
  • 8. what is graphite- Scalable real-time graphing system- 3 main components: - Web front-end, graphite - Processing backend, carbon - Database, whisper- Python based*  * Its good to learn other languages
  • 9. what is graphiteSetup / Documentation:- Easy to setup- Decent documentation- API and CLI access
  • 10. what is graphiteWhat does it capture?- Numeric time-series data...  point some.data.path  value 3.2  timestamp 1337690041 (epoch)
  • 11. what is graphiteHow much data?- configurable- precision- retention period- aggregation  
  • 12. what is graphite
  • 13. what is graphiteNotes / gotchas:- Scales horizontally- Heavy on disk-io- Fault tolerance- Data loss- Precision or Storage Space / io
  • 14. what data to capture...so what information should we capture? ..how detailed do we get? ..and does it have historical relevance? ..are just a few key metrics enough? 
  • 15. what data to capture
  • 16. what data to captureThoughts on maximum vs. minimum:- What information do you need to capture?- Application Data (yes!)- System Data: cpu, disk-io, mem usage- Network: Connections? Latency? Packet loss?- Fine-grained vs summary and aggregate?
  • 17. what data to captureIn your app:- function / method / calculation time- template / content generation- database query execution- Internal and 3rd-party API calls- queue sizes, processing times- A/B testing?
  • 18. what data to captureFrom the systems:- cpu- disk usage- io (disk, network interface)- memory / paging / swap- file handles- log entries
  • 19. what data to captureAt the network level:- connection count- socket state- qos levels- firewall stats- cdn / cache response- 3rd party status
  • 20. chart interpretation...its like reading tea leaves... ...domains of knowledge leave gaps... ...thats not my job... ...forest through the trees...
  • 21. chart interpretationSo what are we looking for:- normality *- deviations- jitters- historical performance- double rainbows * not present per Cals keynote
  • 22. chart interpretationBecause at 3am when you get paged... Wouldnt it be great to correlate the site goingdown... due to swapping... because of highmemory usage... thanks to that code that gotpushed... that had that change to how youprocessed row results from a large databasequery.
  • 23. chart interpretationOr that change window that just happened... Where the security folks made some configchanges to one of the firewalls.. that is nowblocking your outbound API calls.. just fromsome app servers in one of the datacenters..
  • 24. chart interpretationWhat about that new kernel that fixes amemory leak... Can you compare side by side, and withhistorical context, what that looks like? What about a physical machine vs a virtualone?
  • 25. chart interpretationDo we need to retune our load-balancers, appservers, or database replication? Does higher site traffic over the past fewweeks show signs of strain? Did that cache layer we add help any? Is historical data choking once-fast pages?
  • 26. demowordpress example
  • 27. some final thoughts- come full circle, stats back in- this is one solution, there are others (statsd)- part of a larger tool bag- implement before big changes- establish a reference / baseline- suitable for dev, qa, and production- make implementing data capture easy
  • 28. resourceshttp://graphite.wikidot.comhttp://wordpress.orghttp://memgenerator.nethttp://www.flickr.com/groups/webopsviz/ ..more resources available online..  
  • 29. feedbackjoind.in - https://joind.in/6502email - neal.anders@yahoo.com 
  • 30. fin Thank you.
  • 31. Bonus2001:1868:ad01:1::33