Tek12: Graphing real-time performance with Graphite

Graphing real-time
performance with
Graphite
Neal Anders - https://joind.in/650

whoami

Neal Anders
Senior Software Engineer at Infoblox
http://github.com/nanderoo
http://neal-anders.com
@nanderoo

shameless plug
Infoblox is working on some cool stuff...
- DNS, DHCP, IPAM, NCCM
- IPv6 Center of Excellence
- IF-Map / DNSSec
- Hiring (sales, services, support, engineering)

disclaimer
These thoughts and opinions are my own, and
not of my employer, bla bla bla...

whois $USER
Quick poll:
- Designers
- Developers
- Sys-Admins
- Networking
- Management
- Other...?

overview
What will we cover:
- What is Graphite?
- What data to capture
- Chart interpretation

but why
I worked at a place with major scale fail
- boxed vs service
- 100's of servers in multiple datacenters
- manual processes, shell scripts
- no insight into the app, infrastructure
- n-tier architecture
- on-call duties
- needed therapy, got it, didn't help

what is graphite
- Scalable real-time graphing system
- 3 main components:
- Web front-end, graphite
- Processing backend, carbon
- Database, whisper
- Python based*

* It's good to learn other languages

what is graphite
Setup / Documentation:
- Easy to setup
- Decent documentation
- API and CLI access

what is graphite
What does it capture?
- Numeric time-series data...

point some.data.path

value 3.2

timestamp 1337690041 (epoch)

what is graphite
How much data?
- configurable
- precision
- retention period
- aggregation

what is graphite
Notes / gotchas:
- Scales horizontally
- Heavy on disk-io
- Fault tolerance
- Data loss
- Precision or Storage Space / io

what data to capture
...so what information should we capture?

..how detailed do we get?

..and does it have historical relevance?

..are just a few key metrics enough?

Thoughts on maximum vs. minimum:
- What information do you need to capture?
- Application Data (yes!)
- System Data: cpu, disk-io, mem usage
- Network: Connections? Latency? Packet loss?
- Fine-grained vs summary and aggregate?

In your app:
- function / method / calculation time
- template / content generation
- database query execution
- Internal and 3rd-party API calls
- queue sizes, processing times
- A/B testing?

From the systems:
- cpu
- disk usage
- io (disk, network interface)
- memory / paging / swap
- file handles
- log entries

At the network level:
- connection count
- socket state
- qos levels
- firewall stats
- cdn / cache response
- 3rd party status

chart interpretation
...it's like reading tea leaves...

...domains of knowledge leave gaps...

...thats not my job...

...forest through the trees...

So what are we looking for:
- normality *
- deviations
- jitters
- historical performance
- double rainbows

* not present per Cal's keynote

Because at 3am when you get paged...

Wouldn't it be great to correlate the site going
down... due to swapping... because of high
memory usage... thanks to that code that got
pushed... that had that change to how you
processed row results from a large database
query.

Or that change window that just happened...

Where the security folks made some config
changes to one of the firewalls.. that is now
blocking your outbound API calls.. just from
some app servers in one of the datacenters..

What about that new kernel that fixes a
memory leak...

Can you compare side by side, and with
historical context, what that looks like?

What about a physical machine vs a virtual
one?

Do we need to retune our load-balancers, app
servers, or database replication?

Does higher site traffic over the past few
weeks show signs of strain?

Did that cache layer we add help any?

Is historical data choking once-fast pages?

some final thoughts
- come full circle, stats back in
- this is one solution, there are others (statsd)
- part of a larger tool bag
- implement before big changes
- establish a reference / baseline
- suitable for dev, qa, and production
- make implementing data capture easy

resources
http://graphite.wikidot.com
http://wordpress.org
http://memgenerator.net
http://www.flickr.com/groups/webopsviz/

..more resources available online..

feedback
joind.in - https://joind.in/6502
email - neal.anders@yahoo.com

Tek12: Graphing real-time performance with Graphite

More Related Content

Similar to Tek12: Graphing real-time performance with Graphite

Recently uploaded

Tek12: Graphing real-time performance with Graphite