• In 2012, I spoke many times on
• But changing from release cycles to
continuous deployment is too big a
change for most organization, and they
don't have the tools to do it.
• I'm hoping that adding new metrics to
the application becomes so addictive
that you'll want to shorten release
What is DevOps?
• Puppet, Chef, Annsible?
• GitHub? AWS? The Cloud?
• Continuous Deployment?
Yes, but these are tools. Great tools.
• Between machines
• Between team members
• Between Dev and Ops
But in many companies there is a bigger problem
• If you are in Business, you are
invisible to Development and Tech
• If you are in Operations, you are
invisible to Business and
• If you are in Development, you are
invisible to Business and Operations.
• "I don't know what my code will do in
production and ops and let's them deal
• "Why doesn't ops ﬁx these problems."
• "What does Ops do all day?"
• Why do I have to wait till end of the
month for a report?
• "Did the last weeks release change
• "What don't they understand the impact
of that bug, outage, etc?"
• Why are they always bothering me.
• I've got work to do!
• Why do we have do another release
again... can't developers do a better
• "What does this company do?" (really)
This is really destructive
To your Team
To your company.
All of This
Can Fixed By Making
Not just technical operations but
Your company is full
So Why Not Expose
Here's a list of excuses I've heard
"But I already have
graphing in my
• Maybe. But it's junk
• Can't share
• Can't do data mash-ups
• Can't do data transformations
• "They won't understand the data so
what's the point of sharing it."
• First, "they" probably do. And more
people looking at ops metrics, the
• Us vs. Them = Fail.
"They might break
• "The data is in our alerting system, we
don't want you to break it."
• Assumes "they" are incompetent, or
malicious. Learn to trust.
"It's not your job,
so you don't need to
"That information isn't
• This excuse is typically caused by fear.
• Why are you deciding what's important?
"I'm not making
duplicating data is bad."
• For operational metrics is very ok
to have a redundant copy of data.
• Completely different goals.
• Use as alerting-beta
"I'm too busy."
"It's too dangerous"
"I don't know how."
• These are real problems.
• So let's ﬁx it!
Let's get 100% of operational metrics in,
and enable the application to make and
share new metrics on demand
without any help from you.
• Similar to RRDTool, Ganglia, Cacti
• Uses specialized data storage
• Uses specialized queries
• Optimized for time series
Graphite isn't Perfect
• Documentation isn't great
(but getting better)
• A few QA issues
• Somewhat odd stack
• Flexible input and output
• REST API for graphs
• Simple UI for mashups and dashboards
• 3rd party, custom, client-side
Makes Sharing Easy
• Do you have an interesting graph?
just a URL!
• Dashboards are easy since graphs are
just URLs. Very easy to make HTML
• A single low-end machine should have
capacity for a few thousand metrics per
minute from 50+ machines.
• Graphite is not CPU intensive, but
needs fast disks and/or more memory.
• Graphite is not hard to install, but it is a
• But might be as easy as
"apt-get install graphite" on your
• It would be good to have a workshop
or prebuilt AMI for EC2
• But not today :-(
• You could parse /proc, ps, df,
netstat, etc and write your own
• ...or use Diamond from BrightCove
Metrics in Diamond now
and many more
100% of pure operational metrics are now shared!
But what about the
And business metrics?
• Your application sends event data to
statsd, as it happens, in real-time.
• StatsD collects this data and computes
(sum, min, max, average)
• Once a minute, it writes data to
The Magic of UDP
• Your application sends metrics in a
• UDP is error-free. No exceptions, No
timeouts. It can not cause your
application to crash
• It will not overload your network.
• You may lose metrics, but in an
intranet, it's rare.
Let's Count Logins!
• Most StatsD client APIs are
one-ﬁle, no C, simple.
• Add one line to your login code.
• That's it!
• You can also graph low-frequency
• Just send another StatsD request in
your batch script
• Do it on reboots, installs, core dumps.
• New bugs, new hires, new code
• Use drawAsInfinite to display