This document provides an overview of a presentation given by Nick Galbreath at DevOpsDays Tokyo 2013 about making operations visible. The presentation encourages organizations to expose more operational metrics and business data through systems like Graphite and StatsD to improve communication and collaboration between teams. It provides examples of how to collect and visualize different types of data from applications, systems, and business processes. The goal is to overcome excuses for lack of visibility and have organizations complete the "One Machine, One Day, One Person Challenge" to start exposing all of their operational metrics.
7. Continuous
Deployment
• In 2012, I spoke many times on
continuous deployment.
• But changing from release cycles to
continuous deployment is too big a
change for most organization, and they
don't have the tools to do it.
8. Goal
• I'm hoping that adding new metrics to
the application becomes so addictive
that you'll want to shorten release
cycles.
9. What is DevOps?
• Puppet, Chef, Annsible?
• GitHub? AWS? The Cloud?
• Continuous Deployment?
Yes, but these are tools. Great tools.
10. It's About
Communication
• Between machines
• Between team members
• Between Dev and Ops
But in many companies there is a bigger problem
11. You're Invisible
• If you are in Business, you are
invisible to Development and Tech
Operations
• If you are in Operations, you are
invisible to Business and
Development
• If you are in Development, you are
invisible to Business and Operations.
13. Developer
• "I don't know what my code will do in
production and ops and let's them deal
with it.
• "Why doesn't ops fix these problems."
• "What does Ops do all day?"
14. Business
• Why do I have to wait till end of the
month for a report?
• "Did the last weeks release change
anything?"
• "What don't they understand the impact
of that bug, outage, etc?"
15. Operations
• Why are they always bothering me.
• I've got work to do!
• Why do we have do another release
again... can't developers do a better
job?
• "What does this company do?" (really)
16. This is really destructive
To you
To your Team
To your company.
17. All of This
Can Fixed By Making
Operations Visible
with data
Not just technical operations but
company operations.
18. Your company is full
of data!
So Why Not Expose
This Data?
Here's a list of excuses I've heard
19. "But I already have
graphing in my
alerting system"
• Maybe. But it's junk
• Can't share
• Can't do data mash-ups
• Can't do data transformations
20. "They wouldn't
understand."
• "They won't understand the data so
what's the point of sharing it."
• First, "they" probably do. And more
people looking at ops metrics, the
better.
• Us vs. Them = Fail.
21. "They might break
something."
• "The data is in our alerting system, we
don't want you to break it."
• Assumes "they" are incompetent, or
malicious. Learn to trust.
22. "It's not your job,
so you don't need to
know."
"That information isn't
important"
• This excuse is typically caused by fear.
• Why are you deciding what's important?
23. "I'm not making
another system,
duplicating data is bad."
• For operational metrics is very ok
to have a redundant copy of data.
• Completely different goals.
• Use as alerting-beta
24. "I'm too busy."
"It's too dangerous"
"I don't know how."
• These are real problems.
• So let's fix it!
25. One Machine,
One Day,
One Person
Challenge!
Let's get 100% of operational metrics in,
and enable the application to make and
share new metrics on demand
without any help from you.
27. Graphite isn't Perfect
• Documentation isn't great
(but getting better)
• A few QA issues
• Somewhat odd stack
(python-twisted, django)
28. Graphite Ecosystem
• Flexible input and output
• REST API for graphs
• Simple UI for mashups and dashboards
• 3rd party, custom, client-side
dashboards
29. Makes Sharing Easy
• Do you have an interesting graph?
It's
just a URL!
• Dashboards are easy since graphs are
just URLs. Very easy to make HTML
dashboards.
30. One Machine
One Day!
• A single low-end machine should have
capacity for a few thousand metrics per
minute from 50+ machines.
• Graphite is not CPU intensive, but
needs fast disks and/or more memory.
31. One Day,
One Person
• Graphite is not hard to install, but it is a
bit messy.
• But might be as easy as
"apt-get install graphite" on your
system.
• It would be good to have a workshop
or prebuilt AMI for EC2
• But not today :-(
32. Operational Stats
• You could parse /proc, ps, df,
netstat, etc and write your own
custom scripts....
• ...or use Diamond from BrightCove
•https://github.com/
BrightcoveOS/Diamond
33. Metrics in Diamond now
• Memory
• CPU
• Disk
• Network
• Apache
• NGINX
• MySQL
• SNMP
and many more
34. 100% of pure operational metrics are now shared!
But what about the
your applications?
And business metrics?
35. Enter StatsD
• https://github.com/etsy/statsd
• Your application sends event data to
statsd, as it happens, in real-time.
• StatsD collects this data and computes
time-series metrics
(sum, min, max, average)
• Once a minute, it writes data to
Graphite
36. The Magic of UDP
• Your application sends metrics in a
UDP packet.
• UDP is error-free. No exceptions, No
timeouts. It can not cause your
application to crash
• It will not overload your network.
• You may lose metrics, but in an
intranet, it's rare.
37. Let's Count Logins!
• Most StatsD client APIs are
one-file, no C, simple.
• Add one line to your login code.
StatsD::increment('logins');
• That's it!
38. Events!
• You can also graph low-frequency
events.
• Just send another StatsD request in
your batch script
StatsD::increment("deploy", 1);
• Do it on reboots, installs, core dumps.
• New bugs, new hires, new code
commits.
• Use drawAsInfinite to display
41. Logins By Country!
• get country code from IP address
• make a new metric
"login_country" instantly
StatsD::increment('logins');
$kuni = geoip2country($ipv4);
StatsD::increment('logins.$kuni');