You can watch the replay for this Geek Sync webcast in the IDERA Resource Center: http://ow.ly/zeeg50A5pII
You know that the state of your systems is effectively invisible unless you monitor them. But there are so many types of monitoring you can use that it's easy to spend a lot of time and money setting up tools that may or may not be solving your most critical needs.
In this Geek Sync presentation, Ernest Mueller, application performance product manager at Idera, walks us through the steps on how to start light with monitoring necessities and how to iterate into more complex things. He will explain what things are most important to instrument first and how to generate your own roadmap to a full monitoring implementation.
2. About Ernest
• Product Manager at IDERA in Austin, TX
• 20 years of IT experience, from startups to
enterprise shops
• Runs CloudAustin user
group, DevOpsDays Austin
conference
• Twitter: @ernestmueller
• Blog: theagileadmin.com
7. What To Do?
• Monitor it all?
– Expensive
– Complex
• How deep?
– Monitor parts of it?
– Gaps in visibility
– Which parts?
8. Monitoring Pitfalls
• “I have 100,000 metrics, but still can’t tell if the
site is down?”
• “Did you know we’re generating 30% of our
system load from monitoring?”
• “It’s going to cost how much? Maybe, but the
procurement cycle will be 9 months…”
• “We’re spending 2 headcount just on
maintaining our monitoring systems!”
• We get so many alerts we need a secondary
triage system so we know which ones to pay
attention to.”
9. What Is Lean?
• Eliminate Waste
• Amplify Learning
• Decide as late as possible
• Deliver as fast as possible
• Empower the team
• Build quality in
• See the whole
Lean Principles
10. Your Monitoring Is A Product
• Build – Minimum Viable Monitoring
• Measure – All the Monitoring Points
• Learn – About the App and the Monitoring
• Repeat – Go Deeper Where It’s Needed
Iterate Through A Development Cycle
11. Monitoring MVP Areas
1. Service Performance and Uptime
2. Software Component Metrics
3. System Metrics
4. Application Metrics
What are the most important areas to cover?
12. Service Performance and Uptime
• Remember lean principle “see the whole”
• “What do my users see?”
• MVP: external synthetic probe of the end service
• Next: RUM, waterfalls, transactions
• Later: transaction warehousing, cross-tier
transaction tracing
The end user view is always the most critical
13. Remember the Process
• Build – Minimum Viable Monitoring
• Measure – All the Monitoring Points
• Learn – About the App and the Monitoring
• Repeat – Go Deeper Where It’s Needed
Lean Development Cycle
14. Software Component Metrics
• “Is my service up?”
• Check ports/processes for actionable outages
• MVP: local probes
• Next: More metrics beyond uptime and response
time (most have a set they expose)
• Later: Advanced deep dive database and other app
component APM
What you can page people on
15. System and Network Metrics
• “What is the root cause?”
• Load on your systems and network devices
• MVP: basic system metrics
(CPU/mem/disk/network)
• Next: More depth, cloud/virt/container layer stats
• Later: Netflow, deeper dive into specific hardware
platform metrics (SANs, etc.)
Diagnosing Issues
16. Application Metrics
• “What is really going on?”
• The app knows, get the app to tell you
• MVP: Logging and log aggregation
• Later: Better logging
• Next: Specific app metric emission, application
instrumentation (Management API or bytecode)
Business value and troubleshooting specifics
17. Think About The Principles
• Eliminate Waste
• Amplify Learning
• Decide as late as possible
• Deliver as fast as possible
• Empower the team
• Build quality in
• See the whole
Lean Principles
18. Quick Demo
• CopperEgg – Ultra quick-start SaaS-based
monitoring with basics on systems, endpoints,
RUM, custom
• Uptime – Download and install infrastructure and
application monitoring
• Precise – APM suite with deep support from
everything from SAP to Java to SQL
Monitor At the Right Depth
Your systems are complex, and there are many points at which you caninstrument them for monitoring, and various methods you can use to perform the measurements.
And the same goes for your applications.
There are many, many different ways to monitor your system and applications and for each type there are various instrumentation approaches and levels of depth.
An experienced IT person can make educated guesses at this – but they’re just guesses, and every system is unique. And there is a tendency for experienced folks to say “monitor it all! Maximum resolution forever!” But it’s easy for a solution to be too complex for an operator, and everything you have running has a logistics tail all of its own – maintenance, data storage, etc. Plus, you end up with a flood of data and especially alerts that you may not be prepared to properly handle.
These are all specific real monitoring issues I’ve seen with my own eyes.
Lean Manufacturing, as popularized by Takashi Ohno’s Toyota Production System, is a method for eliminating waste in a system. This has been adapted into Lean IT and Lean Software Development. Lean software development is characterized by a seven principles. It is designed to promote visibility, shorten cycle time, and ensure you’re delivering the highest value first.
Eric Ries applied lean to product development in his book Lean Startup (2011) which characterizes the core loop inside the product development cycle as “Build – Measure – Learn.” Lean is not cost-cutting, lean is about bringing your maximum force at the item with the highest leverage at any given time.
I’d like to recommend a sample roadmap of where I think you should start your monitoring. Rather than spending $100k in any one area, you want to get broad coverage in many areas first and then deepen those and/or move into adjacencies as needed. These are the four key areas to nail down first, assuming you’re starting from scratch (or trying to learn/redesign a complex set of monitoring solutions already in place).
Your most important attribute of a system is not CPU on some box, or a queue length, or whether a process is running. It’s whether users are able to access and use your service from out there in the world – period. That’s the first thing to address.
Because remember, we’re iterating here, so that you can learn what you really need (or don’t need), and you and your team can learn how to use what you have well before getting more. You don’t half-bake each step, remember the “build quality in” principle, but you add on one type of monitoring and see what it tells you and what you need next.
Next, you need more internally focused outage detection to tell you if you have an issue, and where that breakage is in your system.
Now we segue to metrics more useful for root cause analysis – a service is down or slow, why?
The more customized the metric is to your business, the more value it has for troubleshooting and for business purposes. Most issues lie within applications, not system components, but you have to rely on either the application telling you what’s wrong or external profiling.
As you go through each iteration, ask yourself how you are achieving these principles. Usually there are many consumers of a monitoring system, of all kinds of different skill levels. How do you empower those people and help them learn with the system you are constructing? Are you building in quality, and is your monitoring integrated to where it really lets you see the state of the system and not just “a bunch of line graphs”?