Dumb and Dumber - How Smart is Your Monitoring Data?


Published on

Big Data is all the rage right now. Everyone from a social media company to your grandmother's online knitting store is suddenly a big data shop. Application monitoring tools are no exception from this trend – they collect gigabytes of monitoring data from your application every minute. But most of this data is useless. It's dumb data. More data isn't better if the data you're getting from your tools isn't helping you do your job – in fact, it's worse. In this session we'll talk how to be a little smarter about collecting monitoring data, and how to ensure that the data we're collecting is intelligent, too. I'll talk about a few of the monitoring solutions and approaches I've used during my career as a monitoring architect at a large financial services institution, as well as present a few case studies of customers who have managed to make the leap from bigger data to smarter data.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Almost everyone in the US thinks that Idaho is Iowa.
  • Monitoring tools exist to help us identify and remediate business impact as fast as possible. This helps unsure our systems are available to make money and to service our customers.
  • Enterprises today are suffering from “data bloating”. Someone just gave it nice name and called it “big data”. There is a lot of good that can come from big data analytics but we need to get smarter about what data we really want to keep. There are differing opinions on how to deal with data.
  • Many think we should keep everything. You never know when you might need that one bit of data in the 500 Petabytes you kept.In society we have labels for people that use this strategy at home… Pack Rats, and in extreme cases … HoardersHoarders can never find anything in their houses because the pile of junk is just way too much to sift through.You can face a similar problem with your monitoring data, that’s why there are so many big data analytics companies coming to market today.You need expensive tools to sift through the mountains of junk data to find the nugget you are after.
  • Then there are those who don’t really consider storage of data important in any way. They just deal with every situation that arises blindly, having no historical references to draw upon. This is also a bad strategy as it extends the time it takes to fix problems because the problems are harder to spot without your historical data to reference.
  • My favorite data strategy. By keeping around the stuff that is relevant and throwing away the clutter we can be much more efficient and effective most of the time.Will the occasional circumstance arise where we are missing a key piece of data? Sure, but that’s what we call a learning experience.Those experiences are what taught us what to keep and throw away in the first place.
  • The reality is that the digital hoarders are winning. Somewhere along the way we were taught that storage was cheap and we took that as a green light to keep as much data hanging around as we could.Now this is okay if your company makes it’s money by analyzing and selling the information derived from all of this data, but for most of us the enormous amount of storage required to keep all of this data around is a huge drag on the IT budget.
  • Gartner Surveyed a boat load of the CIOs from the top companies in the world.They found the top 3 issues faced by CIOs. Data storage was the biggest problem with 47% listing it in the top 3.Look at the next two, system performance and network bandwidth. These items are all inter-related.The survey also showed that in large enterprises there was 40-60 percent yearly growth in data storage requirements.
  • We’ve got to stop believing that storage is cheap. It may be cheap relative to other parts of the IT budget on a per unit basis but at our massive consumption rate, storage has become a tremendous drag on the IT budget.An IBM study in 2009 concluded that the 5 year costs associated with storage were %80 OpEx and only 20% CapEx. It’s operationally expensive to deploy and maintain storage.40% of some large enterprise IT budgets are spent on storage. That’s just too much for most organizations.And when you look at fully loaded chargeback costs. The internal rates organizations charge each other to provide storage within a company we see a range of $5-$25 per GB per month.We have to keep these costs to a minimum while solving problems as fast as we can.
  • Let’s talk about Smart Data. Smart data combines intelligent data collection (knowing what to collect and what not to collect), intelligent data correlation, analytics, and intelligent archiving.
  • Data, by itself doesn’t do us much good. We can use data to trigger alarms that will have no indication on business impact. Probably wake someone up in the middle of the night because a backup kicked off and resource consumption during the backup is high enough to trigger a static threshold.Data needs to be converted into information. Information is actionable and a result of correlation and analytics. Let’s look at some data and see where it takes us…We just got an alert with the following metrics…This alert is from a really cool person monitoring tool that can give us all kinds of great data points.Based upon what we see, is this person performing well?
  • The data would have could just as easily describe this guy
  • As it could this guy. The data we have is insufficient. You’re probably all thinking like crazy right now trying to decide who we might be talking about.
  • So here’s more data…Eye color, that seems pretty irrelevant to the question at hand. Should we have even collected that? Probably not, we just wasted network bandwidth, compute resources, and storage capacity.Weight, that helps us decide what person we are talking about. The runner, right?So now do we know if this person is performing well or not? You’re all thinking about what it would take to answer that question properly.Here’s more data…Distance and time, now we’re talking about serious performance metrics. So is the runner performing well?I think so. But’ I’m no expert on sprinters. These data points seem reasonable to me.Let’s add a final bit of data…Wow, we just created actionable information.
  • Usain Bolt just set a new world record so we need to go ahead an update Guiness and any other place that world records are stored.All of those bits of data alone didn’t tell us anything that was actionable on their own. It took our brains performing correlation and analytics to create that information.So why are our crappy dumb data collectors good enough for monitoring our business and mission critical applications?
  • Our traditional tools are misleading. They show us tons of metrics and how they change over time. Look at all of these CPU spikes. We’ve got to be having some sort of impact right?I don’t know. Show me enough of these charts and I might be able to manually correlate them in my head. But you also better give me a historical reference point so I have a clue if these charts look normal or not.
  • Monitoring tools that bury us with data give us a false sense of security. If I’m looking for a needle in a small haystack the last thing I need is the haystack to get 10 times as big.When I have hundred or thousands of charts to look at during a Sev 1 incident do you think I’m going to find the problem in a few minutes? Probably not, a few hours? Maybe.Or maybe the problem will fix itself and I’ll never know what the root cause was. Has that ever happened to anyone here?
  • There’s too much focus on Disaster Recovery. This needs to shift to a focus on problem recovery. We’ll talk about what that means in a minute.
  • When I was a kid we used to have these emergency drills in school every year. The purpose of the drill was so that we would be prepared in case a nuclear bomb exploded in the general vicinity.Every year went and hid under our desks under the pretense that we would be safe from the nuclear explosion going off in our back yard. Kind of a dumb thing to do.And every year companies spend tons of time and money updating and testing their disaster recovery plans for the worst case of losing a data center or two.There’s nothing wrong with planning for disasters, but what are we doing about the problems that impact us really often?
  • We spend entirely to little time planning for the problems that arise daily, weekly, and monthly.How many performance and stability related incidents does your organization have per month?How many times have you had to use your DR plan for real?Has anyone in this room ever had to invoke their disaster recovery plan for their application?
  • Let me introduce you to the concept of problem recovery planning…
  • When I worked in a large enterprise there was no requirement that any thought be put into monitoring strategy for each application.There were just standard check boxes that would ensure the most basic forms of infrastructure monitoring were deployed with your application.It wasn’t until a major performance or stability problem reared it’s ugly head that application support would ask what we could bolt on to figure out the problem.
  • Most of the app support personnel didn’t even know if there was any monitoring in place for their new applications.
  • Problem recovery planning requires that you think about your application and it’s unique needs ahead of time. There are all kinds of monitoring tools available today…Infrastructure Monitoring – Physical and Virtual Hosts, OS, VMsNPM – Network Performance MonitoringSwitches, routers, packets, network errors, data transfer rates, etc…APM – Application Performance MonitoringBusiness Transactions, End User Experience, Application Flow Maps, Deep Code Diagnostics, Correlation and AnalyticsLog Monitoring – Logged errors and business metrics.Database Monitoring – Database metrics, table statistics, explain plans, locks, etc…Predictive Analytics
  • I’ll give you an example from my time at a top 10 investment bank…
  • I’ll give you an example from my time at a top 10 investment bank…
  • It starts with intelligence at the source (Agent)The collection/aggregation layer needs to know when to have agents collect more or less data.The analytics layer needs to create smart and sparse information from dense data.The archival mechanism needs to know what should be thrown away and what to keep.
  • Application Flow – Active nodes, active tiers, node response time, tier response time, external service response times, etc…
  • Dumb and Dumber - How Smart is Your Monitoring Data?

    1. 1. Introduction• I’m a Performance Geek!!!• Designed and Implemented Monitoring Architecture for Wachovia Investment Bank and Wells Fargo Managed Services• I’ve used many of the enterprise class monitoring tools in existence.• I currently live, work, and play in Idaho, USA 2
    2. 2. Right Here!This is Idaho, I live here. This is Iowa, I don’t live here. 3
    3. 3. Agenda Big Dumb Data Smart Data Defined Shifting DR to PR Smart Data Strategies Examples Questions 4
    4. 4. Big Dumb Data 5
    5. 5. To quickly identify and remediate thebusiness impact of performance and stability issues. Why do monitoring tools exist anyway? 6
    6. 6. What is Business Impact? 7
    7. 7. Big Data = Enterprise Data Bloating• Business Data• Log Files• Monitoring Data• Business Intelligence Data• Legal Data• Regulatory Compliance Data• Email• Etc… 8
    8. 8. Keep Everything? 9
    9. 9. Keeping Too Little is Also Bad 10
    10. 10. Keep Just What You Need 11
    11. 11. True Story: Oops, that got expensive.5-7 years ago installed and operated 3monitoring toolsBTM, APM, and Predictive Analytics~80 ApplicationsEnded up with ~50 Management ServersAnd 5-10 TB of dataExplore the hidden costs before you decideto implement 12
    12. 12. The Digital Hoarders are Winning 13
    13. 13. Gartner Survey Data Storage 47%System Performance 37% Network Bandwidth 36% 14
    14. 14. False Pretense That Storage is Cheap• 5 Year Storage Costs: 80% OpEx, 20% CapEx (2009 IBM Study)• IT Budgets: Up To 40% Spent on Storage• $5-25/GB/month Fully Loaded Cost – $61,440 - $307,200 Per Year Per TB 15
    15. 15. Smart Data Defined 16
    16. 16. Data must be turned into information to be useful.Heart Rate = 150 bpmBlood Pressure = 200 over 100Is the person performing well or not? 17
    17. 17. Are we talking about this guy? 18
    18. 18. Or this guy? 19
    19. 19. Data must be turned into information to be useful.Eye Color = BrownWeight = 207 lbs (94 kg)Is the person performing well or not?Distance Run = 100 metersTime = 9.58sWorld Record Time=9.69s 20
    20. 20. Correlation + AnalyticsTurned Data Into Information 21
    21. 21. Traditional Monitoring Tools Are MisleadingResource Spikes May or May Not Cause Business Impact 22
    22. 22. Having a lot of data causes a false sense of security. Your needle is somewhere in there, good luck finding it anytime soon. 23
    23. 23. We’ve become addicted to metrics! How Much Is Enough??? 24
    24. 24. What do these charts tell us about application performance or business impact? 25
    25. 25. This is better, but still not good enough.Average Response Time of ProcessOrder Transaction with Historical Baseline 26
    26. 26. True Story: Wasted Time.Called onto conf line to help with Sev 1Confident I had all of the data I needed tofigure out the problemSearched charts for hoursThe problem wasn’t on my servers in thefirst place 27
    27. 27. We need our monitoring platforms to do the heavy lifting forus if we want MTTR < 30 minutes. Monitor my application from the user AND IT perspective. Determine what is normal by observation and analytics. Show me what my application looks like right now using correlation. Alert me if anything above changes for the worse. Have the data I need to solve the problem and lead me to the answer quickly. 28
    28. 28. Disaster Recovery (DR) Needs to Shift toProblem Recovery (PR) 29
    29. 29. We spend too much time planning for what will probably never happen. 30
    30. 30. We spend too little time planning for what happens all too often. 31
    31. 31. What is Problem Recovery Planning?PR is a strategy and an organizationalmindset.It’s the idea that monitoring is critical tomanaging applications and ensuring anoptimal user experience.It’s the practical implementation of a welldefined monitoring architecture. 32
    32. 32. Monitoring is an afterthought too often.
    33. 33. When a problem occurs…• Do we have monitoring?• What kind?• What are we collecting?• How long do we have history? 34
    34. 34. Think about what you need ahead of time.DB NetworkLog App Infra 35
    35. 35. True Story: Investment Bank Blues• 40-50 Sev 1 Incendents Per Month• MTTR ~2 hours• Executive Mandate to Cut Incidents to Single Digits• Executive Mandate of 15 Minute or Less MTTR for All Trading Applications 36
    36. 36. Had It Already• Infrastructure Monitoring• NPM – Network Performance Monitoring• Periodic Database Monitoring Missing• APM – Application Performance Monitoring• Log Monitoring and Analytics• Always On Database Monitoring• Predictive Analytics 37
    37. 37. Added• APM – Application Performance Monitoring• Predictive Analytics• Always On Database Monitoring• Business/IT Master Dashboard Significant Results• Reduced Sev 1s from 45/month to 4/month• Improved key transaction speeds by 10x• Reduced MTTR from 3 hrs to 30 mins• Detected and repaired problems before impact 38
    38. 38. Cloud Computing is driving the need for PR planning• Cloud apps are highly distributed so they can take advantage of dynamic scaling• Highly distributed applications are much harder to troubleshoot• Use of APM is the fastest way to identify and fix application problems in the cloud 39
    39. 39. Smart Data Strategies 40
    40. 40. 41
    41. 41. • Single High Traffic Application• Transmit and store up to 40 TB of monitoring data per year! (Keep Everything) The costs add up.• Cloud Bandwidth = ~$5000 per year per application. Charged $.12 per GB of data out of cloud.• Storage Costs = $204,800 per month by end of year 1. Using $5 per GB per month. ~1.3 Million USD spent at end of 1st year. 42
    42. 42. We need to save THE RIGHT data Analytics Aggregation Correlation ControlApplication Archive 43
    43. 43. EUE – Key Performance Indicators (KPIs)EUE – Pages, response time, network time, render time, locationperformance, etc… 44
    44. 44. EUE – Key Performance Indicators (KPIs)EUE – Pages, response time, network time, render time, locationperformance, etc… 45
    45. 45. Business Transaction KPIsBTs – Response time, count, rate, errors, CPU Used, CPU Block, CPUWait, etc… 46
    46. 46. Application Flow KPIsApplication Flow – Active nodes, active tiers, node response time, tier responsetime, external service response times, etc… 47
    47. 47. Deep Diagnostics – We don’t need to save these forever. 48
    48. 48. Don’t be this guy… 49
    49. 49. Plan ahead, anticipate your needs, keep yourorganization nimble, powerful and purpose built. 50
    50. 50. Example 51
    51. 51. Netflix• Video Streaming• AWS Deployment• Highly dynamic environment• ~10,000 JVM Nodes• Doing it right 52
    52. 52. NetflixCollecting over 1 million metrics per minute. 53
    53. 53. What’s the point(s)?• Big data isn’t a bad thing as long as it is serving a purpose.• Big monitoring data slows down MTTR and drives up both OpEx and CapEx.• Focusing on Problem Recovery will help you figure out your architecture, tools, and process.• Don’t be a digital hoarder!!! 54
    54. 54. Questions??? 55
    55. 55. Thank You