Redefine Triage by Learning the Golden Nuggets of APM


Published on

Successful use of APM doesn’t happen by accident or wishful thinking. You need to learn specific tasks and capabilities and evolve in the course of becoming competent with the technology, as well as becoming savvy with the philosophy and lifestyle of performance management. We have been validating analytic techniques for APM data and have found that using KPIs directly from your managed environment has a distinct advantage versus a generic set of metrics. This ensures that your analytics are farming meaningful data, and not getting distracted with excessive volumes of spurious metrics. It is a technique that you can apply today, as you begin planning for your upgrade tomorrow.

In a webcast on May 29th 2013, CA Technologies Mike Sydor, Senior Engineering Services Architect, and author of “APM Best Practices” used this content to discuss how we identify and harness KPIs to make sense of your APM "big data", and how these techniques will help to prepare for your upgrade to the new features and functionality with upcoming APM release and its tight integration with Advanced Behavior Analytics (ABA).

Listen to the webcast replay
Learn more at

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Redefine Triage by Learning the Golden Nuggets of APM

  1. 1. 2 © 2014 CA. ALL RIGHTS RESERVED. Agenda  Why so many metrics with APM? – “Big Data”?  What we are learning with CA-ABA (analytics)  How to find KPIs  What’s new for CA-APM 9.6 Release
  2. 2. 3 © 2014 CA. ALL RIGHTS RESERVED. Typical APM Cluster  Dozens to hundreds of applications – 2800 JVMs/CLRs  Up to 5M metrics, every 15 seconds  Large applications span multiple data centers – 2-8 APM clusters, typical – 30-70 EM Collectors for a nationwide portal application  12M to 28M metrics, every 15 seconds … certainly sounds like big data!!!
  3. 3. 4 © 2014 CA. ALL RIGHTS RESERVED. What is Big Data??? APM information is “big”… but it is not “big data” without enrichment 5M Metrics that you don’t fully understand OR 5M Metrics that you don’t fully understand Trouble Management Version Control Time of ____ Constraints Air Traffic Advisories Weather Forecast AP News Updates Marketing Campaigns E N R I C H M E N T Correlation Trends Insights Anomalies
  4. 4. 5 © 2014 CA. ALL RIGHTS RESERVED. Challenges for Big Data  Data Variety – different sources gives different perspectives. Does your data have a significant perspective?  Validation – is the data source meaningful/predictive?  Consistency – are the values trustworthy?  Data Structure and Nomenclature – Mapping, Transformation  Temporal Impedance Mismatch – APM: real-time with 15 second reporting interval – Trouble Management: +15-30 minutes later – Stock Ticker: +15-30 minutes later – Air Traffic Advisories: +30-60 minutes later – Version Control: days to weeks in advance – Marketing Campaign Assessment: 2-4 weeks later
  5. 5. 6 © 2014 CA. ALL RIGHTS RESERVED. KPI Management Maturity SGCM: Stalls, GC Settings, Concurrency, Memory Management Trends APC : Availability, Performance, Capacity EKB: Errors, Key Resource Performance, Business Transaction Survey VALUE KPI MATURITY (Platform) (Application) (Transaction)
  6. 6. What We are Learning with CA-ABA
  7. 7. ABA Logical Architecture APM Cluster 5M Metrics 100k Metrics (via RegEx) Anomaly Engine Anomalies Alerts Why only 100k Metrics??? Why not 5M???
  8. 8. RegEx == Regular Expression  analytics.metricfeed.process.3 =  Custom Metric Host (Virtual) |Custom Metric Process (Virtual)|Custom Business Application Agent (Virtual)  analytics.metricfeed.metric.3 =  By Business Service|[^|]+|[^|]+|[^|]+:.+
  9. 9. RegEx is hard… but easy to validate
  10. 10. Metricfeed.3 0 20 40 60 80 100 120 140 160 180 200 Series1 metricfeed.3 Broader collection of metrics but only 87/500 == 17.4% are generally known as useful
  11. 11. Suspects Identified via Baseline Technique SiteMinder Backends JSP Frontends JMX Custom 0 2 4 6 8 10 12 14 16 18 Series1 Suspects via Baseline Techniques Average RT only 100% Useful metrics, ready for validation: 47/43625 == 0.1%
  12. 12. Metric Count TypeView
  13. 13. What is an Application?  Front-ends – Browser? Webservice? Messaging?  Back-ends – Databases Webservices Messaging Mainframes Trading_Partners  Muck-in-the-Middle – Software quality, stability and scalability  - We want to identify KPIs for each of these elements – - helps us build a useful dashboard for Operations – - helps expose with the resources are really doing – - helps us define acceptance criteria, to act proactively – - helps us to triage really effectively
  14. 14. How to Find KPIs
  15. 15. Capacity KPIs – “Tree Rings”
  16. 16. Performance KPIs High Volume + Significant Response Time
  17. 17. Create a Simple Alert and Threshold (ConnectionStatus)
  18. 18. Create a Simple Alert, Find Restart and threshold (MetricCount) “UP” – but not actually doing anything!!!
  19. 19. Understanding Your Environment  Identify the KPIs – Availability  Agent ConnectionStatus  Number Live Metrics (Metric Count) – Performance  High Volume components with significant response time – NOT “Top 10 Response Time” – Capacity  Highest Volume Components  Don’t Wait for Production!!! – Make it part of your pre-production review – Manage the application lifecycle by trending KPIs
  20. 20. Good Better (additional) Best (additional) Stalls Availability – Connected Status Errors GC Settings Availability - Metric Count Key Resource Performance Concurrency Suspect Performance Business Transaction Survey Memory Management (graph) Suspect Capacity Platform Coarse information ..but not really APM Application, Transactions, Resources The APM Advantage KPI Evolution
  21. 21. What’s New in CA APM 9.6 Simplified, automated, and built on CA APM strengths. Seamless Mainframe Awareness Faster, Easier APM • Intelligent Deep Transaction Trace is now dynamic, automated, and requires less developer involvement for deep dives into apps supporting the transactions • Simplified Triage with easier drill down with Application Triage Map including Socket Grouping • Improved response times with software based Transaction Impact Monitor (end-user experience) • Expanding APMs scope with Java 7 EM & Agents • Increased insight by adding DB2 details to transaction traces • Greater awareness with CA SYSVIEW MQ alerts & complete status in APM • Driving further cross enterprise depth with CTG traces to fully expand backend calls • Other mainframe based enhancements
  22. 22. Preparing to Upgrade  HealthCheck the existing cluster prior to any upgrade  Good: – - Do a clean install of the APM Cluster, alongside of the existing cluster version.  - Manually duplicate management modules, domains.xml, etc.  - Bring down the old version, then bring up the new  Better: – - Install the new version in a separate environment, reduced size – - migrate a few applications to the new environment for validation – - upgrade the primary environment after validation achieved  Best: – - Install a new GOLD environment in production, separate from original cluster – - migrate agents, as schedules permit, until original cluster may be decommissioned – - this provides an opportunity to introduce pre-production review and generally correct any bad deployment habits
  23. 23. Resources  APM Community Site ( – - Cookbook: APM HealthCheck – - Understanding Which Metrics Matter (KPI discussion) – - Cookbook: Application Audit  - more details on the baseline techniques and process  APM best practices – Realizing Application Performance Management – available on and  - Baselines, Test Plans, App Audits, Triage, Firefighting  - Organizational Models, Service Catalogs  APM Web Page :