Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding bad apples early: Minimizing performance impact


Published on

The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams.

The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following:

# Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster)

# Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification

The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag!

We shall walk the audience through how the techniques are being used with REAL data.

Published in: Technology
  • Be the first to comment

Finding bad apples early: Minimizing performance impact

  1. 1. Arun Kejariwal (@arun_kejariwal) 1 Finding Bad Apples Early: 
 Minimizing Performance Impact
  2. 2. 2 Real-time information consumption > 500M Tweets per day >300 hrs video uploaded to YouTube every minute 69% Y/Y Growth in Data Traffic Mobile E-commerce Mobile Trading
  3. 3. 3 Monitoring ! Availability, Performance, Reliability Large number of micro services Large number of clusters Multiple data centers Millions of time series Data fidelity Anomaly Detection
  4. 4. 4 Anomaly Detection: Why bother? Performance regression Potentially user impacting Sources Software Inefficiency Hardware Slow Failure Keynote Template
  5. 5. 5 Visual Analysis Keynote Template
  6. 6. 6 Anomaly Detection Algorithms > 50 yrs of Research ✒ image No model fits all Time series analysis Univariate Multivariate Frequency domain analysis Fourier transform Wavelet transform Clustering K-Means
  7. 7. 📈 7 Flavors of Anomaly Detection Overview Univariate Time Series # Clicks Multi-variate Time Series CPU, Memory Set of Time Series # Tasks Completed
  8. 8. 8 Host level Keynote Template
  9. 9. 9 Host level (contd.) Keynote Template Contextual Application Topology Map Hierarchical Datacenter ! Applications ! Services ! Hosts •  Automatically discover Developer / Architect’s view of the application - for the Operations team -  Framework for system config and context •  Real-time, streaming architecture -  Keeps up with today’s elastic infrastructure •  Scale to 1000s of hosts, 100s of (micro) services •  Present evolution of system state over time -  DVR-like replay of health, system changes, failures Evolving Needs of Modern Operations
  10. 10. 10 Large clusters “Slow” nodes Failed Nodes Keynote Template Visual analysis?
  11. 11. 11 ➢ Our Approach Algorithmic: Simple, fast and effective Automated Scalable to millions of time series Keynote Template
  12. 12. 12 The Approach Bird’s Eye View Input Set of time series Nodes of the same cluster Same metric, e.g., CPU utilization Output Anomalous time series Classification
  13. 13. 13 📊The Approach (contd.) Metrics of Interest Requests Tasks completed Latency p99 p999 CPU Bytes out (Kafka) rx_bytes TCP retransmits iowait, iowait_max Examples
  14. 14. 14 ✓The Approach (contd.) Robust against anomalies Average Median Examples
  15. 15. 15 ✓The Approach (contd.) Robust against irregular patterns Examples
  16. 16. 16 The Approach (contd.) Parameters 6 hr 1 day Historical window width
  17. 17. 17 The Approach (contd.) Parameters Large High precision Low recall Small Low precision High recall Classification threshold
  18. 18. 18 The Approach (contd.) Compute rolling median time series Compute pairwise distance matrix between time series Distance measure Euclidean Minkowski Compute vector of sum of pairwise distances Classify the time series which satisfy the following as anomalous: Sum(Pairwise Distances)/Median(Pairwise Distances) > threshold Outline
  19. 19. 19 The Approach (contd.) Inconsistent data Time series of unequal length Data collection issue Cold start issue How to address? Truncate at right Omit inconsistent timestamps Interpolate Large number of anomalies Non-uniform underlying trend across the time series Edge cases
  20. 20. 20 Applications In production Hadoop clusters # Tasks completed Datastores Slow writes Micro-services CPU utilization '
  21. 21. 21 Applications (contd.) ' Hadoop Big elephant in the room Multiple clusters Thousands of nodes Slow nodes -> Long job completion times Potential business impact Finding laggards in a Hadoop cluster
  22. 22. 22 Applications (contd.) ' Rank nodes in decreasing order of “slowness” Long history window Compute heavy ad-hoc jobs may skew analysis Burst in new jobs Report, via e-mail report, the slowest nodes Subject to an input sensitivity parameter Specified by the SRE Finding laggards in a Hadoop cluster
  23. 23. 23 Applications (contd.) ' Similar workload across nodes Which band should be classified as anomalous? Load imbalance Finding multiple bands
  24. 24. 24 Applications (contd.) ' Disparate hardware Load imbalance Finding heterogeneity
  25. 25. 25 Limitations New nodes coming online Nodes dying Open problems
  26. 26. 26 Next steps Support for finer granularity data Minutely Sub-minutely
  27. 27. 27 Next steps (contd.) Multi-class classification Presence of multiple bands
  28. 28. 28 Next steps (contd.) Cloud and Containers
  29. 29. 29 For the curious Similarity measures Match count based sequence similarity Normalized length of Longest Common Subsequence (LCS) Clustering k-Means, Phased k-means, k-Medoids EM Single-linkage clustering One-class SVM
  30. 30. 30 ( Feedback @arun_kejariwal Keynote Template