Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Approximate QoS Rule Derivation Based on Root Cause Analysis for Cloud Computing | PRDC 2019

7 views

Published on

Presentation at PRDC 2019

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Approximate QoS Rule Derivation Based on Root Cause Analysis for Cloud Computing | PRDC 2019

  1. 1. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. 1) Yahoo Japan Corporation, 2) Japan Advanced Institute of Science and Technology (JAIST) 3) School of Computing, Tokyo Institute of Technology Approximate QoS Rule Derivation Based on Root Cause Analysis for Cloud Computing PRDC 2019 December 1-3, 2019, Kyoto, Japan Satoshi Konno 1) 2) and Xavier Defago 3)
  2. 2. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Database Platforms in Yahoo! JAPAN 2 300+ Systems 100+ Services
  3. 3. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Major Services of Yahoo! JAPAN 3 3 Media US Search Video Answer Mail JP US JP Membership C2C Payment C2C EC B2C EC Local Search Knowledge search MailNews Yahoo AuctionPremium Loco
  4. 4. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Demand on OSS Database Platforms 4 300+ Systems 200+ Systems MySQL 2000+ DBs 100+ Systems Cassandra 30 70 60 40 Yahoo Japan NoSQL Team RDB Team • Demand on developing autonomous recovery systems • The number of nodes is increasing year by year. • The human resources are limited. 4000+ Nodes X : Autonomous Recovery X : Autonomous Recovery
  5. 5. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Table of Contents 5 • Background and Related Work • Proposal Autonomous Recovery Methods (μQoS and Shape-Root) • Evaluation Result • Conclusion and Future Plans
  6. 6. Copyright 2019 Yahoo Japan Corporation. All Rights Reserved. Background
  7. 7. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. In-memory Monitoring Systems Traditional (Storage) Monitoring Systems Monitoring Studies for Cloud Computing 7 X : Analysis O : Aggregation X : Root Cause O : Analysis Replacing Tech Giant Public System OSS Type Capacity Period Legacy 2010 DataGarage × Distributed 4,000 nodes - TableStore 2014 Atlas + Winston △ Centralized 2 billion records 6 h Epic 2015 Gorilla △ Centralized 10 billion records 26 h HBase 2016 Borgmon × Distributed + Hierarchical 10,000 nodes 12 h
  8. 8. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. QoS Studies for Cloud Computing 8 [1] Abdelzahir Abdelmaboud, Dayang NA Jawawi, Imran Ghani, Abubakar Elsafi, and Barbara Kitchenham. Quality of service approaches in cloud computing: A systematic mapping study. Journal of Systems and Software, 101:159–179, 2015 Map of focus areas in research on QoS approaches in cloud computing [1] Distribution of primary studies by contribution type [1] Models: Discusses concepts, makes comparisons, explores relationships, identifies challenges, or makes classifications. Tools: Supports various aspects of QoS approaches in cloud computing. Methods: Presents a model, algorithm or approaches describing the rules.
  9. 9. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. QoS Studies for Cloud Computing 9 [1] S Anithakumari and K Chandrasekaran. Monitoring and management of service level agreements in cloud computing. In Cloud and Autonomic Computing (ICCAC), 2015 International Conference on, pages 204–207. IEEE, 2015. X : Resource expansion based on system failure without finding the root cause O : Resource extension based on increased demand
  10. 10. Copyright 2019 Yahoo Japan Corporation. All Rights Reserved. Methods (μQos and Shape-Root)
  11. 11. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Overall Autonomous Recovery Sequence μQoS: Reasoning Framework for Guaranteeing QoS 11 STEP1 STEP2 STEP3 Root Cause Analysis QoS Rule Derivation QoS Monitoring Internal In-Memory Time-Series Monitoring System (Foreman) Expanding QoS Monitoring and Action Rules without Resource Expansion STEP4
  12. 12. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. QoS Monitoring Rules and Actions STEP1 : Separation of Monitoring Rules and Actions 12 Internal In-Memory Time-Series Monitoring System (Foreman)
  13. 13. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. STEP3 : μQoS Concept with Root Cause 13 Consumer Provider Root Cause Metrics Service QoS Rule (with No Action) Unsatisfied Metrics Operation QoS Rule (with Recovery Action) STEP1 STEP2 STEP3 Execute the μQoS Rule STEP4 Generating a μQoS Rule • Reliable Root Cause • Fast Root Cause Analysis Mandatory Requirements for Fast Autonomous Recovery
  14. 14. Copyright 2019 Yahoo Japan Corporation. All Rights Reserved. Root Cause Analysis (Shape-Root)
  15. 15. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Root Cause Analysis Methods for Time-Series Data 15 • Correlation Analysis: Traditional parametric statistics method • Clustering Analysis: Grouping a set of metrics in the same group • Recent Studies: BigRoot, Gorilla, etc.
  16. 16. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Root Cause : Correlation Analysis 16 PPMCC [1] Karl Pearson. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58:240–242, 1895. [2] Charles Loboz, Slawek Smyl, and Suman Nath. Datagarage: Warehousing massive performance data on commodity servers. Proceedings of the VLDB Endowment, 3(1-2):1447–1458, 2010. [3] AbdullahMueen,SumanNath,andJieLiu.Fastapproximatecorrelation for massive time-series data. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 171–182. ACM, 2010. [4] Tuomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, and Kaushik Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015. • Pearson product-moment correlation coefficient • Traditional Parametric Statistics Method • Some monitoring studies on Cloud computing [2][3][4] denoted using the general correlation algorithm, but these studies did not reveal how to find the root causes using PPMCC in more detail.
  17. 17. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Root Cause : Clustering Analysis 17 k-Shape [1] [1] John Paparrizos and Luis Gravano. k-Shape: Efficient and accurate clustering of time series. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1855–1870. ACM, 2015. • Scalable shape-based clustering algorithm for time-series data based on k-means clustering • Normalized version of the cross- correlation is used for measuring distances between metrics
  18. 18. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Root Cause : Recent Studies 18 BigRoots [1] [1] Honggang Zhou, Yunchun Li, Hailong Yang, Jie Jia, and Wei Li. Bigroots: An effective approach for root-cause analysis of stragglers in big data system. IEEE Access, 6:41966–41977, 2018. • Root-cause analysis for the underlying reasons for stragglers. • Analyzing the stragglers using general metrics such as shuffle read/write bytes and JVM garbage collection time, CPU, I/O, and network.
  19. 19. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Root Cause : Shape-Root for μQoS 19 Shape-Root • Developed with an emphasis on precision and analysis speed to identify reliable root causes dynamically for μQoS • Based on a shape based algorithm, Dynamic Time Warping (DTW), to measure the metrics correlation distance • Root-cause analysis for all time-series metrics for unsatisfied QoS metrics unlike BigRoot • Excluding descendant metrics and confounding metric based on the timestamps unlike PMCC
  20. 20. Copyright 2019 Yahoo Japan Corporation. All Rights Reserved. Evaluation
  21. 21. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Purpose 21 • Q1: Effectiveness of μQoS and Shape- Root in detecting candidate root causes • Q2: Correlation between analysis span and precision for the time-series data • Q3: Effectiveness of μQoS and Shape- Root for autonomous recovery to real services?
  22. 22. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Evaluation Environment 22 • Apache Cassandra v3.11.4: Distributed NoSQL Database • Yahoo! Cloud Serving Benchmark (YCSB) v0.15.0: Benchmark Program for Distributed Databases • Foreman v0.8.9: Internal Distributed Monitoring System in Yahoo JAPAN (11,754 type metrics in a 5 minute cycle) Cassandra Cassandra Cassandra Cassandra Cassandra
  23. 23. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Root Cause Methods for μQoS 23 • Shape-Root: Our proposal root cause method for μQoS • PPMCC [1]: Standard correlation analysis • k-Shape [2]: General time-series clustering analysis algorithm • BigRoot [3]: Root cause algorithm for stragglers [1] Karl Pearson. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58:240–242, 1895. [2] John Paparrizos and Luis Gravano. k-Shape: Efficient and accurate clustering of time series. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1855–1870. ACM, 2015. [3] Honggang Zhou, Yunchun Li, Hailong Yang, Jie Jia, and Wei Li. Bigroots: An effective approach for root-cause analysis of stragglers in big data system. IEEE Access, 6:41966–41977, 2018.
  24. 24. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Evaluation Metrics 24 • Evaluating the following metrics based on leave-one- out cross-validation • TP: Number of correct potential root causes • FP: Number of wrong potential root causes • FN: Number of not detected root causes
  25. 25. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Experiment 1: Detecting root causes for injected faults 25 QoS Rules Consumer Bad ? CPU Stress (30min cycle) YCSB Read Heavy Workload Rule11 was unsatisfied with CPU load (30min cycle) Best Good
  26. 26. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Experiment 2: Detecting root causes in a real system 26 Consumer QoS Rules x x x x x x YCSB Anti-Pattern Workload SSTables of Tombstone Good Bad ? Bad
  27. 27. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Experiment 3: Comparing effective analysis period in a real system 27 x Consumer QoS Rules YCSB Anti-Pattern Workload Good Good Bad x x Bad Slow Good
  28. 28. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Experiment 4: Autonomous Recovery Effectiveness for a real system 28 Consumer (Rule41) Initial QoS Rules 5. Execute Rule43 3. Derivate Rule43 2. Root Cause Analysis for Derivate for Rule41 1. Rule41 is Unsatisfied 4. Add Rule43and Execute Rule43 YCSB Anti-Pattern Workload Provider (Rule42)
  29. 29. Copyright 2019 Yahoo Japan Corporation. All Rights Reserved. Conclusion
  30. 30. Copyright ©2019 Yahoo Japan Corporation. All Rights Reserved. Summary and Future Plans μQoS : Event-driven monitoring rule derivation method based on case-based reasoning and root cause analysis for autonomous recovery • Good: μQoS have demonstrated the effectiveness in the real system with high precision and real-time root cause algorithm called Shape-Root. • Bad: Oversimplified the acausal root cause exclusion algorithm based on only the metric timestamp. In complex real systems which has many QoS rules, acausal μQoS may be executed. The study focused only on root cause analysis for past failures. As the next step, we currently plan to expand the μQoS framework for future failure based on model-based reasoning or anomaly detection for preventing potential failures. 30
  31. 31. Copyright 2019 Yahoo Japan Corporation. All Rights Reserved. Thank you

×