Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Meeting Performance Goals in multi-tenant Hadoop Clusters

613 views

Published on

Meeting Performance Goals in multi-tenant Hadoop Clusters

Published in: Technology
  • Be the first to comment

Meeting Performance Goals in multi-tenant Hadoop Clusters

  1. 1. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Meeting Performance Goals in Multi-tenant Hadoop Clusters Shivnath Babu and Brian Majeska
  2. 2. Who we are • Manages Operations for YP’s Hadoop and HBase clusters • Worked with Hadoop ecosystem for 5 years • System Admin for 13 years • Previously at eHarmony, CalTech, EarthLink Brian Majeska Director Engineering Operations Platform Data Services YP.com Glendale, CA 91203 Shivnath Babu Associate Professor, Duke University Co-founder/CTO, Unravel Data Systems Menlo Park, CA 94025 • R&D on Hadoop, Spark, NoSQL, streaming, & MPP to simplify ongoing app/system management • Led work on first self-tuning Hadoop platform • Awards from NSF, IBM, HP • PhD, Stanford University
  3. 3. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Lifecycle of a Multi-tenant Hadoop Cluster Growth, Diversity, Challenges
  4. 4. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Growth • Pilot: Migrated ETL to Hadoop 5 years ago • Growth in applications • ETL: from 3 to 100+ unique workflows in 5 years • Ad-hoc: 7K jobs a day - 24/7 • Growth in users: 100+ active users • Growth in systems: HDFS, MapReduce, Hive, Oozie, HBase, Spark, Kafka
  5. 5. Hadoop Growth Over Time Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Production Cluster 300 servers 8 cores 8 x 1T data drives (2PB) 18G RAM 1G NIC Multi-tenant Production Cluster 220 servers 16 cores 12 x 4T data drives (7PB) 256G RAM 10G NIC 5 Years Ago… 1PB data Today… 5PB data
  6. 6. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Daily Processing on Multi-tenant Cluster • 1.3 Billion events • 300TB HDFS Reads • 35TB HDFS Writes
  7. 7. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout With Great Power Comes Great Diversity! • Diversity in application types • Diversity in application resource needs • Diversity in users and their skill-sets • Diversity in business criticality of workloads
  8. 8. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Lifecycle of a Multi-tenant Hadoop Cluster Growth & Diversity  Challenges
  9. 9. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Growth + Diversity  Big Challenges • More problems • Harder to diagnose Cascading failures Application slowdowns Rogue applications Missed SLAs Stuck jobs Failed queries
  10. 10. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Growth + Diversity  Big Challenges • More problems • Harder to diagnose • Harder to track who is doing what • Harder to control CPU/IO/Network usage Files/Tables/partitions created Best practices on application performance and cluster usage
  11. 11. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Growth + Diversity  Big Challenges • More problems • Harder to diagnose • Harder to track who is doing what • Harder to control • Harder to optimize • Harder to plan Server configuration Scheduler parameters Justifying resource demands Forecasting capacity needs
  12. 12. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  13. 13. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Understand What is Going On
  14. 14. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Real-life Example: Unpredictable Workflow Performance
  15. 15. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout
  16. 16. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout
  17. 17. Bad Good
  18. 18. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Root Cause of this Resource Contention
  19. 19. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout But, This is Just One Type of Contention • At Resource Manager Level • App admission time • Container allocation for App Master • Container allocation for tasks • Container allocation for Executor • At Application Level • Workflow Scheduler, e.g., Oozie • Query Engine, e.g., HiveServer2 • At Master Daemon Level • NameNode • Hive MetaStore Bad Good
  20. 20. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Key Takeaways Resource contention at different levels affects app performance • Different apps (Oozie workflows, MapReduce, Spark, Tez) are affected differently • Manual diagnosis can be hard and time-consuming Unravel’s approach to diagnose such problems automatically • Analyzes full-stack monitoring data • Carefully combines system knowledge with statistical analysis
  21. 21. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  22. 22. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Improve Performance & Efficiency
  23. 23. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Quick Primer on YARN Resource Manager Image from: http://doc.mapr.com/display/MapR/YARN
  24. 24. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Two Ways to Improve Performance & Efficiency 1. At the level of individual application’s interaction with the Resource Manager 2. At the level of the Resource Manager’s Configuration that affects all applications
  25. 25. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Application’s Interaction with the Resource Manager 1. Number of containers • MapReduce, Spark, & Tez use different techniques to determine this number 2. Container size • CPU • Memory Image from: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
  26. 26. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Application Spawns Too Many Containers
  27. 27. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Poorly-sized Containers
  28. 28. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Massive Inefficiencies Diagnosed and Eliminated with Intelligent Container Sizing!
  29. 29. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Two Ways to Improve Performance & Efficiency 1. At the level of individual application’s interaction with the Resource Manager 2. At the level of the Resource Manager’s Configuration that affects all applications
  30. 30. Underutilized clusters Low throughput Unused datasets Poor data layout Configuring the Resource Manager (Queues/Pools) Root Mem Capacity: 12 GB CPU Capacity: 24 cores Marketing Fair Share Mem: 4 GB Fair Share CPU: 8 cores R&D Fair Share Mem: 4 GB Fair Share CPU: 8 cores Sales Fair Share Mem: 4 GB Fair Share CPU: 8 cores Missed SLAs Poor performance Failed applications Jim’s Team Fair Share Mem: 2 GB Fair Share CPU: 4 cores Bob’s Team Fair Share Mem: 2 GB Fair Share CPU: 4 cores
  31. 31. Configuring the Resource Manager (Parameters) Image from: http://www.slideshare.net/SumeetSingh1/hadoop-summit-san-jose-2015-towards-slabased-scheduling-on-yarn-clusters
  32. 32. Performance Goals that Need to be Met • Deadline: ETL workflow should finish by 6.00 AM • Latency: Average query latency should be under 3 minutes • Utilization: Cluster utilization should be above 70% • Predictability: SLA satisfaction rate should be above 95%
  33. 33. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Very complex to configure manually for good performance and efficiency! 1. YARN does not understand performance goals 2. Too many low-level parameters to be set 3. Need a deep understanding of application workload & performance requirements 4. Diverse types of application behaviors 5. Workloads change with time
  34. 34. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout A New Interface to YARN’s Resource Manager Simple abstraction to specify key performance goals • Based on past/current performance • Helps pose operational what-if questions Powerful functionality • Learning engine for automated answers to operational what-if questions • Recommender system to automatically find parameter settings that meet performance goals Nonintrusive & No changes needed to YARN
  35. 35. Ask operational what-if questions based on past/current performance: 1. What is the impact of decreasing capacity of ADVERTISING queue by 30%? 2. How to reduce average workflow latency in FINANCIAL queue to 30 minutes? Resource allocation & app performance in different queues
  36. 36. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases To appear in VLDB in Sept 2016
  37. 37. Key Types of Performance Goals Application Workload in the Cluster Models of Fair/Capacity Schedulers Learning and Optimization Algorithms Automated Answers to Operational Questions
  38. 38. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Key Takeaways YARN is powerful to meet key multi-tenant resource allocation needs But, an easy interface to specify & satisfy performance goals is lacking Our work aims to fill this gap in the ecosystem
  39. 39. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  40. 40. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Control Multi-tenant Usage by Enforcing Policies
  41. 41. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout AutoActions for Policy-based Control Ops specifies a policy: If resources used by ad-hoc apps in FINANCIAL queue are slowing down the CEO-Report workflow by more than 20%, then move the ad-hoc apps to the QUARANTINE queue Unravel continuously monitors for policy violations If a policy violation is detected, then Unravel acts via YARN REST APIs • Helps Ops automate operational processes & get peace of mind • Unravel maintains complete audit trail for post-mortem investigation
  42. 42. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Examples of AutoActions Beginner • Enforcing best practices on number of tasks and container sizes Intermediate • Detecting Rogue Apps and moving them to a capped queue/pool • Making workflow execution fault-tolerant under YARN kills & OOMs Expert • Guaranteeing SLAs by dynamic adjustment of YARN Resource Manager parameters • Enabling workload-aware cluster selection to lower Cloud usage costs
  43. 43. Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Lifecycle of a Multi-tenant Hadoop Cluster Growth, Diversity, Challenges Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  44. 44. Get Unravel Trial Edition: bit.ly/getunravel UNCOVER ISSUES UNLEASH RESOURCES UNRAVEL PERFORMANCE

×