Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Meeting Performance Goals in multi-tenant Hadoop Clusters
1. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Meeting Performance Goals in
Multi-tenant Hadoop Clusters
Shivnath Babu and Brian Majeska
2. Who we are
• Manages Operations for YP’s Hadoop and
HBase clusters
• Worked with Hadoop ecosystem for 5 years
• System Admin for 13 years
• Previously at eHarmony, CalTech, EarthLink
Brian Majeska
Director Engineering
Operations
Platform Data Services
YP.com
Glendale, CA 91203
Shivnath Babu
Associate Professor,
Duke University
Co-founder/CTO,
Unravel Data Systems
Menlo Park, CA 94025
• R&D on Hadoop, Spark, NoSQL, streaming,
& MPP to simplify ongoing app/system
management
• Led work on first self-tuning Hadoop platform
• Awards from NSF, IBM, HP
• PhD, Stanford University
3. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Lifecycle of a Multi-tenant Hadoop Cluster
Growth, Diversity, Challenges
4. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth
• Pilot: Migrated ETL to Hadoop 5 years ago
• Growth in applications
• ETL: from 3 to 100+ unique workflows in 5 years
• Ad-hoc: 7K jobs a day - 24/7
• Growth in users: 100+ active users
• Growth in systems: HDFS, MapReduce, Hive, Oozie, HBase, Spark, Kafka
5. Hadoop Growth Over Time
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Production Cluster
300 servers
8 cores
8 x 1T data drives (2PB)
18G RAM
1G NIC
Multi-tenant Production Cluster
220 servers
16 cores
12 x 4T data drives (7PB)
256G RAM
10G NIC
5 Years Ago… 1PB data Today… 5PB data
7. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
With Great Power Comes Great Diversity!
• Diversity in application types
• Diversity in application resource needs
• Diversity in users and their skill-sets
• Diversity in business criticality of workloads
8. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Lifecycle of a Multi-tenant Hadoop Cluster
Growth & Diversity Challenges
9. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth + Diversity Big Challenges
• More problems
• Harder to diagnose
Cascading failures
Application slowdowns
Rogue applications
Missed SLAs
Stuck jobs
Failed queries
10. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth + Diversity Big Challenges
• More problems
• Harder to diagnose
• Harder to track who is doing what
• Harder to control
CPU/IO/Network usage
Files/Tables/partitions
created
Best practices on
application performance
and cluster usage
11. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth + Diversity Big Challenges
• More problems
• Harder to diagnose
• Harder to track who is doing what
• Harder to control
• Harder to optimize
• Harder to plan
Server configuration
Scheduler parameters
Justifying resource demands
Forecasting capacity needs
12. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
13. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Understand
What is Going On
20. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Root Cause of this Resource Contention
21. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
But, This is Just One Type of Contention
• At Resource Manager Level
• App admission time
• Container allocation for App Master
• Container allocation for tasks
• Container allocation for Executor
• At Application Level
• Workflow Scheduler, e.g., Oozie
• Query Engine, e.g., HiveServer2
• At Master Daemon Level
• NameNode
• Hive MetaStore
Bad
Good
22. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Key Takeaways
Resource contention at different levels affects app performance
• Different apps (Oozie workflows, MapReduce, Spark, Tez) are affected differently
• Manual diagnosis can be hard and time-consuming
Unravel’s approach to diagnose such problems automatically
• Analyzes full-stack monitoring data
• Carefully combines system knowledge with statistical analysis
23. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
25. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Quick Primer on YARN Resource Manager
Image from: http://doc.mapr.com/display/MapR/YARN
26. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Two Ways to Improve Performance & Efficiency
1. At the level of individual application’s interaction with the Resource
Manager
2. At the level of the Resource Manager’s Configuration that affects all
applications
27. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Application’s Interaction with the
Resource Manager
1. Number of containers
• MapReduce, Spark, & Tez use different techniques to determine this number
2. Container size
• CPU
• Memory
Image from: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
28. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Application Spawns Too Many Containers
30. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Massive Inefficiencies Diagnosed and Eliminated with
Intelligent Container Sizing!
31. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Two Ways to Improve Performance & Efficiency
1. At the level of individual application’s interaction with the Resource
Manager
2. At the level of the Resource Manager’s Configuration that affects all
applications
33. Configuring the Resource Manager (Parameters)
Image from: http://www.slideshare.net/SumeetSingh1/hadoop-summit-san-jose-2015-towards-slabased-scheduling-on-yarn-clusters
34. Performance Goals that Need to be Met
• Deadline: ETL workflow should finish by 6.00 AM
• Latency: Average query latency should be under 3 minutes
• Utilization: Cluster utilization should be above 70%
• Predictability: SLA satisfaction rate should be above 95%
35. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Very complex to configure manually for
good performance and efficiency!
1. YARN does not understand performance goals
2. Too many low-level parameters to be set
3. Need a deep understanding of application
workload & performance requirements
4. Diverse types of application behaviors
5. Workloads change with time
36. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
A New Interface to YARN’s Resource Manager
Simple abstraction to specify key performance goals
• Based on past/current performance
• Helps pose operational what-if questions
Powerful functionality
• Learning engine for automated answers to operational what-if questions
• Recommender system to automatically find parameter settings that meet
performance goals
Nonintrusive & No changes needed to YARN
37. Ask operational what-if questions based on past/current performance:
1. What is the impact of decreasing capacity of ADVERTISING queue by 30%?
2. How to reduce average workflow latency in FINANCIAL queue to 30 minutes?
Resource allocation & app
performance in different queues
38. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Tempo: Robust and Self-Tuning Resource
Management in Multi-tenant Parallel Databases
To appear in VLDB in Sept 2016
40. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Key Takeaways
YARN is powerful to meet key multi-tenant resource allocation needs
But, an easy interface to specify & satisfy performance goals is lacking
Our work aims to fill this gap in the ecosystem
41. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
42. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Control
Multi-tenant Usage by
Enforcing Policies
43. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
AutoActions for Policy-based Control
Ops specifies a policy: If resources used by ad-hoc apps in FINANCIAL queue
are slowing down the CEO-Report workflow by more than 20%, then move the
ad-hoc apps to the QUARANTINE queue
Unravel continuously monitors for policy violations
If a policy violation is detected, then Unravel acts via YARN REST APIs
• Helps Ops automate operational processes & get peace of mind
• Unravel maintains complete audit trail for post-mortem investigation
44. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Examples of AutoActions
Beginner
• Enforcing best practices on number of tasks and container sizes
Intermediate
• Detecting Rogue Apps and moving them to a capped queue/pool
• Making workflow execution fault-tolerant under YARN kills & OOMs
Expert
• Guaranteeing SLAs by dynamic adjustment of YARN Resource Manager
parameters
• Enabling workload-aware cluster selection to lower Cloud usage costs
45. Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Lifecycle of a Multi-tenant Hadoop Cluster
Growth, Diversity, Challenges
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
- MapReduce, Hive, and now Spark and Kafka
Cores, Memory, Storage
Engineers, Analysts, and First time users
Daily, Weekly, Monthly, Quarterly report jobs
Rouge: DNS Dos
Cascading: OOM before yarn
SLA: Monthly and Quarterly jobs read more data and use more resources but only during certain times of the year.
Stuck: heartbeats are working, but the process isn’t
Failed: bad drives
Resource Usage by user
Hive: 3000 tables and data ownership
NN small files/big files
Tuning the cluster
Tuning the scheduler
Resource usage per user and department
Budget for next year
These are the problems we face. Our goal is to find these problems faster then we have in the past. So we have partnered with Unravel. Let me hand this off to Shivnath and he can explain more.