Meeting Performance Goals in multi-tenant Hadoop Clusters

Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Meeting Performance Goals in
Multi-tenant Hadoop Clusters
Shivnath Babu and Brian Majeska

Who we are
• Manages Operations for YP’s Hadoop and
HBase clusters
• Worked with Hadoop ecosystem for 5 years
• System Admin for 13 years
• Previously at eHarmony, CalTech, EarthLink
Brian Majeska
Director Engineering
Operations
Platform Data Services
YP.com
Glendale, CA 91203
Shivnath Babu
Associate Professor,
Duke University
Co-founder/CTO,
Unravel Data Systems
Menlo Park, CA 94025
• R&D on Hadoop, Spark, NoSQL, streaming,
& MPP to simplify ongoing app/system
management
• Led work on first self-tuning Hadoop platform
• Awards from NSF, IBM, HP
• PhD, Stanford University

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Lifecycle of a Multi-tenant Hadoop Cluster
Growth, Diversity, Challenges

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Growth
• Pilot: Migrated ETL to Hadoop 5 years ago
• Growth in applications
• ETL: from 3 to 100+ unique workflows in 5 years
• Ad-hoc: 7K jobs a day - 24/7
• Growth in users: 100+ active users
• Growth in systems: HDFS, MapReduce, Hive, Oozie, HBase, Spark, Kafka

Hadoop Growth Over Time
Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Production Cluster
300 servers
8 cores
8 x 1T data drives (2PB)
18G RAM
1G NIC
Multi-tenant Production Cluster
220 servers
16 cores
12 x 4T data drives (7PB)
256G RAM
10G NIC
5 Years Ago… 1PB data Today… 5PB data

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Daily Processing on Multi-tenant Cluster
• 1.3 Billion events
• 300TB HDFS Reads
• 35TB HDFS Writes

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
With Great Power Comes Great Diversity!
• Diversity in application types
• Diversity in application resource needs
• Diversity in users and their skill-sets
• Diversity in business criticality of workloads

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Growth & Diversity  Challenges

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Growth + Diversity  Big Challenges
• More problems
• Harder to diagnose
Cascading failures
Application slowdowns
Rogue applications
Missed SLAs
Stuck jobs
Failed queries

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
• More problems
• Harder to track who is doing what
• Harder to control
CPU/IO/Network usage
Files/Tables/partitions
created
Best practices on
application performance
and cluster usage

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
• More problems
• Harder to track who is doing what
• Harder to control
• Harder to optimize
• Harder to plan
Server configuration
Scheduler parameters
Justifying resource demands
Forecasting capacity needs

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Understand
What is Going On

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Real-life Example: Unpredictable Workflow
Performance

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Root Cause of this Resource Contention

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
But, This is Just One Type of Contention
• At Resource Manager Level
• App admission time
• Container allocation for App Master
• Container allocation for tasks
• Container allocation for Executor
• At Application Level
• Workflow Scheduler, e.g., Oozie
• Query Engine, e.g., HiveServer2
• At Master Daemon Level
• NameNode
• Hive MetaStore
Bad
Good

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Key Takeaways
Resource contention at different levels affects app performance
• Different apps (Oozie workflows, MapReduce, Spark, Tez) are affected differently
• Manual diagnosis can be hard and time-consuming
Unravel’s approach to diagnose such problems automatically
• Analyzes full-stack monitoring data
• Carefully combines system knowledge with statistical analysis

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Improve
Performance & Efficiency

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Quick Primer on YARN Resource Manager
Image from: http://doc.mapr.com/display/MapR/YARN

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Two Ways to Improve Performance & Efficiency
1. At the level of individual application’s interaction with the Resource
Manager
2. At the level of the Resource Manager’s Configuration that affects all
applications

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Application’s Interaction with the
Resource Manager
1. Number of containers
• MapReduce, Spark, & Tez use different techniques to determine this number
2. Container size
• CPU
• Memory
Image from: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Application Spawns Too Many Containers

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Poorly-sized Containers

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Massive Inefficiencies Diagnosed and Eliminated with
Intelligent Container Sizing!

Low throughput
Unused datasets
Poor data layout
Configuring the Resource Manager (Queues/Pools)
Root
Mem Capacity: 12 GB
CPU Capacity: 24 cores
Marketing
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
R&D
Sales
Missed SLAs
Poor performance
Failed applications
Jim’s Team
Bob’s Team

Configuring the Resource Manager (Parameters)
Image from: http://www.slideshare.net/SumeetSingh1/hadoop-summit-san-jose-2015-towards-slabased-scheduling-on-yarn-clusters

Performance Goals that Need to be Met
• Deadline: ETL workflow should finish by 6.00 AM
• Latency: Average query latency should be under 3 minutes
• Utilization: Cluster utilization should be above 70%
• Predictability: SLA satisfaction rate should be above 95%

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Very complex to configure manually for
good performance and efficiency!
1. YARN does not understand performance goals
2. Too many low-level parameters to be set
3. Need a deep understanding of application
workload & performance requirements
4. Diverse types of application behaviors
5. Workloads change with time

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
A New Interface to YARN’s Resource Manager
Simple abstraction to specify key performance goals
• Based on past/current performance
• Helps pose operational what-if questions
Powerful functionality
• Learning engine for automated answers to operational what-if questions
• Recommender system to automatically find parameter settings that meet
performance goals
Nonintrusive & No changes needed to YARN

Ask operational what-if questions based on past/current performance:
1. What is the impact of decreasing capacity of ADVERTISING queue by 30%?
2. How to reduce average workflow latency in FINANCIAL queue to 30 minutes?
Resource allocation & app
performance in different queues

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Tempo: Robust and Self-Tuning Resource
Management in Multi-tenant Parallel Databases
To appear in VLDB in Sept 2016

Key Types of
Performance
Goals
Application
Workload
in the
Cluster
Models of
Fair/Capacity
Schedulers
Learning and
Optimization
Algorithms
Automated
Answers to
Operational
Questions

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Key Takeaways
YARN is powerful to meet key multi-tenant resource allocation needs
But, an easy interface to specify & satisfy performance goals is lacking
Our work aims to fill this gap in the ecosystem

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Control
Multi-tenant Usage by
Enforcing Policies

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
AutoActions for Policy-based Control
Ops specifies a policy: If resources used by ad-hoc apps in FINANCIAL queue
are slowing down the CEO-Report workflow by more than 20%, then move the
ad-hoc apps to the QUARANTINE queue
Unravel continuously monitors for policy violations
If a policy violation is detected, then Unravel acts via YARN REST APIs
• Helps Ops automate operational processes & get peace of mind
• Unravel maintains complete audit trail for post-mortem investigation

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Examples of AutoActions
Beginner
• Enforcing best practices on number of tasks and container sizes
Intermediate
• Detecting Rogue Apps and moving them to a capped queue/pool
• Making workflow execution fault-tolerant under YARN kills & OOMs
Expert
• Guaranteeing SLAs by dynamic adjustment of YARN Resource Manager
parameters
• Enabling workload-aware cluster selection to lower Cloud usage costs

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Growth, Diversity, Challenges
Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control

Get Unravel Trial Edition:
bit.ly/getunravel
UNCOVER ISSUES
UNLEASH RESOURCES
UNRAVEL PERFORMANCE

Meeting Performance Goals in multi-tenant Hadoop Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Meeting Performance Goals in multi-tenant Hadoop Clusters

Similar to Meeting Performance Goals in multi-tenant Hadoop Clusters (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Meeting Performance Goals in multi-tenant Hadoop Clusters

Editor's Notes