Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Meeting Performance Goals in
Multi-tenant Hadoop Clusters
Shivnath Babu and Brian Majeska
Who we are
• Manages Operations for YP’s Hadoop and
HBase clusters
• Worked with Hadoop ecosystem for 5 years
• System Admin for 13 years
• Previously at eHarmony, CalTech, EarthLink
Brian Majeska
Director Engineering
Operations
Platform Data Services
YP.com
Glendale, CA 91203
Shivnath Babu
Associate Professor,
Duke University
Co-founder/CTO,
Unravel Data Systems
Menlo Park, CA 94025
• R&D on Hadoop, Spark, NoSQL, streaming,
& MPP to simplify ongoing app/system
management
• Led work on first self-tuning Hadoop platform
• Awards from NSF, IBM, HP
• PhD, Stanford University
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Lifecycle of a Multi-tenant Hadoop Cluster
Growth, Diversity, Challenges
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth
• Pilot: Migrated ETL to Hadoop 5 years ago
• Growth in applications
• ETL: from 3 to 100+ unique workflows in 5 years
• Ad-hoc: 7K jobs a day - 24/7
• Growth in users: 100+ active users
• Growth in systems: HDFS, MapReduce, Hive, Oozie, HBase, Spark, Kafka
Hadoop Growth Over Time
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Production Cluster
300 servers
8 cores
8 x 1T data drives (2PB)
18G RAM
1G NIC
Multi-tenant Production Cluster
220 servers
16 cores
12 x 4T data drives (7PB)
256G RAM
10G NIC
5 Years Ago… 1PB data Today… 5PB data
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Daily Processing on Multi-tenant Cluster
• 1.3 Billion events
• 300TB HDFS Reads
• 35TB HDFS Writes
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
With Great Power Comes Great Diversity!
• Diversity in application types
• Diversity in application resource needs
• Diversity in users and their skill-sets
• Diversity in business criticality of workloads
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Lifecycle of a Multi-tenant Hadoop Cluster
Growth & Diversity  Challenges
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth + Diversity  Big Challenges
• More problems
• Harder to diagnose
Cascading failures
Application slowdowns
Rogue applications
Missed SLAs
Stuck jobs
Failed queries
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth + Diversity  Big Challenges
• More problems
• Harder to diagnose
• Harder to track who is doing what
• Harder to control
CPU/IO/Network usage
Files/Tables/partitions
created
Best practices on
application performance
and cluster usage
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Growth + Diversity  Big Challenges
• More problems
• Harder to diagnose
• Harder to track who is doing what
• Harder to control
• Harder to optimize
• Harder to plan
Server configuration
Scheduler parameters
Justifying resource demands
Forecasting capacity needs
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Understand
What is Going On
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Real-life Example: Unpredictable Workflow
Performance
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Bad
Good
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Root Cause of this Resource Contention
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
But, This is Just One Type of Contention
• At Resource Manager Level
• App admission time
• Container allocation for App Master
• Container allocation for tasks
• Container allocation for Executor
• At Application Level
• Workflow Scheduler, e.g., Oozie
• Query Engine, e.g., HiveServer2
• At Master Daemon Level
• NameNode
• Hive MetaStore
Bad
Good
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Key Takeaways
Resource contention at different levels affects app performance
• Different apps (Oozie workflows, MapReduce, Spark, Tez) are affected differently
• Manual diagnosis can be hard and time-consuming
Unravel’s approach to diagnose such problems automatically
• Analyzes full-stack monitoring data
• Carefully combines system knowledge with statistical analysis
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Improve
Performance & Efficiency
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Quick Primer on YARN Resource Manager
Image from: http://doc.mapr.com/display/MapR/YARN
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Two Ways to Improve Performance & Efficiency
1. At the level of individual application’s interaction with the Resource
Manager
2. At the level of the Resource Manager’s Configuration that affects all
applications
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Application’s Interaction with the
Resource Manager
1. Number of containers
• MapReduce, Spark, & Tez use different techniques to determine this number
2. Container size
• CPU
• Memory
Image from: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Application Spawns Too Many Containers
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Poorly-sized Containers
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Massive Inefficiencies Diagnosed and Eliminated with
Intelligent Container Sizing!
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Two Ways to Improve Performance & Efficiency
1. At the level of individual application’s interaction with the Resource
Manager
2. At the level of the Resource Manager’s Configuration that affects all
applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Configuring the Resource Manager (Queues/Pools)
Root
Mem Capacity: 12 GB
CPU Capacity: 24 cores
Marketing
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
R&D
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
Sales
Fair Share Mem: 4 GB
Fair Share CPU: 8 cores
Missed SLAs
Poor performance
Failed applications
Jim’s Team
Fair Share Mem: 2 GB
Fair Share CPU: 4 cores
Bob’s Team
Fair Share Mem: 2 GB
Fair Share CPU: 4 cores
Configuring the Resource Manager (Parameters)
Image from: http://www.slideshare.net/SumeetSingh1/hadoop-summit-san-jose-2015-towards-slabased-scheduling-on-yarn-clusters
Performance Goals that Need to be Met
• Deadline: ETL workflow should finish by 6.00 AM
• Latency: Average query latency should be under 3 minutes
• Utilization: Cluster utilization should be above 70%
• Predictability: SLA satisfaction rate should be above 95%
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Very complex to configure manually for
good performance and efficiency!
1. YARN does not understand performance goals
2. Too many low-level parameters to be set
3. Need a deep understanding of application
workload & performance requirements
4. Diverse types of application behaviors
5. Workloads change with time
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
A New Interface to YARN’s Resource Manager
Simple abstraction to specify key performance goals
• Based on past/current performance
• Helps pose operational what-if questions
Powerful functionality
• Learning engine for automated answers to operational what-if questions
• Recommender system to automatically find parameter settings that meet
performance goals
Nonintrusive & No changes needed to YARN
Ask operational what-if questions based on past/current performance:
1. What is the impact of decreasing capacity of ADVERTISING queue by 30%?
2. How to reduce average workflow latency in FINANCIAL queue to 30 minutes?
Resource allocation & app
performance in different queues
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Tempo: Robust and Self-Tuning Resource
Management in Multi-tenant Parallel Databases
To appear in VLDB in Sept 2016
Key Types of
Performance
Goals
Application
Workload
in the
Cluster
Models of
Fair/Capacity
Schedulers
Learning and
Optimization
Algorithms
Automated
Answers to
Operational
Questions
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Key Takeaways
YARN is powerful to meet key multi-tenant resource allocation needs
But, an easy interface to specify & satisfy performance goals is lacking
Our work aims to fill this gap in the ecosystem
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Control
Multi-tenant Usage by
Enforcing Policies
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
AutoActions for Policy-based Control
Ops specifies a policy: If resources used by ad-hoc apps in FINANCIAL queue
are slowing down the CEO-Report workflow by more than 20%, then move the
ad-hoc apps to the QUARANTINE queue
Unravel continuously monitors for policy violations
If a policy violation is detected, then Unravel acts via YARN REST APIs
• Helps Ops automate operational processes & get peace of mind
• Unravel maintains complete audit trail for post-mortem investigation
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Examples of AutoActions
Beginner
• Enforcing best practices on number of tasks and container sizes
Intermediate
• Detecting Rogue Apps and moving them to a capped queue/pool
• Making workflow execution fault-tolerant under YARN kills & OOMs
Expert
• Guaranteeing SLAs by dynamic adjustment of YARN Resource Manager
parameters
• Enabling workload-aware cluster selection to lower Cloud usage costs
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Lifecycle of a Multi-tenant Hadoop Cluster
Growth, Diversity, Challenges
Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
For Easier Ongoing Management
of Multi-tenant Hadoop Clusters
Understand, Improve, Control
Get Unravel Trial Edition:
bit.ly/getunravel
UNCOVER ISSUES
UNLEASH RESOURCES
UNRAVEL PERFORMANCE

Meeting Performance Goals in multi-tenant Hadoop Clusters

  • 1.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Meeting Performance Goals in Multi-tenant Hadoop Clusters Shivnath Babu and Brian Majeska
  • 2.
    Who we are •Manages Operations for YP’s Hadoop and HBase clusters • Worked with Hadoop ecosystem for 5 years • System Admin for 13 years • Previously at eHarmony, CalTech, EarthLink Brian Majeska Director Engineering Operations Platform Data Services YP.com Glendale, CA 91203 Shivnath Babu Associate Professor, Duke University Co-founder/CTO, Unravel Data Systems Menlo Park, CA 94025 • R&D on Hadoop, Spark, NoSQL, streaming, & MPP to simplify ongoing app/system management • Led work on first self-tuning Hadoop platform • Awards from NSF, IBM, HP • PhD, Stanford University
  • 3.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Lifecycle of a Multi-tenant Hadoop Cluster Growth, Diversity, Challenges
  • 4.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Growth • Pilot: Migrated ETL to Hadoop 5 years ago • Growth in applications • ETL: from 3 to 100+ unique workflows in 5 years • Ad-hoc: 7K jobs a day - 24/7 • Growth in users: 100+ active users • Growth in systems: HDFS, MapReduce, Hive, Oozie, HBase, Spark, Kafka
  • 5.
    Hadoop Growth OverTime Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout Production Cluster 300 servers 8 cores 8 x 1T data drives (2PB) 18G RAM 1G NIC Multi-tenant Production Cluster 220 servers 16 cores 12 x 4T data drives (7PB) 256G RAM 10G NIC 5 Years Ago… 1PB data Today… 5PB data
  • 6.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Daily Processing on Multi-tenant Cluster • 1.3 Billion events • 300TB HDFS Reads • 35TB HDFS Writes
  • 7.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout With Great Power Comes Great Diversity! • Diversity in application types • Diversity in application resource needs • Diversity in users and their skill-sets • Diversity in business criticality of workloads
  • 8.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Lifecycle of a Multi-tenant Hadoop Cluster Growth & Diversity  Challenges
  • 9.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Growth + Diversity  Big Challenges • More problems • Harder to diagnose Cascading failures Application slowdowns Rogue applications Missed SLAs Stuck jobs Failed queries
  • 10.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Growth + Diversity  Big Challenges • More problems • Harder to diagnose • Harder to track who is doing what • Harder to control CPU/IO/Network usage Files/Tables/partitions created Best practices on application performance and cluster usage
  • 11.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Growth + Diversity  Big Challenges • More problems • Harder to diagnose • Harder to track who is doing what • Harder to control • Harder to optimize • Harder to plan Server configuration Scheduler parameters Justifying resource demands Forecasting capacity needs
  • 12.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  • 13.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Understand What is Going On
  • 14.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Real-life Example: Unpredictable Workflow Performance
  • 15.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout
  • 16.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout
  • 17.
  • 20.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Root Cause of this Resource Contention
  • 21.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout But, This is Just One Type of Contention • At Resource Manager Level • App admission time • Container allocation for App Master • Container allocation for tasks • Container allocation for Executor • At Application Level • Workflow Scheduler, e.g., Oozie • Query Engine, e.g., HiveServer2 • At Master Daemon Level • NameNode • Hive MetaStore Bad Good
  • 22.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Key Takeaways Resource contention at different levels affects app performance • Different apps (Oozie workflows, MapReduce, Spark, Tez) are affected differently • Manual diagnosis can be hard and time-consuming Unravel’s approach to diagnose such problems automatically • Analyzes full-stack monitoring data • Carefully combines system knowledge with statistical analysis
  • 23.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  • 24.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Improve Performance & Efficiency
  • 25.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Quick Primer on YARN Resource Manager Image from: http://doc.mapr.com/display/MapR/YARN
  • 26.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Two Ways to Improve Performance & Efficiency 1. At the level of individual application’s interaction with the Resource Manager 2. At the level of the Resource Manager’s Configuration that affects all applications
  • 27.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Application’s Interaction with the Resource Manager 1. Number of containers • MapReduce, Spark, & Tez use different techniques to determine this number 2. Container size • CPU • Memory Image from: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
  • 28.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Application Spawns Too Many Containers
  • 29.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Poorly-sized Containers
  • 30.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Massive Inefficiencies Diagnosed and Eliminated with Intelligent Container Sizing!
  • 31.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Two Ways to Improve Performance & Efficiency 1. At the level of individual application’s interaction with the Resource Manager 2. At the level of the Resource Manager’s Configuration that affects all applications
  • 32.
    Underutilized clusters Low throughput Unuseddatasets Poor data layout Configuring the Resource Manager (Queues/Pools) Root Mem Capacity: 12 GB CPU Capacity: 24 cores Marketing Fair Share Mem: 4 GB Fair Share CPU: 8 cores R&D Fair Share Mem: 4 GB Fair Share CPU: 8 cores Sales Fair Share Mem: 4 GB Fair Share CPU: 8 cores Missed SLAs Poor performance Failed applications Jim’s Team Fair Share Mem: 2 GB Fair Share CPU: 4 cores Bob’s Team Fair Share Mem: 2 GB Fair Share CPU: 4 cores
  • 33.
    Configuring the ResourceManager (Parameters) Image from: http://www.slideshare.net/SumeetSingh1/hadoop-summit-san-jose-2015-towards-slabased-scheduling-on-yarn-clusters
  • 34.
    Performance Goals thatNeed to be Met • Deadline: ETL workflow should finish by 6.00 AM • Latency: Average query latency should be under 3 minutes • Utilization: Cluster utilization should be above 70% • Predictability: SLA satisfaction rate should be above 95%
  • 35.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Very complex to configure manually for good performance and efficiency! 1. YARN does not understand performance goals 2. Too many low-level parameters to be set 3. Need a deep understanding of application workload & performance requirements 4. Diverse types of application behaviors 5. Workloads change with time
  • 36.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout A New Interface to YARN’s Resource Manager Simple abstraction to specify key performance goals • Based on past/current performance • Helps pose operational what-if questions Powerful functionality • Learning engine for automated answers to operational what-if questions • Recommender system to automatically find parameter settings that meet performance goals Nonintrusive & No changes needed to YARN
  • 37.
    Ask operational what-ifquestions based on past/current performance: 1. What is the impact of decreasing capacity of ADVERTISING queue by 30%? 2. How to reduce average workflow latency in FINANCIAL queue to 30 minutes? Resource allocation & app performance in different queues
  • 38.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases To appear in VLDB in Sept 2016
  • 39.
    Key Types of Performance Goals Application Workload inthe Cluster Models of Fair/Capacity Schedulers Learning and Optimization Algorithms Automated Answers to Operational Questions
  • 40.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Key Takeaways YARN is powerful to meet key multi-tenant resource allocation needs But, an easy interface to specify & satisfy performance goals is lacking Our work aims to fill this gap in the ecosystem
  • 41.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  • 42.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Control Multi-tenant Usage by Enforcing Policies
  • 43.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout AutoActions for Policy-based Control Ops specifies a policy: If resources used by ad-hoc apps in FINANCIAL queue are slowing down the CEO-Report workflow by more than 20%, then move the ad-hoc apps to the QUARANTINE queue Unravel continuously monitors for policy violations If a policy violation is detected, then Unravel acts via YARN REST APIs • Helps Ops automate operational processes & get peace of mind • Unravel maintains complete audit trail for post-mortem investigation
  • 44.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Examples of AutoActions Beginner • Enforcing best practices on number of tasks and container sizes Intermediate • Detecting Rogue Apps and moving them to a capped queue/pool • Making workflow execution fault-tolerant under YARN kills & OOMs Expert • Guaranteeing SLAs by dynamic adjustment of YARN Resource Manager parameters • Enabling workload-aware cluster selection to lower Cloud usage costs
  • 45.
    Missed SLAs Poor performance Failedapplications Underutilized clusters Low throughput Unused datasets Poor data layout Lifecycle of a Multi-tenant Hadoop Cluster Growth, Diversity, Challenges Missed SLAs Poor performance Failed applications Underutilized clusters Low throughput Unused datasets Poor data layout For Easier Ongoing Management of Multi-tenant Hadoop Clusters Understand, Improve, Control
  • 46.
    Get Unravel TrialEdition: bit.ly/getunravel UNCOVER ISSUES UNLEASH RESOURCES UNRAVEL PERFORMANCE

Editor's Notes

  • #3 Shivnath starts
  • #8 - MapReduce, Hive, and now Spark and Kafka Cores, Memory, Storage Engineers, Analysts, and First time users Daily, Weekly, Monthly, Quarterly report jobs
  • #10 Rouge: DNS Dos Cascading: OOM before yarn SLA: Monthly and Quarterly jobs read more data and use more resources but only during certain times of the year. Stuck: heartbeats are working, but the process isn’t Failed: bad drives
  • #11 Resource Usage by user Hive: 3000 tables and data ownership NN small files/big files
  • #12 Tuning the cluster Tuning the scheduler Resource usage per user and department Budget for next year
  • #13 These are the problems we face. Our goal is to find these problems faster then we have in the past. So we have partnered with Unravel. Let me hand this off to Shivnath and he can explain more.