Hui 3.0

1
Hadoop Usage Insight (HUI 1.0)
Session on Descriptive Analytics
ArulKumar
Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights
on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been
developed as a web enabled tool leveraging Eagle Framework.

Contents
• Why we do this ?
• Our Customers
• Initial Use Case
• Eagle Monitoring Framework as a Solution …
• How we did this ? - EagleApp !!
• Functional Coverage
• AS IS Features
• Methods & Metrics
• TO BE Features

3
Why we do this ?
• 2+ large Hadoop clusters
• 3000+ nodes
• 20,000+ jobs per day
• 50,000,00+ tasks per day
• 200+ types of Hadoop Metrics
• Millions of audit events per day
Complexity
• Varieties of data sources & Collectors
• Join multiple data sources
• Threshold based, windows based
• Multiple metrics correlation
• Metrics pre-aggregations
• Alert rules can’t be hot deployed
Volume

4
Key Stake Holders & Use Cases
 Sr. Management
 SME’s & Leads
 Cluster - Availability and HDFS Usage
 Rack wise
 Node wise
 Host Anomaly detection
 Queue - Load Analysis
 Queue Load w.r.t. Batch A/C.
 M R Count and Failure status w.r.t Queues
 Queue wise Elapse time , Job Count
 % of Completion M & R
 Job – Usage and Progress status
 Usage distribution of Jobs across Queues
 Job Listing & Status with elapsed on Queue
 Alerts
 Job Alert Categorization and
Distribution across Queue & Users
 Optimization
 Optimization suggestion for Jobs in
each Queue. ( Start time of Job &
Other Counter Details on Screen
 Counter Analytics
 Map Task Attempt ( File System )
 Reduce Task Attempt ( File System )
 Map Reduce ( File System , Job, Task
 Hadoop Operations
 Hadoop Developers
Use Cases
ROI
 Time to Market
 Reduction in MTTR
 Freedom for Innovation
 Optimized Resource
Usage
 Reduction in SLA Time
 Insight on Running Jobs
 Cluster Capacity Insight
 Proactive Remediation
 Infrastructure Teams
 Product Managers

5
The Initial Use Case …
Anomaly detection algorithm
 Continuously crawl job history immediately after Job
completion
 Calculate minute level job failure ratio for each node
 A node is anomalous when either of 2 conditions happen
• Continuously fails tasks within this node
• Higher failure ratio than rest of nodes in the cluster

Eagle Monitoring Framework as a Solution …
6
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized Structure
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
Eagle Query Framework

7
How we do this ? - Eagle App - HUI 1.0
DATA
Collector
DATA
Processing
Metric
Aggregation
Alert
Engine
Storage

8
Functional Coverage
Hosts MR Jobs Counters Anomalies Metadata
DEV Queue, BatchA/C JobAttempt, File Sys RJ analysis, JPA Hbase, Hive Reports
PROD SLA Jobs, MapReduce, HDFS Skew & Host detection Mongo DB InteractiveQry
MR Analytics
MR History
Job Counters
Failure Ratio
SLOT Usage
-------------
Queue
Batch A/c
OU-Track
Applications
TooLong
NoProgress
TooManyT-Failure
MR Failure Ratio
Spill Over
Host Anomalies
-----------------
Queue
Batch A/C
Job
OU-Track
Usage
Volume
Growth
Retention
Purge Idea
-------------
Users
Applications
OU
Track
DT & Outside
Custom
Dashboard
Query By
• Queue
• Batch A/c
• Job Name
• Time Slot
• OU Specific
• Track
Availability
Capacity
Trend Analysis
Prediction
--------------
Node
OU Usage
Track Usage
MapTaskAttempt
ReduceTaskAttempt
MapTaskAttemptFS
ReduceTaskAttemptFS
MR.FileSystem
MR.Job/ Task
-------------------
Queue
Batch A/C
Job
OU-Track
Application
Platform DATA INFRABusinessUnits

10
ASIS Features
 Cluster - Availability and HDFS Usage
 Rack wise
 Node wise
 Host Anomaly detection
 Queue - Load Analysis
 Queue Load w.r.t. Batch A/C.
 M R Count and Failure status w.r.t Queues
 Queue wise Elapse time
 Job Count
 % of Completion M & R
 Job – Usage and Progress status
 Usage distribution of Jobs across Queues
 Job Listing & Status with elapsed on Queue
 Alerts
 Job Alert Categorization and Distribution
across Queue & Users
 Optimization
 Optimization suggestion for Jobs in each
Queue. ( Start time of Job & Other Counter
Details on Screen )
 Counter Analytics
 Map Task Attempt ( File System )
 Reduce Task Attempt ( File System )
 Map Reduce ( File System , Job, Task )

13
Task CompletionView … (Typical )

15
Alert type Alert category Trigger Condition Email frequency Actions
Job
Performance
Alerts
long execution compared
with historical data
Alert when there is a peak 1 hour or > 10 jobs Notify user with In-sight
Execution time > 12 hours Execution time > 12 hours 1 hour or > 10 jobs Notify user with In-sight
Slow progress HDFS R/W, File R/W has no
progress within 15 minutes
1 hour or > 10 jobs Resource availability Check
long scheduling Map & Reduce Progress 0% even
after 15 minutes
1 hour or > 10 jobs Resource availability Check
long cleanup Map & Reduce Progress 100% but
Job not completed in 15 min.
1 hour or > 10 jobs System resource availability
Check
Abnormal # of HDFS R/W > 0.5M 1 hour or > 10 jobs Notify and Optimize Job
Slow processing file RW # of bytes is between 100 to 200 K
bytes per CPU second
1 hour or > 10 jobs Notify and Optimize Job
very large shuffle size > 10GB 1 hour or > 10 jobs Notify and Optimize Job
Bad Node Node Anomaly Alert Bad node has high failure ratio. on-demand Restart daemon / Decomm node
Job Exception
Job Anomaly Alert Buggy job has very high failure ratio
than any of other jobs
on-demand Send email to owner
Typical AlertType and Categories …

16
JPA : Job Performance and Historical Job Analyzer
Monitor and analyze job performance in real-time
• Historical job analysis
• Running job analysis
• Anomaly host detection
• Job data skew detection
• Job performance suggestion
• Anomaly Prediction via machine learning
• Job historical performance trend
• Task and attempt distribution
• Skewness Score
• Anomaly historical performance detection
 TooLowBytesConsumedPerCPUSecond
 Job StatisticLongDuration
 TooLargeShuffleSizeAlert
JOB Performance Design

17
Real time Data Skew Detection - Approach
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
Counters & Features
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Correlation > 0.9
& Max(Z-Score) > 90%
HDFS Bytes Read  Input Records  Map Duration (ms)  Combine I/P Records
Shuffle Records  Local File Bytes Read  Input Records  Duration (ms) paralyzed

19
Possible Expansions …
 Cluster - Available HDFS Usage
 Trend of Usage
 Trend of Node down time
 Job Distribution
 OU Level mapping
 Queue Load Analytics
 Queue Load
 Reduces, Failed w.r.t Business
 Optimization
 Anomaly Prediction via machine learning
 Real-time Data Skew Detection
 Hbase & Hive Metadata Usage Analytics

Hui 3.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hui 3.0

Similar to Hui 3.0 (20)

Recently uploaded

Recently uploaded (20)

Hui 3.0

Editor's Notes