1
Hadoop Usage Insight (HUI 1.0)
Session on Descriptive Analytics
ArulKumar
Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights
on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been
developed as a web enabled tool leveraging Eagle Framework.
Contents
• Why we do this ?
• Our Customers
• Initial Use Case
• Eagle Monitoring Framework as a Solution …
• How we did this ? - EagleApp !!
• Functional Coverage
• AS IS Features
• Methods & Metrics
• TO BE Features
3
Why we do this ?
• 2+ large Hadoop clusters
• 3000+ nodes
• 20,000+ jobs per day
• 50,000,00+ tasks per day
• 200+ types of Hadoop Metrics
• Millions of audit events per day
Complexity
• Varieties of data sources & Collectors
• Join multiple data sources
• Threshold based, windows based
• Multiple metrics correlation
• Metrics pre-aggregations
• Alert rules can’t be hot deployed
Volume
4
Key Stake Holders & Use Cases
 Sr. Management
 SME’s & Leads
 Cluster - Availability and HDFS Usage
 Rack wise
 Node wise
 Host Anomaly detection
 Queue - Load Analysis
 Queue Load w.r.t. Batch A/C.
 M R Count and Failure status w.r.t Queues
 Queue wise Elapse time , Job Count
 % of Completion M & R
 Job – Usage and Progress status
 Usage distribution of Jobs across Queues
 Job Listing & Status with elapsed on Queue
 Alerts
 Job Alert Categorization and
Distribution across Queue & Users
 Optimization
 Optimization suggestion for Jobs in
each Queue. ( Start time of Job &
Other Counter Details on Screen
 Counter Analytics
 Map Task Attempt ( File System )
 Reduce Task Attempt ( File System )
 Map Reduce ( File System , Job, Task
 Hadoop Operations
 Hadoop Developers
Use Cases
ROI
 Time to Market
 Reduction in MTTR
 Freedom for Innovation
 Optimized Resource
Usage
 Reduction in SLA Time
 Insight on Running Jobs
 Cluster Capacity Insight
 Proactive Remediation
 Infrastructure Teams
 Product Managers
5
The Initial Use Case …
Anomaly detection algorithm
 Continuously crawl job history immediately after Job
completion
 Calculate minute level job failure ratio for each node
 A node is anomalous when either of 2 conditions happen
• Continuously fails tasks within this node
• Higher failure ratio than rest of nodes in the cluster
Eagle Monitoring Framework as a Solution …
6
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized Structure
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
Eagle Query Framework
7
How we do this ? - Eagle App - HUI 1.0
DATA
Collector
DATA
Processing
Metric
Aggregation
Alert
Engine
Storage
8
Functional Coverage
Hosts MR Jobs Counters Anomalies Metadata
DEV Queue, BatchA/C JobAttempt, File Sys RJ analysis, JPA Hbase, Hive Reports
PROD SLA Jobs, MapReduce, HDFS Skew & Host detection Mongo DB InteractiveQry
MR Analytics
MR History
Job Counters
Failure Ratio
SLOT Usage
-------------
Queue
Batch A/c
OU-Track
Applications
TooLong
NoProgress
TooManyT-Failure
MR Failure Ratio
Spill Over
Host Anomalies
-----------------
Queue
Batch A/C
Job
OU-Track
Usage
Volume
Growth
Retention
Purge Idea
-------------
Users
Applications
OU
Track
DT & Outside
Custom
Dashboard
Query By
• Queue
• Batch A/c
• Job Name
• Time Slot
• OU Specific
• Track
Availability
Capacity
Trend Analysis
Prediction
--------------
Node
OU Usage
Track Usage
MapTaskAttempt
ReduceTaskAttempt
MapTaskAttemptFS
ReduceTaskAttemptFS
MR.FileSystem
MR.Job/ Task
-------------------
Queue
Batch A/C
Job
OU-Track
Application
Platform DATA INFRABusinessUnits
9
AS IS …
10
ASIS Features
 Cluster - Availability and HDFS Usage
 Rack wise
 Node wise
 Host Anomaly detection
 Queue - Load Analysis
 Queue Load w.r.t. Batch A/C.
 M R Count and Failure status w.r.t Queues
 Queue wise Elapse time
 Job Count
 % of Completion M & R
 Job – Usage and Progress status
 Usage distribution of Jobs across Queues
 Job Listing & Status with elapsed on Queue
 Alerts
 Job Alert Categorization and Distribution
across Queue & Users
 Optimization
 Optimization suggestion for Jobs in each
Queue. ( Start time of Job & Other Counter
Details on Screen )
 Counter Analytics
 Map Task Attempt ( File System )
 Reduce Task Attempt ( File System )
 Map Reduce ( File System , Job, Task )
11
Hosts Usage (Typical )
12
Queue Usage
13
Task CompletionView … (Typical )
14
Methods & Metrics …
15
Alert type Alert category Trigger Condition Email frequency Actions
Job
Performance
Alerts
long execution compared
with historical data
Alert when there is a peak 1 hour or > 10 jobs Notify user with In-sight
Execution time > 12 hours Execution time > 12 hours 1 hour or > 10 jobs Notify user with In-sight
Slow progress HDFS R/W, File R/W has no
progress within 15 minutes
1 hour or > 10 jobs Resource availability Check
long scheduling Map & Reduce Progress 0% even
after 15 minutes
1 hour or > 10 jobs Resource availability Check
long cleanup Map & Reduce Progress 100% but
Job not completed in 15 min.
1 hour or > 10 jobs System resource availability
Check
Abnormal # of HDFS R/W > 0.5M 1 hour or > 10 jobs Notify and Optimize Job
Slow processing file RW # of bytes is between 100 to 200 K
bytes per CPU second
1 hour or > 10 jobs Notify and Optimize Job
very large shuffle size > 10GB 1 hour or > 10 jobs Notify and Optimize Job
Bad Node Node Anomaly Alert Bad node has high failure ratio. on-demand Restart daemon / Decomm node
Job Exception
Job Anomaly Alert Buggy job has very high failure ratio
than any of other jobs
on-demand Send email to owner
Typical AlertType and Categories …
16
JPA : Job Performance and Historical Job Analyzer
Monitor and analyze job performance in real-time
• Historical job analysis
• Running job analysis
• Anomaly host detection
• Job data skew detection
• Job performance suggestion
• Anomaly Prediction via machine learning
• Job historical performance trend
• Task and attempt distribution
• Skewness Score
• Anomaly historical performance detection
 TooLowBytesConsumedPerCPUSecond
 Job StatisticLongDuration
 TooLargeShuffleSizeAlert
JOB Performance Design
17
Real time Data Skew Detection - Approach
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
Counters & Features
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Correlation > 0.9
& Max(Z-Score) > 90%
HDFS Bytes Read  Input Records  Map Duration (ms)  Combine I/P Records
Shuffle Records  Local File Bytes Read  Input Records  Duration (ms) paralyzed
18
Coming Up …
19
Possible Expansions …
 Cluster - Available HDFS Usage
 Trend of Usage
 Trend of Node down time
 Job Distribution
 OU Level mapping
 Queue Load Analytics
 Queue Load
 Reduces, Failed w.r.t Business
 Optimization
 Anomaly Prediction via machine learning
 Real-time Data Skew Detection
 Hbase & Hive Metadata Usage Analytics
20
ThankYou

Hui 3.0

  • 1.
    1 Hadoop Usage Insight(HUI 1.0) Session on Descriptive Analytics ArulKumar Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been developed as a web enabled tool leveraging Eagle Framework.
  • 2.
    Contents • Why wedo this ? • Our Customers • Initial Use Case • Eagle Monitoring Framework as a Solution … • How we did this ? - EagleApp !! • Functional Coverage • AS IS Features • Methods & Metrics • TO BE Features
  • 3.
    3 Why we dothis ? • 2+ large Hadoop clusters • 3000+ nodes • 20,000+ jobs per day • 50,000,00+ tasks per day • 200+ types of Hadoop Metrics • Millions of audit events per day Complexity • Varieties of data sources & Collectors • Join multiple data sources • Threshold based, windows based • Multiple metrics correlation • Metrics pre-aggregations • Alert rules can’t be hot deployed Volume
  • 4.
    4 Key Stake Holders& Use Cases  Sr. Management  SME’s & Leads  Cluster - Availability and HDFS Usage  Rack wise  Node wise  Host Anomaly detection  Queue - Load Analysis  Queue Load w.r.t. Batch A/C.  M R Count and Failure status w.r.t Queues  Queue wise Elapse time , Job Count  % of Completion M & R  Job – Usage and Progress status  Usage distribution of Jobs across Queues  Job Listing & Status with elapsed on Queue  Alerts  Job Alert Categorization and Distribution across Queue & Users  Optimization  Optimization suggestion for Jobs in each Queue. ( Start time of Job & Other Counter Details on Screen  Counter Analytics  Map Task Attempt ( File System )  Reduce Task Attempt ( File System )  Map Reduce ( File System , Job, Task  Hadoop Operations  Hadoop Developers Use Cases ROI  Time to Market  Reduction in MTTR  Freedom for Innovation  Optimized Resource Usage  Reduction in SLA Time  Insight on Running Jobs  Cluster Capacity Insight  Proactive Remediation  Infrastructure Teams  Product Managers
  • 5.
    5 The Initial UseCase … Anomaly detection algorithm  Continuously crawl job history immediately after Job completion  Calculate minute level job failure ratio for each node  A node is anomalous when either of 2 conditions happen • Continuously fails tasks within this node • Higher failure ratio than rest of nodes in the cluster
  • 6.
    Eagle Monitoring Frameworkas a Solution … 6 Persistence • Metric • Event • Metadata • Alert • Log • Customized Structure Query • Search • Filter • Aggregation • Sort • Expression • …. Eagle Query Framework
  • 7.
    7 How we dothis ? - Eagle App - HUI 1.0 DATA Collector DATA Processing Metric Aggregation Alert Engine Storage
  • 8.
    8 Functional Coverage Hosts MRJobs Counters Anomalies Metadata DEV Queue, BatchA/C JobAttempt, File Sys RJ analysis, JPA Hbase, Hive Reports PROD SLA Jobs, MapReduce, HDFS Skew & Host detection Mongo DB InteractiveQry MR Analytics MR History Job Counters Failure Ratio SLOT Usage ------------- Queue Batch A/c OU-Track Applications TooLong NoProgress TooManyT-Failure MR Failure Ratio Spill Over Host Anomalies ----------------- Queue Batch A/C Job OU-Track Usage Volume Growth Retention Purge Idea ------------- Users Applications OU Track DT & Outside Custom Dashboard Query By • Queue • Batch A/c • Job Name • Time Slot • OU Specific • Track Availability Capacity Trend Analysis Prediction -------------- Node OU Usage Track Usage MapTaskAttempt ReduceTaskAttempt MapTaskAttemptFS ReduceTaskAttemptFS MR.FileSystem MR.Job/ Task ------------------- Queue Batch A/C Job OU-Track Application Platform DATA INFRABusinessUnits
  • 9.
  • 10.
    10 ASIS Features  Cluster- Availability and HDFS Usage  Rack wise  Node wise  Host Anomaly detection  Queue - Load Analysis  Queue Load w.r.t. Batch A/C.  M R Count and Failure status w.r.t Queues  Queue wise Elapse time  Job Count  % of Completion M & R  Job – Usage and Progress status  Usage distribution of Jobs across Queues  Job Listing & Status with elapsed on Queue  Alerts  Job Alert Categorization and Distribution across Queue & Users  Optimization  Optimization suggestion for Jobs in each Queue. ( Start time of Job & Other Counter Details on Screen )  Counter Analytics  Map Task Attempt ( File System )  Reduce Task Attempt ( File System )  Map Reduce ( File System , Job, Task )
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    15 Alert type Alertcategory Trigger Condition Email frequency Actions Job Performance Alerts long execution compared with historical data Alert when there is a peak 1 hour or > 10 jobs Notify user with In-sight Execution time > 12 hours Execution time > 12 hours 1 hour or > 10 jobs Notify user with In-sight Slow progress HDFS R/W, File R/W has no progress within 15 minutes 1 hour or > 10 jobs Resource availability Check long scheduling Map & Reduce Progress 0% even after 15 minutes 1 hour or > 10 jobs Resource availability Check long cleanup Map & Reduce Progress 100% but Job not completed in 15 min. 1 hour or > 10 jobs System resource availability Check Abnormal # of HDFS R/W > 0.5M 1 hour or > 10 jobs Notify and Optimize Job Slow processing file RW # of bytes is between 100 to 200 K bytes per CPU second 1 hour or > 10 jobs Notify and Optimize Job very large shuffle size > 10GB 1 hour or > 10 jobs Notify and Optimize Job Bad Node Node Anomaly Alert Bad node has high failure ratio. on-demand Restart daemon / Decomm node Job Exception Job Anomaly Alert Buggy job has very high failure ratio than any of other jobs on-demand Send email to owner Typical AlertType and Categories …
  • 16.
    16 JPA : JobPerformance and Historical Job Analyzer Monitor and analyze job performance in real-time • Historical job analysis • Running job analysis • Anomaly host detection • Job data skew detection • Job performance suggestion • Anomaly Prediction via machine learning • Job historical performance trend • Task and attempt distribution • Skewness Score • Anomaly historical performance detection  TooLowBytesConsumedPerCPUSecond  Job StatisticLongDuration  TooLargeShuffleSizeAlert JOB Performance Design
  • 17.
    17 Real time DataSkew Detection - Approach Use Case Detect data skew by statistics and distributions for attempt execution durations and counters Assumption Duration and counters should be in normal distribution Counters & Features mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling & Statistics Avg Min Max Distributions Max z-score Top-N Correlation Threshold & Detection Correlation > 0.9 & Max(Z-Score) > 90% HDFS Bytes Read  Input Records  Map Duration (ms)  Combine I/P Records Shuffle Records  Local File Bytes Read  Input Records  Duration (ms) paralyzed
  • 18.
  • 19.
    19 Possible Expansions … Cluster - Available HDFS Usage  Trend of Usage  Trend of Node down time  Job Distribution  OU Level mapping  Queue Load Analytics  Queue Load  Reduces, Failed w.r.t Business  Optimization  Anomaly Prediction via machine learning  Real-time Data Skew Detection  Hbase & Hive Metadata Usage Analytics
  • 20.

Editor's Notes

  • #7 As a framework, Eagle does not assume : Data source (where, what) Business logic execution path (how) Policy engine implementation (how) Data sink (where, what) As a framework, Eagle does the following: SQL-like service API High-performing query framework Lightweight streaming process java API Extensible policy engine implementation Scalable and distributed rule evaluation Metadata driven stream processing Data source extensibility Data sink extensibility Interactive dashboard