SlideShare a Scribd company logo
1
Hadoop Usage Insight (HUI 1.0)
Session on Descriptive Analytics
ArulKumar
Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights
on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been
developed as a web enabled tool leveraging Eagle Framework.
Contents
• Why we do this ?
• Our Customers
• Initial Use Case
• Eagle Monitoring Framework as a Solution …
• How we did this ? - EagleApp !!
• Functional Coverage
• AS IS Features
• Methods & Metrics
• TO BE Features
3
Why we do this ?
• 2+ large Hadoop clusters
• 3000+ nodes
• 20,000+ jobs per day
• 50,000,00+ tasks per day
• 200+ types of Hadoop Metrics
• Millions of audit events per day
Complexity
• Varieties of data sources & Collectors
• Join multiple data sources
• Threshold based, windows based
• Multiple metrics correlation
• Metrics pre-aggregations
• Alert rules can’t be hot deployed
Volume
4
Key Stake Holders & Use Cases
 Sr. Management
 SME’s & Leads
 Cluster - Availability and HDFS Usage
 Rack wise
 Node wise
 Host Anomaly detection
 Queue - Load Analysis
 Queue Load w.r.t. Batch A/C.
 M R Count and Failure status w.r.t Queues
 Queue wise Elapse time , Job Count
 % of Completion M & R
 Job – Usage and Progress status
 Usage distribution of Jobs across Queues
 Job Listing & Status with elapsed on Queue
 Alerts
 Job Alert Categorization and
Distribution across Queue & Users
 Optimization
 Optimization suggestion for Jobs in
each Queue. ( Start time of Job &
Other Counter Details on Screen
 Counter Analytics
 Map Task Attempt ( File System )
 Reduce Task Attempt ( File System )
 Map Reduce ( File System , Job, Task
 Hadoop Operations
 Hadoop Developers
Use Cases
ROI
 Time to Market
 Reduction in MTTR
 Freedom for Innovation
 Optimized Resource
Usage
 Reduction in SLA Time
 Insight on Running Jobs
 Cluster Capacity Insight
 Proactive Remediation
 Infrastructure Teams
 Product Managers
5
The Initial Use Case …
Anomaly detection algorithm
 Continuously crawl job history immediately after Job
completion
 Calculate minute level job failure ratio for each node
 A node is anomalous when either of 2 conditions happen
• Continuously fails tasks within this node
• Higher failure ratio than rest of nodes in the cluster
Eagle Monitoring Framework as a Solution …
6
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized Structure
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
Eagle Query Framework
7
How we do this ? - Eagle App - HUI 1.0
DATA
Collector
DATA
Processing
Metric
Aggregation
Alert
Engine
Storage
8
Functional Coverage
Hosts MR Jobs Counters Anomalies Metadata
DEV Queue, BatchA/C JobAttempt, File Sys RJ analysis, JPA Hbase, Hive Reports
PROD SLA Jobs, MapReduce, HDFS Skew & Host detection Mongo DB InteractiveQry
MR Analytics
MR History
Job Counters
Failure Ratio
SLOT Usage
-------------
Queue
Batch A/c
OU-Track
Applications
TooLong
NoProgress
TooManyT-Failure
MR Failure Ratio
Spill Over
Host Anomalies
-----------------
Queue
Batch A/C
Job
OU-Track
Usage
Volume
Growth
Retention
Purge Idea
-------------
Users
Applications
OU
Track
DT & Outside
Custom
Dashboard
Query By
• Queue
• Batch A/c
• Job Name
• Time Slot
• OU Specific
• Track
Availability
Capacity
Trend Analysis
Prediction
--------------
Node
OU Usage
Track Usage
MapTaskAttempt
ReduceTaskAttempt
MapTaskAttemptFS
ReduceTaskAttemptFS
MR.FileSystem
MR.Job/ Task
-------------------
Queue
Batch A/C
Job
OU-Track
Application
Platform DATA INFRABusinessUnits
9
AS IS …
10
ASIS Features
 Cluster - Availability and HDFS Usage
 Rack wise
 Node wise
 Host Anomaly detection
 Queue - Load Analysis
 Queue Load w.r.t. Batch A/C.
 M R Count and Failure status w.r.t Queues
 Queue wise Elapse time
 Job Count
 % of Completion M & R
 Job – Usage and Progress status
 Usage distribution of Jobs across Queues
 Job Listing & Status with elapsed on Queue
 Alerts
 Job Alert Categorization and Distribution
across Queue & Users
 Optimization
 Optimization suggestion for Jobs in each
Queue. ( Start time of Job & Other Counter
Details on Screen )
 Counter Analytics
 Map Task Attempt ( File System )
 Reduce Task Attempt ( File System )
 Map Reduce ( File System , Job, Task )
11
Hosts Usage (Typical )
12
Queue Usage
13
Task CompletionView … (Typical )
14
Methods & Metrics …
15
Alert type Alert category Trigger Condition Email frequency Actions
Job
Performance
Alerts
long execution compared
with historical data
Alert when there is a peak 1 hour or > 10 jobs Notify user with In-sight
Execution time > 12 hours Execution time > 12 hours 1 hour or > 10 jobs Notify user with In-sight
Slow progress HDFS R/W, File R/W has no
progress within 15 minutes
1 hour or > 10 jobs Resource availability Check
long scheduling Map & Reduce Progress 0% even
after 15 minutes
1 hour or > 10 jobs Resource availability Check
long cleanup Map & Reduce Progress 100% but
Job not completed in 15 min.
1 hour or > 10 jobs System resource availability
Check
Abnormal # of HDFS R/W > 0.5M 1 hour or > 10 jobs Notify and Optimize Job
Slow processing file RW # of bytes is between 100 to 200 K
bytes per CPU second
1 hour or > 10 jobs Notify and Optimize Job
very large shuffle size > 10GB 1 hour or > 10 jobs Notify and Optimize Job
Bad Node Node Anomaly Alert Bad node has high failure ratio. on-demand Restart daemon / Decomm node
Job Exception
Job Anomaly Alert Buggy job has very high failure ratio
than any of other jobs
on-demand Send email to owner
Typical AlertType and Categories …
16
JPA : Job Performance and Historical Job Analyzer
Monitor and analyze job performance in real-time
• Historical job analysis
• Running job analysis
• Anomaly host detection
• Job data skew detection
• Job performance suggestion
• Anomaly Prediction via machine learning
• Job historical performance trend
• Task and attempt distribution
• Skewness Score
• Anomaly historical performance detection
 TooLowBytesConsumedPerCPUSecond
 Job StatisticLongDuration
 TooLargeShuffleSizeAlert
JOB Performance Design
17
Real time Data Skew Detection - Approach
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
Counters & Features
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Correlation > 0.9
& Max(Z-Score) > 90%
HDFS Bytes Read  Input Records  Map Duration (ms)  Combine I/P Records
Shuffle Records  Local File Bytes Read  Input Records  Duration (ms) paralyzed
18
Coming Up …
19
Possible Expansions …
 Cluster - Available HDFS Usage
 Trend of Usage
 Trend of Node down time
 Job Distribution
 OU Level mapping
 Queue Load Analytics
 Queue Load
 Reduces, Failed w.r.t Business
 Optimization
 Anomaly Prediction via machine learning
 Real-time Data Skew Detection
 Hbase & Hive Metadata Usage Analytics
20
ThankYou

More Related Content

What's hot

Everyday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for HumansEveryday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for Humans
Databricks
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
Databricks
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
IoT Austin CUG talk
IoT Austin CUG talkIoT Austin CUG talk
IoT Austin CUG talk
Felicia Haggarty
 
November 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorNovember 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorYahoo Developer Network
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Databricks
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Tuning Java Servers
Tuning Java Servers Tuning Java Servers
Tuning Java Servers
Srinath Perera
 
Is This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleIs This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the People
Databricks
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaPuree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
Jagjit Srawan
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
Databricks
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by Spark
Spark Summit
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsSrinath Perera
 

What's hot (20)

Everyday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for HumansEveryday Probabilistic Data Structures for Humans
Everyday Probabilistic Data Structures for Humans
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
IoT Austin CUG talk
IoT Austin CUG talkIoT Austin CUG talk
IoT Austin CUG talk
 
November 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity CalculatorNovember 2013 HUG: Compute Capacity Calculator
November 2013 HUG: Compute Capacity Calculator
 
Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Tuning Java Servers
Tuning Java Servers Tuning Java Servers
Tuning Java Servers
 
Is This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleIs This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the People
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaPuree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Interactive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by SparkInteractive Visualization of Streaming Data Powered by Spark
Interactive Visualization of Streaming Data Powered by Spark
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 

Similar to Hui 3.0

Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
DataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
Tony Ng
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
Rodolfo Kohn
 
EnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAsEnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAs
EDB
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
Ravi Yogesh
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
Databricks
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
Hao Chen
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Resolving problems & high availability
Resolving problems & high availabilityResolving problems & high availability
Resolving problems & high availability
Zend by Rogue Wave Software
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
Georg Heiler
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Codemotion
 
Orsyp Dollar Universe - Performance Management for SAP
Orsyp Dollar Universe - Performance Management for SAPOrsyp Dollar Universe - Performance Management for SAP
Orsyp Dollar Universe - Performance Management for SAP
ORSYP SOFTWARE
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 

Similar to Hui 3.0 (20)

Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Adding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance TestAdding Value in the Cloud with Performance Test
Adding Value in the Cloud with Performance Test
 
EnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAsEnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAs
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Resolving problems & high availability
Resolving problems & high availabilityResolving problems & high availability
Resolving problems & high availability
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Orsyp Dollar Universe - Performance Management for SAP
Orsyp Dollar Universe - Performance Management for SAPOrsyp Dollar Universe - Performance Management for SAP
Orsyp Dollar Universe - Performance Management for SAP
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 

Recently uploaded

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Hui 3.0

  • 1. 1 Hadoop Usage Insight (HUI 1.0) Session on Descriptive Analytics ArulKumar Synopsis : HUI 1.0 is a convergent Analytics application that provides comprehensive Insights on Usage, Load and Performance of Applications running on Hadoop Clusters. It has been developed as a web enabled tool leveraging Eagle Framework.
  • 2. Contents • Why we do this ? • Our Customers • Initial Use Case • Eagle Monitoring Framework as a Solution … • How we did this ? - EagleApp !! • Functional Coverage • AS IS Features • Methods & Metrics • TO BE Features
  • 3. 3 Why we do this ? • 2+ large Hadoop clusters • 3000+ nodes • 20,000+ jobs per day • 50,000,00+ tasks per day • 200+ types of Hadoop Metrics • Millions of audit events per day Complexity • Varieties of data sources & Collectors • Join multiple data sources • Threshold based, windows based • Multiple metrics correlation • Metrics pre-aggregations • Alert rules can’t be hot deployed Volume
  • 4. 4 Key Stake Holders & Use Cases  Sr. Management  SME’s & Leads  Cluster - Availability and HDFS Usage  Rack wise  Node wise  Host Anomaly detection  Queue - Load Analysis  Queue Load w.r.t. Batch A/C.  M R Count and Failure status w.r.t Queues  Queue wise Elapse time , Job Count  % of Completion M & R  Job – Usage and Progress status  Usage distribution of Jobs across Queues  Job Listing & Status with elapsed on Queue  Alerts  Job Alert Categorization and Distribution across Queue & Users  Optimization  Optimization suggestion for Jobs in each Queue. ( Start time of Job & Other Counter Details on Screen  Counter Analytics  Map Task Attempt ( File System )  Reduce Task Attempt ( File System )  Map Reduce ( File System , Job, Task  Hadoop Operations  Hadoop Developers Use Cases ROI  Time to Market  Reduction in MTTR  Freedom for Innovation  Optimized Resource Usage  Reduction in SLA Time  Insight on Running Jobs  Cluster Capacity Insight  Proactive Remediation  Infrastructure Teams  Product Managers
  • 5. 5 The Initial Use Case … Anomaly detection algorithm  Continuously crawl job history immediately after Job completion  Calculate minute level job failure ratio for each node  A node is anomalous when either of 2 conditions happen • Continuously fails tasks within this node • Higher failure ratio than rest of nodes in the cluster
  • 6. Eagle Monitoring Framework as a Solution … 6 Persistence • Metric • Event • Metadata • Alert • Log • Customized Structure Query • Search • Filter • Aggregation • Sort • Expression • …. Eagle Query Framework
  • 7. 7 How we do this ? - Eagle App - HUI 1.0 DATA Collector DATA Processing Metric Aggregation Alert Engine Storage
  • 8. 8 Functional Coverage Hosts MR Jobs Counters Anomalies Metadata DEV Queue, BatchA/C JobAttempt, File Sys RJ analysis, JPA Hbase, Hive Reports PROD SLA Jobs, MapReduce, HDFS Skew & Host detection Mongo DB InteractiveQry MR Analytics MR History Job Counters Failure Ratio SLOT Usage ------------- Queue Batch A/c OU-Track Applications TooLong NoProgress TooManyT-Failure MR Failure Ratio Spill Over Host Anomalies ----------------- Queue Batch A/C Job OU-Track Usage Volume Growth Retention Purge Idea ------------- Users Applications OU Track DT & Outside Custom Dashboard Query By • Queue • Batch A/c • Job Name • Time Slot • OU Specific • Track Availability Capacity Trend Analysis Prediction -------------- Node OU Usage Track Usage MapTaskAttempt ReduceTaskAttempt MapTaskAttemptFS ReduceTaskAttemptFS MR.FileSystem MR.Job/ Task ------------------- Queue Batch A/C Job OU-Track Application Platform DATA INFRABusinessUnits
  • 10. 10 ASIS Features  Cluster - Availability and HDFS Usage  Rack wise  Node wise  Host Anomaly detection  Queue - Load Analysis  Queue Load w.r.t. Batch A/C.  M R Count and Failure status w.r.t Queues  Queue wise Elapse time  Job Count  % of Completion M & R  Job – Usage and Progress status  Usage distribution of Jobs across Queues  Job Listing & Status with elapsed on Queue  Alerts  Job Alert Categorization and Distribution across Queue & Users  Optimization  Optimization suggestion for Jobs in each Queue. ( Start time of Job & Other Counter Details on Screen )  Counter Analytics  Map Task Attempt ( File System )  Reduce Task Attempt ( File System )  Map Reduce ( File System , Job, Task )
  • 15. 15 Alert type Alert category Trigger Condition Email frequency Actions Job Performance Alerts long execution compared with historical data Alert when there is a peak 1 hour or > 10 jobs Notify user with In-sight Execution time > 12 hours Execution time > 12 hours 1 hour or > 10 jobs Notify user with In-sight Slow progress HDFS R/W, File R/W has no progress within 15 minutes 1 hour or > 10 jobs Resource availability Check long scheduling Map & Reduce Progress 0% even after 15 minutes 1 hour or > 10 jobs Resource availability Check long cleanup Map & Reduce Progress 100% but Job not completed in 15 min. 1 hour or > 10 jobs System resource availability Check Abnormal # of HDFS R/W > 0.5M 1 hour or > 10 jobs Notify and Optimize Job Slow processing file RW # of bytes is between 100 to 200 K bytes per CPU second 1 hour or > 10 jobs Notify and Optimize Job very large shuffle size > 10GB 1 hour or > 10 jobs Notify and Optimize Job Bad Node Node Anomaly Alert Bad node has high failure ratio. on-demand Restart daemon / Decomm node Job Exception Job Anomaly Alert Buggy job has very high failure ratio than any of other jobs on-demand Send email to owner Typical AlertType and Categories …
  • 16. 16 JPA : Job Performance and Historical Job Analyzer Monitor and analyze job performance in real-time • Historical job analysis • Running job analysis • Anomaly host detection • Job data skew detection • Job performance suggestion • Anomaly Prediction via machine learning • Job historical performance trend • Task and attempt distribution • Skewness Score • Anomaly historical performance detection  TooLowBytesConsumedPerCPUSecond  Job StatisticLongDuration  TooLargeShuffleSizeAlert JOB Performance Design
  • 17. 17 Real time Data Skew Detection - Approach Use Case Detect data skew by statistics and distributions for attempt execution durations and counters Assumption Duration and counters should be in normal distribution Counters & Features mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling & Statistics Avg Min Max Distributions Max z-score Top-N Correlation Threshold & Detection Correlation > 0.9 & Max(Z-Score) > 90% HDFS Bytes Read  Input Records  Map Duration (ms)  Combine I/P Records Shuffle Records  Local File Bytes Read  Input Records  Duration (ms) paralyzed
  • 19. 19 Possible Expansions …  Cluster - Available HDFS Usage  Trend of Usage  Trend of Node down time  Job Distribution  OU Level mapping  Queue Load Analytics  Queue Load  Reduces, Failed w.r.t Business  Optimization  Anomaly Prediction via machine learning  Real-time Data Skew Detection  Hbase & Hive Metadata Usage Analytics

Editor's Notes

  1. As a framework, Eagle does not assume : Data source (where, what) Business logic execution path (how) Policy engine implementation (how) Data sink (where, what) As a framework, Eagle does the following: SQL-like service API High-performing query framework Lightweight streaming process java API Extensible policy engine implementation Scalable and distributed rule evaluation Metadata driven stream processing Data source extensibility Data sink extensibility Interactive dashboard