SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Root Cause Analysis using ML
Rohit Choudhary & Gaurav Nagar, Hortonworks
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Introduction
 HDP
– Cumulative Big Data Package with 25+ Certified Open Source Apache Projects
– Source Code arrives from both - Community and Internal Engineering
 QE and Certification Process
– Every change goes through Git and Gerrit
– System tests are written for each components, 100s of new tests added every release
 Release Stability
– Determined by System Test failure and pass percentages
– Once new features and System Tests and are at 100%, we call the release done!
 Releases
– On-premise Releases
– Cloud Releases – HDI and HDC
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Problem Statement
 Test Suite Size
– System Tests are organized as Suites, also called Splits – 700
– Several 1000s of test cases, executed in every run
 Infrastructure
– YarnCloud Infrastructure &OpenStack Infrastructure
– 700 X 5 Node+ HDP Clusters – Creation and Tear Downs
– Test Suites are run on each clusters and Logs are collected
– Test produce 1-1.5 TB of System Logs across our stack everyday
 Failure Assessments and Subsequent Process
– Component owners undertake the responsibilities of identifying failures
– Time-taking, Repetitive without increasing system knowledge
– Restrictive (reduces our ability to release faster)
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Log Analysis
 Root Cause across components in one click
– Identify common failure causes across components
 Recommend Actions instead of assisted search with
– Systemic Knowledge/Repository of Errors and their associations
– Recency of occurrence
– Source modifications as data features
– Current and past reported issues in ticketing systems
 Integrate with downstream process lifecycle
– Test Analysis
– Ticketing system integration
Mool – Sanskrit meaning Root
5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Past Industry Efforts – AALA @Siemens
 Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet
“The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms
do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true
answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the
test system verdicts.”
6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Analysis Process
Log Message Feature Extraction
Test Failure Feature Extraction
Feature Extraction
1
Enriched with Test Execution Time
Origin Components
Enrichment
Error Categorization
RCA Analysis
Error Repository Upgrades
Learning
2
3
7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra
TC2
TC1
TC3
TC4
Run ID
Component
Suite
E2
E1
E3
E4
Test Case – Error Correlation
TC1 = {E1, E2,E4}
TC2 = {E1, E3}
TC3 = {E3, E4}
TC4 = {E1, E4}
Error – Test Case Correlation (Conversely)
E1 = {TC1, TC2,TC4}
E4 = {TC1, TC4}
Where Components = {C1, C2, C3, C4}
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
Suite l…
Suite j
Suite k
Suite n
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
SingleClusterRun
9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
Multi-clusterRun
Suite i
T =t2
E1, E2, E3… TCi1, TCi2, Ti3…
T =t1 T =f2
Suite i
T =t3
E1, E2, E3… TC1, TC2, T3…
T =t2 T =f3
Suite i
T =t4
E1, E2, E3… TC1, TC2, T3…
T =t3 T =f4
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Error Paths and Feature Extraction
Hive Server2
Yarn
ATS
HDFS
Livy
Yarn
HDFS
Pig
Hive
Yarn
HDFS
Spark Oozie WorkflowHive Suite
Test Suites
Stack Call
E1, E2, E3 E1, E2, E3, E4, E5, E6 En….
Test Case Features = {name, suite_name, start_time, end_time, status}
Error Features = {stacktrace, message, occurrence_time, origin, category, file_name}
Errors
11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Salient Points: Failure Sample & Error Samples
Test Case Failures
12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
System Interactions
Ensemble Modeling &
Learning
Customer
Reports
Data Pipeline
Source
Code
Historical
Error DB
Ticket
Systems
Recommendations Automated Actions
Metadata Store
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Application Architecture
Log Accumulation/release
branch
Grok parsers for
HDP/Ambari components
Identical Match
(Stacktrace)
Nearest Match
(Levenshtein Adaptation)
RCA/Associative AnalysisError Hierarchy
Association
Automated Ticket
Processing
Recommendation
Based on Recency
Unsupervised Learning
Ingestion
Outcome
14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Split Processor
Test Clusters
Storage
Deployment Architecture
Livy (Job Server)
HDFS
Spark Jobs
MetaData
Store
Log Daemon
Log daemons
Push Logs into HDFS
Trigger Analysis at End of Run
Web Application
Manual Input for Selection/Rejection of Outcome
Data Processing Data SourceApplication
15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Versus Error Graph Creation
 Error Graphs Creation Failed
– FP Growth Algorithm did not yield desired results
– Too many closed loops, cyclic dependencies
– Time as a split dimension was not enough
 Moved towards RCAs
– Origin of the error chain was easier to find out
– Accuracy was higher
– Enough data supporting multiple code-flows
 Easier to validate through out system Analysts
– Unsupervised Learning is hard to validate without manual intervention
16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Rejections
 False Positives are very prevalent
– Dominating Exceptions because of frequent code path execution
– They are repetitive and need to be ignored, statistically based on decile values
 Priority versus Ignored versus Historical
– Historical RCA’s based on the source code changes and recency allows final decision
– If corresponding tickets are open, then those issues take priority
 Common Exceptions or Common RCA’s
– Prioritize the ones that are causing cross-component failures
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Graph
18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Quick Stats
Item Data
Total Run Ids Analyzed 14410
Total Splits across components 115 K
Raw errors parsed from logs 120 M
Unique Errors 45025
Total Test Case failure 170 K
Errors related to Failed Test Cases 592 K
Unique Errors related to Failed
Test Cases
30570
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Adoption Challenges
 Great for fast changing code base
– Individual component owners have reported upto 99% accuracy
– Multi-component use case scenarios needs improvement
 Log collection required multiple iterations
– Order of logs being written and collected
– Central Log server issues
 Stable releases are harder to instrument
– Our internal team has been unable to use it
– Source code changes are minimal/recency parameters are harder to provide
 Unsupervised learning verification is harder
– Very hard to effectively judge performance of models without manual interference
20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Future Work
 Unsupervised learning validation using automated techniques
 Online processing using Spark Streaming
 Event based error detection on live production clusters
 Correlation with other log events/customer use cases
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Thank You
Rohit Choudhary & Gaurav Nagar

More Related Content

What's hot

Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
DataWorks Summit
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Quality for the Hadoop Zoo
Quality for the Hadoop ZooQuality for the Hadoop Zoo
Quality for the Hadoop Zoo
DataWorks Summit
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
DataWorks Summit
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
DataWorks Summit/Hadoop Summit
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
DataWorks Summit/Hadoop Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
DataWorks Summit
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
DataWorks Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
 
Fine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and HiveFine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and Hive
DataWorks Summit/Hadoop Summit
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
DataWorks Summit/Hadoop Summit
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 

What's hot (20)

Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Quality for the Hadoop Zoo
Quality for the Hadoop ZooQuality for the Hadoop Zoo
Quality for the Hadoop Zoo
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Fine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and HiveFine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and Hive
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 

Similar to Mool - Automated Log Analysis using Data Science and ML

Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Perficient, Inc.
 
Effective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo IteratorsEffective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo Iterators
Josh Elser
 
10_years_Experience_in_Automation
10_years_Experience_in_Automation10_years_Experience_in_Automation
10_years_Experience_in_AutomationArpita Gohel
 
002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrs002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrs
SREEKANTH Kama
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTW
NGINX, Inc.
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challengessnrism
 
IEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualIEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manual
FreyrSCADA Embedded Solution
 
SCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome ThemSCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome Them
Compuware
 
Connectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan WeberConnectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan Weber
Kimberly Daich
 
Michael_Joshua_Validation
Michael_Joshua_ValidationMichael_Joshua_Validation
Michael_Joshua_ValidationMichaelJoshua
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
DataWorks Summit
 
Soma_Mishra_Resume
Soma_Mishra_ResumeSoma_Mishra_Resume
Soma_Mishra_Resumesoma mishra
 
Define enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasadDefine enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasad
Bhawani N Prasad
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 Estimation
Lawrence Bernstein
 
Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016WEI YANN LEE
 
eG Innovations
eG InnovationseG Innovations
eG Innovations
janejarvella
 
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
BIOVIA
 

Similar to Mool - Automated Log Analysis using Data Science and ML (20)

Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
 
Effective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo IteratorsEffective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo Iterators
 
10_years_Experience_in_Automation
10_years_Experience_in_Automation10_years_Experience_in_Automation
10_years_Experience_in_Automation
 
002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrs002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrs
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTW
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challenges
 
IEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualIEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manual
 
Tarun_Medimi
Tarun_MedimiTarun_Medimi
Tarun_Medimi
 
SCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome ThemSCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome Them
 
Connectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan WeberConnectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan Weber
 
Michael_Joshua_Validation
Michael_Joshua_ValidationMichael_Joshua_Validation
Michael_Joshua_Validation
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
 
Soma_Mishra_Resume
Soma_Mishra_ResumeSoma_Mishra_Resume
Soma_Mishra_Resume
 
Define enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasadDefine enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasad
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 Estimation
 
Rajesh - CV
Rajesh - CVRajesh - CV
Rajesh - CV
 
Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016
 
eG Innovations
eG InnovationseG Innovations
eG Innovations
 
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 

Mool - Automated Log Analysis using Data Science and ML

  • 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Mool – Automated Root Cause Analysis using ML Rohit Choudhary & Gaurav Nagar, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Introduction  HDP – Cumulative Big Data Package with 25+ Certified Open Source Apache Projects – Source Code arrives from both - Community and Internal Engineering  QE and Certification Process – Every change goes through Git and Gerrit – System tests are written for each components, 100s of new tests added every release  Release Stability – Determined by System Test failure and pass percentages – Once new features and System Tests and are at 100%, we call the release done!  Releases – On-premise Releases – Cloud Releases – HDI and HDC
  • 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Problem Statement  Test Suite Size – System Tests are organized as Suites, also called Splits – 700 – Several 1000s of test cases, executed in every run  Infrastructure – YarnCloud Infrastructure &OpenStack Infrastructure – 700 X 5 Node+ HDP Clusters – Creation and Tear Downs – Test Suites are run on each clusters and Logs are collected – Test produce 1-1.5 TB of System Logs across our stack everyday  Failure Assessments and Subsequent Process – Component owners undertake the responsibilities of identifying failures – Time-taking, Repetitive without increasing system knowledge – Restrictive (reduces our ability to release faster)
  • 4. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Mool – Automated Log Analysis  Root Cause across components in one click – Identify common failure causes across components  Recommend Actions instead of assisted search with – Systemic Knowledge/Repository of Errors and their associations – Recency of occurrence – Source modifications as data features – Current and past reported issues in ticketing systems  Integrate with downstream process lifecycle – Test Analysis – Ticketing system integration Mool – Sanskrit meaning Root
  • 5. 5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Past Industry Efforts – AALA @Siemens  Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet “The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the test system verdicts.”
  • 6. 6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Analysis Process Log Message Feature Extraction Test Failure Feature Extraction Feature Extraction 1 Enriched with Test Execution Time Origin Components Enrichment Error Categorization RCA Analysis Error Repository Upgrades Learning 2 3
  • 7. 7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra TC2 TC1 TC3 TC4 Run ID Component Suite E2 E1 E3 E4 Test Case – Error Correlation TC1 = {E1, E2,E4} TC2 = {E1, E3} TC3 = {E3, E4} TC4 = {E1, E4} Error – Test Case Correlation (Conversely) E1 = {TC1, TC2,TC4} E4 = {TC1, TC4} Where Components = {C1, C2, C3, C4}
  • 8. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra - Explained Suite i Suite l… Suite j Suite k Suite n T =t Errors Test Cases E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… T =0 T =f SingleClusterRun
  • 9. 9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra - Explained Suite i T =t Errors Test Cases E1, E2, E3… TC1, TC2, T3… T =0 T =f Multi-clusterRun Suite i T =t2 E1, E2, E3… TCi1, TCi2, Ti3… T =t1 T =f2 Suite i T =t3 E1, E2, E3… TC1, TC2, T3… T =t2 T =f3 Suite i T =t4 E1, E2, E3… TC1, TC2, T3… T =t3 T =f4
  • 10. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Error Paths and Feature Extraction Hive Server2 Yarn ATS HDFS Livy Yarn HDFS Pig Hive Yarn HDFS Spark Oozie WorkflowHive Suite Test Suites Stack Call E1, E2, E3 E1, E2, E3, E4, E5, E6 En…. Test Case Features = {name, suite_name, start_time, end_time, status} Error Features = {stacktrace, message, occurrence_time, origin, category, file_name} Errors
  • 11. 11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Salient Points: Failure Sample & Error Samples Test Case Failures
  • 12. 12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved System Interactions Ensemble Modeling & Learning Customer Reports Data Pipeline Source Code Historical Error DB Ticket Systems Recommendations Automated Actions Metadata Store
  • 13. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Application Architecture Log Accumulation/release branch Grok parsers for HDP/Ambari components Identical Match (Stacktrace) Nearest Match (Levenshtein Adaptation) RCA/Associative AnalysisError Hierarchy Association Automated Ticket Processing Recommendation Based on Recency Unsupervised Learning Ingestion Outcome
  • 14. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Split Processor Test Clusters Storage Deployment Architecture Livy (Job Server) HDFS Spark Jobs MetaData Store Log Daemon Log daemons Push Logs into HDFS Trigger Analysis at End of Run Web Application Manual Input for Selection/Rejection of Outcome Data Processing Data SourceApplication
  • 15. 15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Versus Error Graph Creation  Error Graphs Creation Failed – FP Growth Algorithm did not yield desired results – Too many closed loops, cyclic dependencies – Time as a split dimension was not enough  Moved towards RCAs – Origin of the error chain was easier to find out – Accuracy was higher – Enough data supporting multiple code-flows  Easier to validate through out system Analysts – Unsupervised Learning is hard to validate without manual intervention
  • 16. 16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Rejections  False Positives are very prevalent – Dominating Exceptions because of frequent code path execution – They are repetitive and need to be ignored, statistically based on decile values  Priority versus Ignored versus Historical – Historical RCA’s based on the source code changes and recency allows final decision – If corresponding tickets are open, then those issues take priority  Common Exceptions or Common RCA’s – Prioritize the ones that are causing cross-component failures
  • 17. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Graph
  • 18. 18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Quick Stats Item Data Total Run Ids Analyzed 14410 Total Splits across components 115 K Raw errors parsed from logs 120 M Unique Errors 45025 Total Test Case failure 170 K Errors related to Failed Test Cases 592 K Unique Errors related to Failed Test Cases 30570
  • 19. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Adoption Challenges  Great for fast changing code base – Individual component owners have reported upto 99% accuracy – Multi-component use case scenarios needs improvement  Log collection required multiple iterations – Order of logs being written and collected – Central Log server issues  Stable releases are harder to instrument – Our internal team has been unable to use it – Source code changes are minimal/recency parameters are harder to provide  Unsupervised learning verification is harder – Very hard to effectively judge performance of models without manual interference
  • 20. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Future Work  Unsupervised learning validation using automated techniques  Online processing using Spark Streaming  Event based error detection on live production clusters  Correlation with other log events/customer use cases
  • 21. 21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Thank You Rohit Choudhary & Gaurav Nagar

Editor's Notes

  1. TALK TRACK Mool is the application th [NEXT SLIDE]