SlideShare a Scribd company logo
1 of 46
CRYSTAL BALL EVENT PREDICTION (MAPREDUCE)
& LOG ANALYSIS (SPARK)
By: Jivan Nepali, 985095 Big Data (CS522) Project
PRESENTATION OVERVIEW
Pair Approach
• Pseudo-code for Pair Approach
• Java Implementation for Pair Approach
• Pair Approach Result
Stripe Approach
• Pseudo-code for Stripe Approach
• Java Implementation for Stripe Approach
• Stripe Approach Result
Hybrid Approach
• Pseudo-code for Hybrid Approach
• Java Implementation for Hybrid Approach
• Hybrid Approach Result
• Comparison of three Approaches
Spark
• LogAnalysis – Problem Description
• LogAnalysis – Expected Outcomes
• LogAnalysis – Scala Implementation
• LogAnalysis – Results
PAIR APPROACH
IMPLEMENTATION
PSEUDO CODE – MAPPER
Class MAPPER
method INITIALIZE
H = new Associative Array
method MAP (docid a, doc d)
for all term w in doc d do
for all term u in Neighbors(w) do
H { Pair (w, u) } = H {Pair (w, u) } + count 1 // Tally counts
H { Pair(w, *) } = H { Pair (w, *) } + count 1 // Tally counts for *
method CLOSE
for all Pair (w, u) in H do
EMIT ( Pair (w, u), count H { Pair (w, u) } )
PSUDEO CODE - REDUCER
Class REDUCER
method INITIALIZE
TOTALFREQ = 0
method REDUCE (Pair p, counts [c1, c2, c3, … ])
sum = 0
for all count c in counts [c1, c2, c3, … ]) do
sum = sum + c
if ( p.getNeighbor() == “*”)) then //Neighbor is second element of the pair
TOTALFREQ = sum
else
EMIT ( Pair p, sum / TOTALFREQ)
IMPLEMENTATION - MAPPER
IMPLEMENTATION – MAPPER CONTD…
IMPLEMENTATION - REDUCER
PAIR APPROACH – MAP INPUT RECORDS
18 34 56 29 12 34 56 92 29 34 12
92 29 18 12 34 79 29 56 12 34 18
PAIR APPROACH - RESULT
STRIPE APPROACH
IMPLEMENTATION
PSEUDO CODE – MAPPER
Class MAPPER
method INITIALIZE
H = new Associative Array
method MAP (docid a, doc d)
for all term w in doc d do
S = H { w } // Initialize a new Associative Array if H {w} is NULL
for all term u in Neighbors(w) do
S { u } = S { u } + count 1 // Tally counts
H { w } = S
method CLOSE
for all term t in H do
EMIT ( term t, stripe H { t } )
PSUDEO CODE - REDUCER
Class REDUCER
method INITIALIZE
TOTALFREQ = 0
Hf = new Associative Array
method REDUCE (term t, stripes [H1, H2, H3, … ])
for all stripe H in stripes [H1, H2, H3, … ]) do
for all term w in stripe H do
Hf { w } = Hf { w } + H { w } // Hf = Hf + H ; Element-wise addition
TOTALFREQ = TOTALFREQ + count H { w }
for all term w in stripe Hf do
Hf { w } = Hf { w } / TOTALFREQ
EMIT ( term t, stripe Hf )
IMPLEMENTATION - MAPPER
IMPLEMENTATION – MAPPER CONTD…
IMPLEMENTATION - REDUCER
IMPLEMENTATION – REDUCER CONTD…
STRIPE APPROACH – MAP INPUT RECORDS
18 34 56 29 12 34 56 92 29 34 12
92 29 18 12 34 79 29 56 12 34 18
STRIPE APPROACH - RESULT
HYBRID APPROACH
IMPLEMENTATION
PSEUDO CODE – MAPPER
Class MAPPER
method INITIALIZE
H = new Associative Array
method MAP (docid a, doc d)
for all term w in doc d do
for all term u in Neighbors(w) do
H { Pair (w, u) } = H {Pair (w, u) } + count 1 // Tally counts
method CLOSE
for all Pair (w, u) in H do
EMIT ( Pair (w, u), count H { Pair (w, u) } )
PSUDEO CODE - REDUCER
Class REDUCER
method INITIALIZE
TOTALFREQ = 0
Hf = new Associative Array
PREVKEY = “”
method REDUCE (Pair p, counts [C1, C2, C3, … ])
sum = 0
for all count c in counts [ C1, C2, C3, … ] do
sum = sum + c
if ( PREVKEY <> p.getKey( )) then
EMIT ( PREVKEY, Hf / TOTALFREQ ) // Element-wise divide
Hf = new Associative Array
TOTALFREQ = 0
PSUDEO CODE – REDUCER CONTD…
TOTALFREQ = TOTALFREQ + sum
Hf { p.getNeighbor( ) } = Hf { p.getNeighbor( ) } + sum
PREVKEY = p.getKey( )
method CLOSE // for the remaining last key
EMIT ( PREVKEY, Hf / TOTALFREQ ) // Element-wise divide
IMPLEMENTATION - MAPPER
IMPLEMENTATION – MAPPER CONTED…
IMPLEMENTATION - REDUCER
IMPLEMENTATION – REDUCER CONTD …
IMPLEMENTATION – REDUCER CONTD …
HYBRID APPROACH – MAP INPUT RECORDS
18 34 56 29 12 34 56 92 29 34 12
92 29 18 12 34 79 29 56 12 34 18
HYBRID APPROACH - RESULT
MAP-REDUCE JOB PERFORMANCE
COMPARISON WITH COUNTERS
Description Pair Approach Stripe Approach Hybrid Approach
Map Input Records 2 2 2
Map Output Records 47 7 40
Map Output Bytes 463 416 400
Map Output Materialized Bytes 563 436 486
Input-split Bytes 147 149 149
Combine Input Records 0 0 0
Combine Output Records 0 0 0
Reduce Input Groups 47 7 40
Reduce Shuffle Bytes 563 436 486
Reduce Input Records 47 7 40
Reduce Output Records 40 7 7
Shuffled Maps 1 1 1
GC Time Elapsed (ms) 140 175 129
CPU Time Spent (ms) 1540 1530 1700
Physical Memory (bytes) Snapshot 357101568 354013184 352686080
Virtual Memory (bytes) Snapshot 3022008320 3019862016 3020025856
Total Committed Heap Usage (bytes) 226365440 226365440 226365440
LOG ANALYSIS WITH SPARK
LOG ANALYSIS
• Log data is a definitive record of what's
happening in every business, organization
or agency and it’s often an untapped
resource when it comes to troubleshooting
and supporting broader business
objectives.
• 1.5 Millions Log Lines Per Second !
PROBLEM DESCRIPTION
• Web-access log data from Splunk
• Three log files ( ~ 12 MB)
Features
• Extract top selling products
• Extract top selling product categories
• Extract top client IPs visiting the e-commerce site
Sample Data
SPARK, SCALA CONFIGURATION IN ECLIPSE
• Download Scala IDE from http://scala-ide.org/download/sdk.html for Linux 64 bit
SPARK, SCALA CONFIGURATION IN ECLIPSE
• Open the Scala IDE
• Create a new Maven Project
• Configure the pom.xml file
• maven clean, maven install
• Set the Scala Installation to Scala 2.10.4
from Project -> Scala -> Set Installation
LOG ANALYSIS - SCALA IMPLEMENTATION
• Add new Scala Object
to the src directory of
the project
LOG ANALYSIS - SCALA IMPLEMENTATION
LOG ANALYSIS - SCALA IMPLEMENTATION
LOG ANALYSIS - SCALA IMPLEMENTATION
CREATING & EXECUTING THE .JAR FILE
• Open Linux Terminal
• Go to the project directory & Perform mvn clean, mvn package to create the .JAR
file
• Change the permission of .jar as executable ( sudo chmod 777 filename.jar )
• Run the .jar file by providing the input and output directories as arguments
spark-submit --class cs522.sparkproject.LogAnalyzer $LOCAL_DIR/spark/sparkproject-
0.0.1-SNAPSHOT.jar $HDFS_DIR/spark/input $HDFS_DIR/spark/output
LOG ANALYSIS – RESULT (TOP PRODUCT IDs)
LOG ANALYSIS – RESULT (TOP PRODUCT CATEGORIES)
LOG ANALYSIS – RESULT (TOP CLIENT IPs)
DEMO
Questions & Answers Session

More Related Content

What's hot

[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...PingCAP
 
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/ThetaAlgorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/ThetaPriyanka Rana
 
computer notes - Data Structures - 26
computer notes - Data Structures - 26computer notes - Data Structures - 26
computer notes - Data Structures - 26ecomputernotes
 
Essence of the iterator pattern
Essence of the iterator patternEssence of the iterator pattern
Essence of the iterator patternMarkus Klink
 
Application of Non-linear Electronics in Digital Communication
Application of Non-linear Electronics in Digital CommunicationApplication of Non-linear Electronics in Digital Communication
Application of Non-linear Electronics in Digital CommunicationMohammad reza Zahabi
 
Alg March 26, 2009
Alg March 26, 2009Alg March 26, 2009
Alg March 26, 2009Mr. Smith
 
Performing Iterations in EES
Performing Iterations in EESPerforming Iterations in EES
Performing Iterations in EESNaveed Rehman
 
Register Allocation
Register AllocationRegister Allocation
Register AllocationEelco Visser
 
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1zukun
 
Graphing day 1 worked
Graphing day 1 workedGraphing day 1 worked
Graphing day 1 workedJonna Ramsey
 
Introduction to haskell
Introduction to haskellIntroduction to haskell
Introduction to haskellLuca Molteni
 
Surface3d in R and rgl package.
Surface3d in R and rgl package.Surface3d in R and rgl package.
Surface3d in R and rgl package.Dr. Volkan OBAN
 

What's hot (20)

[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
[Paper Reading] Generalized Sub-Query Fusion for Eliminating Redundant I/O fr...
 
Lab 6 (1)
Lab 6 (1)Lab 6 (1)
Lab 6 (1)
 
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/ThetaAlgorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
 
computer notes - Data Structures - 26
computer notes - Data Structures - 26computer notes - Data Structures - 26
computer notes - Data Structures - 26
 
Essence of the iterator pattern
Essence of the iterator patternEssence of the iterator pattern
Essence of the iterator pattern
 
Python programing
Python programingPython programing
Python programing
 
Application of Non-linear Electronics in Digital Communication
Application of Non-linear Electronics in Digital CommunicationApplication of Non-linear Electronics in Digital Communication
Application of Non-linear Electronics in Digital Communication
 
Alg March 26, 2009
Alg March 26, 2009Alg March 26, 2009
Alg March 26, 2009
 
Module 2 topic 2 notes
Module 2 topic 2 notesModule 2 topic 2 notes
Module 2 topic 2 notes
 
Performing Iterations in EES
Performing Iterations in EESPerforming Iterations in EES
Performing Iterations in EES
 
Practical no 4
Practical no 4Practical no 4
Practical no 4
 
Ai4 heuristic2
Ai4 heuristic2Ai4 heuristic2
Ai4 heuristic2
 
Register Allocation
Register AllocationRegister Allocation
Register Allocation
 
Asymptotic Notation
Asymptotic NotationAsymptotic Notation
Asymptotic Notation
 
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 1
 
Graphing day 1 worked
Graphing day 1 workedGraphing day 1 worked
Graphing day 1 worked
 
Permute
PermutePermute
Permute
 
Introduction to haskell
Introduction to haskellIntroduction to haskell
Introduction to haskell
 
2
22
2
 
Surface3d in R and rgl package.
Surface3d in R and rgl package.Surface3d in R and rgl package.
Surface3d in R and rgl package.
 

Viewers also liked

Predictions 2017
Predictions 2017Predictions 2017
Predictions 2017Lynn Yap
 
Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez)
Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez)  Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez)
Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez) Ed Valdez
 
Nowhere to Hide: Expose Threats in Real-time with IBM QRadar Network Insights
Nowhere to Hide: Expose Threats in Real-time with IBM QRadar Network InsightsNowhere to Hide: Expose Threats in Real-time with IBM QRadar Network Insights
Nowhere to Hide: Expose Threats in Real-time with IBM QRadar Network InsightsIBM Security
 
Webinar: Develop Your High Potentials into Results-Driven Leaders
Webinar: Develop Your High Potentials into Results-Driven LeadersWebinar: Develop Your High Potentials into Results-Driven Leaders
Webinar: Develop Your High Potentials into Results-Driven LeaderseCornell
 
Top 12 Cybersecurity Predictions for 2017
Top 12 Cybersecurity Predictions for 2017Top 12 Cybersecurity Predictions for 2017
Top 12 Cybersecurity Predictions for 2017IBM Security
 
Smart Grid Cyber Security Summit Revere
Smart Grid Cyber Security Summit RevereSmart Grid Cyber Security Summit Revere
Smart Grid Cyber Security Summit Reverehhanebeck
 
5 Event Management Trends this 2017
5 Event Management Trends this 20175 Event Management Trends this 2017
5 Event Management Trends this 2017Orly Ballesteros
 
Splunk Discovery Dusseldorf: September 2017 - Security Session
Splunk Discovery Dusseldorf: September 2017 - Security SessionSplunk Discovery Dusseldorf: September 2017 - Security Session
Splunk Discovery Dusseldorf: September 2017 - Security SessionSplunk
 
Top 10 cybersecurity predictions for 2016 by Matthew Rosenquist
Top 10 cybersecurity predictions for 2016 by Matthew RosenquistTop 10 cybersecurity predictions for 2016 by Matthew Rosenquist
Top 10 cybersecurity predictions for 2016 by Matthew RosenquistMatthew Rosenquist
 

Viewers also liked (9)

Predictions 2017
Predictions 2017Predictions 2017
Predictions 2017
 
Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez)
Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez)  Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez)
Are You Prepared? Cybersecurity Trends & Opportunities (Ed Valdez)
 
Nowhere to Hide: Expose Threats in Real-time with IBM QRadar Network Insights
Nowhere to Hide: Expose Threats in Real-time with IBM QRadar Network InsightsNowhere to Hide: Expose Threats in Real-time with IBM QRadar Network Insights
Nowhere to Hide: Expose Threats in Real-time with IBM QRadar Network Insights
 
Webinar: Develop Your High Potentials into Results-Driven Leaders
Webinar: Develop Your High Potentials into Results-Driven LeadersWebinar: Develop Your High Potentials into Results-Driven Leaders
Webinar: Develop Your High Potentials into Results-Driven Leaders
 
Top 12 Cybersecurity Predictions for 2017
Top 12 Cybersecurity Predictions for 2017Top 12 Cybersecurity Predictions for 2017
Top 12 Cybersecurity Predictions for 2017
 
Smart Grid Cyber Security Summit Revere
Smart Grid Cyber Security Summit RevereSmart Grid Cyber Security Summit Revere
Smart Grid Cyber Security Summit Revere
 
5 Event Management Trends this 2017
5 Event Management Trends this 20175 Event Management Trends this 2017
5 Event Management Trends this 2017
 
Splunk Discovery Dusseldorf: September 2017 - Security Session
Splunk Discovery Dusseldorf: September 2017 - Security SessionSplunk Discovery Dusseldorf: September 2017 - Security Session
Splunk Discovery Dusseldorf: September 2017 - Security Session
 
Top 10 cybersecurity predictions for 2016 by Matthew Rosenquist
Top 10 cybersecurity predictions for 2016 by Matthew RosenquistTop 10 cybersecurity predictions for 2016 by Matthew Rosenquist
Top 10 cybersecurity predictions for 2016 by Matthew Rosenquist
 

Similar to Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Languagevsssuresh
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.pptaliraza2732
 
Megadata With Python and Hadoop
Megadata With Python and HadoopMegadata With Python and Hadoop
Megadata With Python and Hadoopryancox
 
Use PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language ParserUse PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language ParserYodalee
 
Scientific visualization with_gr
Scientific visualization with_grScientific visualization with_gr
Scientific visualization with_grJosef Heinen
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERTQAware GmbH
 
Bigdata presentation
Bigdata presentationBigdata presentation
Bigdata presentationYonas Gidey
 
Bigdata Presentation
Bigdata PresentationBigdata Presentation
Bigdata PresentationYonas Gidey
 
Functional Programming In Java
Functional Programming In JavaFunctional Programming In Java
Functional Programming In JavaAndrei Solntsev
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Naoki (Neo) SATO
 
От Java Threads к лямбдам, Андрей Родионов
От Java Threads к лямбдам, Андрей РодионовОт Java Threads к лямбдам, Андрей Родионов
От Java Threads к лямбдам, Андрей РодионовYandex
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statisticsKrishna Dhakal
 
Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...Samuel Lampa
 

Similar to Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark (20)

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
presentation
presentationpresentation
presentation
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.ppt
 
Megadata With Python and Hadoop
Megadata With Python and HadoopMegadata With Python and Hadoop
Megadata With Python and Hadoop
 
Use PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language ParserUse PEG to Write a Programming Language Parser
Use PEG to Write a Programming Language Parser
 
Scientific visualization with_gr
Scientific visualization with_grScientific visualization with_gr
Scientific visualization with_gr
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERT
 
Bigdata presentation
Bigdata presentationBigdata presentation
Bigdata presentation
 
Bigdata Presentation
Bigdata PresentationBigdata Presentation
Bigdata Presentation
 
Functional Programming In Java
Functional Programming In JavaFunctional Programming In Java
Functional Programming In Java
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
 
От Java Threads к лямбдам, Андрей Родионов
От Java Threads к лямбдам, Андрей РодионовОт Java Threads к лямбдам, Андрей Родионов
От Java Threads к лямбдам, Андрей Родионов
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...Using Flow-based programming to write tools and workflows for Scientific Comp...
Using Flow-based programming to write tools and workflows for Scientific Comp...
 
4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
 

More from Jivan Nepali

Library System Implementation with JavaFx
Library System Implementation with JavaFxLibrary System Implementation with JavaFx
Library System Implementation with JavaFxJivan Nepali
 
Cookies: HTTP state management mechanism
Cookies: HTTP state management mechanismCookies: HTTP state management mechanism
Cookies: HTTP state management mechanismJivan Nepali
 
Warehouse based Intelligent Banking Transaction Analysis System
Warehouse based Intelligent Banking Transaction Analysis SystemWarehouse based Intelligent Banking Transaction Analysis System
Warehouse based Intelligent Banking Transaction Analysis SystemJivan Nepali
 
Tourism market segmentation in context of nepal
Tourism market segmentation in context of nepalTourism market segmentation in context of nepal
Tourism market segmentation in context of nepalJivan Nepali
 
Decision Support and Knowledge Based Systems
Decision Support and Knowledge Based SystemsDecision Support and Knowledge Based Systems
Decision Support and Knowledge Based SystemsJivan Nepali
 
Grid computing the grid
Grid computing the gridGrid computing the grid
Grid computing the gridJivan Nepali
 
Restaurant Guide: A GPS based Android App
Restaurant Guide: A GPS based Android AppRestaurant Guide: A GPS based Android App
Restaurant Guide: A GPS based Android AppJivan Nepali
 
Project time management
Project time managementProject time management
Project time managementJivan Nepali
 

More from Jivan Nepali (8)

Library System Implementation with JavaFx
Library System Implementation with JavaFxLibrary System Implementation with JavaFx
Library System Implementation with JavaFx
 
Cookies: HTTP state management mechanism
Cookies: HTTP state management mechanismCookies: HTTP state management mechanism
Cookies: HTTP state management mechanism
 
Warehouse based Intelligent Banking Transaction Analysis System
Warehouse based Intelligent Banking Transaction Analysis SystemWarehouse based Intelligent Banking Transaction Analysis System
Warehouse based Intelligent Banking Transaction Analysis System
 
Tourism market segmentation in context of nepal
Tourism market segmentation in context of nepalTourism market segmentation in context of nepal
Tourism market segmentation in context of nepal
 
Decision Support and Knowledge Based Systems
Decision Support and Knowledge Based SystemsDecision Support and Knowledge Based Systems
Decision Support and Knowledge Based Systems
 
Grid computing the grid
Grid computing the gridGrid computing the grid
Grid computing the grid
 
Restaurant Guide: A GPS based Android App
Restaurant Guide: A GPS based Android AppRestaurant Guide: A GPS based Android App
Restaurant Guide: A GPS based Android App
 
Project time management
Project time managementProject time management
Project time management
 

Recently uploaded

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作ys8omjxb
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 

Recently uploaded (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作Ulm U学位证,乌尔姆大学毕业证书1:1制作
Ulm U学位证,乌尔姆大学毕业证书1:1制作
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 

Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark

  • 1. CRYSTAL BALL EVENT PREDICTION (MAPREDUCE) & LOG ANALYSIS (SPARK) By: Jivan Nepali, 985095 Big Data (CS522) Project
  • 2. PRESENTATION OVERVIEW Pair Approach • Pseudo-code for Pair Approach • Java Implementation for Pair Approach • Pair Approach Result Stripe Approach • Pseudo-code for Stripe Approach • Java Implementation for Stripe Approach • Stripe Approach Result Hybrid Approach • Pseudo-code for Hybrid Approach • Java Implementation for Hybrid Approach • Hybrid Approach Result • Comparison of three Approaches Spark • LogAnalysis – Problem Description • LogAnalysis – Expected Outcomes • LogAnalysis – Scala Implementation • LogAnalysis – Results
  • 4. PSEUDO CODE – MAPPER Class MAPPER method INITIALIZE H = new Associative Array method MAP (docid a, doc d) for all term w in doc d do for all term u in Neighbors(w) do H { Pair (w, u) } = H {Pair (w, u) } + count 1 // Tally counts H { Pair(w, *) } = H { Pair (w, *) } + count 1 // Tally counts for * method CLOSE for all Pair (w, u) in H do EMIT ( Pair (w, u), count H { Pair (w, u) } )
  • 5. PSUDEO CODE - REDUCER Class REDUCER method INITIALIZE TOTALFREQ = 0 method REDUCE (Pair p, counts [c1, c2, c3, … ]) sum = 0 for all count c in counts [c1, c2, c3, … ]) do sum = sum + c if ( p.getNeighbor() == “*”)) then //Neighbor is second element of the pair TOTALFREQ = sum else EMIT ( Pair p, sum / TOTALFREQ)
  • 9. PAIR APPROACH – MAP INPUT RECORDS 18 34 56 29 12 34 56 92 29 34 12 92 29 18 12 34 79 29 56 12 34 18
  • 10. PAIR APPROACH - RESULT
  • 12. PSEUDO CODE – MAPPER Class MAPPER method INITIALIZE H = new Associative Array method MAP (docid a, doc d) for all term w in doc d do S = H { w } // Initialize a new Associative Array if H {w} is NULL for all term u in Neighbors(w) do S { u } = S { u } + count 1 // Tally counts H { w } = S method CLOSE for all term t in H do EMIT ( term t, stripe H { t } )
  • 13. PSUDEO CODE - REDUCER Class REDUCER method INITIALIZE TOTALFREQ = 0 Hf = new Associative Array method REDUCE (term t, stripes [H1, H2, H3, … ]) for all stripe H in stripes [H1, H2, H3, … ]) do for all term w in stripe H do Hf { w } = Hf { w } + H { w } // Hf = Hf + H ; Element-wise addition TOTALFREQ = TOTALFREQ + count H { w } for all term w in stripe Hf do Hf { w } = Hf { w } / TOTALFREQ EMIT ( term t, stripe Hf )
  • 18. STRIPE APPROACH – MAP INPUT RECORDS 18 34 56 29 12 34 56 92 29 34 12 92 29 18 12 34 79 29 56 12 34 18
  • 21. PSEUDO CODE – MAPPER Class MAPPER method INITIALIZE H = new Associative Array method MAP (docid a, doc d) for all term w in doc d do for all term u in Neighbors(w) do H { Pair (w, u) } = H {Pair (w, u) } + count 1 // Tally counts method CLOSE for all Pair (w, u) in H do EMIT ( Pair (w, u), count H { Pair (w, u) } )
  • 22. PSUDEO CODE - REDUCER Class REDUCER method INITIALIZE TOTALFREQ = 0 Hf = new Associative Array PREVKEY = “” method REDUCE (Pair p, counts [C1, C2, C3, … ]) sum = 0 for all count c in counts [ C1, C2, C3, … ] do sum = sum + c if ( PREVKEY <> p.getKey( )) then EMIT ( PREVKEY, Hf / TOTALFREQ ) // Element-wise divide Hf = new Associative Array TOTALFREQ = 0
  • 23. PSUDEO CODE – REDUCER CONTD… TOTALFREQ = TOTALFREQ + sum Hf { p.getNeighbor( ) } = Hf { p.getNeighbor( ) } + sum PREVKEY = p.getKey( ) method CLOSE // for the remaining last key EMIT ( PREVKEY, Hf / TOTALFREQ ) // Element-wise divide
  • 29. HYBRID APPROACH – MAP INPUT RECORDS 18 34 56 29 12 34 56 92 29 34 12 92 29 18 12 34 79 29 56 12 34 18
  • 31. MAP-REDUCE JOB PERFORMANCE COMPARISON WITH COUNTERS Description Pair Approach Stripe Approach Hybrid Approach Map Input Records 2 2 2 Map Output Records 47 7 40 Map Output Bytes 463 416 400 Map Output Materialized Bytes 563 436 486 Input-split Bytes 147 149 149 Combine Input Records 0 0 0 Combine Output Records 0 0 0 Reduce Input Groups 47 7 40 Reduce Shuffle Bytes 563 436 486 Reduce Input Records 47 7 40 Reduce Output Records 40 7 7 Shuffled Maps 1 1 1 GC Time Elapsed (ms) 140 175 129 CPU Time Spent (ms) 1540 1530 1700 Physical Memory (bytes) Snapshot 357101568 354013184 352686080 Virtual Memory (bytes) Snapshot 3022008320 3019862016 3020025856 Total Committed Heap Usage (bytes) 226365440 226365440 226365440
  • 33. LOG ANALYSIS • Log data is a definitive record of what's happening in every business, organization or agency and it’s often an untapped resource when it comes to troubleshooting and supporting broader business objectives. • 1.5 Millions Log Lines Per Second !
  • 34. PROBLEM DESCRIPTION • Web-access log data from Splunk • Three log files ( ~ 12 MB) Features • Extract top selling products • Extract top selling product categories • Extract top client IPs visiting the e-commerce site Sample Data
  • 35. SPARK, SCALA CONFIGURATION IN ECLIPSE • Download Scala IDE from http://scala-ide.org/download/sdk.html for Linux 64 bit
  • 36. SPARK, SCALA CONFIGURATION IN ECLIPSE • Open the Scala IDE • Create a new Maven Project • Configure the pom.xml file • maven clean, maven install • Set the Scala Installation to Scala 2.10.4 from Project -> Scala -> Set Installation
  • 37. LOG ANALYSIS - SCALA IMPLEMENTATION • Add new Scala Object to the src directory of the project
  • 38. LOG ANALYSIS - SCALA IMPLEMENTATION
  • 39. LOG ANALYSIS - SCALA IMPLEMENTATION
  • 40. LOG ANALYSIS - SCALA IMPLEMENTATION
  • 41. CREATING & EXECUTING THE .JAR FILE • Open Linux Terminal • Go to the project directory & Perform mvn clean, mvn package to create the .JAR file • Change the permission of .jar as executable ( sudo chmod 777 filename.jar ) • Run the .jar file by providing the input and output directories as arguments spark-submit --class cs522.sparkproject.LogAnalyzer $LOCAL_DIR/spark/sparkproject- 0.0.1-SNAPSHOT.jar $HDFS_DIR/spark/input $HDFS_DIR/spark/output
  • 42. LOG ANALYSIS – RESULT (TOP PRODUCT IDs)
  • 43. LOG ANALYSIS – RESULT (TOP PRODUCT CATEGORIES)
  • 44. LOG ANALYSIS – RESULT (TOP CLIENT IPs)
  • 45. DEMO