SlideShare a Scribd company logo
1 of 28
Big Data – Project Presentation
By:
Yonas Gidey -985054
Submitted to
Professor Prem Nair
April 25, 2017
Relative Frequency-Project
1. Pseudo code for Pair Approach Algorithm
2. Java code for Pair Approach Algorithm
3. Result of Pair Approach Algorithm
4. Pseudo code for Stripe Approach Algorithm
5. Java code for Stripe Approach
6. Result of Stripe Approach Algorithm
7. Pseudo code for Hybrid Approach Algorithm
8. Java code for Hybrid Approach Algorithm
9. Result of Hybrid Approach Algorithm
10. Comparison
11. Spark Project
Steps for Implementing the pairs approach
I. For each line passed in when the map function is called, we will
split on spaces creating a String Array.
II. The next step would be to construct two loops.
III. The outer loop will iterate over each word in the array and the
inner loop will iterate over the “neighbors” of the current word.
IV. The number of iterations for the inner loop is dictated by the
size of our “window” to capture neighbors of the current word.
V. At the bottom of each iteration in the inner loop, we will emit a
WordPair object (consisting of the current word on the left and
the neighbor word on the right) as the key, and a count of one
as the value
VI. The Reducer for the Pairs implementation will simply sum all of
the numbers for the given WordPair key
1. Pseudo code for PAIR Approach
Class Mapper{
method map(inKey,text )
{
for each word w in text
for each Neighbour n of word w
Pair p= Pair(w,n)
emit(p,1)
emit(*,1)
}
}
Class reducer {
method Reduce(pair p; counts [c1; c2; …])
s = 0
count=0
for all pair(w,*) in p do
s=s+1;
for all count c in pair(w, u) in counts [c1; c2; …]
do
count=count+c
Emit(pair p; count / s)
}
2. Java code for PAIR approach
2. Java code for PAIR approach
Hadoop Commands
• #!/bin/sh
• hadoop fs -mkdir /user/cloudera/relative-frequency /user/cloudera/relative-
frequency/pair /user/cloudera/relative-frequency/pair/input
• hadoop fs -put files/input.txt /user/cloudera/relative-frequency/pair/input
• hadoop fs -rm -r /user/cloudera/relative-frequency/pair/output
• hadoop jar files/pairsrf.jar
project.crystalBall.pairsApproachAlgorithm.PairRelativeFrequencyDriver
/user/cloudera/relative-frequency/pair/input /user/cloudera/relative-
frequency/pair/output
• hadoop fs -cat /user/cloudera/relative-frequency/pair/output/*
3. Result of PAIR approach
Steps for Stripes implementation
I. The approach is the same to Pairs, but all of the “neighbor” words are
collected in a HashMap with the neighbor word as the key and an integer
count as the value.
II. When all of the values have been collected for a given word (the bottom
of the outer loop), the word and the hashmap are emitted.
III. The Reducer for the Stripes approach iterates over a collection of maps,
then for each map, iterate over all of the values in the map:
4.Pseudo code for STRIPE approach
Class mapper
method Map(docid a; doc d)
H = new AssociativeArray
for all term w in doc d do
for all term u in Neighbors(w) do
H{u} = H{u} + 1
for all term u in H do
Emit(Term w; Stripe H)
Class Reducer
method Reduce(term w; stripes [H1;H2;H3;:])
Hf = new AssociativeArray
for all stripe H in stripes [H1;H2;H3; …] do
Sum (Hf; H).
//Calulate frequencies
int count = 0;
Hnew = new AssociativeArray
for each u in Hf do
count+=Hf(u);
for each u inHf do
Hnew{u}=Hf{u}/count;
Emit (term w, stripe Hnew);
}
5. Java code for Stripe Approach
5. Java code for Stripe Approach
Hadoop Commands
• hadoop fs -mkdir /user/cloudera/relative-frequency /user/cloudera/relative-
frequency/stripe /user/cloudera/relative-frequency/stripe/input
• hadoop fs -put files/input.txt /user/cloudera/relative-frequency/stripe/input
• hadoop fs -rm -r /user/cloudera/relative-frequency/stripe/output
• hadoop jar files/stripesrf.jar
project.crystalBall.stripesApproachAlgorithm.StripeRelativeFrequencyDriver
/user/cloudera/relative-frequency/stripe/input /user/cloudera/relative-
frequency/stripe/output
• hadoop fs -cat /user/cloudera/relative-frequency/stripe/output/*
6. Result of Stripe approach
7. Pseudo Code for HYBRID approach
Class Mapper
method map(inKey,text )
for each word w in text
for each Neighbour n of word w
Pair p= Pair(w,n)
emit(p,1)
Class Reducer{
Hf=new Associative Array
last =empty;
method Reduce(pair p(w,u); counts [c1;c2;c3; : : :]){
Count=0
for all count c in pair(w, u) in counts [c1; c2; …] do
Hf{u} = Hf{u}+c //do Stripe for all Pair for term w
for all u in Hf do
count += Hf{u} //all occurring for term w
for all term u in Hf do
Hf{u}=Hf{u} /count //element wise division
if(last==w)
Emit (term w; stripe Hf);
Clear Hf; }
method clear(){
emit(last, Hf); } }
8. Java code for HYBRID approach
8. Java code for HYBRID Approach
Hadoop Commands
• hadoop fs -mkdir /user/cloudera/relative-frequency /user/cloudera/relative-
frequency/pair-stripe /user/cloudera/relative-frequency/pair-stripe/input
• hadoop fs -put files/input.txt /user/cloudera/relative-frequency/pair-
stripe/input
• hadoop fs -rm -r /user/cloudera/relative-frequency/pair-stripe/output
• hadoop jar files/pairsStriperf.jar
project.crystalBall.pairsStrpesHybridAlgorithm.PairStripeRelativeFrequencyDriv
er /user/cloudera/relative-frequency/pair-stripe/input
/user/cloudera/relative-frequency/pair-stripe/output
• hadoop fs -cat /user/cloudera/relative-frequency/pair-stripe/output/*
9. Result of HYBRID approach
10. Comparison
Spark Project
Statement of the problem
In this project I want to analyze some Apache access log
files using spark framework and Scala programming
language.
1. In this project I tried to analyze logs collected from website
by analyzing the request coming from users
2. Analyze the response code and how many of them are “page
not found”, “OK”, “Unauthorized” and etc…
HTTP Status 200 Success 200 OK 301 Moved Permanently
HTTP Error 401 Unauthorized HTTP status 503 Service unavailable
HTTP status 403 Forbidden HTTP status 500 Internal Server Error
HTTP status 404 Not Found
I processed the log files using spark and came out with
outputs. And much more analysis can be done on demand.
Scala Code
Scala Code
Details
• Execute Spark job by handover jar file, main class name, input location and
output location via following terminal commands.
• hdfsdfs –mkdir spark/input
• hdfsdfs –put input spark
• spark-submit --class sparkPackage --master local SparkProject.jar
spark/input spark/output
Spark Output
Thank you!

More Related Content

What's hot

GitRecruit final 1
GitRecruit final 1GitRecruit final 1
GitRecruit final 1Yinghan Fu
 
DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright Andrei Alexandrescu
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in Rherbps10
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage CollectionEelco Visser
 
Boosting command line experience with python and awk
Boosting command line experience with python and awkBoosting command line experience with python and awk
Boosting command line experience with python and awkKirill Pavlov
 
Introduction to cython: example of GCoptimization
Introduction to cython: example of GCoptimizationIntroduction to cython: example of GCoptimization
Introduction to cython: example of GCoptimizationKevin Keraudren
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)Sławomir Zborowski
 
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexFOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexRob Skillington
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetupsource{d}
 
The Cryptol Epilogue: Swift and Bulletproof VHDL
The Cryptol Epilogue: Swift and Bulletproof VHDLThe Cryptol Epilogue: Swift and Bulletproof VHDL
The Cryptol Epilogue: Swift and Bulletproof VHDLUlisses Costa
 

What's hot (19)

Return Oriented Programming
Return Oriented ProgrammingReturn Oriented Programming
Return Oriented Programming
 
GitRecruit final 1
GitRecruit final 1GitRecruit final 1
GitRecruit final 1
 
DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright
 
Rcpp
RcppRcpp
Rcpp
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Performance .NET Core - M. Terech, P. Janowski
Performance .NET Core - M. Terech, P. JanowskiPerformance .NET Core - M. Terech, P. Janowski
Performance .NET Core - M. Terech, P. Janowski
 
Computational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in RComputational Techniques for the Statistical Analysis of Big Data in R
Computational Techniques for the Statistical Analysis of Big Data in R
 
Garbage Collection
Garbage CollectionGarbage Collection
Garbage Collection
 
Boosting command line experience with python and awk
Boosting command line experience with python and awkBoosting command line experience with python and awk
Boosting command line experience with python and awk
 
Introduction to cython: example of GCoptimization
Introduction to cython: example of GCoptimizationIntroduction to cython: example of GCoptimization
Introduction to cython: example of GCoptimization
 
All functions
All functionsAll functions
All functions
 
R and cpp
R and cppR and cpp
R and cpp
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)
 
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexFOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
Python Bindings Overview
Python Bindings OverviewPython Bindings Overview
Python Bindings Overview
 
Cs gate-2011
Cs gate-2011Cs gate-2011
Cs gate-2011
 
The Cryptol Epilogue: Swift and Bulletproof VHDL
The Cryptol Epilogue: Swift and Bulletproof VHDLThe Cryptol Epilogue: Swift and Bulletproof VHDL
The Cryptol Epilogue: Swift and Bulletproof VHDL
 

Similar to Bigdata Presentation

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Who pulls the strings?
Who pulls the strings?Who pulls the strings?
Who pulls the strings?Ronny
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and SparkCrystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and SparkJivan Nepali
 
Apache spark session
Apache spark sessionApache spark session
Apache spark sessionknowbigdata
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Dataflow: Declarative concurrency in Ruby
Dataflow: Declarative concurrency in RubyDataflow: Declarative concurrency in Ruby
Dataflow: Declarative concurrency in RubyLarry Diehl
 
Networking and Go: An Epic Journey
Networking and Go: An Epic JourneyNetworking and Go: An Epic Journey
Networking and Go: An Epic JourneySneha Inguva
 
An introduction to Raku
An introduction to RakuAn introduction to Raku
An introduction to RakuSimon Proctor
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUGCloudera, Inc.
 

Similar to Bigdata Presentation (20)

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Who pulls the strings?
Who pulls the strings?Who pulls the strings?
Who pulls the strings?
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and SparkCrystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
 
Apache spark session
Apache spark sessionApache spark session
Apache spark session
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Dataflow: Declarative concurrency in Ruby
Dataflow: Declarative concurrency in RubyDataflow: Declarative concurrency in Ruby
Dataflow: Declarative concurrency in Ruby
 
Networking and Go: An Epic Journey
Networking and Go: An Epic JourneyNetworking and Go: An Epic Journey
Networking and Go: An Epic Journey
 
An introduction to Raku
An introduction to RakuAn introduction to Raku
An introduction to Raku
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Recently uploaded (20)

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Bigdata Presentation

  • 1. Big Data – Project Presentation By: Yonas Gidey -985054 Submitted to Professor Prem Nair April 25, 2017
  • 2. Relative Frequency-Project 1. Pseudo code for Pair Approach Algorithm 2. Java code for Pair Approach Algorithm 3. Result of Pair Approach Algorithm 4. Pseudo code for Stripe Approach Algorithm 5. Java code for Stripe Approach 6. Result of Stripe Approach Algorithm 7. Pseudo code for Hybrid Approach Algorithm 8. Java code for Hybrid Approach Algorithm 9. Result of Hybrid Approach Algorithm 10. Comparison 11. Spark Project
  • 3. Steps for Implementing the pairs approach I. For each line passed in when the map function is called, we will split on spaces creating a String Array. II. The next step would be to construct two loops. III. The outer loop will iterate over each word in the array and the inner loop will iterate over the “neighbors” of the current word. IV. The number of iterations for the inner loop is dictated by the size of our “window” to capture neighbors of the current word. V. At the bottom of each iteration in the inner loop, we will emit a WordPair object (consisting of the current word on the left and the neighbor word on the right) as the key, and a count of one as the value VI. The Reducer for the Pairs implementation will simply sum all of the numbers for the given WordPair key
  • 4. 1. Pseudo code for PAIR Approach Class Mapper{ method map(inKey,text ) { for each word w in text for each Neighbour n of word w Pair p= Pair(w,n) emit(p,1) emit(*,1) } } Class reducer { method Reduce(pair p; counts [c1; c2; …]) s = 0 count=0 for all pair(w,*) in p do s=s+1; for all count c in pair(w, u) in counts [c1; c2; …] do count=count+c Emit(pair p; count / s) }
  • 5. 2. Java code for PAIR approach
  • 6. 2. Java code for PAIR approach
  • 7. Hadoop Commands • #!/bin/sh • hadoop fs -mkdir /user/cloudera/relative-frequency /user/cloudera/relative- frequency/pair /user/cloudera/relative-frequency/pair/input • hadoop fs -put files/input.txt /user/cloudera/relative-frequency/pair/input • hadoop fs -rm -r /user/cloudera/relative-frequency/pair/output • hadoop jar files/pairsrf.jar project.crystalBall.pairsApproachAlgorithm.PairRelativeFrequencyDriver /user/cloudera/relative-frequency/pair/input /user/cloudera/relative- frequency/pair/output • hadoop fs -cat /user/cloudera/relative-frequency/pair/output/*
  • 8. 3. Result of PAIR approach
  • 9. Steps for Stripes implementation I. The approach is the same to Pairs, but all of the “neighbor” words are collected in a HashMap with the neighbor word as the key and an integer count as the value. II. When all of the values have been collected for a given word (the bottom of the outer loop), the word and the hashmap are emitted. III. The Reducer for the Stripes approach iterates over a collection of maps, then for each map, iterate over all of the values in the map:
  • 10. 4.Pseudo code for STRIPE approach Class mapper method Map(docid a; doc d) H = new AssociativeArray for all term w in doc d do for all term u in Neighbors(w) do H{u} = H{u} + 1 for all term u in H do Emit(Term w; Stripe H) Class Reducer method Reduce(term w; stripes [H1;H2;H3;:]) Hf = new AssociativeArray for all stripe H in stripes [H1;H2;H3; …] do Sum (Hf; H). //Calulate frequencies int count = 0; Hnew = new AssociativeArray for each u in Hf do count+=Hf(u); for each u inHf do Hnew{u}=Hf{u}/count; Emit (term w, stripe Hnew); }
  • 11. 5. Java code for Stripe Approach
  • 12. 5. Java code for Stripe Approach
  • 13. Hadoop Commands • hadoop fs -mkdir /user/cloudera/relative-frequency /user/cloudera/relative- frequency/stripe /user/cloudera/relative-frequency/stripe/input • hadoop fs -put files/input.txt /user/cloudera/relative-frequency/stripe/input • hadoop fs -rm -r /user/cloudera/relative-frequency/stripe/output • hadoop jar files/stripesrf.jar project.crystalBall.stripesApproachAlgorithm.StripeRelativeFrequencyDriver /user/cloudera/relative-frequency/stripe/input /user/cloudera/relative- frequency/stripe/output • hadoop fs -cat /user/cloudera/relative-frequency/stripe/output/*
  • 14. 6. Result of Stripe approach
  • 15. 7. Pseudo Code for HYBRID approach Class Mapper method map(inKey,text ) for each word w in text for each Neighbour n of word w Pair p= Pair(w,n) emit(p,1)
  • 16. Class Reducer{ Hf=new Associative Array last =empty; method Reduce(pair p(w,u); counts [c1;c2;c3; : : :]){ Count=0 for all count c in pair(w, u) in counts [c1; c2; …] do Hf{u} = Hf{u}+c //do Stripe for all Pair for term w for all u in Hf do count += Hf{u} //all occurring for term w for all term u in Hf do Hf{u}=Hf{u} /count //element wise division if(last==w) Emit (term w; stripe Hf); Clear Hf; } method clear(){ emit(last, Hf); } }
  • 17. 8. Java code for HYBRID approach
  • 18. 8. Java code for HYBRID Approach
  • 19. Hadoop Commands • hadoop fs -mkdir /user/cloudera/relative-frequency /user/cloudera/relative- frequency/pair-stripe /user/cloudera/relative-frequency/pair-stripe/input • hadoop fs -put files/input.txt /user/cloudera/relative-frequency/pair- stripe/input • hadoop fs -rm -r /user/cloudera/relative-frequency/pair-stripe/output • hadoop jar files/pairsStriperf.jar project.crystalBall.pairsStrpesHybridAlgorithm.PairStripeRelativeFrequencyDriv er /user/cloudera/relative-frequency/pair-stripe/input /user/cloudera/relative-frequency/pair-stripe/output • hadoop fs -cat /user/cloudera/relative-frequency/pair-stripe/output/*
  • 20. 9. Result of HYBRID approach
  • 23. Statement of the problem In this project I want to analyze some Apache access log files using spark framework and Scala programming language. 1. In this project I tried to analyze logs collected from website by analyzing the request coming from users 2. Analyze the response code and how many of them are “page not found”, “OK”, “Unauthorized” and etc… HTTP Status 200 Success 200 OK 301 Moved Permanently HTTP Error 401 Unauthorized HTTP status 503 Service unavailable HTTP status 403 Forbidden HTTP status 500 Internal Server Error HTTP status 404 Not Found I processed the log files using spark and came out with outputs. And much more analysis can be done on demand.
  • 26. Details • Execute Spark job by handover jar file, main class name, input location and output location via following terminal commands. • hdfsdfs –mkdir spark/input • hdfsdfs –put input spark • spark-submit --class sparkPackage --master local SparkProject.jar spark/input spark/output