Bigdata presentation

Big Data – Project Presentation
By:
Yonas Gidey -985054
Submitted to
Professor Prem Nair
April 25, 2017

Relative Frequency-Project
1. Pseudo code for Pair Approach Algorithm
2. Java code for Pair Approach Algorithm
3. Result of Pair Approach Algorithm
4. Pseudo code for Stripe Approach Algorithm
5. Java code for Stripe Approach
6. Result of Stripe Approach Algorithm
7. Pseudo code for Hybrid Approach Algorithm
8. Java code for Hybrid Approach Algorithm
9. Result of Hybrid Approach Algorithm
10. Comparison
11. Spark Project

Steps for Implementing the pairs approach
I. For each line passed in when the map function is called, we will
split on spaces creating a String Array.
II. The next step would be to construct two loops.
III. The outer loop will iterate over each word in the array and the
inner loop will iterate over the “neighbors” of the current word.
IV. The number of iterations for the inner loop is dictated by the
size of our “window” to capture neighbors of the current word.
V. At the bottom of each iteration in the inner loop, we will emit a
WordPair object (consisting of the current word on the left and
the neighbor word on the right) as the key, and a count of one
as the value
VI. The Reducer for the Pairs implementation will simply sum all of
the numbers for the given WordPair key

1. Pseudo code for PAIR Approach
Class Mapper{
method map(inKey,text )
{
for each word w in text
for each Neighbour n of word w
Pair p= Pair(w,n)
emit(p,1)
emit(*,1)
}
}
Class reducer {
method Reduce(pair p; counts [c1; c2; …])
s = 0
count=0
for all pair(w,*) in p do
s=s+1;
for all count c in pair(w, u) in counts [c1; c2; …]
do
count=count+c
Emit(pair p; count / s)
}

2. Java code for PAIR approach

Hadoop Commands
• #!/bin/sh
• hadoop fs -mkdir /user/cloudera/relative-frequency /user/cloudera/relative-
frequency/pair /user/cloudera/relative-frequency/pair/input
• hadoop fs -put files/input.txt /user/cloudera/relative-frequency/pair/input
• hadoop fs -rm -r /user/cloudera/relative-frequency/pair/output
• hadoop jar files/pairsrf.jar
project.crystalBall.pairsApproachAlgorithm.PairRelativeFrequencyDriver
/user/cloudera/relative-frequency/pair/input /user/cloudera/relative-
frequency/pair/output
• hadoop fs -cat /user/cloudera/relative-frequency/pair/output/*

Steps for Stripes implementation
I. The approach is the same to Pairs, but all of the “neighbor” words are
collected in a HashMap with the neighbor word as the key and an integer
count as the value.
II. When all of the values have been collected for a given word (the bottom
of the outer loop), the word and the hashmap are emitted.
III. The Reducer for the Stripes approach iterates over a collection of maps,
then for each map, iterate over all of the values in the map:

4.Pseudo code for STRIPE approach
Class mapper
method Map(docid a; doc d)
H = new AssociativeArray
for all term w in doc d do
for all term u in Neighbors(w) do
H{u} = H{u} + 1
for all term u in H do
Emit(Term w; Stripe H)
Class Reducer
method Reduce(term w; stripes [H1;H2;H3;:])
Hf = new AssociativeArray
for all stripe H in stripes [H1;H2;H3; …] do
Sum (Hf; H).
//Calulate frequencies
int count = 0;
Hnew = new AssociativeArray
for each u in Hf do
count+=Hf(u);
for each u inHf do
Hnew{u}=Hf{u}/count;
Emit (term w, stripe Hnew);
}

5. Java code for Stripe Approach

Hadoop Commands
frequency/stripe /user/cloudera/relative-frequency/stripe/input
• hadoop fs -put files/input.txt /user/cloudera/relative-frequency/stripe/input
• hadoop fs -rm -r /user/cloudera/relative-frequency/stripe/output
• hadoop jar files/stripesrf.jar
project.crystalBall.stripesApproachAlgorithm.StripeRelativeFrequencyDriver
/user/cloudera/relative-frequency/stripe/input /user/cloudera/relative-
frequency/stripe/output
• hadoop fs -cat /user/cloudera/relative-frequency/stripe/output/*

7. Pseudo Code for HYBRID approach
Class Mapper
method map(inKey,text )
for each word w in text
for each Neighbour n of word w
Pair p= Pair(w,n)
emit(p,1)

Class Reducer{
Hf=new Associative Array
last =empty;
method Reduce(pair p(w,u); counts [c1;c2;c3; : : :]){
Count=0
for all count c in pair(w, u) in counts [c1; c2; …] do
Hf{u} = Hf{u}+c //do Stripe for all Pair for term w
for all u in Hf do
count += Hf{u} //all occurring for term w
for all term u in Hf do
Hf{u}=Hf{u} /count //element wise division
if(last==w)
Emit (term w; stripe Hf);
Clear Hf; }
method clear(){
emit(last, Hf); } }

8. Java code for HYBRID approach

8. Java code for HYBRID Approach

Hadoop Commands
frequency/pair-stripe /user/cloudera/relative-frequency/pair-stripe/input
• hadoop fs -put files/input.txt /user/cloudera/relative-frequency/pair-
stripe/input
• hadoop fs -rm -r /user/cloudera/relative-frequency/pair-stripe/output
• hadoop jar files/pairsStriperf.jar
project.crystalBall.pairsStrpesHybridAlgorithm.PairStripeRelativeFrequencyDriv
er /user/cloudera/relative-frequency/pair-stripe/input
/user/cloudera/relative-frequency/pair-stripe/output
• hadoop fs -cat /user/cloudera/relative-frequency/pair-stripe/output/*

Statement of the problem
In this project I want to analyze some Apache access log
files using spark framework and Scala programming
language.
1. In this project I tried to analyze logs collected from website
by analyzing the request coming from users
2. Analyze the response code and how many of them are “page
not found”, “OK”, “Unauthorized” and etc…
HTTP Status 200 Success 200 OK 301 Moved Permanently
HTTP Error 401 Unauthorized HTTP status 503 Service unavailable
HTTP status 403 Forbidden HTTP status 500 Internal Server Error
HTTP status 404 Not Found
I processed the log files using spark and came out with
outputs. And much more analysis can be done on demand.

Details
• Execute Spark job by handover jar file, main class name, input location and
output location via following terminal commands.
• hdfsdfs –mkdir spark/input
• hdfsdfs –put input spark
• spark-submit --class sparkPackage --master local SparkProject.jar
spark/input spark/output

Bigdata presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Bigdata presentation

Similar to Bigdata presentation (20)

Recently uploaded

Recently uploaded (20)

Bigdata presentation