Big Data Project
on
Crystal Ball
Submitted By:
Sushil Sedai(984474)
Suvash Shah(984461)
Submitted to:
Prof. Prem Nair
Pair approach (Mapper) – pseudo
code
method map(docid id, doc d)
for each term w in doc d do
total = 0;
for each neighbor u in Neighbor(w) do
Emit(Pair(w, u), 1);
total++;
Emit(Pair(w, *), total);
Pair approach (Mapper) – Java
Code
Pair approach (Reducer) – pseudo
code
method reduce(Pair p, Iterable<Int> values)
if p.secondValue == *
if p.firstValue is new
currentvalue = p.firstvalue;
marginal = sum(values)
else
marginal += sum(values)
else
Emit(p, sum(values)/marginal);
Pair approach (Reducer) – Java
Code
Pair approach - input
Mapper1 input
18 29 12 34 79 18 56 12 34 92
Mapper2 input
18 29 12 34 79 18 56 12 34 92
Pair approach – Output (Reducer1)
(10,12) 0.5
(10,34) 0.5
(12,10) 0.09090909090909091
(12,18) 0.09090909090909091
(12,34) 0.36363636363636365
(12,56) 0.18181818181818182
(12,79) 0.09090909090909091
(12,92) 0.18181818181818182
(18,12) 0.25
(18,29) 0.125
(18,34) 0.25
(18,56) 0.125
(18,79) 0.125
(18,92) 0.125
(29,10) 0.06666666666666667
(29,12) 0.26666666666666666
(29,18) 0.06666666666666667
(29,34) 0.26666666666666666
(29,56) 0.13333333333333333
(29,79) 0.06666666666666667
(29,92) 0.13333333333333333
(34,10) 0.08333333333333333
(34,12) 0.25
(34,18) 0.08333333333333333
(34,29) 0.08333333333333333
(34,56) 0.25
(34,79) 0.08333333333333333
(34,92) 0.16666666666666666
(56,10) 0.1
(56,12) 0.3
(56,29) 0.1
(56,34) 0.3
(56,92) 0.2
(92,10) 0.3333333333333333
(92,12) 0.3333333333333333
(92,34) 0.3333333333333333
Pair approach – Output (Reducer2)
(79,12) 0.2
(79,18) 0.2
(79,34) 0.2
(79,56) 0.2
(79,92)0.2
Stripe approach (Mapper) –
pseudo code
method map(docid id, doc d)
Stripe H;
for each term w in doc d do
clear(H);
for each neighbor u in Neighbor(w) do
if H.containsKey(u)
H{u} += 1;
else
H.add(u, 1);
Emit(w, H);
Stripe approach (Mapper) – Java
Code
Stripe approach (Reducer) –
pseudo code
total = 0;
method reduce(Text key, Stripe H [H1, H2, …])
total = sumValues(H);
for each Item h in H do
h.secondValue /= total;
Emit(key, H);
Stripe approach (Reducer) – Java
Code
Stripe appoach (Reducer) – Java
Code
Stripe approach – input
Mapper1 input
34 56 29 12 34 56 92 10 34 12
Mapper2 input
18 29 12 34 79 18 56 12 34 92
Stripe approach –
Output(Reducer1)
10 [ (34,0.5000) (12,0.5000) ]
12 [ (56,0.1818) (92,0.1818) (34,0.3636) (18,0.0909) (79,0.0909) (10,0.0909) ]
18 [ (56,0.1250) (92,0.1250) (34,0.2500) (79,0.1250) (29,0.1250) (12,0.2500) ]
29 [ (56,0.1333) (92,0.1333) (34,0.2667) (18,0.0667) (79,0.0667) (10,0.0667)
(12,0.2667) ]
34 [ (56,0.2500) (92,0.1667) (18,0.0833) (79,0.0833) (29,0.0833) (10,0.0833)
(12,0.2500) ]
56 [ (92,0.2000) (34,0.3000) (29,0.1000) (10,0.1000) (12,0.3000) ]
92 [ (34,0.3333) (10,0.3333) (12,0.3333) ]
Stripe approach –
Output(Reducer2)
79 [ (56,0.2000) (92,0.2000) (34,0.2000) (18,0.2000)
(12,0.2000) ]
Hybrid approach (Mapper) –
pseudo code
method map(docid id, doc d)
HashMap H;
for each term w in doc d do
for each neighbor u in Neighbor(w) do
if H.contains(Pair(w, u))
H{Pair(w, u)} += 1;
else
H.add(Pair(w, u));
for each Pair p in H do
Emit(p, H(p));
Hybrid approach (Mapper) – Java
Code
Hybrid approach (Reducer) –
pseudo codeprev = null;
HashMap H;
Method reduce(Pair p, Iterable<Int> values)
if p.firstValue != prev and not first
total = sumValues(H);
for each item h in H
h(prev.secondValue) /= total;
Emit(p.firstValue, H);
clear(H);
End if
prev = p.firstValue;
H.add(p.secondValue, sum(values));
Method close
//for last pair
total = sumValues(H);
for each item h in H
h(prev.secondValue) /= total;
Emit(p.firstValue, H);
Hybrid approach (Reducer) – Java
Code
Hybrid approach (Reducer) – Java
Code
Hybrid approach - Input
Mapper1 input
34 56 29 12 34 56 92 10 34 12
Mapper2 input
18 29 12 34 79 18 56 12 34 92
Hybrid approach –
Output(Reducer1)
10 (12,0.5) (34,0.5)
12 (10,0.09090909) (18,0.09090909) (34,0.36363637) (56,0.18181819) (79,0.09090909)
(92,0.18181819)
18 (12,0.25) (29,0.125) (34,0.25) (56,0.125) (79,0.125) (92,0.125)
29 (10,0.06666667) (12,0.26666668) (18,0.06666667) (34,0.26666668) (56,0.13333334)
(79,0.06666667) (92,0.13333334)
34 (10,0.083333336) (12,0.25) (18,0.083333336) (29,0.083333336) (56,0.25) (79,0.083333336)
(92,0.16666667)
56 (10,0.1) (12,0.3) (29,0.1) (34,0.3) (92,0.2)
92 (10,0.33333334) (12,0.33333334) (34,0.33333334)
Hybrid approach –
Output(Reducer2)
79 (12,0.2) (18,0.2) (34,0.2) (56,0.2) (92,0.2)
Comparison
Apache Spark
Write a java program on spark to calculate total number of
students in MUM coming in different entries.This program
should display total number student by country.
Spark - Java Code
Spark - input
2014 Feb Nepal 20
2014 Feb India 15
2014 Oct Italy 2
2014 July France 1
2015 Feb Nepal 10
2015 Feb India 25
2015 Oct Italy 7
Spark - Output
(France,1)
(Italy,9)
(Nepal,30)
(India,40)
Tools Used
• VMPlayer Pro 7
• cloudera-quickstart-vm-5.4.0-0-vmware
• EclipseVersion: Luna Service Release 2 (4.4.2)
• Windows 8.1
References
• http://glebche.appspot.com/static/hadoop-
ecosystem/mapreduce-job-java.html
• https://hadoopi.wordpress.com/2013/06/05/hadoop-
implementing-the-tool-interface-for-mapreduce-driver/
• http://www.bogotobogo.com/Hadoop/BigData_hadoop_
Apache_Spark.php
ThankYou

CrystalBall - Compute Relative Frequency in Hadoop

  • 1.
    Big Data Project on CrystalBall Submitted By: Sushil Sedai(984474) Suvash Shah(984461) Submitted to: Prof. Prem Nair
  • 2.
    Pair approach (Mapper)– pseudo code method map(docid id, doc d) for each term w in doc d do total = 0; for each neighbor u in Neighbor(w) do Emit(Pair(w, u), 1); total++; Emit(Pair(w, *), total);
  • 3.
  • 4.
    Pair approach (Reducer)– pseudo code method reduce(Pair p, Iterable<Int> values) if p.secondValue == * if p.firstValue is new currentvalue = p.firstvalue; marginal = sum(values) else marginal += sum(values) else Emit(p, sum(values)/marginal);
  • 5.
  • 6.
    Pair approach -input Mapper1 input 18 29 12 34 79 18 56 12 34 92 Mapper2 input 18 29 12 34 79 18 56 12 34 92
  • 7.
    Pair approach –Output (Reducer1) (10,12) 0.5 (10,34) 0.5 (12,10) 0.09090909090909091 (12,18) 0.09090909090909091 (12,34) 0.36363636363636365 (12,56) 0.18181818181818182 (12,79) 0.09090909090909091 (12,92) 0.18181818181818182 (18,12) 0.25 (18,29) 0.125 (18,34) 0.25 (18,56) 0.125 (18,79) 0.125 (18,92) 0.125 (29,10) 0.06666666666666667 (29,12) 0.26666666666666666 (29,18) 0.06666666666666667 (29,34) 0.26666666666666666 (29,56) 0.13333333333333333 (29,79) 0.06666666666666667 (29,92) 0.13333333333333333 (34,10) 0.08333333333333333 (34,12) 0.25 (34,18) 0.08333333333333333 (34,29) 0.08333333333333333 (34,56) 0.25 (34,79) 0.08333333333333333 (34,92) 0.16666666666666666 (56,10) 0.1 (56,12) 0.3 (56,29) 0.1 (56,34) 0.3 (56,92) 0.2 (92,10) 0.3333333333333333 (92,12) 0.3333333333333333 (92,34) 0.3333333333333333
  • 8.
    Pair approach –Output (Reducer2) (79,12) 0.2 (79,18) 0.2 (79,34) 0.2 (79,56) 0.2 (79,92)0.2
  • 9.
    Stripe approach (Mapper)– pseudo code method map(docid id, doc d) Stripe H; for each term w in doc d do clear(H); for each neighbor u in Neighbor(w) do if H.containsKey(u) H{u} += 1; else H.add(u, 1); Emit(w, H);
  • 10.
  • 11.
    Stripe approach (Reducer)– pseudo code total = 0; method reduce(Text key, Stripe H [H1, H2, …]) total = sumValues(H); for each Item h in H do h.secondValue /= total; Emit(key, H);
  • 12.
  • 13.
  • 14.
    Stripe approach –input Mapper1 input 34 56 29 12 34 56 92 10 34 12 Mapper2 input 18 29 12 34 79 18 56 12 34 92
  • 15.
    Stripe approach – Output(Reducer1) 10[ (34,0.5000) (12,0.5000) ] 12 [ (56,0.1818) (92,0.1818) (34,0.3636) (18,0.0909) (79,0.0909) (10,0.0909) ] 18 [ (56,0.1250) (92,0.1250) (34,0.2500) (79,0.1250) (29,0.1250) (12,0.2500) ] 29 [ (56,0.1333) (92,0.1333) (34,0.2667) (18,0.0667) (79,0.0667) (10,0.0667) (12,0.2667) ] 34 [ (56,0.2500) (92,0.1667) (18,0.0833) (79,0.0833) (29,0.0833) (10,0.0833) (12,0.2500) ] 56 [ (92,0.2000) (34,0.3000) (29,0.1000) (10,0.1000) (12,0.3000) ] 92 [ (34,0.3333) (10,0.3333) (12,0.3333) ]
  • 16.
    Stripe approach – Output(Reducer2) 79[ (56,0.2000) (92,0.2000) (34,0.2000) (18,0.2000) (12,0.2000) ]
  • 17.
    Hybrid approach (Mapper)– pseudo code method map(docid id, doc d) HashMap H; for each term w in doc d do for each neighbor u in Neighbor(w) do if H.contains(Pair(w, u)) H{Pair(w, u)} += 1; else H.add(Pair(w, u)); for each Pair p in H do Emit(p, H(p));
  • 18.
  • 19.
    Hybrid approach (Reducer)– pseudo codeprev = null; HashMap H; Method reduce(Pair p, Iterable<Int> values) if p.firstValue != prev and not first total = sumValues(H); for each item h in H h(prev.secondValue) /= total; Emit(p.firstValue, H); clear(H); End if prev = p.firstValue; H.add(p.secondValue, sum(values)); Method close //for last pair total = sumValues(H); for each item h in H h(prev.secondValue) /= total; Emit(p.firstValue, H);
  • 20.
  • 21.
  • 22.
    Hybrid approach -Input Mapper1 input 34 56 29 12 34 56 92 10 34 12 Mapper2 input 18 29 12 34 79 18 56 12 34 92
  • 23.
    Hybrid approach – Output(Reducer1) 10(12,0.5) (34,0.5) 12 (10,0.09090909) (18,0.09090909) (34,0.36363637) (56,0.18181819) (79,0.09090909) (92,0.18181819) 18 (12,0.25) (29,0.125) (34,0.25) (56,0.125) (79,0.125) (92,0.125) 29 (10,0.06666667) (12,0.26666668) (18,0.06666667) (34,0.26666668) (56,0.13333334) (79,0.06666667) (92,0.13333334) 34 (10,0.083333336) (12,0.25) (18,0.083333336) (29,0.083333336) (56,0.25) (79,0.083333336) (92,0.16666667) 56 (10,0.1) (12,0.3) (29,0.1) (34,0.3) (92,0.2) 92 (10,0.33333334) (12,0.33333334) (34,0.33333334)
  • 24.
    Hybrid approach – Output(Reducer2) 79(12,0.2) (18,0.2) (34,0.2) (56,0.2) (92,0.2)
  • 25.
  • 26.
    Apache Spark Write ajava program on spark to calculate total number of students in MUM coming in different entries.This program should display total number student by country.
  • 27.
  • 28.
    Spark - input 2014Feb Nepal 20 2014 Feb India 15 2014 Oct Italy 2 2014 July France 1 2015 Feb Nepal 10 2015 Feb India 25 2015 Oct Italy 7
  • 29.
  • 30.
    Tools Used • VMPlayerPro 7 • cloudera-quickstart-vm-5.4.0-0-vmware • EclipseVersion: Luna Service Release 2 (4.4.2) • Windows 8.1
  • 31.
  • 32.