CSC 570 BIG DATA ANALYTICS
A Project on Weighted Page Rank
DECEMBER 4, 2016
PALLAV SHAH AND MANAV DESHMUKH
i
Contents
1. INTRODUCTION:........................................................................................... 2
2. WEIGHTED PAGE RANK: ................................................................................. 2
3. PROCESSING OF DATA:................................................................................... 4
4. GENERATION OF GRAPH:................................................................................ 4
5. IN-DEGREE DISTRIBUTION:.............................................................................. 5
6. CALCULATION OF WEIGHTS:............................................................................ 5
7. CALCULATION OF PAGE RANK: ......................................................................... 9
8. GETTING A FINAL RESULT:............................................................................. 11
9. CONCLUSION:............................................................................................ 12
10. REFERENCES:............................................................................................. 13
1
Acknowledgement
Working over this project on “Calculation of Weighted
Page Rank" was a source of immense knowledge to us.
We would like to express our sincere gratitude to Dr.
Elham Khorasani (Buxton) for always support and
guidance throughout the course work. We really
appreciate your support and are thankful for your
cooperation.
2
1. Introduction:
In the given project, a dataset known as ACMcitation is given which contains 2,381,688 papers and
10,476,564 referencingrelationships.OurGoal was to calculate weightedpage rank of top ten most
influentialpaperswithitstitle.The datagivenissemi-structuredandhasfieldslikeTitle,authorname,
index and references. Majorly the indexes and their citations are used in the whole project.
2. Weighted Page Rank:
It is a step ahead or we can say an extension to the standard PageRank algorithm, which calculates
the rank basedonin-linksandoutlinkstoapage.It producesthe popularityof the page asa result.In
otherwords,the algorithmsassigna real numberto each node of a graph. The higherthe PageRank,
it'll be more important.
The Formula to calculate the Weighted Page rank is as follows:
3
4
3. Processing of Data:
The processingof Data is done infollowingsteps:(Thisare summarized,mainstepsmentioned)
Algorithmforinputof data processingof data:
1. Load inputfile inRDD(Resilientdistributeddataset) withHadoopconfigurationdelimitvaluesby:
#*.
2. Spliteachand everyinputvalue withnew line.
3. Filteroutdata with regularexpression.
4. Generate pairsof index andreferences.
5. Extra Notes:
Procedure touse input:
 We have used Hadoop configuration settings to fetch data after that we filtered out useful
files.
 After that we filtered out data which is useful.
 Aftergettinguseful datawe generatedpairsof index andreferencesusingflatmap function.
4. Generation of Graph:
1. Algorithmforinputof data processingof data:
2. ConvertstringvaluestoLong byusinghashing.
3. Generate keyvalue pairsmapsandhashcodes.
4. Convertvaluesof mapto hashcodes.
5. Generate Graph.
6. Extra Notes
 We have usedhashingtechnique togenerate graph.Reasonis index value isinstringformat
but we have to convert it into long.
 After converting data into long we got in-degrees and out-degrees from graph.
 Further, we used those in-degree and out-degree values to calculate weights.
 After calculating weights, we calculated page ranks.
We have usedsample datato demonstrate the procedure of PageRank calculation
Sample datalookslike:
5
5. In-Degree distribution:
 Calculatedthe In-degree distribution,thereby joininginlinksandoutlinks.
 Afterconstructinga graph,we obtainedindegreesof graphfromthe function
Graph.indegree.Asaresult,we gotin degree applyingthe formulaprovided,thereby using
the reduce by keyandcountingvaluesof eachand everynode forindegrees.
 Aftergettingthe outputs,we combinedthe valuesandgeneratedthe graphinExcel.
(Attachedbelow)
6. Calculation of Weights:
CalculatedWeightsbasedonthe Formula:
To calculate weightswe firstlywe gotvaluesof in-degreesandout-degrees.
Than we convertedthose valuestodataframesandafterthat we convertedthose valuesintablesto
joinpurpose:
val inlinks=citationgraph.inDegrees
val outlinks= citationgraph.outDegrees
val classInlinks=inlinks.map(element=>Links(element._1,element._2))
val classOutlinks=outlinks.map(element=>Links(element._1,element._2))
val dfInlinks=classInlinks.toDF
val dfOutlinks=classOutlinks.toDF
dfInlinks.createOrReplaceTempView("tblInlinks")
dfOutlinks.createOrReplaceTempView("tblOutlinks")
Explanationof algorithm:
 We can fetchindegree andoutdegree valuesfromgraphusingspecificmethods.
 Andwe casted specificclassLinksowe can joindata usingcolumns.
 We can not directlyconvertRddtotablessofirstlywe needtoconvertrddto dataframe.
 Than we can convertit intotable.
6
Data framesfor inDegreesandoutDegreescanbe demonstratedas:
 Afterthat we joinedbothtableswithinnerjointoget all in-degreesandout-degreesfor
each andeveryspecificnode andconvertedthatintotable.
Source code:
val allLinks= spark.sql("selecta.paper_index asindex,a.linksasinlink,b.linksasoutlinkfromtblInlinks
as a jointblOutlinksasbon a.paper_index=b.paper_index")
allLinks.createOrReplaceTempView("tblAllLinks")
7
Movingaheadaftergettingall linkswe joinedthose linkswithgraph.Butfor thatwe needtoconvert
graph firstso we converteditintotable thanwe joinedall linksto thatgraph.
Source code:
val indexRdd=graphRdd.map(element=>Paper(element._1,element._2))
val dfIndexes=indexRdd.toDF
dfIndexes.createOrReplaceTempView("tblIndexes")
Joiningall linkstographwe needtofetchnumeratorand denominatorforeach andeverylinkto
calculate weights.
1) In thisprocessof joiningwithgraphwe joinedthemwithreference index (2nd
column) of the
graph.
2) We have calculateddenominatorbyapplyinggroupbyto numerator.
3) Here We have calculateddenominatorintwodifferent formats1) forIn-degree valuesand
otherone for outdegree values.
4) AftercalculatingbothdenominatorsIhave calculatedbothbyjoiningthemtogatherandgot
commondenominator.
Source code:
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
8
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
Resultforthat can be displayedas:
 Nowonce we got useful valuesfornumeratoranddenominatorwe usedinnerjoinbetween
numeratoranddenominator.Tocalculate weights.
 We joinedboththe tableswithmainpaper_index(firstcolumn).Andinthisjoinquerywe
appliedweightformulaandmultipliedbothincomingandoutgoingvalues.
Source code:
val weights= spark.sql("selecta.paper_index1,a.paper_index2,cast(a.inlink/b.inlinkas
double)*cast(a.outlink/b.outlinkasdouble) asweightsfromtblnumeratorasajointbldenominator
as b on a.paper_index1=b.paper_index")
weights.createOrReplaceTempView("tblweights")
Resultof final weightsis:
9
7. Calculation of Page Rank:
 Calculatedthe weightedPageRankbasedonformula.
 Usednewvaluesof Ranks,goingaheadwiththe iterations.
Algorithmtocalculate page rank:
1. Calculate constantsforrank.
2. Initializedefaultranks.
3. For i -> 1 to 10
4. Calculate newranksandreplace witholdtable of ranks
5. Get topten ranks
6. Jointable withindegreedistribution
 Procedure tocalculate page rank:
To calculate weightedpage rankasperformulawe calculate constantfirstandsavedthat into
one table:
val constant = sc.parallelize(Array(Constant((1-0.85).toDouble/N.toDouble)))
val dfConstant= constant.toDF
dfConstant.createOrReplaceTempView("tblConstant")
 Formula:1-0.85/N (where N isnumberof nodes)
 Resultforconstant:
 Aftergettingformula.We calculateddefaultrankforeachand everynode.
 A defaultrankcan be calculatedwithformulaas: 1/N
Source code:
val ranksRdd= hashIndexes.map(element=>Rank(element._2.toLong,(1.toDouble/N.toDouble)))
val dfRanks= ranksRdd.toDF()
dfRanks.createOrReplaceTempView("tblRanks")
Defaultrankscan be displayedas:
10
Aftergettingdefaultrankswe joinedranktable andweightstable tocalculate page rank
We have calculatedpage rankintotwodifferentqueries:
 Firstquerycalculatespage rankpartially.Thispartial page rankcontains valueswithaddition
of constant
 Andin the secondstepwe addedrankvalueswithconstants.
Source code:
val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85
as rank fromtblWeightsajointblRanks bon a.paper_index1=b.paper_indexgroupby
a.paper_index2")
rankedWeights.createOrReplaceTempView("tblRanks")
val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank
fromtblRanks")
For demonstration purposewe ranthisformulaforthe firstiterationonlywe gotfollowingresult:
 To run thisprogram for10 iterationswe gave followingrankcalculationfunctionwe needto
applythat functioninforloopfor10 iterations.Andcode willbe like:
for(a<- 1 to 10)
{
val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85
as rank fromtblWeightsajointblRanksbon a.paper_index1=b.paper_indexgroupby
a.paper_index2")
rankedWeights.createOrReplaceTempView("tblRanks")
val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank
fromtblRanks")
constantRanks.createOrReplaceTempView("tblRanks")
}
 But here inthiscode sample we addedone more statement.Since we needto iterate
throughtentimesto calculate rank.
11
 So,we usedthisfunctioncreateOrReplaceTempView() tooverwrite ranktable.
 But overexecutionforrankcalculationisnotfinishedyet.
We needtogetonlytop ranksfor that we can applyotherquery:
val topranks= spark.sql("selectpaper_index,rankfromtblRanksorderbyrankdesclimit10")
val topranksRdd= topranks.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toDouble)}
 Thus,we got resultof final rank.
 All of thiscalculation are done forrank calculation.
 For demonstrationpurposeIhave usedquerywithlimitone toshow topmostrank
8. Getting a final result:
Algorithmtogetjoinedresult
1. Fetchtitlesandindex fromdata.
2. Replace index valueswithhashcode.
3. Jointable withtitlestotable withtoptenranks.
4. Joinnumeratoranddenominatortogetfinal weights.
 But on our citationdatawe alsoneedto joinitwithindegree valuesandtitle:
 For indegree valueswe have:
 Graph.inDegrees
 Andfor joinoperationnetweenindegree tableandranktable we usedsimple scalajoin .
Source code:
val filteredTitles=splittedRdd.map(element=> (element.filter(element=>
element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="")
val titleRdd= filteredTitles.map(element=>
(element._1.substring(6,element._1.length()),element._2))
val mappedTitles=titleRdd.map(element=>
(mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case
(Some(a),b)=>(a,b)}
12
Andto fetchtitle we used followingcode:
(Where splittedRddwe gotwhile parsingthe data)
val filteredTitles=splittedRdd.map(element=> (element.filter(element=>
element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="")
val titleRdd= filteredTitles.map(element=>
(element._1.substring(6,element._1.length()),element._2))
val mappedTitles=titleRdd.map(element=>
(mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case
(Some(a),b)=>(a,b)}
 In Degree Graph:
9. Conclusion:
 In thisprogramgot opportunitytolearnhow page rank worksin the real world.We also
comparedthe original page rankalgorithmwhichisknownaspanda currentlyimplemented
by google.Thus,we Gotoverall ideaof page ranks,backlinksandindexingof websitesina
searchengine.
13
10. References:
 people.cis.ksu.edu/~halmohri/files/weightedPageRank.pdf
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual
 http://spark.apache.org/docs/latest/api/scala/index.html#package

Big data analytics project report

  • 1.
    CSC 570 BIGDATA ANALYTICS A Project on Weighted Page Rank DECEMBER 4, 2016 PALLAV SHAH AND MANAV DESHMUKH
  • 2.
    i Contents 1. INTRODUCTION:........................................................................................... 2 2.WEIGHTED PAGE RANK: ................................................................................. 2 3. PROCESSING OF DATA:................................................................................... 4 4. GENERATION OF GRAPH:................................................................................ 4 5. IN-DEGREE DISTRIBUTION:.............................................................................. 5 6. CALCULATION OF WEIGHTS:............................................................................ 5 7. CALCULATION OF PAGE RANK: ......................................................................... 9 8. GETTING A FINAL RESULT:............................................................................. 11 9. CONCLUSION:............................................................................................ 12 10. REFERENCES:............................................................................................. 13
  • 3.
    1 Acknowledgement Working over thisproject on “Calculation of Weighted Page Rank" was a source of immense knowledge to us. We would like to express our sincere gratitude to Dr. Elham Khorasani (Buxton) for always support and guidance throughout the course work. We really appreciate your support and are thankful for your cooperation.
  • 4.
    2 1. Introduction: In thegiven project, a dataset known as ACMcitation is given which contains 2,381,688 papers and 10,476,564 referencingrelationships.OurGoal was to calculate weightedpage rank of top ten most influentialpaperswithitstitle.The datagivenissemi-structuredandhasfieldslikeTitle,authorname, index and references. Majorly the indexes and their citations are used in the whole project. 2. Weighted Page Rank: It is a step ahead or we can say an extension to the standard PageRank algorithm, which calculates the rank basedonin-linksandoutlinkstoapage.It producesthe popularityof the page asa result.In otherwords,the algorithmsassigna real numberto each node of a graph. The higherthe PageRank, it'll be more important. The Formula to calculate the Weighted Page rank is as follows:
  • 5.
  • 6.
    4 3. Processing ofData: The processingof Data is done infollowingsteps:(Thisare summarized,mainstepsmentioned) Algorithmforinputof data processingof data: 1. Load inputfile inRDD(Resilientdistributeddataset) withHadoopconfigurationdelimitvaluesby: #*. 2. Spliteachand everyinputvalue withnew line. 3. Filteroutdata with regularexpression. 4. Generate pairsof index andreferences. 5. Extra Notes: Procedure touse input:  We have used Hadoop configuration settings to fetch data after that we filtered out useful files.  After that we filtered out data which is useful.  Aftergettinguseful datawe generatedpairsof index andreferencesusingflatmap function. 4. Generation of Graph: 1. Algorithmforinputof data processingof data: 2. ConvertstringvaluestoLong byusinghashing. 3. Generate keyvalue pairsmapsandhashcodes. 4. Convertvaluesof mapto hashcodes. 5. Generate Graph. 6. Extra Notes  We have usedhashingtechnique togenerate graph.Reasonis index value isinstringformat but we have to convert it into long.  After converting data into long we got in-degrees and out-degrees from graph.  Further, we used those in-degree and out-degree values to calculate weights.  After calculating weights, we calculated page ranks. We have usedsample datato demonstrate the procedure of PageRank calculation Sample datalookslike:
  • 7.
    5 5. In-Degree distribution: Calculatedthe In-degree distribution,thereby joininginlinksandoutlinks.  Afterconstructinga graph,we obtainedindegreesof graphfromthe function Graph.indegree.Asaresult,we gotin degree applyingthe formulaprovided,thereby using the reduce by keyandcountingvaluesof eachand everynode forindegrees.  Aftergettingthe outputs,we combinedthe valuesandgeneratedthe graphinExcel. (Attachedbelow) 6. Calculation of Weights: CalculatedWeightsbasedonthe Formula: To calculate weightswe firstlywe gotvaluesof in-degreesandout-degrees. Than we convertedthose valuestodataframesandafterthat we convertedthose valuesintablesto joinpurpose: val inlinks=citationgraph.inDegrees val outlinks= citationgraph.outDegrees val classInlinks=inlinks.map(element=>Links(element._1,element._2)) val classOutlinks=outlinks.map(element=>Links(element._1,element._2)) val dfInlinks=classInlinks.toDF val dfOutlinks=classOutlinks.toDF dfInlinks.createOrReplaceTempView("tblInlinks") dfOutlinks.createOrReplaceTempView("tblOutlinks") Explanationof algorithm:  We can fetchindegree andoutdegree valuesfromgraphusingspecificmethods.  Andwe casted specificclassLinksowe can joindata usingcolumns.  We can not directlyconvertRddtotablessofirstlywe needtoconvertrddto dataframe.  Than we can convertit intotable.
  • 8.
    6 Data framesfor inDegreesandoutDegreescanbedemonstratedas:  Afterthat we joinedbothtableswithinnerjointoget all in-degreesandout-degreesfor each andeveryspecificnode andconvertedthatintotable. Source code: val allLinks= spark.sql("selecta.paper_index asindex,a.linksasinlink,b.linksasoutlinkfromtblInlinks as a jointblOutlinksasbon a.paper_index=b.paper_index") allLinks.createOrReplaceTempView("tblAllLinks")
  • 9.
    7 Movingaheadaftergettingall linkswe joinedthoselinkswithgraph.Butfor thatwe needtoconvert graph firstso we converteditintotable thanwe joinedall linksto thatgraph. Source code: val indexRdd=graphRdd.map(element=>Paper(element._1,element._2)) val dfIndexes=indexRdd.toDF dfIndexes.createOrReplaceTempView("tblIndexes") Joiningall linkstographwe needtofetchnumeratorand denominatorforeach andeverylinkto calculate weights. 1) In thisprocessof joiningwithgraphwe joinedthemwithreference index (2nd column) of the graph. 2) We have calculateddenominatorbyapplyinggroupbyto numerator. 3) Here We have calculateddenominatorintwodifferent formats1) forIn-degree valuesand otherone for outdegree values. 4) AftercalculatingbothdenominatorsIhave calculatedbothbyjoiningthemtogatherandgot commondenominator. Source code: val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes as a jointblAllLinksasbon a.paper_index2=b.index") val numeratorRdd= numerator.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen t(3).toString.toInt)} val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)} val denominatorIn= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._3).sum)) val denominatorOut= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._4).sum)) val dfdenominatorIn=denominatorIn.toDF val dfdenominatorOut=denominatorOut.toDF val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes as a jointblAllLinksasbon a.paper_index2=b.index") val numeratorRdd= numerator.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen t(3).toString.toInt)} val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)} val denominatorIn= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._3).sum)) val denominatorOut= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._4).sum)) val dfdenominatorIn=denominatorIn.toDF val dfdenominatorOut=denominatorOut.toDF val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes as a jointblAllLinksasbon a.paper_index2=b.index")
  • 10.
    8 val numeratorRdd= numerator.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen t(3).toString.toInt)} valgroupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)} val denominatorIn= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._3).sum)) val denominatorOut= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._4).sum)) val dfdenominatorIn=denominatorIn.toDF val dfdenominatorOut=denominatorOut.toDF Resultforthat can be displayedas:  Nowonce we got useful valuesfornumeratoranddenominatorwe usedinnerjoinbetween numeratoranddenominator.Tocalculate weights.  We joinedboththe tableswithmainpaper_index(firstcolumn).Andinthisjoinquerywe appliedweightformulaandmultipliedbothincomingandoutgoingvalues. Source code: val weights= spark.sql("selecta.paper_index1,a.paper_index2,cast(a.inlink/b.inlinkas double)*cast(a.outlink/b.outlinkasdouble) asweightsfromtblnumeratorasajointbldenominator as b on a.paper_index1=b.paper_index") weights.createOrReplaceTempView("tblweights") Resultof final weightsis:
  • 11.
    9 7. Calculation ofPage Rank:  Calculatedthe weightedPageRankbasedonformula.  Usednewvaluesof Ranks,goingaheadwiththe iterations. Algorithmtocalculate page rank: 1. Calculate constantsforrank. 2. Initializedefaultranks. 3. For i -> 1 to 10 4. Calculate newranksandreplace witholdtable of ranks 5. Get topten ranks 6. Jointable withindegreedistribution  Procedure tocalculate page rank: To calculate weightedpage rankasperformulawe calculate constantfirstandsavedthat into one table: val constant = sc.parallelize(Array(Constant((1-0.85).toDouble/N.toDouble))) val dfConstant= constant.toDF dfConstant.createOrReplaceTempView("tblConstant")  Formula:1-0.85/N (where N isnumberof nodes)  Resultforconstant:  Aftergettingformula.We calculateddefaultrankforeachand everynode.  A defaultrankcan be calculatedwithformulaas: 1/N Source code: val ranksRdd= hashIndexes.map(element=>Rank(element._2.toLong,(1.toDouble/N.toDouble))) val dfRanks= ranksRdd.toDF() dfRanks.createOrReplaceTempView("tblRanks") Defaultrankscan be displayedas:
  • 12.
    10 Aftergettingdefaultrankswe joinedranktable andweightstabletocalculate page rank We have calculatedpage rankintotwodifferentqueries:  Firstquerycalculatespage rankpartially.Thispartial page rankcontains valueswithaddition of constant  Andin the secondstepwe addedrankvalueswithconstants. Source code: val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85 as rank fromtblWeightsajointblRanks bon a.paper_index1=b.paper_indexgroupby a.paper_index2") rankedWeights.createOrReplaceTempView("tblRanks") val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank fromtblRanks") For demonstration purposewe ranthisformulaforthe firstiterationonlywe gotfollowingresult:  To run thisprogram for10 iterationswe gave followingrankcalculationfunctionwe needto applythat functioninforloopfor10 iterations.Andcode willbe like: for(a<- 1 to 10) { val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85 as rank fromtblWeightsajointblRanksbon a.paper_index1=b.paper_indexgroupby a.paper_index2") rankedWeights.createOrReplaceTempView("tblRanks") val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank fromtblRanks") constantRanks.createOrReplaceTempView("tblRanks") }  But here inthiscode sample we addedone more statement.Since we needto iterate throughtentimesto calculate rank.
  • 13.
    11  So,we usedthisfunctioncreateOrReplaceTempView()tooverwrite ranktable.  But overexecutionforrankcalculationisnotfinishedyet. We needtogetonlytop ranksfor that we can applyotherquery: val topranks= spark.sql("selectpaper_index,rankfromtblRanksorderbyrankdesclimit10") val topranksRdd= topranks.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toDouble)}  Thus,we got resultof final rank.  All of thiscalculation are done forrank calculation.  For demonstrationpurposeIhave usedquerywithlimitone toshow topmostrank 8. Getting a final result: Algorithmtogetjoinedresult 1. Fetchtitlesandindex fromdata. 2. Replace index valueswithhashcode. 3. Jointable withtitlestotable withtoptenranks. 4. Joinnumeratoranddenominatortogetfinal weights.  But on our citationdatawe alsoneedto joinitwithindegree valuesandtitle:  For indegree valueswe have:  Graph.inDegrees  Andfor joinoperationnetweenindegree tableandranktable we usedsimple scalajoin . Source code: val filteredTitles=splittedRdd.map(element=> (element.filter(element=> element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="") val titleRdd= filteredTitles.map(element=> (element._1.substring(6,element._1.length()),element._2)) val mappedTitles=titleRdd.map(element=> (mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case (Some(a),b)=>(a,b)}
  • 14.
    12 Andto fetchtitle weused followingcode: (Where splittedRddwe gotwhile parsingthe data) val filteredTitles=splittedRdd.map(element=> (element.filter(element=> element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="") val titleRdd= filteredTitles.map(element=> (element._1.substring(6,element._1.length()),element._2)) val mappedTitles=titleRdd.map(element=> (mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case (Some(a),b)=>(a,b)}  In Degree Graph: 9. Conclusion:  In thisprogramgot opportunitytolearnhow page rank worksin the real world.We also comparedthe original page rankalgorithmwhichisknownaspanda currentlyimplemented by google.Thus,we Gotoverall ideaof page ranks,backlinksandindexingof websitesina searchengine.
  • 15.
    13 10. References:  people.cis.ksu.edu/~halmohri/files/weightedPageRank.pdf https://cwiki.apache.org/confluence/display/Hive/LanguageManual  http://spark.apache.org/docs/latest/api/scala/index.html#package