DBA Basics: Getting Started with Performance Tuning.pdf
Big data analytics project report
1. CSC 570 BIG DATA ANALYTICS
A Project on Weighted Page Rank
DECEMBER 4, 2016
PALLAV SHAH AND MANAV DESHMUKH
2. i
Contents
1. INTRODUCTION:........................................................................................... 2
2. WEIGHTED PAGE RANK: ................................................................................. 2
3. PROCESSING OF DATA:................................................................................... 4
4. GENERATION OF GRAPH:................................................................................ 4
5. IN-DEGREE DISTRIBUTION:.............................................................................. 5
6. CALCULATION OF WEIGHTS:............................................................................ 5
7. CALCULATION OF PAGE RANK: ......................................................................... 9
8. GETTING A FINAL RESULT:............................................................................. 11
9. CONCLUSION:............................................................................................ 12
10. REFERENCES:............................................................................................. 13
3. 1
Acknowledgement
Working over this project on “Calculation of Weighted
Page Rank" was a source of immense knowledge to us.
We would like to express our sincere gratitude to Dr.
Elham Khorasani (Buxton) for always support and
guidance throughout the course work. We really
appreciate your support and are thankful for your
cooperation.
4. 2
1. Introduction:
In the given project, a dataset known as ACMcitation is given which contains 2,381,688 papers and
10,476,564 referencingrelationships.OurGoal was to calculate weightedpage rank of top ten most
influentialpaperswithitstitle.The datagivenissemi-structuredandhasfieldslikeTitle,authorname,
index and references. Majorly the indexes and their citations are used in the whole project.
2. Weighted Page Rank:
It is a step ahead or we can say an extension to the standard PageRank algorithm, which calculates
the rank basedonin-linksandoutlinkstoapage.It producesthe popularityof the page asa result.In
otherwords,the algorithmsassigna real numberto each node of a graph. The higherthe PageRank,
it'll be more important.
The Formula to calculate the Weighted Page rank is as follows:
6. 4
3. Processing of Data:
The processingof Data is done infollowingsteps:(Thisare summarized,mainstepsmentioned)
Algorithmforinputof data processingof data:
1. Load inputfile inRDD(Resilientdistributeddataset) withHadoopconfigurationdelimitvaluesby:
#*.
2. Spliteachand everyinputvalue withnew line.
3. Filteroutdata with regularexpression.
4. Generate pairsof index andreferences.
5. Extra Notes:
Procedure touse input:
We have used Hadoop configuration settings to fetch data after that we filtered out useful
files.
After that we filtered out data which is useful.
Aftergettinguseful datawe generatedpairsof index andreferencesusingflatmap function.
4. Generation of Graph:
1. Algorithmforinputof data processingof data:
2. ConvertstringvaluestoLong byusinghashing.
3. Generate keyvalue pairsmapsandhashcodes.
4. Convertvaluesof mapto hashcodes.
5. Generate Graph.
6. Extra Notes
We have usedhashingtechnique togenerate graph.Reasonis index value isinstringformat
but we have to convert it into long.
After converting data into long we got in-degrees and out-degrees from graph.
Further, we used those in-degree and out-degree values to calculate weights.
After calculating weights, we calculated page ranks.
We have usedsample datato demonstrate the procedure of PageRank calculation
Sample datalookslike:
7. 5
5. In-Degree distribution:
Calculatedthe In-degree distribution,thereby joininginlinksandoutlinks.
Afterconstructinga graph,we obtainedindegreesof graphfromthe function
Graph.indegree.Asaresult,we gotin degree applyingthe formulaprovided,thereby using
the reduce by keyandcountingvaluesof eachand everynode forindegrees.
Aftergettingthe outputs,we combinedthe valuesandgeneratedthe graphinExcel.
(Attachedbelow)
6. Calculation of Weights:
CalculatedWeightsbasedonthe Formula:
To calculate weightswe firstlywe gotvaluesof in-degreesandout-degrees.
Than we convertedthose valuestodataframesandafterthat we convertedthose valuesintablesto
joinpurpose:
val inlinks=citationgraph.inDegrees
val outlinks= citationgraph.outDegrees
val classInlinks=inlinks.map(element=>Links(element._1,element._2))
val classOutlinks=outlinks.map(element=>Links(element._1,element._2))
val dfInlinks=classInlinks.toDF
val dfOutlinks=classOutlinks.toDF
dfInlinks.createOrReplaceTempView("tblInlinks")
dfOutlinks.createOrReplaceTempView("tblOutlinks")
Explanationof algorithm:
We can fetchindegree andoutdegree valuesfromgraphusingspecificmethods.
Andwe casted specificclassLinksowe can joindata usingcolumns.
We can not directlyconvertRddtotablessofirstlywe needtoconvertrddto dataframe.
Than we can convertit intotable.
8. 6
Data framesfor inDegreesandoutDegreescanbe demonstratedas:
Afterthat we joinedbothtableswithinnerjointoget all in-degreesandout-degreesfor
each andeveryspecificnode andconvertedthatintotable.
Source code:
val allLinks= spark.sql("selecta.paper_index asindex,a.linksasinlink,b.linksasoutlinkfromtblInlinks
as a jointblOutlinksasbon a.paper_index=b.paper_index")
allLinks.createOrReplaceTempView("tblAllLinks")
9. 7
Movingaheadaftergettingall linkswe joinedthose linkswithgraph.Butfor thatwe needtoconvert
graph firstso we converteditintotable thanwe joinedall linksto thatgraph.
Source code:
val indexRdd=graphRdd.map(element=>Paper(element._1,element._2))
val dfIndexes=indexRdd.toDF
dfIndexes.createOrReplaceTempView("tblIndexes")
Joiningall linkstographwe needtofetchnumeratorand denominatorforeach andeverylinkto
calculate weights.
1) In thisprocessof joiningwithgraphwe joinedthemwithreference index (2nd
column) of the
graph.
2) We have calculateddenominatorbyapplyinggroupbyto numerator.
3) Here We have calculateddenominatorintwodifferent formats1) forIn-degree valuesand
otherone for outdegree values.
4) AftercalculatingbothdenominatorsIhave calculatedbothbyjoiningthemtogatherandgot
commondenominator.
Source code:
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
10. 8
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
Resultforthat can be displayedas:
Nowonce we got useful valuesfornumeratoranddenominatorwe usedinnerjoinbetween
numeratoranddenominator.Tocalculate weights.
We joinedboththe tableswithmainpaper_index(firstcolumn).Andinthisjoinquerywe
appliedweightformulaandmultipliedbothincomingandoutgoingvalues.
Source code:
val weights= spark.sql("selecta.paper_index1,a.paper_index2,cast(a.inlink/b.inlinkas
double)*cast(a.outlink/b.outlinkasdouble) asweightsfromtblnumeratorasajointbldenominator
as b on a.paper_index1=b.paper_index")
weights.createOrReplaceTempView("tblweights")
Resultof final weightsis:
11. 9
7. Calculation of Page Rank:
Calculatedthe weightedPageRankbasedonformula.
Usednewvaluesof Ranks,goingaheadwiththe iterations.
Algorithmtocalculate page rank:
1. Calculate constantsforrank.
2. Initializedefaultranks.
3. For i -> 1 to 10
4. Calculate newranksandreplace witholdtable of ranks
5. Get topten ranks
6. Jointable withindegreedistribution
Procedure tocalculate page rank:
To calculate weightedpage rankasperformulawe calculate constantfirstandsavedthat into
one table:
val constant = sc.parallelize(Array(Constant((1-0.85).toDouble/N.toDouble)))
val dfConstant= constant.toDF
dfConstant.createOrReplaceTempView("tblConstant")
Formula:1-0.85/N (where N isnumberof nodes)
Resultforconstant:
Aftergettingformula.We calculateddefaultrankforeachand everynode.
A defaultrankcan be calculatedwithformulaas: 1/N
Source code:
val ranksRdd= hashIndexes.map(element=>Rank(element._2.toLong,(1.toDouble/N.toDouble)))
val dfRanks= ranksRdd.toDF()
dfRanks.createOrReplaceTempView("tblRanks")
Defaultrankscan be displayedas:
12. 10
Aftergettingdefaultrankswe joinedranktable andweightstable tocalculate page rank
We have calculatedpage rankintotwodifferentqueries:
Firstquerycalculatespage rankpartially.Thispartial page rankcontains valueswithaddition
of constant
Andin the secondstepwe addedrankvalueswithconstants.
Source code:
val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85
as rank fromtblWeightsajointblRanks bon a.paper_index1=b.paper_indexgroupby
a.paper_index2")
rankedWeights.createOrReplaceTempView("tblRanks")
val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank
fromtblRanks")
For demonstration purposewe ranthisformulaforthe firstiterationonlywe gotfollowingresult:
To run thisprogram for10 iterationswe gave followingrankcalculationfunctionwe needto
applythat functioninforloopfor10 iterations.Andcode willbe like:
for(a<- 1 to 10)
{
val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85
as rank fromtblWeightsajointblRanksbon a.paper_index1=b.paper_indexgroupby
a.paper_index2")
rankedWeights.createOrReplaceTempView("tblRanks")
val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank
fromtblRanks")
constantRanks.createOrReplaceTempView("tblRanks")
}
But here inthiscode sample we addedone more statement.Since we needto iterate
throughtentimesto calculate rank.
13. 11
So,we usedthisfunctioncreateOrReplaceTempView() tooverwrite ranktable.
But overexecutionforrankcalculationisnotfinishedyet.
We needtogetonlytop ranksfor that we can applyotherquery:
val topranks= spark.sql("selectpaper_index,rankfromtblRanksorderbyrankdesclimit10")
val topranksRdd= topranks.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toDouble)}
Thus,we got resultof final rank.
All of thiscalculation are done forrank calculation.
For demonstrationpurposeIhave usedquerywithlimitone toshow topmostrank
8. Getting a final result:
Algorithmtogetjoinedresult
1. Fetchtitlesandindex fromdata.
2. Replace index valueswithhashcode.
3. Jointable withtitlestotable withtoptenranks.
4. Joinnumeratoranddenominatortogetfinal weights.
But on our citationdatawe alsoneedto joinitwithindegree valuesandtitle:
For indegree valueswe have:
Graph.inDegrees
Andfor joinoperationnetweenindegree tableandranktable we usedsimple scalajoin .
Source code:
val filteredTitles=splittedRdd.map(element=> (element.filter(element=>
element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="")
val titleRdd= filteredTitles.map(element=>
(element._1.substring(6,element._1.length()),element._2))
val mappedTitles=titleRdd.map(element=>
(mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case
(Some(a),b)=>(a,b)}
14. 12
Andto fetchtitle we used followingcode:
(Where splittedRddwe gotwhile parsingthe data)
val filteredTitles=splittedRdd.map(element=> (element.filter(element=>
element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="")
val titleRdd= filteredTitles.map(element=>
(element._1.substring(6,element._1.length()),element._2))
val mappedTitles=titleRdd.map(element=>
(mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case
(Some(a),b)=>(a,b)}
In Degree Graph:
9. Conclusion:
In thisprogramgot opportunitytolearnhow page rank worksin the real world.We also
comparedthe original page rankalgorithmwhichisknownaspanda currentlyimplemented
by google.Thus,we Gotoverall ideaof page ranks,backlinksandindexingof websitesina
searchengine.