SlideShare a Scribd company logo
1 of 15
CSC 570 BIG DATA ANALYTICS
A Project on Weighted Page Rank
DECEMBER 4, 2016
PALLAV SHAH AND MANAV DESHMUKH
i
Contents
1. INTRODUCTION:........................................................................................... 2
2. WEIGHTED PAGE RANK: ................................................................................. 2
3. PROCESSING OF DATA:................................................................................... 4
4. GENERATION OF GRAPH:................................................................................ 4
5. IN-DEGREE DISTRIBUTION:.............................................................................. 5
6. CALCULATION OF WEIGHTS:............................................................................ 5
7. CALCULATION OF PAGE RANK: ......................................................................... 9
8. GETTING A FINAL RESULT:............................................................................. 11
9. CONCLUSION:............................................................................................ 12
10. REFERENCES:............................................................................................. 13
1
Acknowledgement
Working over this project on “Calculation of Weighted
Page Rank" was a source of immense knowledge to us.
We would like to express our sincere gratitude to Dr.
Elham Khorasani (Buxton) for always support and
guidance throughout the course work. We really
appreciate your support and are thankful for your
cooperation.
2
1. Introduction:
In the given project, a dataset known as ACMcitation is given which contains 2,381,688 papers and
10,476,564 referencingrelationships.OurGoal was to calculate weightedpage rank of top ten most
influentialpaperswithitstitle.The datagivenissemi-structuredandhasfieldslikeTitle,authorname,
index and references. Majorly the indexes and their citations are used in the whole project.
2. Weighted Page Rank:
It is a step ahead or we can say an extension to the standard PageRank algorithm, which calculates
the rank basedonin-linksandoutlinkstoapage.It producesthe popularityof the page asa result.In
otherwords,the algorithmsassigna real numberto each node of a graph. The higherthe PageRank,
it'll be more important.
The Formula to calculate the Weighted Page rank is as follows:
3
4
3. Processing of Data:
The processingof Data is done infollowingsteps:(Thisare summarized,mainstepsmentioned)
Algorithmforinputof data processingof data:
1. Load inputfile inRDD(Resilientdistributeddataset) withHadoopconfigurationdelimitvaluesby:
#*.
2. Spliteachand everyinputvalue withnew line.
3. Filteroutdata with regularexpression.
4. Generate pairsof index andreferences.
5. Extra Notes:
Procedure touse input:
 We have used Hadoop configuration settings to fetch data after that we filtered out useful
files.
 After that we filtered out data which is useful.
 Aftergettinguseful datawe generatedpairsof index andreferencesusingflatmap function.
4. Generation of Graph:
1. Algorithmforinputof data processingof data:
2. ConvertstringvaluestoLong byusinghashing.
3. Generate keyvalue pairsmapsandhashcodes.
4. Convertvaluesof mapto hashcodes.
5. Generate Graph.
6. Extra Notes
 We have usedhashingtechnique togenerate graph.Reasonis index value isinstringformat
but we have to convert it into long.
 After converting data into long we got in-degrees and out-degrees from graph.
 Further, we used those in-degree and out-degree values to calculate weights.
 After calculating weights, we calculated page ranks.
We have usedsample datato demonstrate the procedure of PageRank calculation
Sample datalookslike:
5
5. In-Degree distribution:
 Calculatedthe In-degree distribution,thereby joininginlinksandoutlinks.
 Afterconstructinga graph,we obtainedindegreesof graphfromthe function
Graph.indegree.Asaresult,we gotin degree applyingthe formulaprovided,thereby using
the reduce by keyandcountingvaluesof eachand everynode forindegrees.
 Aftergettingthe outputs,we combinedthe valuesandgeneratedthe graphinExcel.
(Attachedbelow)
6. Calculation of Weights:
CalculatedWeightsbasedonthe Formula:
To calculate weightswe firstlywe gotvaluesof in-degreesandout-degrees.
Than we convertedthose valuestodataframesandafterthat we convertedthose valuesintablesto
joinpurpose:
val inlinks=citationgraph.inDegrees
val outlinks= citationgraph.outDegrees
val classInlinks=inlinks.map(element=>Links(element._1,element._2))
val classOutlinks=outlinks.map(element=>Links(element._1,element._2))
val dfInlinks=classInlinks.toDF
val dfOutlinks=classOutlinks.toDF
dfInlinks.createOrReplaceTempView("tblInlinks")
dfOutlinks.createOrReplaceTempView("tblOutlinks")
Explanationof algorithm:
 We can fetchindegree andoutdegree valuesfromgraphusingspecificmethods.
 Andwe casted specificclassLinksowe can joindata usingcolumns.
 We can not directlyconvertRddtotablessofirstlywe needtoconvertrddto dataframe.
 Than we can convertit intotable.
6
Data framesfor inDegreesandoutDegreescanbe demonstratedas:
 Afterthat we joinedbothtableswithinnerjointoget all in-degreesandout-degreesfor
each andeveryspecificnode andconvertedthatintotable.
Source code:
val allLinks= spark.sql("selecta.paper_index asindex,a.linksasinlink,b.linksasoutlinkfromtblInlinks
as a jointblOutlinksasbon a.paper_index=b.paper_index")
allLinks.createOrReplaceTempView("tblAllLinks")
7
Movingaheadaftergettingall linkswe joinedthose linkswithgraph.Butfor thatwe needtoconvert
graph firstso we converteditintotable thanwe joinedall linksto thatgraph.
Source code:
val indexRdd=graphRdd.map(element=>Paper(element._1,element._2))
val dfIndexes=indexRdd.toDF
dfIndexes.createOrReplaceTempView("tblIndexes")
Joiningall linkstographwe needtofetchnumeratorand denominatorforeach andeverylinkto
calculate weights.
1) In thisprocessof joiningwithgraphwe joinedthemwithreference index (2nd
column) of the
graph.
2) We have calculateddenominatorbyapplyinggroupbyto numerator.
3) Here We have calculateddenominatorintwodifferent formats1) forIn-degree valuesand
otherone for outdegree values.
4) AftercalculatingbothdenominatorsIhave calculatedbothbyjoiningthemtogatherandgot
commondenominator.
Source code:
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes
as a jointblAllLinksasbon a.paper_index2=b.index")
8
val numeratorRdd= numerator.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen
t(3).toString.toInt)}
val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)}
val denominatorIn=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._3).sum))
val denominatorOut=
groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=>
element._4).sum))
val dfdenominatorIn=denominatorIn.toDF
val dfdenominatorOut=denominatorOut.toDF
Resultforthat can be displayedas:
 Nowonce we got useful valuesfornumeratoranddenominatorwe usedinnerjoinbetween
numeratoranddenominator.Tocalculate weights.
 We joinedboththe tableswithmainpaper_index(firstcolumn).Andinthisjoinquerywe
appliedweightformulaandmultipliedbothincomingandoutgoingvalues.
Source code:
val weights= spark.sql("selecta.paper_index1,a.paper_index2,cast(a.inlink/b.inlinkas
double)*cast(a.outlink/b.outlinkasdouble) asweightsfromtblnumeratorasajointbldenominator
as b on a.paper_index1=b.paper_index")
weights.createOrReplaceTempView("tblweights")
Resultof final weightsis:
9
7. Calculation of Page Rank:
 Calculatedthe weightedPageRankbasedonformula.
 Usednewvaluesof Ranks,goingaheadwiththe iterations.
Algorithmtocalculate page rank:
1. Calculate constantsforrank.
2. Initializedefaultranks.
3. For i -> 1 to 10
4. Calculate newranksandreplace witholdtable of ranks
5. Get topten ranks
6. Jointable withindegreedistribution
 Procedure tocalculate page rank:
To calculate weightedpage rankasperformulawe calculate constantfirstandsavedthat into
one table:
val constant = sc.parallelize(Array(Constant((1-0.85).toDouble/N.toDouble)))
val dfConstant= constant.toDF
dfConstant.createOrReplaceTempView("tblConstant")
 Formula:1-0.85/N (where N isnumberof nodes)
 Resultforconstant:
 Aftergettingformula.We calculateddefaultrankforeachand everynode.
 A defaultrankcan be calculatedwithformulaas: 1/N
Source code:
val ranksRdd= hashIndexes.map(element=>Rank(element._2.toLong,(1.toDouble/N.toDouble)))
val dfRanks= ranksRdd.toDF()
dfRanks.createOrReplaceTempView("tblRanks")
Defaultrankscan be displayedas:
10
Aftergettingdefaultrankswe joinedranktable andweightstable tocalculate page rank
We have calculatedpage rankintotwodifferentqueries:
 Firstquerycalculatespage rankpartially.Thispartial page rankcontains valueswithaddition
of constant
 Andin the secondstepwe addedrankvalueswithconstants.
Source code:
val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85
as rank fromtblWeightsajointblRanks bon a.paper_index1=b.paper_indexgroupby
a.paper_index2")
rankedWeights.createOrReplaceTempView("tblRanks")
val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank
fromtblRanks")
For demonstration purposewe ranthisformulaforthe firstiterationonlywe gotfollowingresult:
 To run thisprogram for10 iterationswe gave followingrankcalculationfunctionwe needto
applythat functioninforloopfor10 iterations.Andcode willbe like:
for(a<- 1 to 10)
{
val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85
as rank fromtblWeightsajointblRanksbon a.paper_index1=b.paper_indexgroupby
a.paper_index2")
rankedWeights.createOrReplaceTempView("tblRanks")
val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank
fromtblRanks")
constantRanks.createOrReplaceTempView("tblRanks")
}
 But here inthiscode sample we addedone more statement.Since we needto iterate
throughtentimesto calculate rank.
11
 So,we usedthisfunctioncreateOrReplaceTempView() tooverwrite ranktable.
 But overexecutionforrankcalculationisnotfinishedyet.
We needtogetonlytop ranksfor that we can applyotherquery:
val topranks= spark.sql("selectpaper_index,rankfromtblRanksorderbyrankdesclimit10")
val topranksRdd= topranks.rdd.map{case
element=>(element(0).toString.toLong,element(1).toString.toDouble)}
 Thus,we got resultof final rank.
 All of thiscalculation are done forrank calculation.
 For demonstrationpurposeIhave usedquerywithlimitone toshow topmostrank
8. Getting a final result:
Algorithmtogetjoinedresult
1. Fetchtitlesandindex fromdata.
2. Replace index valueswithhashcode.
3. Jointable withtitlestotable withtoptenranks.
4. Joinnumeratoranddenominatortogetfinal weights.
 But on our citationdatawe alsoneedto joinitwithindegree valuesandtitle:
 For indegree valueswe have:
 Graph.inDegrees
 Andfor joinoperationnetweenindegree tableandranktable we usedsimple scalajoin .
Source code:
val filteredTitles=splittedRdd.map(element=> (element.filter(element=>
element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="")
val titleRdd= filteredTitles.map(element=>
(element._1.substring(6,element._1.length()),element._2))
val mappedTitles=titleRdd.map(element=>
(mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case
(Some(a),b)=>(a,b)}
12
Andto fetchtitle we used followingcode:
(Where splittedRddwe gotwhile parsingthe data)
val filteredTitles=splittedRdd.map(element=> (element.filter(element=>
element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="")
val titleRdd= filteredTitles.map(element=>
(element._1.substring(6,element._1.length()),element._2))
val mappedTitles=titleRdd.map(element=>
(mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case
(Some(a),b)=>(a,b)}
 In Degree Graph:
9. Conclusion:
 In thisprogramgot opportunitytolearnhow page rank worksin the real world.We also
comparedthe original page rankalgorithmwhichisknownaspanda currentlyimplemented
by google.Thus,we Gotoverall ideaof page ranks,backlinksandindexingof websitesina
searchengine.
13
10. References:
 people.cis.ksu.edu/~halmohri/files/weightedPageRank.pdf
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual
 http://spark.apache.org/docs/latest/api/scala/index.html#package

More Related Content

What's hot

A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning ExplainedMelanie Swan
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sectorAnil Rana
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at RestInternap
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market PredictionMRIDUL GUPTA
 

What's hot (20)

A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Hadoop
HadoopHadoop
Hadoop
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sector
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Text MIning
Text MIningText MIning
Text MIning
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Data in Motion vs Data at Rest
Data in Motion vs Data at RestData in Motion vs Data at Rest
Data in Motion vs Data at Rest
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market Prediction
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
 

Similar to Big data analytics project report

Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Vaclav Kosar
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informaticahomeworkping9
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Schema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdfseo18
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Consistent join queries in cloud data stores
Consistent join queries in cloud data storesConsistent join queries in cloud data stores
Consistent join queries in cloud data storesJoão Gabriel Lima
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
phoenix-on-calcite-hadoop-summit-2016
phoenix-on-calcite-hadoop-summit-2016phoenix-on-calcite-hadoop-summit-2016
phoenix-on-calcite-hadoop-summit-2016Maryann Xue
 
Js info vis_toolkit
Js info vis_toolkitJs info vis_toolkit
Js info vis_toolkitnikhilyagnic
 

Similar to Big data analytics project report (20)

Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Ashwin_Thesis
Ashwin_ThesisAshwin_Thesis
Ashwin_Thesis
 
Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4 Spline 0.3 and Plans for 0.4
Spline 0.3 and Plans for 0.4
 
hadoop
hadoophadoop
hadoop
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Schema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus &amp; Hibernate-ORM.pdf
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Consistent join queries in cloud data stores
Consistent join queries in cloud data storesConsistent join queries in cloud data stores
Consistent join queries in cloud data stores
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Cost-Based query optimization
Cost-Based query optimizationCost-Based query optimization
Cost-Based query optimization
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
phoenix-on-calcite-hadoop-summit-2016
phoenix-on-calcite-hadoop-summit-2016phoenix-on-calcite-hadoop-summit-2016
phoenix-on-calcite-hadoop-summit-2016
 
Js info vis_toolkit
Js info vis_toolkitJs info vis_toolkit
Js info vis_toolkit
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Big data analytics project report

  • 1. CSC 570 BIG DATA ANALYTICS A Project on Weighted Page Rank DECEMBER 4, 2016 PALLAV SHAH AND MANAV DESHMUKH
  • 2. i Contents 1. INTRODUCTION:........................................................................................... 2 2. WEIGHTED PAGE RANK: ................................................................................. 2 3. PROCESSING OF DATA:................................................................................... 4 4. GENERATION OF GRAPH:................................................................................ 4 5. IN-DEGREE DISTRIBUTION:.............................................................................. 5 6. CALCULATION OF WEIGHTS:............................................................................ 5 7. CALCULATION OF PAGE RANK: ......................................................................... 9 8. GETTING A FINAL RESULT:............................................................................. 11 9. CONCLUSION:............................................................................................ 12 10. REFERENCES:............................................................................................. 13
  • 3. 1 Acknowledgement Working over this project on “Calculation of Weighted Page Rank" was a source of immense knowledge to us. We would like to express our sincere gratitude to Dr. Elham Khorasani (Buxton) for always support and guidance throughout the course work. We really appreciate your support and are thankful for your cooperation.
  • 4. 2 1. Introduction: In the given project, a dataset known as ACMcitation is given which contains 2,381,688 papers and 10,476,564 referencingrelationships.OurGoal was to calculate weightedpage rank of top ten most influentialpaperswithitstitle.The datagivenissemi-structuredandhasfieldslikeTitle,authorname, index and references. Majorly the indexes and their citations are used in the whole project. 2. Weighted Page Rank: It is a step ahead or we can say an extension to the standard PageRank algorithm, which calculates the rank basedonin-linksandoutlinkstoapage.It producesthe popularityof the page asa result.In otherwords,the algorithmsassigna real numberto each node of a graph. The higherthe PageRank, it'll be more important. The Formula to calculate the Weighted Page rank is as follows:
  • 5. 3
  • 6. 4 3. Processing of Data: The processingof Data is done infollowingsteps:(Thisare summarized,mainstepsmentioned) Algorithmforinputof data processingof data: 1. Load inputfile inRDD(Resilientdistributeddataset) withHadoopconfigurationdelimitvaluesby: #*. 2. Spliteachand everyinputvalue withnew line. 3. Filteroutdata with regularexpression. 4. Generate pairsof index andreferences. 5. Extra Notes: Procedure touse input:  We have used Hadoop configuration settings to fetch data after that we filtered out useful files.  After that we filtered out data which is useful.  Aftergettinguseful datawe generatedpairsof index andreferencesusingflatmap function. 4. Generation of Graph: 1. Algorithmforinputof data processingof data: 2. ConvertstringvaluestoLong byusinghashing. 3. Generate keyvalue pairsmapsandhashcodes. 4. Convertvaluesof mapto hashcodes. 5. Generate Graph. 6. Extra Notes  We have usedhashingtechnique togenerate graph.Reasonis index value isinstringformat but we have to convert it into long.  After converting data into long we got in-degrees and out-degrees from graph.  Further, we used those in-degree and out-degree values to calculate weights.  After calculating weights, we calculated page ranks. We have usedsample datato demonstrate the procedure of PageRank calculation Sample datalookslike:
  • 7. 5 5. In-Degree distribution:  Calculatedthe In-degree distribution,thereby joininginlinksandoutlinks.  Afterconstructinga graph,we obtainedindegreesof graphfromthe function Graph.indegree.Asaresult,we gotin degree applyingthe formulaprovided,thereby using the reduce by keyandcountingvaluesof eachand everynode forindegrees.  Aftergettingthe outputs,we combinedthe valuesandgeneratedthe graphinExcel. (Attachedbelow) 6. Calculation of Weights: CalculatedWeightsbasedonthe Formula: To calculate weightswe firstlywe gotvaluesof in-degreesandout-degrees. Than we convertedthose valuestodataframesandafterthat we convertedthose valuesintablesto joinpurpose: val inlinks=citationgraph.inDegrees val outlinks= citationgraph.outDegrees val classInlinks=inlinks.map(element=>Links(element._1,element._2)) val classOutlinks=outlinks.map(element=>Links(element._1,element._2)) val dfInlinks=classInlinks.toDF val dfOutlinks=classOutlinks.toDF dfInlinks.createOrReplaceTempView("tblInlinks") dfOutlinks.createOrReplaceTempView("tblOutlinks") Explanationof algorithm:  We can fetchindegree andoutdegree valuesfromgraphusingspecificmethods.  Andwe casted specificclassLinksowe can joindata usingcolumns.  We can not directlyconvertRddtotablessofirstlywe needtoconvertrddto dataframe.  Than we can convertit intotable.
  • 8. 6 Data framesfor inDegreesandoutDegreescanbe demonstratedas:  Afterthat we joinedbothtableswithinnerjointoget all in-degreesandout-degreesfor each andeveryspecificnode andconvertedthatintotable. Source code: val allLinks= spark.sql("selecta.paper_index asindex,a.linksasinlink,b.linksasoutlinkfromtblInlinks as a jointblOutlinksasbon a.paper_index=b.paper_index") allLinks.createOrReplaceTempView("tblAllLinks")
  • 9. 7 Movingaheadaftergettingall linkswe joinedthose linkswithgraph.Butfor thatwe needtoconvert graph firstso we converteditintotable thanwe joinedall linksto thatgraph. Source code: val indexRdd=graphRdd.map(element=>Paper(element._1,element._2)) val dfIndexes=indexRdd.toDF dfIndexes.createOrReplaceTempView("tblIndexes") Joiningall linkstographwe needtofetchnumeratorand denominatorforeach andeverylinkto calculate weights. 1) In thisprocessof joiningwithgraphwe joinedthemwithreference index (2nd column) of the graph. 2) We have calculateddenominatorbyapplyinggroupbyto numerator. 3) Here We have calculateddenominatorintwodifferent formats1) forIn-degree valuesand otherone for outdegree values. 4) AftercalculatingbothdenominatorsIhave calculatedbothbyjoiningthemtogatherandgot commondenominator. Source code: val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes as a jointblAllLinksasbon a.paper_index2=b.index") val numeratorRdd= numerator.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen t(3).toString.toInt)} val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)} val denominatorIn= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._3).sum)) val denominatorOut= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._4).sum)) val dfdenominatorIn=denominatorIn.toDF val dfdenominatorOut=denominatorOut.toDF val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes as a jointblAllLinksasbon a.paper_index2=b.index") val numeratorRdd= numerator.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen t(3).toString.toInt)} val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)} val denominatorIn= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._3).sum)) val denominatorOut= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._4).sum)) val dfdenominatorIn=denominatorIn.toDF val dfdenominatorOut=denominatorOut.toDF val numerator= spark.sql("selecta.paper_index1,a.paper_index2,b.inlink,b.outlinkfromtblIndexes as a jointblAllLinksasbon a.paper_index2=b.index")
  • 10. 8 val numeratorRdd= numerator.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toLong,element(2).toString.toInt,elemen t(3).toString.toInt)} val groupedInlinks=numeratorRdd.groupBy(_._1).map{case (a,b) =>(a,b.toArray)} val denominatorIn= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._3).sum)) val denominatorOut= groupedInlinks.map(element=>SingleDenominator(element._1,element._2.map(element=> element._4).sum)) val dfdenominatorIn=denominatorIn.toDF val dfdenominatorOut=denominatorOut.toDF Resultforthat can be displayedas:  Nowonce we got useful valuesfornumeratoranddenominatorwe usedinnerjoinbetween numeratoranddenominator.Tocalculate weights.  We joinedboththe tableswithmainpaper_index(firstcolumn).Andinthisjoinquerywe appliedweightformulaandmultipliedbothincomingandoutgoingvalues. Source code: val weights= spark.sql("selecta.paper_index1,a.paper_index2,cast(a.inlink/b.inlinkas double)*cast(a.outlink/b.outlinkasdouble) asweightsfromtblnumeratorasajointbldenominator as b on a.paper_index1=b.paper_index") weights.createOrReplaceTempView("tblweights") Resultof final weightsis:
  • 11. 9 7. Calculation of Page Rank:  Calculatedthe weightedPageRankbasedonformula.  Usednewvaluesof Ranks,goingaheadwiththe iterations. Algorithmtocalculate page rank: 1. Calculate constantsforrank. 2. Initializedefaultranks. 3. For i -> 1 to 10 4. Calculate newranksandreplace witholdtable of ranks 5. Get topten ranks 6. Jointable withindegreedistribution  Procedure tocalculate page rank: To calculate weightedpage rankasperformulawe calculate constantfirstandsavedthat into one table: val constant = sc.parallelize(Array(Constant((1-0.85).toDouble/N.toDouble))) val dfConstant= constant.toDF dfConstant.createOrReplaceTempView("tblConstant")  Formula:1-0.85/N (where N isnumberof nodes)  Resultforconstant:  Aftergettingformula.We calculateddefaultrankforeachand everynode.  A defaultrankcan be calculatedwithformulaas: 1/N Source code: val ranksRdd= hashIndexes.map(element=>Rank(element._2.toLong,(1.toDouble/N.toDouble))) val dfRanks= ranksRdd.toDF() dfRanks.createOrReplaceTempView("tblRanks") Defaultrankscan be displayedas:
  • 12. 10 Aftergettingdefaultrankswe joinedranktable andweightstable tocalculate page rank We have calculatedpage rankintotwodifferentqueries:  Firstquerycalculatespage rankpartially.Thispartial page rankcontains valueswithaddition of constant  Andin the secondstepwe addedrankvalueswithconstants. Source code: val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85 as rank fromtblWeightsajointblRanks bon a.paper_index1=b.paper_indexgroupby a.paper_index2") rankedWeights.createOrReplaceTempView("tblRanks") val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank fromtblRanks") For demonstration purposewe ranthisformulaforthe firstiterationonlywe gotfollowingresult:  To run thisprogram for10 iterationswe gave followingrankcalculationfunctionwe needto applythat functioninforloopfor10 iterations.Andcode willbe like: for(a<- 1 to 10) { val rankedWeights=spark.sql("selecta.paper_index2aspaper_index,sum(b.rank*a.weights)*0.85 as rank fromtblWeightsajointblRanksbon a.paper_index1=b.paper_indexgroupby a.paper_index2") rankedWeights.createOrReplaceTempView("tblRanks") val constantRanks= spark.sql("selectpaper_index,rank+(selectconstantfromtblConstant) asrank fromtblRanks") constantRanks.createOrReplaceTempView("tblRanks") }  But here inthiscode sample we addedone more statement.Since we needto iterate throughtentimesto calculate rank.
  • 13. 11  So,we usedthisfunctioncreateOrReplaceTempView() tooverwrite ranktable.  But overexecutionforrankcalculationisnotfinishedyet. We needtogetonlytop ranksfor that we can applyotherquery: val topranks= spark.sql("selectpaper_index,rankfromtblRanksorderbyrankdesclimit10") val topranksRdd= topranks.rdd.map{case element=>(element(0).toString.toLong,element(1).toString.toDouble)}  Thus,we got resultof final rank.  All of thiscalculation are done forrank calculation.  For demonstrationpurposeIhave usedquerywithlimitone toshow topmostrank 8. Getting a final result: Algorithmtogetjoinedresult 1. Fetchtitlesandindex fromdata. 2. Replace index valueswithhashcode. 3. Jointable withtitlestotable withtoptenranks. 4. Joinnumeratoranddenominatortogetfinal weights.  But on our citationdatawe alsoneedto joinitwithindegree valuesandtitle:  For indegree valueswe have:  Graph.inDegrees  Andfor joinoperationnetweenindegree tableandranktable we usedsimple scalajoin . Source code: val filteredTitles=splittedRdd.map(element=> (element.filter(element=> element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="") val titleRdd= filteredTitles.map(element=> (element._1.substring(6,element._1.length()),element._2)) val mappedTitles=titleRdd.map(element=> (mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case (Some(a),b)=>(a,b)}
  • 14. 12 Andto fetchtitle we used followingcode: (Where splittedRddwe gotwhile parsingthe data) val filteredTitles=splittedRdd.map(element=> (element.filter(element=> element.contains("#index")).mkString(""),element(0))).filter(element=>element._1!="") val titleRdd= filteredTitles.map(element=> (element._1.substring(6,element._1.length()),element._2)) val mappedTitles=titleRdd.map(element=> (mapIndexes.get(element._1),element._2)).filter(element=>element._1!=None).map{case (Some(a),b)=>(a,b)}  In Degree Graph: 9. Conclusion:  In thisprogramgot opportunitytolearnhow page rank worksin the real world.We also comparedthe original page rankalgorithmwhichisknownaspanda currentlyimplemented by google.Thus,we Gotoverall ideaof page ranks,backlinksandindexingof websitesina searchengine.
  • 15. 13 10. References:  people.cis.ksu.edu/~halmohri/files/weightedPageRank.pdf  https://cwiki.apache.org/confluence/display/Hive/LanguageManual  http://spark.apache.org/docs/latest/api/scala/index.html#package