2. Look at how much fun they’re having.
…..however, there are a lot of web pages, and a lot of links, and it
becomes a LOT of work to calculate
PageRank is fun!
3. Overview
How does PageRank work?
-Directed graph (nodes point to other nodes, but it’s a one-way street)
-Adjacency matrix constructed from graph
-Each page given an equal weight to distribute to the pages it points to
-Pages without any other pages pointing to it given weight of 1/total#pages
to represent the random chance that someone goes directly to that
page
-Adjacency matrix is multiplied by the PageRank vector iteratively until the
PageRanks begin to approach an equilibrium and change no
further
-”Damping Factor” applied to simulate a random stop in page exploration
Formula Represented as:
Where R is the PR matrix, M is the adjacency matrix, t is the number of iterations done,
d is the damping factor, and N is the total number of pages. After limited iterations,
the pagerank value will converge.
4. Overview
The adjacency matrix is populated with
1’s initially, and then 1/#NodesPointedAt,
as a given side pointing to 3 other sites gives
a ⅓ chance to navigate to either one of the 3.
In a very basic representation, the adjacency
matrix is multiplied by the page rank vector,
in this case, on its initial run. All pages have
equal weight from the beginning.
To deal with the situation (there are some pages
which they never link to other pages), we need
to set up a possibility which represents the person
may jump to other pages by inputting address
in browser.
5. Overview
Damping Factor-
When a node points to no one else, over time,
it will possibly become a sink, and hold all of
the weight.
The damping factor simulates a user getting
bored of their current train of pages, and
going to a random website. This makes sure
that sinks don’t happen, as someone stuck
on C might become bored and navigate to
any other node.
6. Sequential Implementation
The sequential implementation is basically just a big loop through the matrix, performing
the page rank calculation on every row of the matrix, updating the page rank vector, and
then repeating for N number of times.
Pseudocode:
for n times
for row in adjmat
for i in row
tempPR[rowIndex] = PRcalc(row[i], PR[i])
PR = tempPR
7. Parallel implementation
Running this problem concurrently is actually very slick. Each thread can handle a
row of the adjacency matrix and calculate PR for each node. Each thread writes the
new pagerank to a temporary vector, and after all threads have calculated
pagerank for each node, the new pagerank vector replaces the old one. No threads
will ever write to the same location, so there is no need to use mutexes or any other
kind of read/write control. The only thing to keep control of is making sure the
threads don’t outpace each other and get ahead or behind on the iteration.
*pseudocode for thread:
row = adjmat[next]
for i in row
thisRank += (do PR formula on row[i], pagerank[i])
tempPR[next] = thisRank
*main thread
pagerank = tempPR
11. Output ( two times running)
For testing the algorithm is correct or not, we ran our code in a data set with 4 nodes.
Sequential ( one thread) Parallel ( two threads)
12. The pagerank value is converging after 30 times running.
Output ( 30 times sequential running)
13. Output ( 30 times parallel running)
The pagerank value
is converging after
30 times running.
So algorithm of
code is CORRECT!
14. Running time
For this part, we ran our code with a big data set named Wiki-Vote( about 8000 nodes).
And we ran it for 20 times
The Wiki-Vote was downloaded from http://snap.stanford.edu/data/. It is directed
graph. Its description is Wikipedia who-votes-on-whom network.
For sequential(1 thread) running time : 15.18 sec.
2 threads: 7.89 sec.
4 threads: 4.21 sec.
16 threads: 3.77 sec.
15.
16. Conclusion
After more than 20 times running, the Pagerank value shows
convergence.
When running the small data set, there is no different running
times between sequence and multi threads.
When running the big data set and running for many times,
there will be obviously different running times between
sequence and multi threads. As the number of thread
increasing, the running time is decreasing.
Thank you!