This document summarizes techniques for efficiently computing PageRank on large web graphs in an I/O-efficient manner. It introduces the PageRank algorithm and describes challenges in computing PageRank on billions of web pages due to memory and storage limitations. It then presents three algorithms - the naive algorithm, Haveliwala's algorithm, and two new algorithms proposed by the authors: the Sort-Merge algorithm and Split-Accumulate algorithm. The document concludes by experimentally evaluating these algorithms on a real web graph with hundreds of millions of pages and edges.
2. Web Graph
• URL as a node
• Hyperlink as a
directed edge
• The graph structure
represents the World
Wide Web
3. Page Rank
• Random Surfer model
– A person who surf the
web by randomly
clicking links on
visited pages.
• PageRank of a page is
proportional to the
frequency with which
a random surfer would
visit it.
R2=0.286
R3=0.143
R4=0.143R5=0.143
R1=0.286
4. Practical PageRank
• Two problems:
– Rank leak
– Rank sink
• Pruning
• Add back edges
• Random Jump
R2=0.142
R3=0.101
R4=0.313R5=0.290
R1=0.154
d=0.8
5. Topic Sensitive PageRank
• Modified Random
Jump
• Only jump to certain
pages which are
related to a specific
topic
• ODP-biasing
Topic T
6. Challenge
• 3.5 Billion pages on the web
• 49 Billion hyperlinks in betweens
• Require 14G bytes to store 4-byte pagerank
values – Hard to fit in memory
• Calculate the pagerank value in an I/O
efficient way
9. Naïve Algorithm
• Two vectors of
32-bits floating
point numbers.
• Source vector is
on disk
• Destination vector
is in memory.
LVVLVCnaive +⋅=++= 2'
10. Haveliwala’s Algorithm
• Partition destination vector into d blocks
Vi’ that each fit into main memory.
• Partition link file into d files Li , each only
contains links pointing to nodes in Vi’ .
∑∑ <≤<≤
⋅++⋅+=++⋅=
di
i
di
ih LVdLVVdC
00
)1()1(' ε
11.
12. Sort-Merge Algorithm
• Link file is identical to the on in naïve
algorithm.
• Creating for each link a packet that contains
the line number of the destination and an
amount of rank value that has to be
transmitted to that destination.
• 8-byte packet : 4-byte id + 4-byte floating
number
13. Sort-Merge Algorithm (continue)
• Route packets by sorting them by
destination and combining the ranks into the
destination node.
• |P| is the total size of the generated packets
that need to be written in and out once.
PLVVC mergesort ⋅+++=− 2'
14. Split-Accumulate Algorithm
• Splits the source vector into d blocks Vi,
such that 4-byte rank values of all node in a
block fit into memory.
• Link file contains information on all
links with source node in block Vi.
• It likes reverse of Li in Haviliwala’s, but we
remove the out-degree information to
another files.
iL
15.
16. Split-Accumulate Algorithm (continue)
• File Oi is a vector of 2-byte integers, storing
out-degree for each element in source
vector.
• File is defined as containing all packets
of rank values with destination in block Vi,
in arbitrary order.
iP
17. Split-Accumulate Algorithm (continue)
• For each iteration i:
– Initial block Vi in memory
– Accumulate phase:
• Scan with destinations in Vi , add rank values
in each packet to appropriate entry in Vi.
– Scan Oi and divide each rank value in Vi by its
out-degree.
iP
18. Split-Accumulate Algorithm (continue)
– Split phase:
• Read and for each record in consisting of
several sources in Vi and a destination in Vj, we
write one packet with this destination node and the
total amount of rank to be transmitted to it from
these sources into output file ( which will
become file in the next iteration).
Combining packets is simpler and more efficient.
No in-memory sorting of packets is needed.
iL iL
'jP
jP
19. Split-Accumulate Algorithm (continue)
• In a nutshell, it split packets into different
buckets by destination, and then directly
accumulating rank values using a table.
PL
PLV
iPiLOiC
di
split
⋅++=
⋅+++⋅=
⋅++= ∑<≤
2)1(
2)'1(5.0
)2(
0
ε
ε
20. Experimental Setup
• Sun Blade 100 (500 MHz Ultra Sparc IIe)
running Solaris 8 with 100GB, 7200 RPM
hard disk.
• Various physical memory configurations:
128M, 256M, 512M, 1G, 2G
• Simulated 32M and 64M setting under
128M memory.
21. Results for Real Data
• 120 M web pages
crawled
• 327 M URLs and 1.33
Billion links parsed
out.
• After pruning:
– 44.8 M nodes
– 653M edges
– 15.3 edges/node
22. Result for Real Data (continue)
• No pruning.
• Add back edges
for nodes which
has 0 out-degree.
• 327 M nodes
• 1.96 Billion
edges
24. Results for Topic-Sensitive PR
0
500
1000
1500
2000
2500
3000
3500
4000
10
Topics(512M)
20
Topics(512M)
10
Topics(256M)
20
Topics(256M)
Nai ve
Havel i wal a' s
Spl i t -
Accumul at e
25. • Basic:
• Random Jump:
• Topic-Sensitive:
{
Page Rank
∑→
=
pq qd
pr
pr
)(
)(
)(
∑→
−
⋅+−=
pq
i
i
qd
qr
n
R
pr
)(
)(
)1()(
)1()0(
)(
αα
=)()(
pr i
∑→
−
⋅+−
pq
i
qd
qr
n
R
)(
)(
)1(
)1()0(
αα
∑→
−
⋅
pq
i
qd
qr
)(
)()1(
α
p is special
otherwise