I/O-Efficient Techniques for
Computing Pagerank
Yen-Yu Chen Qingqing Gan Torsten Suel
Polytechnic University, Brooklyn NY
Web Graph
• URL as a node
• Hyperlink as a
directed edge
• The graph structure
represents the World
Wide Web
Page Rank
• Random Surfer model
– A person who surf the
web by randomly
clicking links on
visited pages.
• PageRank of a p...
Practical PageRank
• Two problems:
– Rank leak
– Rank sink
• Pruning
• Add back edges
• Random Jump
R2=0.142
R3=0.101
R4=0...
Topic Sensitive PageRank
• Modified Random
Jump
• Only jump to certain
pages which are
related to a specific
topic
• ODP-b...
Challenge
• 3.5 Billion pages on the web
• 49 Billion hyperlinks in betweens
• Require 14G bytes to store 4-byte pagerank
...
I/O Efficient Algorithms
• Naïve Algorithm
• Haveliwala’s Algorithm
• Our contribution:
– Sort-Merge Algorithm
– Split-Acc...
Related Work
Naïve Algorithm
• Two vectors of
32-bits floating
point numbers.
• Source vector is
on disk
• Destination vector
is in mem...
Haveliwala’s Algorithm
• Partition destination vector into d blocks
Vi’ that each fit into main memory.
• Partition link f...
Sort-Merge Algorithm
• Link file is identical to the on in naïve
algorithm.
• Creating for each link a packet that contain...
Sort-Merge Algorithm (continue)
• Route packets by sorting them by
destination and combining the ranks into the
destinatio...
Split-Accumulate Algorithm
• Splits the source vector into d blocks Vi,
such that 4-byte rank values of all node in a
bloc...
Split-Accumulate Algorithm (continue)
• File Oi is a vector of 2-byte integers, storing
out-degree for each element in sou...
Split-Accumulate Algorithm (continue)
• For each iteration i:
– Initial block Vi in memory
– Accumulate phase:
• Scan with...
Split-Accumulate Algorithm (continue)
– Split phase:
• Read and for each record in consisting of
several sources in Vi and...
Split-Accumulate Algorithm (continue)
• In a nutshell, it split packets into different
buckets by destination, and then di...
Experimental Setup
• Sun Blade 100 (500 MHz Ultra Sparc IIe)
running Solaris 8 with 100GB, 7200 RPM
hard disk.
• Various p...
Results for Real Data
• 120 M web pages
crawled
• 327 M URLs and 1.33
Billion links parsed
out.
• After pruning:
– 44.8 M ...
Result for Real Data (continue)
• No pruning.
• Add back edges
for nodes which
has 0 out-degree.
• 327 M nodes
• 1.96 Bill...
Results for Scaled Data
Results for Topic-Sensitive PR
0
500
1000
1500
2000
2500
3000
3500
4000
10
Topics(512M)
20
Topics(512M)
10
Topics(256M)
20...
• Basic:
• Random Jump:
• Topic-Sensitive:
{
Page Rank
∑→
=
pq qd
pr
pr
)(
)(
)(
∑→
−
⋅+−=
pq
i
i
qd
qr
n
R
pr
)(
)(
)1()(...
I/O-Efficient Techniques for Computing Pagerank
I/O-Efficient Techniques for Computing Pagerank
Upcoming SlideShare
Loading in …5
×

I/O-Efficient Techniques for Computing Pagerank

292 views

Published on

Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide
more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, and which assigns a global importance measure to each page based on the importance of other pages pointing to it. The main advantage of the Pagerank measure is that it is independent of the query posed by a user; this means that it can be precomputed and then used to optimize the layout of the inverted index structure accordingly. However, computing the Pagerank measure requires implementing an iterative process on a massive graph corresponding to billions of web pages and hyperlinks.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
292
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

I/O-Efficient Techniques for Computing Pagerank

  1. 1. I/O-Efficient Techniques for Computing Pagerank Yen-Yu Chen Qingqing Gan Torsten Suel Polytechnic University, Brooklyn NY
  2. 2. Web Graph • URL as a node • Hyperlink as a directed edge • The graph structure represents the World Wide Web
  3. 3. Page Rank • Random Surfer model – A person who surf the web by randomly clicking links on visited pages. • PageRank of a page is proportional to the frequency with which a random surfer would visit it. R2=0.286 R3=0.143 R4=0.143R5=0.143 R1=0.286
  4. 4. Practical PageRank • Two problems: – Rank leak – Rank sink • Pruning • Add back edges • Random Jump R2=0.142 R3=0.101 R4=0.313R5=0.290 R1=0.154 d=0.8
  5. 5. Topic Sensitive PageRank • Modified Random Jump • Only jump to certain pages which are related to a specific topic • ODP-biasing Topic T
  6. 6. Challenge • 3.5 Billion pages on the web • 49 Billion hyperlinks in betweens • Require 14G bytes to store 4-byte pagerank values – Hard to fit in memory • Calculate the pagerank value in an I/O efficient way
  7. 7. I/O Efficient Algorithms • Naïve Algorithm • Haveliwala’s Algorithm • Our contribution: – Sort-Merge Algorithm – Split-Accumulate Algorithm
  8. 8. Related Work
  9. 9. Naïve Algorithm • Two vectors of 32-bits floating point numbers. • Source vector is on disk • Destination vector is in memory. LVVLVCnaive +⋅=++= 2'
  10. 10. Haveliwala’s Algorithm • Partition destination vector into d blocks Vi’ that each fit into main memory. • Partition link file into d files Li , each only contains links pointing to nodes in Vi’ . ∑∑ <≤<≤ ⋅++⋅+=++⋅= di i di ih LVdLVVdC 00 )1()1(' ε
  11. 11. Sort-Merge Algorithm • Link file is identical to the on in naïve algorithm. • Creating for each link a packet that contains the line number of the destination and an amount of rank value that has to be transmitted to that destination. • 8-byte packet : 4-byte id + 4-byte floating number
  12. 12. Sort-Merge Algorithm (continue) • Route packets by sorting them by destination and combining the ranks into the destination node. • |P| is the total size of the generated packets that need to be written in and out once. PLVVC mergesort ⋅+++=− 2'
  13. 13. Split-Accumulate Algorithm • Splits the source vector into d blocks Vi, such that 4-byte rank values of all node in a block fit into memory. • Link file contains information on all links with source node in block Vi. • It likes reverse of Li in Haviliwala’s, but we remove the out-degree information to another files. iL
  14. 14. Split-Accumulate Algorithm (continue) • File Oi is a vector of 2-byte integers, storing out-degree for each element in source vector. • File is defined as containing all packets of rank values with destination in block Vi, in arbitrary order. iP
  15. 15. Split-Accumulate Algorithm (continue) • For each iteration i: – Initial block Vi in memory – Accumulate phase: • Scan with destinations in Vi , add rank values in each packet to appropriate entry in Vi. – Scan Oi and divide each rank value in Vi by its out-degree. iP
  16. 16. Split-Accumulate Algorithm (continue) – Split phase: • Read and for each record in consisting of several sources in Vi and a destination in Vj, we write one packet with this destination node and the total amount of rank to be transmitted to it from these sources into output file ( which will become file in the next iteration). Combining packets is simpler and more efficient. No in-memory sorting of packets is needed. iL iL 'jP jP
  17. 17. Split-Accumulate Algorithm (continue) • In a nutshell, it split packets into different buckets by destination, and then directly accumulating rank values using a table. PL PLV iPiLOiC di split ⋅++= ⋅+++⋅= ⋅++= ∑<≤ 2)1( 2)'1(5.0 )2( 0 ε ε
  18. 18. Experimental Setup • Sun Blade 100 (500 MHz Ultra Sparc IIe) running Solaris 8 with 100GB, 7200 RPM hard disk. • Various physical memory configurations: 128M, 256M, 512M, 1G, 2G • Simulated 32M and 64M setting under 128M memory.
  19. 19. Results for Real Data • 120 M web pages crawled • 327 M URLs and 1.33 Billion links parsed out. • After pruning: – 44.8 M nodes – 653M edges – 15.3 edges/node
  20. 20. Result for Real Data (continue) • No pruning. • Add back edges for nodes which has 0 out-degree. • 327 M nodes • 1.96 Billion edges
  21. 21. Results for Scaled Data
  22. 22. Results for Topic-Sensitive PR 0 500 1000 1500 2000 2500 3000 3500 4000 10 Topics(512M) 20 Topics(512M) 10 Topics(256M) 20 Topics(256M) Nai ve Havel i wal a' s Spl i t - Accumul at e
  23. 23. • Basic: • Random Jump: • Topic-Sensitive: { Page Rank ∑→ = pq qd pr pr )( )( )( ∑→ − ⋅+−= pq i i qd qr n R pr )( )( )1()( )1()0( )( αα =)()( pr i ∑→ − ⋅+− pq i qd qr n R )( )( )1( )1()0( αα ∑→ − ⋅ pq i qd qr )( )()1( α p is special otherwise

×