SlideShare a Scribd company logo
1 of 25
I/O-Efficient Techniques for
Computing Pagerank
Yen-Yu Chen Qingqing Gan Torsten Suel
Polytechnic University, Brooklyn NY
Web Graph
• URL as a node
• Hyperlink as a
directed edge
• The graph structure
represents the World
Wide Web
Page Rank
• Random Surfer model
– A person who surf the
web by randomly
clicking links on
visited pages.
• PageRank of a page is
proportional to the
frequency with which
a random surfer would
visit it.
R2=0.286
R3=0.143
R4=0.143R5=0.143
R1=0.286
Practical PageRank
• Two problems:
– Rank leak
– Rank sink
• Pruning
• Add back edges
• Random Jump
R2=0.142
R3=0.101
R4=0.313R5=0.290
R1=0.154
d=0.8
Topic Sensitive PageRank
• Modified Random
Jump
• Only jump to certain
pages which are
related to a specific
topic
• ODP-biasing
Topic T
Challenge
• 3.5 Billion pages on the web
• 49 Billion hyperlinks in betweens
• Require 14G bytes to store 4-byte pagerank
values – Hard to fit in memory
• Calculate the pagerank value in an I/O
efficient way
I/O Efficient Algorithms
• Naïve Algorithm
• Haveliwala’s Algorithm
• Our contribution:
– Sort-Merge Algorithm
– Split-Accumulate Algorithm
Related Work
Naïve Algorithm
• Two vectors of
32-bits floating
point numbers.
• Source vector is
on disk
• Destination vector
is in memory.
LVVLVCnaive +⋅=++= 2'
Haveliwala’s Algorithm
• Partition destination vector into d blocks
Vi’ that each fit into main memory.
• Partition link file into d files Li , each only
contains links pointing to nodes in Vi’ .
∑∑ <≤<≤
⋅++⋅+=++⋅=
di
i
di
ih LVdLVVdC
00
)1()1(' ε
Sort-Merge Algorithm
• Link file is identical to the on in naïve
algorithm.
• Creating for each link a packet that contains
the line number of the destination and an
amount of rank value that has to be
transmitted to that destination.
• 8-byte packet : 4-byte id + 4-byte floating
number
Sort-Merge Algorithm (continue)
• Route packets by sorting them by
destination and combining the ranks into the
destination node.
• |P| is the total size of the generated packets
that need to be written in and out once.
PLVVC mergesort ⋅+++=− 2'
Split-Accumulate Algorithm
• Splits the source vector into d blocks Vi,
such that 4-byte rank values of all node in a
block fit into memory.
• Link file contains information on all
links with source node in block Vi.
• It likes reverse of Li in Haviliwala’s, but we
remove the out-degree information to
another files.
iL
Split-Accumulate Algorithm (continue)
• File Oi is a vector of 2-byte integers, storing
out-degree for each element in source
vector.
• File is defined as containing all packets
of rank values with destination in block Vi,
in arbitrary order.
iP
Split-Accumulate Algorithm (continue)
• For each iteration i:
– Initial block Vi in memory
– Accumulate phase:
• Scan with destinations in Vi , add rank values
in each packet to appropriate entry in Vi.
– Scan Oi and divide each rank value in Vi by its
out-degree.
iP
Split-Accumulate Algorithm (continue)
– Split phase:
• Read and for each record in consisting of
several sources in Vi and a destination in Vj, we
write one packet with this destination node and the
total amount of rank to be transmitted to it from
these sources into output file ( which will
become file in the next iteration).
Combining packets is simpler and more efficient.
No in-memory sorting of packets is needed.
iL iL
'jP
jP
Split-Accumulate Algorithm (continue)
• In a nutshell, it split packets into different
buckets by destination, and then directly
accumulating rank values using a table.
PL
PLV
iPiLOiC
di
split
⋅++=
⋅+++⋅=
⋅++= ∑<≤
2)1(
2)'1(5.0
)2(
0
ε
ε
Experimental Setup
• Sun Blade 100 (500 MHz Ultra Sparc IIe)
running Solaris 8 with 100GB, 7200 RPM
hard disk.
• Various physical memory configurations:
128M, 256M, 512M, 1G, 2G
• Simulated 32M and 64M setting under
128M memory.
Results for Real Data
• 120 M web pages
crawled
• 327 M URLs and 1.33
Billion links parsed
out.
• After pruning:
– 44.8 M nodes
– 653M edges
– 15.3 edges/node
Result for Real Data (continue)
• No pruning.
• Add back edges
for nodes which
has 0 out-degree.
• 327 M nodes
• 1.96 Billion
edges
Results for Scaled Data
Results for Topic-Sensitive PR
0
500
1000
1500
2000
2500
3000
3500
4000
10
Topics(512M)
20
Topics(512M)
10
Topics(256M)
20
Topics(256M)
Nai ve
Havel i wal a' s
Spl i t -
Accumul at e
• Basic:
• Random Jump:
• Topic-Sensitive:
{
Page Rank
∑→
=
pq qd
pr
pr
)(
)(
)(
∑→
−
⋅+−=
pq
i
i
qd
qr
n
R
pr
)(
)(
)1()(
)1()0(
)(
αα
=)()(
pr i
∑→
−
⋅+−
pq
i
qd
qr
n
R
)(
)(
)1(
)1()0(
αα
∑→
−
⋅
pq
i
qd
qr
)(
)()1(
α
p is special
otherwise

More Related Content

What's hot

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...
openseesdays
 

What's hot (20)

IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?Functional?
 Reactive? 
Why?
Functional?
 Reactive? 
Why?
 
mongodb-aggregation-may-2012
mongodb-aggregation-may-2012mongodb-aggregation-may-2012
mongodb-aggregation-may-2012
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Databases and how to choose them
Databases and how to choose themDatabases and how to choose them
Databases and how to choose them
 
Machine Learning in H2O
Machine Learning in H2OMachine Learning in H2O
Machine Learning in H2O
 
A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...A shared-filesystem-memory approach for running IDA in parallel over informal...
A shared-filesystem-memory approach for running IDA in parallel over informal...
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
JVM languages "flame wars"
JVM languages "flame wars"JVM languages "flame wars"
JVM languages "flame wars"
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 

Similar to I/O-Efficient Techniques for Computing Pagerank

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
George Ang
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
Grisha Weintraub
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
Rui Vieira
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
Manchor Ko
 

Similar to I/O-Efficient Techniques for Computing Pagerank (20)

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Week5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptxWeek5-Faster R-CNN.pptx
Week5-Faster R-CNN.pptx
 
Aa sort-v4
Aa sort-v4Aa sort-v4
Aa sort-v4
 
Sorting algorithms
Sorting algorithmsSorting algorithms
Sorting algorithms
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Nbvtalkataitamimageprocessingconf
NbvtalkataitamimageprocessingconfNbvtalkataitamimageprocessingconf
Nbvtalkataitamimageprocessingconf
 
Graph processing
Graph processingGraph processing
Graph processing
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple data
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Efficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databasesEfficient top-k queries processing in column-family distributed databases
Efficient top-k queries processing in column-family distributed databases
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Query
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
cache memory
 cache memory cache memory
cache memory
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Pitfalls of Object Oriented Programming
Pitfalls of Object Oriented ProgrammingPitfalls of Object Oriented Programming
Pitfalls of Object Oriented Programming
 

Recently uploaded

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 

Recently uploaded (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 

I/O-Efficient Techniques for Computing Pagerank

  • 1. I/O-Efficient Techniques for Computing Pagerank Yen-Yu Chen Qingqing Gan Torsten Suel Polytechnic University, Brooklyn NY
  • 2. Web Graph • URL as a node • Hyperlink as a directed edge • The graph structure represents the World Wide Web
  • 3. Page Rank • Random Surfer model – A person who surf the web by randomly clicking links on visited pages. • PageRank of a page is proportional to the frequency with which a random surfer would visit it. R2=0.286 R3=0.143 R4=0.143R5=0.143 R1=0.286
  • 4. Practical PageRank • Two problems: – Rank leak – Rank sink • Pruning • Add back edges • Random Jump R2=0.142 R3=0.101 R4=0.313R5=0.290 R1=0.154 d=0.8
  • 5. Topic Sensitive PageRank • Modified Random Jump • Only jump to certain pages which are related to a specific topic • ODP-biasing Topic T
  • 6. Challenge • 3.5 Billion pages on the web • 49 Billion hyperlinks in betweens • Require 14G bytes to store 4-byte pagerank values – Hard to fit in memory • Calculate the pagerank value in an I/O efficient way
  • 7. I/O Efficient Algorithms • Naïve Algorithm • Haveliwala’s Algorithm • Our contribution: – Sort-Merge Algorithm – Split-Accumulate Algorithm
  • 9. Naïve Algorithm • Two vectors of 32-bits floating point numbers. • Source vector is on disk • Destination vector is in memory. LVVLVCnaive +⋅=++= 2'
  • 10. Haveliwala’s Algorithm • Partition destination vector into d blocks Vi’ that each fit into main memory. • Partition link file into d files Li , each only contains links pointing to nodes in Vi’ . ∑∑ <≤<≤ ⋅++⋅+=++⋅= di i di ih LVdLVVdC 00 )1()1(' ε
  • 11.
  • 12. Sort-Merge Algorithm • Link file is identical to the on in naïve algorithm. • Creating for each link a packet that contains the line number of the destination and an amount of rank value that has to be transmitted to that destination. • 8-byte packet : 4-byte id + 4-byte floating number
  • 13. Sort-Merge Algorithm (continue) • Route packets by sorting them by destination and combining the ranks into the destination node. • |P| is the total size of the generated packets that need to be written in and out once. PLVVC mergesort ⋅+++=− 2'
  • 14. Split-Accumulate Algorithm • Splits the source vector into d blocks Vi, such that 4-byte rank values of all node in a block fit into memory. • Link file contains information on all links with source node in block Vi. • It likes reverse of Li in Haviliwala’s, but we remove the out-degree information to another files. iL
  • 15.
  • 16. Split-Accumulate Algorithm (continue) • File Oi is a vector of 2-byte integers, storing out-degree for each element in source vector. • File is defined as containing all packets of rank values with destination in block Vi, in arbitrary order. iP
  • 17. Split-Accumulate Algorithm (continue) • For each iteration i: – Initial block Vi in memory – Accumulate phase: • Scan with destinations in Vi , add rank values in each packet to appropriate entry in Vi. – Scan Oi and divide each rank value in Vi by its out-degree. iP
  • 18. Split-Accumulate Algorithm (continue) – Split phase: • Read and for each record in consisting of several sources in Vi and a destination in Vj, we write one packet with this destination node and the total amount of rank to be transmitted to it from these sources into output file ( which will become file in the next iteration). Combining packets is simpler and more efficient. No in-memory sorting of packets is needed. iL iL 'jP jP
  • 19. Split-Accumulate Algorithm (continue) • In a nutshell, it split packets into different buckets by destination, and then directly accumulating rank values using a table. PL PLV iPiLOiC di split ⋅++= ⋅+++⋅= ⋅++= ∑<≤ 2)1( 2)'1(5.0 )2( 0 ε ε
  • 20. Experimental Setup • Sun Blade 100 (500 MHz Ultra Sparc IIe) running Solaris 8 with 100GB, 7200 RPM hard disk. • Various physical memory configurations: 128M, 256M, 512M, 1G, 2G • Simulated 32M and 64M setting under 128M memory.
  • 21. Results for Real Data • 120 M web pages crawled • 327 M URLs and 1.33 Billion links parsed out. • After pruning: – 44.8 M nodes – 653M edges – 15.3 edges/node
  • 22. Result for Real Data (continue) • No pruning. • Add back edges for nodes which has 0 out-degree. • 327 M nodes • 1.96 Billion edges
  • 24. Results for Topic-Sensitive PR 0 500 1000 1500 2000 2500 3000 3500 4000 10 Topics(512M) 20 Topics(512M) 10 Topics(256M) 20 Topics(256M) Nai ve Havel i wal a' s Spl i t - Accumul at e
  • 25. • Basic: • Random Jump: • Topic-Sensitive: { Page Rank ∑→ = pq qd pr pr )( )( )( ∑→ − ⋅+−= pq i i qd qr n R pr )( )( )1()( )1()0( )( αα =)()( pr i ∑→ − ⋅+− pq i qd qr n R )( )( )1( )1()0( αα ∑→ − ⋅ pq i qd qr )( )()1( α p is special otherwise