1. Outline Introduction Models Experiments Summary
Crawling the Inﬁnite Web:
Five Levels are Enough
Ricardo Baeza-Yates and Carlos Castillo
Center for Web Research
www.cwr.cl
WAW 2004
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
2. Outline Introduction Models Experiments Summary
1 Introduction
2 Models
3 Experiments
4 Summary
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
3. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered inﬁnite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
4. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered inﬁnite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
5. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered inﬁnite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
6. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered inﬁnite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
7. Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterative
algorithms, etc.
The number of pages on the Web can be considered inﬁnite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
8. Outline Introduction Models Experiments Summary
Conﬂicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use eﬃciently
the network and storage capacity available
Search engine user: would like to ﬁnd what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
9. Outline Introduction Models Experiments Summary
Conﬂicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use eﬃciently
the network and storage capacity available
Search engine user: would like to ﬁnd what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
10. Outline Introduction Models Experiments Summary
Conﬂicting interests
Web site administrator: would like to have all of the Web
site indexed
Search engine administrator: would like to use eﬃciently
the network and storage capacity available
Search engine user: would like to ﬁnd what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
11. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
12. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
13. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
14. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
15. Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
16. Outline Introduction Models Experiments Summary
Models
Navigating a tree ≈ Moving through levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
17. Outline Introduction Models Experiments Summary
Actions
Possible actions at a given level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
18. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
19. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
20. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
21. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
22. Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actions
A = {next, start/jump, back, stay , prev , fwd}
Pr (action| ) is the probability of taking an action
action∈A Pr (action| )=1
The probability Pr (next| ) is constant
Stationary distribution → how much time users spent at each
level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
23. Outline Introduction Models Experiments Summary
Model A
Forwards and backwards one level at a time
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
24. Outline Introduction Models Experiments Summary
Model A
Forwards and backwards one level at a time
Birth and death process
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
25. Outline Introduction Models Experiments Summary
Model B
Back to ﬁrst level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
26. Outline Introduction Models Experiments Summary
Model B
Back to ﬁrst level
Birth and death process with extinction
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
27. Outline Introduction Models Experiments Summary
Model C
Back to any previous level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
28. Outline Introduction Models Experiments Summary
Model C
Back to any previous level
Birth and death process with extinction and disaster?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
29. Outline Introduction Models Experiments Summary
Cumulative probability of levels 0 . . . k
Based on solutions given in the paper
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
30. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
31. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
32. Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
33. Outline Introduction Models Experiments Summary
Distribution of visits per level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
34. Outline Introduction Models Experiments Summary
Model ﬁtting
Code Type Country Model q Error
E1 Educational Chile B 0.51 0.88%
E2 Educational Spain B 0.51 2.29%
E3 Educational US B 0.64 0.72%
C1 Commercial Chile B 0.55 0.39%
C2 Commercial Chile B 0.62 5.17%
R1 Reference Chile B 0.54 2.96%
R2 Reference Chile B 0.59 2.75%
O1 Organization Italy C 0.35 2.27%
O2 Organization US B 0.62 2.31%
OB1 Organization + Blog Chile B 0.65 2.07%
OB2 Organization + Blog Chile B 0.72 0.35%
B1 Blog Chile C 0.79 0.88%
B2 Blog Chile C 0.63 1.01%
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
35. Outline Introduction Models Experiments Summary
Observed distribution of transitions
Level Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –
1 120482 0.459 – 0.332 0.185 0.017 –
2 70911 0.462 0.111 0.235 0.171 0.014 –
3 42311 0.497 0.065 0.186 0.159 0.017 0.069
4 27129 0.514 0.057 0.157 0.171 0.009 0.088
5 17544 0.549 0.048 0.138 0.143 0.009 0.108
6 10296 0.555 0.037 0.133 0.155 0.009 0.106
7 6326 0.596 0.033 0.135 0.113 0.006 0.113
8 4200 0.637 0.024 0.104 0.127 0.006 0.096
9 2782 0.663 0.015 0.108 0.113 0.006 0.089
10 2089 0.662 0.037 0.084 0.120 0.005 0.086
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
36. Outline Introduction Models Experiments Summary
Observed distribution of transitions
Level Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –
1 120482 0.459 – 0.332 0.185 0.017 –
2 70911 0.462 0.111 0.235 0.171 0.014 –
3 42311 0.497 0.065 0.186 0.159 0.017 0.069
4 27129 0.514 0.057 0.157 0.171 0.009 0.088
5 17544 0.549 0.048 0.138 0.143 0.009 0.108
6 10296 0.555 0.037 0.133 0.155 0.009 0.106
7 6326 0.596 0.033 0.135 0.113 0.006 0.113
8 4200 0.637 0.024 0.104 0.127 0.006 0.096
9 2782 0.663 0.015 0.108 0.113 0.006 0.089
10 2089 0.662 0.037 0.084 0.120 0.005 0.086
Pr (next) is not constant, if you have spent some time in the Web site,
then you can spend some more
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
37. Outline Introduction Models Experiments Summary
Pagerank and depth
Cumulative Pagerank by levels in the Chilean Web
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
38. Outline Introduction Models Experiments Summary
Pagerank and depth
Correlation of Pagerank and depth is low at deeper levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
39. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
40. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
41. Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,
except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to the
models, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
42. Outline Introduction Models Experiments Summary
Open problems
A model which better ﬁts empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that aﬀect the
desired crawling depth in a Web site?
There are other ways of deﬁning which pages to download
from an inﬁnite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
43. Outline Introduction Models Experiments Summary
Open problems
A model which better ﬁts empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that aﬀect the
desired crawling depth in a Web site?
There are other ways of deﬁning which pages to download
from an inﬁnite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
44. Outline Introduction Models Experiments Summary
Open problems
A model which better ﬁts empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that aﬀect the
desired crawling depth in a Web site?
There are other ways of deﬁning which pages to download
from an inﬁnite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
45. Outline Introduction Models Experiments Summary
Open problems
A model which better ﬁts empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that aﬀect the
desired crawling depth in a Web site?
There are other ways of deﬁning which pages to download
from an inﬁnite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
46. Outline Introduction Models Experiments Summary
Open problems
A model which better ﬁts empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that aﬀect the
desired crawling depth in a Web site?
There are other ways of deﬁning which pages to download
from an inﬁnite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
47. Outline Introduction Models Experiments Summary
Open problems
A model which better ﬁts empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that aﬀect the
desired crawling depth in a Web site?
There are other ways of deﬁning which pages to download
from an inﬁnite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
48. Outline Introduction Models Experiments Summary
Questions and comments . . .
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Inﬁnite Web
Be the first to comment