Your SlideShare is downloading. ×
0
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Crawling the Infinite Web (WAW 2004 Rome)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Crawling the Infinite Web (WAW 2004 Rome)

664

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
664
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Outline Introduction Models Experiments Summary Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research www.cwr.cl WAW 2004 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 2. Outline Introduction Models Experiments Summary 1 Introduction 2 Models 3 Experiments 4 Summary R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 3. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 4. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 5. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 6. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 7. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 8. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 9. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 10. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 11. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 12. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 13. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 14. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 15. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 16. Outline Introduction Models Experiments Summary Models Navigating a tree ≈ Moving through levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 17. Outline Introduction Models Experiments Summary Actions Possible actions at a given level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 18. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 19. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 20. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 21. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 22. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 23. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 24. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time Birth and death process R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 25. Outline Introduction Models Experiments Summary Model B Back to first level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 26. Outline Introduction Models Experiments Summary Model B Back to first level Birth and death process with extinction R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 27. Outline Introduction Models Experiments Summary Model C Back to any previous level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 28. Outline Introduction Models Experiments Summary Model C Back to any previous level Birth and death process with extinction and disaster? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 29. Outline Introduction Models Experiments Summary Cumulative probability of levels 0 . . . k Based on solutions given in the paper R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 30. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 31. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 32. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 33. Outline Introduction Models Experiments Summary Distribution of visits per level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 34. Outline Introduction Models Experiments Summary Model fitting Code Type Country Model q Error E1 Educational Chile B 0.51 0.88% E2 Educational Spain B 0.51 2.29% E3 Educational US B 0.64 0.72% C1 Commercial Chile B 0.55 0.39% C2 Commercial Chile B 0.62 5.17% R1 Reference Chile B 0.54 2.96% R2 Reference Chile B 0.59 2.75% O1 Organization Italy C 0.35 2.27% O2 Organization US B 0.62 2.31% OB1 Organization + Blog Chile B 0.65 2.07% OB2 Organization + Blog Chile B 0.72 0.35% B1 Blog Chile C 0.79 0.88% B2 Blog Chile C 0.63 1.01% R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 35. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 36. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 Pr (next) is not constant, if you have spent some time in the Web site, then you can spend some more R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 37. Outline Introduction Models Experiments Summary Pagerank and depth Cumulative Pagerank by levels in the Chilean Web R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 38. Outline Introduction Models Experiments Summary Pagerank and depth Correlation of Pagerank and depth is low at deeper levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 39. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 40. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 41. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 42. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 43. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 44. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 45. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 46. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 47. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  • 48. Outline Introduction Models Experiments Summary Questions and comments . . . R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web

×