Outline                    Introduction           Models             Experiments                 Summary




             ...
Outline                    Introduction   Models   Experiments                 Summary




          1 Introduction


    ...
Outline                    Introduction   Models         Experiments                 Summary




Introduction



         ...
Outline                    Introduction   Models         Experiments                 Summary




Introduction



         ...
Outline                    Introduction   Models         Experiments                 Summary




Introduction



         ...
Outline                    Introduction   Models         Experiments                 Summary




Introduction



         ...
Outline                    Introduction   Models         Experiments                 Summary




Introduction



         ...
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



   ...
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



   ...
Outline                    Introduction   Models       Experiments                 Summary




Conflicting interests



   ...
Outline                    Introduction   Models          Experiments                 Summary




Our approach



        ...
Outline                    Introduction   Models          Experiments                 Summary




Our approach



        ...
Outline                    Introduction   Models          Experiments                 Summary




Our approach



        ...
Outline                    Introduction   Models          Experiments                 Summary




Our approach



        ...
Outline                    Introduction   Models          Experiments                 Summary




Our approach



        ...
Outline                    Introduction   Models   Experiments                 Summary




Models
Navigating a tree ≈ Movi...
Outline                    Introduction   Models   Experiments                 Summary




Actions
Possible actions at a g...
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study...
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study...
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study...
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study...
Outline                    Introduction      Models        Experiments                 Summary




Type of models we study...
Outline                    Introduction   Models   Experiments                 Summary




Model A
Forwards and backwards ...
Outline                    Introduction           Models            Experiments                 Summary




Model A
Forwar...
Outline                    Introduction   Models   Experiments                 Summary




Model B
Back to first level




...
Outline                    Introduction          Models           Experiments                 Summary




Model B
Back to ...
Outline                    Introduction   Models   Experiments                 Summary




Model C
Back to any previous le...
Outline                    Introduction      Models            Experiments                 Summary




Model C
Back to any...
Outline                    Introduction   Models   Experiments                 Summary




Cumulative probability of level...
Outline                    Introduction   Models        Experiments                 Summary




Experiments




          ...
Outline                    Introduction   Models        Experiments                 Summary




Experiments




          ...
Outline                    Introduction   Models        Experiments                 Summary




Experiments




          ...
Outline                    Introduction   Models   Experiments                 Summary




Distribution of visits per leve...
Outline                    Introduction      Models        Experiments                  Summary




Model fitting
         ...
Outline                    Introduction         Models            Experiments                 Summary




Observed distrib...
Outline                    Introduction         Models            Experiments                 Summary




Observed distrib...
Outline                    Introduction   Models   Experiments                 Summary




Pagerank and depth
Cumulative P...
Outline                    Introduction   Models            Experiments                 Summary




Pagerank and depth
Cor...
Outline                    Introduction   Models            Experiments                 Summary




Summary



           ...
Outline                    Introduction   Models            Experiments                 Summary




Summary



           ...
Outline                    Introduction   Models            Experiments                 Summary




Summary



           ...
Outline                    Introduction   Models         Experiments                 Summary




Open problems


         ...
Outline                    Introduction   Models         Experiments                 Summary




Open problems


         ...
Outline                    Introduction   Models         Experiments                 Summary




Open problems


         ...
Outline                    Introduction   Models         Experiments                 Summary




Open problems


         ...
Outline                    Introduction   Models         Experiments                 Summary




Open problems


         ...
Outline                    Introduction   Models         Experiments                 Summary




Open problems


         ...
Outline                    Introduction   Models   Experiments                 Summary




          Questions and comment...
Upcoming SlideShare
Loading in …5
×

Crawling the Infinite Web (WAW 2004 Rome)

775 views
720 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
775
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Crawling the Infinite Web (WAW 2004 Rome)

  1. 1. Outline Introduction Models Experiments Summary Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research www.cwr.cl WAW 2004 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  2. 2. Outline Introduction Models Experiments Summary 1 Introduction 2 Models 3 Experiments 4 Summary R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  3. 3. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  4. 4. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  5. 5. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  6. 6. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  7. 7. Outline Introduction Models Experiments Summary Introduction Dynamic page: “a page which is created on request” Dynamic pages with links to other dynamic pages Malicious: loops and/or near-duplicates Legitimate: recommendation systems, calendars, iterative algorithms, etc. The number of pages on the Web can be considered infinite R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  8. 8. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  9. 9. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  10. 10. Outline Introduction Models Experiments Summary Conflicting interests Web site administrator: would like to have all of the Web site indexed Search engine administrator: would like to use efficiently the network and storage capacity available Search engine user: would like to find what he is looking for R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  11. 11. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  12. 12. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  13. 13. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  14. 14. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  15. 15. Outline Introduction Models Experiments Summary Our approach Users do not go so deep inside Web sites If something is important it has to be easily reachable We will download only a few levels of each Web site How many levels? How much do you lost? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  16. 16. Outline Introduction Models Experiments Summary Models Navigating a tree ≈ Moving through levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  17. 17. Outline Introduction Models Experiments Summary Actions Possible actions at a given level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  18. 18. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  19. 19. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  20. 20. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  21. 21. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  22. 22. Outline Introduction Models Experiments Summary Type of models we study There is a set of atomic actions A = {next, start/jump, back, stay , prev , fwd} Pr (action| ) is the probability of taking an action action∈A Pr (action| )=1 The probability Pr (next| ) is constant Stationary distribution → how much time users spent at each level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  23. 23. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  24. 24. Outline Introduction Models Experiments Summary Model A Forwards and backwards one level at a time Birth and death process R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  25. 25. Outline Introduction Models Experiments Summary Model B Back to first level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  26. 26. Outline Introduction Models Experiments Summary Model B Back to first level Birth and death process with extinction R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  27. 27. Outline Introduction Models Experiments Summary Model C Back to any previous level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  28. 28. Outline Introduction Models Experiments Summary Model C Back to any previous level Birth and death process with extinction and disaster? R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  29. 29. Outline Introduction Models Experiments Summary Cumulative probability of levels 0 . . . k Based on solutions given in the paper R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  30. 30. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  31. 31. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  32. 32. Outline Introduction Models Experiments Summary Experiments Anonimized access logs for 13 Websites Educational - Commercial - Reference - Organization - Blogs Analysis of access logs to extract ≈ 250,000 user sessions R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  33. 33. Outline Introduction Models Experiments Summary Distribution of visits per level R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  34. 34. Outline Introduction Models Experiments Summary Model fitting Code Type Country Model q Error E1 Educational Chile B 0.51 0.88% E2 Educational Spain B 0.51 2.29% E3 Educational US B 0.64 0.72% C1 Commercial Chile B 0.55 0.39% C2 Commercial Chile B 0.62 5.17% R1 Reference Chile B 0.54 2.96% R2 Reference Chile B 0.59 2.75% O1 Organization Italy C 0.35 2.27% O2 Organization US B 0.62 2.31% OB1 Organization + Blog Chile B 0.65 2.07% OB2 Organization + Blog Chile B 0.72 0.35% B1 Blog Chile C 0.79 0.88% B2 Blog Chile C 0.63 1.01% R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  35. 35. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  36. 36. Outline Introduction Models Experiments Summary Observed distribution of transitions Level Obs. Next Start Jump Back Stay Prev 0 247985 0.457 – 0.527 – 0.008 – 1 120482 0.459 – 0.332 0.185 0.017 – 2 70911 0.462 0.111 0.235 0.171 0.014 – 3 42311 0.497 0.065 0.186 0.159 0.017 0.069 4 27129 0.514 0.057 0.157 0.171 0.009 0.088 5 17544 0.549 0.048 0.138 0.143 0.009 0.108 6 10296 0.555 0.037 0.133 0.155 0.009 0.106 7 6326 0.596 0.033 0.135 0.113 0.006 0.113 8 4200 0.637 0.024 0.104 0.127 0.006 0.096 9 2782 0.663 0.015 0.108 0.113 0.006 0.089 10 2089 0.662 0.037 0.084 0.120 0.005 0.086 Pr (next) is not constant, if you have spent some time in the Web site, then you can spend some more R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  37. 37. Outline Introduction Models Experiments Summary Pagerank and depth Cumulative Pagerank by levels in the Chilean Web R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  38. 38. Outline Introduction Models Experiments Summary Pagerank and depth Correlation of Pagerank and depth is low at deeper levels R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  39. 39. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  40. 40. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  41. 41. Outline Introduction Models Experiments Summary Summary 90% of the visits are 4-5 clicks away from the home page, except in blogs Simple models try to explain this behavior In the paper: explicit methodology, closed solutions to the models, references R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  42. 42. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  43. 43. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  44. 44. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  45. 45. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  46. 46. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  47. 47. Outline Introduction Models Experiments Summary Open problems A model which better fits empirical data Analyzing blogs Analyzing the textual content of pages to decide when to stop Relationship of this with the spam detection problem Try adaptive strategies: which are the factors that affect the desired crawling depth in a Web site? There are other ways of defining which pages to download from an infinite set R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web
  48. 48. Outline Introduction Models Experiments Summary Questions and comments . . . R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web

×