Web Crawling

                                   Carlos Castillo

                              Outline

                 ...
Motivation                       Web Crawling

Behavior of a crawler            Carlos Castillo

   Selection policy      ...
An astronomer watching the sky        Web Crawling

                                      Carlos Castillo

               ...
The problem of abundance                                 Web Crawling

                                                   ...
The problem of abundance                                 Web Crawling

                                                   ...
The problem of abundance                                 Web Crawling

                                                   ...
The bandwidth is expensive                                Web Crawling

                                                  ...
The bandwidth is expensive                                Web Crawling

                                                  ...
Combination of policies          Web Crawling

                                 Carlos Castillo

                         ...
Combination of policies          Web Crawling

                                 Carlos Castillo

                         ...
Combination of policies          Web Crawling

                                 Carlos Castillo

                         ...
Combination of policies          Web Crawling

                                 Carlos Castillo

                         ...
It is necessary to prioritize                            Web Crawling

                                                   ...
Web Crawling
Selection based on links                                 Carlos Castillo

                                   ...
Web Crawling
Events                                                   Carlos Castillo

                                   ...
Web Crawling
Cost functions                                            Carlos Castillo

                                  ...
Web Crawling
Cost functions                                            Carlos Castillo

                                  ...
Evolution of freshness and age        Web Crawling

                                      Carlos Castillo

               ...
Estimating freshness and age                                Web Crawling

                                                ...
Web Crawling
Web robots can be a threat                               Carlos Castillo

                                   ...
Web Crawling
Robot exclusion                                                           Carlos Castillo

                  ...
Web Crawling
Robot exclusion                                                           Carlos Castillo

                  ...
Objectives                                      Web Crawling

                                                Carlos Casti...
Types of policies                                         Web Crawling

                                                  ...
Problem separation                                        Web Crawling

                                                  ...
Problem separation                                        Web Crawling

                                                  ...
Short-term scheduling                                  Web Crawling

                                                     ...
Full parallelization        Web Crawling

                            Carlos Castillo

                       Outline

   ...
Web Crawling
Full serialization        Carlos Castillo

                     Outline

                     Motivation

   ...
Web Crawling
Realistic scenario        Carlos Castillo

                     Outline

                     Motivation

   ...
Web Crawling
Number of active crawlers        Carlos Castillo

                            Outline

                      ...
Objective                                                  Web Crawling

                                                 ...
Objective                                                  Web Crawling

                                                 ...
Strategies                              Web Crawling

                                        Carlos Castillo

           ...
Strategies                              Web Crawling

                                        Carlos Castillo

           ...
Strategies                              Web Crawling

                                        Carlos Castillo

           ...
Strategies                              Web Crawling

                                        Carlos Castillo

           ...
Comparison of strategies        Web Crawling

                                Carlos Castillo

                           ...
Distribution of visits per level        Web Crawling

                                        Carlos Castillo

           ...
Pagerank and depth                                      Web Crawling

                                                    ...
Pagerank and depth                                               Web Crawling

                                           ...
Web Crawling
First crawlers                                         Carlos Castillo

                                     ...
Web Crawling
First crawlers                                         Carlos Castillo

                                     ...
Second generation                                      Web Crawling

                                                     ...
Second generation                                      Web Crawling

                                                     ...
Web Crawling
Standard architecture        Carlos Castillo

                        Outline

                        Motiva...
Different crawlers have different                            Web Crawling

                                                 ...
Taxonomy of Web crawlers        Web Crawling

                                Carlos Castillo

                           ...
Key operations                                            Web Crawling

                                                  ...
Key operations                                            Web Crawling

                                                  ...
Key operations                                            Web Crawling

                                                  ...
Key operations                                            Web Crawling

                                                  ...
Key operations                                            Web Crawling

                                                  ...
The architecture needs to be                            Web Crawling

                                                    ...
Problems arise in large crawls          Web Crawling

                                        Carlos Castillo

           ...
Network and protocol problems                      Web Crawling

                                                   Carlos...
Server problems                                       Web Crawling

                                                      ...
Page contents problems                     Web Crawling

                                           Carlos Castillo

     ...
Summary                                             Web Crawling

                                                    Carl...
Summary                                             Web Crawling

                                                    Carl...
Summary                                             Web Crawling

                                                    Carl...
Summary                                             Web Crawling

                                                    Carl...
Open problems                                     Web Crawling

                                                  Carlos C...
Open problems                                     Web Crawling

                                                  Carlos C...
Open problems                                     Web Crawling

                                                  Carlos C...
Baeza-Yates, R. and Castillo, C. (2004).                Web Crawling


Crawling the infinite Web: five levels are enough.   ...
In Latin American Web Conference                        Web Crawling

(WebMedia/LA-WEB), Riberao Preto, Brazil.           ...
Craswell, N., Crimmins, F., Hawking, D., and            Web Crawling


Moffat, A. (2004).                                  ...
Accessibility of information on the web.                Web Crawling

Intelligence, 11(1):32–39.                          ...
Web Crawling

     Carlos Castillo

Outline

Motivation

Behavior of a crawler
Selection policy
Re-visit policy
Politeness...
Web Crawling

     Carlos Castillo

Outline

Motivation

Behavior of a crawler
Selection policy
Re-visit policy
Politeness...
Upcoming SlideShare
Loading in...5
×

Web Crawling

8,962

Published on

Published in: Technology, News & Politics
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,962
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
430
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Web Crawling"

  1. 1. Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web Crawling Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture Carlos Castillo History Classification Implementation Center for Web Research Practical issues Computer Science Department Summary University of Chile www.cwr.cl References
  2. 2. Motivation Web Crawling Behavior of a crawler Carlos Castillo Selection policy Outline Re-visit policy Motivation Politeness policy Behavior of a crawler Selection policy Parallelization policy Re-visit policy Politeness policy Parallelization policy Scheduling Scheduling Short-term scheduling Short-term scheduling Long-term scheduling Long-term scheduling When to stop crawling Architecture When to stop crawling History Classification Architecture Implementation Practical issues History Summary Classification References Implementation Practical issues Summary References
  3. 3. An astronomer watching the sky Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  4. 4. The problem of abundance Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy 5 exabytes of new information a year Politeness policy Parallelization policy [Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling Short-term scheduling bytes) Long-term scheduling When to stop crawling Most directories no longer encourage Architecture History administrators to submit their Web sites: they Classification Implementation have to find the page on their own Practical issues Adversarial information retrieval Summary References
  5. 5. The problem of abundance Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy 5 exabytes of new information a year Politeness policy Parallelization policy [Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling Short-term scheduling bytes) Long-term scheduling When to stop crawling Most directories no longer encourage Architecture History administrators to submit their Web sites: they Classification Implementation have to find the page on their own Practical issues Adversarial information retrieval Summary References
  6. 6. The problem of abundance Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy 5 exabytes of new information a year Politeness policy Parallelization policy [Lyman and Varian, 2003] (1 exabyte = 1018 Scheduling Short-term scheduling bytes) Long-term scheduling When to stop crawling Most directories no longer encourage Architecture History administrators to submit their Web sites: they Classification Implementation have to find the page on their own Practical issues Adversarial information retrieval Summary References
  7. 7. The bandwidth is expensive Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler “Given that the bandwidth for conducting Selection policy Re-visit policy crawls is neither infinite nor free it is Politeness policy Parallelization policy becoming essential to crawl the Web in a Scheduling Short-term scheduling not only scalable, but efficient way if some Long-term scheduling When to stop crawling reasonable measure of quality or freshness is Architecture to be maintained” [Edwards et al., 2001] History Classification Implementation The cost of a “complete” Web crawl is estimated in Practical issues Summary $1.5 million USD [Craswell et al., 2004], only References considering network usage
  8. 8. The bandwidth is expensive Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler “Given that the bandwidth for conducting Selection policy Re-visit policy crawls is neither infinite nor free it is Politeness policy Parallelization policy becoming essential to crawl the Web in a Scheduling Short-term scheduling not only scalable, but efficient way if some Long-term scheduling When to stop crawling reasonable measure of quality or freshness is Architecture to be maintained” [Edwards et al., 2001] History Classification Implementation The cost of a “complete” Web crawl is estimated in Practical issues Summary $1.5 million USD [Craswell et al., 2004], only References considering network usage
  9. 9. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  10. 10. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  11. 11. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  12. 12. Combination of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy Scheduling Short-term scheduling Re-visit policy Long-term scheduling When to stop crawling Politeness policy Architecture History Classification Parallelization policy Implementation Practical issues Summary References
  13. 13. It is necessary to prioritize Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy No search engine indexes more than 16% of the Scheduling Web [Lawrence and Giles, 2000] Short-term scheduling Long-term scheduling When to stop crawling Download only the “important” pages Architecture Restrict to only a sub-domain History Classification Implementation Avoid spamming Practical issues Summary References
  14. 14. Web Crawling Selection based on links Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Order by Pagerank [Cho et al., 1998] Scheduling Depth-first search [Najork and Wiener, 2001] Short-term scheduling Long-term scheduling When to stop crawling Focused crawling [Chakrabarti et al., 1999], Architecture attempting to infer similarity to pages before History Classification Implementation downloading them Practical issues Summary References
  15. 15. Web Crawling Events Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Creation, which requires a link Parallelization policy Scheduling Update, can be either minor or major. Most of Short-term scheduling Long-term scheduling the changes are minor, but this is not easy to When to stop crawling exploit Architecture History Classification Deletion, which is more damaging to the search Implementation engine’s reputation Practical issues Summary References
  16. 16. Web Crawling Cost functions Carlos Castillo Outline Motivation Freshness: Behavior of a crawler Selection policy 1 if p is not modified at time t Re-visit policy Politeness policy Fp (t) = Parallelization policy 0 otherwise Scheduling Short-term scheduling Long-term scheduling When to stop crawling Age: Architecture History Classification Implementation 0 if p is not modified Practical issues Ap (t) = t − lastmod(p) otherwise Summary References Depending on the cost function used, the behavior can be different
  17. 17. Web Crawling Cost functions Carlos Castillo Outline Motivation Freshness: Behavior of a crawler Selection policy 1 if p is not modified at time t Re-visit policy Politeness policy Fp (t) = Parallelization policy 0 otherwise Scheduling Short-term scheduling Long-term scheduling When to stop crawling Age: Architecture History Classification Implementation 0 if p is not modified Practical issues Ap (t) = t − lastmod(p) otherwise Summary References Depending on the cost function used, the behavior can be different
  18. 18. Evolution of freshness and age Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  19. 19. Estimating freshness and age Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Page changes can be modeled as a Poisson Re-visit policy Politeness policy process [Brewington et al., 2000] Parallelization policy Scheduling Probability of a page being updated at time t is Short-term scheduling Long-term scheduling When to stop crawling P(Fp (t) = 1) = e −λp t Architecture History Classification Implementation λp can be estimated using historical data, Practical issues specially if last-modification date is provided by Summary the server [Cho and Garcia-Molina, 2003] References
  20. 20. Web Crawling Web robots can be a threat Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy They consume network resources Parallelization policy Scheduling They can cause server overload Short-term scheduling Long-term scheduling The robot exclusion protocol should be honored When to stop crawling Architecture [Koster, 1996] History Classification The re-visiting period should be reasonable Implementation (what is reasonable?) Practical issues Summary References
  21. 21. Web Crawling Robot exclusion Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Server exclusions Parallelization policy D i s a l l o w : / c g i −b i n Scheduling Short-term scheduling Long-term scheduling When to stop crawling Page exclusions Architecture History <meta name=” r o b o t s ” Classification Implementation c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”> Practical issues Summary References
  22. 22. Web Crawling Robot exclusion Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Server exclusions Parallelization policy D i s a l l o w : / c g i −b i n Scheduling Short-term scheduling Long-term scheduling When to stop crawling Page exclusions Architecture History <meta name=” r o b o t s ” Classification Implementation c o n t e n t =”n o i n d e x . n o f o l l o w , n o c a c h e”> Practical issues Summary References
  23. 23. Objectives Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Distribute the Web crawling Scheduling Short-term scheduling Ideally, no central control point Long-term scheduling When to stop crawling Reduce overhead due to communications Architecture History Classification Reduce overlap, ideally zero Implementation Practical issues Summary References
  24. 24. Types of policies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Static assignment: typically a hash function on Scheduling Short-term scheduling site names Long-term scheduling When to stop crawling Dynamic assignment: more complicated to Architecture History handle, usually requires central control Classification Implementation Practical issues Summary References
  25. 25. Problem separation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Indexing, downloading, and distributed crawling Politeness policy Parallelization policy are done in batches – this can be exploited to Scheduling Short-term scheduling separate the problem Long-term scheduling When to stop crawling Short-term scheduling: using the network Architecture History resources efficiently Classification Implementation Long-term scheduling: ordering the crawling Practical issues process to download important pages first Summary References
  26. 26. Problem separation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Indexing, downloading, and distributed crawling Politeness policy Parallelization policy are done in batches – this can be exploited to Scheduling Short-term scheduling separate the problem Long-term scheduling When to stop crawling Short-term scheduling: using the network Architecture History resources efficiently Classification Implementation Long-term scheduling: ordering the crawling Practical issues process to download important pages first Summary References
  27. 27. Short-term scheduling Web Crawling Carlos Castillo Outline Motivation If B is the bandwidth available, then Bp , the Behavior of a crawler Selection policy downloading speed for page p, is Re-visit policy Politeness policy Parallelization policy Sp Scheduling Bp = Short-term scheduling T∗ Long-term scheduling When to stop crawling Architecture Where T ∗ is the optimal time to use all of the History Classification available bandwidth Implementation Practical issues p Sp Summary T∗ = References B
  28. 28. Full parallelization Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  29. 29. Web Crawling Full serialization Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  30. 30. Web Crawling Realistic scenario Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  31. 31. Web Crawling Number of active crawlers Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  32. 32. Objective Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Download “important” pages first Scheduling Short-term scheduling Download X% of the top Y% pages Long-term scheduling When to stop crawling Cumulative Pagerank vs fraction of the Web – Architecture History total Pagerank is 1, random strategy should give Classification Implementation a straight line Practical issues Summary References
  33. 33. Objective Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Download “important” pages first Scheduling Short-term scheduling Download X% of the top Y% pages Long-term scheduling When to stop crawling Cumulative Pagerank vs fraction of the Web – Architecture History total Pagerank is 1, random strategy should give Classification Implementation a straight line Practical issues Summary References
  34. 34. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  35. 35. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  36. 36. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  37. 37. Strategies Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Oracle with Pagerank Scheduling Short-term scheduling Depth-first search Long-term scheduling When to stop crawling Bigger sites first Architecture History Classification Partial pagerank calculations Implementation Practical issues Summary References
  38. 38. Comparison of strategies Web Crawling Carlos Castillo Outline [Castillo et al., 2004] Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  39. 39. Distribution of visits per level Web Crawling Carlos Castillo Outline Motivation [Baeza-Yates and Castillo, 2004] Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  40. 40. Pagerank and depth Web Crawling Carlos Castillo Cumulative Pagerank by levels in the Chilean Web Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  41. 41. Pagerank and depth Web Crawling Carlos Castillo Correlation of Pagerank and depth is low at deeper levels Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  42. 42. Web Crawling First crawlers Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy RBSE spider - size of the Web: 100,000 pages Parallelization policy Scheduling Internet archive crawler - www.archive.org Short-term scheduling Long-term scheduling When to stop crawling Webcrawler - first search engine powered by a Architecture Web crawler History Classification Implementation Pages were a scarce resource Practical issues Summary References
  43. 43. Web Crawling First crawlers Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy RBSE spider - size of the Web: 100,000 pages Parallelization policy Scheduling Internet archive crawler - www.archive.org Short-term scheduling Long-term scheduling When to stop crawling Webcrawler - first search engine powered by a Architecture Web crawler History Classification Implementation Pages were a scarce resource Practical issues Summary References
  44. 44. Second generation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Mercator, SPHINX - focused crawling Scheduling Short-term scheduling Long-term scheduling Lycos, Excite, Google - large-scale crawling When to stop crawling Architecture Parallel crawlers History Classification Problem of abundance Implementation Practical issues Summary References
  45. 45. Second generation Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Mercator, SPHINX - focused crawling Scheduling Short-term scheduling Long-term scheduling Lycos, Excite, Google - large-scale crawling When to stop crawling Architecture Parallel crawlers History Classification Problem of abundance Implementation Practical issues Summary References
  46. 46. Web Crawling Standard architecture Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  47. 47. Different crawlers have different Web Crawling Carlos Castillo focus Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Different issues Short-term scheduling Long-term scheduling Quality: having “good resources” When to stop crawling Architecture Representation: having complete copies History Classification Freshnes: having updated copies Implementation Practical issues A global-scale crawler tries to balance them all Summary References
  48. 48. Taxonomy of Web crawlers Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  49. 49. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  50. 50. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  51. 51. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  52. 52. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  53. 53. Key operations Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Have I seen this URL ? Scheduling Have I seen this page (or a very similar one) ? Short-term scheduling Long-term scheduling When to stop crawling Which pages should I download next ? Architecture History Store this page Classification Implementation Download this batch of pages Practical issues Summary References
  54. 54. The architecture needs to be Web Crawling Carlos Castillo highly optimized Outline Motivation Behavior of a crawler Selection policy Re-visit policy “While it is fairly easy to build a slow Politeness policy Parallelization policy crawler that downloads a few pages per Scheduling Short-term scheduling second for a short period of time, building a Long-term scheduling When to stop crawling high-performance system that can download Architecture hundreds of millions of pages over several History Classification weeks presentsa number of challenges in Implementation Practical issues system design, I/O and network efficiency, Summary and robustness and manegeability” References [Shkapenyuk and Suel, 2002].
  55. 55. Problems arise in large crawls Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Network and protocol problems Short-term scheduling Long-term scheduling Page contents problems When to stop crawling Architecture Server problems History Classification Implementation Practical issues Summary References
  56. 56. Network and protocol problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Variable quality of service Scheduling Short-term scheduling Misconfigured firewalls Long-term scheduling When to stop crawling Crashing DNS servers Architecture History Classification Wrong DNS servers pointing to good hosts Implementation Practical issues Summary References
  57. 57. Server problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Responses lacking headers Scheduling Short-term scheduling Fancy “error” pages Long-term scheduling When to stop crawling “Deeep Web” pages which could be accessible Architecture History otherwise Classification Implementation Embedded session-ids in URLs Practical issues Summary References
  58. 58. Page contents problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy High prevalence of duplicates Scheduling Short-term scheduling Browsers are very tolerant Long-term scheduling When to stop crawling Malformed markup Architecture History Classification Physical over logical formatting Implementation Practical issues Summary References
  59. 59. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  60. 60. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  61. 61. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  62. 62. Summary Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Web crawling is studied at multiple levels Scheduling Short-term scheduling Long-term scheduling, page selection Long-term scheduling When to stop crawling Scalability, parallelization Architecture History Classification Practical issues, network usage Implementation Practical issues Summary References
  63. 63. Open problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling using historical information Scheduling Short-term scheduling Long-term scheduling Exploiting the Web’s structure When to stop crawling Architecture Adversarial IR: Spam detection before History downloading the pages Classification Implementation Practical issues Summary References
  64. 64. Open problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling using historical information Scheduling Short-term scheduling Long-term scheduling Exploiting the Web’s structure When to stop crawling Architecture Adversarial IR: Spam detection before History downloading the pages Classification Implementation Practical issues Summary References
  65. 65. Open problems Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling using historical information Scheduling Short-term scheduling Long-term scheduling Exploiting the Web’s structure When to stop crawling Architecture Adversarial IR: Spam detection before History downloading the pages Classification Implementation Practical issues Summary References
  66. 66. Baeza-Yates, R. and Castillo, C. (2004). Web Crawling Crawling the infinite Web: five levels are enough. Carlos Castillo In Proceedings of the third Workshop on Web Outline Graphs (WAW), volume 3243 of Lecture Notes in Motivation Computer Science, pages 156–167, Rome, Italy. Behavior of a crawler Selection policy Springer. Re-visit policy Politeness policy Parallelization policy Brewington, B., Cybenko, G., Stata, R., Bharat, Scheduling Short-term scheduling K., and Maghoul, F. (2000). Long-term scheduling When to stop crawling How dynamic is the web? Architecture In Proceedings of the Ninth Conference on World History Classification Wide Web, pages 257 – 276, Amsterdam, Implementation Practical issues Netherlands. Summary Castillo, C., Marin, M., Rodriguez, A., and References Baeza-Yates, R. (2004). Scheduling algorithms for Web crawling.
  67. 67. In Latin American Web Conference Web Crawling (WebMedia/LA-WEB), Riberao Preto, Brazil. Carlos Castillo IEEE CS Press. Outline (To appear). Motivation Behavior of a crawler Chakrabarti, S., van den Berg, M., and Dom, B. Selection policy (1999). Re-visit policy Politeness policy Parallelization policy Focused crawling: a new approach to Scheduling topic-specific web resource discovery. Short-term scheduling Long-term scheduling Computer Networks, 31(11–16):1623–1640. When to stop crawling Architecture History Cho, J. and Garcia-Molina, H. (2003). Classification Implementation Estimating frequency of change. Practical issues ACM Transactions on Internet Technology, 3(3). Summary References Cho, J., Garc´ ıa-Molina, H., and Page, L. (1998). Efficient crawling through URL ordering. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia.
  68. 68. Craswell, N., Crimmins, F., Hawking, D., and Web Crawling Moffat, A. (2004). Carlos Castillo Performance and cost tradeoffs in web search. Outline In Proceedings of the 15th Australasian Database Motivation Conference, pages 161–169, Dunedin, New Behavior of a crawler Selection policy Zealand. Re-visit policy Politeness policy Parallelization policy Edwards, J., McCurley, K. S., and Tomlin, J. A. Scheduling (2001). Short-term scheduling Long-term scheduling When to stop crawling An adaptive model for optimizing performance of Architecture an incremental web crawler. History Classification In Proceedings of the Tenth Conference on World Implementation Practical issues Wide Web, pages 106–113, Hong Kong. Elsevier Summary Science. References Koster, M. (1996). A standard for robot exclusion. http://www.robotstxt.org/wc/exclusion.html. Lawrence, S. and Giles, C. L. (2000).
  69. 69. Accessibility of information on the web. Web Crawling Intelligence, 11(1):32–39. Carlos Castillo Lyman, P. and Varian, H. R. (2003). Outline How much information. Motivation Behavior of a crawler http://www.sims.berkeley.edu/how-much-info- Selection policy 2003. Re-visit policy Politeness policy Parallelization policy Najork, M. and Wiener, J. L. (2001). Scheduling Short-term scheduling Breadth-first crawling yields high-quality pages. Long-term scheduling When to stop crawling In Proceedings of the Tenth Conference on World Architecture Wide Web, pages 114–118, Hong Kong. Elsevier History Classification Science. Implementation Practical issues Shkapenyuk, V. and Suel, T. (2002). Summary Design and implementation of a high-performance References distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357 – 368, San Jose, California. IEEE CS Press.
  70. 70. Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  71. 71. Web Crawling Carlos Castillo Outline Motivation Behavior of a crawler Selection policy Re-visit policy Politeness policy Parallelization policy Scheduling Short-term scheduling Long-term scheduling When to stop crawling Architecture History Classification Implementation Practical issues Summary References
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×