SlideShare a Scribd company logo
1 of 50
Download to read offline
Outline            Motivation             Algorithms       Experiments      Summary              References




                   Scheduling Algorithms for Web Crawling

               C. Castillo, M. Marin, A. Rodr´
                                             ıguez and R. Baeza-Yates

                                             Center for Web Research
                                                   www.cwr.cl


                                                LA-WEB 2004



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                   Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




      Motivation


      Algorithms


      Experiments


      Summary


      References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Motivation



              Web search generates more than 13% of the traffic to Web
              sites [StatMarket, 2003].
              No search engine indexes more than one third of the publicly
              available Web [Lawrence and Giles, 1998].
              If we cannot download all of the pages, we should at least
              download the most “important” ones.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms          Experiments      Summary              References




The problem of Web crawling


      We must download pages with sizes given by Pi , over a connection
      of bandwidth B. Trivial solution: we download all the pages
      simultaneously at a speed proportional to the size of each page:

                                           Pi
                                                       Bi =
                                          T∗
      T ∗ is the optimal time to use all the available bandwidth:

                                                               Pi
                                                  T∗ =
                                                              B




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                      Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Optimal scenario




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Restrictions



              Robot exclusion protocol [Koster, 1995]
              Waiting time ≈ 10 − 30 seconds
              Web sites bandwidth BiMAX lower than the crawler bandwidth
              B
              Distribution of Web site sizes is very skewed




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Distribution of site sizes




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Realistic scenario




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Number of active robots in a batch




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Goal




      If each page has a certain score, capture most of the total value of
      this score downloading just a fraction of the pages.
      We will use the total Pagerank of the downloaded set vs. the
      fraction of downloaded pages as a measure of quality




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms




      Algorithms are based on a scheduler with two levels of queues:
              Queue of Web sites
              Queue of Web pages in each Web site




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Queues used for the scheduling




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms based on Pagerank


              Optimal/Oracle: crawler asks for the Pagerank value of each
              page in the frontier using an “Oracle”. This is not available in
              a real crawl as we do not have the entire graph
              The average relative error for estimating the Pagerank four
              months ahead is about 78% [Cho and Adams, 2004], so
              historical information from previous crawls is not too useful
              Batch-Pagerank: Pagerank calculations are executed over
              the subset of known pages [Cho et al., 1998]
              Partial-Pagerank: a “temporary” Pagerank value is assigned
              to the pages in between batch-Pagerank calculations



C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms not based on Pagerank



              Depth: pages are given a priority based on their depths. This
              is graph traversal in breadth-first ordering
              [Najork and Wiener, 2001]
              Length: pages from the Web sites which seem to be bigger
              are crawled first. We do not know which are really the bigger
              Web sites until the end of the crawl. We use partial
              information




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Algorithms not based on Pagerank



              Depth: pages are given a priority based on their depths. This
              is graph traversal in breadth-first ordering
              [Najork and Wiener, 2001]
              Length: pages from the Web sites which seem to be bigger
              are crawled first. We do not know which are really the bigger
              Web sites until the end of the crawl. We use partial
              information




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Experiments



              Download a sample of pages using the WIRE crawler
              [Baeza-Yates and Castillo, 2002]
              3.5 million pages from over 50,000 Web sites in .CL
              At most 25,000 pages from each Web site
              Strategies are simulated on a graph built using actual data
              Simulation includes: bandwidth saturation, network speed of
              different Web sites, page sizes, waiting time, latency, etc.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Simulation parameters



              Algorithm
              Waiting time between pages from the same Web site w
              Number of pages downloaded per connection when re-using
              the HTTP connection k
              Number of robots r




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Results with one robot




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Results with many robots




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Speed-ups with the “Length” strategy




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Crawling the real Web using the “Length” strategy




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Pagerank vs day of crawl




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Depth is not correlated with Pagerank
      When depth is ≥ 2 links from the home page




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Summary



              The restrictions, specially waiting time, create a difficult
              problem for scheduling
              An strategy with an “oracle” was too greedy
              We try to keep Web sites in the frontier for as long as
              possible, so we always have several Web sites to choose from
              Simulation ensures the same conditions, which is critical
              because the Web is very dynamic




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




Open problems




              Scheduling using historical information
              Exploiting the Web’s structure
              Adversarial IR: Spam detection before downloading the pages




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




             Baeza-Yates, R. and Castillo, C. (2002).
             Balancing volume, quality and freshness in web crawling.
             In Soft Computing Systems - Design, Management and
             Applications, pages 565–572, Santiago, Chile. IOS Press
             Amsterdam.
             Cho, J. and Adams, R. (2004).
             Page quality: In search of an unbiased Web ranking.
             Technical report, UCLA Computer Science.
             Cho, J., Garc´
                          ıa-Molina, H., and Page, L. (1998).
             Efficient crawling through URL ordering.
             In Proceedings of the seventh conference on World Wide Web,
             Brisbane, Australia.
             Koster, M. (1995).
             Robots in the web: threat or treat ?
             ConneXions, 9(4).
C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




             Lawrence, S. and Giles, C. L. (1998).
             Searching the World Wide Web.
             Science, 280(5360):98–100.
             Najork, M. and Wiener, J. L. (2001).
             Breadth-first crawling yields high-quality pages.
             In Proceedings of the Tenth Conference on World Wide Web,
             pages 114–118, Hong Kong. Elsevier Science.
             StatMarket (2003).
             Search engine referrals nearly double worldwide.
             http://websidestory.com/pressroom/pressreleases.html-
             ?id=181.




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
Outline            Motivation             Algorithms     Experiments      Summary              References




C. Castillo, M. Marin, A. Rodr´
                              ıguez and R. Baeza-Yates                 Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling

More Related Content

What's hot

Components of a search engine
Components of a search engineComponents of a search engine
Components of a search enginePrimya Tamil
 
Perception in artificial intelligence
Perception in artificial intelligencePerception in artificial intelligence
Perception in artificial intelligenceMinakshi Atre
 
Web Design & Development - Session 1
Web Design & Development - Session 1Web Design & Development - Session 1
Web Design & Development - Session 1Shahrzad Peyman
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.SUJIT SHIBAPRASAD MAITY
 
ppt of web designing and development
ppt of web designing and developmentppt of web designing and development
ppt of web designing and development47ishu
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
Project on disease prediction
Project on disease predictionProject on disease prediction
Project on disease predictionKOYELMAJUMDAR1
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
2. seo (lecture notes)
2. seo (lecture notes)2. seo (lecture notes)
2. seo (lecture notes)Ebele uchendu
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web ArchitectureChamnap Chhorn
 
DBMS LECTURE NOTES FOR AKTU
DBMS LECTURE NOTES FOR AKTU DBMS LECTURE NOTES FOR AKTU
DBMS LECTURE NOTES FOR AKTU Sunit Mishra
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 

What's hot (20)

Search Engine ppt
Search Engine pptSearch Engine ppt
Search Engine ppt
 
Components of a search engine
Components of a search engineComponents of a search engine
Components of a search engine
 
Perception in artificial intelligence
Perception in artificial intelligencePerception in artificial intelligence
Perception in artificial intelligence
 
Web Design & Development - Session 1
Web Design & Development - Session 1Web Design & Development - Session 1
Web Design & Development - Session 1
 
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.Heart Disease Identification Method Using Machine Learnin in E-healthcare.
Heart Disease Identification Method Using Machine Learnin in E-healthcare.
 
ppt of web designing and development
ppt of web designing and developmentppt of web designing and development
ppt of web designing and development
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Project on disease prediction
Project on disease predictionProject on disease prediction
Project on disease prediction
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Web development
Web developmentWeb development
Web development
 
2. seo (lecture notes)
2. seo (lecture notes)2. seo (lecture notes)
2. seo (lecture notes)
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 
DBMS LECTURE NOTES FOR AKTU
DBMS LECTURE NOTES FOR AKTU DBMS LECTURE NOTES FOR AKTU
DBMS LECTURE NOTES FOR AKTU
 
Wordpress ppt
Wordpress pptWordpress ppt
Wordpress ppt
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Scheduling Algorithms for Web Crawling

  • 1. Outline Motivation Algorithms Experiments Summary References Scheduling Algorithms for Web Crawling C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl LA-WEB 2004 C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 2. Outline Motivation Algorithms Experiments Summary References Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 3. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 4. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 5. Outline Motivation Algorithms Experiments Summary References Motivation Web search generates more than 13% of the traffic to Web sites [StatMarket, 2003]. No search engine indexes more than one third of the publicly available Web [Lawrence and Giles, 1998]. If we cannot download all of the pages, we should at least download the most “important” ones. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 6. Outline Motivation Algorithms Experiments Summary References The problem of Web crawling We must download pages with sizes given by Pi , over a connection of bandwidth B. Trivial solution: we download all the pages simultaneously at a speed proportional to the size of each page: Pi Bi = T∗ T ∗ is the optimal time to use all the available bandwidth: Pi T∗ = B C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 7. Outline Motivation Algorithms Experiments Summary References Optimal scenario C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 8. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 9. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 10. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 11. Outline Motivation Algorithms Experiments Summary References Restrictions Robot exclusion protocol [Koster, 1995] Waiting time ≈ 10 − 30 seconds Web sites bandwidth BiMAX lower than the crawler bandwidth B Distribution of Web site sizes is very skewed C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 12. Outline Motivation Algorithms Experiments Summary References Distribution of site sizes C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 13. Outline Motivation Algorithms Experiments Summary References Realistic scenario C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 14. Outline Motivation Algorithms Experiments Summary References Number of active robots in a batch C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 15. Outline Motivation Algorithms Experiments Summary References Goal If each page has a certain score, capture most of the total value of this score downloading just a fraction of the pages. We will use the total Pagerank of the downloaded set vs. the fraction of downloaded pages as a measure of quality C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 16. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 17. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 18. Outline Motivation Algorithms Experiments Summary References Algorithms Algorithms are based on a scheduler with two levels of queues: Queue of Web sites Queue of Web pages in each Web site C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 19. Outline Motivation Algorithms Experiments Summary References Queues used for the scheduling C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 20. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 21. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 22. Outline Motivation Algorithms Experiments Summary References Algorithms based on Pagerank Optimal/Oracle: crawler asks for the Pagerank value of each page in the frontier using an “Oracle”. This is not available in a real crawl as we do not have the entire graph The average relative error for estimating the Pagerank four months ahead is about 78% [Cho and Adams, 2004], so historical information from previous crawls is not too useful Batch-Pagerank: Pagerank calculations are executed over the subset of known pages [Cho et al., 1998] Partial-Pagerank: a “temporary” Pagerank value is assigned to the pages in between batch-Pagerank calculations C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 23. Outline Motivation Algorithms Experiments Summary References Algorithms not based on Pagerank Depth: pages are given a priority based on their depths. This is graph traversal in breadth-first ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled first. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 24. Outline Motivation Algorithms Experiments Summary References Algorithms not based on Pagerank Depth: pages are given a priority based on their depths. This is graph traversal in breadth-first ordering [Najork and Wiener, 2001] Length: pages from the Web sites which seem to be bigger are crawled first. We do not know which are really the bigger Web sites until the end of the crawl. We use partial information C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 25. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 26. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 27. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 28. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 29. Outline Motivation Algorithms Experiments Summary References Experiments Download a sample of pages using the WIRE crawler [Baeza-Yates and Castillo, 2002] 3.5 million pages from over 50,000 Web sites in .CL At most 25,000 pages from each Web site Strategies are simulated on a graph built using actual data Simulation includes: bandwidth saturation, network speed of different Web sites, page sizes, waiting time, latency, etc. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 30. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 31. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 32. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 33. Outline Motivation Algorithms Experiments Summary References Simulation parameters Algorithm Waiting time between pages from the same Web site w Number of pages downloaded per connection when re-using the HTTP connection k Number of robots r C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 34. Outline Motivation Algorithms Experiments Summary References Results with one robot C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 35. Outline Motivation Algorithms Experiments Summary References Results with many robots C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 36. Outline Motivation Algorithms Experiments Summary References Speed-ups with the “Length” strategy C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 37. Outline Motivation Algorithms Experiments Summary References Crawling the real Web using the “Length” strategy C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 38. Outline Motivation Algorithms Experiments Summary References Pagerank vs day of crawl C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 39. Outline Motivation Algorithms Experiments Summary References Depth is not correlated with Pagerank When depth is ≥ 2 links from the home page C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 40. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 41. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 42. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 43. Outline Motivation Algorithms Experiments Summary References Summary The restrictions, specially waiting time, create a difficult problem for scheduling An strategy with an “oracle” was too greedy We try to keep Web sites in the frontier for as long as possible, so we always have several Web sites to choose from Simulation ensures the same conditions, which is critical because the Web is very dynamic C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 44. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 45. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 46. Outline Motivation Algorithms Experiments Summary References Open problems Scheduling using historical information Exploiting the Web’s structure Adversarial IR: Spam detection before downloading the pages C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 47. Outline Motivation Algorithms Experiments Summary References Baeza-Yates, R. and Castillo, C. (2002). Balancing volume, quality and freshness in web crawling. In Soft Computing Systems - Design, Management and Applications, pages 565–572, Santiago, Chile. IOS Press Amsterdam. Cho, J. and Adams, R. (2004). Page quality: In search of an unbiased Web ranking. Technical report, UCLA Computer Science. Cho, J., Garc´ ıa-Molina, H., and Page, L. (1998). Efficient crawling through URL ordering. In Proceedings of the seventh conference on World Wide Web, Brisbane, Australia. Koster, M. (1995). Robots in the web: threat or treat ? ConneXions, 9(4). C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 48. Outline Motivation Algorithms Experiments Summary References Lawrence, S. and Giles, C. L. (1998). Searching the World Wide Web. Science, 280(5360):98–100. Najork, M. and Wiener, J. L. (2001). Breadth-first crawling yields high-quality pages. In Proceedings of the Tenth Conference on World Wide Web, pages 114–118, Hong Kong. Elsevier Science. StatMarket (2003). Search engine referrals nearly double worldwide. http://websidestory.com/pressroom/pressreleases.html- ?id=181. C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 49. Outline Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling
  • 50. Outline Motivation Algorithms Experiments Summary References C. Castillo, M. Marin, A. Rodr´ ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl Scheduling Algorithms for Web Crawling