1.
Outline Motivation Algorithms Experiments Summary References
Scheduling Algorithms for Web Crawling
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates
Center for Web Research
www.cwr.cl
LA-WEB 2004
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
2.
Outline Motivation Algorithms Experiments Summary References
Motivation
Algorithms
Experiments
Summary
References
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
3.
Outline Motivation Algorithms Experiments Summary References
Motivation
Web search generates more than 13% of the traﬃc to Web
sites [StatMarket, 2003].
No search engine indexes more than one third of the publicly
available Web [Lawrence and Giles, 1998].
If we cannot download all of the pages, we should at least
download the most “important” ones.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
4.
Outline Motivation Algorithms Experiments Summary References
Motivation
Web search generates more than 13% of the traﬃc to Web
sites [StatMarket, 2003].
No search engine indexes more than one third of the publicly
available Web [Lawrence and Giles, 1998].
If we cannot download all of the pages, we should at least
download the most “important” ones.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
5.
Outline Motivation Algorithms Experiments Summary References
Motivation
Web search generates more than 13% of the traﬃc to Web
sites [StatMarket, 2003].
No search engine indexes more than one third of the publicly
available Web [Lawrence and Giles, 1998].
If we cannot download all of the pages, we should at least
download the most “important” ones.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
6.
Outline Motivation Algorithms Experiments Summary References
The problem of Web crawling
We must download pages with sizes given by Pi , over a connection
of bandwidth B. Trivial solution: we download all the pages
simultaneously at a speed proportional to the size of each page:
Pi
Bi =
T∗
T ∗ is the optimal time to use all the available bandwidth:
Pi
T∗ =
B
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
7.
Outline Motivation Algorithms Experiments Summary References
Optimal scenario
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
8.
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10 − 30 seconds
Web sites bandwidth BiMAX lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
9.
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10 − 30 seconds
Web sites bandwidth BiMAX lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
10.
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10 − 30 seconds
Web sites bandwidth BiMAX lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
11.
Outline Motivation Algorithms Experiments Summary References
Restrictions
Robot exclusion protocol [Koster, 1995]
Waiting time ≈ 10 − 30 seconds
Web sites bandwidth BiMAX lower than the crawler bandwidth
B
Distribution of Web site sizes is very skewed
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
12.
Outline Motivation Algorithms Experiments Summary References
Distribution of site sizes
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
13.
Outline Motivation Algorithms Experiments Summary References
Realistic scenario
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
14.
Outline Motivation Algorithms Experiments Summary References
Number of active robots in a batch
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
15.
Outline Motivation Algorithms Experiments Summary References
Goal
If each page has a certain score, capture most of the total value of
this score downloading just a fraction of the pages.
We will use the total Pagerank of the downloaded set vs. the
fraction of downloaded pages as a measure of quality
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
16.
Outline Motivation Algorithms Experiments Summary References
Algorithms
Algorithms are based on a scheduler with two levels of queues:
Queue of Web sites
Queue of Web pages in each Web site
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
17.
Outline Motivation Algorithms Experiments Summary References
Algorithms
Algorithms are based on a scheduler with two levels of queues:
Queue of Web sites
Queue of Web pages in each Web site
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
18.
Outline Motivation Algorithms Experiments Summary References
Algorithms
Algorithms are based on a scheduler with two levels of queues:
Queue of Web sites
Queue of Web pages in each Web site
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
19.
Outline Motivation Algorithms Experiments Summary References
Queues used for the scheduling
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
20.
Outline Motivation Algorithms Experiments Summary References
Algorithms based on Pagerank
Optimal/Oracle: crawler asks for the Pagerank value of each
page in the frontier using an “Oracle”. This is not available in
a real crawl as we do not have the entire graph
The average relative error for estimating the Pagerank four
months ahead is about 78% [Cho and Adams, 2004], so
historical information from previous crawls is not too useful
Batch-Pagerank: Pagerank calculations are executed over
the subset of known pages [Cho et al., 1998]
Partial-Pagerank: a “temporary” Pagerank value is assigned
to the pages in between batch-Pagerank calculations
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
21.
Outline Motivation Algorithms Experiments Summary References
Algorithms based on Pagerank
Optimal/Oracle: crawler asks for the Pagerank value of each
page in the frontier using an “Oracle”. This is not available in
a real crawl as we do not have the entire graph
The average relative error for estimating the Pagerank four
months ahead is about 78% [Cho and Adams, 2004], so
historical information from previous crawls is not too useful
Batch-Pagerank: Pagerank calculations are executed over
the subset of known pages [Cho et al., 1998]
Partial-Pagerank: a “temporary” Pagerank value is assigned
to the pages in between batch-Pagerank calculations
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
22.
Outline Motivation Algorithms Experiments Summary References
Algorithms based on Pagerank
Optimal/Oracle: crawler asks for the Pagerank value of each
page in the frontier using an “Oracle”. This is not available in
a real crawl as we do not have the entire graph
The average relative error for estimating the Pagerank four
months ahead is about 78% [Cho and Adams, 2004], so
historical information from previous crawls is not too useful
Batch-Pagerank: Pagerank calculations are executed over
the subset of known pages [Cho et al., 1998]
Partial-Pagerank: a “temporary” Pagerank value is assigned
to the pages in between batch-Pagerank calculations
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
23.
Outline Motivation Algorithms Experiments Summary References
Algorithms not based on Pagerank
Depth: pages are given a priority based on their depths. This
is graph traversal in breadth-ﬁrst ordering
[Najork and Wiener, 2001]
Length: pages from the Web sites which seem to be bigger
are crawled ﬁrst. We do not know which are really the bigger
Web sites until the end of the crawl. We use partial
information
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
24.
Outline Motivation Algorithms Experiments Summary References
Algorithms not based on Pagerank
Depth: pages are given a priority based on their depths. This
is graph traversal in breadth-ﬁrst ordering
[Najork and Wiener, 2001]
Length: pages from the Web sites which seem to be bigger
are crawled ﬁrst. We do not know which are really the bigger
Web sites until the end of the crawl. We use partial
information
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
25.
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler
[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed of
diﬀerent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
26.
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler
[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed of
diﬀerent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
27.
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler
[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed of
diﬀerent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
28.
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler
[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed of
diﬀerent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
29.
Outline Motivation Algorithms Experiments Summary References
Experiments
Download a sample of pages using the WIRE crawler
[Baeza-Yates and Castillo, 2002]
3.5 million pages from over 50,000 Web sites in .CL
At most 25,000 pages from each Web site
Strategies are simulated on a graph built using actual data
Simulation includes: bandwidth saturation, network speed of
diﬀerent Web sites, page sizes, waiting time, latency, etc.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
30.
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-using
the HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
31.
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-using
the HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
32.
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-using
the HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
33.
Outline Motivation Algorithms Experiments Summary References
Simulation parameters
Algorithm
Waiting time between pages from the same Web site w
Number of pages downloaded per connection when re-using
the HTTP connection k
Number of robots r
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
34.
Outline Motivation Algorithms Experiments Summary References
Results with one robot
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
35.
Outline Motivation Algorithms Experiments Summary References
Results with many robots
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
36.
Outline Motivation Algorithms Experiments Summary References
Speed-ups with the “Length” strategy
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
37.
Outline Motivation Algorithms Experiments Summary References
Crawling the real Web using the “Length” strategy
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
38.
Outline Motivation Algorithms Experiments Summary References
Pagerank vs day of crawl
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
39.
Outline Motivation Algorithms Experiments Summary References
Depth is not correlated with Pagerank
When depth is ≥ 2 links from the home page
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
40.
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a diﬃcult
problem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long as
possible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is critical
because the Web is very dynamic
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
41.
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a diﬃcult
problem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long as
possible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is critical
because the Web is very dynamic
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
42.
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a diﬃcult
problem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long as
possible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is critical
because the Web is very dynamic
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
43.
Outline Motivation Algorithms Experiments Summary References
Summary
The restrictions, specially waiting time, create a diﬃcult
problem for scheduling
An strategy with an “oracle” was too greedy
We try to keep Web sites in the frontier for as long as
possible, so we always have several Web sites to choose from
Simulation ensures the same conditions, which is critical
because the Web is very dynamic
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
44.
Outline Motivation Algorithms Experiments Summary References
Open problems
Scheduling using historical information
Exploiting the Web’s structure
Adversarial IR: Spam detection before downloading the pages
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
45.
Outline Motivation Algorithms Experiments Summary References
Open problems
Scheduling using historical information
Exploiting the Web’s structure
Adversarial IR: Spam detection before downloading the pages
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
46.
Outline Motivation Algorithms Experiments Summary References
Open problems
Scheduling using historical information
Exploiting the Web’s structure
Adversarial IR: Spam detection before downloading the pages
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
47.
Outline Motivation Algorithms Experiments Summary References
Baeza-Yates, R. and Castillo, C. (2002).
Balancing volume, quality and freshness in web crawling.
In Soft Computing Systems - Design, Management and
Applications, pages 565–572, Santiago, Chile. IOS Press
Amsterdam.
Cho, J. and Adams, R. (2004).
Page quality: In search of an unbiased Web ranking.
Technical report, UCLA Computer Science.
Cho, J., Garc´
ıa-Molina, H., and Page, L. (1998).
Eﬃcient crawling through URL ordering.
In Proceedings of the seventh conference on World Wide Web,
Brisbane, Australia.
Koster, M. (1995).
Robots in the web: threat or treat ?
ConneXions, 9(4).
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
48.
Outline Motivation Algorithms Experiments Summary References
Lawrence, S. and Giles, C. L. (1998).
Searching the World Wide Web.
Science, 280(5360):98–100.
Najork, M. and Wiener, J. L. (2001).
Breadth-ﬁrst crawling yields high-quality pages.
In Proceedings of the Tenth Conference on World Wide Web,
pages 114–118, Hong Kong. Elsevier Science.
StatMarket (2003).
Search engine referrals nearly double worldwide.
http://websidestory.com/pressroom/pressreleases.html-
?id=181.
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
49.
Outline Motivation Algorithms Experiments Summary References
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
50.
Outline Motivation Algorithms Experiments Summary References
C. Castillo, M. Marin, A. Rodr´
ıguez and R. Baeza-Yates Center for Web Research www.cwr.cl
Scheduling Algorithms for Web Crawling
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment