Introduction to
Web Crawlers
Presented By
Rehab
INTRODUCTION
 Definition:
A Web crawler is a computer program that
browses the World Wide Web in a methodical,
automated manner. (Wikipedia)
 Utilities:
Gather pages from the Web.
Support a search engine, perform data mining
and so on.
WHY CRAWLERS?
Internet has a
wide expanse of
Information.
 Finding relevant
information
requires an
efficient
mechanism.
• Web Crawlers
provide that scope
to the search
engine.
Features oF a crawler
 Must provide:
 Robustness: spider traps
 Politeness: which pages can be crawled, and which
cannot
 Should provide:
 Distributed
 Scalable
 Performance and efficiency
 Quality
 Freshness
 Extensible
How does web crawler
work?
arcHitecture oF a crawler
www
DNS
Fetch
Parse
Content
Seen?
URL
Filter
Dup
URL
Elim
URL Frontier
Doc
Fingerprint
Robots
templates
URL
set
Architecture of A crAwler
URL Frontier: containing URLs yet to be fetches in the current crawl. At first,
a seed set is stored in URL Frontier, and a crawler begins by taking a URL
from the seed set.
DNS: domain name service resolution. Look up IP address for domain
names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links are
extracted.
Architecture of A crAwler
Content Seen?: test whether a web page with the same content has already been seen at
another URL. Need to develop a way to measure the fingerprint of a web page.
URL Filter:
Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
en.wikipedia.org/wiki/MainPage
<“disclaimer">Disclaimers</a>
Dup URL Elm: the URL is checked for duplicate elimination.
Architecture of A crAwler
Other issues:
 Housekeeping tasks:
 Log crawl progress statistics: URLs crawled, frontier
size, etc. (Every few seconds)
 Check pointing: a snapshot of the crawler’s state (the
URL frontier) is committed to disk. (Every few hours)
 Priority of URLs in URL frontier:
 Change rate.
 Quality.
 Politeness:
 Avoid repeated fetch requests to a host within a short
time span.
 Otherwise: blocked 
Thank You
For Your
Attention

WebCrawler

  • 1.
  • 2.
    INTRODUCTION  Definition: A Webcrawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)  Utilities: Gather pages from the Web. Support a search engine, perform data mining and so on.
  • 3.
    WHY CRAWLERS? Internet hasa wide expanse of Information.  Finding relevant information requires an efficient mechanism. • Web Crawlers provide that scope to the search engine.
  • 4.
    Features oF acrawler  Must provide:  Robustness: spider traps  Politeness: which pages can be crawled, and which cannot  Should provide:  Distributed  Scalable  Performance and efficiency  Quality  Freshness  Extensible
  • 5.
    How does webcrawler work?
  • 6.
    arcHitecture oF acrawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set
  • 7.
    Architecture of AcrAwler URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.
  • 8.
    Architecture of AcrAwler Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding). en.wikipedia.org/wiki/MainPage <“disclaimer">Disclaimers</a> Dup URL Elm: the URL is checked for duplicate elimination.
  • 9.
    Architecture of AcrAwler Other issues:  Housekeeping tasks:  Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds)  Check pointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours)  Priority of URLs in URL frontier:  Change rate.  Quality.  Politeness:  Avoid repeated fetch requests to a host within a short time span.  Otherwise: blocked 
  • 10.