Definition
•A web crawler (also known as a web spider or web
robot) is a program or automated script which
browses theWorldWideWeb in a methodical,
automated manner.This process is calledWeb
crawling or spidering.
•Without crawlers, search engines would not exist.
The working of a web crawler is as follows:
• Initializing the seed URL or URLs
• Adding it to the frontier
• Selecting the URL from the frontier
• Fetching the web-page corresponding to that URLs
• Parsing the retrieved page to extract the URLs[21]
• Adding all the unvisited links to the list of URL i.e.
into the frontier
• Again start with step 2 and repeat till the frontier is
empty.
Architecture of
a web crawler
Figure shows the generalized
architecture of web crawler.
It has three main
components: a frontier
which stores the list of URL’s
to visit, Page Downloader
which download pages from
WWW andWeb Repository
receives web pages from a
crawler and stores it in the
database.
Flow of a basic crawler
The behaviour of aWeb crawler is the
outcome of a combination of policies:
• A selection policy: states which pages to download.
• A re-visit policy: states when to check for changes to the
pages.
• A politeness policy: states how to avoid overloading
Web sites.
• A parallelization policy: states how to coordinate
distributedWeb crawlers.
Usage
• Web crawlers are mainly used to create a copy of all the
visited pages for later processing by a search engine, that will
index the downloaded pages to provide fast searches.
• Crawlers can also be used for automating maintenance tasks
on aWeb site, such as checking links or validating HTML
code.Also, crawlers can be used to gather specific types of
information fromWeb pages, such as harvesting e-mail
addresses (usually for spam).
Types
•FocusedWeb Crawler
•IncrementalCrawler
•Distributed Crawler
•ParallelCrawler
Conclusion
•Web Crawler is the vital source of information
retrieval which traverses theWeb and
downloads web documents that suit the
user's need.Web crawler is used by the search
engine and other users to regularly ensure
that their database is up-to-date.
References
• IOSR Journal of Computer Engineering (IOSR-JCE) e-
ISSN: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1,
Ver.VI (Feb. 2014), PP 01-05 www.iosrjournals.org
• Information form the webpages through:
• www.wikipedia.org
• www.sciencedaily.com

Web Crawlers

  • 2.
    Definition •A web crawler(also known as a web spider or web robot) is a program or automated script which browses theWorldWideWeb in a methodical, automated manner.This process is calledWeb crawling or spidering. •Without crawlers, search engines would not exist.
  • 3.
    The working ofa web crawler is as follows: • Initializing the seed URL or URLs • Adding it to the frontier • Selecting the URL from the frontier • Fetching the web-page corresponding to that URLs • Parsing the retrieved page to extract the URLs[21] • Adding all the unvisited links to the list of URL i.e. into the frontier • Again start with step 2 and repeat till the frontier is empty.
  • 4.
    Architecture of a webcrawler Figure shows the generalized architecture of web crawler. It has three main components: a frontier which stores the list of URL’s to visit, Page Downloader which download pages from WWW andWeb Repository receives web pages from a crawler and stores it in the database.
  • 5.
    Flow of abasic crawler
  • 6.
    The behaviour ofaWeb crawler is the outcome of a combination of policies: • A selection policy: states which pages to download. • A re-visit policy: states when to check for changes to the pages. • A politeness policy: states how to avoid overloading Web sites. • A parallelization policy: states how to coordinate distributedWeb crawlers.
  • 7.
    Usage • Web crawlersare mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. • Crawlers can also be used for automating maintenance tasks on aWeb site, such as checking links or validating HTML code.Also, crawlers can be used to gather specific types of information fromWeb pages, such as harvesting e-mail addresses (usually for spam).
  • 8.
  • 9.
    Conclusion •Web Crawler isthe vital source of information retrieval which traverses theWeb and downloads web documents that suit the user's need.Web crawler is used by the search engine and other users to regularly ensure that their database is up-to-date.
  • 10.
    References • IOSR Journalof Computer Engineering (IOSR-JCE) e- ISSN: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver.VI (Feb. 2014), PP 01-05 www.iosrjournals.org • Information form the webpages through: • www.wikipedia.org • www.sciencedaily.com