ARCHITECTURE OF WEB CRAWLER
APPROACHES FOR CRAWLING PROCESS
UTILITIES OF WEB CRAWLER
SCOPE FOR FUTURE
Internet users and accessible web pages.
Hypertext system .
Most crucial components in search engines and their
optimization would have a great effect on improving the
Programs that exploit the graph structures of the web to
move from page to page.
Program that browses the World Wide Web in a
methodical, automated manner.
Most crucial components
Improves the searching efficiency.
Literature survey paper 1
“Distributed Ontology-Driven Focused Crawling”
•Vertical search technologies.
Web Crawler architechture uses URL scoring functions,Scheduler
and DOM parser,Page ranker to download web pages.
• Literature survey paper 2
“Efficient Focused Crawling based on Best First Search”
•Seek out pages that are relevant to given keywords.
•A focused crawler analyze links that are likely to be most
•“Best” first search strategy is identified as a “focused crawler”
Focused crawler has two main components:
(i)To find specific web page.
(ii)To proceed from seed pages.
Literature survey paper 3
“Design of an Ontology based Adaptive Crawler for
•Deep web/ invisible web / hidden web.
•Accessing deep web using ontology.
•Download relevant hidden web pages.
• Literature survey paper 4
“URL Rule Based Focused Crawlers.”
• Use of URL regular expression .
• Retrieving Topic-specific Pages.
Search the topic-specific information, need to crawl a small
part of data use fewer server resources .
• Literature survey paper 5
“A Topic-Specific Web Crawler with Web Page
Hierarchy Based on HTML Dom-Tree.”
•Representation of data in hierarchical Dom-Tree.
•Dom-Tree is structural representation of HTML pages.
•Use the concept of Ontology.
Most prominent challenge with current web crawlers
Selection of important pages for downloading.
Cannot download all pages from the web.
It is important for the crawler
“To select the pages and to visit “important” pages first by
prioritizing the URLs in the queue properly.”
It minimizing the load on the websites crawled with
parallelization of the crawling process.
Approaches for Crawling process
Basically if we consider there are 2 different types of crawler
Do not follow a specific path.
A selection policy that states which pages to download.
A politeness policy that states how to avoid overloading
A parallelization policy that states how to coordinate
distributed web crawl.
Utilities of Web Crawler
Gather pages from the Web.
Support a search engine.
Perform data mining
Improving the sites (web site analysis)
The number of extracted documents was reduced. Link
analyzed, and deleted a great deal of irrelevant web page.
Crawling time is reduced. After a great deal of irrelevant
web page is deleted, crawling load is reduced.