• In the early days of Internet
Rise of Anonymous FTP sites
It download the files needed
The first search engine ::
Created in 1990,downloaded directory listings of
all files on anonymous FTP sites, and created
Became popular around 2001
Important concepts of “ link popularity” and
“page rank” were introduced.
Prior to 2004, Yahoo! Used Google to provide
users with search results.
Launched its own search engine in 2004.
Used technologies used in Inktomi and AltaVista,
which Yahoo! Acquired.
MSN Search :
Most recent search engine, owned by
Increasing in popularity
Windows live search --- a new search
Search Engine Defined
“It is a software program that helps in
locating information stored on a
computer system, typically on world
They are of two types :
I. Crawler Based
II. Human Powered
Crawler Based Search
• Create their listings Automatically
e.g. GOOGLE, YAHOO
• crawl or spider the web to create a
directory of information.
• When “changes” are made to a page
Such search engines will find these
• Human-powered Directories
Depend on humans for the creation of
• Hybrid Search Engines
Can accept both types of results
Based on web crawlers
Based on human-powered listings
What is WebCrawler
A single piece of software ,with
two different functions
Building indexes of web pages.
Navigate the web automatically on demand.
KEY DESIGN GOALS
Breath first search to create a broad index.
Crawler behavior to include as many as
web servers as possible.
Components in WebCrawler
retrieving documents from the web
under the control of search engine =>
front end for Crawler
Start with the known
set of documents
access contents using
handling the query
Web viewed as a Graph
Select a URL from the set of candidates
Download the associated web pages
Extract the URL’s contained therein
Add those URL’s that have not been
encountered before the candidate set
Robots exclusion Protocol
Fast Changing Page
to avoid multiple
Typical anatomy of a large-scale crawler
Performance and Reliability
• Need to fetch many pages at same time
– utilize the network bandwidth
• Highly concurrent and parallelized DNS lookups
• Use of asynchronous sockets
– Polling socket to check for completion of network
– Multi-processing or multi-threading
• Care in URL extraction
– Eliminating duplicates to reduce redundant fetches
WebCrawler : Indexing Mode
• Try and build an index of as much of the web as
• Some heuristics used :
– Which documents to select if the space for storing
indices is limited? (eg. SAVE 100 pages)
• A reasonable approach is to ensure that
documents come from as many different servers
• WebCrawler uses a modified breath first search
approach in order to ensure that every server has
at least one document that has been indexed.
WebCrawler : Real-time
• Basic motivation :
Given a user’s query, try to find documents
that most closely matches.
A different search algorithm is used here by
Intuitive reasoning :
– If we follow the links from a document that is
similar to what the user is looking for , they
will most likely lead to relevant documents.
• Search Engine Indexing
• Statistical Analysis
• Maintenance of Hypertext Structure
(URL , Links Validation)
• Resource Discovery
– A service that mines web for Copyright