In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents.
Up to 09/2005 Google indexes 8,200,000,000 web pages.
A search engine is a system which collects , organizes & presents a way to select Web documents based on certain words , phrases , or patterns within documents
Model the Web as a full-text DB
Index a portion of the Web docs
Search Web documents using user-specified words/patterns in a text
Two categories of search engine
general-purpose search engine, e.g. Yahoo !, AltaVista and Google
special-purpose search engines (or Internet Portals), e.g. LinuxStart ( www.linuxstart.com )
Two main components of a search engine:
web crawler (spider), which collects massive Web pages.
large database, which stores and indexes collected Web pages.
Ranking has to be performed without accessing the text, just the index
Ranking algorithms : all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries
Information Retrieval (IR) is a key to search engine or Web Search.
Most commonly – used models:
Vector Space Model (VSM)
The Google is a representative general-purpose search engine (successful, takes many techniques such as use of both web page link structure and text content, which fully takes advantage of the web page features).
The PageRank in Google is defined as follow:
Assume page A has pages P 1 ...P n which point to it. The parameter d is a damping factor which can be set between 0 and 1. Also C ( P i ) is defined as the number of links going out of page P i . The PageRank of a page A is given as follows:
PR ( A ) = ( 1 - d ) + d ( PR ( P 1 )/ C ( P 1 ) + ... + PR ( P n )/ C ( P n ))
Usually the parameter d is set to 0.85. PageRank or PR ( A ) can be calculated using a simple iterative algorithm.
Other features: anchor text processing, location information management and various data structures, which fully make use of the features of the web.