Finding key information from gigantic World Wide Web is similar to find a needle lost in haystack. For this purpose we would use a special magnet that would automatically, quickly and effortlessly attract that needle for us. In this scenario magnet is “Search Engine”
“ Even a blind squirrel finds a nut , occasionally.” But few of us are determined enough to search through millions, or billions, of pages of information to find our “nut.” So, to reduce the problem to a, more or less, manageable solution, web “search engines” were introduced a few years ago.
A software program that searches a database and gathers and reports information that contains or is related to specified terms. OR A website whose primary function is providing a search for gathering and reporting information available on the Internet or a portion of the Internet . S e a r c h E n g i n e
Eight reasonably well-known Web search engines are : -
Top 10 Search Providers by Searches, August 2007 Provider Searches (000) Share of Total Searches (%) 4,199,495 53.6 1,561,903 19.9 1,011,398 12.9 435,088 5.6 136,853 1.7 71,724 0.9 37,762 0.5 34,699 0.4 32,483 0.4 31,912 0.4 Other 275,812 3.5 All Search 7,829,129 100.0 Source: Nielsen//NetRatings, 2007
1990 - The first search engine Archie was released . There was no World Wide Web at the time. Data resided on defense contractor , university, and government computers, and techies were the only people accessing the data. The computers were interconnected by Telenet . File Transfer Protocol (FTP) used for transferring files from computer to computer. There was no such thing as a browser. Files were transferred in their native format and viewed using the associated file type software. Archie searched FTP servers and indexed their files into a searchable directory. S e a r c h E n g i n e History
1991 - Gopherspace came into existence with the advent of Gopher. Gopher cataloged FTP sites, and the resulting catalog became known as Gopherspace . 1994 - WebCrawler, a new type of search engine that indexed the entire content of a web page , was introduced. Telenet / FTP passed information among the new web browsers accessing not FTP sites but WWW sites. Webmasters and web site owners begin submitting sites for inclusion in the growing number of web directories.
1995 - Meta tags in the web page were first utilized by some search engines to determine relevancy. 1997 - Search engine rank-checking software was introduced. It provides an automated tool to determine web site position and ranking within the major search engines. 1998 - Search engine algorithms begin incorporating esoteric information in their ranking algorithms. E.g. Inclusion of the number of links to a web site to determine its “link popularity.” Another ranking approach was to determine the number of clicks (visitors) to a web site based upon keyword and phrase relevancy.
2000 - Marketers determined that pay-per click campaigns were an easy yet expensive approach to gaining top search rankings. To elevate sites in the search engine rankings web sites started adding useful and relevant content while optimizing their web pages for each specific search engine.
All robots use the following algorithm for retrieving documents from the Web:
The algorithm uses a list of known URLs. This list contains at least one URL to start with.
A URL is taken from the list, and the corresponding document is retrieved from the Web.
The document is parsed to retrieve information for the index database and to extract the embedded links to other documents.
The URLs of the links found in the document are added to the list of known URLs.
If the list is empty or some limit is exceeded (number of documents retrieved, size of the index database, time elapsed since startup, etc.) the algorithm stops. otherwise the algorithm continues at step 2.
Processed URL s are removed from the list and Breadth-first search is used to limit path-length
To avoid memory problem pointed pages are not traced in same order as they obtained.
For each entry in url_table, indexing procedure will examine the title and selects out all words not on the stop list. Each selected word is written on to a file with a line consisting of the word followed by the current url_table entry number. when the whole table has been scanned , the file is shorted by word. keyword Indexing The stop list prevents indexing of prepositions, conjunctions, articles, and other words with many hits and little value.
The “ inverse document frequency ” is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient).
| D | : total number of documents in the corpus
: number of documents where the term t i appears (that is ).