SOVAN MISRA , Roll no:-CS-07-42 , Regd no.- 0701230147
Working Of
Search Engine
Guided By :- XxX
Why Engine?
Finding key information
from gigantic World Wide
Web is similar to find a
needle lost in haystack.
For this purpose we
would use a special
magnet that would
automatically, quickly and
effortlessly attract that
needle for us.
In this scenario magnet is
“Search Engine”
“Search Engine is a program that
searches documents for specified
keywords and returns a list of the
documents where the keywords were
found.”
Search Engine History
1990 - The first search engine Archie was released .
- The computers were interconnected by Telnet.
- File Transfer Protocol (FTP) used for
transferring files from computer to computer.
- Archie searched FTP servers and indexed their
files into a searchable directory.
1991 - Gopherspace came into existence with the
advent of Gopher.
Gopher cataloged FTP sites, and the
resulting catalog became known as Gopherspace .
1994 - WebCrawler, a new type of search engine
that indexed the entire content of a web page ,
was introduced.
1995 - Meta tags in the web page were first
utilized by some search engines to determine
relevancy.
1997 - Search engine rank-checking software was
introduced.
1998 - Search engine algorithms were being
introduced that increases the accurateness of
searching.
2000 - Marketers determined that pay-per click
campaigns were an easy yet expensive approach to
gaining top search rankings.
Eight reasonably well-known
Web search engines are : -
Top 10 Search Providers by Searches, February 2010
Provider Searches (000) Share of Total
Searches (%)
All search 9,175,357 100.0
5,981,044 65.5
1,294,261 14.1
1,142,364 12.5
206,969 2.3
175,074 1.9
91,288 1.0
55,122 0.6
27,002 0.3
26,462 0.3
24,681 0.3
Others 110104 1.2
Source: Nielsen//Net Ratings, 2010
Finding documents:
It is potentially needed to find required
document distributed over tens of thousands of
servers.
Formulating queries:
It needed to express exactly what kind of
information is to retrieve.
Determining relevance:
The system must determine whether a
document contains the required information or not.
Stages in information retrieval
Types of Search Engine
On the basis of working, Search engine is
categories in following group :-
Crawler-Based Search Engines
Directories
Hybrid Search Engines
Meta Search Engines
•It uses automated software programs to survey and
categories web pages , which is known as ‘spiders’,
‘crawlers’, ‘robots’ or ‘bots’.
•A spider will find a web page, download it and analyses
the information presented on the web page. The web page
will then be added to the search engine’s database.
•When a user performs a search, the search engine will
check its database of web pages for the key words the user
searched.
•The results (list of suggested links to go to), are listed on
pages by order of which is ‘closest’ (as defined by the ‘bots).
Examples of crawler-based search engines are:
•Google (www.google.com)
•Ask (www.ask.com)
Crawler-Based Search Engines
I'm not as scary as I look...
All robots use the following algorithm for retrieving
documents from the Web:
1.The algorithm uses a list of known URLs. This list
contains at least one URL to start with.
2.A URL is taken from the list, and the corresponding
document is retrieved from the Web.
3.The document is parsed to retrieve information for the
index database and to extract the embedded links of
other documents.
4.The URLs of the links found in the document are added
to the list of known URLs.
5.If the list is empty or some limit is exceeded (number of
documents retrieved, size of the index database, time
elapsed since startup, etc.) the algorithm stops.
otherwise the algorithm continues at step 2.
Robot Algorithm
Crawler program treats World Wide
Web as big graph having pages as
nodes and hyperlinks as arcs.
Crawler works with a simple goal:
Indexing all the keywords in web
pages’ titles.
Three data structure is needed for crawler
or robot algorithm
•A large linear array , url_table
•Heap
•Hash table
Links Are Important
Where ever the
link I follow
Url_table :
•It is a large linear array that contains
millions of entries
•Each entry contains two pointers –
•Pointer to URL
•Pointer to title
Heap:
• It is a large unstructured chunk of
virtual memory where strings can be
appended.
Hash table :
•It is third data structure of size ‘n’ entries.
•Any URL can be run through a hash
function to produce a nonnegative
integer less than ‘n’.
•All URL that hash to the value ‘k’ are
hooked together on a linked list.
•Every entry into url_table is also entered
into hash table.
•The main use of hash table is to start with a
URL and be able to quickly determine
whether it is already present in url_table.
U
URL
URL
Title
Title
6
44
19
21
5
4
2
Pointers
to URL
Pointers
to title Overflow
chains
Heap
Url_table Hash table
String storage
Hash
Code
0
1
2
3
n
Data structure for crawler
Building the index requires two phases :
Searching (URL proceesing )
Indexing.
The heart of the search engine is a recursive
procedure procees_url, which takes a URL
string as input.
Searching is done by procedure, procees_url as
follows :-
•It hashes the URL to see if it is already present in
url_table. If so, it is done and returns immediately.
•If the URL is not already known, its page is
fetched.
•The URL and title are then copied to the heap and
pointers to these two strings are entered in
url_table.
•The URL is also entered into the hash table.
•Finally, process_url extracts all the hyperlinks
from the page and calls process_url once per
hyperlink, passing the hyperlink’s URL as the
input parameter
For each entry in url_table, indexing procedure will
examine the title and selects out all words not on
the stop list.
Each selected word is written on to a file with a line
consisting of the word followed by the current
url_table entry number.
when the whole table has been scanned , the file is
shorted by word.
keyword IndexingThe stop list prevents indexing
of prepositions, conjunctions,
articles, and other words with
many hits and little value.
Formulating Queries
Keyword submission cause a request to be done in
the machine where the index is located(web server).
Then the keyword is looked up in the index
database to find the set of URL indices for each
keyword .
It is now indexed into url_table to find all the titles
and urls. Then it is stored in the Document server.
 These are then combined to form a web page and
sent back to user as the response.
Determining Relevance
Classic algorithm "TF / IDF“ is used for
determining relevance.
It is a weight often used in information retrieval
and text mining. This weight is a statistical measure
used to evaluate how important a word is to a
document in a collection
A high weight in TF-IDF is reached by a high term
frequency (in the given document) and a low
document frequency of the term in the whole
collection of documents
Term Frequency
•“Term Frequency” -The number of times a given term
appears in that document.
•It gives a measure of the importance of the term ti within the
particular document.
Term Frequency,
where ni is the number of occurrences of the
considered term, and the denominator is the number of
occurrences of all terms.
E.g.
If a document contains 100 total words and the word
computer appears 3 times, then the term frequency of the
word computer in the document is 0.03 (3/100)
Inverse Document Frequency
The “inverse document frequency ”is a measure of
the general importance of the term (obtained by
dividing the number of all documents by the
number of documents containing the term, and then
taking the logarithm of that quotient).
Where,
•| D | : total number of documents in the corpus
• : number of documents where the
term ti appears (that is ).
Inverse Document Frequency
There are different ways of calculating the IDF
“Document Frequency” (DF) is to determine how
many documents contain the word and divide it by the
total number of documents in the collection.
E.g.
1) If the word computer appears in 1,000
documents out of a total of 10,000,000 then the
IDF is 0.0001 (1000/10,000,000).
2) Alternately, take the log of the document
frequency. The natural logarithm is commonly
used. In this example we would have
IDF = ln(1,000 / 10,000,000) =1/ 9.21
Inverse Document Frequency
The final TF-IDF score is then calculated by
dividing the “Term Frequency” by the
“Document Frequency”.
E.g.
The TF-IDF score for computer in
the collection would be :
1)TF-IDF = 0.03/0.0001= 300 , by using first
formula of IDF.
2)If alternate formula used we would have
TF-IDF = 0.03 * 9.21 = 0.27.
The Big PictureThe Big Picture
You are here, and
make yourself
number one
You don't need to be #1 for everything.
You need to be #1 and that what
matters in case of searching.
Ranking procedure
•The human editors comprehensively check the
website and rank it, based on the information
they find, using a pre-defined set of rules.
There are two major directories :
Yahoo Directory (www.yahoo.com)
Open Directory (www.dmoz.org)
Directories
Hybrid search engines use a combination of
both crawler-based results and directory
results.
Examples of hybrid search engines are:
Yahoo (www.yahoo.com)
Google (www.google.com)
Hybrid Search Engines
•Also known as Multiple Search Engines or
Metacrawlers.
•Meta search engines query several other Web
search engine databases in parallel and then
combine the results in one list.
Examples of Meta search engines include:
Metacrawler (www.metacrawler.com)
Dogpile (www.dogpile.com)
Meta Search Engines
This is how it works
(in short)
ANY QUERIES…..???
Hey!
This is sovan..
Please send your feedback @
sovan107@gmail.com

How a search engine works slide

  • 1.
    SOVAN MISRA ,Roll no:-CS-07-42 , Regd no.- 0701230147 Working Of Search Engine Guided By :- XxX
  • 2.
  • 3.
    Finding key information fromgigantic World Wide Web is similar to find a needle lost in haystack. For this purpose we would use a special magnet that would automatically, quickly and effortlessly attract that needle for us. In this scenario magnet is “Search Engine”
  • 4.
    “Search Engine isa program that searches documents for specified keywords and returns a list of the documents where the keywords were found.”
  • 5.
    Search Engine History 1990- The first search engine Archie was released . - The computers were interconnected by Telnet. - File Transfer Protocol (FTP) used for transferring files from computer to computer. - Archie searched FTP servers and indexed their files into a searchable directory.
  • 6.
    1991 - Gopherspacecame into existence with the advent of Gopher. Gopher cataloged FTP sites, and the resulting catalog became known as Gopherspace . 1994 - WebCrawler, a new type of search engine that indexed the entire content of a web page , was introduced. 1995 - Meta tags in the web page were first utilized by some search engines to determine relevancy.
  • 7.
    1997 - Searchengine rank-checking software was introduced. 1998 - Search engine algorithms were being introduced that increases the accurateness of searching. 2000 - Marketers determined that pay-per click campaigns were an easy yet expensive approach to gaining top search rankings.
  • 8.
    Eight reasonably well-known Websearch engines are : -
  • 9.
    Top 10 SearchProviders by Searches, February 2010 Provider Searches (000) Share of Total Searches (%) All search 9,175,357 100.0 5,981,044 65.5 1,294,261 14.1 1,142,364 12.5 206,969 2.3 175,074 1.9 91,288 1.0 55,122 0.6 27,002 0.3 26,462 0.3 24,681 0.3 Others 110104 1.2 Source: Nielsen//Net Ratings, 2010
  • 10.
    Finding documents: It ispotentially needed to find required document distributed over tens of thousands of servers. Formulating queries: It needed to express exactly what kind of information is to retrieve. Determining relevance: The system must determine whether a document contains the required information or not. Stages in information retrieval
  • 11.
    Types of SearchEngine On the basis of working, Search engine is categories in following group :- Crawler-Based Search Engines Directories Hybrid Search Engines Meta Search Engines
  • 12.
    •It uses automatedsoftware programs to survey and categories web pages , which is known as ‘spiders’, ‘crawlers’, ‘robots’ or ‘bots’. •A spider will find a web page, download it and analyses the information presented on the web page. The web page will then be added to the search engine’s database. •When a user performs a search, the search engine will check its database of web pages for the key words the user searched. •The results (list of suggested links to go to), are listed on pages by order of which is ‘closest’ (as defined by the ‘bots). Examples of crawler-based search engines are: •Google (www.google.com) •Ask (www.ask.com) Crawler-Based Search Engines I'm not as scary as I look...
  • 14.
    All robots usethe following algorithm for retrieving documents from the Web: 1.The algorithm uses a list of known URLs. This list contains at least one URL to start with. 2.A URL is taken from the list, and the corresponding document is retrieved from the Web. 3.The document is parsed to retrieve information for the index database and to extract the embedded links of other documents. 4.The URLs of the links found in the document are added to the list of known URLs. 5.If the list is empty or some limit is exceeded (number of documents retrieved, size of the index database, time elapsed since startup, etc.) the algorithm stops. otherwise the algorithm continues at step 2. Robot Algorithm
  • 15.
    Crawler program treatsWorld Wide Web as big graph having pages as nodes and hyperlinks as arcs. Crawler works with a simple goal: Indexing all the keywords in web pages’ titles. Three data structure is needed for crawler or robot algorithm •A large linear array , url_table •Heap •Hash table
  • 16.
    Links Are Important Whereever the link I follow
  • 17.
    Url_table : •It isa large linear array that contains millions of entries •Each entry contains two pointers – •Pointer to URL •Pointer to title Heap: • It is a large unstructured chunk of virtual memory where strings can be appended.
  • 18.
    Hash table : •Itis third data structure of size ‘n’ entries. •Any URL can be run through a hash function to produce a nonnegative integer less than ‘n’. •All URL that hash to the value ‘k’ are hooked together on a linked list. •Every entry into url_table is also entered into hash table. •The main use of hash table is to start with a URL and be able to quickly determine whether it is already present in url_table.
  • 19.
    U URL URL Title Title 6 44 19 21 5 4 2 Pointers to URL Pointers to titleOverflow chains Heap Url_table Hash table String storage Hash Code 0 1 2 3 n Data structure for crawler
  • 20.
    Building the indexrequires two phases : Searching (URL proceesing ) Indexing. The heart of the search engine is a recursive procedure procees_url, which takes a URL string as input.
  • 21.
    Searching is doneby procedure, procees_url as follows :- •It hashes the URL to see if it is already present in url_table. If so, it is done and returns immediately. •If the URL is not already known, its page is fetched. •The URL and title are then copied to the heap and pointers to these two strings are entered in url_table. •The URL is also entered into the hash table. •Finally, process_url extracts all the hyperlinks from the page and calls process_url once per hyperlink, passing the hyperlink’s URL as the input parameter
  • 22.
    For each entryin url_table, indexing procedure will examine the title and selects out all words not on the stop list. Each selected word is written on to a file with a line consisting of the word followed by the current url_table entry number. when the whole table has been scanned , the file is shorted by word. keyword IndexingThe stop list prevents indexing of prepositions, conjunctions, articles, and other words with many hits and little value.
  • 24.
    Formulating Queries Keyword submissioncause a request to be done in the machine where the index is located(web server). Then the keyword is looked up in the index database to find the set of URL indices for each keyword . It is now indexed into url_table to find all the titles and urls. Then it is stored in the Document server.  These are then combined to form a web page and sent back to user as the response.
  • 26.
    Determining Relevance Classic algorithm"TF / IDF“ is used for determining relevance. It is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection A high weight in TF-IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents
  • 27.
    Term Frequency •“Term Frequency”-The number of times a given term appears in that document. •It gives a measure of the importance of the term ti within the particular document. Term Frequency, where ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms. E.g. If a document contains 100 total words and the word computer appears 3 times, then the term frequency of the word computer in the document is 0.03 (3/100)
  • 28.
    Inverse Document Frequency The“inverse document frequency ”is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient). Where, •| D | : total number of documents in the corpus • : number of documents where the term ti appears (that is ).
  • 29.
    Inverse Document Frequency Thereare different ways of calculating the IDF “Document Frequency” (DF) is to determine how many documents contain the word and divide it by the total number of documents in the collection. E.g. 1) If the word computer appears in 1,000 documents out of a total of 10,000,000 then the IDF is 0.0001 (1000/10,000,000). 2) Alternately, take the log of the document frequency. The natural logarithm is commonly used. In this example we would have IDF = ln(1,000 / 10,000,000) =1/ 9.21
  • 30.
    Inverse Document Frequency Thefinal TF-IDF score is then calculated by dividing the “Term Frequency” by the “Document Frequency”. E.g. The TF-IDF score for computer in the collection would be : 1)TF-IDF = 0.03/0.0001= 300 , by using first formula of IDF. 2)If alternate formula used we would have TF-IDF = 0.03 * 9.21 = 0.27.
  • 31.
    The Big PictureTheBig Picture You are here, and make yourself number one
  • 32.
    You don't needto be #1 for everything. You need to be #1 and that what matters in case of searching.
  • 33.
  • 34.
    •The human editorscomprehensively check the website and rank it, based on the information they find, using a pre-defined set of rules. There are two major directories : Yahoo Directory (www.yahoo.com) Open Directory (www.dmoz.org) Directories
  • 35.
    Hybrid search enginesuse a combination of both crawler-based results and directory results. Examples of hybrid search engines are: Yahoo (www.yahoo.com) Google (www.google.com) Hybrid Search Engines
  • 36.
    •Also known asMultiple Search Engines or Metacrawlers. •Meta search engines query several other Web search engine databases in parallel and then combine the results in one list. Examples of Meta search engines include: Metacrawler (www.metacrawler.com) Dogpile (www.dogpile.com) Meta Search Engines
  • 37.
    This is howit works (in short)
  • 39.
  • 40.
    Hey! This is sovan.. Pleasesend your feedback @ sovan107@gmail.com