Working Of “Search
Engine”
Nikhil
D-1
14BTCSERS033
Maths Assignment
What is Search Engine ?
“A web search engine is a software system that
is designed to search for information on the
World Wide Web.”
Purpose of Search Engines
Helping people find what they’re looking for:
• Starts with an “information need”
• Convert to a query
• Gets results
Types of Search Engines
• Search by Keywords
(e.g.AltaVista,Google)
• Search by categories
(e.g. Yahoo)
The Parts of a Search Engine
Spider (or “crawler”)
Index
Search software (an algorithm)
The “spider” or “crawler”
The spider visits a web page, reads it, and
then follows links to other pages within the
site. This is what it means when someone
refers to a site being "spidered" or
"crawled". This is also known as
“harvesting”. The spider returns to the site
on a regular basis, such as every month or
two, to look for changes.
The Indexer
Everything the spider finds goes
into the second part of a search
engine, the index. The index,
sometimes called the catalog, is like
a giant book containing a copy of
every web page that the spider
finds. If a web page changes, then
this book is updated new
information.
Search engine software
It is the third part of a search
engine. This is the program that
sifts through the millions of pages
recorded in the index to find
matches to a search and rank them
in order of what it believes is most
relevant.
Variations of the tf–idf weighting
scheme are often used by search
engines as a central tool in scoring and
ranking a document's relevance given a
user query.
Term Frequency–Inverse Document
Frequency, is a numerical statistic that is
intended to reflect how important a
word is to a document in a collection.
TF-IDF Ranking Algorithm
wij = weight of Term Tj in Document Di
tfij = frequency of Term Tj in Document Dj
N = number of Documents in collection
n = number of Documents where term Tj occurs at least once
• The equation:
PR(A) = (1-d) + d(PR(t1)/C(t1) + … + PR(tn)/C(tn))
• Used by WebQuery and Google
• Google simulates users using the search engine to
rank documents.
• Google uses citation graph (518 million links)
• Google computes 26 million in a few hours.
PageRank
PageRank works by counting
the number and quality of
links to a page to determine a
rough estimate of how
important the website is. The
underlying assumption is that
more important websites are
likely to receive more links
from other websites
The End
Thank you for listening patiently.

Working of search engine

  • 1.
  • 2.
    What is SearchEngine ? “A web search engine is a software system that is designed to search for information on the World Wide Web.”
  • 3.
    Purpose of SearchEngines Helping people find what they’re looking for: • Starts with an “information need” • Convert to a query • Gets results
  • 4.
    Types of SearchEngines • Search by Keywords (e.g.AltaVista,Google) • Search by categories (e.g. Yahoo)
  • 5.
    The Parts ofa Search Engine Spider (or “crawler”) Index Search software (an algorithm)
  • 6.
    The “spider” or“crawler” The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled". This is also known as “harvesting”. The spider returns to the site on a regular basis, such as every month or two, to look for changes.
  • 7.
    The Indexer Everything thespider finds goes into the second part of a search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated new information.
  • 8.
    Search engine software Itis the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.
  • 9.
    Variations of thetf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. Term Frequency–Inverse Document Frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection. TF-IDF Ranking Algorithm wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Dj N = number of Documents in collection n = number of Documents where term Tj occurs at least once
  • 10.
    • The equation: PR(A)= (1-d) + d(PR(t1)/C(t1) + … + PR(tn)/C(tn)) • Used by WebQuery and Google • Google simulates users using the search engine to rank documents. • Google uses citation graph (518 million links) • Google computes 26 million in a few hours. PageRank
  • 11.
    PageRank works bycounting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites
  • 12.
    The End Thank youfor listening patiently.