Google yahoo case study, page ranking ,search engine optimization,types of search engine,invert file,how search engine work, web crawler,doc file ,query...etc
2. 2
Today's Coverage
● Introduction
● Types of Search Engines
● Components of a Search Engine
● Semantics and Relevancy
● Search Engine Optimization
3. Introduction
• A web search engine is a software system that
is designed to search for information on
the World Wide Web. The search results are
generally presented in a line of results often
referred to as search engine results pages.
• Search engines look through their own
databases of information in order to find what
it is that you are looking for…
4. 4
Types of Search Engine
● Crawler Powered Indexes
– Guruji.com, Google.com
● Human Powered Indexes
– www.dmoz.org
● Hybrid Models
– Submitted URLs to a search engine ?
● Semantic Indexes
– Hakia.com,
7. Copyleft (ɔ) 2009 Sudarsun Santhiappan 7
How does a Search Engine work ?
8. Copyleft (ɔ) 2009 Sudarsun Santhiappan 8
Your
Browser
How Search Engines Work
(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
Search
Engine
Database Eggs?
Eggs.
Eggs - 90%
Eggo - 81%
Ego- 40%
Huh? - 10%
All About
Eggs
by
S. I. Am
11. Crawlers
• A crawler is a program that visits Web sites
and reads their pages and other information
in order to create entries for a search
engine index. The major search engines on the
Web all have such a program, which is also
known as a "spider" or a "bot."
12. Indexers
• A database index is a data structure that
improves the speed of data retrieval
operations on a database table at the cost of
additional writes and the use of more storage
space to maintain the extra copy of data.
13. Semantics
• Semantics is the study of meaning. It focuses
on the relation between signifiers, like
words, phrases, signs, and symbols, and what
they stand for, their denotation. semantics is
the study of meaning that is used for
understanding human expression through
language.
15. Copyleft (ɔ) 2009 Sudarsun Santhiappan 15
How Inverted Files
Are Created
● Periodically rebuilt, static otherwise.
● Documents are parsed to extract
tokens. These are saved with the
Document ID.
Now is the time
for all good men
to come to the aid
of their country
Doc 1
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Doc 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
16. Copyleft (ɔ) 2009 Sudarsun Santhiappan 16
How Inverted
Files are Created
● After all
documents have
been parsed the
inverted file is
sorted
alphabetically.
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
17. Copyleft (ɔ) 2009 Sudarsun Santhiappan 17
How Inverted
Files are Created
● Multiple term
entries for a
single document
are merged.
● Within-
document term
frequency
information is
compiled.
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
18. Copyleft (ɔ) 2009 Sudarsun Santhiappan 18
How Inverted Files are Created
● Finally, the file can be split into
– A Dictionary or Lexicon file
and
– A Postings file
19. Copyleft (ɔ) 2009 Sudarsun Santhiappan 19
How Inverted Files are Created
Dictionary/Lexicon Postings
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Doc # Freq
2 1
1 1
1 1
2 1
1 1
1 1
2 1
2 1
1 1
1 1
2 1
1 1
2 1
2 1
1 1
2 1
2 1
1 1
1 1
2 1
2 1
1 2
2 2
1 1
1 1
2 1
1 2
2 2
Term N docs Tot Freq
a 1 1
aid 1 1
all 1 1
and 1 1
come 1 1
country 2 2
dark 1 1
for 1 1
good 1 1
in 1 1
is 1 1
it 1 1
manor 1 1
men 1 1
midnight 1 1
night 1 1
now 1 1
of 1 1
past 1 1
stormy 1 1
the 2 4
their 1 1
time 2 2
to 1 2
was 1 2
20. inverted index
• In computer science, an inverted index (also referred
to as postings file or inverted file) is an index data
structure storing a mapping from content, such as
words or numbers, to its locations in a database file, or
in a document or a set of documents. The purpose of
an inverted index is to allow fast full text searches, at a
cost of increased processing when a document is
added to the database. The inverted file may be the
database file itself, rather than its index. It is the most
popular data structure used in document
retrieval systems, used on a large scale for example in
search engines.
21. Copyleft (ɔ) 2009 Sudarsun Santhiappan 21
From description of the FAST search engine, by Knut Risvik
In this example, the data
for the pages is partitioned
across machines.
Additionally, each partition
is allocated multiple
machines to handle the
queries.
Each row can handle 120
queries per second
Each column can handle
7M pages
To handle more queries,
add another row.
22. Copyleft (ɔ) 2009 Sudarsun Santhiappan 22
PageRank
● Let A1, A2, …, An be the pages that point to
page A. Let C(P) be the # links out of page
P. The PageRank (PR) of page A is defined
as:
● PageRank is principal eigenvector of the
link matrix of the web.
● Can be computed as the fixpoint of the
above equation.
PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )