Inside Search Engine - A case study

Inside Search Engine
A case study
OF

Outline
• WWW of SE(SearchEngine)

•A brief overview of SE history

•Getting Started!

•Basic Architecture of SE

•Inside PageRank

•Related work & Future
2

Motivation
Unedited – anyone can enter content
– Quality issues; Spam

Varied information types
– Catalogs, dissertations, news reports, weather,
pictures,videos…

Different kinds of users
• Lexis-Nexis: Paying, professional searchers
• Online catalogs: Scholars searching scholarly literature
• Web: Every type of person with every type of goal

Scale
• Hundreds of millions of searches/day; billions of docs
3

Motivation

What’s the situation
without SE?

4

Motivation

“Necessity Is The mother of Invention”
famous saying

So,it’s a KDD(Knowledge Discovery
from Data) process!

5

Motivation

Search Engine
Saves
Today!

A Search Engine helps
you find things on the
Internet. Any time
anyone looks up
anything on the
Internet!
6

A brief History

•Three major categories of SE
– Full-text Search Engine
– Dictonary Search Engine (generally speaking)
– Meta Search Engine

•Major Issues of SE
- Understanding Search Queries
- Understanding Website & Hyperlinks
- Accuracy & Relevance
- Honesty & Anti-Spam !

9

Getting Started!

•Importance of Links
– Internal links (links within your site)
– Outbound links (sites you link to)
– Inbound links (sites linking to you)

•Good Websites
- Key pages with only a few click
- User Navigation
- Links easy for Robot
- Anchor text

11

Getting Started!

•Anchor text (descriptive)
•Crawler (spider)
- Main difficulities
- Graph Theory
- A simple process

12

Getting Started!

•Inverted Indexes (The IR Way ）
•How I.I are created?
•A Detailed Example of two Docs
•I.I for Web Search Engines

13

Getting Started!

How I.I files are created?
Term Doc #
now 1
is 1
the 1
time 1
- Periodically rebuilt, static otherwise. for
all
1
1
good 1
men 1
- Docs are parsed to extract tokens. These to 1

are saved with Doc ID come
to
1
1
the 1
aid 1
of 1
Doc 1 Doc 2 their
country
1
1
it 2

It was a dark and was
a
2
2
Now is the time dark 2

stormy night in and
stormy
2
2
for all good men night
in
2
2
the country the 2
to come to the aid country
manor
2
2
manor. The time the 2

of their country time
was
2
2
was past midnight past
midnight
2
2
14

Getting Started!
Term Doc # Term Doc #

How I.I files now
is
1
1
a
aid
2
1

are created?
the 1 all 1
time 1 and 2
for 1 come 1
all 1 country 1
good 1 country 2
men 1 dark 2
to 1 for 1
come 1 good 1
- After all documents have to 1 in 2
the 1 is 1
been parsed the inverted aid 1 it 2

file is sorted of
their
1
1
manor
men
2
1

alphabetically. country
it
1
2
midnight
night
2
2
was 2 now 1
a 2 of 1
dark 2 past 2
and 2 stormy 2
stormy 2 the 1
night 2 the 1
in 2 the 2
the 2 the 2
country 2 their 1
manor 2 time 1
the 2 time 2
time 2 to 1
was 2 to 1
past 2 was 2
midnight 2 was 15 2

Getting Started!
Term Doc #

How I.I files a
aid
2
1
Term
a
Doc #
2
Freq
1

are created?
all 1 aid 1 1
and 2 all 1 1
come 1 and 2 1
country 1 come 1 1
country 2 country 1 1
dark 2
- Multiple term entries for 1
country
dark
2
2
1
1
good 1
for a single document in 2 for 1 1
good 1 1
are merged. is
it
1
2 in 2 1
manor 2 is 1 1
men 1 it 2 1
- Within-document term midnight
night
2
2
manor 2 1
men 1 1
frequency now
of
1
1
midnight 2 1

information is past 2
night
now
2
1
1
1
stormy 2
compiled. the 1 of 1 1
the 1 past 2 1
the 2 stormy 2 1
the 2 the 1 2
their 1 the 2 2
time 1 their 1 1
time 2 time 1 1
to 1 time 2 1
to 1
to 1 2
was 2
was 2 216
was 2

Getting Started!

- Finally, the file can be split into

• A Dictionary or Lexicon file
and
• A Postings file

17

Getting Started!

Term Doc # Freq
a 2 1 Dictionary/Lexicon Postings
aid 1 1
all 1 1
and 2 1 Term N docs Tot Freq Doc # Freq
a 1 1 2 1
come 1 1
aid 1 1 1 1
country 1 1 all 1 1 1 1
country 2 1 and 1 1 2 1
dark 2 1 come 1 1 1 1
for 1 1 country 2 2 1 1
good 1 1 dark 1 1 2 1
in 2 1 for 1 1 2 1
good 1 1 1 1
is 1 1
in 1 1 1 1
it 2 1
is 1 1 2 1
manor 2 1 it 1 1 1 1
men 1 1 manor 1 1 2 1
midnight 2 1 men 1 1 2 1
night 2 1 midnight 1 1 1 1
now 1 1 night 1 1 2 1
of 1 1 now 1 1 2 1
of 1 1 1 1
past 2 1
past 1 1 1 1
stormy 2 1 stormy 1 1 2 1
the 1 2 the 2 4 2 1
the 2 2 their 1 1 1 2
their 1 1 time 2 2 2 2
time 1 1 to 1 2 1 1
time 2 1 was 1 2 1 1
to 1 2 2 1
was 2 2 1 18 2
2 2

Getting Started!

Inverted indexes
- Permit fast search for individual terms

- For each term, you get a list consisting of:
• document ID
• frequency of term in doc (optional)
• position of term in doc (optional)

- These lists can be used to solve Boolean queries:
– country -> d1, d2
– manor -> d2
– country AND manor -> d2

19

Getting Started!

Inverted Indexes for Web SE
- Inverted indexes are still used, even though the web is so
huge.

- Some systems partition the indexes across different
machines. Each machine handles different parts of the
data.

- Other systems duplicate the data across many machines;
queries are distributed among the machines.

- Most do a combination of these.

20

Basic Web SE Architecture

crawl the Check for duplicates,
crawl the
web store the
web
documents
DocIds

user create an
create an

query inverted
inverted
index
index

Show results
Show results Search Inverted
To user
To user
engine index
servers
21

Google’s Architecture
 Sorted barrels =
inverted index
 Pagerank
computed from link
structure;
combined with IR
rank
 IR rank depends
on TF, type of “hit”,
hit proximity, etc.
 Billion
documents
 Hundred million
queries a day 22

Inside PageRank

Motivation
 Web: heterogeneous and unstructured
 Free of quality control on the web
 Commercial interest to manipulate
ranking
 Building A Open Lab for Scientists

23

Inside PageRank

Motivation
 Most algo. From IR (eg:vector space)
 Only get content,neglect graphical
structure

24

Inside PageRank

Related Work
Assumption: If the pages pointing to this page
are good, then this is also a good page.
– References: Kleinberg 98, Page et al. 98

Draws upon earlier research in sociology and
bibliometrics.
• Kleinberg’s model includes “authorities” (highly
referenced pages) and “hubs” (pages containing
good reference lists).
• Google model is a version with no hubs, and is
closely related to work on influence weights by
Pinski-Narin (1976).

25

Inside PageRank

PR: Bringing Order to Web
Basic IDEA
• Introduce a notion of page authority,which is
Indep. Of the page content
• Only take into the topological structure of web
• Intuition: A page has high rank if sum of the ranks
of the backlinks is high
• Similar idea can be found in scientific
citation

26

Inside PageRank

• Pages with lots of back links are important
• Back links come from important pages convey
more importance to a page

• Problem : Rank Sink (Dangling pages)

27

Inside PageRank

Problem: this loop will accumulate rank but
never distribute any rank outside!

28

Inside PageRank

•Where
• W = {wi,j} : the transition matrix
• wi,j = 1/hj if there is hyperlink from
j to i and wi,j = 0 otherwise

30

Inside PageRank

Consider this
simple case,
what will the
transition matrix
look like?

31

Inside PageRank

•The sys. Is stable and x(t) always
converges to the stationary solution
•D is a dangling factor and 0 < d < 1
•Just Jacobi algorithm to sovle linear
sys.

33

Inside PageRank

•PR corresponds to prob. Distribution
of a random walk on the web graph

•The Escape term can be
personalized!
34

Inside PageRank

A stochastic process is any sequence of
experiments for which the outcome at any
stage depends on chance. A Markov process
is a stochastic process with following
properties:

•Possibe outcomes or states is finite
•Prob. Of next depends only on
previous
•Prob. Are constant over time

35

Inside PageRank

Theory 1 :
if λ = 1 is a dominant
eigenvalue of a stochastic matrix A.
the the Markov chain with transition
A will converge to a steady-state.
The Perron theorem can be used to
show that if the transition matrix A
of a Markov process is positive then
λ = 1 is a dominant evalue of A
36

Inside PageRank

Theory 2 :
if A is a positive n*n matrix,then A has a
positive reak evalue R with following
properties:

1.R has a positive evalue X
2.If λ is any other evalue of A ,then
| λ| < R

37

Outside PageRank

38

Outside PageRank

39

Reference
• PageRank: Bringing Order to Life
• An Atonmy of Large-scale hypetextul web SE
• Inside PageRank
• Combating Web Spam with TrustRank
• Does Authority Mean Quality
• What can you do with a Web in your Pocket
• Modern Information Retrieval (Book)
• Data Mining : Concepts & Techs (book)

40

Inside Search Engine - A case study

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Recently uploaded

Recently uploaded (20)

Inside Search Engine - A case study