2. Outline
• WWW of SE(SearchEngine)
•A brief overview of SE history
•Getting Started!
•Basic Architecture of SE
•Inside PageRank
•Related work & Future
2
3. Motivation
Unedited – anyone can enter content
– Quality issues; Spam
Varied information types
– Catalogs, dissertations, news reports, weather,
pictures,videos…
Different kinds of users
• Lexis-Nexis: Paying, professional searchers
• Online catalogs: Scholars searching scholarly literature
• Web: Every type of person with every type of goal
Scale
• Hundreds of millions of searches/day; billions of docs
3
11. Getting Started!
•Importance of Links
– Internal links (links within your site)
– Outbound links (sites you link to)
– Inbound links (sites linking to you)
•Good Websites
- Key pages with only a few click
- User Navigation
- Links easy for Robot
- Anchor text
11
12. Getting Started!
•Anchor text (descriptive)
•Crawler (spider)
- Main difficulities
- Graph Theory
- A simple process
12
13. Getting Started!
•Inverted Indexes (The IR Way )
•How I.I are created?
•A Detailed Example of two Docs
•I.I for Web Search Engines
13
14. Getting Started!
How I.I files are created?
Term Doc #
now 1
is 1
the 1
time 1
- Periodically rebuilt, static otherwise. for
all
1
1
good 1
men 1
- Docs are parsed to extract tokens. These to 1
are saved with Doc ID come
to
1
1
the 1
aid 1
of 1
Doc 1 Doc 2 their
country
1
1
it 2
It was a dark and was
a
2
2
Now is the time dark 2
stormy night in and
stormy
2
2
for all good men night
in
2
2
the country the 2
to come to the aid country
manor
2
2
manor. The time the 2
of their country time
was
2
2
was past midnight past
midnight
2
2
14
15. Getting Started!
Term Doc # Term Doc #
How I.I files now
is
1
1
a
aid
2
1
are created?
the 1 all 1
time 1 and 2
for 1 come 1
all 1 country 1
good 1 country 2
men 1 dark 2
to 1 for 1
come 1 good 1
- After all documents have to 1 in 2
the 1 is 1
been parsed the inverted aid 1 it 2
file is sorted of
their
1
1
manor
men
2
1
alphabetically. country
it
1
2
midnight
night
2
2
was 2 now 1
a 2 of 1
dark 2 past 2
and 2 stormy 2
stormy 2 the 1
night 2 the 1
in 2 the 2
the 2 the 2
country 2 their 1
manor 2 time 1
the 2 time 2
time 2 to 1
was 2 to 1
past 2 was 2
midnight 2 was 15 2
16. Getting Started!
Term Doc #
How I.I files a
aid
2
1
Term
a
Doc #
2
Freq
1
are created?
all 1 aid 1 1
and 2 all 1 1
come 1 and 2 1
country 1 come 1 1
country 2 country 1 1
dark 2
- Multiple term entries for 1
country
dark
2
2
1
1
good 1
for a single document in 2 for 1 1
good 1 1
are merged. is
it
1
2 in 2 1
manor 2 is 1 1
men 1 it 2 1
- Within-document term midnight
night
2
2
manor 2 1
men 1 1
frequency now
of
1
1
midnight 2 1
information is past 2
night
now
2
1
1
1
stormy 2
compiled. the 1 of 1 1
the 1 past 2 1
the 2 stormy 2 1
the 2 the 1 2
their 1 the 2 2
time 1 their 1 1
time 2 time 1 1
to 1 time 2 1
to 1
to 1 2
was 2
was 2 216
was 2
17. Getting Started!
How I.I files are created?
- Finally, the file can be split into
• A Dictionary or Lexicon file
and
• A Postings file
17
18. Getting Started!
How I.I files are created?
Term Doc # Freq
a 2 1 Dictionary/Lexicon Postings
aid 1 1
all 1 1
and 2 1 Term N docs Tot Freq Doc # Freq
a 1 1 2 1
come 1 1
aid 1 1 1 1
country 1 1 all 1 1 1 1
country 2 1 and 1 1 2 1
dark 2 1 come 1 1 1 1
for 1 1 country 2 2 1 1
good 1 1 dark 1 1 2 1
in 2 1 for 1 1 2 1
good 1 1 1 1
is 1 1
in 1 1 1 1
it 2 1
is 1 1 2 1
manor 2 1 it 1 1 1 1
men 1 1 manor 1 1 2 1
midnight 2 1 men 1 1 2 1
night 2 1 midnight 1 1 1 1
now 1 1 night 1 1 2 1
of 1 1 now 1 1 2 1
of 1 1 1 1
past 2 1
past 1 1 1 1
stormy 2 1 stormy 1 1 2 1
the 1 2 the 2 4 2 1
the 2 2 their 1 1 1 2
their 1 1 time 2 2 2 2
time 1 1 to 1 2 1 1
time 2 1 was 1 2 1 1
to 1 2 2 1
was 2 2 1 18 2
2 2
19. Getting Started!
Inverted indexes
- Permit fast search for individual terms
- For each term, you get a list consisting of:
• document ID
• frequency of term in doc (optional)
• position of term in doc (optional)
- These lists can be used to solve Boolean queries:
– country -> d1, d2
– manor -> d2
– country AND manor -> d2
19
20. Getting Started!
Inverted Indexes for Web SE
- Inverted indexes are still used, even though the web is so
huge.
- Some systems partition the indexes across different
machines. Each machine handles different parts of the
data.
- Other systems duplicate the data across many machines;
queries are distributed among the machines.
- Most do a combination of these.
20
21. Basic Web SE Architecture
crawl the Check for duplicates,
crawl the
web store the
web
documents
DocIds
user create an
create an
query inverted
inverted
index
index
Show results
Show results Search Inverted
To user
To user
engine index
servers
21
22. Google’s Architecture
Sorted barrels =
inverted index
Pagerank
computed from link
structure;
combined with IR
rank
IR rank depends
on TF, type of “hit”,
hit proximity, etc.
Billion
documents
Hundred million
queries a day 22
23. Inside PageRank
Motivation
Web: heterogeneous and unstructured
Free of quality control on the web
Commercial interest to manipulate
ranking
Building A Open Lab for Scientists
23
25. Inside PageRank
Related Work
Assumption: If the pages pointing to this page
are good, then this is also a good page.
– References: Kleinberg 98, Page et al. 98
Draws upon earlier research in sociology and
bibliometrics.
• Kleinberg’s model includes “authorities” (highly
referenced pages) and “hubs” (pages containing
good reference lists).
• Google model is a version with no hubs, and is
closely related to work on influence weights by
Pinski-Narin (1976).
25
26. Inside PageRank
PR: Bringing Order to Web
Basic IDEA
• Introduce a notion of page authority,which is
Indep. Of the page content
• Only take into the topological structure of web
• Intuition: A page has high rank if sum of the ranks
of the backlinks is high
• Similar idea can be found in scientific
citation
26
27. Inside PageRank
• Pages with lots of back links are important
• Back links come from important pages convey
more importance to a page
• Problem : Rank Sink (Dangling pages)
27
28. Inside PageRank
Problem: this loop will accumulate rank but
never distribute any rank outside!
28
33. Inside PageRank
•The sys. Is stable and x(t) always
converges to the stationary solution
•D is a dangling factor and 0 < d < 1
•Just Jacobi algorithm to sovle linear
sys.
33
34. Inside PageRank
•PR corresponds to prob. Distribution
of a random walk on the web graph
•The Escape term can be
personalized!
34
35. Inside PageRank
A stochastic process is any sequence of
experiments for which the outcome at any
stage depends on chance. A Markov process
is a stochastic process with following
properties:
•Possibe outcomes or states is finite
•Prob. Of next depends only on
previous
•Prob. Are constant over time
35
36. Inside PageRank
Theory 1 :
if λ = 1 is a dominant
eigenvalue of a stochastic matrix A.
the the Markov chain with transition
A will converge to a steady-state.
The Perron theorem can be used to
show that if the transition matrix A
of a Markov process is positive then
λ = 1 is a dominant evalue of A
36
37. Inside PageRank
Theory 2 :
if A is a positive n*n matrix,then A has a
positive reak evalue R with following
properties:
1.R has a positive evalue X
2.If λ is any other evalue of A ,then
| λ| < R
37
40. Reference
• PageRank: Bringing Order to Life
• An Atonmy of Large-scale hypetextul web SE
• Inside PageRank
• Combating Web Spam with TrustRank
• Does Authority Mean Quality
• What can you do with a Web in your Pocket
• Modern Information Retrieval (Book)
• Data Mining : Concepts & Techs (book)
40