9. A strongly connected component (SCC) in a
directed graph is a subset of the nodes
such that every node in the subset has a
path to every other node in the subset
12. 1. Collect all pages that are relevant
through text-only techniques: the query
occurs in the title of the page, the query
occurs in the page itself, etc.
2. Sort the outcome by e.g. global
PageRank
Problem: If Yahoo! Contains the text
“flower” it will be one the first few results for
the query
Naïve Approach
14. 1. Collect all pages that are relevant
through text-only techniques: the query
occurs in the title of the page, the query
occurs in the page itself, etc.
2. Let pages in this sample “vote”
through links
Problem: Super popular pages like Yahoo!
still pose problems
15.
16. Lists
Some pages are “lists” of things
A page’s value
as a list = sum of
votes received
by all pages that
it voted for
17.
18.
19. Hubs and Authorities: A
Precursor of PageRank
Hubs = High-value lists for the query
Authorities = highly endorsed answers to
the query
For each page p, we assign it two values
hub(p) and auth(p)
20. Start: for all p, hub(p) = 1, auth(p) = 1
Authority Update Rule: For each page p,
update auth(p) to be the sum of the
hub scores of all pages that point to it
Hub Update Rule: For each page p, update
hub(p) to be the sum of the authority
scores of all pages that it points to
For k times:
Apply Authority Update Rule
Apply Hub Update Rule
21. To make the numbers not
grow infinitely, always
normalize
This process converges!
22.
23. Combining Anchor Text
A great newspaper
Check out this picture
Which link is better for the query
“newspaper”?
How do we incorporate this information
into PageRank or “Hubs and Authorities”?
We can multiply link contributions by a
factor that indicates the quality
24. Impact Factor of
Scientific Journals
Nature
Science
New England Journal of Medicine
Cell
PNAS
Journal of Biological Chemistry
JAMA
The Lancet
NAT GENET
Nature Medicine