SlideShare a Scribd company logo
COMP9313: Big Data Management
Lecturer: Xin Cao
Course web site: http://www.cse.unsw.edu.au/~cs9313/
8.2
Chapter 8: Analysis of Large Graphs
— Link Analysis
Adapted from the slides of Chapter 5 of “Mining Massive Datasets”
8.3
Graph Data: Social Networks
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
8.4
Graph Data: Information Nets
Citation networks and Maps of science
[Börner et al., 2012]
8.5
Graph Data: Communication Nets
Internet
8.6
Web as a Directed Graph
 Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
8.7
Broad Question
 How to organize the Web?
 First try: Human curated
Web directories
 Yahoo, LookSmart, etc.
 Second try: Web Search
 Information Retrieval investigates:
Find relevant docs in a small
and trusted set
 Newspaper articles, Patents, etc.
 But: Web is huge, full of untrusted documents, random things, web
spam, etc.
 What is the “best” answer to query “newspaper”?
 No single right answer
8.8
Ranking Nodes on the Graph
 All web pages are not equally “important”
 http://xxx.github.io/ vs. http://www.unsw.edu.au/
 There is large diversity in the web-graph node connectivity. Let’s rank
the pages by the link structure!
8.9
Link Analysis Algorithms
 We will cover the following Link Analysis approaches for computing
importance of nodes in a graph:
 Page Rank
 Topic-Specific (Personalized) Page Rank
 HITS
8.10
Part 1: PageRank
8.11
Links as Votes
 Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
 Think of in-links as votes:
 http://www.unsw.edu.au/ has 23,400 in-links
 http://xxx.github.io/ has 1 in-link
 Are all in-links equal?
 Links from important pages count more
 Recursive question!
8.12
Example: PageRank Scores
B
38.4
C
34.3
E
8.1
F
3.9
D
3.9
A
3.3
1.6
1.6 1.6 1.6 1.6
8.13
Simple Recursive Formulation
 Each link’s vote is proportional to the importance of its source page
 If page j with importance rj has n out-links, each link gets rj / n votes
 Page j’s own importance is the sum of the votes on its in-links
j
k
i
rj/3
rj/3
rj/3
rj = ri/3+rk/4
ri/3 rk/4
8.14
PageRank: The “Flow” Model
 A “vote” from an important page is
worth more
 A page is important if it is pointed to by
other important pages
 Define a “rank” rj for page j



j
i
i
j
r
r
i
d
y
m
a
a/2
y/2
a/2
m
y/2
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
8.15
Solving the Flow Equations
 3 equations, 3 unknowns, no constants
 No unique solution
 All solutions equivalent modulo the scale factor
 Additional constraint forces uniqueness:
 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
 Solution: 𝒓𝒚 =
𝟐
𝟓
, 𝒓𝒂 =
𝟐
𝟓
, 𝒓𝒎 =
𝟏
𝟓
 Gaussian elimination method works for small examples, but we need
a better method for large web-size graphs
 We need a new formulation!
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Flow equations:
8.16
PageRank: Matrix Formulation
 Stochastic adjacency matrix 𝑴
 Let page 𝑖 has 𝑑𝑖 out-links
 If 𝑖 → 𝑗, then 𝑀𝑗𝑖 =
1
𝑑𝑖
else 𝑀𝑗𝑖 = 0
 𝑴 is a column stochastic matrix
– Columns sum to 1
 Rank vector 𝒓: vector with an entry per page
 𝑟𝑖 is the importance score of page 𝑖
 𝑖 𝑟𝑖 = 1
 The flow equations can be written
𝒓 = 𝑴 ⋅ 𝒓
8.17
Example
 Remember the flow equation:
 Flow equation in the matrix form
𝑴 ⋅ 𝒓 = 𝒓
 Suppose page i links to 3 pages, including j



j
i
i
j
r
r
i
d
j
i
M r r
=
rj
1/3
ri
.
. =
8.18
Eigenvector Formulation
 The flow equations can be written
𝒓 = 𝑴 ∙ 𝒓
 So the rank vector r is an eigenvector of the stochastic web matrix
M
 In fact, its first or principal eigenvector,
with corresponding eigenvalue 1
 Largest eigenvalue of M is 1 since M is
column stochastic (with non-negative entries)
– We know r is unit length and each column of M
sums to one, so 𝑴𝒓 ≤ 𝟏
 We can now efficiently solve for r!
 The method is called Power iteration
NOTE: x is an
eigenvector with
the corresponding
eigenvalue λ if:
𝑨𝒙 = 𝝀𝒙
8.19
Example: Flow Equations & M
r = M∙r
y ½ ½ 0 y
a = ½ 0 1 a
m 0 ½ 0 m
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
8.20
Power Iteration Method
 Given a web graph with n nodes, where the nodes are pages and
edges are hyperlinks
 Power iteration: a simple iterative scheme
 Suppose there are N web pages
 Initialize: r(0) = [1/N,….,1/N]T
 Iterate: r(t+1) = M ∙ r(t)
 Stop when |r(t+1) – r(t)|1 < 




j
i
t
i
t
j
r
r
i
)
(
)
1
(
d
di …. out-degree of node i
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
8.21
PageRank: How to solve?
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Iteration 0, 1, 2, …
8.22
Why Power Iteration works?

8.23
Random Walk Interpretation




j
i
i
j
r
r
(i)
dout
j
i1 i2 i3
8.24
The Stationary Distribution

)
(
M
)
1
( t
p
t
p 


j
i1 i2 i3
8.25
Existence and Uniqueness
 A central result from the theory of random walks (a.k.a. Markov
processes):
For graphs that satisfy certain conditions,
the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
8.26
PageRank: Two Questions
 Does this converge?
 Does it converge to what we want?




j
i
t
i
t
j
r
r
i
)
(
)
1
(
d Mr
r 
or
equivalently
8.27
Does this converge?
 Example:
ra 1 0 1 0 …
rb 0 1 0 1 …
b
a 



j
i
t
i
t
j
r
r
i
)
(
)
1
(
d
Iteration 0, 1, 2, …
8.28
Does it converge to what we want?
 Example:
ra 1 0 0 0
rb 0 1 0 0
b
a 



j
i
t
i
t
j
r
r
i
)
(
)
1
(
d
Iteration 0, 1, 2, …
8.29
PageRank: Problems
2 problems:
 (1) Some pages are dead ends (have no out-links)
 Random walk has “nowhere” to go to
 Such pages cause importance to “leak out”
 (2) Spider traps: (all out-links are within the group)
 Random walked gets “stuck” in a trap
 And eventually spider traps absorb all importance
Dead end
8.30
Problem: Dead Ends

y
a m
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2
Iteration 0, 1, 2, …
Here the PageRank “leaks” out since the matrix is not stochastic.
8.31
Solution: Teleport!
 Teleports: Follow random teleport links with probability 1.0 from dead-
ends
 Adjust matrix accordingly
y
a m
y a m
y ½ ½ ⅓
a ½ 0 ⅓
m 0 ½ ⅓
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 0
y
a m
8.32
Problem: Spider Traps

Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
y
a m
y a m
y ½ ½ 0
a ½ 0 0
m 0 ½ 1
ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm
m is a spider trap
8.33
Solution: Always Teleports!
 The Google solution for spider traps: At each time step, the random
surfer has two options
 With prob. , follow a link at random
 With prob. 1-, jump to some random page
 Common values for  are in the range 0.8 to 0.9
 Surfer will teleport out of spider trap within a few time steps
y
a m
y
a m
8.34
Why Teleports Solve the Problem?
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
 Spider-traps are not a problem, but with traps PageRank scores are
not what we want
 Solution: Never get stuck in a spider trap by teleporting out of it in
a finite number of steps
 Dead-ends are a problem
 The matrix is not column stochastic so our initial assumptions are
not met
 Solution: Make matrix column stochastic by always teleporting
when there is nowhere else to go
8.35
Google’s Solution: Random Teleports

di … out-degree
of node i
8.36
The Google Matrix

8.37
Random Teleports ( = 0.8)
y
a =
m
1/3
1/3
1/3
0.33
0.20
0.46
0.24
0.20
0.52
0.26
0.18
0.56
7/33
5/33
21/33
. . .
y
a
m
13/15
7/15
1/2 1/2 0
1/2 0 0
0 1/2 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
0.8 + 0.2
M [1/N]NxN
A
8.38
Computing Page Rank
 Key step is matrix-vector multiplication
 rnew = A ∙ rold
 Easy if we have enough main memory to hold A, rold, rnew
 Say N = 1 billion pages
 We need 4 bytes for
each entry (say)
 2 billion entries for
vectors, approx 8GB
 Matrix A has N2 entries
 1018 is a large number!
½ ½ 0
½ 0 0
0 ½ 1
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
7/15 7/15 1/15
7/15 1/15 1/15
1/15 7/15 13/15
0.8 +0.2
A = ∙M + (1-) [1/N]NxN
=
A =
8.39
Matrix Formulation
 Suppose there are N pages
 Consider page i, with di out-links
 We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
 The random teleport is equivalent to:
 Adding a teleport link from i to every other page and setting
transition probability to (1-)/N
 Reducing the probability of following each
out-link from 1/|di| to /|di|
 Equivalent: Tax each page a fraction (1-) of its score and
redistribute evenly
8.40
Rearranging the Equation

[x]N … a vector of length N with all entries x
Note: Here we assumed M
has no dead-ends
8.41
Sparse Matrix Formulation

8.42
PageRank: The Complete Algorithm

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
where: 𝑆 = 𝑗 𝑟′𝑗
𝑛𝑒𝑤
8.43
Sparse Matrix Encoding
 Encode sparse matrix using only nonzero entries
 Space proportional roughly to number of links
 Say 10N, or 4*10*1 billion = 40GB
 Still won’t fit in memory, but will fit on disk
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
source
node degree destination nodes
8.44
Basic Algorithm: Update Step
 Assume enough RAM to fit rnew into memory
 Store rold and matrix M on disk
 1 step of power-iteration is:
0 3 1, 5, 6
1 4 17, 64, 113, 117
2 2 13, 23
source degree destination
0
1
2
3
4
5
0
1
2
3
4
5
6
rnew rold
Initialize all entries of rnew = (1-) / N
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) +=  rold(i) / di
8.45
Analysis
 Assume enough RAM to fit rnew into memory
 Store rold and matrix M on disk
 In each iteration, we have to:
 Read rold and M
 Write rnew back to disk
 Cost per iteration of Power method:
= 2|r| + |M|
 Question:
 What if we could not even fit rnew in memory?
 Split rnew into blocks. Details ignored
8.46
Some Problems with Page Rank
 Measures generic popularity of a page
 Biased against topic-specific authorities
 Solution: Topic-Specific (Personalized) PageRank (next)
 Uses a single measure of importance
 Other models of importance
 Solution: Hubs-and-Authorities
8.47
Part 2: Topic-Specific (Personalized)
PageRank
8.48
Topic-Specific PageRank
 Instead of generic popularity, can we measure popularity within a
topic?
 Goal: Evaluate Web pages not just according to their popularity, but
by how close they are to a particular topic, e.g. “sports” or “history”
 Allows search queries to be answered based on interests of the user
8.49
Topic-Specific PageRank
 Random walker has a small probability of teleporting at any step
 Teleport can go to:
 Standard PageRank: Any page with equal probability
 To avoid dead-end and spider-trap problems
 Topic Specific PageRank: A topic-specific set of “relevant”
pages (teleport set)
 Idea: Bias the random walk
 When walker teleports, she pick a page from a set S
 S contains only pages that are relevant to the topic
 E.g., Open Directory (DMOZ) pages for a given topic/query
 For each teleport set S, we get a different vector rS
8.50
Matrix Formulation

8.51
Example: Topic-Specific PageRank
1
2 3
4
Suppose S = {1},  = 0.8
Node Iteration
0 1 2 … stable
1 0.25 0.4 0.28 0.294
2 0.25 0.1 0.16 0.118
3 0.25 0.3 0.32 0.327
4 0.25 0.2 0.24 0.261
0.2
0.5
0.5
1
1 1
0.4 0.4
0.8
0.8 0.8
S={1,2,3,4}, β=0.8:
r=[0.13, 0.10, 0.39, 0.36]
S={1,2,3} , β=0.8:
r=[0.17, 0.13, 0.38, 0.30]
S={1,2} , β=0.8:
r=[0.26, 0.20, 0.29, 0.23]
S={1} , β=0.8:
r=[0.29, 0.11, 0.32, 0.26]
S={1}, β=0.90:
r=[0.17, 0.07, 0.40, 0.36]
S={1} , β=0.8:
r=[0.29, 0.11, 0.32, 0.26]
S={1}, β=0.70:
r=[0.39, 0.14, 0.27, 0.19]
8.52
Part 3: HITS
8.53
Hubs and Authorities
 HITS (Hypertext-Induced Topic Selection)
 Is a measure of importance of pages or documents, similar to
PageRank
 Proposed at around same time as PageRank (‘98)
 Goal: Say we want to find good newspapers
 Don’t just find newspapers. Find “experts” – people who link in a
coordinated way to good newspapers
 Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
8.54
Finding Newspapers
 Hubs and Authorities
Each page has 2 scores:
 Quality as an expert (hub):
 Total sum of votes of authorities pointed to
 Quality as a content (authority):
 Total sum of votes coming from experts
 Principle of repeated improvement
NYT: 10
Ebay: 3
Yahoo: 3
CNN: 8
WSJ: 9
8.55
Hubs and Authorities
 Interesting pages fall into two classes:
 Authorities are pages containing
useful information
 Newspaper home pages
 Course home pages
 Home pages of auto manufacturers
 Hubs are pages that link to authorities
 List of newspapers
 Course bulletin
 List of US auto manufacturers
8.56
Counting in-links: Authority
Each page starts with hub
score 1. Authorities collect
their votes
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
Sum of hub scores
of nodes pointing to
NYT.
8.57
Expert Quality: Hub
Hubs collect authority scores
Sum of authority scores
of nodes that the node
points to.
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
8.58
Reweighting
Authorities again collect
the hub scores
(Note this is idealized example. In reality graph is not bipartite and
each page has both the hub and authority score)
8.59
Mutually Recursive Definition

8.60
Hubs and Authorities

i
j1 j2 j3 j4
j1 j2 j3 j4
i
8.61
Hubs and Authorities

8.62
Hubs and Authorities

Convergence criterion:
Repeated matrix powering
8.63
Existence and Uniqueness
 h = A a
 a = AT h
 h = A AT h
 a = AT A a
 Under reasonable assumptions about A,
HITS converges to vectors h* and a*:
 h* is the principal eigenvector of matrix A AT
 a* is the principal eigenvector of matrix AT A
8.64
Example of HITS
1 1 1
A = 1 0 1
0 1 0
1 1 0
AT = 1 0 1
1 1 0
h(yahoo)
h(amazon)
h(m’soft)
=
=
=
.58
.58
.58
.80
.53
.27
.80
.53
.27
.79
.57
.23
. . .
. . .
. . .
.788
.577
.211
a(yahoo) = .58
a(amazon) = .58
a(m’soft) = .58
.58
.58
.58
.62
.49
.62
. . .
. . .
. . .
.628
.459
.628
.62
.49
.62
Yahoo
M’soft
Amazon
8.65
PageRank and HITS
 PageRank and HITS are two solutions to the same problem:
 What is the value of an in-link from u to v?
 In the PageRank model, the value of the link depends on the links
into u
 In the HITS model, it depends on the value of the other links out
of u
 PageRank computes authorities only. HITS computes both authorities
and hubs.
 The existence of dead ends or spider traps does not affect the solution
of HITS.
8.66
References
 Chapter 5. Mining of Massive Datasets.
End of Chapter 8
8.68
PageRank in MapReduce
 One iteration of the PageRank algorithm involves taking an estimated
PageRank vector r and computing the next estimate r′ by
𝒓 = 𝜷 𝑴 ⋅ 𝒓 +
𝟏 − 𝜷
𝑵 𝑵
 Mapper: input – a line containing node u, ru, a list of out-going
neighbors of u
 For each neighbor v, emit(v, ru/deg(u))
 Emit (u, a list of out-going neighbors of u)
 Reducer: input – (node v, a list of values <ru/deg(u), …>)
 Aggregate the results according to the equation to compute r’v
 Emit node v, r’v, a list of out-going neighbors of v

More Related Content

Similar to Chapter8-Link_Analysis.pptx

Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
Jeff Hammerbacher
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
mobius.cn
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerank
Carlos
 
Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2
Seung-gyu Byeon
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
chestialtaff
 
1 chayes
1 chayes1 chayes
1 chayes
Yandex
 
MATLAB Programming
MATLAB Programming MATLAB Programming
MATLAB Programming
محمدعبد الحى
 
Traveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed EnvironmentTraveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed Environment
csandit
 
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTTRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
cscpconf
 
Matlab anilkumar
Matlab  anilkumarMatlab  anilkumar
Matlab anilkumar
THEMASTERBLASTERSVID
 
MATLAB Workshop yugjjnhhasfhlhhlllhl.pptx
MATLAB Workshop yugjjnhhasfhlhhlllhl.pptxMATLAB Workshop yugjjnhhasfhlhhlllhl.pptx
MATLAB Workshop yugjjnhhasfhlhhlllhl.pptx
prashantkumarchinama
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
Matlab intro notes
Matlab intro notesMatlab intro notes
Matlab intro notes
pawanss
 
Matlab1
Matlab1Matlab1
Matlab1
guest8ba004
 
Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
Sri Prasanna
 
Pagerank
PagerankPagerank
Pagerank
webman86
 
Graph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer NetworksGraph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer Networks
Faculty of Computer Science
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
David Gleich
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
lakshmidkurup
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics Pipeline
Mark Kilgard
 

Similar to Chapter8-Link_Analysis.pptx (20)

Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerank
 
Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
1 chayes
1 chayes1 chayes
1 chayes
 
MATLAB Programming
MATLAB Programming MATLAB Programming
MATLAB Programming
 
Traveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed EnvironmentTraveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed Environment
 
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTTRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
 
Matlab anilkumar
Matlab  anilkumarMatlab  anilkumar
Matlab anilkumar
 
MATLAB Workshop yugjjnhhasfhlhhlllhl.pptx
MATLAB Workshop yugjjnhhasfhlhhlllhl.pptxMATLAB Workshop yugjjnhhasfhlhhlllhl.pptx
MATLAB Workshop yugjjnhhasfhlhhlllhl.pptx
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Matlab intro notes
Matlab intro notesMatlab intro notes
Matlab intro notes
 
Matlab1
Matlab1Matlab1
Matlab1
 
Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
 
Pagerank
PagerankPagerank
Pagerank
 
Graph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer NetworksGraph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer Networks
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics Pipeline
 

Recently uploaded

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 

Recently uploaded (20)

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 

Chapter8-Link_Analysis.pptx

  • 1. COMP9313: Big Data Management Lecturer: Xin Cao Course web site: http://www.cse.unsw.edu.au/~cs9313/
  • 2. 8.2 Chapter 8: Analysis of Large Graphs — Link Analysis Adapted from the slides of Chapter 5 of “Mining Massive Datasets”
  • 3. 8.3 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
  • 4. 8.4 Graph Data: Information Nets Citation networks and Maps of science [Börner et al., 2012]
  • 6. 8.6 Web as a Directed Graph  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks
  • 7. 8.7 Broad Question  How to organize the Web?  First try: Human curated Web directories  Yahoo, LookSmart, etc.  Second try: Web Search  Information Retrieval investigates: Find relevant docs in a small and trusted set  Newspaper articles, Patents, etc.  But: Web is huge, full of untrusted documents, random things, web spam, etc.  What is the “best” answer to query “newspaper”?  No single right answer
  • 8. 8.8 Ranking Nodes on the Graph  All web pages are not equally “important”  http://xxx.github.io/ vs. http://www.unsw.edu.au/  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure!
  • 9. 8.9 Link Analysis Algorithms  We will cover the following Link Analysis approaches for computing importance of nodes in a graph:  Page Rank  Topic-Specific (Personalized) Page Rank  HITS
  • 11. 8.11 Links as Votes  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  http://www.unsw.edu.au/ has 23,400 in-links  http://xxx.github.io/ has 1 in-link  Are all in-links equal?  Links from important pages count more  Recursive question!
  • 13. 8.13 Simple Recursive Formulation  Each link’s vote is proportional to the importance of its source page  If page j with importance rj has n out-links, each link gets rj / n votes  Page j’s own importance is the sum of the votes on its in-links j k i rj/3 rj/3 rj/3 rj = ri/3+rk/4 ri/3 rk/4
  • 14. 8.14 PageRank: The “Flow” Model  A “vote” from an important page is worth more  A page is important if it is pointed to by other important pages  Define a “rank” rj for page j    j i i j r r i d y m a a/2 y/2 a/2 m y/2 “Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
  • 15. 8.15 Solving the Flow Equations  3 equations, 3 unknowns, no constants  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏  Solution: 𝒓𝒚 = 𝟐 𝟓 , 𝒓𝒂 = 𝟐 𝟓 , 𝒓𝒎 = 𝟏 𝟓  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Flow equations:
  • 16. 8.16 PageRank: Matrix Formulation  Stochastic adjacency matrix 𝑴  Let page 𝑖 has 𝑑𝑖 out-links  If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = 1 𝑑𝑖 else 𝑀𝑗𝑖 = 0  𝑴 is a column stochastic matrix – Columns sum to 1  Rank vector 𝒓: vector with an entry per page  𝑟𝑖 is the importance score of page 𝑖  𝑖 𝑟𝑖 = 1  The flow equations can be written 𝒓 = 𝑴 ⋅ 𝒓
  • 17. 8.17 Example  Remember the flow equation:  Flow equation in the matrix form 𝑴 ⋅ 𝒓 = 𝒓  Suppose page i links to 3 pages, including j    j i i j r r i d j i M r r = rj 1/3 ri . . =
  • 18. 8.18 Eigenvector Formulation  The flow equations can be written 𝒓 = 𝑴 ∙ 𝒓  So the rank vector r is an eigenvector of the stochastic web matrix M  In fact, its first or principal eigenvector, with corresponding eigenvalue 1  Largest eigenvalue of M is 1 since M is column stochastic (with non-negative entries) – We know r is unit length and each column of M sums to one, so 𝑴𝒓 ≤ 𝟏  We can now efficiently solve for r!  The method is called Power iteration NOTE: x is an eigenvector with the corresponding eigenvalue λ if: 𝑨𝒙 = 𝝀𝒙
  • 19. 8.19 Example: Flow Equations & M r = M∙r y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
  • 20. 8.20 Power Iteration Method  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages  Initialize: r(0) = [1/N,….,1/N]T  Iterate: r(t+1) = M ∙ r(t)  Stop when |r(t+1) – r(t)|1 <      j i t i t j r r i ) ( ) 1 ( d di …. out-degree of node i |x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean
  • 21. 8.21 PageRank: How to solve? y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2  Iteration 0, 1, 2, …
  • 24. 8.24 The Stationary Distribution  ) ( M ) 1 ( t p t p    j i1 i2 i3
  • 25. 8.25 Existence and Uniqueness  A central result from the theory of random walks (a.k.a. Markov processes): For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0
  • 26. 8.26 PageRank: Two Questions  Does this converge?  Does it converge to what we want?     j i t i t j r r i ) ( ) 1 ( d Mr r  or equivalently
  • 27. 8.27 Does this converge?  Example: ra 1 0 1 0 … rb 0 1 0 1 … b a     j i t i t j r r i ) ( ) 1 ( d Iteration 0, 1, 2, …
  • 28. 8.28 Does it converge to what we want?  Example: ra 1 0 0 0 rb 0 1 0 0 b a     j i t i t j r r i ) ( ) 1 ( d Iteration 0, 1, 2, …
  • 29. 8.29 PageRank: Problems 2 problems:  (1) Some pages are dead ends (have no out-links)  Random walk has “nowhere” to go to  Such pages cause importance to “leak out”  (2) Spider traps: (all out-links are within the group)  Random walked gets “stuck” in a trap  And eventually spider traps absorb all importance Dead end
  • 30. 8.30 Problem: Dead Ends  y a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 Iteration 0, 1, 2, … Here the PageRank “leaks” out since the matrix is not stochastic.
  • 31. 8.31 Solution: Teleport!  Teleports: Follow random teleport links with probability 1.0 from dead- ends  Adjust matrix accordingly y a m y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m
  • 32. 8.32 Problem: Spider Traps  Iteration 0, 1, 2, … All the PageRank score gets “trapped” in node m. y a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 ry = ry /2 + ra /2 ra = ry /2 rm = ra /2 + rm m is a spider trap
  • 33. 8.33 Solution: Always Teleports!  The Google solution for spider traps: At each time step, the random surfer has two options  With prob. , follow a link at random  With prob. 1-, jump to some random page  Common values for  are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps y a m y a m
  • 34. 8.34 Why Teleports Solve the Problem? Why are dead-ends and spider traps a problem and why do teleports solve the problem?  Spider-traps are not a problem, but with traps PageRank scores are not what we want  Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps  Dead-ends are a problem  The matrix is not column stochastic so our initial assumptions are not met  Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go
  • 35. 8.35 Google’s Solution: Random Teleports  di … out-degree of node i
  • 37. 8.37 Random Teleports ( = 0.8) y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . . y a m 13/15 7/15 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A
  • 38. 8.38 Computing Page Rank  Key step is matrix-vector multiplication  rnew = A ∙ rold  Easy if we have enough main memory to hold A, rold, rnew  Say N = 1 billion pages  We need 4 bytes for each entry (say)  2 billion entries for vectors, approx 8GB  Matrix A has N2 entries  1018 is a large number! ½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 0.8 +0.2 A = ∙M + (1-) [1/N]NxN = A =
  • 39. 8.39 Matrix Formulation  Suppose there are N pages  Consider page i, with di out-links  We have Mji = 1/|di| when i → j and Mji = 0 otherwise  The random teleport is equivalent to:  Adding a teleport link from i to every other page and setting transition probability to (1-)/N  Reducing the probability of following each out-link from 1/|di| to /|di|  Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly
  • 40. 8.40 Rearranging the Equation  [x]N … a vector of length N with all entries x Note: Here we assumed M has no dead-ends
  • 42. 8.42 PageRank: The Complete Algorithm  If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. where: 𝑆 = 𝑗 𝑟′𝑗 𝑛𝑒𝑤
  • 43. 8.43 Sparse Matrix Encoding  Encode sparse matrix using only nonzero entries  Space proportional roughly to number of links  Say 10N, or 4*10*1 billion = 40GB  Still won’t fit in memory, but will fit on disk 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 source node degree destination nodes
  • 44. 8.44 Basic Algorithm: Update Step  Assume enough RAM to fit rnew into memory  Store rold and matrix M on disk  1 step of power-iteration is: 0 3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23 source degree destination 0 1 2 3 4 5 0 1 2 3 4 5 6 rnew rold Initialize all entries of rnew = (1-) / N For each page i (of out-degree di): Read into memory: i, di, dest1, …, destdi, rold(i) For j = 1…di rnew(destj) +=  rold(i) / di
  • 45. 8.45 Analysis  Assume enough RAM to fit rnew into memory  Store rold and matrix M on disk  In each iteration, we have to:  Read rold and M  Write rnew back to disk  Cost per iteration of Power method: = 2|r| + |M|  Question:  What if we could not even fit rnew in memory?  Split rnew into blocks. Details ignored
  • 46. 8.46 Some Problems with Page Rank  Measures generic popularity of a page  Biased against topic-specific authorities  Solution: Topic-Specific (Personalized) PageRank (next)  Uses a single measure of importance  Other models of importance  Solution: Hubs-and-Authorities
  • 47. 8.47 Part 2: Topic-Specific (Personalized) PageRank
  • 48. 8.48 Topic-Specific PageRank  Instead of generic popularity, can we measure popularity within a topic?  Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”  Allows search queries to be answered based on interests of the user
  • 49. 8.49 Topic-Specific PageRank  Random walker has a small probability of teleporting at any step  Teleport can go to:  Standard PageRank: Any page with equal probability  To avoid dead-end and spider-trap problems  Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set)  Idea: Bias the random walk  When walker teleports, she pick a page from a set S  S contains only pages that are relevant to the topic  E.g., Open Directory (DMOZ) pages for a given topic/query  For each teleport set S, we get a different vector rS
  • 51. 8.51 Example: Topic-Specific PageRank 1 2 3 4 Suppose S = {1},  = 0.8 Node Iteration 0 1 2 … stable 1 0.25 0.4 0.28 0.294 2 0.25 0.1 0.16 0.118 3 0.25 0.3 0.32 0.327 4 0.25 0.2 0.24 0.261 0.2 0.5 0.5 1 1 1 0.4 0.4 0.8 0.8 0.8 S={1,2,3,4}, β=0.8: r=[0.13, 0.10, 0.39, 0.36] S={1,2,3} , β=0.8: r=[0.17, 0.13, 0.38, 0.30] S={1,2} , β=0.8: r=[0.26, 0.20, 0.29, 0.23] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.90: r=[0.17, 0.07, 0.40, 0.36] S={1} , β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.70: r=[0.39, 0.14, 0.27, 0.19]
  • 53. 8.53 Hubs and Authorities  HITS (Hypertext-Induced Topic Selection)  Is a measure of importance of pages or documents, similar to PageRank  Proposed at around same time as PageRank (‘98)  Goal: Say we want to find good newspapers  Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?
  • 54. 8.54 Finding Newspapers  Hubs and Authorities Each page has 2 scores:  Quality as an expert (hub):  Total sum of votes of authorities pointed to  Quality as a content (authority):  Total sum of votes coming from experts  Principle of repeated improvement NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9
  • 55. 8.55 Hubs and Authorities  Interesting pages fall into two classes:  Authorities are pages containing useful information  Newspaper home pages  Course home pages  Home pages of auto manufacturers  Hubs are pages that link to authorities  List of newspapers  Course bulletin  List of US auto manufacturers
  • 56. 8.56 Counting in-links: Authority Each page starts with hub score 1. Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) Sum of hub scores of nodes pointing to NYT.
  • 57. 8.57 Expert Quality: Hub Hubs collect authority scores Sum of authority scores of nodes that the node points to. (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)
  • 58. 8.58 Reweighting Authorities again collect the hub scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)
  • 60. 8.60 Hubs and Authorities  i j1 j2 j3 j4 j1 j2 j3 j4 i
  • 62. 8.62 Hubs and Authorities  Convergence criterion: Repeated matrix powering
  • 63. 8.63 Existence and Uniqueness  h = A a  a = AT h  h = A AT h  a = AT A a  Under reasonable assumptions about A, HITS converges to vectors h* and a*:  h* is the principal eigenvector of matrix A AT  a* is the principal eigenvector of matrix AT A
  • 64. 8.64 Example of HITS 1 1 1 A = 1 0 1 0 1 0 1 1 0 AT = 1 0 1 1 1 0 h(yahoo) h(amazon) h(m’soft) = = = .58 .58 .58 .80 .53 .27 .80 .53 .27 .79 .57 .23 . . . . . . . . . .788 .577 .211 a(yahoo) = .58 a(amazon) = .58 a(m’soft) = .58 .58 .58 .58 .62 .49 .62 . . . . . . . . . .628 .459 .628 .62 .49 .62 Yahoo M’soft Amazon
  • 65. 8.65 PageRank and HITS  PageRank and HITS are two solutions to the same problem:  What is the value of an in-link from u to v?  In the PageRank model, the value of the link depends on the links into u  In the HITS model, it depends on the value of the other links out of u  PageRank computes authorities only. HITS computes both authorities and hubs.  The existence of dead ends or spider traps does not affect the solution of HITS.
  • 66. 8.66 References  Chapter 5. Mining of Massive Datasets.
  • 68. 8.68 PageRank in MapReduce  One iteration of the PageRank algorithm involves taking an estimated PageRank vector r and computing the next estimate r′ by 𝒓 = 𝜷 𝑴 ⋅ 𝒓 + 𝟏 − 𝜷 𝑵 𝑵  Mapper: input – a line containing node u, ru, a list of out-going neighbors of u  For each neighbor v, emit(v, ru/deg(u))  Emit (u, a list of out-going neighbors of u)  Reducer: input – (node v, a list of values <ru/deg(u), …>)  Aggregate the results according to the equation to compute r’v  Emit node v, r’v, a list of out-going neighbors of v