Link Analysis .pptx

LINK ANALYSIS
Ahnaf Tahmeed
ID:23542605015

LINK ANALYSIS
 Link analysis is based on a branch of mathematics called graph
theory, which represents relationships between different objects as
edges in a graph. Link analysis is not a specific modeling technique,
so it can be used for both directed and undirected data mining.
 A link analysis ranking algorithm starts with a set of Web pages.
Depending on how this set of pages is obtained, we distinguish
between query independent algorithms, and query dependent
algorithms. In the former case, the algorithm ranks the whole Web.

Link analysis is used for 3
primary purposes:
Find matches in data for known
patterns of interest;
Find anomalies where known
patterns are violated;
Discover new patterns of
interest (social network analysis,
data mining).

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

WEB AS A GRAPH
 Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes
are in the
Gates
building
Computer
Science
Departmen
t at
Stanford
Stanford
University

 First try: Human curated
Web directories
 Yahoo, DMOZ, LookSmart
 Second try: Web Search
 Information Retrieval investigates:
Find relevant docs in a small
and trusted set
 Newspaper articles, Patents, etc.
 But: Web is huge, full of untrusted
documents, random things, web spam,
etc.

2 challenges of web search:
 (1) Web contains many sources of information
Who to “trust”?
 Trick: Trustworthy pages may point to each other!
 (2) What is the “best” answer to query “newspaper”?
 No single right answer
 Trick: Pages that actually know about newspapers might all
be pointing to many newspapers

RANKING NODES ON THE
GRAPH
 All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
 There is large diversity
in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

Link Analysis approaches for computing
importances
of nodes in a graph:
Page Rank Algorithm
Hyperlink Induced Topic Search (HITS)
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms

PageRank (PR) is an algorithm used by Google Search to
rank websites in their search engine results. PageRank
was named after Larry Page, one of the founders of
Google. PageRank is a way of measuring the importance of
website pages.
PageRank of a website is very important because it is the
deciding factor which shows up your site in the first page
or last page of the search engine when a browser is
searching for something related to your business or
product.

LINKS AS VOTES
 Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
 Think of in-links as votes:
 www.stanford.edu has 23,400 in-links
 www.joe-schmoe.com has 1 in-link
 Are all in-links are equal?
 Links from important pages count more
 Recursive question!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
12
B
38.4
C
34.3
E
8.1
F
3.9
D
3.9
A
3.3
1.6
1.6 1.6 1.6 1.6

SIMPLE RECURSIVE
FORMULATION
 Each link’s vote is proportional to the
importance of its source page
 If page j with importance rj has n out-
links, each link gets rj / n votes
 Page j’s own importance is the sum of the
votes on its in-links
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
13
j
k
i
rj/3
rj/3
rj/3
rj = ri/3+rk/4
ri/3 rk/4

 A “vote” from an important page is worth more
 A page is important if it is pointed to by other
important pages
 Define a “rank” rj for page j



j
i
i
j
r
r
i
d
y
m
a
a/2
y/2
a/2
m
y/2
The web in 1839
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
𝒅𝒊 … out-degree of node 𝒊

SOLVING THE FLOW
EQUATIONS
 3 equations, 3 unknowns,
no constants
 No unique solution
 All solutions equivalent modulo the scale factor
 Additional constraint forces uniqueness:
 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
 Solution: 𝒓𝒚 =
𝟐
𝟓
, 𝒓𝒂 =
𝟐
𝟓
, 𝒓𝒎 =
𝟏
𝟓
 Gaussian elimination method works for
small examples, but we need a better method for large web-size graphs
 We need a new formulation!
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Flow equations:

PAGERANK: MATRIX
FORMULATION
 Stochastic adjacency matrix 𝑴
 Let page 𝑖 has 𝑑𝑖 out-links
 If 𝑖 → 𝑗, then 𝑀𝑗𝑖 =
1
𝑑𝑖
else 𝑀𝑗𝑖 = 0
 𝑴 is a column stochastic matrix
 Columns sum to 1
 Rank vector 𝒓: vector with an entry per page
 𝑟𝑖 is the importance score of page 𝑖
 𝑖 𝑟𝑖 = 1
 The flow equations can be written
𝒓 = 𝑴 ⋅ 𝒓



j
i
i
j
r
r
i
d

 Remember the flow equation:
 Flow equation in the matrix form
𝑴 ⋅ 𝒓 = 𝒓
 Suppose page i links to 3 pages, including j
j
i
M r r
=
rj
1/3



j
i
i
j
r
r
i
d
ri
.
. =

EXAMPLE: FLOW
EQUATIONS & M
r = M∙r
y ½ ½ 0 y
a = ½ 0 1 a
m 0 ½ 0 m
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Link Analysis .pptx

More Related Content

What's hot

Similar to Link Analysis .pptx

Recently uploaded

Link Analysis .pptx