LINK ANALYSIS
Ahnaf Tahmeed
ID:23542605015
LINK ANALYSIS
 Link analysis is based on a branch of mathematics called graph
theory, which represents relationships between different objects as
edges in a graph. Link analysis is not a specific modeling technique,
so it can be used for both directed and undirected data mining.
 A link analysis ranking algorithm starts with a set of Web pages.
Depending on how this set of pages is obtained, we distinguish
between query independent algorithms, and query dependent
algorithms. In the former case, the algorithm ranks the whole Web.
Link analysis is used for 3
primary purposes:
Find matches in data for known
patterns of interest;
Find anomalies where known
patterns are violated;
Discover new patterns of
interest (social network analysis,
data mining).
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
WEB AS A GRAPH
 Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
I teach a
class on
Networks. CS224W:
Classes
are in the
Gates
building
Computer
Science
Departmen
t at
Stanford
Stanford
University
 First try: Human curated
Web directories
 Yahoo, DMOZ, LookSmart
 Second try: Web Search
 Information Retrieval investigates:
Find relevant docs in a small
and trusted set
 Newspaper articles, Patents, etc.
 But: Web is huge, full of untrusted
documents, random things, web spam,
etc.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
2 challenges of web search:
 (1) Web contains many sources of information
Who to “trust”?
 Trick: Trustworthy pages may point to each other!
 (2) What is the “best” answer to query “newspaper”?
 No single right answer
 Trick: Pages that actually know about newspapers might all
be pointing to many newspapers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
RANKING NODES ON THE
GRAPH
 All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
 There is large diversity
in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
Link Analysis approaches for computing
importances
of nodes in a graph:
Page Rank Algorithm
Hyperlink Induced Topic Search (HITS)
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
PageRank (PR) is an algorithm used by Google Search to
rank websites in their search engine results. PageRank
was named after Larry Page, one of the founders of
Google. PageRank is a way of measuring the importance of
website pages.
PageRank of a website is very important because it is the
deciding factor which shows up your site in the first page
or last page of the search engine when a browser is
searching for something related to your business or
product.
LINKS AS VOTES
 Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
 Think of in-links as votes:
 www.stanford.edu has 23,400 in-links
 www.joe-schmoe.com has 1 in-link
 Are all in-links are equal?
 Links from important pages count more
 Recursive question!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
12
B
38.4
C
34.3
E
8.1
F
3.9
D
3.9
A
3.3
1.6
1.6 1.6 1.6 1.6
SIMPLE RECURSIVE
FORMULATION
 Each link’s vote is proportional to the
importance of its source page
 If page j with importance rj has n out-
links, each link gets rj / n votes
 Page j’s own importance is the sum of the
votes on its in-links
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
13
j
k
i
rj/3
rj/3
rj/3
rj = ri/3+rk/4
ri/3 rk/4
 A “vote” from an important page is worth more
 A page is important if it is pointed to by other
important pages
 Define a “rank” rj for page j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14



j
i
i
j
r
r
i
d
y
m
a
a/2
y/2
a/2
m
y/2
The web in 1839
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
𝒅𝒊 … out-degree of node 𝒊
SOLVING THE FLOW
EQUATIONS
 3 equations, 3 unknowns,
no constants
 No unique solution
 All solutions equivalent modulo the scale factor
 Additional constraint forces uniqueness:
 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
 Solution: 𝒓𝒚 =
𝟐
𝟓
, 𝒓𝒂 =
𝟐
𝟓
, 𝒓𝒎 =
𝟏
𝟓
 Gaussian elimination method works for
small examples, but we need a better method for large web-size graphs
 We need a new formulation!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Flow equations:
PAGERANK: MATRIX
FORMULATION
 Stochastic adjacency matrix 𝑴
 Let page 𝑖 has 𝑑𝑖 out-links
 If 𝑖 → 𝑗, then 𝑀𝑗𝑖 =
1
𝑑𝑖
else 𝑀𝑗𝑖 = 0
 𝑴 is a column stochastic matrix
 Columns sum to 1
 Rank vector 𝒓: vector with an entry per page
 𝑟𝑖 is the importance score of page 𝑖
 𝑖 𝑟𝑖 = 1
 The flow equations can be written
𝒓 = 𝑴 ⋅ 𝒓
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16



j
i
i
j
r
r
i
d
 Remember the flow equation:
 Flow equation in the matrix form
𝑴 ⋅ 𝒓 = 𝒓
 Suppose page i links to 3 pages, including j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
j
i
M r r
=
rj
1/3



j
i
i
j
r
r
i
d
ri
.
. =
EXAMPLE: FLOW
EQUATIONS & M
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
r = M∙r
y ½ ½ 0 y
a = ½ 0 1 a
m 0 ½ 0 m
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Link Analysis .pptx

Link Analysis .pptx

  • 1.
  • 2.
    LINK ANALYSIS  Linkanalysis is based on a branch of mathematics called graph theory, which represents relationships between different objects as edges in a graph. Link analysis is not a specific modeling technique, so it can be used for both directed and undirected data mining.  A link analysis ranking algorithm starts with a set of Web pages. Depending on how this set of pages is obtained, we distinguish between query independent algorithms, and query dependent algorithms. In the former case, the algorithm ranks the whole Web.
  • 3.
    Link analysis isused for 3 primary purposes: Find matches in data for known patterns of interest; Find anomalies where known patterns are violated; Discover new patterns of interest (social network analysis, data mining).
  • 4.
    J. Leskovec, A.Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
  • 5.
    WEB AS AGRAPH  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Departmen t at Stanford Stanford University
  • 6.
     First try:Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Information Retrieval investigates: Find relevant docs in a small and trusted set  Newspaper articles, Patents, etc.  But: Web is huge, full of untrusted documents, random things, web spam, etc. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
  • 7.
    2 challenges ofweb search:  (1) Web contains many sources of information Who to “trust”?  Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query “newspaper”?  No single right answer  Trick: Pages that actually know about newspapers might all be pointing to many newspapers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
  • 8.
    RANKING NODES ONTHE GRAPH  All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
  • 9.
    Link Analysis approachesfor computing importances of nodes in a graph: Page Rank Algorithm Hyperlink Induced Topic Search (HITS) Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
  • 10.
    PageRank (PR) isan algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. PageRank of a website is very important because it is the deciding factor which shows up your site in the first page or last page of the search engine when a browser is searching for something related to your business or product.
  • 11.
    LINKS AS VOTES Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
  • 12.
    J. Leskovec, A.Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6
  • 13.
    SIMPLE RECURSIVE FORMULATION  Eachlink’s vote is proportional to the importance of its source page  If page j with importance rj has n out- links, each link gets rj / n votes  Page j’s own importance is the sum of the votes on its in-links J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 j k i rj/3 rj/3 rj/3 rj = ri/3+rk/4 ri/3 rk/4
  • 14.
     A “vote”from an important page is worth more  A page is important if it is pointed to by other important pages  Define a “rank” rj for page j J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14    j i i j r r i d y m a a/2 y/2 a/2 m y/2 The web in 1839 “Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 𝒅𝒊 … out-degree of node 𝒊
  • 15.
    SOLVING THE FLOW EQUATIONS 3 equations, 3 unknowns, no constants  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏  Solution: 𝒓𝒚 = 𝟐 𝟓 , 𝒓𝒂 = 𝟐 𝟓 , 𝒓𝒎 = 𝟏 𝟓  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Flow equations:
  • 16.
    PAGERANK: MATRIX FORMULATION  Stochasticadjacency matrix 𝑴  Let page 𝑖 has 𝑑𝑖 out-links  If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = 1 𝑑𝑖 else 𝑀𝑗𝑖 = 0  𝑴 is a column stochastic matrix  Columns sum to 1  Rank vector 𝒓: vector with an entry per page  𝑟𝑖 is the importance score of page 𝑖  𝑖 𝑟𝑖 = 1  The flow equations can be written 𝒓 = 𝑴 ⋅ 𝒓 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16    j i i j r r i d
  • 17.
     Remember theflow equation:  Flow equation in the matrix form 𝑴 ⋅ 𝒓 = 𝒓  Suppose page i links to 3 pages, including j J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 j i M r r = rj 1/3    j i i j r r i d ri . . =
  • 18.
    EXAMPLE: FLOW EQUATIONS &M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 r = M∙r y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2