SlideShare a Scribd company logo
1 of 19
LINK ANALYSIS
Ahnaf Tahmeed
ID:23542605015
LINK ANALYSIS
 Link analysis is based on a branch of mathematics called graph
theory, which represents relationships between different objects as
edges in a graph. Link analysis is not a specific modeling technique,
so it can be used for both directed and undirected data mining.
 A link analysis ranking algorithm starts with a set of Web pages.
Depending on how this set of pages is obtained, we distinguish
between query independent algorithms, and query dependent
algorithms. In the former case, the algorithm ranks the whole Web.
Link analysis is used for 3
primary purposes:
Find matches in data for known
patterns of interest;
Find anomalies where known
patterns are violated;
Discover new patterns of
interest (social network analysis,
data mining).
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
WEB AS A GRAPH
 Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
I teach a
class on
Networks. CS224W:
Classes
are in the
Gates
building
Computer
Science
Departmen
t at
Stanford
Stanford
University
 First try: Human curated
Web directories
 Yahoo, DMOZ, LookSmart
 Second try: Web Search
 Information Retrieval investigates:
Find relevant docs in a small
and trusted set
 Newspaper articles, Patents, etc.
 But: Web is huge, full of untrusted
documents, random things, web spam,
etc.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
2 challenges of web search:
 (1) Web contains many sources of information
Who to “trust”?
 Trick: Trustworthy pages may point to each other!
 (2) What is the “best” answer to query “newspaper”?
 No single right answer
 Trick: Pages that actually know about newspapers might all
be pointing to many newspapers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
RANKING NODES ON THE
GRAPH
 All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu
 There is large diversity
in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
Link Analysis approaches for computing
importances
of nodes in a graph:
Page Rank Algorithm
Hyperlink Induced Topic Search (HITS)
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
PageRank (PR) is an algorithm used by Google Search to
rank websites in their search engine results. PageRank
was named after Larry Page, one of the founders of
Google. PageRank is a way of measuring the importance of
website pages.
PageRank of a website is very important because it is the
deciding factor which shows up your site in the first page
or last page of the search engine when a browser is
searching for something related to your business or
product.
LINKS AS VOTES
 Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
 Think of in-links as votes:
 www.stanford.edu has 23,400 in-links
 www.joe-schmoe.com has 1 in-link
 Are all in-links are equal?
 Links from important pages count more
 Recursive question!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
12
B
38.4
C
34.3
E
8.1
F
3.9
D
3.9
A
3.3
1.6
1.6 1.6 1.6 1.6
SIMPLE RECURSIVE
FORMULATION
 Each link’s vote is proportional to the
importance of its source page
 If page j with importance rj has n out-
links, each link gets rj / n votes
 Page j’s own importance is the sum of the
votes on its in-links
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
13
j
k
i
rj/3
rj/3
rj/3
rj = ri/3+rk/4
ri/3 rk/4
 A “vote” from an important page is worth more
 A page is important if it is pointed to by other
important pages
 Define a “rank” rj for page j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14



j
i
i
j
r
r
i
d
y
m
a
a/2
y/2
a/2
m
y/2
The web in 1839
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
𝒅𝒊 … out-degree of node 𝒊
SOLVING THE FLOW
EQUATIONS
 3 equations, 3 unknowns,
no constants
 No unique solution
 All solutions equivalent modulo the scale factor
 Additional constraint forces uniqueness:
 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
 Solution: 𝒓𝒚 =
𝟐
𝟓
, 𝒓𝒂 =
𝟐
𝟓
, 𝒓𝒎 =
𝟏
𝟓
 Gaussian elimination method works for
small examples, but we need a better method for large web-size graphs
 We need a new formulation!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Flow equations:
PAGERANK: MATRIX
FORMULATION
 Stochastic adjacency matrix 𝑴
 Let page 𝑖 has 𝑑𝑖 out-links
 If 𝑖 → 𝑗, then 𝑀𝑗𝑖 =
1
𝑑𝑖
else 𝑀𝑗𝑖 = 0
 𝑴 is a column stochastic matrix
 Columns sum to 1
 Rank vector 𝒓: vector with an entry per page
 𝑟𝑖 is the importance score of page 𝑖
 𝑖 𝑟𝑖 = 1
 The flow equations can be written
𝒓 = 𝑴 ⋅ 𝒓
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16



j
i
i
j
r
r
i
d
 Remember the flow equation:
 Flow equation in the matrix form
𝑴 ⋅ 𝒓 = 𝒓
 Suppose page i links to 3 pages, including j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
j
i
M r r
=
rj
1/3



j
i
i
j
r
r
i
d
ri
.
. =
EXAMPLE: FLOW
EQUATIONS & M
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18
r = M∙r
y ½ ½ 0 y
a = ½ 0 1 a
m 0 ½ 0 m
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Link Analysis .pptx

More Related Content

Similar to Link Analysis .pptx (20)

Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Power Point
Power PointPower Point
Power Point
 
prueba
prueba prueba
prueba
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank (1)
Pagerank (1)Pagerank (1)
Pagerank (1)
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 
Pagerank
PagerankPagerank
Pagerank
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 

Link Analysis .pptx

  • 2. LINK ANALYSIS  Link analysis is based on a branch of mathematics called graph theory, which represents relationships between different objects as edges in a graph. Link analysis is not a specific modeling technique, so it can be used for both directed and undirected data mining.  A link analysis ranking algorithm starts with a set of Web pages. Depending on how this set of pages is obtained, we distinguish between query independent algorithms, and query dependent algorithms. In the former case, the algorithm ranks the whole Web.
  • 3. Link analysis is used for 3 primary purposes: Find matches in data for known patterns of interest; Find anomalies where known patterns are violated; Discover new patterns of interest (social network analysis, data mining).
  • 4. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
  • 5. WEB AS A GRAPH  Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 I teach a class on Networks. CS224W: Classes are in the Gates building Computer Science Departmen t at Stanford Stanford University
  • 6.  First try: Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Information Retrieval investigates: Find relevant docs in a small and trusted set  Newspaper articles, Patents, etc.  But: Web is huge, full of untrusted documents, random things, web spam, etc. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
  • 7. 2 challenges of web search:  (1) Web contains many sources of information Who to “trust”?  Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query “newspaper”?  No single right answer  Trick: Pages that actually know about newspapers might all be pointing to many newspapers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
  • 8. RANKING NODES ON THE GRAPH  All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
  • 9. Link Analysis approaches for computing importances of nodes in a graph: Page Rank Algorithm Hyperlink Induced Topic Search (HITS) Topic-Specific (Personalized) Page Rank Web Spam Detection Algorithms J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
  • 10. PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. PageRank of a website is very important because it is the deciding factor which shows up your site in the first page or last page of the search engine when a browser is searching for something related to your business or product.
  • 11. LINKS AS VOTES  Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
  • 12. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6
  • 13. SIMPLE RECURSIVE FORMULATION  Each link’s vote is proportional to the importance of its source page  If page j with importance rj has n out- links, each link gets rj / n votes  Page j’s own importance is the sum of the votes on its in-links J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 j k i rj/3 rj/3 rj/3 rj = ri/3+rk/4 ri/3 rk/4
  • 14.  A “vote” from an important page is worth more  A page is important if it is pointed to by other important pages  Define a “rank” rj for page j J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14    j i i j r r i d y m a a/2 y/2 a/2 m y/2 The web in 1839 “Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 𝒅𝒊 … out-degree of node 𝒊
  • 15. SOLVING THE FLOW EQUATIONS  3 equations, 3 unknowns, no constants  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏  Solution: 𝒓𝒚 = 𝟐 𝟓 , 𝒓𝒂 = 𝟐 𝟓 , 𝒓𝒎 = 𝟏 𝟓  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2 Flow equations:
  • 16. PAGERANK: MATRIX FORMULATION  Stochastic adjacency matrix 𝑴  Let page 𝑖 has 𝑑𝑖 out-links  If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = 1 𝑑𝑖 else 𝑀𝑗𝑖 = 0  𝑴 is a column stochastic matrix  Columns sum to 1  Rank vector 𝒓: vector with an entry per page  𝑟𝑖 is the importance score of page 𝑖  𝑖 𝑟𝑖 = 1  The flow equations can be written 𝒓 = 𝑴 ⋅ 𝒓 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16    j i i j r r i d
  • 17.  Remember the flow equation:  Flow equation in the matrix form 𝑴 ⋅ 𝒓 = 𝒓  Suppose page i links to 3 pages, including j J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 j i M r r = rj 1/3    j i i j r r i d ri . . =
  • 18. EXAMPLE: FLOW EQUATIONS & M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 r = M∙r y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m y a m y a m y ½ ½ 0 a ½ 0 1 m 0 ½ 0 ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2