Analyzing Hyperlinks Using Link Analysis and Graph Theory
1. Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Link analysis Pagerank
HITS
Web graph
characterization
Scale-free networks
Carlos Castillo Macroscopic structure
Summary
Center for Web Research References
Computer Science Department
University of Chile
www.cwr.cl
2. Link analysis
Motivation Carlos Castillo
Outline
Bibliometrics Motivation
Bibliometric laws Bibliometrics
Measures of similarity Bibliometric laws
Measures of similarity
Ranking
Ranking Pagerank
HITS
Pagerank Web graph
characterization
HITS Scale-free networks
Macroscopic structure
Summary
Web graph characterization
References
Scale-free networks
Macroscopic structure
Summary
References
3. Motivation Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Hyperlinks are not at random, they provide valuable Measures of similarity
Ranking
information for: Pagerank
HITS
Link-based ranking Web graph
characterization
Structure analysis Scale-free networks
Macroscopic structure
Detection of communities Summary
Spam detection References
...
4. Topical locality Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
“We found that pages are significantly more Pagerank
HITS
likely to be related topically to pages to which Web graph
they are linked, as opposed to other pages characterization
Scale-free networks
selected at random or other nearby pages.” Macroscopic structure
Summary
[Davison, 2000]
References
5. What type of relationship? Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Link from P to P means that the content of P is Ranking
Pagerank
endorsed by author of P , but links can also mean: HITS
Disagreement Web graph
characterization
Scale-free networks
Self-citation Macroscopic structure
Citation to popular document Summary
References
Citation to methodological document
6. Bibliometrics Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Quantitative analysis and statistics to describe patterns of Web graph
characterization
publication. Scale-free networks
Macroscopic structure
Summary
References
7. Example 1: Lotka’s Law Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Fraction of authors with n papers ∝ 1/n2 Ranking
Pagerank
HITS
1 publication = 0.60 Web graph
2 publications = (1/2)2 × 0.60 = 0.15 characterization
Scale-free networks
Macroscopic structure
3 publications = (1/3)2 × 0.60 = 0.07 Summary
7 publications = (1/7)2 × 0.60 = 0.01 References
8. Example 2: Bradford’s Law Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Journals in a field can be divided in 3 equal parts (each Web graph
part with the same number of journals). The number of characterization
Scale-free networks
papers in each part will be ∝ 1 : n : n2 . Macroscopic structure
Summary
References
9. Counting the number of citations Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Problems: Bibliometric laws
Measures of similarity
Quantity, not quality Ranking
Pagerank
Self-citations are frequent HITS
Web graph
In some fields there are many publications, in others characterization
Scale-free networks
there are less Macroscopic structure
Summary
Citations go from newer to older article
References
New documents have few citations
1/3 of the citations in a paper are not relevant
10. Measures of similarity Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Studying the citations of papers: Ranking
Pagerank
Bibliographic coupling: two documents share a HITS
Web graph
significant portion of their bibliographies characterization
Scale-free networks
Co-citation: two documents are cited Macroscopic structure
Summary
simultaneously by a significant number of other
References
documents
11. Bibliographic coupling Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
12. Co-citation Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
13. Types of ranking Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Query-independent ranking Web graph
characterization
Query-dependent ranking Scale-free networks
Macroscopic structure
Summary
References
14. Adversarial-IR in link-based ranking Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
“With a simple program, huge numbers of Bibliometric laws
Measures of similarity
pages can be created easily, artificially inflating Ranking
Pagerank
citation counts. Because the Web environment HITS
contains profit seeking ventures, attention Web graph
characterization
getting strategies evolve in response to search Scale-free networks
Macroscopic structure
engine algorithms. For this reason, any Summary
evaluation strategy which counts replicable References
features of web pages is prone to
manipulation” [Page et al., 1998].
15. Notation Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Symbol Meaning Pagerank
HITS
p A Web page Web graph
characterization
Γ+ (p) Links pointing from page p Scale-free networks
Γ− (p)
Macroscopic structure
Links pointing to page p Summary
References
16. Hyperlink vector voting Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Each link counts as a “vote” on certain keywords
Ranking
[Li, 1998]: Pagerank
HITS
h(P , P, w ) = 1 ⇐⇒ there is a link from P to P Web graph
characterization
with anchor text w Scale-free networks
Macroscopic structure
HVV (P, w ) = P ∈Γ− (P) h(P , P, w ) Summary
Only the count of links is used ⇒ easy to References
manipulate
17. Pagerank Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measure of importance of a page, defined recursively Measures of similarity
[Page et al., 1998]: A page with high Pagerank is a Ranking
Pagerank
page referenced by many pages with high Pagerank HITS
Web graph
characterization
This is a simplified version: Scale-free networks
Macroscopic structure
Summary
Pagerank (x)
Pagerank (P) = References
|Γ+ (x)|
x∈Γ− (P)
18. Iterations with seudo-Pagerank Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
Note: we should normalize after each iteration
19. Convergence with seudo-Pagerank Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
20. Random jumps Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
So far, so good, but ... Pagerank
HITS
The Web includes many pages with no out-links, Web graph
characterization
these will accumulate all of the score Scale-free networks
Macroscopic structure
We would like Web pages to accumulate ranking Summary
We add random jumps (teleportation) References
21. Random jumps Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
22. Random jumps to single node Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
23. Pagerank with random jumps Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
Pagerank(x) HITS
+ (1 − )
Pagerank(P) =
|Γ+ (x)|
N Web graph
x∈Γ− (P) characterization
Scale-free networks
Macroscopic structure
Summary
References
≈ 0.15
Typically
24. Pagerank calculation Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Iterative calculation Pagerank
HITS
Exploit blocks in the matrix – locality of links Web graph
characterization
The extra node can be added after the last iteration Scale-free networks
Macroscopic structure
Dangling nodes can be added after the last iteration Summary
[Eiron et al., 2004] References
25. Alternatives Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
Ranking as a vector: topic-sensitive Pagerank HITS
[Haveliwala, 2002] Web graph
characterization
Network flows: TrafficRank [Tomlin, 2003] Scale-free networks
Macroscopic structure
Dynamic absorbing model [Amati et al., 2003] Summary
References
26. Dynamic absorbing model Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
Scores are the probabilities of the “clone” nodes
27. Hubs and authorities Link analysis
Carlos Castillo
Outline
[Kleinberg, 1999] Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
28. Algorithm Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Obtain a results set using text search
1
Web graph
characterization
Expand the results set with in- and out-links
2
Scale-free networks
Macroscopic structure
Calculate hubs and authorities
3
Summary
References
29. Expansion phase Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
30. Calculation phase Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Initialize: Bibliometric laws
Measures of similarity
hub(p, 0) = auth(p, 0) = 1
Ranking
Pagerank
Iterate: HITS
Web graph
auth(x, t − 1) characterization
hub(p, t) = Scale-free networks
|Γ− (x)| Macroscopic structure
x∈Γ− (p) Summary
References
hub(x, t − 1)
auth(p, t) =
|Γ− (x)|
x∈Γ− (p)
31. Web graph characterization Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
“While entirely of human design, the emerging HITS
network appears to have more in common with Web graph
characterization
a cell or an ecological system than with a Swiss Scale-free networks
Macroscopic structure
watch.” [Barab´si, 2001]
a Summary
References
32. The Web is not a random network Link analysis
Carlos Castillo
Outline
Scale-free network [Barab´si, 2002]
a Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
33. Relationship of Web links with goods Link analysis
Carlos Castillo
imports
Outline
Motivation
[To be available] Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
34. Other scale-free networks Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Power grid designs Pagerank
HITS
Sexual partners in humans Web graph
characterization
Collaboration of movie actors in films Scale-free networks
Macroscopic structure
Citations in scientific publications Summary
Protein interactions References
35. Distribution of degree Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
36. Giant strongly-connected component Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
[Broder et al., 2000]
Ranking
The core of the Web is a strongly-connected Pagerank
HITS
component (MAIN) Web graph
characterization
Nodes reachable from this component are in the OUT Scale-free networks
Macroscopic structure
component Summary
Nodes that reach this component are in the IN References
component
37. Bow-tie structure Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
38. Bow-tie structure and details Link analysis
Carlos Castillo
Outline
[Baeza-Yates and Castillo, 2001] Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
39. Bow-tie and search behavior Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
40. Summary Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Hyperlinks should be exploited for extracting Web graph
information characterization
Scale-free networks
Link analysis should exploit global properties Macroscopic structure
Summary
References
41. Amati, G., Ounis, I., and V., P. (2003). Link analysis
The dynamic absorbing model for the web. Carlos Castillo
Technical Report TR-2003-137, Department of Outline
Computing Science, University of Glasgow. Motivation
Bibliometrics
Baeza-Yates, R. and Castillo, C. (2001). Bibliometric laws
Relating Web characteristics with link based Web Measures of similarity
Ranking
page ranking. Pagerank
HITS
In Proceedings of String Processing and Information
Web graph
Retrieval, pages 21–32, Laguna San Rafael, Chile. characterization
Scale-free networks
IEEE CS Press. Macroscopic structure
Summary
Barab´si, A.-L. (2001).
a
References
The physics of the web.
PhysicsWeb.ORG, online journal.
Barab´si, A.-L. (2002).
a
Linked: the new science of networks.
Perseus Publishing.
42. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Link analysis
Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, Carlos Castillo
J. (2000). Outline
Graph structure in the web: Experiments and models. Motivation
Bibliometrics
In Proceedings of the Ninth Conference on World Bibliometric laws
Measures of similarity
Wide Web, pages 309–320, Amsterdam, Netherlands. Ranking
Pagerank
HITS
Davison, B. D. (2000).
Web graph
Topical locality in the web. characterization
Scale-free networks
In Proceedings of the 23rd annual international ACM Macroscopic structure
SIGIR conference on research and development in Summary
information retrieval, pages 272–279. ACM Press. References
Eiron, N., McCurley, K. S., and Tomlin, J. A. (2004).
Ranking the web frontier.
In Proceedings of the 13th international conference
on World Wide Web, pages 309–318. ACM Press.
Haveliwala, T. H. (2002).
Topic-sensitive pagerank.
43. In Proceedings of the Eleventh World Wide Web Link analysis
Conference, pages 517–526, Honolulu, Hawaii, USA. Carlos Castillo
ACM Press.
Outline
Kleinberg, J. M. (1999). Motivation
Authoritative sources in a hyperlinked environment. Bibliometrics
Journal of the ACM, 46(5):604–632. Bibliometric laws
Measures of similarity
Ranking
Li, Y. (1998). Pagerank
Toward a qualitative search engine. HITS
Web graph
IEEE Internet Computing, pages 24 – 29. characterization
Scale-free networks
Macroscopic structure
Page, L., Brin, S., Motwani, R., and Winograd, T.
Summary
(1998).
References
The Pagerank citation algorithm: bringing order to
the web.
In Proceedings of the seventh conference on World
Wide Web, Brisbane, Australia.
Tomlin, J. A. (2003).
A new paradigm for ranking pages on the world wide
web.
44. In Proceedings of the Twelfth Conference on World Link analysis
Wide Web, pages 350–355, Budapest, Hungary. ACM Carlos Castillo
Press.
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
45. Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References
46. Link analysis
Carlos Castillo
Outline
Motivation
Bibliometrics
Bibliometric laws
Measures of similarity
Ranking
Pagerank
HITS
Web graph
characterization
Scale-free networks
Macroscopic structure
Summary
References