Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
                                Link...
Link Analysis on
    the Web

                       Levels of Link Analysis
                   1
Levels of Link
Analysis
...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Functional
Rankings

Web Spam

Web S...
Link Analysis on
                   How to find meaningful patterns?
    the Web



Levels of Link
Analysis

Generalizing
P...
Link Analysis on
                   How to find meaningful patterns?
    the Web



Levels of Link
Analysis

Generalizing
P...
Link Analysis on
                   How to find meaningful patterns?
    the Web



Levels of Link
Analysis

Generalizing
P...
Link Analysis on
                   Macroscopic view, e.g. Bow-tie
    the Web



Levels of Link
Analysis

Generalizing
Pa...
Link Analysis on
                   Macroscopic view, e.g. Bow-tie, migration
    the Web



Levels of Link
Analysis

Gene...
Link Analysis on
                   Macroscopic view, e.g. Jellyfish
    the Web



Levels of Link
Analysis

Generalizing
P...
Link Analysis on
                   Macroscopic view, e.g. Jellyfish
    the Web



Levels of Link
Analysis

Generalizing
P...
Link Analysis on
                   Microscopic view, e.g. Degree
    the Web



Levels of Link
Analysis

Generalizing
Pag...
Link Analysis on
                   Microscopic view, e.g. Degree
    the Web



                                 Greece  ...
Link Analysis on
                   Mesoscopic view, e.g. Hop-plot
    the Web



Levels of Link
Analysis

Generalizing
Pa...
Link Analysis on
                   Mesoscopic view, e.g. Hop-plot
    the Web



Levels of Link
Analysis

Generalizing
Pa...
Link Analysis on
                   Mesoscopic view, e.g. Hop-plot
    the Web



Levels of Link
Analysis
                ...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Functional
Rankings

Web Spam

Web S...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
                   Notation
    the Web



Levels of Link
Analysis

Generalizing
                   Let P...
Link Analysis on
                   Notation
    the Web



Levels of Link
Analysis

Generalizing
                   Let P...
Link Analysis on
                   Explicit Formulas
    the Web



Levels of Link
Analysis

Generalizing
PageRank

     ...
Link Analysis on
                   Explicit Formulas
    the Web



Levels of Link
Analysis

Generalizing
PageRank

     ...
Link Analysis on
                   Branching contribution
    the Web



Levels of Link
Analysis

Generalizing
PageRank
 ...
Link Analysis on
                   Functional ranking
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Othe...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
                   Exponential damping = PageRank
    the Web



Levels of Link
                         ...
Link Analysis on
                   Linear damping
    the Web




                                  0.30
Levels of Link
 ...
Link Analysis on
                   Example: Calculating LinearRank
    the Web



Levels of Link
Analysis

Generalizing
P...
Link Analysis on
                   Example: Calculating LinearRank
    the Web



Levels of Link
Analysis

Generalizing
P...
Link Analysis on
                   Re-write the damping as a recursion
    the Web



Levels of Link
Analysis

Generalizi...
Link Analysis on
                   Re-write the damping as a recursion
    the Web



Levels of Link
Analysis

Generalizi...
Link Analysis on
                   Re-write the damping as a recursion
    the Web



Levels of Link
Analysis

Generalizi...
Link Analysis on
                   Algorithm
    the Web



Levels of Link
                       for i : 1 . . . N do {I...
Link Analysis on
                   Algorithm
    the Web



Levels of Link
                         for i : 1 . . . N do ...
Link Analysis on
                   Algorithm
    the Web



Levels of Link
                          for i : 1 . . . N do...
Link Analysis on
                   Algorithm
    the Web



Levels of Link
                          for i : 1 . . . N do...
Link Analysis on
                   Algorithm (general)
    the Web



Levels of Link
                          for i : 1 ...
Link Analysis on
                   Other damping functions
    the Web



Levels of Link
Analysis

                   Emp...
Link Analysis on
                   Using LinearRank to approximage PageRank
    the Web



Levels of Link
Analysis

Gener...
Link Analysis on
                   Using LinearRank to approximage PageRank
    the Web



Levels of Link
Analysis

Gener...
Link Analysis on
                   Using LinearRank to approximage PageRank
    the Web



Levels of Link
Analysis

Gener...
Link Analysis on
                   Using LinearRank to approximage PageRank
    the Web



Levels of Link
Analysis

Gener...
Link Analysis on
                   Experimental comparison
    the Web



Levels of Link
Analysis
                      E...
Link Analysis on
                   Prediction of best parameter combination
    the Web



Levels of Link
Analysis
      ...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
                   What is on the Web?
    the Web



                   Information
Levels of Link
Analy...
Link Analysis on
                   What is on the Web?
    the Web



                   Information + Porn
Levels of Lin...
Link Analysis on
                   What is on the Web?
    the Web



                   Information + Porn + On-line cas...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Opportunities for Web spam
    the Web



Levels of Link
Analysis

Generalizing
PageRa...
Link Analysis on
                   Typical Web Spam (1)
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Ot...
Link Analysis on
                   Typical Web Spam (2)
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Ot...
Link Analysis on
                   Hidden text
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Funct...
Link Analysis on
                   Made for Advertising (1)
    the Web



Levels of Link
Analysis

Generalizing
PageRank...
Link Analysis on
                   Made for Advertising (2)
    the Web



Levels of Link
Analysis

Generalizing
PageRank...
Link Analysis on
                   Made for Advertising (3)
    the Web



Levels of Link
Analysis

Generalizing
PageRank...
Link Analysis on
                   Search engine?
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Fu...
Link Analysis on
                   Fake search engine
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Othe...
Link Analysis on
                   Problem: “normal” pages that are spam
    the Web



Levels of Link
Analysis

Generali...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
                   Machine Learning
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
...
Link Analysis on
                   Machine Learning (cont.)
    the Web



Levels of Link
Analysis

Generalizing
PageRank...
Link Analysis on
                   Feature Extraction
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Othe...
Link Analysis on
                   Challenges: Machine Learning
    the Web



Levels of Link
Analysis

Generalizing
Page...
Link Analysis on
                   Challenges: Machine Learning
    the Web



Levels of Link
Analysis

Generalizing
Page...
Link Analysis on
                   Challenges: Machine Learning
    the Web



Levels of Link
Analysis

Generalizing
Page...
Link Analysis on
                   Challenges: Information Retrieval
    the Web



Levels of Link
Analysis

Generalizing...
Link Analysis on
                   Challenges: Information Retrieval
    the Web



Levels of Link
Analysis

Generalizing...
Link Analysis on
                   Challenges: Information Retrieval
    the Web



Levels of Link
Analysis

Generalizing...
Link Analysis on
                   Challenges: Information Retrieval
    the Web



Levels of Link
Analysis

Generalizing...
Link Analysis on
                   Challenges: Information Retrieval
    the Web



Levels of Link
Analysis

Generalizing...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
                   Topological spam: link farms
    the Web



Levels of Link
Analysis

Generalizing
Page...
Link Analysis on
                   Topological spam: link farms
    the Web



Levels of Link
Analysis

Generalizing
Page...
Link Analysis on
                   Motivation
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Functi...
Link Analysis on
                   Test collection
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   Test collection
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   In-degree
    the Web




                                                     δ = 0.3...
Link Analysis on
                   Out-degree
    the Web



Levels of Link
                                             ...
Link Analysis on
                   Edge reciprocity
    the Web



Levels of Link
                                       ...
Link Analysis on
                   Assortativity
    the Web



Levels of Link

                                         ...
Link Analysis on
                   Variance of PageRank
    the Web


                              Suggested in [Bencz´r...
Link Analysis on
                   Variance of PageRank of in-neighbors
    the Web



Levels of Link

                  ...
Link Analysis on
                   TrustRank
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
       ...
Link Analysis on
                   TrustRank
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
       ...
Link Analysis on
                   TrustRank Idea
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Fu...
Link Analysis on
                   TrustRank score
    the Web



Levels of Link

                                       ...
Link Analysis on
                   TrustRank / PageRank
    the Web



Levels of Link

                                  ...
Link Analysis on
                   Truncated PageRank
    the Web



Levels of Link
Analysis

Generalizing
              ...
Link Analysis on
                   Truncated PageRank
    the Web



Levels of Link
Analysis

Generalizing
              ...
Link Analysis on
                   Truncated PageRank(T=2) / PageRank
    the Web



Levels of Link
Analysis
            ...
Link Analysis on
                   Max. change of Truncated PageRank
    the Web



Levels of Link
Analysis

            ...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
                   High and low-ranked pages are different
    the Web



                                ...
Link Analysis on
                   High and low-ranked pages are different
    the Web



                                ...
Link Analysis on
                   Probabilistic counting
    the Web



Levels of Link
Analysis
                        ...
Link Analysis on
                   Probabilistic counting
    the Web



Levels of Link
Analysis
                        ...
Link Analysis on
                   General algorithm
    the Web



                   Require: N: number of nodes, d: di...
Link Analysis on
                   General algorithm
    the Web



                   Require: N: number of nodes, d: di...
Link Analysis on
                   General algorithm
    the Web



                   Require: N: number of nodes, d: di...
Link Analysis on
                   Our estimator
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Fun...
Link Analysis on
                   Our estimator
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Fun...
Link Analysis on
                   Our estimator
    the Web



Levels of Link
Analysis

Generalizing
PageRank

Other
Fun...
Link Analysis on
                   Convergence
    the Web



Levels of Link
Analysis
                                   ...
Link Analysis on
                   Error rate
    the Web



Levels of Link
Analysis

Generalizing
                      ...
Link Analysis on
                   Hosts at distance 4
    the Web



Levels of Link
                                    ...
Link Analysis on
                   Minimum change of supporters
    the Web



Levels of Link
                           ...
Link Analysis on
    the Web



Levels of Link
Analysis

Generalizing
PageRank
                       Levels of Link Analy...
Link Analysis on
                   Detection rates
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   Detection rates
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   Detection rates
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   Detection rates
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   Detection rates
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   Detection rates
    the Web



Levels of Link
Analysis

Generalizing
PageRank

       ...
Link Analysis on
                   Upcoming Web Spam Challenge on UK-2006
    the Web



Levels of Link
Analysis

General...
Link Analysis on
                   Upcoming Web Spam Challenge on UK-2006
    the Web



Levels of Link
Analysis

General...
Link Analysis on
                   Upcoming Web Spam Challenge on UK-2006
    the Web



Levels of Link
Analysis

General...
Link Analysis on
                   Upcoming Web Spam Challenge on UK-2006
    the Web



Levels of Link
Analysis

General...
Link Analysis on
                   Agreement between humans
    the Web



Levels of Link
Analysis

Generalizing
PageRank...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
                   Result: first public Web Spam collection
    the Web



Levels of Link
Analysis

Genera...
Link Analysis on
    the Web



Levels of Link
                   Thank you!
Analysis

Generalizing
PageRank

Other
Functi...
Link Analysis on
    the Web



Levels of Link
                   Thank you!
Analysis

Generalizing
PageRank

Other
Functi...
Link Analysis on
    the Web
                   Baeza-Yates, R., Boldi, P., and Castillo, C. (2006a).
                   G...
Link Analysis on
    the Web
                   Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and
               ...
Link Analysis on
    the Web

                   Boldi, P., Santini, M., and Vigna, S. (2005).
                   Pagerank...
Link Analysis on
                   Flajolet, P. and Martin, N. G. (1985).
    the Web

                   Probabilistic c...
Link Analysis on
    the Web



Levels of Link
Analysis

                   Palmer, C. R., Gibbons, P. B., and Faloutsos, ...
Upcoming SlideShare
Loading in...5
×

Link Analysis (RBY)

1,615

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,615
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
53
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Link Analysis (RBY)

  1. 1. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Other Link Analysis on the Web Functional Rankings The big picture, the small picture and the medium-sized picture Web Spam Web Spam Detection Ricardo Baeza-Yates3,4 Topological Web Spam Joint work with: L. Becchetti1 , P. Boldi2 , C. Castillo1,3 , Direct Counting D. Donato1,3 , S. Leonardi1 , B. Poblete5 of Supporters Spam Detection Results 1. Universit` di Roma “La Sapienza” – Rome, Italy a 2. Univerit` degli Studi di Milano – Milan, Italy a 3. Yahoo! Research Barcelona – Catalunya, Spain 4. Yahoo! Research Latin America – Santiago, Chile 5. Universitat Pompeu Fabra – Catalunya, Spain
  2. 2. Link Analysis on the Web Levels of Link Analysis 1 Levels of Link Analysis Generalizing PageRank 2 Generalizing PageRank Other Other Functional Rankings 3 Functional Rankings Web Spam Web Spam 4 Web Spam Detection Web Spam Detection Topological Web 5 Spam Direct Counting of Supporters Topological Web Spam 6 Spam Detection Results Direct Counting of Supporters 7 Spam Detection Results 8
  3. 3. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  4. 4. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  5. 5. Link Analysis on How to find meaningful patterns? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Several levels of analysis: Web Spam Web Spam Macroscopic view: overall structure Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  6. 6. Link Analysis on How to find meaningful patterns? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Several levels of analysis: Web Spam Web Spam Macroscopic view: overall structure Detection Microscopic view: nodes Topological Web Spam Direct Counting of Supporters Spam Detection Results
  7. 7. Link Analysis on How to find meaningful patterns? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Several levels of analysis: Web Spam Web Spam Macroscopic view: overall structure Detection Microscopic view: nodes Topological Web Spam Mesoscopic view: regions Direct Counting of Supporters Spam Detection Results
  8. 8. Link Analysis on Macroscopic view, e.g. Bow-tie the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Broder et al., 2000]
  9. 9. Link Analysis on Macroscopic view, e.g. Bow-tie, migration the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Baeza-Yates and Poblete, 2006]
  10. 10. Link Analysis on Macroscopic view, e.g. Jellyfish the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Tauro et al., 2001] - Internet Autonomous Systems (AS) Topology
  11. 11. Link Analysis on Macroscopic view, e.g. Jellyfish the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  12. 12. Link Analysis on Microscopic view, e.g. Degree the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Barab´si, 2002] and others a
  13. 13. Link Analysis on Microscopic view, e.g. Degree the Web Greece Chile Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Spain Korea Topological Web Spam Direct Counting of Supporters Spam Detection Results [Baeza-Yates et al., 2006b] - compares this distribution in 8 countries . . . guess what is the result?
  14. 14. Link Analysis on Mesoscopic view, e.g. Hop-plot the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  15. 15. Link Analysis on Mesoscopic view, e.g. Hop-plot the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  16. 16. Link Analysis on Mesoscopic view, e.g. Hop-plot the Web Levels of Link Analysis .it (40M pages) .uk (18M pages) Generalizing 0.3 0.3 PageRank Other 0.2 0.2 Frequency Frequency Functional Rankings 0.1 0.1 Web Spam Web Spam 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Detection Distance Distance Topological Web .eu.int (800K pages) Synthetic graph (100K pages) Spam Direct Counting 0.3 0.3 of Supporters Spam Detection 0.2 0.2 Frequency Frequency Results 0.1 0.1 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Distance Distance [Baeza-Yates et al., 2006a]
  17. 17. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  18. 18. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  19. 19. Link Analysis on Notation the Web Levels of Link Analysis Generalizing Let PN×N be the normalized link matrix of a graph PageRank Row-normalized Other Functional Rankings No “sinks” Web Spam Definition (PageRank) Web Spam Detection Stationary state of: Topological Web Spam (1 − α) Direct Counting αP + 1N×N of Supporters N Spam Detection Results
  20. 20. Link Analysis on Notation the Web Levels of Link Analysis Generalizing Let PN×N be the normalized link matrix of a graph PageRank Row-normalized Other Functional Rankings No “sinks” Web Spam Definition (PageRank) Web Spam Detection Stationary state of: Topological Web Spam (1 − α) Direct Counting αP + 1N×N of Supporters N Spam Detection Results Follow links with probability α Random jump with probability 1 − α
  21. 21. Link Analysis on Explicit Formulas the Web Levels of Link Analysis Generalizing PageRank Formulas for PageRank Other Functional [Newman et al., 2001, Boldi et al., 2005] Rankings Web Spam ∞ (1 − α) Web Spam (αP)t . r(α) = Detection N t=0 Topological Web Spam (1 − α)α|p| Direct Counting ri (α) = branching(p) of Supporters N Spam Detection p∈Path(−,i) Results
  22. 22. Link Analysis on Explicit Formulas the Web Levels of Link Analysis Generalizing PageRank Formulas for PageRank Other Functional [Newman et al., 2001, Boldi et al., 2005] Rankings Web Spam ∞ (1 − α) Web Spam (αP)t . r(α) = Detection N t=0 Topological Web Spam (1 − α)α|p| Direct Counting ri (α) = branching(p) of Supporters N Spam Detection p∈Path(−,i) Results Path(−, i) are incoming paths in node i
  23. 23. Link Analysis on Branching contribution the Web Levels of Link Analysis Generalizing PageRank Definition (Branching contribution of a path) Other Functional Given a path p = x1 , x2 , . . . , xt of length t = |p| Rankings Web Spam 1 branching(p) = Web Spam d1 d2 · · · dt−1 Detection Topological Web where di are the out-degrees of the members of the path Spam Direct Counting For every node i and every length t of Supporters Spam Detection Results branching(p) = 1. p∈Path(i,−),|p|=t
  24. 24. Link Analysis on Functional ranking the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings General functional ranking [Baeza-Yates et al., 2006a] Web Spam Web Spam damping(|p|) Detection ri (α) = branching(p) N Topological Web p∈Path(−,i) Spam Direct Counting PageRank is a particular case of path-based ranking of Supporters Spam Detection Results
  25. 25. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  26. 26. Link Analysis on Exponential damping = PageRank the Web Levels of Link 0.30 Analysis damping(t) with α=0.8 damping(t) with α=0.7 Generalizing PageRank Other 0.20 Functional Weight Rankings Web Spam Web Spam 0.10 Detection Topological Web Spam Direct Counting 0.00 of Supporters 1 2 345678 9 10 Spam Detection Length of the path (t) Results Exponential damping = PageRank damping(t) = α(1 − α)t Most of the contribution is on the first few levels.
  27. 27. Link Analysis on Linear damping the Web 0.30 Levels of Link damping(t) with L=15 Analysis damping(t) with L=10 Generalizing PageRank 0.20 Other Functional Weight Rankings Web Spam 0.10 Web Spam Detection Topological Web Spam 0.00 Direct Counting of Supporters 1 2 345678 9 10 Spam Detection Length of the path (t) Results Linear damping 2(L−t) t<L L(L+1) damping(t) = t≥L 0
  28. 28. Link Analysis on Example: Calculating LinearRank the Web Levels of Link Analysis Generalizing PageRank For calculating LinearRank we use: Other Functional Rankings ∞ 1 Web Spam damping(t)Pt LinearRank = N Web Spam t=0 Detection L−1 Topological Web 2(L − t) t 1 Spam = P N L(L + 1) Direct Counting t=0 of Supporters Spam Detection Results
  29. 29. Link Analysis on Example: Calculating LinearRank the Web Levels of Link Analysis Generalizing PageRank For calculating LinearRank we use: Other Functional Rankings ∞ 1 Web Spam damping(t)Pt LinearRank = N Web Spam t=0 Detection L−1 Topological Web 2(L − t) t 1 Spam = P N L(L + 1) Direct Counting t=0 of Supporters Spam Detection Results However, we cannot hold the temporary Pt in memory!
  30. 30. Link Analysis on Re-write the damping as a recursion the Web Levels of Link Analysis Generalizing PageRank We have to rewrite to be able to calculate: Other Functional 2 Rankings R(0) = Web Spam L+1 Web Spam (L − k − 1) (k) Detection R(k+1) = RP (L − k) Topological Web Spam Direct Counting of Supporters Spam Detection Results
  31. 31. Link Analysis on Re-write the damping as a recursion the Web Levels of Link Analysis Generalizing PageRank We have to rewrite to be able to calculate: Other Functional 2 Rankings R(0) = Web Spam L+1 Web Spam (L − k − 1) (k) Detection R(k+1) = RP (L − k) Topological Web Spam L−1 Direct Counting R(k) LinearRank = of Supporters Spam Detection k=0 Results
  32. 32. Link Analysis on Re-write the damping as a recursion the Web Levels of Link Analysis Generalizing PageRank We have to rewrite to be able to calculate: Other Functional 2 Rankings R(0) = Web Spam L+1 Web Spam (L − k − 1) (k) Detection R(k+1) = RP (L − k) Topological Web Spam L−1 Direct Counting R(k) LinearRank = of Supporters Spam Detection k=0 Results Now we can give the algorithm . . .
  33. 33. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank 3: end for Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  34. 34. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank end for 3: Other for k : 1 . . . L − 1 do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  35. 35. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank end for 3: Other for k : 1 . . . L − 1 do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam for i : 1 . . . N do {Follow links in the graph} 6: Web Spam for all j such that there is a link from i to j do 7: Detection Aux[j] ← Aux[j] + R[i]/outdegree(i) Topological Web 8: Spam end for 9: Direct Counting end for of Supporters 10: Spam Detection Results
  36. 36. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank end for 3: Other for k : 1 . . . L − 1 do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam for i : 1 . . . N do {Follow links in the graph} 6: Web Spam for all j such that there is a link from i to j do 7: Detection Aux[j] ← Aux[j] + R[i]/outdegree(i) Topological Web 8: Spam end for 9: Direct Counting end for of Supporters 10: for i : 1 . . . N do {Add to ranking value} Spam Detection 11: Results R[i] ← Aux[i] × (L−k−1) 12: (L−k) Score[i] ← Score[i] + R[i] 13: end for 14: end for 15: return Score 16:
  37. 37. Link Analysis on Algorithm (general) the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis Score[i] ← R[i] ← INIT 2: Generalizing PageRank end for 3: Other for k : 1 . . . STOP do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam for i : 1 . . . N do {Follow links in the graph} 6: Web Spam for all j such that there is a link from i to j do Detection 7: Aux[j] ← Aux[j] + R[i]/outdegree(i) Topological Web 8: Spam end for 9: Direct Counting of Supporters end for 10: Spam Detection for i : 1 . . . N do {Add to ranking value} 11: Results R[i] ← Aux[i] × FACTOR 12: Score[i] ← Score[i] + R[i] 13: end for 14: end for 15: return Score 16:
  38. 38. Link Analysis on Other damping functions the Web Levels of Link Analysis Empirical damping: Generalizing PageRank 0.7 Other Functional Rankings Average text similarity 0.6 Web Spam Web Spam 0.5 Detection Topological Web Spam 0.4 Direct Counting of Supporters 0.3 Spam Detection Results 0.2 1 2 3 4 5 Link distance
  39. 39. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  40. 40. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Calculated PageRank with α = 0.1, 0.2, . . . , 0.9 Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  41. 41. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Calculated PageRank with α = 0.1, 0.2, . . . , 0.9 Detection Topological Web Calculated LinearRank with L = 5, 10, . . . , 25 Spam Direct Counting of Supporters Spam Detection Results
  42. 42. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Calculated PageRank with α = 0.1, 0.2, . . . , 0.9 Detection Topological Web Calculated LinearRank with L = 5, 10, . . . , 25 Spam For certain combinations of parameters, the rankings are Direct Counting of Supporters almost equal! Spam Detection Results
  43. 43. Link Analysis on Experimental comparison the Web Levels of Link Analysis Experimental Comparison in the U.K. Web Graph Generalizing PageRank Other Functional 1.00 Rankings 0.95 Web Spam τ 0.90 Web Spam Detection 0.85 τ ≥ 0.95 Topological Web 0.80 Spam Direct Counting of Supporters 25 Spam Detection 20 Results 0.9 15 L 0.8 10 0.7 α 0.6 5 0.5
  44. 44. Link Analysis on Prediction of best parameter combination the Web Levels of Link Analysis Prediction of Best Parameter Combinations (Analysis) Generalizing PageRank 25 Actual optimum Other Predicted optimum with length=5 Functional Rankings L that maximizes Kendall’s τ 20 Web Spam Web Spam Detection 15 Topological Web Spam 10 Direct Counting of Supporters Spam Detection Results 5 0.5 0.6 0.7 0.8 0.9 Exponent α
  45. 45. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  46. 46. Link Analysis on What is on the Web? the Web Information Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  47. 47. Link Analysis on What is on the Web? the Web Information + Porn Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  48. 48. Link Analysis on What is on the Web? the Web Information + Porn + On-line casinos + Free movies + Levels of Link Analysis Cheap software + Buy a MBA diploma + Prescription -free Generalizing drugs + V!-4-gra + Get rich now now now!!! PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results Graphic: www.milliondollarhomepage.com
  49. 49. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  50. 50. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  51. 51. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  52. 52. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  53. 53. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  54. 54. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Direct Counting of Supporters Spam Detection Results
  55. 55. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Click spam Direct Counting of Supporters Spam Detection Results
  56. 56. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Click spam Direct Counting of Supporters Spam Detection Results
  57. 57. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Click spam Direct Counting of Supporters Adversarial relationship Spam Detection Results Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  58. 58. Link Analysis on Typical Web Spam (1) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  59. 59. Link Analysis on Typical Web Spam (2) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  60. 60. Link Analysis on Hidden text the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  61. 61. Link Analysis on Made for Advertising (1) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  62. 62. Link Analysis on Made for Advertising (2) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  63. 63. Link Analysis on Made for Advertising (3) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  64. 64. Link Analysis on Search engine? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  65. 65. Link Analysis on Fake search engine the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  66. 66. Link Analysis on Problem: “normal” pages that are spam the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  67. 67. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  68. 68. Link Analysis on Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  69. 69. Link Analysis on Machine Learning (cont.) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  70. 70. Link Analysis on Feature Extraction the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  71. 71. Link Analysis on Challenges: Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Machine Learning Challenges: Web Spam Web Spam Learning with inter dependent variables (graph) Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  72. 72. Link Analysis on Challenges: Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Machine Learning Challenges: Web Spam Web Spam Learning with inter dependent variables (graph) Detection Learning with few examples Topological Web Spam Direct Counting of Supporters Spam Detection Results
  73. 73. Link Analysis on Challenges: Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Machine Learning Challenges: Web Spam Web Spam Learning with inter dependent variables (graph) Detection Learning with few examples Topological Web Spam Scalability Direct Counting of Supporters Spam Detection Results
  74. 74. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  75. 75. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  76. 76. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Feature propagation (graph) Spam Direct Counting of Supporters Spam Detection Results
  77. 77. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Feature propagation (graph) Spam Recall/precision tradeoffs Direct Counting of Supporters Spam Detection Results
  78. 78. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Feature propagation (graph) Spam Recall/precision tradeoffs Direct Counting of Supporters Scalability Spam Detection Results
  79. 79. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  80. 80. Link Analysis on Topological spam: link farms the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  81. 81. Link Analysis on Topological spam: link farms the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  82. 82. Link Analysis on Motivation the Web Levels of Link Analysis Generalizing PageRank Other Functional [Fetterly et al., 2004] hypothesized that studying the Rankings distribution of statistics about pages could be a good way of Web Spam Web Spam detecting spam pages: Detection Topological Web “in a number of these distributions, outlier values are Spam Direct Counting associated with web spam” of Supporters Spam Detection Results
  83. 83. Link Analysis on Test collection the Web Levels of Link Analysis Generalizing PageRank U.K. collection Other Functional Rankings 18.5 million pages downloaded from the .UK domain Web Spam 5,344 hosts manually classified (6% of the hosts) Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  84. 84. Link Analysis on Test collection the Web Levels of Link Analysis Generalizing PageRank U.K. collection Other Functional Rankings 18.5 million pages downloaded from the .UK domain Web Spam 5,344 hosts manually classified (6% of the hosts) Web Spam Detection Topological Web Spam Direct Counting Classified entire hosts: of Supporters Spam Detection V A few hosts are mixed: spam and non-spam pages Results X More coverage: sample covers 32% of the pages
  85. 85. Link Analysis on In-degree the Web δ = 0.35 In−degree Levels of Link Analysis Generalizing Normal PageRank 0.4 Spam Other Functional Rankings 0.3 Web Spam Web Spam Detection Topological Web 0.2 Spam Direct Counting of Supporters Spam Detection 0.1 Results 0 1 100 10000 Number of in−links (δ = max. difference in C.D.F. plot)
  86. 86. Link Analysis on Out-degree the Web Levels of Link δ = 0.28 Out−degree Analysis 0.3 Generalizing Normal PageRank Spam Other Functional Rankings Web Spam 0.2 Web Spam Detection Topological Web Spam Direct Counting of Supporters 0.1 Spam Detection Results 0 1 10 50 100 Number of out−links
  87. 87. Link Analysis on Edge reciprocity the Web Levels of Link δ = 0.35 Reciprocity of max. PR page Analysis 0.5 Generalizing Normal PageRank Spam Other Functional 0.4 Rankings Web Spam Web Spam 0.3 Detection Topological Web Spam 0.2 Direct Counting of Supporters Spam Detection Results 0.1 0 0 0.2 0.4 0.6 0.8 1 Fraction of reciprocal links
  88. 88. Link Analysis on Assortativity the Web Levels of Link δ = 0.31 Degree / Degree of neighbors Analysis Generalizing 0.4 PageRank Normal Spam Other Functional Rankings 0.3 Web Spam Web Spam Detection Topological Web 0.2 Spam Direct Counting of Supporters Spam Detection 0.1 Results 0 0.001 0.01 0.1 1 10 100 1000 Degree/Degree ratio of home page
  89. 89. Link Analysis on Variance of PageRank the Web Suggested in [Bencz´r et al., 2005] u Levels of Link Analysis Generalizing PageRank PageRank PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  90. 90. Link Analysis on Variance of PageRank of in-neighbors the Web Levels of Link Stdev. of PR of Neighbors (Home) δ = 0.41 Analysis Generalizing PageRank Normal Spam Other 0.3 Functional Rankings Web Spam Web Spam Detection 0.2 Topological Web Spam Direct Counting of Supporters 0.1 Spam Detection Results 0 0 0.2 0.4 0.6 0.8 1 σ2 of the logarithm of PageRank
  91. 91. Link Analysis on TrustRank the Web Levels of Link Analysis Generalizing PageRank Other TrustRank [Gy¨ngyi et al., 2004] o Functional Rankings A node with high PageRank, but far away from a core set of Web Spam “trusted nodes” is suspicious Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  92. 92. Link Analysis on TrustRank the Web Levels of Link Analysis Generalizing PageRank Other TrustRank [Gy¨ngyi et al., 2004] o Functional Rankings A node with high PageRank, but far away from a core set of Web Spam “trusted nodes” is suspicious Web Spam Detection Start from a set of trusted nodes, then do a random walk, Topological Web Spam returning to the set of trusted nodes with probability 1 − α at Direct Counting each step of Supporters Spam Detection Results i Trusted nodes: data from http://www.dmoz.org/
  93. 93. Link Analysis on TrustRank Idea the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  94. 94. Link Analysis on TrustRank score the Web Levels of Link δ = 0.59 Analysis TrustRank score of home page Generalizing PageRank Normal 0.4 Spam Other Functional Rankings Web Spam 0.3 Web Spam Detection Topological Web Spam 0.2 Direct Counting of Supporters Spam Detection 0.1 Results 0 1e−06 0.001 TrustRank
  95. 95. Link Analysis on TrustRank / PageRank the Web Levels of Link δ = 0.59 Analysis Estimated relative non−spam mass Generalizing PageRank Normal 0.8 Spam Other Functional 0.7 Rankings Web Spam 0.6 Web Spam 0.5 Detection Topological Web 0.4 Spam Direct Counting 0.3 of Supporters Spam Detection 0.2 Results 0.1 0 0.3 1 10 100 TrustRank score/PageRank
  96. 96. Link Analysis on Truncated PageRank the Web Levels of Link Analysis Generalizing Proposed in [Becchetti et al., 2006b]. Idea: reduce the direct PageRank contribution of the first levels of links: Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection t≤T 0 Results damping(t) = C αt t>T
  97. 97. Link Analysis on Truncated PageRank the Web Levels of Link Analysis Generalizing Proposed in [Becchetti et al., 2006b]. Idea: reduce the direct PageRank contribution of the first levels of links: Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection t≤T 0 Results damping(t) = C αt t>T V No extra reading of the graph after PageRank
  98. 98. Link Analysis on Truncated PageRank(T=2) / PageRank the Web Levels of Link Analysis TruncatedPageRank T=2 / PageRank δ = 0.30 Generalizing PageRank Normal Other Spam 0.3 Functional Rankings Web Spam Web Spam Detection 0.2 Topological Web Spam Direct Counting of Supporters 0.1 Spam Detection Results 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 TruncatedPageRank(T=2) / PageRank
  99. 99. Link Analysis on Max. change of Truncated PageRank the Web Levels of Link Analysis Maximum change of Truncated PageRank δ = 0.29 Generalizing PageRank Normal Other Spam Functional Rankings 0.2 Web Spam Web Spam Detection Topological Web Spam 0.1 Direct Counting of Supporters Spam Detection Results 0 0.85 0.9 0.95 1 1.05 1.1 max(TrPRi+1/TrPri)
  100. 100. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  101. 101. Link Analysis on High and low-ranked pages are different the Web 4 Levels of Link x 10 Analysis Top 0%−10% 12 Generalizing Top 40%−50% PageRank Top 60%−70% Other 10 Number of Nodes Functional Rankings 8 Web Spam Web Spam Detection 6 Topological Web Spam 4 Direct Counting of Supporters 2 Spam Detection Results 0 1 5 10 15 20 Distance
  102. 102. Link Analysis on High and low-ranked pages are different the Web 4 Levels of Link x 10 Analysis Top 0%−10% 12 Generalizing Top 40%−50% PageRank Top 60%−70% Other 10 Number of Nodes Functional Rankings 8 Web Spam Web Spam Detection 6 Topological Web Spam 4 Direct Counting of Supporters 2 Spam Detection Results 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  103. 103. Link Analysis on Probabilistic counting the Web Levels of Link Analysis 1 1 Generalizing 0 0 PageRank 0 0 0 0 Other 0 1 1 1 1 1 Functional 0 0 1 1 0 0 Rankings 0 0 0 0 Propagation of 0 0 1 1 Web Spam bits using the 1 0 1 1 “OR” operation 1 0 1 0 Web Spam Detection 1 Target 0 Count bits set Topological Web 0 page 0 to estimate Spam 0 0 supporters 0 0 Direct Counting 1 1 1 1 of Supporters 0 0 1 1 0 0 Spam Detection 0 0 Results 1 1 0 0
  104. 104. Link Analysis on Probabilistic counting the Web Levels of Link Analysis 1 1 Generalizing 0 0 PageRank 0 0 0 0 Other 0 1 1 1 1 1 Functional 0 0 1 1 0 0 Rankings 0 0 0 0 Propagation of 0 0 1 1 Web Spam bits using the 1 0 1 1 “OR” operation 1 0 1 0 Web Spam Detection 1 Target 0 Count bits set Topological Web 0 page 0 to estimate Spam 0 0 supporters 0 0 Direct Counting 1 1 1 1 of Supporters 0 0 1 1 0 0 Spam Detection 0 0 Results 1 1 0 0 [Becchetti et al., 2006b] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  105. 105. Link Analysis on General algorithm the Web Require: N: number of nodes, d: distance, k: bits Levels of Link Analysis 1: for node : 1 . . . N, bit: 1 . . . k do Generalizing INIT(node,bit) 2: PageRank 3: end for Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  106. 106. Link Analysis on General algorithm the Web Require: N: number of nodes, d: distance, k: bits Levels of Link Analysis 1: for node : 1 . . . N, bit: 1 . . . k do Generalizing INIT(node,bit) 2: PageRank 3: end for Other Functional 4: for distance : 1 . . . d do {Iteration step} Rankings Aux ← 0k Web Spam 5: for src : 1 . . . N do {Follow links in the graph} Web Spam 6: Detection for all links from src to dest do 7: Topological Web Aux[dest] ← Aux[dest] OR V[src,·] Spam 8: Direct Counting end for 9: of Supporters end for 10: Spam Detection Results V ← Aux 11: 12: end for
  107. 107. Link Analysis on General algorithm the Web Require: N: number of nodes, d: distance, k: bits Levels of Link Analysis 1: for node : 1 . . . N, bit: 1 . . . k do Generalizing INIT(node,bit) 2: PageRank 3: end for Other Functional 4: for distance : 1 . . . d do {Iteration step} Rankings Aux ← 0k Web Spam 5: for src : 1 . . . N do {Follow links in the graph} Web Spam 6: Detection for all links from src to dest do 7: Topological Web Aux[dest] ← Aux[dest] OR V[src,·] Spam 8: Direct Counting end for 9: of Supporters end for 10: Spam Detection Results V ← Aux 11: 12: end for 13: for node: 1 . . . N do {Estimate supporters} Supporters[node] ← ESTIMATE( V[node,·] ) 14: 15: end for 16: return Supporters
  108. 108. Link Analysis on Our estimator the Web Levels of Link Analysis Generalizing PageRank Other Functional Initialize all bits to one with probability Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  109. 109. Link Analysis on Our estimator the Web Levels of Link Analysis Generalizing PageRank Other Functional Initialize all bits to one with probability Rankings ones(node) Estimator: neighbors(node) = log(1− ) 1 − Web Spam k Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  110. 110. Link Analysis on Our estimator the Web Levels of Link Analysis Generalizing PageRank Other Functional Initialize all bits to one with probability Rankings ones(node) Estimator: neighbors(node) = log(1− ) 1 − Web Spam k Web Spam Detection Adaptive estimation Topological Web Spam Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look Direct Counting for the transitions from more than (1 − 1/e)k ones to less of Supporters than (1 − 1/e)k ones. Spam Detection Results
  111. 111. Link Analysis on Convergence the Web Levels of Link Analysis 100% Generalizing PageRank 90% Other 80% Functional Rankings Fraction of nodes 70% with estimates Web Spam 60% Web Spam Detection 50% d=1 Topological Web d=2 40% Spam d=3 Direct Counting 30% d=4 of Supporters d=5 20% Spam Detection d=6 Results d=7 10% d=8 0% 5 10 15 20 Iteration
  112. 112. Link Analysis on Error rate the Web Levels of Link Analysis Generalizing Ours 64 bits, epsilon−only estimator PageRank Ours 64 bits, combined estimator 0.5 Other ANF 24 bits × 24 iterations (576 b×i) Average Relative Error Functional ANF 24 bits × 48 iterations (1152 b×i) Rankings 0.4 Web Spam 960 b×i Web Spam 1216 b×i 512 b×i 832 b×i Detection 1344 b×i 1408 b×i 768 b×i 1152 b×i 0.3 Topological Web Spam 0.2 Direct Counting 576 b×i 1152 b×i of Supporters 512 b×i 768 b×i 960 b×i 1216 b×i 1344 b×i 1408 b×i 832 b×i 1152 b×i Spam Detection 0.1 Results 0 1 2 3 4 5 6 7 8 Distance
  113. 113. Link Analysis on Hosts at distance 4 the Web Levels of Link δ = 0.39 Hosts at Distance Exactly 4 Analysis 0.4 Generalizing Normal PageRank Spam Other Functional Rankings 0.3 Web Spam Web Spam Detection Topological Web 0.2 Spam Direct Counting of Supporters Spam Detection 0.1 Results 0 1 100 1000 S4 − S3
  114. 114. Link Analysis on Minimum change of supporters the Web Levels of Link δ = 0.39 Minimum change of supporters Analysis Generalizing PageRank Normal 0.4 Spam Other Functional Rankings Web Spam 0.3 Web Spam Detection Topological Web Spam 0.2 Direct Counting of Supporters Spam Detection 0.1 Results 0 1 5 10 min(S2/S1, S3/S2, S4/S3)
  115. 115. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  116. 116. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  117. 117. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam Direct Counting of Supporters Spam Detection Results
  118. 118. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters Spam Detection Results
  119. 119. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters V Host-based counts of neighbors are important Spam Detection Results
  120. 120. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters V Host-based counts of neighbors are important Spam Detection Results
  121. 121. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters V Host-based counts of neighbors are important Spam Detection Results Next step: combine link analysis and content analysis
  122. 122. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  123. 123. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam We provided several examples Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  124. 124. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam We provided several examples Detection Asked to classify normal / borderline / spam Topological Web Spam Direct Counting of Supporters Spam Detection Results
  125. 125. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam We provided several examples Detection Asked to classify normal / borderline / spam Topological Web Spam Do they agree? Mostly . . . Direct Counting of Supporters Spam Detection Results
  126. 126. Link Analysis on Agreement between humans the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  127. 127. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  128. 128. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  129. 129. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  130. 130. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  131. 131. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Direct Counting of Supporters Spam Detection Results
  132. 132. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Machine learning Direct Counting of Supporters Spam Detection Results
  133. 133. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Machine learning Direct Counting of Supporters Information retrieval Spam Detection Results
  134. 134. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Machine learning Direct Counting of Supporters Information retrieval Spam Detection webspam-announces-subscribe@yahoogroups.com Results
  135. 135. Link Analysis on the Web Levels of Link Thank you! Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  136. 136. Link Analysis on the Web Levels of Link Thank you! Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  137. 137. Link Analysis on the Web Baeza-Yates, R., Boldi, P., and Castillo, C. (2006a). Generalizing pagerank: Damping functions for link-based Levels of Link Analysis ranking algorithms. Generalizing In Proceedings of ACM SIGIR, pages 308–315, Seattle, PageRank Washington, USA. ACM Press. Other Functional Rankings Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2006b). Web Spam Characterization of national web domains. Web Spam Detection To appear in ACM TOIT. Topological Web Spam Baeza-Yates, R. and Poblete, B. (2006). Direct Counting of Supporters Dynamics of the chilean web structure. Spam Detection Comput. Networks, 50(10):1464–1473. Results Barab´si, A.-L. (2002). a Linked: The New Science of Networks. Perseus Books Group.
  138. 138. Link Analysis on the Web Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (2006a). Levels of Link Link-based characterization and detection of Web Spam. Analysis Generalizing In Second International Workshop on Adversarial Information PageRank Retrieval on the Web (AIRWeb), Seattle, USA. Other Functional Rankings Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Web Spam Baeza-Yates, R. (2006b). Web Spam Using rank propagation and probabilistic counting for Detection link-based spam detection. Topological Web Spam In Proceedings of the Workshop on Web Mining and Web Direct Counting Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. of Supporters Spam Detection Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M. u a o Results (2005). Spamrank: fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan.
  139. 139. Link Analysis on the Web Boldi, P., Santini, M., and Vigna, S. (2005). Pagerank as a function of the damping factor. Levels of Link Analysis In Proceedings of the 14th international conference on World Generalizing Wide Web, pages 557–566, Chiba, Japan. ACM Press. PageRank Other Functional Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rankings Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. Web Spam (2000). Web Spam Detection Graph structure in the web: Experiments and models. Topological Web In Proceedings of the Ninth Conference on World Wide Web, Spam pages 309–320, Amsterdam, Netherlands. ACM Press. Direct Counting of Supporters Fetterly, D., Manasse, M., and Najork, M. (2004). Spam Detection Results Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1–6, Paris, France.
  140. 140. Link Analysis on Flajolet, P. and Martin, N. G. (1985). the Web Probabilistic counting algorithms for data base applications. Levels of Link Journal of Computer and System Sciences, 31(2):182–209. Analysis Generalizing Gibson, D., Kumar, R., and Tomkins, A. (2005). PageRank Other Discovering large dense subgraphs in massive graphs. Functional Rankings In VLDB ’05: Proceedings of the 31st international conference Web Spam on Very large data bases, pages 721–732. VLDB Endowment. Web Spam Detection Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004). o Topological Web Combating web spam with trustrank. Spam Direct Counting In Proceedings of the Thirtieth International Conference on of Supporters Very Large Data Bases (VLDB), pages 576–587, Toronto, Spam Detection Canada. Morgan Kaufmann. Results Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001). Random graphs with arbitrary degree distributions and their applications. Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
  141. 141. Link Analysis on the Web Levels of Link Analysis Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). Generalizing PageRank ANF: a fast and scalable tool for data mining in massive Other Functional graphs. Rankings In Proceedings of the eighth ACM SIGKDD international Web Spam conference on Knowledge discovery and data mining, pages Web Spam Detection 81–90, New York, NY, USA. ACM Press. Topological Web Spam Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001). Direct Counting A simple conceptual model for the internet topology. of Supporters Spam Detection In Global Internet, San Antonio, Texas, USA. IEEE CS Press. Results
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×