Link Analysis (RBY)

1,954 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,954
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
57
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Link Analysis (RBY)

  1. 1. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Other Link Analysis on the Web Functional Rankings The big picture, the small picture and the medium-sized picture Web Spam Web Spam Detection Ricardo Baeza-Yates3,4 Topological Web Spam Joint work with: L. Becchetti1 , P. Boldi2 , C. Castillo1,3 , Direct Counting D. Donato1,3 , S. Leonardi1 , B. Poblete5 of Supporters Spam Detection Results 1. Universit` di Roma “La Sapienza” – Rome, Italy a 2. Univerit` degli Studi di Milano – Milan, Italy a 3. Yahoo! Research Barcelona – Catalunya, Spain 4. Yahoo! Research Latin America – Santiago, Chile 5. Universitat Pompeu Fabra – Catalunya, Spain
  2. 2. Link Analysis on the Web Levels of Link Analysis 1 Levels of Link Analysis Generalizing PageRank 2 Generalizing PageRank Other Other Functional Rankings 3 Functional Rankings Web Spam Web Spam 4 Web Spam Detection Web Spam Detection Topological Web 5 Spam Direct Counting of Supporters Topological Web Spam 6 Spam Detection Results Direct Counting of Supporters 7 Spam Detection Results 8
  3. 3. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  4. 4. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  5. 5. Link Analysis on How to find meaningful patterns? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Several levels of analysis: Web Spam Web Spam Macroscopic view: overall structure Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  6. 6. Link Analysis on How to find meaningful patterns? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Several levels of analysis: Web Spam Web Spam Macroscopic view: overall structure Detection Microscopic view: nodes Topological Web Spam Direct Counting of Supporters Spam Detection Results
  7. 7. Link Analysis on How to find meaningful patterns? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Several levels of analysis: Web Spam Web Spam Macroscopic view: overall structure Detection Microscopic view: nodes Topological Web Spam Mesoscopic view: regions Direct Counting of Supporters Spam Detection Results
  8. 8. Link Analysis on Macroscopic view, e.g. Bow-tie the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Broder et al., 2000]
  9. 9. Link Analysis on Macroscopic view, e.g. Bow-tie, migration the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Baeza-Yates and Poblete, 2006]
  10. 10. Link Analysis on Macroscopic view, e.g. Jellyfish the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Tauro et al., 2001] - Internet Autonomous Systems (AS) Topology
  11. 11. Link Analysis on Macroscopic view, e.g. Jellyfish the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  12. 12. Link Analysis on Microscopic view, e.g. Degree the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results [Barab´si, 2002] and others a
  13. 13. Link Analysis on Microscopic view, e.g. Degree the Web Greece Chile Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Spain Korea Topological Web Spam Direct Counting of Supporters Spam Detection Results [Baeza-Yates et al., 2006b] - compares this distribution in 8 countries . . . guess what is the result?
  14. 14. Link Analysis on Mesoscopic view, e.g. Hop-plot the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  15. 15. Link Analysis on Mesoscopic view, e.g. Hop-plot the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  16. 16. Link Analysis on Mesoscopic view, e.g. Hop-plot the Web Levels of Link Analysis .it (40M pages) .uk (18M pages) Generalizing 0.3 0.3 PageRank Other 0.2 0.2 Frequency Frequency Functional Rankings 0.1 0.1 Web Spam Web Spam 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Detection Distance Distance Topological Web .eu.int (800K pages) Synthetic graph (100K pages) Spam Direct Counting 0.3 0.3 of Supporters Spam Detection 0.2 0.2 Frequency Frequency Results 0.1 0.1 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Distance Distance [Baeza-Yates et al., 2006a]
  17. 17. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  18. 18. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  19. 19. Link Analysis on Notation the Web Levels of Link Analysis Generalizing Let PN×N be the normalized link matrix of a graph PageRank Row-normalized Other Functional Rankings No “sinks” Web Spam Definition (PageRank) Web Spam Detection Stationary state of: Topological Web Spam (1 − α) Direct Counting αP + 1N×N of Supporters N Spam Detection Results
  20. 20. Link Analysis on Notation the Web Levels of Link Analysis Generalizing Let PN×N be the normalized link matrix of a graph PageRank Row-normalized Other Functional Rankings No “sinks” Web Spam Definition (PageRank) Web Spam Detection Stationary state of: Topological Web Spam (1 − α) Direct Counting αP + 1N×N of Supporters N Spam Detection Results Follow links with probability α Random jump with probability 1 − α
  21. 21. Link Analysis on Explicit Formulas the Web Levels of Link Analysis Generalizing PageRank Formulas for PageRank Other Functional [Newman et al., 2001, Boldi et al., 2005] Rankings Web Spam ∞ (1 − α) Web Spam (αP)t . r(α) = Detection N t=0 Topological Web Spam (1 − α)α|p| Direct Counting ri (α) = branching(p) of Supporters N Spam Detection p∈Path(−,i) Results
  22. 22. Link Analysis on Explicit Formulas the Web Levels of Link Analysis Generalizing PageRank Formulas for PageRank Other Functional [Newman et al., 2001, Boldi et al., 2005] Rankings Web Spam ∞ (1 − α) Web Spam (αP)t . r(α) = Detection N t=0 Topological Web Spam (1 − α)α|p| Direct Counting ri (α) = branching(p) of Supporters N Spam Detection p∈Path(−,i) Results Path(−, i) are incoming paths in node i
  23. 23. Link Analysis on Branching contribution the Web Levels of Link Analysis Generalizing PageRank Definition (Branching contribution of a path) Other Functional Given a path p = x1 , x2 , . . . , xt of length t = |p| Rankings Web Spam 1 branching(p) = Web Spam d1 d2 · · · dt−1 Detection Topological Web where di are the out-degrees of the members of the path Spam Direct Counting For every node i and every length t of Supporters Spam Detection Results branching(p) = 1. p∈Path(i,−),|p|=t
  24. 24. Link Analysis on Functional ranking the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings General functional ranking [Baeza-Yates et al., 2006a] Web Spam Web Spam damping(|p|) Detection ri (α) = branching(p) N Topological Web p∈Path(−,i) Spam Direct Counting PageRank is a particular case of path-based ranking of Supporters Spam Detection Results
  25. 25. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  26. 26. Link Analysis on Exponential damping = PageRank the Web Levels of Link 0.30 Analysis damping(t) with α=0.8 damping(t) with α=0.7 Generalizing PageRank Other 0.20 Functional Weight Rankings Web Spam Web Spam 0.10 Detection Topological Web Spam Direct Counting 0.00 of Supporters 1 2 345678 9 10 Spam Detection Length of the path (t) Results Exponential damping = PageRank damping(t) = α(1 − α)t Most of the contribution is on the first few levels.
  27. 27. Link Analysis on Linear damping the Web 0.30 Levels of Link damping(t) with L=15 Analysis damping(t) with L=10 Generalizing PageRank 0.20 Other Functional Weight Rankings Web Spam 0.10 Web Spam Detection Topological Web Spam 0.00 Direct Counting of Supporters 1 2 345678 9 10 Spam Detection Length of the path (t) Results Linear damping 2(L−t) t<L L(L+1) damping(t) = t≥L 0
  28. 28. Link Analysis on Example: Calculating LinearRank the Web Levels of Link Analysis Generalizing PageRank For calculating LinearRank we use: Other Functional Rankings ∞ 1 Web Spam damping(t)Pt LinearRank = N Web Spam t=0 Detection L−1 Topological Web 2(L − t) t 1 Spam = P N L(L + 1) Direct Counting t=0 of Supporters Spam Detection Results
  29. 29. Link Analysis on Example: Calculating LinearRank the Web Levels of Link Analysis Generalizing PageRank For calculating LinearRank we use: Other Functional Rankings ∞ 1 Web Spam damping(t)Pt LinearRank = N Web Spam t=0 Detection L−1 Topological Web 2(L − t) t 1 Spam = P N L(L + 1) Direct Counting t=0 of Supporters Spam Detection Results However, we cannot hold the temporary Pt in memory!
  30. 30. Link Analysis on Re-write the damping as a recursion the Web Levels of Link Analysis Generalizing PageRank We have to rewrite to be able to calculate: Other Functional 2 Rankings R(0) = Web Spam L+1 Web Spam (L − k − 1) (k) Detection R(k+1) = RP (L − k) Topological Web Spam Direct Counting of Supporters Spam Detection Results
  31. 31. Link Analysis on Re-write the damping as a recursion the Web Levels of Link Analysis Generalizing PageRank We have to rewrite to be able to calculate: Other Functional 2 Rankings R(0) = Web Spam L+1 Web Spam (L − k − 1) (k) Detection R(k+1) = RP (L − k) Topological Web Spam L−1 Direct Counting R(k) LinearRank = of Supporters Spam Detection k=0 Results
  32. 32. Link Analysis on Re-write the damping as a recursion the Web Levels of Link Analysis Generalizing PageRank We have to rewrite to be able to calculate: Other Functional 2 Rankings R(0) = Web Spam L+1 Web Spam (L − k − 1) (k) Detection R(k+1) = RP (L − k) Topological Web Spam L−1 Direct Counting R(k) LinearRank = of Supporters Spam Detection k=0 Results Now we can give the algorithm . . .
  33. 33. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank 3: end for Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  34. 34. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank end for 3: Other for k : 1 . . . L − 1 do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  35. 35. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank end for 3: Other for k : 1 . . . L − 1 do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam for i : 1 . . . N do {Follow links in the graph} 6: Web Spam for all j such that there is a link from i to j do 7: Detection Aux[j] ← Aux[j] + R[i]/outdegree(i) Topological Web 8: Spam end for 9: Direct Counting end for of Supporters 10: Spam Detection Results
  36. 36. Link Analysis on Algorithm the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis 2 Score[i] ← R[i] ← L+1 2: Generalizing PageRank end for 3: Other for k : 1 . . . L − 1 do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam for i : 1 . . . N do {Follow links in the graph} 6: Web Spam for all j such that there is a link from i to j do 7: Detection Aux[j] ← Aux[j] + R[i]/outdegree(i) Topological Web 8: Spam end for 9: Direct Counting end for of Supporters 10: for i : 1 . . . N do {Add to ranking value} Spam Detection 11: Results R[i] ← Aux[i] × (L−k−1) 12: (L−k) Score[i] ← Score[i] + R[i] 13: end for 14: end for 15: return Score 16:
  37. 37. Link Analysis on Algorithm (general) the Web Levels of Link for i : 1 . . . N do {Initialization} 1: Analysis Score[i] ← R[i] ← INIT 2: Generalizing PageRank end for 3: Other for k : 1 . . . STOP do {Iteration step} 4: Functional Rankings Aux ← 0 5: Web Spam for i : 1 . . . N do {Follow links in the graph} 6: Web Spam for all j such that there is a link from i to j do Detection 7: Aux[j] ← Aux[j] + R[i]/outdegree(i) Topological Web 8: Spam end for 9: Direct Counting of Supporters end for 10: Spam Detection for i : 1 . . . N do {Add to ranking value} 11: Results R[i] ← Aux[i] × FACTOR 12: Score[i] ← Score[i] + R[i] 13: end for 14: end for 15: return Score 16:
  38. 38. Link Analysis on Other damping functions the Web Levels of Link Analysis Empirical damping: Generalizing PageRank 0.7 Other Functional Rankings Average text similarity 0.6 Web Spam Web Spam 0.5 Detection Topological Web Spam 0.4 Direct Counting of Supporters 0.3 Spam Detection Results 0.2 1 2 3 4 5 Link distance
  39. 39. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  40. 40. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Calculated PageRank with α = 0.1, 0.2, . . . , 0.9 Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  41. 41. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Calculated PageRank with α = 0.1, 0.2, . . . , 0.9 Detection Topological Web Calculated LinearRank with L = 5, 10, . . . , 25 Spam Direct Counting of Supporters Spam Detection Results
  42. 42. Link Analysis on Using LinearRank to approximage PageRank the Web Levels of Link Analysis Generalizing PageRank Other Functional Experimental comparison: 18-million nodes in the U.K. Web Rankings Web Spam Graph Web Spam Calculated PageRank with α = 0.1, 0.2, . . . , 0.9 Detection Topological Web Calculated LinearRank with L = 5, 10, . . . , 25 Spam For certain combinations of parameters, the rankings are Direct Counting of Supporters almost equal! Spam Detection Results
  43. 43. Link Analysis on Experimental comparison the Web Levels of Link Analysis Experimental Comparison in the U.K. Web Graph Generalizing PageRank Other Functional 1.00 Rankings 0.95 Web Spam τ 0.90 Web Spam Detection 0.85 τ ≥ 0.95 Topological Web 0.80 Spam Direct Counting of Supporters 25 Spam Detection 20 Results 0.9 15 L 0.8 10 0.7 α 0.6 5 0.5
  44. 44. Link Analysis on Prediction of best parameter combination the Web Levels of Link Analysis Prediction of Best Parameter Combinations (Analysis) Generalizing PageRank 25 Actual optimum Other Predicted optimum with length=5 Functional Rankings L that maximizes Kendall’s τ 20 Web Spam Web Spam Detection 15 Topological Web Spam 10 Direct Counting of Supporters Spam Detection Results 5 0.5 0.6 0.7 0.8 0.9 Exponent α
  45. 45. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  46. 46. Link Analysis on What is on the Web? the Web Information Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  47. 47. Link Analysis on What is on the Web? the Web Information + Porn Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  48. 48. Link Analysis on What is on the Web? the Web Information + Porn + On-line casinos + Free movies + Levels of Link Analysis Cheap software + Buy a MBA diploma + Prescription -free Generalizing drugs + V!-4-gra + Get rich now now now!!! PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results Graphic: www.milliondollarhomepage.com
  49. 49. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  50. 50. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  51. 51. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  52. 52. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  53. 53. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  54. 54. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Direct Counting of Supporters Spam Detection Results
  55. 55. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Click spam Direct Counting of Supporters Spam Detection Results
  56. 56. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Click spam Direct Counting of Supporters Spam Detection Results
  57. 57. Link Analysis on Opportunities for Web spam the Web Levels of Link Analysis Generalizing PageRank V Spamdexing Other Keyword stuffing Functional Rankings Link farms Web Spam Scraper, “Made for Advertising” sites Web Spam Spam blogs (splogs) Detection Cloaking Topological Web Spam Click spam Direct Counting of Supporters Adversarial relationship Spam Detection Results Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
  58. 58. Link Analysis on Typical Web Spam (1) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  59. 59. Link Analysis on Typical Web Spam (2) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  60. 60. Link Analysis on Hidden text the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  61. 61. Link Analysis on Made for Advertising (1) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  62. 62. Link Analysis on Made for Advertising (2) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  63. 63. Link Analysis on Made for Advertising (3) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  64. 64. Link Analysis on Search engine? the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  65. 65. Link Analysis on Fake search engine the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  66. 66. Link Analysis on Problem: “normal” pages that are spam the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  67. 67. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  68. 68. Link Analysis on Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  69. 69. Link Analysis on Machine Learning (cont.) the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  70. 70. Link Analysis on Feature Extraction the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  71. 71. Link Analysis on Challenges: Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Machine Learning Challenges: Web Spam Web Spam Learning with inter dependent variables (graph) Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  72. 72. Link Analysis on Challenges: Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Machine Learning Challenges: Web Spam Web Spam Learning with inter dependent variables (graph) Detection Learning with few examples Topological Web Spam Direct Counting of Supporters Spam Detection Results
  73. 73. Link Analysis on Challenges: Machine Learning the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Machine Learning Challenges: Web Spam Web Spam Learning with inter dependent variables (graph) Detection Learning with few examples Topological Web Spam Scalability Direct Counting of Supporters Spam Detection Results
  74. 74. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  75. 75. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  76. 76. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Feature propagation (graph) Spam Direct Counting of Supporters Spam Detection Results
  77. 77. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Feature propagation (graph) Spam Recall/precision tradeoffs Direct Counting of Supporters Spam Detection Results
  78. 78. Link Analysis on Challenges: Information Retrieval the Web Levels of Link Analysis Generalizing PageRank Other Functional Information Retrieval Challenges: Rankings Feature extraction: which features? Web Spam Web Spam Feature aggregation: page/host/domain Detection Topological Web Feature propagation (graph) Spam Recall/precision tradeoffs Direct Counting of Supporters Scalability Spam Detection Results
  79. 79. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  80. 80. Link Analysis on Topological spam: link farms the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  81. 81. Link Analysis on Topological spam: link farms the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
  82. 82. Link Analysis on Motivation the Web Levels of Link Analysis Generalizing PageRank Other Functional [Fetterly et al., 2004] hypothesized that studying the Rankings distribution of statistics about pages could be a good way of Web Spam Web Spam detecting spam pages: Detection Topological Web “in a number of these distributions, outlier values are Spam Direct Counting associated with web spam” of Supporters Spam Detection Results
  83. 83. Link Analysis on Test collection the Web Levels of Link Analysis Generalizing PageRank U.K. collection Other Functional Rankings 18.5 million pages downloaded from the .UK domain Web Spam 5,344 hosts manually classified (6% of the hosts) Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  84. 84. Link Analysis on Test collection the Web Levels of Link Analysis Generalizing PageRank U.K. collection Other Functional Rankings 18.5 million pages downloaded from the .UK domain Web Spam 5,344 hosts manually classified (6% of the hosts) Web Spam Detection Topological Web Spam Direct Counting Classified entire hosts: of Supporters Spam Detection V A few hosts are mixed: spam and non-spam pages Results X More coverage: sample covers 32% of the pages
  85. 85. Link Analysis on In-degree the Web δ = 0.35 In−degree Levels of Link Analysis Generalizing Normal PageRank 0.4 Spam Other Functional Rankings 0.3 Web Spam Web Spam Detection Topological Web 0.2 Spam Direct Counting of Supporters Spam Detection 0.1 Results 0 1 100 10000 Number of in−links (δ = max. difference in C.D.F. plot)
  86. 86. Link Analysis on Out-degree the Web Levels of Link δ = 0.28 Out−degree Analysis 0.3 Generalizing Normal PageRank Spam Other Functional Rankings Web Spam 0.2 Web Spam Detection Topological Web Spam Direct Counting of Supporters 0.1 Spam Detection Results 0 1 10 50 100 Number of out−links
  87. 87. Link Analysis on Edge reciprocity the Web Levels of Link δ = 0.35 Reciprocity of max. PR page Analysis 0.5 Generalizing Normal PageRank Spam Other Functional 0.4 Rankings Web Spam Web Spam 0.3 Detection Topological Web Spam 0.2 Direct Counting of Supporters Spam Detection Results 0.1 0 0 0.2 0.4 0.6 0.8 1 Fraction of reciprocal links
  88. 88. Link Analysis on Assortativity the Web Levels of Link δ = 0.31 Degree / Degree of neighbors Analysis Generalizing 0.4 PageRank Normal Spam Other Functional Rankings 0.3 Web Spam Web Spam Detection Topological Web 0.2 Spam Direct Counting of Supporters Spam Detection 0.1 Results 0 0.001 0.01 0.1 1 10 100 1000 Degree/Degree ratio of home page
  89. 89. Link Analysis on Variance of PageRank the Web Suggested in [Bencz´r et al., 2005] u Levels of Link Analysis Generalizing PageRank PageRank PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  90. 90. Link Analysis on Variance of PageRank of in-neighbors the Web Levels of Link Stdev. of PR of Neighbors (Home) δ = 0.41 Analysis Generalizing PageRank Normal Spam Other 0.3 Functional Rankings Web Spam Web Spam Detection 0.2 Topological Web Spam Direct Counting of Supporters 0.1 Spam Detection Results 0 0 0.2 0.4 0.6 0.8 1 σ2 of the logarithm of PageRank
  91. 91. Link Analysis on TrustRank the Web Levels of Link Analysis Generalizing PageRank Other TrustRank [Gy¨ngyi et al., 2004] o Functional Rankings A node with high PageRank, but far away from a core set of Web Spam “trusted nodes” is suspicious Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  92. 92. Link Analysis on TrustRank the Web Levels of Link Analysis Generalizing PageRank Other TrustRank [Gy¨ngyi et al., 2004] o Functional Rankings A node with high PageRank, but far away from a core set of Web Spam “trusted nodes” is suspicious Web Spam Detection Start from a set of trusted nodes, then do a random walk, Topological Web Spam returning to the set of trusted nodes with probability 1 − α at Direct Counting each step of Supporters Spam Detection Results i Trusted nodes: data from http://www.dmoz.org/
  93. 93. Link Analysis on TrustRank Idea the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  94. 94. Link Analysis on TrustRank score the Web Levels of Link δ = 0.59 Analysis TrustRank score of home page Generalizing PageRank Normal 0.4 Spam Other Functional Rankings Web Spam 0.3 Web Spam Detection Topological Web Spam 0.2 Direct Counting of Supporters Spam Detection 0.1 Results 0 1e−06 0.001 TrustRank
  95. 95. Link Analysis on TrustRank / PageRank the Web Levels of Link δ = 0.59 Analysis Estimated relative non−spam mass Generalizing PageRank Normal 0.8 Spam Other Functional 0.7 Rankings Web Spam 0.6 Web Spam 0.5 Detection Topological Web 0.4 Spam Direct Counting 0.3 of Supporters Spam Detection 0.2 Results 0.1 0 0.3 1 10 100 TrustRank score/PageRank
  96. 96. Link Analysis on Truncated PageRank the Web Levels of Link Analysis Generalizing Proposed in [Becchetti et al., 2006b]. Idea: reduce the direct PageRank contribution of the first levels of links: Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection t≤T 0 Results damping(t) = C αt t>T
  97. 97. Link Analysis on Truncated PageRank the Web Levels of Link Analysis Generalizing Proposed in [Becchetti et al., 2006b]. Idea: reduce the direct PageRank contribution of the first levels of links: Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection t≤T 0 Results damping(t) = C αt t>T V No extra reading of the graph after PageRank
  98. 98. Link Analysis on Truncated PageRank(T=2) / PageRank the Web Levels of Link Analysis TruncatedPageRank T=2 / PageRank δ = 0.30 Generalizing PageRank Normal Other Spam 0.3 Functional Rankings Web Spam Web Spam Detection 0.2 Topological Web Spam Direct Counting of Supporters 0.1 Spam Detection Results 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 TruncatedPageRank(T=2) / PageRank
  99. 99. Link Analysis on Max. change of Truncated PageRank the Web Levels of Link Analysis Maximum change of Truncated PageRank δ = 0.29 Generalizing PageRank Normal Other Spam Functional Rankings 0.2 Web Spam Web Spam Detection Topological Web Spam 0.1 Direct Counting of Supporters Spam Detection Results 0 0.85 0.9 0.95 1 1.05 1.1 max(TrPRi+1/TrPri)
  100. 100. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  101. 101. Link Analysis on High and low-ranked pages are different the Web 4 Levels of Link x 10 Analysis Top 0%−10% 12 Generalizing Top 40%−50% PageRank Top 60%−70% Other 10 Number of Nodes Functional Rankings 8 Web Spam Web Spam Detection 6 Topological Web Spam 4 Direct Counting of Supporters 2 Spam Detection Results 0 1 5 10 15 20 Distance
  102. 102. Link Analysis on High and low-ranked pages are different the Web 4 Levels of Link x 10 Analysis Top 0%−10% 12 Generalizing Top 40%−50% PageRank Top 60%−70% Other 10 Number of Nodes Functional Rankings 8 Web Spam Web Spam Detection 6 Topological Web Spam 4 Direct Counting of Supporters 2 Spam Detection Results 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
  103. 103. Link Analysis on Probabilistic counting the Web Levels of Link Analysis 1 1 Generalizing 0 0 PageRank 0 0 0 0 Other 0 1 1 1 1 1 Functional 0 0 1 1 0 0 Rankings 0 0 0 0 Propagation of 0 0 1 1 Web Spam bits using the 1 0 1 1 “OR” operation 1 0 1 0 Web Spam Detection 1 Target 0 Count bits set Topological Web 0 page 0 to estimate Spam 0 0 supporters 0 0 Direct Counting 1 1 1 1 of Supporters 0 0 1 1 0 0 Spam Detection 0 0 Results 1 1 0 0
  104. 104. Link Analysis on Probabilistic counting the Web Levels of Link Analysis 1 1 Generalizing 0 0 PageRank 0 0 0 0 Other 0 1 1 1 1 1 Functional 0 0 1 1 0 0 Rankings 0 0 0 0 Propagation of 0 0 1 1 Web Spam bits using the 1 0 1 1 “OR” operation 1 0 1 0 Web Spam Detection 1 Target 0 Count bits set Topological Web 0 page 0 to estimate Spam 0 0 supporters 0 0 Direct Counting 1 1 1 1 of Supporters 0 0 1 1 0 0 Spam Detection 0 0 Results 1 1 0 0 [Becchetti et al., 2006b] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  105. 105. Link Analysis on General algorithm the Web Require: N: number of nodes, d: distance, k: bits Levels of Link Analysis 1: for node : 1 . . . N, bit: 1 . . . k do Generalizing INIT(node,bit) 2: PageRank 3: end for Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  106. 106. Link Analysis on General algorithm the Web Require: N: number of nodes, d: distance, k: bits Levels of Link Analysis 1: for node : 1 . . . N, bit: 1 . . . k do Generalizing INIT(node,bit) 2: PageRank 3: end for Other Functional 4: for distance : 1 . . . d do {Iteration step} Rankings Aux ← 0k Web Spam 5: for src : 1 . . . N do {Follow links in the graph} Web Spam 6: Detection for all links from src to dest do 7: Topological Web Aux[dest] ← Aux[dest] OR V[src,·] Spam 8: Direct Counting end for 9: of Supporters end for 10: Spam Detection Results V ← Aux 11: 12: end for
  107. 107. Link Analysis on General algorithm the Web Require: N: number of nodes, d: distance, k: bits Levels of Link Analysis 1: for node : 1 . . . N, bit: 1 . . . k do Generalizing INIT(node,bit) 2: PageRank 3: end for Other Functional 4: for distance : 1 . . . d do {Iteration step} Rankings Aux ← 0k Web Spam 5: for src : 1 . . . N do {Follow links in the graph} Web Spam 6: Detection for all links from src to dest do 7: Topological Web Aux[dest] ← Aux[dest] OR V[src,·] Spam 8: Direct Counting end for 9: of Supporters end for 10: Spam Detection Results V ← Aux 11: 12: end for 13: for node: 1 . . . N do {Estimate supporters} Supporters[node] ← ESTIMATE( V[node,·] ) 14: 15: end for 16: return Supporters
  108. 108. Link Analysis on Our estimator the Web Levels of Link Analysis Generalizing PageRank Other Functional Initialize all bits to one with probability Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  109. 109. Link Analysis on Our estimator the Web Levels of Link Analysis Generalizing PageRank Other Functional Initialize all bits to one with probability Rankings ones(node) Estimator: neighbors(node) = log(1− ) 1 − Web Spam k Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  110. 110. Link Analysis on Our estimator the Web Levels of Link Analysis Generalizing PageRank Other Functional Initialize all bits to one with probability Rankings ones(node) Estimator: neighbors(node) = log(1− ) 1 − Web Spam k Web Spam Detection Adaptive estimation Topological Web Spam Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look Direct Counting for the transitions from more than (1 − 1/e)k ones to less of Supporters than (1 − 1/e)k ones. Spam Detection Results
  111. 111. Link Analysis on Convergence the Web Levels of Link Analysis 100% Generalizing PageRank 90% Other 80% Functional Rankings Fraction of nodes 70% with estimates Web Spam 60% Web Spam Detection 50% d=1 Topological Web d=2 40% Spam d=3 Direct Counting 30% d=4 of Supporters d=5 20% Spam Detection d=6 Results d=7 10% d=8 0% 5 10 15 20 Iteration
  112. 112. Link Analysis on Error rate the Web Levels of Link Analysis Generalizing Ours 64 bits, epsilon−only estimator PageRank Ours 64 bits, combined estimator 0.5 Other ANF 24 bits × 24 iterations (576 b×i) Average Relative Error Functional ANF 24 bits × 48 iterations (1152 b×i) Rankings 0.4 Web Spam 960 b×i Web Spam 1216 b×i 512 b×i 832 b×i Detection 1344 b×i 1408 b×i 768 b×i 1152 b×i 0.3 Topological Web Spam 0.2 Direct Counting 576 b×i 1152 b×i of Supporters 512 b×i 768 b×i 960 b×i 1216 b×i 1344 b×i 1408 b×i 832 b×i 1152 b×i Spam Detection 0.1 Results 0 1 2 3 4 5 6 7 8 Distance
  113. 113. Link Analysis on Hosts at distance 4 the Web Levels of Link δ = 0.39 Hosts at Distance Exactly 4 Analysis 0.4 Generalizing Normal PageRank Spam Other Functional Rankings 0.3 Web Spam Web Spam Detection Topological Web 0.2 Spam Direct Counting of Supporters Spam Detection 0.1 Results 0 1 100 1000 S4 − S3
  114. 114. Link Analysis on Minimum change of supporters the Web Levels of Link δ = 0.39 Minimum change of supporters Analysis Generalizing PageRank Normal 0.4 Spam Other Functional Rankings Web Spam 0.3 Web Spam Detection Topological Web Spam 0.2 Direct Counting of Supporters Spam Detection 0.1 Results 0 1 5 10 min(S2/S1, S3/S2, S4/S3)
  115. 115. Link Analysis on the Web Levels of Link Analysis Generalizing PageRank Levels of Link Analysis 1 Other Generalizing PageRank 2 Functional Other Functional Rankings Rankings 3 Web Spam 4 Web Spam Web Spam Detection 5 Web Spam Detection Topological Web Spam 6 Topological Web Direct Counting of Supporters 7 Spam Spam Detection Results 8 Direct Counting of Supporters Spam Detection Results
  116. 116. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  117. 117. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam Direct Counting of Supporters Spam Detection Results
  118. 118. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters Spam Detection Results
  119. 119. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters V Host-based counts of neighbors are important Spam Detection Results
  120. 120. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters V Host-based counts of neighbors are important Spam Detection Results
  121. 121. Link Analysis on Detection rates the Web Levels of Link Analysis Generalizing PageRank 60% (UK-2006) – 80% (UK-2002) of detection rate, with Other Functional 4%–2% error rate by combining different Rankings attributes [Becchetti et al., 2006a]. Web Spam Web Spam X No magic bullet in link analysis Detection X Topological Web Precision still low compared to e-mail spam filters Spam V Measure both home page and max. PageRank page Direct Counting of Supporters V Host-based counts of neighbors are important Spam Detection Results Next step: combine link analysis and content analysis
  122. 122. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  123. 123. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam We provided several examples Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  124. 124. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam We provided several examples Detection Asked to classify normal / borderline / spam Topological Web Spam Direct Counting of Supporters Spam Detection Results
  125. 125. Link Analysis on Upcoming Web Spam Challenge on UK-2006 the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings We asked 20+ volunteers to clasify entire hosts Web Spam Web Spam We provided several examples Detection Asked to classify normal / borderline / spam Topological Web Spam Do they agree? Mostly . . . Direct Counting of Supporters Spam Detection Results
  126. 126. Link Analysis on Agreement between humans the Web Levels of Link Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  127. 127. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  128. 128. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  129. 129. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  130. 130. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  131. 131. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Direct Counting of Supporters Spam Detection Results
  132. 132. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Machine learning Direct Counting of Supporters Spam Detection Results
  133. 133. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Machine learning Direct Counting of Supporters Information retrieval Spam Detection Results
  134. 134. Link Analysis on Result: first public Web Spam collection the Web Levels of Link Analysis Generalizing PageRank Other Public spam collection Functional Rankings Web graph with ∼80 million pages Web Spam ∼11,000 hosts Web Spam Labels for ∼4,000 hosts by at least 2 humans each Detection Topological Web Upcoming Web Spam challenge Spam Machine learning Direct Counting of Supporters Information retrieval Spam Detection webspam-announces-subscribe@yahoogroups.com Results
  135. 135. Link Analysis on the Web Levels of Link Thank you! Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  136. 136. Link Analysis on the Web Levels of Link Thank you! Analysis Generalizing PageRank Other Functional Rankings Web Spam Web Spam Detection Topological Web Spam Direct Counting of Supporters Spam Detection Results
  137. 137. Link Analysis on the Web Baeza-Yates, R., Boldi, P., and Castillo, C. (2006a). Generalizing pagerank: Damping functions for link-based Levels of Link Analysis ranking algorithms. Generalizing In Proceedings of ACM SIGIR, pages 308–315, Seattle, PageRank Washington, USA. ACM Press. Other Functional Rankings Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2006b). Web Spam Characterization of national web domains. Web Spam Detection To appear in ACM TOIT. Topological Web Spam Baeza-Yates, R. and Poblete, B. (2006). Direct Counting of Supporters Dynamics of the chilean web structure. Spam Detection Comput. Networks, 50(10):1464–1473. Results Barab´si, A.-L. (2002). a Linked: The New Science of Networks. Perseus Books Group.
  138. 138. Link Analysis on the Web Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (2006a). Levels of Link Link-based characterization and detection of Web Spam. Analysis Generalizing In Second International Workshop on Adversarial Information PageRank Retrieval on the Web (AIRWeb), Seattle, USA. Other Functional Rankings Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Web Spam Baeza-Yates, R. (2006b). Web Spam Using rank propagation and probabilistic counting for Detection link-based spam detection. Topological Web Spam In Proceedings of the Workshop on Web Mining and Web Direct Counting Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. of Supporters Spam Detection Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M. u a o Results (2005). Spamrank: fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan.
  139. 139. Link Analysis on the Web Boldi, P., Santini, M., and Vigna, S. (2005). Pagerank as a function of the damping factor. Levels of Link Analysis In Proceedings of the 14th international conference on World Generalizing Wide Web, pages 557–566, Chiba, Japan. ACM Press. PageRank Other Functional Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rankings Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. Web Spam (2000). Web Spam Detection Graph structure in the web: Experiments and models. Topological Web In Proceedings of the Ninth Conference on World Wide Web, Spam pages 309–320, Amsterdam, Netherlands. ACM Press. Direct Counting of Supporters Fetterly, D., Manasse, M., and Najork, M. (2004). Spam Detection Results Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1–6, Paris, France.
  140. 140. Link Analysis on Flajolet, P. and Martin, N. G. (1985). the Web Probabilistic counting algorithms for data base applications. Levels of Link Journal of Computer and System Sciences, 31(2):182–209. Analysis Generalizing Gibson, D., Kumar, R., and Tomkins, A. (2005). PageRank Other Discovering large dense subgraphs in massive graphs. Functional Rankings In VLDB ’05: Proceedings of the 31st international conference Web Spam on Very large data bases, pages 721–732. VLDB Endowment. Web Spam Detection Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004). o Topological Web Combating web spam with trustrank. Spam Direct Counting In Proceedings of the Thirtieth International Conference on of Supporters Very Large Data Bases (VLDB), pages 576–587, Toronto, Spam Detection Canada. Morgan Kaufmann. Results Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001). Random graphs with arbitrary degree distributions and their applications. Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
  141. 141. Link Analysis on the Web Levels of Link Analysis Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). Generalizing PageRank ANF: a fast and scalable tool for data mining in massive Other Functional graphs. Rankings In Proceedings of the eighth ACM SIGKDD international Web Spam conference on Knowledge discovery and data mining, pages Web Spam Detection 81–90, New York, NY, USA. ACM Press. Topological Web Spam Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001). Direct Counting A simple conceptual model for the internet topology. of Supporters Spam Detection In Global Internet, San Antonio, Texas, USA. IEEE CS Press. Results

×