Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

                   Using rank propa...
Using rank
propagation and
  Probabilistic
  counting for
                       Motivation
  Link-Based       1
Spam Dete...
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. ...
What is on the Web?
   Using rank
                   Information
propagation and
  Probabilistic
  counting for
  Link-Bas...
What is on the Web?
   Using rank
                   Information + Porn
propagation and
  Probabilistic
  counting for
  L...
What is on the Web?
   Using rank
                   Information + Porn + On-line casinos + Free movies +
propagation and
...
Web spam (keywords + links)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. ...
Web spam (mostly keywords)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. B...
Search engine?
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  ...
Fake search engine
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti...
Problem: “normal” pages that are spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detec...
Problem: “normal” pages that are spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detec...
Problem: borderline pages
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Be...
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. ...
Definitions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. C...
Definitions
   Using rank
propagation and
  Probabilistic
  counting for
                           Any deliberate action t...
Definitions
   Using rank
propagation and
  Probabilistic
  counting for
                           Any deliberate action t...
Link farms
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. C...
Link farms
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. C...
Link−based spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                 ...
Link−based spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                 ...
Idea: count “supporters” at different distances
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
 ...
High and low-ranked pages are different
   Using rank
                                                  4
propagation and
 ...
High and low-ranked pages are different
   Using rank
                                                   4
propagation and
...
Metrics
   Using rank
                          Graph algorithms
propagation and
  Probabilistic
  counting for
          ...
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. ...
General functional ranking
   Using rank
propagation and
  Probabilistic
                   Let P the row-normalized versi...
General functional ranking
   Using rank
propagation and
  Probabilistic
                   Let P the row-normalized versi...
General functional ranking
   Using rank
propagation and
  Probabilistic
                   Let P the row-normalized versi...
Truncated PageRank
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
              ...
Truncated PageRank
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
              ...
General algorithm
                   Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for
   Using ...
General algorithm
                   Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for
   Using ...
Truncated PageRank vs PageRank
   Using rank
propagation and
  Probabilistic
  counting for                               ...
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. ...
Probabilistic counting
   Using rank
propagation and
  Probabilistic
  counting for
                                      ...
Probabilistic counting
   Using rank
propagation and
  Probabilistic
  counting for
                                      ...
General algorithm
   Using rank
                   Require: N: number of nodes, d: distance, k: bits
propagation and
  Pro...
General algorithm
   Using rank
                   Require: N: number of nodes, d: distance, k: bits
propagation and
  Pro...
General algorithm
   Using rank
                   Require: N: number of nodes, d: distance, k: bits
propagation and
  Pro...
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all ...
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all ...
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all ...
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all ...
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all ...
Adaptive estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                             ...
Convergence
   Using rank
                                            100%
propagation and
  Probabilistic
               ...
Convergence
   Using rank
                                             100%
propagation and
  Probabilistic
              ...
Convergence
   Using rank
                                             100%
propagation and
  Probabilistic
              ...
Error rate
   Using rank
propagation and
  Probabilistic
  counting for
                                                  ...
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. ...
Test collection
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

                ...
Test collection
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

                ...
Automatic classifier
                   We extracted (for the home page and the page with
   Using rank
propagation and
   ...
Automatic classifier
                   We extracted (for the home page and the page with
   Using rank
propagation and
   ...
Automatic classifier
                   We extracted (for the home page and the page with
   Using rank
propagation and
   ...
Single-technique classifier
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Cl...
Comparison of single-technique classifier (M=5)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
S...
Comparison of single-technique classifier (M=30)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
...
Combined classifier
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti...
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. ...
Summary of classifiers
   Using rank
                         (a) Precision and recall of spam detection
propagation and
  ...
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. ...
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. ...
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. ...
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. ...
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. ...
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. ...
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. ...
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. ...
Using rank
                   Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
propagation and
  Probabilistic
       ...
Using rank
propagation and
  Probabilistic
  counting for
                   Fetterly, D., Manasse, M., and Najork, M. (20...
Using rank
propagation and
  Probabilistic
                   Gy¨ngyi, Z. and Garcia-Molina, H. (2005).
                  ...
Using rank
propagation and
                   Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
  Probabilist...
Upcoming SlideShare
Loading in …5
×

Using Rank Propagation for Spam Detection (WebKDD 2006)

0 views

Published on

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
0
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
35
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Using Rank Propagation for Spam Detection (WebKDD 2006)

  1. 1. Using rank propagation and Probabilistic counting for Link-Based Spam Detection Using rank propagation and Probabilistic L. Becchetti, C. Castillo, D. Donato, counting for Link-Based Spam Detection S. Leonardi and R. Baeza-Yates Motivation Luca Becchetti1 , Carlos Castillo1 ,Debora Donato1 , Spam pages characterization Stefano Leonardi1 and Ricardo Baeza-Yates2 Truncated PageRank 1. Universit` di Roma “La Sapienza” – Rome, Italy a Counting 2. Yahoo! Research – Barcelona, Spain and Santiago, Chile supporters Experiments August 20th, 2006 Conclusions
  2. 2. Using rank propagation and Probabilistic counting for Motivation Link-Based 1 Spam Detection L. Becchetti, C. Castillo, Spam pages characterization 2 D. Donato, S. Leonardi and R. Baeza-Yates Truncated PageRank 3 Motivation Spam pages characterization Counting supporters 4 Truncated PageRank Counting supporters Experiments 5 Experiments Conclusions Conclusions 6
  3. 3. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  4. 4. What is on the Web? Using rank Information propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  5. 5. What is on the Web? Using rank Information + Porn propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  6. 6. What is on the Web? Using rank Information + Porn + On-line casinos + Free movies + propagation and Probabilistic Cheap software + Buy a MBA diploma + Prescription -free counting for Link-Based drugs + V!-4-gra + Get rich now now now!!! Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions Graphic: www.milliondollarhomepage.com
  7. 7. Web spam (keywords + links) Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  8. 8. Web spam (mostly keywords) Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  9. 9. Search engine? Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  10. 10. Fake search engine Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  11. 11. Problem: “normal” pages that are spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  12. 12. Problem: “normal” pages that are spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions Some content is introduced
  13. 13. Problem: borderline pages Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  14. 14. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  15. 15. Definitions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  16. 16. Definitions Using rank propagation and Probabilistic counting for Any deliberate action that is meant to trigger Link-Based Spam Detection an unjustifiably favorable relevance or importance L. Becchetti, for some Web page, considering the page’s true value C. Castillo, D. Donato, [Gy¨ngyi and Garcia-Molina, 2005] o S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  17. 17. Definitions Using rank propagation and Probabilistic counting for Any deliberate action that is meant to trigger Link-Based Spam Detection an unjustifiably favorable relevance or importance L. Becchetti, for some Web page, considering the page’s true value C. Castillo, D. Donato, [Gy¨ngyi and Garcia-Molina, 2005] o S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization any attempt to deceive a search engine’s relevancy algorithm Truncated PageRank or simply Counting supporters Experiments anything that would not be done Conclusions if search engines did not exist. [Perkins, 2001]
  18. 18. Link farms Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  19. 19. Link farms Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Single-level farms can be detected by searching groups of Conclusions nodes sharing their out-links [Gibson et al., 2005]
  20. 20. Link−based spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection [Fetterly et al., 2004] hypothesized that studying the L. Becchetti, C. Castillo, distribution of statistics about pages could be a good way of D. Donato, detecting spam pages: S. Leonardi and R. Baeza-Yates Motivation “in a number of these distributions, outlier values are Spam pages associated with web spam” characterization Truncated PageRank Counting supporters Experiments Conclusions
  21. 21. Link−based spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection [Fetterly et al., 2004] hypothesized that studying the L. Becchetti, C. Castillo, distribution of statistics about pages could be a good way of D. Donato, detecting spam pages: S. Leonardi and R. Baeza-Yates Motivation “in a number of these distributions, outlier values are Spam pages associated with web spam” characterization Truncated PageRank Research goal Counting supporters Statistical analysis of link-based spam Experiments Conclusions
  22. 22. Idea: count “supporters” at different distances Using rank propagation and Probabilistic counting for Link-Based Number of different nodes at a given distance: Spam Detection L. Becchetti, C. Castillo, .UK 18 mill. nodes .EU.INT 860,000 nodes D. Donato, S. Leonardi and 0.3 0.3 R. Baeza-Yates Motivation 0.2 0.2 Frequency Frequency Spam pages characterization 0.1 0.1 Truncated PageRank 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Counting supporters Distance Distance Experiments Average distance Average distance Conclusions 14.9 clicks 10.0 clicks
  23. 23. High and low-ranked pages are different Using rank 4 propagation and x 10 Probabilistic Top 0%−10% counting for 12 Link-Based Top 40%−50% Spam Detection Top 60%−70% L. Becchetti, 10 Number of Nodes C. Castillo, D. Donato, S. Leonardi and 8 R. Baeza-Yates Motivation 6 Spam pages characterization 4 Truncated PageRank 2 Counting supporters Experiments 0 1 5 10 15 20 Conclusions Distance
  24. 24. High and low-ranked pages are different Using rank 4 propagation and x 10 Probabilistic Top 0%−10% counting for 12 Link-Based Top 40%−50% Spam Detection Top 60%−70% L. Becchetti, 10 Number of Nodes C. Castillo, D. Donato, S. Leonardi and 8 R. Baeza-Yates Motivation 6 Spam pages characterization 4 Truncated PageRank 2 Counting supporters Experiments 0 1 5 10 15 20 Conclusions Distance Areas below the curves are equal if we are in the same strongly-connected component
  25. 25. Metrics Using rank Graph algorithms propagation and Probabilistic counting for All shortest paths, centrality, betweenness, clustering coefficient... Link-Based Spam Detection Streamed algorithms L. Becchetti, Breadth-first and depth-first search C. Castillo, D. Donato, Count of neighbors S. Leonardi and R. Baeza-Yates Symmetric algorithms Motivation (Strongly) connected components Spam pages Approximate count of neighbors characterization Truncated PageRank, Truncated PageRank, Linear Rank PageRank HITS, Salsa, TrustRank Counting supporters Experiments Conclusions
  26. 26. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  27. 27. General functional ranking Using rank propagation and Probabilistic Let P the row-normalized version of the citation matrix of a counting for Link-Based graph G = (V , E ) Spam Detection A functional ranking [Baeza-Yates et al., 2006] is a L. Becchetti, C. Castillo, link-based ranking algorithm to compute a scoring vector W D. Donato, S. Leonardi and of the form: R. Baeza-Yates ∞ damping(t) t W= P. Motivation N t=0 Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  28. 28. General functional ranking Using rank propagation and Probabilistic Let P the row-normalized version of the citation matrix of a counting for Link-Based graph G = (V , E ) Spam Detection A functional ranking [Baeza-Yates et al., 2006] is a L. Becchetti, C. Castillo, link-based ranking algorithm to compute a scoring vector W D. Donato, S. Leonardi and of the form: R. Baeza-Yates ∞ damping(t) t W= P. Motivation N t=0 Spam pages characterization Truncated PageRank There are many choices for damping(t), including simply a Counting linear function that is as good as PageRank in practice supporters Experiments Conclusions
  29. 29. General functional ranking Using rank propagation and Probabilistic Let P the row-normalized version of the citation matrix of a counting for Link-Based graph G = (V , E ) Spam Detection A functional ranking [Baeza-Yates et al., 2006] is a L. Becchetti, C. Castillo, link-based ranking algorithm to compute a scoring vector W D. Donato, S. Leonardi and of the form: R. Baeza-Yates ∞ damping(t) t W= P. Motivation N t=0 Spam pages characterization Truncated PageRank There are many choices for damping(t), including simply a Counting linear function that is as good as PageRank in practice supporters Experiments Conclusions damping(t) = (1 − α)αt
  30. 30. Truncated PageRank Using rank propagation and Probabilistic counting for Link-Based Spam Detection Reduce the direct contribution of the first levels of links: L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank t≤T 0 damping(t) = Counting C αt supporters t>T Experiments Conclusions
  31. 31. Truncated PageRank Using rank propagation and Probabilistic counting for Link-Based Spam Detection Reduce the direct contribution of the first levels of links: L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank t≤T 0 damping(t) = Counting C αt supporters t>T Experiments V No extra reading of the graph after PageRank Conclusions
  32. 32. General algorithm Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for Using rank propagation and truncation Probabilistic 1: for i : 1 . . . N do {Initialization} counting for 2: R[i] ← (1 − α)/((αT +1 )N) Link-Based Spam Detection 3: if T≥ 0 then 4: Score[i] ← 0 L. Becchetti, 5: else {Calculate normal PageRank} C. Castillo, D. Donato, 6: Score[i] ← R[i] S. Leonardi and 7: end if R. Baeza-Yates 8: end for Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  33. 33. General algorithm Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for Using rank propagation and truncation Probabilistic 1: for i : 1 . . . N do {Initialization} counting for 2: R[i] ← (1 − α)/((αT +1 )N) Link-Based Spam Detection 3: if T≥ 0 then 4: Score[i] ← 0 L. Becchetti, 5: else {Calculate normal PageRank} C. Castillo, D. Donato, 6: Score[i] ← R[i] S. Leonardi and 7: end if R. Baeza-Yates 8: end for 9: distance = 1 Motivation 10: while not converged do Spam pages 11: Aux ← 0 characterization 12: for src : 1 . . . N do {Follow links in the graph} Truncated 13: for all link from src to dest do PageRank 14: Aux[dest] ← Aux[dest] + R[src]/outdegree(src) Counting 15: end for supporters 16: end for Experiments 17: for i : 1 . . . N do {Apply damping factor α} Conclusions 18: R[i] ← Aux[i] ×α 19: if distance > T then {Add to ranking value} 20: Score[i] ← Score[i] + R[i] 21: end if 22: end for 23: distance = distance +1 24: end while
  34. 34. Truncated PageRank vs PageRank Using rank propagation and Probabilistic counting for −3 −3 10 10 Link-Based Spam Detection Truncated PageRank T=1 Truncated PageRank T=4 −4 −4 10 10 L. Becchetti, −5 −5 C. Castillo, 10 10 D. Donato, −6 −6 S. Leonardi and 10 10 R. Baeza-Yates −7 −7 10 10 Motivation −8 −8 10 10 Spam pages −9 −9 10 −8 10 −8 characterization −6 −4 −6 −4 10 10 10 10 10 10 Normal PageRank Normal PageRank Truncated PageRank Comparing PageRank and Truncated PageRank with T = 1 Counting supporters and T = 4. Experiments The correlation is high and decreases as more levels are Conclusions truncated.
  35. 35. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  36. 36. Probabilistic counting Using rank propagation and Probabilistic counting for 1 1 Link-Based 0 0 Spam Detection 0 0 L. Becchetti, 0 0 0 1 1 1 C. Castillo, 1 1 0 0 1 1 D. Donato, 0 0 0 0 0 0 S. Leonardi and Propagation of 0 0 1 1 R. Baeza-Yates bits using the 1 0 1 1 “OR” operation 1 0 1 0 Motivation 1 Target 0 Count bits set Spam pages 0 page 0 characterization to estimate 0 0 supporters Truncated 0 0 1 1 PageRank 1 1 0 0 1 1 0 0 Counting 0 0 supporters 1 1 Experiments 0 0 Conclusions
  37. 37. Probabilistic counting Using rank propagation and Probabilistic counting for 1 1 Link-Based 0 0 Spam Detection 0 0 L. Becchetti, 0 0 0 1 1 1 C. Castillo, 1 1 0 0 1 1 D. Donato, 0 0 0 0 0 0 S. Leonardi and Propagation of 0 0 1 1 R. Baeza-Yates bits using the 1 0 1 1 “OR” operation 1 0 1 0 Motivation 1 Target 0 Count bits set Spam pages 0 page 0 characterization to estimate 0 0 supporters Truncated 0 0 1 1 PageRank 1 1 0 0 1 1 0 0 Counting 0 0 supporters 1 1 Experiments 0 0 Conclusions Improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  38. 38. General algorithm Using rank Require: N: number of nodes, d: distance, k: bits propagation and Probabilistic 1: for node : 1 . . . N, bit: 1 . . . k do counting for Link-Based INIT(node,bit) 2: Spam Detection 3: end for L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  39. 39. General algorithm Using rank Require: N: number of nodes, d: distance, k: bits propagation and Probabilistic 1: for node : 1 . . . N, bit: 1 . . . k do counting for Link-Based INIT(node,bit) 2: Spam Detection 3: end for L. Becchetti, C. Castillo, 4: for distance : 1 . . . d do {Iteration step} D. Donato, S. Leonardi and Aux ← 0k 5: R. Baeza-Yates for src : 1 . . . N do {Follow links in the graph} 6: Motivation for all links from src to dest do 7: Spam pages Aux[dest] ← Aux[dest] OR V[src,·] characterization 8: Truncated end for 9: PageRank end for 10: Counting supporters V ← Aux 11: Experiments 12: end for Conclusions
  40. 40. General algorithm Using rank Require: N: number of nodes, d: distance, k: bits propagation and Probabilistic 1: for node : 1 . . . N, bit: 1 . . . k do counting for Link-Based INIT(node,bit) 2: Spam Detection 3: end for L. Becchetti, C. Castillo, 4: for distance : 1 . . . d do {Iteration step} D. Donato, S. Leonardi and Aux ← 0k 5: R. Baeza-Yates for src : 1 . . . N do {Follow links in the graph} 6: Motivation for all links from src to dest do 7: Spam pages Aux[dest] ← Aux[dest] OR V[src,·] characterization 8: Truncated end for 9: PageRank end for 10: Counting supporters V ← Aux 11: Experiments 12: end for Conclusions 13: for node: 1 . . . N do {Estimate supporters} Supporters[node] ← ESTIMATE( V[node,·] ) 14: 15: end for 16: return Supporters
  41. 41. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  42. 42. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  43. 43. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages ones(node) 1− Estimator: neighbors(node) = log(1− characterization ) k Truncated PageRank Counting supporters Experiments Conclusions
  44. 44. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages Estimator: neighbors(node) = log(1− ) 1 − ones(node) characterization k Truncated Problem: neighbors(node) can vary by orders of magnitudes PageRank as node varies. Counting supporters Experiments Conclusions
  45. 45. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages Estimator: neighbors(node) = log(1− ) 1 − ones(node) characterization k Truncated Problem: neighbors(node) can vary by orders of magnitudes PageRank as node varies. Counting supporters This means that for some values of , the computed value of Experiments ones(node) might be k (or 0, depending on neighbors(node)) Conclusions with relatively high probability.
  46. 46. Adaptive estimator Using rank propagation and Probabilistic counting for Link-Based 1 Spam Detection if we knew neighbors(node) and chose = we neighbors(node) L. Becchetti, would get: C. Castillo, D. Donato, S. Leonardi and 1 R. Baeza-Yates 1− ones(node) k 0.63k, e Motivation Spam pages characterization Adaptive estimation Truncated PageRank Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look Counting for the transitions from more than (1 − 1/e)k ones to less supporters than (1 − 1/e)k ones. Experiments Conclusions
  47. 47. Convergence Using rank 100% propagation and Probabilistic 90% counting for Link-Based Spam Detection 80% Fraction of nodes L. Becchetti, 70% with estimates C. Castillo, D. Donato, 60% S. Leonardi and R. Baeza-Yates 50% d=1 d=2 Motivation 40% d=3 Spam pages 30% d=4 characterization d=5 Truncated 20% d=6 PageRank d=7 10% Counting d=8 supporters 0% 5 10 15 20 Experiments Iteration Conclusions
  48. 48. Convergence Using rank 100% propagation and Probabilistic 90% counting for Link-Based Spam Detection 80% Fraction of nodes L. Becchetti, 70% with estimates C. Castillo, D. Donato, 60% S. Leonardi and R. Baeza-Yates 50% d=1 d=2 Motivation 40% d=3 Spam pages 30% d=4 characterization d=5 Truncated 20% d=6 PageRank d=7 10% Counting d=8 supporters 0% 5 10 15 20 Experiments Iteration Conclusions 15 iterations for estimating the neighbors at distance 4 or less
  49. 49. Convergence Using rank 100% propagation and Probabilistic 90% counting for Link-Based Spam Detection 80% Fraction of nodes L. Becchetti, 70% with estimates C. Castillo, D. Donato, 60% S. Leonardi and R. Baeza-Yates 50% d=1 d=2 Motivation 40% d=3 Spam pages 30% d=4 characterization d=5 Truncated 20% d=6 PageRank d=7 10% Counting d=8 supporters 0% 5 10 15 20 Experiments Iteration Conclusions 15 iterations for estimating the neighbors at distance 4 or less less than 25 iterations for all distances up to 8.
  50. 50. Error rate Using rank propagation and Probabilistic counting for Ours 64 bits, epsilon−only estimator Link-Based Spam Detection Ours 64 bits, combined estimator 0.5 ANF 24 bits × 24 iterations (576 b×i) Average Relative Error L. Becchetti, C. Castillo, ANF 24 bits × 48 iterations (1152 b×i) D. Donato, S. Leonardi and 0.4 R. Baeza-Yates 960 b×i 1216 b×i 512 b×i 832 b×i 1344 b×i 1408 b×i Motivation 768 b×i 1152 b×i 0.3 Spam pages characterization 0.2 576 b×i Truncated 1152 b×i PageRank 512 b×i 768 b×i 960 b×i 1216 b×i 1344 b×i 1408 b×i 832 b×i 1152 b×i Counting 0.1 supporters Experiments 0 Conclusions 1 2 3 4 5 6 7 8 Distance
  51. 51. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  52. 52. Test collection Using rank propagation and Probabilistic counting for Link-Based Spam Detection U.K. collection L. Becchetti, C. Castillo, 18.5 million pages downloaded from the .UK domain D. Donato, S. Leonardi and R. Baeza-Yates 5,344 hosts manually classified (6% of the hosts) Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  53. 53. Test collection Using rank propagation and Probabilistic counting for Link-Based Spam Detection U.K. collection L. Becchetti, C. Castillo, 18.5 million pages downloaded from the .UK domain D. Donato, S. Leonardi and R. Baeza-Yates 5,344 hosts manually classified (6% of the hosts) Motivation Spam pages characterization Truncated Classified entire hosts: PageRank Counting V A few hosts are mixed: spam and non-spam pages supporters X More coverage: sample covers 32% of the pages Experiments Conclusions
  54. 54. Automatic classifier We extracted (for the home page and the page with Using rank propagation and maximum PageRank) PageRank, Truncated PageRank at Probabilistic counting for 2 . . . 4, Supporters at 2 . . . 4 Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  55. 55. Automatic classifier We extracted (for the home page and the page with Using rank propagation and maximum PageRank) PageRank, Truncated PageRank at Probabilistic counting for 2 . . . 4, Supporters at 2 . . . 4 Link-Based Spam Detection We measured: L. Becchetti, C. Castillo, D. Donato, # of spam hosts classified as spam S. Leonardi and Precision = R. Baeza-Yates # of hosts classified as spam Motivation # of spam hosts classified as spam Recall = . Spam pages # of spam hosts characterization Truncated PageRank Counting supporters Experiments Conclusions
  56. 56. Automatic classifier We extracted (for the home page and the page with Using rank propagation and maximum PageRank) PageRank, Truncated PageRank at Probabilistic counting for 2 . . . 4, Supporters at 2 . . . 4 Link-Based Spam Detection We measured: L. Becchetti, C. Castillo, D. Donato, # of spam hosts classified as spam S. Leonardi and Precision = R. Baeza-Yates # of hosts classified as spam Motivation # of spam hosts classified as spam Recall = . Spam pages # of spam hosts characterization Truncated PageRank and the two types of errors in spam classification Counting supporters Experiments # of normal hosts classified as spam Conclusions False positive rate = # of normal hosts # of spam hosts classified as normal False negative rate = . # of spam hosts
  57. 57. Single-technique classifier Using rank propagation and Probabilistic counting for Link-Based Classifier based on TrustRank: uses as features the PageRank, Spam Detection the estimated non-spam mass, and the L. Becchetti, C. Castillo, estimated non-spam mass divided by PageRank. D. Donato, S. Leonardi and Classifier based on Truncated PageRank: uses as features the R. Baeza-Yates PageRank, the Truncated PageRank with Motivation truncation distance t = 2, 3, 4 (with t = 1 it Spam pages characterization would be just based on in-degree), and the Truncated Truncated PageRank divided by PageRank. PageRank Classifier based on Estimation of Supporters: uses as features Counting supporters the PageRank, the estimation of supporters at a Experiments given distance d = 2, 3, 4, and the estimation of Conclusions supporters divided by PageRank.
  58. 58. Comparison of single-technique classifier (M=5) Using rank propagation and Probabilistic counting for Link-Based Spam Detection Classifiers Spam class False False L. Becchetti, C. Castillo, (pruning with M = 5) Prec. Recall Pos. Neg. D. Donato, S. Leonardi and TrustRank 0.82 0.50 2.1% 50% R. Baeza-Yates Trunc. PageRank t = 2 0.85 0.50 1.6% 50% Motivation Trunc. PageRank t = 3 0.84 0.47 1.6% 53% Spam pages characterization Trunc. PageRank t = 4 0.79 0.45 2.2% 55% Truncated Est. Supporters d = 2 0.78 0.60 3.2% 40% PageRank Est. Supporters d = 3 0.83 0.64 2.4% 36% Counting supporters Est. Supporters d = 4 0.86 0.64 2.0% 36% Experiments Conclusions
  59. 59. Comparison of single-technique classifier (M=30) Using rank propagation and Probabilistic counting for Link-Based Spam Detection Classifiers Spam class False False L. Becchetti, C. Castillo, (pruning with M = 30) Prec. Recall Pos. Neg. D. Donato, S. Leonardi and TrustRank 0.80 0.49 2.3% 51% R. Baeza-Yates Trunc. PageRank t = 2 0.82 0.43 1.8% 57% Motivation Trunc. PageRank t = 3 0.81 0.42 1.8% 58% Spam pages characterization Trunc. PageRank t = 4 0.77 0.43 2.4% 57% Truncated Est. Supporters d = 2 0.76 0.52 3.1% 48% PageRank Est. Supporters d = 3 0.83 0.57 2.1% 43% Counting supporters Est. Supporters d = 4 0.80 0.57 2.6% 43% Experiments Conclusions
  60. 60. Combined classifier Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Spam class False False S. Leonardi and R. Baeza-Yates Pruning Rules Precision Recall Pos. Neg. Motivation M=5 49 0.87 0.80 2.0% 20% Spam pages M=30 31 0.88 0.76 1.8% 24% characterization No pruning 189 0.85 0.79 2.6% 21% Truncated PageRank Counting supporters Experiments Conclusions
  61. 61. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  62. 62. Summary of classifiers Using rank (a) Precision and recall of spam detection propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank (b) Error rates of the spam classifiers Counting supporters Experiments Conclusions
  63. 63. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  64. 64. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  65. 65. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  66. 66. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization Truncated PageRank Counting supporters Experiments Conclusions
  67. 67. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization V Truncated Host-based counts are important PageRank Counting supporters Experiments Conclusions
  68. 68. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization V Truncated Host-based counts are important PageRank Counting supporters Experiments Conclusions
  69. 69. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization V Truncated Host-based counts are important PageRank Counting Next step: combine link analysis and content analysis supporters Experiments Conclusions
  70. 70. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Thank you! Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  71. 71. Using rank Baeza-Yates, R., Boldi, P., and Castillo, C. (2006). propagation and Probabilistic Generalizing PageRank: Damping functions for link-based counting for Link-Based ranking algorithms. Spam Detection In Proceedings of SIGIR, Seattle, Washington, USA. ACM L. Becchetti, C. Castillo, Press. D. Donato, S. Leonardi and R. Baeza-Yates Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (2006). Motivation Using rank propagation and probabilistic counting for Spam pages characterization link-based spam detection. Truncated In Proceedings of the Workshop on Web Mining and Web PageRank Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. Counting supporters Experiments Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M. u a o Conclusions (2005). Spamrank: fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan.
  72. 72. Using rank propagation and Probabilistic counting for Fetterly, D., Manasse, M., and Najork, M. (2004). Link-Based Spam Detection Spam, damn spam, and statistics: Using statistical analysis to L. Becchetti, locate spam web pages. C. Castillo, D. Donato, In Proceedings of the seventh workshop on the Web and S. Leonardi and R. Baeza-Yates databases (WebDB), pages 1–6, Paris, France. Motivation Flajolet, P. and Martin, N. G. (1985). Spam pages characterization Probabilistic counting algorithms for data base applications. Truncated Journal of Computer and System Sciences, 31(2):182–209. PageRank Counting Gibson, D., Kumar, R., and Tomkins, A. (2005). supporters Experiments Discovering large dense subgraphs in massive graphs. Conclusions In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment.
  73. 73. Using rank propagation and Probabilistic Gy¨ngyi, Z. and Garcia-Molina, H. (2005). o counting for Link-Based Web spam taxonomy. Spam Detection L. Becchetti, In First International Workshop on Adversarial Information C. Castillo, Retrieval on the Web. D. Donato, S. Leonardi and R. Baeza-Yates Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004). o Motivation Combating web spam with trustrank. Spam pages In Proceedings of the Thirtieth International Conference on characterization Very Large Data Bases (VLDB), pages 576–587, Toronto, Truncated PageRank Canada. Morgan Kaufmann. Counting supporters Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001). Experiments Random graphs with arbitrary degree distributions and their Conclusions applications. Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
  74. 74. Using rank propagation and Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Probabilistic counting for Link-Based Detecting spam web pages through content analysis. Spam Detection In Proceedings of the World Wide Web conference, pages L. Becchetti, C. Castillo, 83–92, Edinburgh, Scotland. D. Donato, S. Leonardi and R. Baeza-Yates Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). ANF: a fast and scalable tool for data mining in massive Motivation graphs. Spam pages characterization In Proceedings of the eighth ACM SIGKDD international Truncated conference on Knowledge discovery and data mining, pages PageRank 81–90, New York, NY, USA. ACM Press. Counting supporters Perkins, A. (2001). Experiments Conclusions The classification of search engine spam. Available online at http://www.silverdisc.co.uk/articles/spam-classification/.

×