Link-Based Spam Detection (FWS 2006 Barcelona)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Link-Based Spam Detection (FWS 2006 Barcelona) - Presentation Transcript

    1. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and Using Rank Propagation and Probabilistic R. Baeza-Yates Counting for Link-based Spam Detection Motivation Degree-based measures PageRank Luca Becchetti1 , Carlos Castillo1 ,Debora Donato1 , TrustRank Stefano Leonardi1 and Ricardo Baeza-Yates2 Truncated PageRank 1. Universit` di Roma “La Sapienza” – Rome, Italy a Counting supporters 2. Yahoo! Research – Barcelona, Spain and Santiago, Chile Conclusions May 19th, 2006
    2. Link-Based Spam Detection L. Becchetti, Motivation 1 C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Degree-based measures 2 Motivation PageRank Degree-based 3 measures PageRank TrustRank 4 TrustRank Truncated PageRank Truncated PageRank 5 Counting supporters Conclusions Counting supporters 6 Conclusions 7
    3. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation 1 Motivation Degree-based measures 2 Degree-based PageRank 3 measures TrustRank 4 PageRank Truncated PageRank 5 TrustRank Counting supporters 6 Truncated PageRank Conclusions 7 Counting supporters Conclusions
    4. What is on the Web? Link-Based Information Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    5. What is on the Web? Link-Based Information + Porn Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    6. What is on the Web? Link-Based Information + Porn + On-line casinos + Free movies + Spam Detection Cheap software + Buy a MBA diploma + Prescription -free L. Becchetti, C. Castillo, drugs + V!-4-gra + Get rich now now now!!! D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions Graphic: www.milliondollarhomepage.com
    7. Web spam (keywords + links) Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    8. Web spam (mostly keywords) Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    9. Search engine? Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    10. Fake search engine Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    11. Problem: “normal” pages that are spam Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    12. Link farms Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    13. Link farms Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
    14. Motivation Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and [Fetterly et al., 2004] hypothesized that studying the R. Baeza-Yates distribution of statistics about pages could be a good way of Motivation detecting spam pages: Degree-based measures “in a number of these distributions, outlier values are PageRank associated with web spam” TrustRank Truncated PageRank Counting supporters Conclusions
    15. Motivation Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and [Fetterly et al., 2004] hypothesized that studying the R. Baeza-Yates distribution of statistics about pages could be a good way of Motivation detecting spam pages: Degree-based measures “in a number of these distributions, outlier values are PageRank associated with web spam” TrustRank Truncated PageRank Research goal Counting supporters Statistical analysis of link-based spam Conclusions
    16. Metrics Link-Based Graph algorithms Spam Detection L. Becchetti, All shortest paths, centrality, betweenness, clustering coefficient... C. Castillo, D. Donato, Streamed algorithms S. Leonardi and R. Baeza-Yates Breadth-first and depth-first search Motivation Count of neighbors Degree-based Symmetric algorithms measures (Strongly) connected components PageRank TrustRank Approximate count of neighbors Truncated PageRank, Truncated PageRank, Linear Rank PageRank HITS, Salsa, TrustRank Counting supporters Conclusions
    17. Test collection Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and U.K. collection R. Baeza-Yates 18.5 million pages downloaded from the .UK domain Motivation Degree-based 5,344 hosts manually classified (6% of the hosts) measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    18. Test collection Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and U.K. collection R. Baeza-Yates 18.5 million pages downloaded from the .UK domain Motivation Degree-based 5,344 hosts manually classified (6% of the hosts) measures PageRank TrustRank Truncated PageRank Classified entire hosts: Counting V A few hosts are mixed: spam and non-spam pages supporters Conclusions X More coverage: sample covers 32% of the pages
    19. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation 1 Motivation Degree-based measures 2 Degree-based PageRank 3 measures TrustRank 4 PageRank Truncated PageRank 5 TrustRank Counting supporters 6 Truncated PageRank Conclusions 7 Counting supporters Conclusions
    20. Degree Link-Based δ = 0.35 In−degree Spam Detection L. Becchetti, Normal C. Castillo, D. Donato, 0.4 Spam S. Leonardi and R. Baeza-Yates Motivation 0.3 Degree-based measures PageRank 0.2 TrustRank Truncated PageRank Counting 0.1 supporters Conclusions 0 1 100 10000 Number of in−links (δ = max. difference in C.D.F. plot)
    21. Degree Link-Based Spam Detection δ = 0.28 Out−degree L. Becchetti, 0.3 C. Castillo, Normal D. Donato, S. Leonardi and Spam R. Baeza-Yates Motivation Degree-based 0.2 measures PageRank TrustRank Truncated PageRank 0.1 Counting supporters Conclusions 0 1 10 50 100 Number of out−links
    22. Edge reciprocity Link-Based Spam Detection δ = 0.35 Reciprocity of max. PR page L. Becchetti, 0.5 C. Castillo, Normal D. Donato, S. Leonardi and Spam R. Baeza-Yates 0.4 Motivation Degree-based measures 0.3 PageRank TrustRank 0.2 Truncated PageRank Counting supporters 0.1 Conclusions 0 0 0.2 0.4 0.6 0.8 1 Fraction of reciprocal links
    23. Assortativity Link-Based Spam Detection δ = 0.31 Degree / Degree of neighbors L. Becchetti, C. Castillo, 0.4 D. Donato, Normal S. Leonardi and Spam R. Baeza-Yates Motivation 0.3 Degree-based measures PageRank 0.2 TrustRank Truncated PageRank Counting 0.1 supporters Conclusions 0 0.001 0.01 0.1 1 10 100 1000 Degree/Degree ratio of home page
    24. Automatic classifier Link-Based Spam Detection L. Becchetti, C. Castillo, All of the following attributes in the home page and the page D. Donato, S. Leonardi and with the maximum PageRank, plus a binary variable R. Baeza-Yates indicating if they are the same page: Motivation - In-degree, out-degree Degree-based measures - Fraction of reciprocal edges PageRank - Degree divided by degree of direct neighbors TrustRank - Average and sum of in-degree of out-neighbors Truncated PageRank - Average and sum of out-degree of in-neighbors Counting supporters Decision tree Conclusions 72.6% of detection rate, with 3.1% false positives
    25. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation 1 Motivation Degree-based measures 2 Degree-based PageRank 3 measures TrustRank 4 PageRank Truncated PageRank 5 TrustRank Counting supporters 6 Truncated PageRank Conclusions 7 Counting supporters Conclusions
    26. PageRank Link-Based Spam Detection L. Becchetti, C. Castillo, Let PN×N be the normalized link matrix of a graph D. Donato, S. Leonardi and Row-normalized R. Baeza-Yates No “sinks” Motivation Degree-based Definition (PageRank) measures PageRank Stationary state of: TrustRank (1 − α) Truncated αP + 1N×N PageRank N Counting supporters Conclusions
    27. PageRank Link-Based Spam Detection L. Becchetti, C. Castillo, Let PN×N be the normalized link matrix of a graph D. Donato, S. Leonardi and Row-normalized R. Baeza-Yates No “sinks” Motivation Degree-based Definition (PageRank) measures PageRank Stationary state of: TrustRank (1 − α) Truncated αP + 1N×N PageRank N Counting supporters Conclusions Follow links with probability α Random jump with probability 1 − α
    28. Maximum PageRank in the Host Link-Based Spam Detection δ = 0.23 Maximum PageRank of the site L. Becchetti, C. Castillo, D. Donato, Normal S. Leonardi and Spam R. Baeza-Yates 0.3 Motivation Degree-based measures 0.2 PageRank TrustRank Truncated PageRank 0.1 Counting supporters Conclusions 0 1e−08 1e−07 1e−06 1e−05 0.0001 PageRank
    29. Variance of PageRank Suggested in [Bencz´r et al., 2005] u Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, PageRank PageRank S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    30. Variance of PageRank of in-neighbors Link-Based Spam Detection Stdev. of PR of Neighbors (Home) δ = 0.41 L. Becchetti, C. Castillo, D. Donato, Normal S. Leonardi and Spam R. Baeza-Yates 0.3 Motivation Degree-based measures 0.2 PageRank TrustRank Truncated PageRank 0.1 Counting supporters Conclusions 0 0 0.2 0.4 0.6 0.8 1 σ2 of the logarithm of PageRank
    31. Automatic classifier Link-Based Spam Detection L. Becchetti, Features: degree-based plus the following in the home page C. Castillo, D. Donato, and the page with maximum PageRank: S. Leonardi and R. Baeza-Yates - PageRank - In-degree/PageRank Motivation Degree-based - Out-degree/PageRank measures Standard deviation of PageRank of in-neighbors = σ 2 - PageRank σ 2 /PageRank - TrustRank Plus the PageRank of the home page divided by the Truncated PageRank PageRank of the page with the maximum PageRank. Counting supporters Decision tree Conclusions 74.4% of detection rate, with 2.6% false positives
    32. Automatic classifier Link-Based Spam Detection L. Becchetti, Features: degree-based plus the following in the home page C. Castillo, D. Donato, and the page with maximum PageRank: S. Leonardi and R. Baeza-Yates - PageRank - In-degree/PageRank Motivation Degree-based - Out-degree/PageRank measures Standard deviation of PageRank of in-neighbors = σ 2 - PageRank σ 2 /PageRank - TrustRank Plus the PageRank of the home page divided by the Truncated PageRank PageRank of the page with the maximum PageRank. Counting supporters Decision tree Conclusions 74.4% of detection rate, with 2.6% false positives (Degree-based: 72.6% of detection, 3.1% false positives)
    33. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation 1 Motivation Degree-based measures 2 Degree-based PageRank 3 measures TrustRank 4 PageRank Truncated PageRank 5 TrustRank Counting supporters 6 Truncated PageRank Conclusions 7 Counting supporters Conclusions
    34. TrustRank Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates TrustRank [Gy¨ngyi et al., 2004] o Motivation A node with high PageRank, but far away from a core set of Degree-based “trusted nodes” is suspicious measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    35. TrustRank Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates TrustRank [Gy¨ngyi et al., 2004] o Motivation A node with high PageRank, but far away from a core set of Degree-based “trusted nodes” is suspicious measures PageRank Start from a set of trusted nodes, then do a random walk, TrustRank returning to the set of trusted nodes with probability 1 − α at Truncated PageRank each step Counting supporters i Trusted nodes: data from http://www.dmoz.org/ Conclusions
    36. TrustRank score Link-Based Spam Detection δ = 0.59 TrustRank score of home page L. Becchetti, C. Castillo, D. Donato, Normal S. Leonardi and 0.4 Spam R. Baeza-Yates Motivation Degree-based 0.3 measures PageRank TrustRank 0.2 Truncated PageRank Counting 0.1 supporters Conclusions 0 1e−06 0.001 TrustRank
    37. TrustRank / PageRank Link-Based Spam Detection δ = 0.59 Estimated relative non−spam mass L. Becchetti, C. Castillo, D. Donato, Normal 0.8 S. Leonardi and Spam R. Baeza-Yates 0.7 Motivation 0.6 Degree-based measures 0.5 PageRank 0.4 TrustRank Truncated 0.3 PageRank Counting 0.2 supporters Conclusions 0.1 0 0.3 1 10 100 TrustRank score/PageRank
    38. Automatic classifier Link-Based Spam Detection L. Becchetti, C. Castillo, PageRank attributes, plus the following in the home page and D. Donato, S. Leonardi and the page with maximum PageRank: R. Baeza-Yates - TrustScore Motivation - TrustScore/PageRank (estimated relative non-spam mass) Degree-based measures - TrustScore/In-degree PageRank Plus the TrustScore in the home page divided by the TrustRank TrustScore in the page with the maximum PageRank. Truncated PageRank Decision tree Counting supporters 77.3% of detection rate, with 3.0% false positives Conclusions
    39. Automatic classifier Link-Based Spam Detection L. Becchetti, C. Castillo, PageRank attributes, plus the following in the home page and D. Donato, S. Leonardi and the page with maximum PageRank: R. Baeza-Yates - TrustScore Motivation - TrustScore/PageRank (estimated relative non-spam mass) Degree-based measures - TrustScore/In-degree PageRank Plus the TrustScore in the home page divided by the TrustRank TrustScore in the page with the maximum PageRank. Truncated PageRank Decision tree Counting supporters 77.3% of detection rate, with 3.0% false positives Conclusions (PageRank-based: 74.4% of detection, 2.6% false positives)
    40. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation 1 Motivation Degree-based measures 2 Degree-based PageRank 3 measures TrustRank 4 PageRank Truncated PageRank 5 TrustRank Counting supporters 6 Truncated PageRank Conclusions 7 Counting supporters Conclusions
    41. Path-based formula for PageRank Link-Based Spam Detection L. Becchetti, C. Castillo, Given a path p = x1 , x2 , . . . , xt of length t = |p| D. Donato, S. Leonardi and R. Baeza-Yates 1 branching(p) = Motivation d1 d2 · · · dt−1 Degree-based measures where di are the out-degrees of the members of the path PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    42. Path-based formula for PageRank Link-Based Spam Detection L. Becchetti, C. Castillo, Given a path p = x1 , x2 , . . . , xt of length t = |p| D. Donato, S. Leonardi and R. Baeza-Yates 1 branching(p) = Motivation d1 d2 · · · dt−1 Degree-based measures where di are the out-degrees of the members of the path PageRank Explicit formula for PageRank [Newman et al., 2001] TrustRank Truncated (1 − α)α|p| PageRank ri (α) = branching(p) Counting N supporters p∈Path(−,i) Conclusions Path(−, i) are incoming paths in node i
    43. General functional ranking Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and In general: R. Baeza-Yates Motivation damping(|p|) ri (α) = branching(p) Degree-based N measures p∈Path(−,i) PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    44. General functional ranking Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and In general: R. Baeza-Yates Motivation damping(|p|) ri (α) = branching(p) Degree-based N measures p∈Path(−,i) PageRank TrustRank Truncated There are many choices for damping(|p|), including simply a PageRank linear function that is as good as PageRank in practice Counting supporters [Baeza-Yates et al., 2006] Conclusions
    45. General functional ranking Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and In general: R. Baeza-Yates Motivation damping(|p|) ri (α) = branching(p) Degree-based N measures p∈Path(−,i) PageRank TrustRank Truncated There are many choices for damping(|p|), including simply a PageRank linear function that is as good as PageRank in practice Counting supporters [Baeza-Yates et al., 2006] Conclusions
    46. Truncated PageRank Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Reduce the direct contribution of the first levels of links: S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank t≤T 0 Counting damping(t) = supporters C αt t>T Conclusions
    47. Truncated PageRank Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Reduce the direct contribution of the first levels of links: S. Leonardi and R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank t≤T 0 Counting damping(t) = supporters C αt t>T Conclusions V No extra reading of the graph after PageRank
    48. Truncated PageRank(T=2) / PageRank Link-Based Spam Detection TruncatedPageRank T=2 / PageRank δ = 0.30 L. Becchetti, C. Castillo, D. Donato, Normal S. Leonardi and Spam R. Baeza-Yates 0.3 Motivation Degree-based measures 0.2 PageRank TrustRank Truncated PageRank 0.1 Counting supporters Conclusions 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 TruncatedPageRank(T=2) / PageRank
    49. Max. change of Truncated PageRank Link-Based Spam Detection L. Becchetti, Maximum change of Truncated PageRank δ = 0.29 C. Castillo, D. Donato, S. Leonardi and Normal R. Baeza-Yates Spam 0.2 Motivation Degree-based measures PageRank TrustRank 0.1 Truncated PageRank Counting supporters Conclusions 0 0.85 0.9 0.95 1 1.05 1.1 max(TrPRi+1/TrPri)
    50. Automatic classifier Link-Based PageRank attributes, plus the following in the home page and Spam Detection the page with maximum PageRank: L. Becchetti, C. Castillo, D. Donato, - TruncPageRank(T = 1 . . . 4) S. Leonardi and R. Baeza-Yates - TruncPageRank(T = 4) / TruncPageRank(T = 3) - TruncPageRank(T = 3) / TruncPageRank(T = 2) Motivation - TruncPageRank(T = 2) / TruncPageRank(T = 1) Degree-based measures - TruncPageRank(T = 1 . . . 4) / PageRank PageRank - Minimum, maximum and average of: TrustRank TruncPageRank(T = i)/TruncPageRank(T = i − 1) Truncated PageRank Plus the TruncatedPageRank(T = 1 . . . 4) of the home page Counting divided by the same value in the page with the maximum supporters PageRank. Conclusions Decision tree 76.9% of detection rate, with 2.5% false positives
    51. Automatic classifier Link-Based PageRank attributes, plus the following in the home page and Spam Detection the page with maximum PageRank: L. Becchetti, C. Castillo, D. Donato, - TruncPageRank(T = 1 . . . 4) S. Leonardi and R. Baeza-Yates - TruncPageRank(T = 4) / TruncPageRank(T = 3) - TruncPageRank(T = 3) / TruncPageRank(T = 2) Motivation - TruncPageRank(T = 2) / TruncPageRank(T = 1) Degree-based measures - TruncPageRank(T = 1 . . . 4) / PageRank PageRank - Minimum, maximum and average of: TrustRank TruncPageRank(T = i)/TruncPageRank(T = i − 1) Truncated PageRank Plus the TruncatedPageRank(T = 1 . . . 4) of the home page Counting divided by the same value in the page with the maximum supporters PageRank. Conclusions Decision tree 76.9% of detection rate, with 2.5% false positives (TrustRank-based: 77.3% of detection, 3.0% false positives)
    52. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation 1 Motivation Degree-based measures 2 Degree-based PageRank 3 measures TrustRank 4 PageRank Truncated PageRank 5 TrustRank Counting supporters 6 Truncated PageRank Conclusions 7 Counting supporters Conclusions
    53. Idea: count “supporters” at different distances Link-Based Spam Detection L. Becchetti, C. Castillo, Number of different nodes at a given distance: D. Donato, S. Leonardi and R. Baeza-Yates .UK 18 mill. nodes .EU.INT 860,000 nodes Motivation 0.3 0.3 Degree-based measures 0.2 0.2 Frequency Frequency PageRank TrustRank 0.1 0.1 Truncated PageRank 0.0 0.0 Counting 5 10 15 20 25 30 5 10 15 20 25 30 supporters Distance Distance Conclusions Average distance Average distance 14.9 clicks 10.0 clicks
    54. High and low-ranked pages are different Link-Based 4 Spam Detection x 10 Top 0%−10% L. Becchetti, 12 C. Castillo, Top 40%−50% D. Donato, Top 60%−70% S. Leonardi and R. Baeza-Yates 10 Number of Nodes Motivation 8 Degree-based measures 6 PageRank TrustRank 4 Truncated PageRank Counting 2 supporters Conclusions 0 1 5 10 15 20 Distance
    55. High and low-ranked pages are different Link-Based 4 Spam Detection x 10 Top 0%−10% L. Becchetti, 12 C. Castillo, Top 40%−50% D. Donato, Top 60%−70% S. Leonardi and R. Baeza-Yates 10 Number of Nodes Motivation 8 Degree-based measures 6 PageRank TrustRank 4 Truncated PageRank Counting 2 supporters Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
    56. Probabilistic counting Link-Based Spam Detection L. Becchetti, C. Castillo, 1 1 D. Donato, 0 0 S. Leonardi and 0 0 R. Baeza-Yates 0 0 0 1 1 1 1 1 0 0 1 1 0 0 Motivation 0 0 0 0 Propagation of 0 0 1 1 Degree-based bits using the 1 0 1 1 measures “OR” operation 1 0 1 0 PageRank 1 Target 0 Count bits set TrustRank 0 page 0 to estimate 0 0 supporters Truncated 0 0 PageRank 1 1 1 1 0 0 1 1 Counting 0 0 supporters 0 0 1 1 Conclusions 0 0
    57. Probabilistic counting Link-Based Spam Detection L. Becchetti, C. Castillo, 1 1 D. Donato, 0 0 S. Leonardi and 0 0 R. Baeza-Yates 0 0 0 1 1 1 1 1 0 0 1 1 0 0 Motivation 0 0 0 0 Propagation of 0 0 1 1 Degree-based bits using the 1 0 1 1 measures “OR” operation 1 0 1 0 PageRank 1 Target 0 Count bits set TrustRank 0 page 0 to estimate 0 0 supporters Truncated 0 0 PageRank 1 1 1 1 0 0 1 1 Counting 0 0 supporters 0 0 1 1 Conclusions 0 0 Improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
    58. General algorithm Link-Based Require: N: number of nodes, d: distance, k: bits Spam Detection 1: for node : 1 . . . N, bit: 1 . . . k do L. Becchetti, C. Castillo, INIT(node,bit) 2: D. Donato, S. Leonardi and 3: end for R. Baeza-Yates Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    59. General algorithm Link-Based Require: N: number of nodes, d: distance, k: bits Spam Detection 1: for node : 1 . . . N, bit: 1 . . . k do L. Becchetti, C. Castillo, INIT(node,bit) 2: D. Donato, S. Leonardi and 3: end for R. Baeza-Yates 4: for distance : 1 . . . d do {Iteration step} Motivation Aux ← 0k 5: Degree-based for src : 1 . . . N do {Follow links in the graph} measures 6: PageRank for all links from src to dest do 7: TrustRank Aux[dest] ← Aux[dest] OR V[src,·] 8: Truncated end for 9: PageRank end for 10: Counting supporters V ← Aux 11: Conclusions 12: end for
    60. General algorithm Link-Based Require: N: number of nodes, d: distance, k: bits Spam Detection 1: for node : 1 . . . N, bit: 1 . . . k do L. Becchetti, C. Castillo, INIT(node,bit) 2: D. Donato, S. Leonardi and 3: end for R. Baeza-Yates 4: for distance : 1 . . . d do {Iteration step} Motivation Aux ← 0k 5: Degree-based for src : 1 . . . N do {Follow links in the graph} measures 6: PageRank for all links from src to dest do 7: TrustRank Aux[dest] ← Aux[dest] OR V[src,·] 8: Truncated end for 9: PageRank end for 10: Counting supporters V ← Aux 11: Conclusions 12: end for 13: for node: 1 . . . N do {Estimate supporters} Supporters[node] ← ESTIMATE( V[node,·] ) 14: 15: end for 16: return Supporters
    61. Our estimator Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Initialize all bits to one with probability Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    62. Our estimator Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Initialize all bits to one with probability Motivation ones(node) Estimator: neighbors(node) = log(1− ) 1 − k Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    63. Our estimator Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Initialize all bits to one with probability Motivation ones(node) Estimator: neighbors(node) = log(1− ) 1 − k Degree-based measures Adaptive estimation PageRank TrustRank Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look Truncated for the transitions from more than (1 − 1/e)k ones to less PageRank than (1 − 1/e)k ones. Counting supporters Conclusions
    64. Convergence Link-Based Spam Detection L. Becchetti, 100% C. Castillo, D. Donato, 90% S. Leonardi and R. Baeza-Yates 80% Fraction of nodes Motivation 70% with estimates Degree-based 60% measures 50% PageRank d=1 d=2 TrustRank 40% d=3 Truncated 30% d=4 PageRank d=5 20% Counting d=6 supporters d=7 10% Conclusions d=8 0% 5 10 15 20 Iteration
    65. Error rate Link-Based Spam Detection L. Becchetti, C. Castillo, Ours 64 bits, epsilon−only estimator D. Donato, Ours 64 bits, combined estimator S. Leonardi and 0.5 ANF 24 bits × 24 iterations (576 b×i) Average Relative Error R. Baeza-Yates ANF 24 bits × 48 iterations (1152 b×i) Motivation 0.4 Degree-based 960 b×i measures 1216 b×i 512 b×i 832 b×i 1344 b×i 1408 b×i 768 b×i 1152 b×i 0.3 PageRank TrustRank 0.2 Truncated 576 b×i 1152 b×i PageRank 512 b×i 768 b×i Counting 960 b×i 1216 b×i 1344 b×i 1408 b×i 832 b×i 1152 b×i 0.1 supporters Conclusions 0 1 2 3 4 5 6 7 8 Distance
    66. Hosts at distance 4 Link-Based Spam Detection δ = 0.39 Hosts at Distance Exactly 4 L. Becchetti, 0.4 C. Castillo, Normal D. Donato, S. Leonardi and Spam R. Baeza-Yates 0.3 Motivation Degree-based measures PageRank 0.2 TrustRank Truncated PageRank Counting 0.1 supporters Conclusions 0 1 100 1000 S4 − S3
    67. Minimum change of supporters Link-Based Spam Detection δ = 0.39 Minimum change of supporters L. Becchetti, C. Castillo, Normal D. Donato, S. Leonardi and 0.4 Spam R. Baeza-Yates Motivation Degree-based 0.3 measures PageRank TrustRank 0.2 Truncated PageRank Counting supporters 0.1 Conclusions 0 1 5 10 min(S2/S1, S3/S2, S4/S3)
    68. Automatic classifier Link-Based PageRank attributes, plus the following in the home page and Spam Detection the page with maximum PageRank: L. Becchetti, C. Castillo, D. Donato, - Supporters at 2 . . . 4 S. Leonardi and R. Baeza-Yates - Supporters at 2 . . . 4 / PageRank - Supporters at i / Supporters at i − 1 (for i = 1..4) Motivation - Minimum, maximum and average of: Supporters at i / Degree-based measures Supporters at i − 1 (for i = 1..4) PageRank - (Supporters at i - Supporters at i − 1) / PageRank TrustRank Plus the number of supporters at distance 2 . . . 4 in the home Truncated PageRank page divided by the same feature in the page with the Counting maximum PageRank. supporters Conclusions Decision tree 78.9% of detection rate, with 2.5% false positives
    69. Automatic classifier Link-Based PageRank attributes, plus the following in the home page and Spam Detection the page with maximum PageRank: L. Becchetti, C. Castillo, D. Donato, - Supporters at 2 . . . 4 S. Leonardi and R. Baeza-Yates - Supporters at 2 . . . 4 / PageRank - Supporters at i / Supporters at i − 1 (for i = 1..4) Motivation - Minimum, maximum and average of: Supporters at i / Degree-based measures Supporters at i − 1 (for i = 1..4) PageRank - (Supporters at i - Supporters at i − 1) / PageRank TrustRank Plus the number of supporters at distance 2 . . . 4 in the home Truncated PageRank page divided by the same feature in the page with the Counting maximum PageRank. supporters Conclusions Decision tree 78.9% of detection rate, with 2.5% false positives (TruncPR: 76.9% of detection, 2.5% false positives) (TrustRank: 77.7% of detection, 3.0% false positives)
    70. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation 1 Motivation Degree-based measures 2 Degree-based PageRank 3 measures TrustRank 4 PageRank Truncated PageRank 5 TrustRank Counting supporters 6 Truncated PageRank Conclusions 7 Counting supporters Conclusions
    71. Summary of classifiers Detection False Link-Based Spam Detection Metrics rate positives L. Becchetti, Degree (D) 72.6% 3.1% C. Castillo, D. Donato, D + PageRank (P) 74.4% 2.6% S. Leonardi and R. Baeza-Yates D + P + TrustRank 77.7% 3.0% Motivation D + P + Trunc. PageRank 76.9% 2.5% Degree-based D + P + Est. of Supporters 78.9% 2.5% measures All attributes 81.4% 2.8% PageRank All attributes (more rules) 80.8% 1.1% TrustRank Truncated PageRank Counting supporters Conclusions
    72. Summary of classifiers Detection False Link-Based Spam Detection Metrics rate positives L. Becchetti, Degree (D) 72.6% 3.1% C. Castillo, D. Donato, D + PageRank (P) 74.4% 2.6% S. Leonardi and R. Baeza-Yates D + P + TrustRank 77.7% 3.0% Motivation D + P + Trunc. PageRank 76.9% 2.5% Degree-based D + P + Est. of Supporters 78.9% 2.5% measures All attributes 81.4% 2.8% PageRank All attributes (more rules) 80.8% 1.1% TrustRank Truncated PageRank Comparison Counting supporters Content-based analysis [Ntoulas et al., 2006] has shown Conclusions 86.2% detection rate with 2.2% false positives Single-attribute classifier with TrustRank [Gy¨ngyi et al., 2004]: 51.1% detection rate, 3.4% error in o our sample – SpamRank [Bencz´r et al., 2005] reports similar u detection rates
    73. Top 10 metrics Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, 1. Binary variable indicating if homepage is the page with S. Leonardi and R. Baeza-Yates maximum PageRank of the site 2. Edge reciprocity Motivation 3. Different hosts at distance 4 Degree-based measures 4. Different hosts at distance 3 PageRank 5. Minimum change of supporters (different hosts) TrustRank 6. Different hosts at distance 2 Truncated PageRank 7. TruncatedPagerank (T=1) / PageRank Counting 8. TrustRank score divided by PageRank supporters 9. Different hosts at distance 1 Conclusions 10. TruncatedPagerank (T=2) / PageRank
    74. Conclusions Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates V Link-based statistics to detect 80% of spam Motivation Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    75. Conclusions Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates V Link-based statistics to detect 80% of spam Motivation X No magic bullet in link analysis Degree-based measures PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    76. Conclusions Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates V Link-based statistics to detect 80% of spam Motivation X No magic bullet in link analysis Degree-based measures X Precision still low compared to e-mail spam filters PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    77. Conclusions Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates V Link-based statistics to detect 80% of spam Motivation X No magic bullet in link analysis Degree-based measures X Precision still low compared to e-mail spam filters PageRank V Measure both home page and max. PageRank page TrustRank Truncated PageRank Counting supporters Conclusions
    78. Conclusions Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates V Link-based statistics to detect 80% of spam Motivation X No magic bullet in link analysis Degree-based measures X Precision still low compared to e-mail spam filters PageRank V Measure both home page and max. PageRank page TrustRank V Truncated Host-based counts are important PageRank Counting supporters Conclusions
    79. Conclusions Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates V Link-based statistics to detect 80% of spam Motivation X No magic bullet in link analysis Degree-based measures X Precision still low compared to e-mail spam filters PageRank V Measure both home page and max. PageRank page TrustRank V Truncated Host-based counts are important PageRank Counting supporters Conclusions
    80. Conclusions Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates V Link-based statistics to detect 80% of spam Motivation X No magic bullet in link analysis Degree-based measures X Precision still low compared to e-mail spam filters PageRank V Measure both home page and max. PageRank page TrustRank V Truncated Host-based counts are important PageRank Counting Next step: combine link analysis and content analysis supporters Conclusions
    81. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Degree-based measures Thank you! PageRank TrustRank Truncated PageRank Counting supporters Conclusions
    82. Link-Based Baeza-Yates, R., Boldi, P., and Castillo, C. (2006). Spam Detection L. Becchetti, Generalizing pagerank: Damping functions for link-based C. Castillo, ranking algorithms. D. Donato, S. Leonardi and In Proceedings of SIGIR, Seattle, Washington, USA. ACM R. Baeza-Yates Press. Motivation Degree-based Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and measures Baeza-Yates, R. (2006). PageRank Using rank propagation and probabilistic counting for TrustRank link-based spam detection. Truncated PageRank Technical report, DELIS – Dynamically Evolving, Large-Scale Counting Information Systems. supporters Conclusions Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M. u a o (2005). Spamrank: fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan.
    83. Link-Based Spam Detection L. Becchetti, Fetterly, D., Manasse, M., and Najork, M. (2004). C. Castillo, D. Donato, Spam, damn spam, and statistics: Using statistical analysis to S. Leonardi and R. Baeza-Yates locate spam web pages. In Proceedings of the seventh workshop on the Web and Motivation databases (WebDB), pages 1–6, Paris, France. Degree-based measures PageRank Flajolet, P. and Martin, N. G. (1985). TrustRank Probabilistic counting algorithms for data base applications. Truncated Journal of Computer and System Sciences, 31(2):182–209. PageRank Counting supporters Gibson, D., Kumar, R., and Tomkins, A. (2005). Conclusions Discovering large dense subgraphs in massive graphs. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment.
    84. Link-Based Spam Detection Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004). o L. Becchetti, C. Castillo, Combating web spam with trustrank. D. Donato, S. Leonardi and In Proceedings of the Thirtieth International Conference on R. Baeza-Yates Very Large Data Bases (VLDB), pages 576–587, Toronto, Motivation Canada. Morgan Kaufmann. Degree-based measures Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001). PageRank Random graphs with arbitrary degree distributions and their TrustRank applications. Truncated PageRank Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2). Counting supporters Ntoulas, A., Najork, M., and Manasse, M. a. (2006). Conclusions Detecting spam web pages through content analysis. In To appear in proceedings of the World Wide Web conference, Edinburgh, Scotland.
    85. Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). Motivation Degree-based ANF: a fast and scalable tool for data mining in massive measures graphs. PageRank In Proceedings of the eighth ACM SIGKDD international TrustRank conference on Knowledge discovery and data mining, pages Truncated PageRank 81–90, New York, NY, USA. ACM Press. Counting supporters Conclusions

    + Carlos CastilloCarlos Castillo, 3 years ago

    custom

    1549 views, 1 favs, 0 embeds more stats

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 1549
      • 1549 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 81
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories