Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)


Published on

Presentation of PolaritySpam, a graph-based ranking algorithm intended to demote the spam web pages in the ranking provided by a web search engine.

Cite as:
F. Javier Ortega; Craig Macdonald; José A. Troyano; Fermín L. Cruz. “Spam Detection with a Content-based Random-Walk Algorithm”. Proceedings of the Second International Workshop on Search and Mining User-Generated Contents, International Conference on Information and Knowledge Management. 2010. Toronto, Canadá

Published in: Technology, News & Politics
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

  1. 1. Spam detection witha content-based random-walk algorithm F. Javier Ortega Craig Macdonald José A. Troyano Fermín Cruz
  2. 2. Index ♦ Introduction ♦ Related work ♦ Content-based ♦ Link-based ♦ Our Approach ♦ Random-walk algorithm ♦ Content-based metrics ♦ Selection of seeds ♦ Experiments ♦ Future work ♦ References
  3. 3. Introduction♦ Web Spam: phenomenon where a number of web pages are created for the purpose of making a search engine deliver undesirable results for a given query.
  4. 4. Introduction♦ Self-Promotion: gaining high relevance for a search engine, mainly based on the textual content. i.e.: including a number of keywords in the web page.
  5. 5. Introduction♦ Mutual-Promotion: gaining high score by focusing the attention on the out-links and in-links of a web page. i.e.: a web page with lots of in-links can be considered relevant by a search engine.
  6. 6. Introduction♦ Web Spam characteristics: ♦ Textual content: large amount of invisible content, a set of words with high frequency, lots of hyperlinks with large anchor texts, very long words, etc. ♦ Link-farms: large number of pages pointing one to another, in order to improve their scores by increasing the number of in-links to them. ♦ Good pages usually point to good pages. ♦ Spam pages mainly point to other spam pages (link- farms). They rarely point to good pages.
  7. 7. Related work: Content-based♦ Content-based techniques classify the web pages as spam or not-spam according to their textual content.♦ Heuristics to determine the spam likehood of a web page. ♦ Meta tag content, anchor texts, URL of the page, average lenght of the words, compression rate, etc. [10, 12] ♦ Inclusion of link-based scores and metrics into a classifier [3]♦ Link-based techniques exploit the relations between web pages to obtain a rank of pages, ordered according to their spam likelihood.♦ Random-Walk algorithms that penalizes spam-like behaviors. ♦ Dont take into account the nearest neighbours [1] ♦ Take only the scores received from a specific set of good or bad pages. [7,11]
  8. 8. Our Approach♦ Our approach combines both techniques: ♦ A set of content-based metrics, that obtains information from each single web page. ♦ A link-based algorithm, that processes the relations between web pages.♦ The goal is to obtain a ranking of web pages, in which spam web pages are demoted according to their spam likelihood.
  9. 9. Our Approach Web Content- Selection of pages based metrics Seeds Random-walk algorithm Web graph
  10. 10. Our Approach: random-walk algorithm♦ We propose a random-walk algorithm that computes two scores for each web page: ♦PR⁺: relevance of a web page ♦PR⁻: spam likelihood of a web page♦ PR⁻(b), changes according to the relation of b with spam-like web pages. Analogous with PR⁺. The higher PR⁺(a), the higher PR⁺(b). a b The higher PR⁻(a), the higher PR⁻(b).
  11. 11. Our Approach: random-walk algorithm♦ Formula:♦ Intuition: High PR⁺ High PR⁻ Higher PR⁺!! Higher PR⁻!!
  12. 12. Our Approach: content-based metrics♦ Content-based metrics are intended to extract some a-priori information from the textual content of the web pages.♦ Content-based metrics must be: ♦ Easy to obtain: save the performance! ♦ Accurate: precision is preferred over recall.
  13. 13. Our Approach: content-based metrics♦ Selected metrics: ♦ Compressibility: fraction of the sizes of a web page, before and after being compressed. ♦ Fraction of globally popular words: a web page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam. ♦ Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values.
  14. 14. Our Approach: selection of seeds♦ Seeds: set of relevant nodes, in terms of spam (negative seeds) or not-spam likelihood (positive seeds).♦ The algorithm gives more relevance to the seeds.♦ Spam-biased algorithm
  15. 15. Our Approach: selection of seeds♦ Unsupervised method: content-based metrics as features to choose the seeds.♦ Pros: ♦Human intervention is not needed. ♦Larger number of seeds can be considered. ♦Inclusion of text content into a link-based method.♦ Due to the lack of human intervention... ♦“False positives”.
  16. 16. Our Approach: selection of seeds♦ Obtaining a-priori score for a node, a:♦ Selecting seeds: ♦ Pos/Neg Approach: ♦ Pos/Neg Metrics Approach: ♦ Metric-based Approach
  17. 17. Experiments ♦ Dataset: WEBSPAM-UK2006* ♦ ~98 million pages ♦ 11,402 hand-labeled hosts ♦ 7,423 labeled as spam. ♦ ~10 million spam web pages ♦ Terrier IR Platform ♦ Random-walk algorithm parameters: ♦ Damping factor = 0.85 ♦ Threshold = 0.01* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection forweb spam. SIGIR Forum, 40(2):11–24, December 2006.
  18. 18. Experiments♦ Evaluation: PR-buckets Buckets Total Pages 1 14 PageRank 2 54 } PR-bucket 1 3 144 Relevance } PR-bucket 2 4 437 } PR-bucket 3 5 6 1070 2130 } PR-bucket 4 ... 7 8 ... 2664 2778 ... 17 16M 18 28M 19 28M 20 28M Total PR =
  19. 19. Experiments♦ Baseline: TrustRank ♦ Link-based technique. ♦ Seeds chosen in a semi-supervised way: • Hand-picked set of good pages. • Top pages according to an inverse PageRank. ♦ Random-walk algorithm, biased according to the seedsZ. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004
  20. 20. Experiments TrustRank Pos/Neg Approach Pos/Neg Metrics Approach Metric-based Approach
  21. 21. Experiments 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 TrustRank Pos/Neg Pos/Neg Metrics MetricsBased
  22. 22. Conclusions and future work♦ Novel web spam detection technique, that combines concepts from link and content-based methods. ♦ Content-based metrics as an unsupervised seed selection method. ♦ Random-walk algorithm to compute two scores for each web page: spam and not-spam likelihood.♦ Future work: ♦ Including new content-based heuristics. ♦ Improving the spam-biased selection of the seeds, taking into account the links to/from each node. ♦ Content-based metrics to characterize also the edges of the web graph.
  23. 23. References[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006.[2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005.[3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM.[4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Computing Research Repository, 2010.[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, January 2005.[6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York, NY, USA, 2004. ACM.[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004.[8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003- 29, 2003.2.[9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM.[10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2006.[11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006.[12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM.[13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999.[14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.
  24. 24. Thanks for your attention!! Questions? F. Javier Ortega Craig Macdonald José A. Troyano Fermín Cruz