Your SlideShare is downloading. ×
Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)


Published on

Presentation of PolaritySpam, a graph-based ranking algorithm intended to demote the spam web pages in the ranking provided by a web search engine. …

Presentation of PolaritySpam, a graph-based ranking algorithm intended to demote the spam web pages in the ranking provided by a web search engine.

Cite as:
F. Javier Ortega; Craig Macdonald; José A. Troyano; Fermín L. Cruz. “Spam Detection with a Content-based Random-Walk Algorithm”. Proceedings of the Second International Workshop on Search and Mining User-Generated Contents, International Conference on Information and Knowledge Management. 2010. Toronto, Canadá

Published in: Technology, News & Politics

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Spam detection witha content-based random-walk algorithm F. Javier Ortega Craig Macdonald José A. Troyano Fermín Cruz
  • 2. Index ♦ Introduction ♦ Related work ♦ Content-based ♦ Link-based ♦ Our Approach ♦ Random-walk algorithm ♦ Content-based metrics ♦ Selection of seeds ♦ Experiments ♦ Future work ♦ References
  • 3. Introduction♦ Web Spam: phenomenon where a number of web pages are created for the purpose of making a search engine deliver undesirable results for a given query.
  • 4. Introduction♦ Self-Promotion: gaining high relevance for a search engine, mainly based on the textual content. i.e.: including a number of keywords in the web page.
  • 5. Introduction♦ Mutual-Promotion: gaining high score by focusing the attention on the out-links and in-links of a web page. i.e.: a web page with lots of in-links can be considered relevant by a search engine.
  • 6. Introduction♦ Web Spam characteristics: ♦ Textual content: large amount of invisible content, a set of words with high frequency, lots of hyperlinks with large anchor texts, very long words, etc. ♦ Link-farms: large number of pages pointing one to another, in order to improve their scores by increasing the number of in-links to them. ♦ Good pages usually point to good pages. ♦ Spam pages mainly point to other spam pages (link- farms). They rarely point to good pages.
  • 7. Related work: Content-based♦ Content-based techniques classify the web pages as spam or not-spam according to their textual content.♦ Heuristics to determine the spam likehood of a web page. ♦ Meta tag content, anchor texts, URL of the page, average lenght of the words, compression rate, etc. [10, 12] ♦ Inclusion of link-based scores and metrics into a classifier [3]♦ Link-based techniques exploit the relations between web pages to obtain a rank of pages, ordered according to their spam likelihood.♦ Random-Walk algorithms that penalizes spam-like behaviors. ♦ Dont take into account the nearest neighbours [1] ♦ Take only the scores received from a specific set of good or bad pages. [7,11]
  • 8. Our Approach♦ Our approach combines both techniques: ♦ A set of content-based metrics, that obtains information from each single web page. ♦ A link-based algorithm, that processes the relations between web pages.♦ The goal is to obtain a ranking of web pages, in which spam web pages are demoted according to their spam likelihood.
  • 9. Our Approach Web Content- Selection of pages based metrics Seeds Random-walk algorithm Web graph
  • 10. Our Approach: random-walk algorithm♦ We propose a random-walk algorithm that computes two scores for each web page: ♦PR⁺: relevance of a web page ♦PR⁻: spam likelihood of a web page♦ PR⁻(b), changes according to the relation of b with spam-like web pages. Analogous with PR⁺. The higher PR⁺(a), the higher PR⁺(b). a b The higher PR⁻(a), the higher PR⁻(b).
  • 11. Our Approach: random-walk algorithm♦ Formula:♦ Intuition: High PR⁺ High PR⁻ Higher PR⁺!! Higher PR⁻!!
  • 12. Our Approach: content-based metrics♦ Content-based metrics are intended to extract some a-priori information from the textual content of the web pages.♦ Content-based metrics must be: ♦ Easy to obtain: save the performance! ♦ Accurate: precision is preferred over recall.
  • 13. Our Approach: content-based metrics♦ Selected metrics: ♦ Compressibility: fraction of the sizes of a web page, before and after being compressed. ♦ Fraction of globally popular words: a web page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam. ♦ Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values.
  • 14. Our Approach: selection of seeds♦ Seeds: set of relevant nodes, in terms of spam (negative seeds) or not-spam likelihood (positive seeds).♦ The algorithm gives more relevance to the seeds.♦ Spam-biased algorithm
  • 15. Our Approach: selection of seeds♦ Unsupervised method: content-based metrics as features to choose the seeds.♦ Pros: ♦Human intervention is not needed. ♦Larger number of seeds can be considered. ♦Inclusion of text content into a link-based method.♦ Due to the lack of human intervention... ♦“False positives”.
  • 16. Our Approach: selection of seeds♦ Obtaining a-priori score for a node, a:♦ Selecting seeds: ♦ Pos/Neg Approach: ♦ Pos/Neg Metrics Approach: ♦ Metric-based Approach
  • 17. Experiments ♦ Dataset: WEBSPAM-UK2006* ♦ ~98 million pages ♦ 11,402 hand-labeled hosts ♦ 7,423 labeled as spam. ♦ ~10 million spam web pages ♦ Terrier IR Platform ♦ Random-walk algorithm parameters: ♦ Damping factor = 0.85 ♦ Threshold = 0.01* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection forweb spam. SIGIR Forum, 40(2):11–24, December 2006.
  • 18. Experiments♦ Evaluation: PR-buckets Buckets Total Pages 1 14 PageRank 2 54 } PR-bucket 1 3 144 Relevance } PR-bucket 2 4 437 } PR-bucket 3 5 6 1070 2130 } PR-bucket 4 ... 7 8 ... 2664 2778 ... 17 16M 18 28M 19 28M 20 28M Total PR =
  • 19. Experiments♦ Baseline: TrustRank ♦ Link-based technique. ♦ Seeds chosen in a semi-supervised way: • Hand-picked set of good pages. • Top pages according to an inverse PageRank. ♦ Random-walk algorithm, biased according to the seedsZ. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004
  • 20. Experiments TrustRank Pos/Neg Approach Pos/Neg Metrics Approach Metric-based Approach
  • 21. Experiments 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 TrustRank Pos/Neg Pos/Neg Metrics MetricsBased
  • 22. Conclusions and future work♦ Novel web spam detection technique, that combines concepts from link and content-based methods. ♦ Content-based metrics as an unsupervised seed selection method. ♦ Random-walk algorithm to compute two scores for each web page: spam and not-spam likelihood.♦ Future work: ♦ Including new content-based heuristics. ♦ Improving the spam-biased selection of the seeds, taking into account the links to/from each node. ♦ Content-based metrics to characterize also the edges of the web graph.
  • 23. References[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006.[2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005.[3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM.[4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Computing Research Repository, 2010.[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, January 2005.[6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York, NY, USA, 2004. ACM.[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004.[8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003- 29, 2003.2.[9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM.[10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2006.[11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006.[12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM.[13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999.[14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.
  • 24. Thanks for your attention!! Questions? F. Javier Ortega Craig Macdonald José A. Troyano Fermín Cruz