Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

2,546 views

Published on

Presentation of PolaritySpam, a graph-based ranking algorithm intended to demote the spam web pages in the ranking provided by a web search engine.

Cite as:
F. Javier Ortega; Craig Macdonald; José A. Troyano; Fermín L. Cruz. “Spam Detection with a Content-based Random-Walk Algorithm”. Proceedings of the Second International Workshop on Search and Mining User-Generated Contents, International Conference on Information and Knowledge Management. 2010. Toronto, Canadá

Published in: Technology, News & Politics
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)

  1. 1. Spam detection witha content-based random-walk algorithm F. Javier Ortega Craig Macdonald javierortega@us.es craigm@dcs.gla.ac.uk José A. Troyano Fermín Cruz troyano@us.es fcruz@us.es
  2. 2. Index ♦ Introduction ♦ Related work ♦ Content-based ♦ Link-based ♦ Our Approach ♦ Random-walk algorithm ♦ Content-based metrics ♦ Selection of seeds ♦ Experiments ♦ Future work ♦ References
  3. 3. Introduction♦ Web Spam: phenomenon where a number of web pages are created for the purpose of making a search engine deliver undesirable results for a given query.
  4. 4. Introduction♦ Self-Promotion: gaining high relevance for a search engine, mainly based on the textual content. i.e.: including a number of keywords in the web page.
  5. 5. Introduction♦ Mutual-Promotion: gaining high score by focusing the attention on the out-links and in-links of a web page. i.e.: a web page with lots of in-links can be considered relevant by a search engine.
  6. 6. Introduction♦ Web Spam characteristics: ♦ Textual content: large amount of invisible content, a set of words with high frequency, lots of hyperlinks with large anchor texts, very long words, etc. ♦ Link-farms: large number of pages pointing one to another, in order to improve their scores by increasing the number of in-links to them. ♦ Good pages usually point to good pages. ♦ Spam pages mainly point to other spam pages (link- farms). They rarely point to good pages.
  7. 7. Related work: Content-based♦ Content-based techniques classify the web pages as spam or not-spam according to their textual content.♦ Heuristics to determine the spam likehood of a web page. ♦ Meta tag content, anchor texts, URL of the page, average lenght of the words, compression rate, etc. [10, 12] ♦ Inclusion of link-based scores and metrics into a classifier [3]♦ Link-based techniques exploit the relations between web pages to obtain a rank of pages, ordered according to their spam likelihood.♦ Random-Walk algorithms that penalizes spam-like behaviors. ♦ Dont take into account the nearest neighbours [1] ♦ Take only the scores received from a specific set of good or bad pages. [7,11]
  8. 8. Our Approach♦ Our approach combines both techniques: ♦ A set of content-based metrics, that obtains information from each single web page. ♦ A link-based algorithm, that processes the relations between web pages.♦ The goal is to obtain a ranking of web pages, in which spam web pages are demoted according to their spam likelihood.
  9. 9. Our Approach Web Content- Selection of pages based metrics Seeds Random-walk algorithm Web graph
  10. 10. Our Approach: random-walk algorithm♦ We propose a random-walk algorithm that computes two scores for each web page: ♦PR⁺: relevance of a web page ♦PR⁻: spam likelihood of a web page♦ PR⁻(b), changes according to the relation of b with spam-like web pages. Analogous with PR⁺. The higher PR⁺(a), the higher PR⁺(b). a b The higher PR⁻(a), the higher PR⁻(b).
  11. 11. Our Approach: random-walk algorithm♦ Formula:♦ Intuition: High PR⁺ High PR⁻ Higher PR⁺!! Higher PR⁻!!
  12. 12. Our Approach: content-based metrics♦ Content-based metrics are intended to extract some a-priori information from the textual content of the web pages.♦ Content-based metrics must be: ♦ Easy to obtain: save the performance! ♦ Accurate: precision is preferred over recall.
  13. 13. Our Approach: content-based metrics♦ Selected metrics: ♦ Compressibility: fraction of the sizes of a web page, before and after being compressed. ♦ Fraction of globally popular words: a web page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam. ♦ Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values.
  14. 14. Our Approach: selection of seeds♦ Seeds: set of relevant nodes, in terms of spam (negative seeds) or not-spam likelihood (positive seeds).♦ The algorithm gives more relevance to the seeds.♦ Spam-biased algorithm
  15. 15. Our Approach: selection of seeds♦ Unsupervised method: content-based metrics as features to choose the seeds.♦ Pros: ♦Human intervention is not needed. ♦Larger number of seeds can be considered. ♦Inclusion of text content into a link-based method.♦ Due to the lack of human intervention... ♦“False positives”.
  16. 16. Our Approach: selection of seeds♦ Obtaining a-priori score for a node, a:♦ Selecting seeds: ♦ Pos/Neg Approach: ♦ Pos/Neg Metrics Approach: ♦ Metric-based Approach
  17. 17. Experiments ♦ Dataset: WEBSPAM-UK2006* ♦ ~98 million pages ♦ 11,402 hand-labeled hosts ♦ 7,423 labeled as spam. ♦ ~10 million spam web pages ♦ Terrier IR Platform ♦ Random-walk algorithm parameters: ♦ Damping factor = 0.85 ♦ Threshold = 0.01* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection forweb spam. SIGIR Forum, 40(2):11–24, December 2006.
  18. 18. Experiments♦ Evaluation: PR-buckets Buckets Total Pages 1 14 PageRank 2 54 } PR-bucket 1 3 144 Relevance } PR-bucket 2 4 437 } PR-bucket 3 5 6 1070 2130 } PR-bucket 4 ... 7 8 ... 2664 2778 ... 17 16M 18 28M 19 28M 20 28M Total PR =
  19. 19. Experiments♦ Baseline: TrustRank ♦ Link-based technique. ♦ Seeds chosen in a semi-supervised way: • Hand-picked set of good pages. • Top pages according to an inverse PageRank. ♦ Random-walk algorithm, biased according to the seedsZ. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004
  20. 20. Experiments TrustRank Pos/Neg Approach Pos/Neg Metrics Approach Metric-based Approach
  21. 21. Experiments 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 TrustRank Pos/Neg Pos/Neg Metrics MetricsBased
  22. 22. Conclusions and future work♦ Novel web spam detection technique, that combines concepts from link and content-based methods. ♦ Content-based metrics as an unsupervised seed selection method. ♦ Random-walk algorithm to compute two scores for each web page: spam and not-spam likelihood.♦ Future work: ♦ Including new content-based heuristics. ♦ Improving the spam-biased selection of the seeds, taking into account the links to/from each node. ♦ Content-based metrics to characterize also the edges of the web graph.
  23. 23. References[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006.[2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005.[3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM.[4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Computing Research Repository, 2010.[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, January 2005.[6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York, NY, USA, 2004. ACM.[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004.[8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003- 29, 2003.2.[9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM.[10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2006.[11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006.[12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM.[13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999.[14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.
  24. 24. Thanks for your attention!! Questions? F. Javier Ortega Craig Macdonald javierortega@us.es craigm@dcs.gla.ac.uk José A. Troyano Fermín Cruz troyano@us.es fcruz@us.es

×