Web Spam (OjoBuscador 2007 Madrid, Spain)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Web Spam (OjoBuscador 2007 Madrid, Spain) - Presentation Transcript

    1. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Know your Neighbors: Web Spam Web Spam Detection using Web Topology Web Spam Detection A Reference Collection C. Castillo2 , D. Donato2 , A. Gionis2 , Topological Web Spam V.Murdock2 and F. Silvestri4 Counting of Previous work with: R. Baeza-Yates2,3 , L. Becchetti1 , Supporters P. Boldi5 , S. Leonardi1 , M. Santini5 and S. Vigna5 Content-based Spam detection Web Topology 1. Universit` di Roma “La Sapienza” – Rome, Italy a Conclusions 2. Yahoo! Research Barcelona – Catalunya, Spain 3. Yahoo! Research Santiago – Chile 4. ISTI-CNR –Pisa,Italy 5. Universit` degli Studi di Milano – Milan, Italy a
    2. Web Spam This is a talk about academic research! Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Tools for dealing with Web Spam Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    3. Web Spam Detection C. Castillo, Web Spam 1 D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Detection 2 Web Spam A Reference Collection 3 Web Spam Detection A Reference Topological Web Spam Collection 4 Topological Web Spam Counting of Supporters 5 Counting of Supporters Content-based Content-based Spam detection 6 Spam detection Web Topology Web Topology Conclusions 7 Conclusions 8
    4. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    5. Web Spam The Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, “The sum of all human knowledge plus porn” – Robert Gilbert F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions Graphic: www.milliondollarhomepage.com
    6. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    7. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    8. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    9. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    10. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    11. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    12. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    13. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    14. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    15. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    16. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    17. Web Spam Adversarial IR Issues on the Web Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Link spam F. Silvestri Content spam Web Spam Web Spam Cloaking Detection Comment/forum/wiki spam A Reference Collection Spam-oriented blogging Topological Web Spam Click fraud ×2 Counting of Supporters Reverse engineering of ranking algorithms Content-based Web content filtering Spam detection Web Topology Advertisement blocking Conclusions Stealth crawling Malicious tagging . . . more?
    18. Web Spam Opportunities for Web spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam X Spamdexing Web Spam Detection Keyword stuffing A Reference Link farms Collection Spam blogs (splogs) Topological Web Spam Cloaking Counting of Supporters Adversarial relationship Content-based Spam detection Every undeserved gain in ranking for a spammer, is a loss of Web Topology precision for the search engine. Conclusions
    19. Web Spam Opportunities for Web spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam X Spamdexing Web Spam Detection Keyword stuffing A Reference Link farms Collection Spam blogs (splogs) Topological Web Spam Cloaking Counting of Supporters Adversarial relationship Content-based Spam detection Every undeserved gain in ranking for a spammer, is a loss of Web Topology precision for the search engine. Conclusions
    20. Web Spam Naive Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    21. Web Spam Hidden text Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    22. Web Spam Made for Advertising Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    23. Web Spam Search engine? Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    24. Web Spam Fake search engine Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    25. Web Spam Problem: “normal” pages that are spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    26. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    27. Web Spam Machine Learning Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    28. Web Spam Training of a Decision Tree Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    29. Web Spam Decision Tree (error = 15%) Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    30. Decision Tree (error = 15% → 12%) Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    31. Web Spam Machine Learning (cont.) Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    32. Web Spam Feature Extraction Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    33. Web Spam Challenges: Machine Learning Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection Machine Learning Challenges: A Reference Collection Instances are not really independent (graph) Topological Web Spam Learning with few examples Counting of Supporters Scalability Content-based Spam detection Web Topology Conclusions
    34. Web Spam Challenges: Machine Learning Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection Machine Learning Challenges: A Reference Collection Instances are not really independent (graph) Topological Web Spam Learning with few examples Counting of Supporters Scalability Content-based Spam detection Web Topology Conclusions
    35. Web Spam Challenges: Machine Learning Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection Machine Learning Challenges: A Reference Collection Instances are not really independent (graph) Topological Web Spam Learning with few examples Counting of Supporters Scalability Content-based Spam detection Web Topology Conclusions
    36. Web Spam Challenges: Information Retrieval Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Information Retrieval Challenges: Detection A Reference Feature extraction: which features? Collection Feature aggregation: page/host/domain Topological Web Spam Feature propagation (graph) Counting of Supporters Recall/precision tradeoffs Content-based Spam detection Scalability Web Topology Conclusions
    37. Web Spam Challenges: Information Retrieval Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Information Retrieval Challenges: Detection A Reference Feature extraction: which features? Collection Feature aggregation: page/host/domain Topological Web Spam Feature propagation (graph) Counting of Supporters Recall/precision tradeoffs Content-based Spam detection Scalability Web Topology Conclusions
    38. Web Spam Challenges: Information Retrieval Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Information Retrieval Challenges: Detection A Reference Feature extraction: which features? Collection Feature aggregation: page/host/domain Topological Web Spam Feature propagation (graph) Counting of Supporters Recall/precision tradeoffs Content-based Spam detection Scalability Web Topology Conclusions
    39. Web Spam Challenges: Information Retrieval Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Information Retrieval Challenges: Detection A Reference Feature extraction: which features? Collection Feature aggregation: page/host/domain Topological Web Spam Feature propagation (graph) Counting of Supporters Recall/precision tradeoffs Content-based Spam detection Scalability Web Topology Conclusions
    40. Web Spam Challenges: Information Retrieval Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Information Retrieval Challenges: Detection A Reference Feature extraction: which features? Collection Feature aggregation: page/host/domain Topological Web Spam Feature propagation (graph) Counting of Supporters Recall/precision tradeoffs Content-based Spam detection Scalability Web Topology Conclusions
    41. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    42. Web Spam Assembling Process Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Crawling of base data Collection Elaboration of the guidelines and classification interface Topological Web Spam Labeling Counting of Supporters Post-processing Content-based Spam detection Web Topology Conclusions
    43. Web Spam Assembling Process Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Crawling of base data Collection Elaboration of the guidelines and classification interface Topological Web Spam Labeling Counting of Supporters Post-processing Content-based Spam detection Web Topology Conclusions
    44. Web Spam Assembling Process Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Crawling of base data Collection Elaboration of the guidelines and classification interface Topological Web Spam Labeling Counting of Supporters Post-processing Content-based Spam detection Web Topology Conclusions
    45. Web Spam Assembling Process Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Crawling of base data Collection Elaboration of the guidelines and classification interface Topological Web Spam Labeling Counting of Supporters Post-processing Content-based Spam detection Web Topology Conclusions
    46. Web Spam Crawling of base data Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam U.K. collection Detection 77.9 M pages downloaded from the .UK domain in May 2006 A Reference Collection (LAW, University of Milan) Topological Web Spam Counting of Large seed of about 150,000 .uk hosts Supporters 11,400 hosts Content-based Spam detection 8 levels depth, with <=50,000 pages per host Web Topology Conclusions
    47. Web Spam Crawling of base data Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam U.K. collection Detection 77.9 M pages downloaded from the .UK domain in May 2006 A Reference Collection (LAW, University of Milan) Topological Web Spam Counting of Large seed of about 150,000 .uk hosts Supporters 11,400 hosts Content-based Spam detection 8 levels depth, with <=50,000 pages per host Web Topology Conclusions
    48. Web Spam Crawling of base data Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam U.K. collection Detection 77.9 M pages downloaded from the .UK domain in May 2006 A Reference Collection (LAW, University of Milan) Topological Web Spam Counting of Large seed of about 150,000 .uk hosts Supporters 11,400 hosts Content-based Spam detection 8 levels depth, with <=50,000 pages per host Web Topology Conclusions
    49. Web Spam Crawling of base data Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam U.K. collection Detection 77.9 M pages downloaded from the .UK domain in May 2006 A Reference Collection (LAW, University of Milan) Topological Web Spam Counting of Large seed of about 150,000 .uk hosts Supporters 11,400 hosts Content-based Spam detection 8 levels depth, with <=50,000 pages per host Web Topology Conclusions
    50. Web Spam Classification interface Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    51. Web Spam Labeling process Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference We asked 20+ volunteers to classify entire hosts Collection Topological Web Asked to classify normal / borderline / spam Spam Counting of Do they agree? Mostly . . . Supporters Content-based Spam detection Web Topology Conclusions
    52. Web Spam Labeling process Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference We asked 20+ volunteers to classify entire hosts Collection Topological Web Asked to classify normal / borderline / spam Spam Counting of Do they agree? Mostly . . . Supporters Content-based Spam detection Web Topology Conclusions
    53. Web Spam Labeling process Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference We asked 20+ volunteers to classify entire hosts Collection Topological Web Asked to classify normal / borderline / spam Spam Counting of Do they agree? Mostly . . . Supporters Content-based Spam detection Web Topology Conclusions
    54. Web Spam Agreement Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    55. Web Spam Results Detection C. Castillo, D. Donato, Labels A. Gionis, V. Murdock, Label Frequency Percentage F. Silvestri Normal 4,046 61.75% Web Spam Borderline 709 10.82% Web Spam Detection Spam 1,447 22.08% A Reference Can not classify 350 5.34% Collection Topological Web Spam Counting of Supporters Agreement Content-based Spam detection Category Kappa Interpretation Web Topology normal 0.62 Substantial agreement Conclusions spam 0.63 Substantial agreement borderline 0.11 Slight agreement global 0.56 Moderate agreement
    56. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    57. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    58. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    59. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    60. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    61. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    62. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    63. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    64. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    65. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    66. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    67. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    68. Web Spam Result: first public Web Spam collection Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Public spam collection Labels for 6,552 hosts Web Spam 2,725 hosts classified by at least 2 humans Web Spam Detection 3,106 automatically considered normal (.ac.uk, A Reference .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Collection http://www.yr-bcn.es/webspam/ Topological Web Spam Upcoming Web Spam challenge Counting of Track I: Information retrieval + Machine learning Supporters Track II: Machine learning Content-based Spam detection http://webspam.lip6.fr/ Web Topology AIRWeb 2007 Workshop (21 submissions!) Conclusions Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    69. Web Spam AIRWeb 2007 in Banff, Canada Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    70. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    71. Web Spam Topological spam: link farms Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
    72. Web Spam Topological spam: link farms Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
    73. Web Spam Motivation Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection Fetterly [Fetterly et al., 2004] hypothesized that studying the A Reference distribution of statistics about pages could be a good way of Collection detecting spam pages: Topological Web Spam Counting of “in a number of these distributions, outlier values are Supporters associated with web spam” Content-based Spam detection Web Topology Conclusions
    74. Web Spam Restriction Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Semi-streaming model: graph on disk F. Silvestri 1: for node : 1 . . . N do Web Spam INITIALIZE-MEM(node) 2: Web Spam Detection 3: end for A Reference 4: for distance : 1 . . . d do {Iteration step} Collection for src : 1 . . . N do {Follow links in the graph} 5: Topological Web Spam for all links from src to dest do 6: Counting of COMPUTE(src,dest) 7: Supporters end for 8: Content-based Spam detection end for 9: Web Topology NORMALIZE 10: Conclusions 11: end for 12: POST-PROCESS 13: return Something
    75. Web Spam Restriction Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Semi-streaming model: graph on disk F. Silvestri 1: for node : 1 . . . N do Web Spam INITIALIZE-MEM(node) 2: Web Spam Detection 3: end for A Reference 4: for distance : 1 . . . d do {Iteration step} Collection for src : 1 . . . N do {Follow links in the graph} 5: Topological Web Spam for all links from src to dest do 6: Counting of COMPUTE(src,dest) 7: Supporters end for 8: Content-based Spam detection end for 9: Web Topology NORMALIZE 10: Conclusions 11: end for 12: POST-PROCESS 13: return Something
    76. Web Spam Restriction Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Semi-streaming model: graph on disk F. Silvestri 1: for node : 1 . . . N do Web Spam INITIALIZE-MEM(node) 2: Web Spam Detection 3: end for A Reference 4: for distance : 1 . . . d do {Iteration step} Collection for src : 1 . . . N do {Follow links in the graph} 5: Topological Web Spam for all links from src to dest do 6: Counting of COMPUTE(src,dest) 7: Supporters end for 8: Content-based Spam detection end for 9: Web Topology NORMALIZE 10: Conclusions 11: end for 12: POST-PROCESS 13: return Something
    77. Web Spam Link-Based Features Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection Degree-related measures A Reference PageRank Collection Topological Web TrustRank [Gy¨ngyi et al., 2004] o Spam Truncated PageRank [Becchetti et al., 2006] Counting of Supporters Estimation of supporters [Becchetti et al., 2006] Content-based Spam detection 140 features per host (2 pages per host) Web Topology Conclusions
    78. Web Spam Degree-Based Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, 0.12 F. Silvestri Normal Spam 0.10 Web Spam 0.08 Web Spam Detection 0.06 A Reference 0.04 Collection 0.02 Topological Web Spam 0.00 4 18 76 323 1380 5899 25212 107764 460609 1968753 Counting of 0.14 Normal Supporters Spam 0.12 Content-based 0.10 Spam detection 0.08 Web Topology 0.06 Conclusions 0.04 0.02 0.00 0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
    79. Web Spam TrustRank Idea Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    80. Web Spam TrustRank / PageRank Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam 1.00 Normal Detection Spam 0.90 A Reference 0.80 Collection 0.70 0.60 Topological Web 0.50 Spam 0.40 Counting of 0.30 Supporters 0.20 0.10 Content-based Spam detection 0.00 0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03 Web Topology Conclusions
    81. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    82. Web Spam High and low-ranked pages are different Detection C. Castillo, D. Donato, 4 A. Gionis, x 10 V. Murdock, Top 0%−10% F. Silvestri 12 Top 40%−50% Top 60%−70% Web Spam 10 Web Spam Number of Nodes Detection A Reference 8 Collection Topological Web 6 Spam Counting of Supporters 4 Content-based Spam detection 2 Web Topology Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
    83. Web Spam High and low-ranked pages are different Detection C. Castillo, D. Donato, 4 A. Gionis, x 10 V. Murdock, Top 0%−10% F. Silvestri 12 Top 40%−50% Top 60%−70% Web Spam 10 Web Spam Number of Nodes Detection A Reference 8 Collection Topological Web 6 Spam Counting of Supporters 4 Content-based Spam detection 2 Web Topology Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
    84. Web Spam Probabilistic counting Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1 1 0 0 0 0 Web Spam 0 0 0 1 1 1 1 1 Web Spam 0 0 1 1 0 0 Detection 0 0 0 0 Propagation of 0 0 1 1 A Reference bits using the 1 0 1 1 Collection “OR” operation 1 0 1 0 Topological Web Spam 1 Target 0 Count bits set 0 page 0 to estimate Counting of 0 0 supporters Supporters 0 0 1 1 1 1 Content-based 0 0 1 1 Spam detection 0 0 0 0 Web Topology 1 1 0 0 Conclusions [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
    85. Web Spam Probabilistic counting Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1 1 0 0 0 0 Web Spam 0 0 0 1 1 1 1 1 Web Spam 0 0 1 1 0 0 Detection 0 0 0 0 Propagation of 0 0 1 1 A Reference bits using the 1 0 1 1 Collection “OR” operation 1 0 1 0 Topological Web Spam 1 Target 0 Count bits set 0 page 0 to estimate Counting of 0 0 supporters Supporters 0 0 1 1 1 1 Content-based 0 0 1 1 Spam detection 0 0 0 0 Web Topology 1 1 0 0 Conclusions [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
    86. Web Spam Bottleneck number Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth Web Spam of the neighbors of x up to a certain distance. We expect that Web Spam spam pages form clusters that are somehow isolated from the Detection A Reference rest of the Web graph and they have smaller bottleneck Collection numbers than non-spam pages. Topological Web 0.40 Spam Normal Spam 0.35 Counting of Supporters 0.30 0.25 Content-based Spam detection 0.20 0.15 Web Topology 0.10 Conclusions 0.05 0.00 1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
    87. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    88. Web Spam Content-Based Features Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Most of the features reported in [Ntoulas et al., 2006] Web Spam Number of word in the page and title Web Spam Average word length Detection A Reference Fraction of anchor text Collection Fraction of visible text Topological Web Spam Compression rate Counting of Supporters Corpus precision and corpus recall Content-based Spam detection Query precision and query recall Web Topology Independent trigram likelihood Conclusions Entropy of trigrams 96 features per host
    89. Web Spam Average word length Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.12 Normal Spam Web Spam 0.10 Web Spam Detection 0.08 A Reference Collection 0.06 Topological Web Spam 0.04 Counting of Supporters 0.02 Content-based Spam detection 0.00 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Web Topology Conclusions Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
    90. Web Spam Corpus precision Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.10 Normal 0.09 Spam Web Spam 0.08 Web Spam 0.07 Detection 0.06 A Reference Collection 0.05 Topological Web 0.04 Spam 0.03 Counting of 0.02 Supporters 0.01 Content-based Spam detection 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Web Topology Conclusions Figure: Histogram of the corpus precision in non-spam vs. spam pages.
    91. Web Spam Query precision Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.12 Normal Spam Web Spam 0.10 Web Spam Detection 0.08 A Reference Collection 0.06 Topological Web Spam 0.04 Counting of Supporters 0.02 Content-based Spam detection 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Web Topology Conclusions Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
    92. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    93. Web Spam General hypothesis Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection Pages topologically close to each other are more likely A Reference to have the same label (spam/nonspam) than random Collection pairs of pages. Topological Web Spam Counting of Pages linked together are more likely to be on the same topic Supporters than random pairs of pages [Davison, 2000] Content-based Spam detection Web Topology Conclusions
    94. Web Spam General hypothesis Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection Pages topologically close to each other are more likely A Reference to have the same label (spam/nonspam) than random Collection pairs of pages. Topological Web Spam Counting of Pages linked together are more likely to be on the same topic Supporters than random pairs of pages [Davison, 2000] Content-based Spam detection Web Topology Conclusions
    95. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    96. Web Spam Topological dependencies: in-links Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Histogram of fraction of spam hosts in the in-links Web Spam 0 = no in-link comes from spam hosts Web Spam 1 = all of the in-links come from spam hosts Detection A Reference Collection 0.4 Topological Web In-links of non spam In-links of spam Spam 0.35 Counting of 0.3 Supporters 0.25 Content-based 0.2 Spam detection 0.15 Web Topology 0.1 Conclusions 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0
    97. Web Spam Topological dependencies: out-links Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Histogram of fraction of spam hosts in the out-links Web Spam 0 = none of the out-links points to spam hosts Web Spam 1 = all of the out-links point to spam hosts Detection A Reference Collection 1 Topological Web Out-links of non spam 0.9 Outlinks of spam Spam 0.8 Counting of 0.7 Supporters 0.6 Content-based 0.5 Spam detection 0.4 Web Topology 0.3 Conclusions 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0
    98. Web Spam Idea 1: Clustering Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Classify, then cluster hosts, then assign the same label to all F. Silvestri hosts in the same cluster by majority voting Web Spam Web Spam Detection Baseline Clustering A Reference Without bagging Collection True positive rate 75.6% 74.5% Topological Web Spam False positive rate 8.5% 6.8% Counting of F-Measure 0.646 0.673 Supporters Content-based With bagging Spam detection True positive rate 78.7% 76.9% Web Topology False positive rate 5.7% 5.0% Conclusions F-Measure 0.723 0.728 V Reduces error rate
    99. Web Spam Idea 1: Clustering Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, Classify, then cluster hosts, then assign the same label to all F. Silvestri hosts in the same cluster by majority voting Web Spam Web Spam Detection Baseline Clustering A Reference Without bagging Collection True positive rate 75.6% 74.5% Topological Web Spam False positive rate 8.5% 6.8% Counting of F-Measure 0.646 0.673 Supporters Content-based With bagging Spam detection True positive rate 78.7% 76.9% Web Topology False positive rate 5.7% 5.0% Conclusions F-Measure 0.723 0.728 V Reduces error rate
    100. Web Spam Idea 2: Propagate the label Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Classify, then interpret “spamicity” as a probability, then do a Web Spam random walk with restart from those nodes Web Spam Detection Baseline Fwds. Backwds. Both A Reference Collection Classifier without bagging Topological Web True positive rate 75.6% 70.9% 69.4% 71.4% Spam False positive rate 8.5% 6.1% 5.8% 5.8% Counting of Supporters F-Measure 0.646 0.665 0.664 0.676 Content-based Spam detection Classifier with bagging Web Topology True positive rate 78.7% 76.5% 75.0% 75.2% Conclusions False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure 0.723 0.716 0.733 0.724
    101. Web Spam Idea 2: Propagate the label Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Classify, then interpret “spamicity” as a probability, then do a Web Spam random walk with restart from those nodes Web Spam Detection Baseline Fwds. Backwds. Both A Reference Collection Classifier without bagging Topological Web True positive rate 75.6% 70.9% 69.4% 71.4% Spam False positive rate 8.5% 6.1% 5.8% 5.8% Counting of Supporters F-Measure 0.646 0.665 0.664 0.676 Content-based Spam detection Classifier with bagging Web Topology True positive rate 78.7% 76.5% 75.0% 75.2% Conclusions False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure 0.723 0.716 0.733 0.724
    102. Web Spam Idea 3: Stacked graphical learning Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Classify, then add the average predicted “spamicity” of Web Spam neighbors as a new feature for each node, then classify Web Spam Detection again[Cohen and Kou, 2006] A Reference Collection Avg. Avg. Avg. Topological Web Spam Baseline of in of out of both Counting of Supporters True positive rate 78.7% 84.4% 78.3% 85.2% Content-based False positive rate 5.7% 6.7% 4.8% 6.1% Spam detection F-Measure 0.723 0.733 0.742 0.750 Web Topology Conclusions V Increases detection rate
    103. Web Spam Idea 3: Stacked graphical learning Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Classify, then add the average predicted “spamicity” of Web Spam neighbors as a new feature for each node, then classify Web Spam Detection again[Cohen and Kou, 2006] A Reference Collection Avg. Avg. Avg. Topological Web Spam Baseline of in of out of both Counting of Supporters True positive rate 78.7% 84.4% 78.3% 85.2% Content-based False positive rate 5.7% 6.7% 4.8% 6.1% Spam detection F-Measure 0.723 0.733 0.742 0.750 Web Topology Conclusions V Increases detection rate
    104. Web Spam Idea 3: Stacked graphical learning x2 Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam And repeat ... Detection A Reference Baseline First pass Second pass Collection Topological Web True positive rate 78.7% 85.2% 88.4% Spam False positive rate 5.7% 6.1% 6.3% Counting of Supporters F-Measure 0.723 0.750 0.763 Content-based Spam detection V Significant improvement over the baseline Web Topology Conclusions
    105. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam 1 Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Topological Web Spam 4 Collection Counting of Supporters 5 Topological Web Content-based Spam detection 6 Spam Web Topology 7 Counting of Supporters Conclusions 8 Content-based Spam detection Web Topology Conclusions
    106. Web Spam Concluding remarks Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection V The UK-2006-05 dataset is “harder” than previous A Reference Collection datasets Topological Web V Considering content-based and link-based attributes Spam Counting of improves the accuracy Supporters V Considering the dependencies improves the accuracy Content-based Spam detection Web Topology Conclusions
    107. Web Spam Concluding remarks Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection V The UK-2006-05 dataset is “harder” than previous A Reference Collection datasets Topological Web V Considering content-based and link-based attributes Spam Counting of improves the accuracy Supporters V Considering the dependencies improves the accuracy Content-based Spam detection Web Topology Conclusions
    108. Web Spam Concluding remarks Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection V The UK-2006-05 dataset is “harder” than previous A Reference Collection datasets Topological Web V Considering content-based and link-based attributes Spam Counting of improves the accuracy Supporters V Considering the dependencies improves the accuracy Content-based Spam detection Web Topology Conclusions
    109. Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Web Spam Web Spam Detection A Reference Thank you! Collection Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    110. Web Spam Detection Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (2006). C. Castillo, D. Donato, Using rank propagation and probabilistic counting for link-based spam A. Gionis, detection. V. Murdock, In Proceedings of the Workshop on Web Mining and Web Usage Analysis F. Silvestri (WebKDD), Pennsylvania, USA. ACM Press. Web Spam Cohen, W. W. and Kou, Z. (2006). Web Spam Stacked graphical learning: approximating learning in markov random Detection fields using very short inhomogeneous markov chains. A Reference Technical report. Collection Davison, B. D. (2000). Topological Web Spam Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on Counting of research and development in information retrieval, pages 272–279, Athens, Supporters Greece. ACM Press. Content-based Spam detection Fetterly, D., Manasse, M., and Najork, M. (2004). Spam, damn spam, and statistics: Using statistical analysis to locate spam Web Topology web pages. Conclusions In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1–6, Paris, France. Flajolet, P. and Martin, N. G. (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182–209.
    111. Web Spam Detection C. Castillo, D. Donato, Gibson, D., Kumar, R., and Tomkins, A. (2005). A. Gionis, V. Murdock, Discovering large dense subgraphs in massive graphs. F. Silvestri In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment. Web Spam Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004). o Web Spam Detection Combating web spam with trustrank. In Proceedings of the Thirtieth International Conference on Very Large A Reference Collection Data Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann. Topological Web Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Spam Detecting spam web pages through content analysis. Counting of In Proceedings of the World Wide Web conference, pages 83–92, Supporters Edinburgh, Scotland. Content-based Spam detection Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). ANF: a fast and scalable tool for data mining in massive graphs. Web Topology In Proceedings of the eighth ACM SIGKDD international conference on Conclusions Knowledge discovery and data mining, pages 81–90, New York, NY, USA. ACM Press.

    + Carlos CastilloCarlos Castillo, 3 years ago

    custom

    911 views, 0 favs, 0 embeds more stats

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 911
      • 911 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 14
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags