Web Spam (Salamanca 2007)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Web Spam (Salamanca 2007) - Presentation Transcript

    1. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam Detection A Reference Collection Web Links Carlos Castillo1 Topological Web Spam chato@yahoo-inc.com Counting of With: R. Baeza-Yates1,3 , L. Becchetti2 , P. Boldi5 , Supporters D. Donato1 , A. Gionis1 , S. Leonardi2 , V.Murdock1 , Content-based Spam detection M. Santini5 , F. Silvestri4 , S. Vigna5 Web Topology Conclusions 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Universit` di Roma “La Sapienza” – Rome, Italy a 3. Yahoo! Research Santiago – Chile 4. ISTI-CNR –Pisa,Italy 5. Universit` degli Studi di Milano – Milan, Italy a
    2. Web Spam Previous: how search engines work Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    3. Web Spam Search engine: issues Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Scalability (crawling, indexing, searching, ranking) Topological Web Spam Relevance (query to document match) Counting of Supporters Static ranking (content quality) Content-based Incentives for cheating ($) Spam detection Web Topology Conclusions
    4. Web Spam Search engine: issues Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Scalability (crawling, indexing, searching, ranking) Topological Web Spam Relevance (query to document match) Counting of Supporters Static ranking (content quality) Content-based Incentives for cheating ($) Spam detection Web Topology Conclusions
    5. Web Spam Search engine: issues Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Scalability (crawling, indexing, searching, ranking) Topological Web Spam Relevance (query to document match) Counting of Supporters Static ranking (content quality) Content-based Incentives for cheating ($) Spam detection Web Topology Conclusions
    6. Web Spam Search engine: issues Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Scalability (crawling, indexing, searching, ranking) Topological Web Spam Relevance (query to document match) Counting of Supporters Static ranking (content quality) Content-based Incentives for cheating ($) Spam detection Web Topology Conclusions
    7. Web Spam This is a talk about academic research! Detection C. Castillo Web Spam Web Spam Detection Tools for dealing with Web Spam A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    8. Web Spam Detection Web Spam 1 C. Castillo Web Spam Web Spam Detection 2 Web Spam Detection A Reference Collection 3 A Reference Collection Web Links Web Links 4 Topological Web Spam Topological Web Spam 5 Counting of Supporters Content-based Counting of Supporters 6 Spam detection Web Topology Content-based Spam detection 7 Conclusions Web Topology 8 Conclusions 9
    9. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    10. Web Spam The Web Detection C. Castillo Web Spam “The sum of all human knowledge plus porn” – Robert Gilbert Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions Graphic: www.milliondollarhomepage.com
    11. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    12. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    13. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    14. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    15. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    16. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    17. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    18. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    19. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    20. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    21. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    22. Web Spam Adversarial IR Issues on the Web Detection C. Castillo Web Spam Link spam Web Spam Detection Content spam A Reference Collection Cloaking Web Links Comment/forum/wiki spam Topological Web Spam Spam-oriented blogging Counting of Click fraud ×2 Supporters Content-based Reverse engineering of ranking algorithms Spam detection Web Topology Web content filtering Conclusions Advertisement blocking Stealth crawling Malicious tagging . . . more?
    23. Web Spam Opportunities for Web spam Detection C. Castillo Web Spam Web Spam Detection A Reference X Spamdexing Collection Keyword stuffing Web Links Link farms Topological Web Spam Spam blogs (splogs) Counting of Cloaking Supporters Content-based Spam detection Adversarial relationship Web Topology Every undeserved gain in ranking for a spammer, is a loss of Conclusions precision for the search engine.
    24. Web Spam Opportunities for Web spam Detection C. Castillo Web Spam Web Spam Detection A Reference X Spamdexing Collection Keyword stuffing Web Links Link farms Topological Web Spam Spam blogs (splogs) Counting of Cloaking Supporters Content-based Spam detection Adversarial relationship Web Topology Every undeserved gain in ranking for a spammer, is a loss of Conclusions precision for the search engine.
    25. Web Spam Na¨ Web Spam ıve Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    26. Web Spam Hidden text Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    27. Web Spam Made for Advertising Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    28. Web Spam Search engine? Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    29. Web Spam Fake search engine Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    30. Web Spam “Normal” content in link farms Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    31. Web Spam Cloaking Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    32. Web Spam Redirection Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    33. Web Spam Redirects using Javascript Detection C. Castillo Web Spam Web Spam Simple redirect Detection A Reference <script> Collection document.location=\"http://www.topsearch10.com/\"; Web Links </script> Topological Web Spam Counting of “Hidden” redirect Supporters Content-based <script> Spam detection var1=24; var2=var1; Web Topology if(var1==var2) { Conclusions document.location=\"http://www.topsearch10.com/\"; } </script>
    34. Web Spam Problem: obfuscated code Detection C. Castillo Web Spam Web Spam Detection Obfuscated redirect A Reference Collection <script> Web Links var a1=\"win\",a2=\"dow\",a3=\"loca\",a4=\"tion.\", Topological Web a5=\"replace\",a6=\"(’http://www.top10search.com/’)\"; Spam var i,str=\"\"; Counting of Supporters for(i=1;i<=6;i++) Content-based { Spam detection str += eval(\"a\"+i); Web Topology } Conclusions eval(str); </script>
    35. Web Spam Problem: really obfuscated code Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Encoded javascript Web Links <script> Topological Web Spam var s = \"%5CBE0D%5C%05GDHJ BDE%16...%04%0E\"; Counting of var e = ’’, i; Supporters eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’)); Content-based Spam detection </script> Web Topology Conclusions More examples: [Chellapilla and Maykov, 2007]
    36. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    37. Web Spam Machine Learning Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    38. Web Spam Training of a Decision Tree Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    39. Web Spam Decision Tree (error = 15%) Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    40. Decision Tree (error = 15% → 12%) Web Spam Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    41. Web Spam Machine Learning (cont.) Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    42. Web Spam Feature Extraction Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    43. Web Spam Challenges: Machine Learning Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Machine Learning Challenges: Topological Web Spam Instances are not really independent (graph) Counting of Supporters Learning with few examples Content-based Scalability Spam detection Web Topology Conclusions
    44. Web Spam Challenges: Machine Learning Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Machine Learning Challenges: Topological Web Spam Instances are not really independent (graph) Counting of Supporters Learning with few examples Content-based Scalability Spam detection Web Topology Conclusions
    45. Web Spam Challenges: Machine Learning Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Machine Learning Challenges: Topological Web Spam Instances are not really independent (graph) Counting of Supporters Learning with few examples Content-based Scalability Spam detection Web Topology Conclusions
    46. Web Spam Challenges: Information Retrieval Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Information Retrieval Challenges: Web Links Feature extraction: which features? Topological Web Spam Feature aggregation: page/host/domain Counting of Supporters Feature propagation (graph) Content-based Spam detection Recall/precision tradeoffs Web Topology Scalability Conclusions
    47. Web Spam Challenges: Information Retrieval Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Information Retrieval Challenges: Web Links Feature extraction: which features? Topological Web Spam Feature aggregation: page/host/domain Counting of Supporters Feature propagation (graph) Content-based Spam detection Recall/precision tradeoffs Web Topology Scalability Conclusions
    48. Web Spam Challenges: Information Retrieval Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Information Retrieval Challenges: Web Links Feature extraction: which features? Topological Web Spam Feature aggregation: page/host/domain Counting of Supporters Feature propagation (graph) Content-based Spam detection Recall/precision tradeoffs Web Topology Scalability Conclusions
    49. Web Spam Challenges: Information Retrieval Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Information Retrieval Challenges: Web Links Feature extraction: which features? Topological Web Spam Feature aggregation: page/host/domain Counting of Supporters Feature propagation (graph) Content-based Spam detection Recall/precision tradeoffs Web Topology Scalability Conclusions
    50. Web Spam Challenges: Information Retrieval Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Information Retrieval Challenges: Web Links Feature extraction: which features? Topological Web Spam Feature aggregation: page/host/domain Counting of Supporters Feature propagation (graph) Content-based Spam detection Recall/precision tradeoffs Web Topology Scalability Conclusions
    51. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    52. Web Spam Data is really important Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web It is dangerous for a search engine to provide labelled Spam data for this Counting of Supporters Even if they do, it would never reflect a consensus Content-based Spam detection Web Topology Conclusions
    53. Web Spam Data is really important Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web It is dangerous for a search engine to provide labelled Spam data for this Counting of Supporters Even if they do, it would never reflect a consensus Content-based Spam detection Web Topology Conclusions
    54. Web Spam Assembling Process Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Crawling of base data Topological Web Spam Elaboration of the guidelines and classification interface Counting of Supporters Labeling Content-based Post-processing Spam detection Web Topology Conclusions
    55. Web Spam Assembling Process Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Crawling of base data Topological Web Spam Elaboration of the guidelines and classification interface Counting of Supporters Labeling Content-based Post-processing Spam detection Web Topology Conclusions
    56. Web Spam Assembling Process Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Crawling of base data Topological Web Spam Elaboration of the guidelines and classification interface Counting of Supporters Labeling Content-based Post-processing Spam detection Web Topology Conclusions
    57. Web Spam Assembling Process Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Crawling of base data Topological Web Spam Elaboration of the guidelines and classification interface Counting of Supporters Labeling Content-based Post-processing Spam detection Web Topology Conclusions
    58. Web Spam Crawling of base data Detection C. Castillo Web Spam Web Spam Detection A Reference Collection U.K. collection Web Links 77.9 M pages downloaded from the .UK domain in May 2006 Topological Web Spam (LAW, University of Milan) Counting of Supporters Large seed of about 150,000 .uk hosts Content-based Spam detection 11,400 hosts Web Topology 8 levels depth, with <=50,000 pages per host Conclusions
    59. Web Spam Crawling of base data Detection C. Castillo Web Spam Web Spam Detection A Reference Collection U.K. collection Web Links 77.9 M pages downloaded from the .UK domain in May 2006 Topological Web Spam (LAW, University of Milan) Counting of Supporters Large seed of about 150,000 .uk hosts Content-based Spam detection 11,400 hosts Web Topology 8 levels depth, with <=50,000 pages per host Conclusions
    60. Web Spam Crawling of base data Detection C. Castillo Web Spam Web Spam Detection A Reference Collection U.K. collection Web Links 77.9 M pages downloaded from the .UK domain in May 2006 Topological Web Spam (LAW, University of Milan) Counting of Supporters Large seed of about 150,000 .uk hosts Content-based Spam detection 11,400 hosts Web Topology 8 levels depth, with <=50,000 pages per host Conclusions
    61. Web Spam Crawling of base data Detection C. Castillo Web Spam Web Spam Detection A Reference Collection U.K. collection Web Links 77.9 M pages downloaded from the .UK domain in May 2006 Topological Web Spam (LAW, University of Milan) Counting of Supporters Large seed of about 150,000 .uk hosts Content-based Spam detection 11,400 hosts Web Topology 8 levels depth, with <=50,000 pages per host Conclusions
    62. Web Spam Classification interface Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    63. Web Spam Labeling process Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web We asked 20+ volunteers to classify entire hosts Spam Asked to classify normal / borderline / spam Counting of Supporters Do they agree? Mostly . . . Content-based Spam detection Web Topology Conclusions
    64. Web Spam Labeling process Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web We asked 20+ volunteers to classify entire hosts Spam Asked to classify normal / borderline / spam Counting of Supporters Do they agree? Mostly . . . Content-based Spam detection Web Topology Conclusions
    65. Web Spam Labeling process Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web We asked 20+ volunteers to classify entire hosts Spam Asked to classify normal / borderline / spam Counting of Supporters Do they agree? Mostly . . . Content-based Spam detection Web Topology Conclusions
    66. Web Spam Agreement Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    67. Web Spam Results Detection C. Castillo Labels Web Spam Label Frequency Percentage Web Spam Detection Normal 4,046 61.75% A Reference Borderline 709 10.82% Collection Spam 1,447 22.08% Web Links Can not classify 350 5.34% Topological Web Spam Counting of Supporters Content-based Spam detection Agreement Web Topology Category Kappa Interpretation Conclusions normal 0.62 Substantial agreement spam 0.63 Substantial agreement borderline 0.11 Slight agreement global 0.56 Moderate agreement
    68. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    69. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    70. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    71. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    72. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    73. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    74. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    75. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    76. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    77. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    78. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    79. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    80. Web Spam Result: first public Web Spam collection Detection C. Castillo Web Spam Public spam collection Web Spam Detection Labels for 6,552 hosts A Reference 2,725 hosts classified by at least 2 humans Collection 3,106 automatically considered normal (.ac.uk, Web Links .sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk) Topological Web Spam http://www.yr-bcn.es/webspam/ Counting of Upcoming Web Spam challenge Supporters Track I: Information retrieval + Machine learning Content-based Spam detection Track II: Machine learning Web Topology http://webspam.lip6.fr/ Conclusions AIRWeb 2007 Workshop (challenge results available) Regular and short papers Track I of the Web Spam Challenge http://airweb.cse.lehigh.edu/2007/
    81. Web Spam AIRWeb 2007 in Banff, Canada Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    82. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    83. Web Spam Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    84. Web Spam Scale-free networks Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    85. Web Spam How to find meaningful patterns? Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Several levels of analysis: Topological Web Spam Macroscopic view: overall structure Counting of Supporters Microscopic view: nodes Content-based Mesoscopic view: regions Spam detection Web Topology Conclusions
    86. Web Spam How to find meaningful patterns? Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Several levels of analysis: Topological Web Spam Macroscopic view: overall structure Counting of Supporters Microscopic view: nodes Content-based Mesoscopic view: regions Spam detection Web Topology Conclusions
    87. Web Spam How to find meaningful patterns? Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Several levels of analysis: Topological Web Spam Macroscopic view: overall structure Counting of Supporters Microscopic view: nodes Content-based Mesoscopic view: regions Spam detection Web Topology Conclusions
    88. Web Spam Macroscopic view, e.g. Bow-tie Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions [Broder et al., 2000]
    89. Web Spam Macroscopic view, e.g. Bow-tie, migration Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions [Baeza-Yates and Poblete, 2006]
    90. Web Spam Macroscopic view, e.g. Jellyfish Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions [Tauro et al., 2001] - Internet Autonomous Systems (AS) Topology
    91. Web Spam Macroscopic view, e.g. Jellyfish Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    92. Web Spam Microscopic view, e.g. Degree Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions [Barab´si, 2002] and others a
    93. Web Spam Microscopic view, e.g. Degree Detection C. Castillo Greece Chile Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Spain Korea Content-based Spam detection Web Topology Conclusions [Baeza-Yates et al., 2006b] - compares this distribution in 8 countries . . . guess what is the result?
    94. Web Spam Mesoscopic view, e.g. Hop-plot Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    95. Web Spam Mesoscopic view, e.g. Hop-plot Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    96. Web Spam Mesoscopic view, e.g. Hop-plot Detection C. Castillo Web Spam .it (40M pages) .uk (18M pages) Web Spam 0.3 0.3 Detection A Reference 0.2 0.2 Frequency Frequency Collection Web Links 0.1 0.1 Topological Web Spam 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Counting of Distance Distance Supporters .eu.int (800K pages) Synthetic graph (100K pages) Content-based Spam detection 0.3 0.3 Web Topology 0.2 0.2 Conclusions Frequency Frequency 0.1 0.1 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Distance Distance [Baeza-Yates et al., 2006a]
    97. Web Spam Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    98. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    99. Web Spam Topological spam: link farms Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
    100. Web Spam Topological spam: link farms Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
    101. Web Spam Motivation Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Fetterly [Fetterly et al., 2004] hypothesized that studying the Web Links distribution of statistics about pages could be a good way of Topological Web Spam detecting spam pages: Counting of Supporters “in a number of these distributions, outlier values are Content-based Spam detection associated with web spam” Web Topology Conclusions
    102. Web Spam Handling large graphs Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links For large graphs, random access is not possible. Topological Web Spam Counting of Large graphs do not fit in main memory Supporters Content-based Streaming model of computation Spam detection Web Topology Conclusions
    103. Web Spam Handling large graphs Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links For large graphs, random access is not possible. Topological Web Spam Counting of Large graphs do not fit in main memory Supporters Content-based Streaming model of computation Spam detection Web Topology Conclusions
    104. Web Spam Handling large graphs Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links For large graphs, random access is not possible. Topological Web Spam Counting of Large graphs do not fit in main memory Supporters Content-based Streaming model of computation Spam detection Web Topology Conclusions
    105. Web Spam Semi-streaming model Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Memory size enough to hold some data per-node Spam Disk size enough to hold some data per-edge Counting of Supporters A small number of passes over the data Content-based Spam detection Web Topology Conclusions
    106. Web Spam Restriction Detection C. Castillo Web Spam Semi-streaming model: graph on disk Web Spam Detection 1: for node : 1 . . . N do A Reference INITIALIZE-MEM(node) 2: Collection 3: end for Web Links 4: for distance : 1 . . . d do {Iteration step} Topological Web Spam for src : 1 . . . N do {Follow links in the graph} 5: Counting of for all links from src to dest do Supporters 6: Content-based COMPUTE(src,dest) 7: Spam detection end for 8: Web Topology end for 9: Conclusions NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
    107. Web Spam Restriction Detection C. Castillo Web Spam Semi-streaming model: graph on disk Web Spam Detection 1: for node : 1 . . . N do A Reference INITIALIZE-MEM(node) 2: Collection 3: end for Web Links 4: for distance : 1 . . . d do {Iteration step} Topological Web Spam for src : 1 . . . N do {Follow links in the graph} 5: Counting of for all links from src to dest do Supporters 6: Content-based COMPUTE(src,dest) 7: Spam detection end for 8: Web Topology end for 9: Conclusions NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
    108. Web Spam Restriction Detection C. Castillo Web Spam Semi-streaming model: graph on disk Web Spam Detection 1: for node : 1 . . . N do A Reference INITIALIZE-MEM(node) 2: Collection 3: end for Web Links 4: for distance : 1 . . . d do {Iteration step} Topological Web Spam for src : 1 . . . N do {Follow links in the graph} 5: Counting of for all links from src to dest do Supporters 6: Content-based COMPUTE(src,dest) 7: Spam detection end for 8: Web Topology end for 9: Conclusions NORMALIZE 10: 11: end for 12: POST-PROCESS 13: return Something
    109. Web Spam Link-Based Features Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Degree-related measures Web Links Topological Web PageRank Spam TrustRank [Gy¨ngyi et al., 2004] o Counting of Supporters Truncated PageRank [Becchetti et al., 2006] Content-based Spam detection Estimation of supporters [Becchetti et al., 2006] Web Topology 140 features per host (2 pages per host) Conclusions
    110. Web Spam Degree-Based Detection C. Castillo Web Spam 0.12 Normal Web Spam Spam Detection 0.10 A Reference 0.08 Collection 0.06 Web Links 0.04 Topological Web Spam 0.02 Counting of 0.00 Supporters 4 18 76 323 1380 5899 25212 107764 460609 1968753 0.14 Content-based Normal Spam Spam detection 0.12 Web Topology 0.10 Conclusions 0.08 0.06 0.04 0.02 0.00 0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
    111. Web Spam TrustRank Idea Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    112. Web Spam TrustRank / PageRank Detection C. Castillo Web Spam Web Spam Detection A Reference Collection 1.00 Normal Spam Web Links 0.90 0.80 Topological Web 0.70 Spam 0.60 Counting of 0.50 Supporters 0.40 0.30 Content-based 0.20 Spam detection 0.10 Web Topology 0.00 0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03 Conclusions
    113. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    114. Web Spam High and low-ranked pages are different Detection C. Castillo 4 x 10 Web Spam Top 0%−10% Web Spam 12 Top 40%−50% Detection Top 60%−70% A Reference 10 Collection Number of Nodes Web Links 8 Topological Web Spam Counting of 6 Supporters Content-based 4 Spam detection Web Topology 2 Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
    115. Web Spam High and low-ranked pages are different Detection C. Castillo 4 x 10 Web Spam Top 0%−10% Web Spam 12 Top 40%−50% Detection Top 60%−70% A Reference 10 Collection Number of Nodes Web Links 8 Topological Web Spam Counting of 6 Supporters Content-based 4 Spam detection Web Topology 2 Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
    116. Web Spam Probabilistic counting Detection C. Castillo Web Spam 1 1 Web Spam 0 0 Detection 0 0 0 0 A Reference 0 1 1 1 1 1 Collection 0 0 1 1 0 0 0 0 0 0 Web Links Propagation of 0 0 1 1 bits using the 1 0 1 Topological Web 1 “OR” operation 1 0 1 Spam 0 Counting of 1 Target 0 Count bits set Supporters 0 page 0 to estimate 0 0 supporters Content-based 0 0 Spam detection 1 1 1 1 0 0 Web Topology 1 1 0 0 0 0 Conclusions 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
    117. Web Spam Probabilistic counting Detection C. Castillo Web Spam 1 1 Web Spam 0 0 Detection 0 0 0 0 A Reference 0 1 1 1 1 1 Collection 0 0 1 1 0 0 0 0 0 0 Web Links Propagation of 0 0 1 1 bits using the 1 0 1 Topological Web 1 “OR” operation 1 0 1 Spam 0 Counting of 1 Target 0 Count bits set Supporters 0 page 0 to estimate 0 0 supporters Content-based 0 0 Spam detection 1 1 1 1 0 0 Web Topology 1 1 0 0 0 0 Conclusions 1 1 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
    118. Web Spam Bottleneck number Detection C. Castillo Web Spam Web Spam bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth Detection A Reference of the neighbors of x up to a certain distance. We expect that Collection spam pages form clusters that are somehow isolated from the Web Links rest of the Web graph and they have smaller bottleneck Topological Web Spam numbers than non-spam pages. Counting of 0.40 Normal Supporters Spam 0.35 Content-based 0.30 Spam detection 0.25 Web Topology 0.20 Conclusions 0.15 0.10 0.05 0.00 1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
    119. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    120. Web Spam Content-Based Features Detection C. Castillo Web Spam Web Spam Most of the features reported in [Ntoulas et al., 2006] Detection Number of word in the page and title A Reference Collection Average word length Web Links Fraction of anchor text Topological Web Spam Fraction of visible text Counting of Supporters Compression rate Content-based Spam detection Corpus precision and corpus recall Web Topology Query precision and query recall Conclusions Independent trigram likelihood Entropy of trigrams 96 features per host
    121. Web Spam Average word length Detection C. Castillo Web Spam 0.12 Web Spam Normal Detection Spam 0.10 A Reference Collection 0.08 Web Links Topological Web 0.06 Spam Counting of 0.04 Supporters Content-based 0.02 Spam detection Web Topology 0.00 Conclusions 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
    122. Web Spam Corpus precision Detection C. Castillo Web Spam 0.10 Web Spam Normal Detection 0.09 Spam A Reference 0.08 Collection 0.07 Web Links 0.06 Topological Web 0.05 Spam 0.04 Counting of Supporters 0.03 Content-based 0.02 Spam detection 0.01 Web Topology 0.00 Conclusions 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure: Histogram of the corpus precision in non-spam vs. spam pages.
    123. Web Spam Query precision Detection C. Castillo Web Spam 0.12 Web Spam Normal Detection Spam 0.10 A Reference Collection 0.08 Web Links Topological Web 0.06 Spam Counting of 0.04 Supporters Content-based 0.02 Spam detection Web Topology 0.00 Conclusions 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
    124. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    125. Web Spam General hypothesis Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Pages topologically close to each other are more likely Topological Web to have the same label (spam/nonspam) than random Spam pairs of pages. Counting of Supporters Content-based Pages linked together are more likely to be on the same topic Spam detection than random pairs of pages [Davison, 2000] Web Topology Conclusions
    126. Web Spam General hypothesis Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Pages topologically close to each other are more likely Topological Web to have the same label (spam/nonspam) than random Spam pairs of pages. Counting of Supporters Content-based Pages linked together are more likely to be on the same topic Spam detection than random pairs of pages [Davison, 2000] Web Topology Conclusions
    127. Web Spam Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    128. Web Spam Topological dependencies: in-links Detection C. Castillo Web Spam Histogram of fraction of spam hosts in the in-links Web Spam Detection 0 = no in-link comes from spam hosts A Reference Collection 1 = all of the in-links come from spam hosts Web Links Topological Web Spam 0.4 In-links of non spam Counting of In-links of spam 0.35 Supporters 0.3 Content-based Spam detection 0.25 Web Topology 0.2 Conclusions 0.15 0.1 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0
    129. Web Spam Topological dependencies: out-links Detection C. Castillo Web Spam Histogram of fraction of spam hosts in the out-links Web Spam Detection 0 = none of the out-links points to spam hosts A Reference Collection 1 = all of the out-links point to spam hosts Web Links Topological Web Spam 1 Out-links of non spam Counting of 0.9 Outlinks of spam Supporters 0.8 Content-based 0.7 Spam detection 0.6 Web Topology 0.5 0.4 Conclusions 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0
    130. Web Spam Idea 1: Clustering Detection C. Castillo Web Spam Classify, then cluster hosts, then assign the same label to all Web Spam Detection hosts in the same cluster by majority voting A Reference Collection Baseline Clustering Web Links Without bagging Topological Web Spam True positive rate 75.6% 74.5% Counting of Supporters False positive rate 8.5% 6.8% Content-based F-Measure 0.646 0.673 Spam detection With bagging Web Topology True positive rate 78.7% 76.9% Conclusions False positive rate 5.7% 5.0% F-Measure 0.723 0.728 V Reduces error rate
    131. Web Spam Idea 1: Clustering Detection C. Castillo Web Spam Classify, then cluster hosts, then assign the same label to all Web Spam Detection hosts in the same cluster by majority voting A Reference Collection Baseline Clustering Web Links Without bagging Topological Web Spam True positive rate 75.6% 74.5% Counting of Supporters False positive rate 8.5% 6.8% Content-based F-Measure 0.646 0.673 Spam detection With bagging Web Topology True positive rate 78.7% 76.9% Conclusions False positive rate 5.7% 5.0% F-Measure 0.723 0.728 V Reduces error rate
    132. Web Spam Idea 2: Propagate the label Detection C. Castillo Web Spam Web Spam Classify, then interpret “spamicity” as a probability, then do a Detection A Reference random walk with restart from those nodes Collection Web Links Baseline Fwds. Backwds. Both Topological Web Classifier without bagging Spam Counting of True positive rate 75.6% 70.9% 69.4% 71.4% Supporters False positive rate 8.5% 6.1% 5.8% 5.8% Content-based Spam detection F-Measure 0.646 0.665 0.664 0.676 Web Topology Classifier with bagging Conclusions True positive rate 78.7% 76.5% 75.0% 75.2% False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure 0.723 0.716 0.733 0.724
    133. Web Spam Idea 2: Propagate the label Detection C. Castillo Web Spam Web Spam Classify, then interpret “spamicity” as a probability, then do a Detection A Reference random walk with restart from those nodes Collection Web Links Baseline Fwds. Backwds. Both Topological Web Classifier without bagging Spam Counting of True positive rate 75.6% 70.9% 69.4% 71.4% Supporters False positive rate 8.5% 6.1% 5.8% 5.8% Content-based Spam detection F-Measure 0.646 0.665 0.664 0.676 Web Topology Classifier with bagging Conclusions True positive rate 78.7% 76.5% 75.0% 75.2% False positive rate 5.7% 5.4% 4.3% 4.7% F-Measure 0.723 0.716 0.733 0.724
    134. Web Spam Idea 3: Stacked graphical learning Detection C. Castillo Web Spam Web Spam Detection Classify, then add the average predicted “spamicity” of A Reference Collection neighbors as a new feature for each node, then classify Web Links again[Cohen and Kou, 2006] Topological Web Spam Avg. Avg. Avg. Counting of Supporters Baseline of in of out of both Content-based True positive rate 78.7% 84.4% 78.3% 85.2% Spam detection False positive rate 5.7% 6.7% 4.8% 6.1% Web Topology F-Measure 0.723 0.733 0.742 0.750 Conclusions V Increases detection rate
    135. Web Spam Idea 3: Stacked graphical learning Detection C. Castillo Web Spam Web Spam Detection Classify, then add the average predicted “spamicity” of A Reference Collection neighbors as a new feature for each node, then classify Web Links again[Cohen and Kou, 2006] Topological Web Spam Avg. Avg. Avg. Counting of Supporters Baseline of in of out of both Content-based True positive rate 78.7% 84.4% 78.3% 85.2% Spam detection False positive rate 5.7% 6.7% 4.8% 6.1% Web Topology F-Measure 0.723 0.733 0.742 0.750 Conclusions V Increases detection rate
    136. Web Spam Idea 3: Stacked graphical learning x2 Detection C. Castillo Web Spam Web Spam Detection A Reference Collection And repeat ... Web Links Topological Web Baseline First pass Second pass Spam True positive rate 78.7% 85.2% 88.4% Counting of Supporters False positive rate 5.7% 6.1% 6.3% Content-based F-Measure 0.723 0.750 0.763 Spam detection Web Topology V Significant improvement over the baseline Conclusions
    137. Web Spam Detection C. Castillo Web Spam Web Spam Detection Web Spam 1 A Reference Web Spam Detection 2 Collection A Reference Collection 3 Web Links Web Links 4 Topological Web Topological Web Spam 5 Spam Counting of Supporters 6 Counting of Supporters Content-based Spam detection 7 Content-based Web Topology 8 Spam detection Conclusions 9 Web Topology Conclusions
    138. Web Spam Concluding remarks Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links V The UK-2006-05 dataset is “harder” than previous Topological Web datasets Spam V Counting of Considering content-based and link-based attributes Supporters improves the accuracy Content-based Spam detection V Considering the dependencies improves the accuracy Web Topology Conclusions
    139. Web Spam Concluding remarks Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links V The UK-2006-05 dataset is “harder” than previous Topological Web datasets Spam V Counting of Considering content-based and link-based attributes Supporters improves the accuracy Content-based Spam detection V Considering the dependencies improves the accuracy Web Topology Conclusions
    140. Web Spam Concluding remarks Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links V The UK-2006-05 dataset is “harder” than previous Topological Web datasets Spam V Counting of Considering content-based and link-based attributes Supporters improves the accuracy Content-based Spam detection V Considering the dependencies improves the accuracy Web Topology Conclusions
    141. Web Spam Detection C. Castillo Web Spam Web Spam Detection A Reference Collection Web Links Thank you! Topological Web Spam Counting of Supporters Content-based Spam detection Web Topology Conclusions
    142. Web Spam Detection Baeza-Yates, R., Boldi, P., and Castillo, C. (2006a). C. Castillo Generalizing pagerank: Damping functions for link-based ranking Web Spam algorithms. In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA. Web Spam Detection ACM Press. A Reference Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2006b). Collection Characterization of national web domains. Web Links To appear in ACM TOIT. Topological Web Baeza-Yates, R. and Poblete, B. (2006). Spam Dynamics of the chilean web structure. Counting of Comput. Networks, 50(10):1464–1473. Supporters Barab´si, A.-L. (2002). a Content-based Spam detection Linked: The New Science of Networks. Perseus Books Group. Web Topology Conclusions Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (2006). Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press.
    143. Web Spam Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Detection Stata, R., Tomkins, A., and Wiener, J. (2000). C. Castillo Graph structure in the web: Experiments and models. In Proceedings of the Ninth Conference on World Wide Web, pages Web Spam 309–320, Amsterdam, Netherlands. ACM Press. Web Spam Detection Chellapilla, K. and Maykov, A. (2007). A taxonomy of javascript redirection spam. A Reference In AIRWeb ’07: Proceedings of the 3rd international workshop on Collection Adversarial information retrieval on the web, pages 81–88, New York, NY, Web Links USA. ACM Press. Topological Web Spam Cohen, W. W. and Kou, Z. (2006). Stacked graphical learning: approximating learning in markov random Counting of fields using very short inhomogeneous markov chains. Supporters Technical report. Content-based Spam detection Davison, B. D. (2000). Web Topology Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on Conclusions research and development in information retrieval, pages 272–279, Athens, Greece. ACM Press. Fetterly, D., Manasse, M., and Najork, M. (2004). Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1–6, Paris, France.
    144. Web Spam Flajolet, P. and Martin, N. G. (1985). Detection Probabilistic counting algorithms for data base applications. C. Castillo Journal of Computer and System Sciences, 31(2):182–209. Web Spam Gibson, D., Kumar, R., and Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. Web Spam Detection In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment. A Reference Collection Gy¨ngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). o Web Links Combating Web spam with TrustRank. Topological Web In Proceedings of the 30th International Conference on Very Large Data Spam Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann. Counting of Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Supporters Detecting spam web pages through content analysis. Content-based In Proceedings of the World Wide Web conference, pages 83–92, Spam detection Edinburgh, Scotland. Web Topology Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). Conclusions ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 81–90, New York, NY, USA. ACM Press. Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001). A simple conceptual model for the internet topology. In Global Internet, San Antonio, Texas, USA. IEEE CS Press.

    + Carlos CastilloCarlos Castillo, 3 years ago

    custom

    760 views, 0 favs, 1 embeds more stats

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 760
      • 759 on SlideShare
      • 1 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 25
    Most viewed embeds
    • 1 views on http://192.168.10.100

    more

    All embeds
    • 1 views on http://192.168.10.100

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags