• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Using Topology to Identify Spam (SIGIR 2007)
 

Using Topology to Identify Spam (SIGIR 2007)

on

  • 1,561 views

 

Statistics

Views

Total Views
1,561
Views on SlideShare
1,561
Embed Views
0

Actions

Likes
1
Downloads
46
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Using Topology to Identify Spam (SIGIR 2007) Using Topology to Identify Spam (SIGIR 2007) Presentation Transcript

    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Know your Neighbors Link-Based Detection Web Spam Detection Using the Web Topology Content-Based Detection Using Links and Contents Carlos Castillo1 , Debora Donato1 , Aristides Gionis1 , Using the Web Topology Vanessa Murdock1 , Fabrizio Silvestri2 Conclusions 1. Yahoo! Research Barcelona – Catalunya, Spain 2. ISTI-CNR –Pisa,Italy ACM SIGIR, 25 July 2007, Amsterdam
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, 1 Spam on the Web V. Murdock, F. Silvestri 2 Detecting Web Spam Spam on the Web Detecting Web Spam 3 Link-Based Detection Link-Based Detection Content-Based 4 Content-Based Detection Detection Using Links and Contents 5 Using Links and Contents Using the Web Topology Conclusions 6 Using the Web Topology 7 Conclusions
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
    • Web Spam Detection What is on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection What is on the Web [2.0]? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection What else is on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions Source: www.milliondollarhomepage.com
    • Web Spam Detection What’s happening on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection There is a fierce competition Content-Based Detection Using Links and Contents for your attention Using the Web Topology Conclusions
    • Web Spam Detection What’s happening on the Web? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Search engines are to some extent Detection Content-Based Detection arbiters of this competition Using Links and Contents and they must watch it closely, otherwise ... Using the Web Topology Conclusions
    • Web Spam Detection Some cheating occurs C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions 1986 FIFA World Cup, Argentina vs England
    • Web Spam Detection Simple web spam C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Hidden text C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Made for advertising C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Search engine? C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Fake search engine C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection “Normal” content in link farms C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection There are many attempts of cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Most of these are spam: Link-Based Detection 1,630,000 results for “free mp3 hilton viagra” in SE1 Content-Based Detection 1,760,000 results for “credit vicodin loan” in SE2 Using Links and Contents 1,320,000 results for “porn mortgage” in SE3 Using the Web Topology Conclusions
    • Web Spam Detection Costs C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Costs: Spam Link-Based X Costs for users: lower precision for some queries Detection Content-Based X Costs for search engines: wasted storage space, Detection network resources, and processing cycles Using Links and Contents X Costs for the publishers: resources invested in cheating Using the Web Topology and not in improving their contents Conclusions
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection Cheating on the Web C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Z Link spam Spam on the Web Z Content spam Detecting Web Spam-oriented blogging Spam Link-Based Comment/forum/Wiki spam Detection Content-Based Malicious cloaking Detection Click fraud ×2 Using Links and Contents Malicious tagging Using the Web Topology . . . more? Conclusions Adversarial relationship Every undeserved gain in ranking for a spammer, is a loss of precision for the search engine.
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
    • Web Spam Detection Research on Web spam detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Web spam detection techniques Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Spam, damn spam and statistics C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri [Fetterly et al., 2004] propose to study statistical distributions: “in a number of these distributions, outlier Spam on the Web values are associated with web spam” Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Machine learning training C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Machine learning C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Challenges C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Scalability Web + Machine Learning Challenges: Detecting Web Spam Instances are not really independent (graph) Link-Based Detection Training set is relatively small Content-Based Detection + Information Retrieval Challenges: Using Links and Contents It is hard to find out which features are relevant Using the Web Topology Features can be aggregated in content units: Conclusions page/host/domain Features can be propagated through the graph
    • Web Spam Detection Training data C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam X It is hard for search engines to provide labeled data Link-Based Detection X Even if they do, it will not reflect a consensus on what is Content-Based Detection Web Spam Using Links and V Public Web Spam collection built by a group of Contents volunteers: http://www.yr-bcn.es/webspam/ Using the Web Topology Conclusions
    • Web Spam Detection Training data C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam X It is hard for search engines to provide labeled data Link-Based Detection X Even if they do, it will not reflect a consensus on what is Content-Based Detection Web Spam Using Links and V Public Web Spam collection built by a group of Contents volunteers: http://www.yr-bcn.es/webspam/ Using the Web Topology Conclusions
    • Web Spam Detection Training data C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam X It is hard for search engines to provide labeled data Link-Based Detection X Even if they do, it will not reflect a consensus on what is Content-Based Detection Web Spam Using Links and V Public Web Spam collection built by a group of Contents volunteers: http://www.yr-bcn.es/webspam/ Using the Web Topology Conclusions
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
    • Web Spam Detection “Link farms” C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Link farm Using the Web Topology Conclusions Spam page Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 2005]
    • Web Spam Detection Handling large-graphs C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Memory size enough to hold some data per-node Content-Based Disk size enough to hold some data per-edge Detection Using Links and A small number of passes over the data Contents Using the Web Topology Conclusions
    • Web Spam Detection Semi-streaming model C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1: for node : 1 . . . N do Spam on the Web 2: INITIALIZE-MEM(node) Detecting Web 3: end for Spam 4: for distance : 1 . . . d do {Iteration step} Link-Based Detection 5: for src : 1 . . . N do {Follow links in the graph} Content-Based 6: for all links from src to dest do Detection 7: COMPUTE(src,dest) Using Links and Contents 8: end for Using the Web 9: end for Topology 10: NORMALIZE Conclusions 11: end for 12: POST-PROCESS 13: return Something
    • Web Spam Detection Semi-streaming model C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1: for node : 1 . . . N do Spam on the Web 2: INITIALIZE-MEM(node) Detecting Web 3: end for Spam 4: for distance : 1 . . . d do {Iteration step} Link-Based Detection 5: for src : 1 . . . N do {Follow links in the graph} Content-Based 6: for all links from src to dest do Detection 7: COMPUTE(src,dest) Using Links and Contents 8: end for Using the Web 9: end for Topology 10: NORMALIZE Conclusions 11: end for 12: POST-PROCESS 13: return Something
    • Web Spam Detection Semi-streaming model C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1: for node : 1 . . . N do Spam on the Web 2: INITIALIZE-MEM(node) Detecting Web 3: end for Spam 4: for distance : 1 . . . d do {Iteration step} Link-Based Detection 5: for src : 1 . . . N do {Follow links in the graph} Content-Based 6: for all links from src to dest do Detection 7: COMPUTE(src,dest) Using Links and Contents 8: end for Using the Web 9: end for Topology 10: NORMALIZE Conclusions 11: end for 12: POST-PROCESS 13: return Something
    • Web Spam Detection Link-based features C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Degree-related measures Link-Based PageRank Detection Content-Based TrustRank [Gy¨ngyi et al., 2004] o Detection Using Links and Truncated PageRank [Becchetti et al., 2006] Contents Estimation of supporters [Becchetti et al., 2006] Using the Web Topology 140 features per host (2 pages per host) Conclusions
    • Web Spam Detection Degree-Based C. Castillo, D. Donato, A. Gionis, V. Murdock, 0.12 F. Silvestri Normal Spam 0.10 Spam on the Web 0.08 Detecting Web 0.06 Spam 0.04 Link-Based Detection 0.02 Content-Based 0.00 4 18 76 323 1380 5899 25212 107764 460609 1968753 Detection 0.14 Normal Using Links and Spam Contents 0.12 0.10 Using the Web Topology 0.08 Conclusions 0.06 0.04 0.02 0.00 0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
    • Web Spam Detection TrustRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions [Gy¨ngyi et al., 2004] o
    • Web Spam Detection TrustRank / PageRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1.00 Normal Detecting Web 0.90 Spam Spam 0.80 Link-Based 0.70 Detection 0.60 0.50 Content-Based 0.40 Detection 0.30 Using Links and 0.20 Contents 0.10 Using the Web 0.00 0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03 Topology Conclusions
    • Web Spam Detection Truncated PageRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Proposed in [Becchetti et al., 2006]. Idea: reduce the direct Spam on the Web contribution of the first levels of links: Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology 0 t≤T Conclusions damping(t) = C αt t>T V No extra reading of the graph after PageRank
    • Web Spam Detection Truncated PageRank C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Proposed in [Becchetti et al., 2006]. Idea: reduce the direct Spam on the Web contribution of the first levels of links: Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology 0 t≤T Conclusions damping(t) = C αt t>T V No extra reading of the graph after PageRank
    • Web Spam Detection Hop-plot C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection High and low-ranked pages are different C. Castillo, D. Donato, A. Gionis, 4 V. Murdock, x 10 F. Silvestri Top 0%−10% 12 Top 40%−50% Spam on the Web Top 60%−70% 10 Number of Nodes Detecting Web Spam Link-Based 8 Detection Content-Based 6 Detection Using Links and Contents 4 Using the Web Topology 2 Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
    • Web Spam Detection High and low-ranked pages are different C. Castillo, D. Donato, A. Gionis, 4 V. Murdock, x 10 F. Silvestri Top 0%−10% 12 Top 40%−50% Spam on the Web Top 60%−70% 10 Number of Nodes Detecting Web Spam Link-Based 8 Detection Content-Based 6 Detection Using Links and Contents 4 Using the Web Topology 2 Conclusions 0 1 5 10 15 20 Distance Areas below the curves are equal if we are in the same strongly-connected component
    • Web Spam Detection Probabilistic counting C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1 1 0 0 0 0 Spam on the 0 0 Web 1 0 1 1 1 1 1 0 0 0 1 Detecting Web 0 0 0 0 0 Spam 0 Propagation of 0 1 1 0 bits using the 1 1 Link-Based 1 0 “OR” operation 1 1 Detection 0 Content-Based 0 Target 1 Count bits set Detection 0 page 0 to estimate 0 0 supporters Using Links and 0 0 Contents 1 1 1 1 0 0 1 Using the Web 1 0 0 Topology 0 0 1 1 Conclusions 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
    • Web Spam Detection Probabilistic counting C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 1 1 0 0 0 0 Spam on the 0 0 Web 1 0 1 1 1 1 1 0 0 0 1 Detecting Web 0 0 0 0 0 Spam 0 Propagation of 0 1 1 0 bits using the 1 1 Link-Based 1 0 “OR” operation 1 1 Detection 0 Content-Based 0 Target 1 Count bits set Detection 0 page 0 to estimate 0 0 supporters Using Links and 0 0 Contents 1 1 1 1 0 0 1 Using the Web 1 0 0 Topology 0 0 1 1 Conclusions 0 0 [Becchetti et al., 2006] shows an improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
    • Web Spam Detection Bottleneck number C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam |Nj (x)| Link-Based bd (x) = minj≤d |Nj−1 (x)| . Detection Content-Based Detection Minimum rate of growth of the neighbors of x up to a certain Using Links and Contents distance. Using the Web Topology Conclusions
    • Web Spam Detection Bottleneck number: spam C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Bottleneck number: normal C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Bottleneck number C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Spam 0.40 Normal Spam 0.35 Link-Based Detection 0.30 0.25 Content-Based Detection 0.20 0.15 Using Links and Contents 0.10 Using the Web 0.05 Topology 0.00 1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52 Conclusions
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
    • Web Spam Detection Content-Based Features C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Most of the features reported in [Ntoulas et al., 2006] Spam on the Web Number of word in the page and title Detecting Web Spam Average word length Link-Based Fraction of anchor text Detection Content-Based Fraction of visible text Detection Compression rate Using Links and Contents Corpus precision and corpus recall Using the Web Topology Query precision and query recall Conclusions Independent trigram likelihood Entropy of trigrams 96 features per host
    • Web Spam Detection Content-based features (entropy related) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam T = {(w1 , p1 ), . . . , (wk , pk )} the set of trigrams in a page, Link-Based Detection where trigram wi has frequency pi Content-Based Detection Features: Using Links and Contents Entropy of trigrams H = − wi ∈T pi log pi Using the Web Topology Also, compression rate, as measured by bzip Conclusions
    • Web Spam Detection Content-based features (related to popular C. Castillo, D. Donato, keywords) A. Gionis, V. Murdock, F. Silvestri Spam on the Web F set of most frequent terms in the collection Detecting Web Spam Q set of most frequent terms in a query log Link-Based Detection P set of terms in a page Content-Based Detection Features: Using Links and Contents Corpus “precision” |P ∩ F |/|P| Using the Web Topology Corpus “recall” |P ∩ F |/|F | Conclusions Query “precision” |P ∩ Q|/|P| Query “recall” |P ∩ Q|/|Q|
    • Web Spam Detection Average word length C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.12 Normal Spam Spam on the 0.10 Web Detecting Web 0.08 Spam Link-Based Detection 0.06 Content-Based Detection 0.04 Using Links and Contents 0.02 Using the Web Topology 0.00 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 Conclusions Figure: Histogram of the average word length in non-spam vs. spam pages for k = 500.
    • Web Spam Detection Corpus precision C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.10 Normal 0.09 Spam Spam on the Web 0.08 Detecting Web 0.07 Spam 0.06 Link-Based Detection 0.05 0.04 Content-Based Detection 0.03 Using Links and 0.02 Contents 0.01 Using the Web Topology 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Conclusions Figure: Histogram of the corpus precision in non-spam vs. spam pages.
    • Web Spam Detection Query precision C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri 0.12 Normal Spam Spam on the 0.10 Web Detecting Web 0.08 Spam Link-Based Detection 0.06 Content-Based Detection 0.04 Using Links and Contents 0.02 Using the Web Topology 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Conclusions Figure: Histogram of the query precision in non-spam vs. spam pages for k = 500.
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
    • Web Spam Detection Cost-sensitive decision tree with bagging C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Bagging of 10 decision trees, asymmetrical costs. Link-Based Detection Cost ratio 1 10 20 30 50 Content-Based Detection True positive rate 65.8% 66.7% 71.1% 78.7% 84.1% Using Links and Contents False positive rate 2.8% 3.4% 4.5% 5.7% 8.6% Using the Web F-Measure 0.712 0.703 0.704 0.723 0.692 Topology Conclusions
    • Web Spam Detection Link- and content-based features C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-based and content-based Link-Based Detection Both Link-only Content-only Content-Based Detection True positive rate 78.7% 79.4% 64.9% Using Links and False positive rate 5.7% 9.0% 3.7% Contents Using the Web F-Measure 0.723 0.659 0.683 Topology Conclusions
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
    • Web Spam Detection Hypothesis C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Pages topologically close to each other are more likely Link-Based Detection to have the same label (spam/nonspam) than random Content-Based pairs of pages. Detection Using Links and Contents Pages linked together are more likely to be on the same topic Using the Web than random pairs of pages [Davison, 2000] Topology Conclusions
    • Web Spam Detection Hypothesis C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Pages topologically close to each other are more likely Link-Based Detection to have the same label (spam/nonspam) than random Content-Based pairs of pages. Detection Using Links and Contents Pages linked together are more likely to be on the same topic Using the Web than random pairs of pages [Davison, 2000] Topology Conclusions
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Topological dependencies: in-links C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Histogram of fraction of spam hosts in the in-links Spam on the Web 0 = no in-link comes from spam hosts Detecting Web 1 = all of the in-links come from spam hosts Spam Link-Based Detection 0.4 In-links of non spam Content-Based 0.35 In-links of spam Detection 0.3 Using Links and Contents 0.25 Using the Web 0.2 Topology 0.15 Conclusions 0.1 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0
    • Web Spam Detection Topological dependencies: out-links C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Histogram of fraction of spam hosts in the out-links Spam on the Web 0 = none of the out-links points to spam hosts Detecting Web 1 = all of the out-links point to spam hosts Spam Link-Based Detection 1 Out-links of non spam Content-Based 0.9 Outlinks of spam Detection 0.8 Using Links and 0.7 Contents 0.6 Using the Web 0.5 Topology 0.4 Conclusions 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0
    • Web Spam Detection Idea 1: Clustering C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Classify, then cluster hosts, then assign the same label to all Content-Based Detection hosts in the same cluster by majority voting Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 1: Clustering (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Initial prediction: Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 1: Clustering (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Clustering: Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 1: Clustering (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Final prediction: Spam on the Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 1: Clustering – Results C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Baseline Clustering Web Detecting Web Without bagging Spam True positive rate 75.6% 74.5% Link-Based Detection False positive rate 8.5% 6.8% Content-Based F-Measure 0.646 0.673 Detection With bagging Using Links and Contents True positive rate 78.7% 76.9% Using the Web False positive rate 5.7% 5.0% Topology F-Measure 0.723 0.728 Conclusions V Reduces error rate
    • Web Spam Detection Idea 2: Propagate the label C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection Classify, then interpret “spamicity” as a probability, then do a Content-Based Detection random walk with restart from those nodes Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 2: Propagate the label (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Initial prediction: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 2: Propagate the label (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Propagation: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 2: Propagate the label (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Final prediction, applying a threshold: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 2: Propagate the label – Results C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Baseline Fwds. Backwds. Both Detecting Web Spam Classifier without bagging Link-Based True positive rate 75.6% 70.9% 69.4% 71.4% Detection False positive rate 8.5% 6.1% 5.8% 5.8% Content-Based Detection F-Measure 0.646 0.665 0.664 0.676 Using Links and Classifier with bagging Contents True positive rate 78.7% 76.5% 75.0% 75.2% Using the Web Topology False positive rate 5.7% 5.4% 4.3% 4.7% Conclusions F-Measure 0.723 0.716 0.733 0.724
    • Web Spam Detection Idea 3: Stacked graphical learning C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Meta-learning scheme [Cohen and Kou, 2006] Link-Based Detection Derive initial predictions Content-Based Detection Generate an additional attribute for each object by Using Links and combining predictions on neighbors in the graph Contents Using the Web Append additional attribute in the data and retrain Topology Conclusions
    • Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Let p(x) ∈ [0..1] be the prediction of a classification Detecting Web Spam algorithm for a host x using k features Link-Based Detection Let N(x) be the set of pages related to x (in some way) Content-Based Compute Detection g ∈N(x) p(g ) Using Links and f (x) = Contents |N(x)| Using the Web Topology Add f (x) as an extra feature for instance x and learn a Conclusions new model with k + 1 features
    • Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Initial prediction: Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Computation of new feature: Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 3: Stacked graphical learning (cont.) C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the New prediction with k + 1 features: Web Detecting Web Spam Link-Based Detection Content-Based Detection Using Links and Contents Using the Web Topology Conclusions
    • Web Spam Detection Idea 3: Stacked graphical learning - Results C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Avg. Avg. Avg. Link-Based Baseline of in of out of both Detection True positive rate 78.7% 84.4% 78.3% 85.2% Content-Based Detection False positive rate 5.7% 6.7% 4.8% 6.1% Using Links and F-Measure 0.723 0.733 0.742 0.750 Contents Using the Web Topology V Increases detection rate Conclusions
    • Web Spam Detection Idea 3: Stacked graphical learning x2 C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web And repeat ... Spam Link-Based Detection Baseline First pass Second pass Content-Based True positive rate 78.7% 85.2% 88.4% Detection False positive rate 5.7% 6.1% 6.3% Using Links and Contents F-Measure 0.723 0.750 0.763 Using the Web Topology V Significant improvement over the baseline Conclusions
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web 1 Spam on the Web Detecting Web 2 Detecting Web Spam Spam 3 Link-Based Detection Link-Based 4 Content-Based Detection Detection 5 Using Links and Contents Content-Based Detection 6 Using the Web Topology Using Links and 7 Conclusions Contents Using the Web Topology Conclusions
    • Web Spam Detection Concluding remarks C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection V Considering content-based and link-based attributes Content-Based improves the accuracy of the classifier Detection Using Links and V Considering the links among pages improves the accuracy Contents Using the Web Topology Conclusions
    • Web Spam Detection Concluding remarks C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri Spam on the Web Detecting Web Spam Link-Based Detection V Considering content-based and link-based attributes Content-Based improves the accuracy of the classifier Detection Using Links and V Considering the links among pages improves the accuracy Contents Using the Web Topology Conclusions
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri i Web Spam Dataset: http://www.yr-bcn.es/webspam/ Spam on the Web i Web Spam Challenge I & II: http://webspam.lip6.fr/ Detecting Web Spam i AIRWeb Workshop: http://airweb.cse.lehigh.edu/ Link-Based i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/ Detection Content-Based Detection B Newsletter: webspam-announces@yahoogroups.com Using Links and Contents Using the Web Topology Conclusions Thank you!
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, V. Murdock, F. Silvestri i Web Spam Dataset: http://www.yr-bcn.es/webspam/ Spam on the Web i Web Spam Challenge I & II: http://webspam.lip6.fr/ Detecting Web Spam i AIRWeb Workshop: http://airweb.cse.lehigh.edu/ Link-Based i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/ Detection Content-Based Detection B Newsletter: webspam-announces@yahoogroups.com Using Links and Contents Using the Web Topology Conclusions Thank you!
    • Web Spam Detection Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. C. Castillo, (2006). D. Donato, Using rank propagation and probabilistic counting for link-based spam A. Gionis, V. Murdock, detection. F. Silvestri In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. Spam on the Web Cohen, W. W. and Kou, Z. (2006). Detecting Web Stacked graphical learning: approximating learning in markov random Spam fields using very short inhomogeneous markov chains. Link-Based Technical report. Detection Davison, B. D. (2000). Content-Based Topical locality in the web. Detection In Proceedings of the 23rd annual international ACM SIGIR conference on Using Links and research and development in information retrieval, pages 272–279, Athens, Contents Greece. ACM Press. Using the Web Topology Fetterly, D., Manasse, M., and Najork, M. (2004). Spam, damn spam, and statistics: Using statistical analysis to locate spam Conclusions web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 1–6, Paris, France. Flajolet, P. and Martin, N. G. (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182–209.
    • Web Spam Detection C. Castillo, D. Donato, A. Gionis, Gibson, D., Kumar, R., and Tomkins, A. (2005). V. Murdock, Discovering large dense subgraphs in massive graphs. F. Silvestri In VLDB ’05: Proceedings of the 31st international conference on Very Spam on the large data bases, pages 721–732. VLDB Endowment. Web Gy¨ngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004). o Detecting Web Combating Web spam with TrustRank. Spam In Proceedings of the 30th International Conference on Very Large Data Link-Based Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann. Detection Content-Based Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Detection Detecting spam web pages through content analysis. Using Links and In Proceedings of the World Wide Web conference, pages 83–92, Contents Edinburgh, Scotland. Using the Web Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). Topology ANF: a fast and scalable tool for data mining in massive graphs. Conclusions In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 81–90, New York, NY, USA. ACM Press.