Survey: Adversarial IR on the Web

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Survey: Adversarial IR on the Web - Presentation Transcript

    1. Carlos Castillo <chato@yahoo-inc.com> Yahoo! Research · Barcelona Adversarial IR on the Web Thanks to: Brian Davison, Lehigh University
    2. What's on the Web?
    3.  
    4. And on the Web 2.0?
    5.  
    6. What else?
    7. milliondollarhomepage.com
    8. Two webs
      • Closed web
      • Open web
      Terrence Brooks http://Informationr.net/ir/8-3/paper154.html
    9. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    10. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    11. Everything for a click
    12.  
    13.  
    14.  
    15.  
    16.  
    17.  
    18.  
    19. Origin of “spam” http://www.nowpublic.com/culture/monty-python-spam-skit http://www.flickr.com/photos/42162585@N00/262091025
    20. How much spam? Ntoulas et al. WWW 2006
    21. Search engine spam “web pages that hold no informational value, but are created to lure web searchers to sites they would otherwise not visit” Fetterly et al. WebDB 2004
    22. Spam results are still present
      • 3.3M results for “porn mortgage” in S1
      • 1.4M results for “free mp3 hilton viagra” in S2
      • 2.5M results for “credit vicodin loan” in S3
      May 2009
    23. Who?
      • Activists
      • Marketers
      • Optimizers
      • Spammers
    24. Why?
      • Activists
      • Marketers – for money
      • Optimizers – for money
      • Spammers – for money
    25. Costs
      • Users
      • Search engines
      • Publishers
    26. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    27. Features Web Pages Training Labels Learning Machine Learning System 0.3 0.9 1.7 4.5 3.2 0.0 0.3 0.9 1.7 4.5 3.2 0.0 0.3 0.9 1.7 4.5 3.2 0.0
    28. Supervised learning
      • Obtain training data
      • Extract features
        • Index-time features: contents, links
        • Crawl-time, rank-time features
        • Usage features
      • Choose a learning scheme
        • Non-graphical / graphical
      • Evaluate
    29. Training data: do editors agree? Sometimes ( k =0.56) Castillo et al. http://www.acm.org/sigs/sigir/forum/2006D-TOC.html
    30. Features Training set Learning process 0.3 0.9 1.7 4.5 3.2 0.0 0.3 0.9 1.7 4.5 3.2 0.0 0.3 0.9 1.7 4.5 3.2 0.0
    31. (e.g.) decision trees x <= a x > a 60 pages= 50 spam + 10 normal 100 pages 40 pages 35 pages= 34 normal + 1 spam 5 pages= 4 spam + 1 normal y <= b y > b
    32. Graphical learning
      • Nodes are not isolated
      • N->S is unlikely
    33. Castillo et al. SIGIR 2007
    34. Average spam fraction of in-links Castillo et al. SIGIR 2007
    35. Average spam fraction of out-links Castillo et al. SIGIR 2007
    36. Weight out-links a lot, in-links a little Abernethy, Chapelle and Castillo, AIRWeb 2008
    37. Evaluation
      • Point-wise measures: TPR, FPR, F1
      • Most schemes output a family of classifiers
      • Area under ROC
    38. ROC curve http://wwwiti.cs.uni-magdeburg.de/~sschimke/sose05/15-platasign/Evaluation_of_Biometric_Systems.html False negative rate False positive rate
    39. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    40. Content spam
      • Why? query-document matching
      • How? repeat / weave /stitch / copy
      • Which features are useful?
        • URL/title length, document size, compression rate, fraction of popular terms
    41. Methods
      • Document classification
      • Language model disagreement
      • Coding style similarity
      • Near-duplicate detection (for plagiarized content)
    42. Cloaking different contents at the same URL User Search engine results page Click on the result Search Engine Normal document Buy viagra now!
    43. Redirection redirect hidden from search engine User Search engine results page Click on the result redirect Search Engine Normal document Buy viagra now!
    44. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    45. Link farms
      • Why? Pagerank/HITS and other link analysis methods
      • Sybil attacks/mutual admiration societies
    46. Detection methods
      • Finding dense sub-graphs
      • Finding anomalous linking patterns
      • Finding abnormal link change rates
    47. Finding dense sub-graphs is hard http://www.flickr.com/photos/docco/771439289/
    48. Demotion methods
      • De-duping Sybils
      • Down-weighting “unqualified” links
    49. Propagation of trust and distrust “ Trusted” Nodes Suspicious Gyongyi, Garcia-Molina and Pedersen, VLDB 2004
    50. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    51. Advertising on the Web http://www.flickr.com/photos/66606673@N00/1308955915
    52. Click fraud
      • Why? For (many) pennies
      • Two types
        • Affiliate click fraud (publisher clicks on his/her ads)
        • Competitive click fraud (click on ads of competition)
      http://www.jameshyman.com/blog/archives/2006_01.html
    53. Search log spam
      • Why? Spammers will try anything
        • Manipulate search suggestion, expected click rates, etc.
    54. Automated search traffic Buehrer, Stokes and Chellapilla, AIRWeb 2008
    55. Usage data can help fight spam
      • Many spammers can be easily spotted
      • Using search logs
        • Abnormal fraction of popular terms
      • Using toolbar logs
        • Abnormal dwell time or bounce rate
        • Abnormal fraction of search-engine visits
    56. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    57. http://www.maxpower.ca/wp-content/uploads/2006/10/bad_splog.png
    58. Splog update regularities Normal blog Power blog Splog Lin et al. AIRWeb 2007
    59. Publicly-writeable pages
      • Wikis
        • Spam/vandalism
      • Blog comments
    60. Comment spam http://hojtsy.hu/blog/2008-apr-01/thanks-mollom-protecting-blog-spam
    61. Social network spam
      • Fake content
        • “suicide videos in Youtube”
        • Opinion/review spam
        • Self promotion (where is the limit?)
      • Tagging/voting spam
    62. Fake reviews http://www.dilbert.com/
    63. What is web spam Web spam detection Content analysis Links and trust Usage analysis Social media sites Conclusions
    64. SEO's perspective “[I]f search engines do not flag [some spam] pages today, some day they will. Beyond search engine smarts, overoptimized pages also leave you vulnerable to being reported by your competitors to search engines for spamming – causing a human editor to look check the page and possibly ban your sites” Moran and Hunt “Search Engine Marketing, Inc” 2006
    65. Search engine's perspective “Victory does not require perfection, just a rate of detection that alters the economic balance for the would-be spammer” Ntoulas et al., WWW 2006
    66. Main research topic
      • How to provide the maximum freedom possible to users, while keeping the systems open and generative?
      • Systems that allow cooperative users to defeat non-cooperative users easily
    67. Carlos Castillo · Yahoo! Research · Barcelona buy vicodin xanax viagra online generic cheap tramadol prevacid triamcinolone mesothiloma asbestos home insurance weight loss valium mortgage mp3 ripper music free Thank you!

    + Carlos CastilloCarlos Castillo, 5 months ago

    custom

    368 views, 0 favs, 0 embeds more stats

    Slideshow about adversarial IR on the Web, by ChaTo more

    More info about this document

    CC Attribution License

    Go to text version

    • Total Views 368
      • 368 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 5
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags