ppt presentation


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • -- Put more details on slide -- SPAM SPAM
  • Transition better to 2 nd slide More motivation – why SEO is important -- screenshot of search results -- More than spam, possible exploits -- more coherent story about comment spam -- Moderation nightmare
  • Users want to see useful information. They want to participate in forums, they want to blog, go shopping without being bombarded by irrelevant ads. And of course, everyone has the right to surf the web without fear of being attacked by this or that exploit. Search engines try to point users to quality pages through good search results. They’re also partially motivated by money earned through ads.
  • More reasons here… Define web forum
  • More trackbacks or pingbacks (how do they work. Why do they exist) -- similarity based on layout COLOR backgrounds -- Captcha can’t be used. -- more difficult to moderate trackbacks/pingbacks
  • Content-based analysis We get all the doorway pages + the destination. End game is to direct traffic to the destination Why we chose context-based analysis over content-based -- Define -- Related
  • Thumbnails Define 3 rd party domain here
  • More detail on the process of recording pages. 3 rd party domain-defin “ seeded known spammer domains” Mention the double funnel -- blacklist, whitelist, spam policies
  • Also do picture for crawler-browser
  • 1 st image is: konquerer masquerading from Wget (which doesn’t deal with javascript) The 2 nd image shows konquerer sending the correct user-agent id.
  • Use circles/emphasize current graph. shrink
  • First we look at the extent to which web forums are spammed, from the perspective of the web user. Presumably, this is because the spammer has been very busy in leaving his URLs all over the web. And again, the URLs being left about are doorway pages, which are more expendable than actual domains.
  • WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet, Invision Board, Phpbb, Phorum, and VBulletin A mix of languages (perl, php) hosted/non-hosted. 9 different softwares – highlight differences rather than names -- list all, but more readable (maybe red circles & graphically)
  • Top 5 numbers. Show more non-spammy words -- .edu & .gov sites (why web forums as well) Why is this bad?? (for every perspective)
  • Expand the graph. Growth keeps continuing. Spammers are still visiting. Exponential growth seen on all 3
  • Change colors. Sum 3 lines -- Shift the number Mark the important dates -- 2 nd graph to show rate of change -- mention length of experiment
  • Include percentages
  • Put numbers here -- Google has resources
  • Blogspoint + blogstudio share spammers *** numbers!! Graph/table showing all 4 webhosts Why isn’t spam consistent across Consistent metrics
  • Why are .edu/.gov redirs troublesome
  • Less time on this. Don’t read out loud Highlight how ours differs/relates, their shortcomings (cloaking).
  • Move WWW paper info to: APPLICATION/FUTURE WORK/IMPACT Explain how useful results are to search engines/forum.
  • add citation. Title, partial names, www 2007. Add homepage url. Don’t mention morals.
  • ppt presentation

    1. 1. A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research
    2. 2. A Look at the Web User Spammer
    3. 3. Why do we care about spam? <ul><li>Users want to </li></ul><ul><ul><li>Look at quality pages on the web </li></ul></ul><ul><ul><li>Interact without the trouble of moderation </li></ul></ul><ul><ul><li>Surf safely </li></ul></ul><ul><li>Search engines want to </li></ul><ul><ul><li>Provide good search results </li></ul></ul><ul><ul><li>Profit from ads </li></ul></ul><ul><li>We want to investigate the landscape of the problem </li></ul><ul><ul><li>Popular battleground: web forums </li></ul></ul>
    4. 4. Why Web Forums? <ul><li>Open communities: wiki, forums, blogs </li></ul><ul><li>Increasingly easy to contribute </li></ul>
    5. 5. Why Web Forums?
    6. 6. How Spammers Operate Doorway Pages (Splogs) Search Results Comment Spam Search Engine Spammer Domain Spammer 2. Writes Splog URLs 1. Creates Returns 3. Propagates Splog URL 4. Sends User to Doorway URL 5. Redirects User
    7. 7. How to deal with the problem? <ul><li>Content based approach </li></ul><ul><ul><li>Constrained by content retrieved </li></ul></ul><ul><ul><li>May be deceived by tricks like cloaking and redirection </li></ul></ul><ul><li>We propose: context-based analysis </li></ul>
    8. 8. Context-based Analysis <ul><li>Consisting of </li></ul><ul><ul><li>Redirection </li></ul></ul><ul><ul><li>Cloaking analysis </li></ul></ul><ul><li>See dynamic content not served to crawlers </li></ul><ul><ul><li>Use the Strider URL Tracer </li></ul></ul><ul><li>Flag large number of doorway pages to spam domains </li></ul><ul><li>Based on intuition that: </li></ul><ul><ul><li>Publishing links is necessary to increase popularity </li></ul></ul><ul><ul><li>We must see the destination URL eventually </li></ul></ul>
    9. 9. Doorways & Redirections Google search: Coach handbag
    10. 10. Redirection Analysis <ul><li>Fed URLs to Strider URL Tracer, which records all pages visited </li></ul><ul><ul><li>Ranked top 3 rd Party Domains by redirections </li></ul></ul><ul><li>Seed known spammer domain </li></ul><ul><li>Identified doorway pages based on association with spammer domains </li></ul><ul><li>Manually investigated unknown domains to expand the blacklist </li></ul>
    11. 11. Cloaking Analysis <ul><li>Diff-based check </li></ul><ul><ul><li>Run URL twice – once with anti-cloaking, once without </li></ul></ul><ul><li>Crawler-browser cloaking (User-agent, scripting-on/off) </li></ul><ul><li>Click-through cloaking (Referer) </li></ul>
    12. 12. Crawler-Browser Cloaking Google Search: ringtones download www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Disabled www.welcometuscany.it/images/_notes/xc/26/Ringtones-Download.html Javascript Enabled
    13. 13. Crawler-Browser Cloaking
    14. 14. Click-Through Cloaking Cached page/ Scripting off/ Crawler View Advertising Page from Click-throughs Directly Visiting the Page Directly Visiting the Page Cached page/ Scripting off/ Crawler View
    15. 15. Three Perspectives Doorway Pages (Splogs) Search Results Comment Spam Search Engine Spammer Domain Spammer 2. Writes Splog URLs 1. Creates Returns 3. Propagates Splog URL 4. Sends User to Doorway URL 5. Redirects User Search User Webhost
    16. 16. Search User
    17. 17. Search User <ul><li>Chose 9 popular forum software – written in Perl/PHP, hosted/unhosted </li></ul><ul><ul><li>WWWBoard, Hypernews, Ikonboard, Ezboard, Bravenet, Invision Board, Phpbb, Phorum, and VBulletin </li></ul></ul><ul><li>Compiled popular tags and common spam terms –list of 190 keywords </li></ul><ul><ul><li>“ Myspace, jewelry, casino, shopping, baseball…” </li></ul></ul><ul><li>Searched for all < keyword, forum-software > pairs in Google & MSN </li></ul>
    18. 18. Search User <ul><li>Search terms returned spammed forums in top 20 results from both Google and MSN </li></ul><ul><ul><li>Only exception is “palm-texas-holdem-game” </li></ul></ul><ul><li>Top 5 most spammed forums: </li></ul>79 105 http://samba.eecs.umich .edu /phorum/list.php?2 97 117 http://classicauthors.net/messageboard/list.php?f=1 94 119 http://www.usra .edu /phorum 82 134 http://www.comm.fsu .edu /interactive/forum/ 102 175 http://fs.fed.us/...mm/get/mmforumA.html Keywords Pages Forum
    19. 19. Honeyblogs <ul><li>Spammers: </li></ul><ul><ul><li>Create their own doorway pages, and </li></ul></ul><ul><ul><li>Promote the doorways by posting to other people’s pages </li></ul></ul><ul><li>Honeyblogs lure the spammer in: </li></ul><ul><ul><li>No moderation, default accept all policy </li></ul></ul><ul><ul><li>Pinged blog aggregators with every post </li></ul></ul><ul><ul><li>Abandoned within three months </li></ul></ul>
    20. 20. Honeyblogs <ul><li>41,100 comments collected over 339 days </li></ul><ul><li>19,297 comments received in the last month </li></ul><ul><ul><li>Ilium – 930/1432 </li></ul></ul><ul><ul><li>Litlog – 3734/5714 </li></ul></ul><ul><li>Spammer activity got me kicked off my hosting server </li></ul>
    21. 21. Honeyblog Activity
    22. 22. Honeyblog Activity 3142
    23. 23. Webhost Perspective <ul><li>Focus on splog doorways </li></ul><ul><li>Above Numbers are lower bounds </li></ul><ul><ul><li>Consider only pages using cloaking & redirection </li></ul></ul>0 82 (83%) 99 Blogsharing 0 198 (54%) 369 Blogstudio 131 3,535 (75 % ) 4,714 Blogspoint 652 1,091 (8.1%) 13,389 Blogspot URLs Using Cloaking Spam URLs Examined URLs Blog Host
    24. 24. Webhost Perspective <ul><li>Blogspot: 1,091 splogs </li></ul><ul><ul><li>Most popular </li></ul></ul><ul><ul><li>Randomly sampled 1% of profile pages created in July and extracted all blog links – 13,389 </li></ul></ul><ul><ul><li>60% of splogs used cloaking </li></ul></ul><ul><ul><li>24% of splogs redirected to filldirect.com </li></ul></ul>
    25. 25. Webhost Perspective <ul><li>Blogspoint: 3535 splogs </li></ul><ul><ul><li>2166 redirected to finance-web-search.com </li></ul></ul><ul><ul><li>917 redirected to casino-web-search.com </li></ul></ul><ul><li>Blogstudio: 198 splogs </li></ul><ul><ul><li>130 redirected to finance-web-search.com </li></ul></ul><ul><ul><li>54 redirected to casino-web-search.com </li></ul></ul><ul><li>Blogsharing: 82 splogs </li></ul><ul><ul><li>Plumber related link spamming in splogs </li></ul></ul>
    26. 26. Also of note… <ul><li>Malicious URLs </li></ul><ul><ul><li>Previous work by MSR (Strider HoneyMonkey) 1 discovered sites that actively exploit browser vulnerabilities </li></ul></ul><ul><ul><li>We tested 8 known malicious URLs for presence on the web </li></ul></ul><ul><ul><ul><li>Found 5 spammed in forums, 2 in link farms, 1 in referrer logs </li></ul></ul></ul><ul><li>Universal redirectors </li></ul><ul><ul><li>Redirects user to any URL (sometimes destination is obfuscated): </li></ul></ul><ul><ul><ul><li>www.rit.edu/~ksa/cgi-bin/splinks/click.cgi?num=2&url=[your url here] </li></ul></ul></ul><ul><ul><ul><li>http://tinyurl.com/3c7twl </li></ul></ul></ul><ul><ul><ul><ul><li>http://www.canadianpharmacyltd.com/group.php?id=59&aid=8 60 </li></ul></ul></ul></ul><ul><ul><li>Could be used to serve malicious URLs, particularly those on .edu and .gov sites </li></ul></ul>1 Yi-Min Wang, et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. NDSS, 2006.
    27. 27. Related Work (Part 1) <ul><li>Diff-based cloaking </li></ul><ul><ul><li>Wu & Davison – Diff-based cloaking combined with content based analysis </li></ul></ul><ul><ul><li>Our approach detects click-through cloaking </li></ul></ul><ul><li>Content based approaches </li></ul><ul><ul><li>Fetterly, Manasse and Najork – URL properties, clustering pages of similar content </li></ul></ul><ul><ul><li>Mishne, Carmel, Lempel – Compared statistical models of comments & target pages against post content </li></ul></ul><ul><ul><li>Kolari, Finin and Joshi – Meta tag text, anchor text, URLs </li></ul></ul><ul><ul><li>Our approach is complimentary to content-based approaches </li></ul></ul>
    28. 28. Related Work (Part 2) <ul><li>Measurements of Trust </li></ul><ul><ul><li>Metaxas et al – Defined trust neighborhoods </li></ul></ul><ul><ul><li>Benczur et al – SpamRank: Identify outliers by looking at PageRank of the site and its “supporters” </li></ul></ul><ul><ul><li>Similarly, our approach propagates distrust by following redirections </li></ul></ul><ul><li>Plugins to aid moderating forums/blogs </li></ul><ul><ul><li>Akismet </li></ul></ul><ul><ul><li>Bad Behavior, Spam Karma </li></ul></ul><ul><ul><li>Our approach does not require cooperation from forum owners </li></ul></ul>
    29. 29. Conclusions <ul><li>Context-based approach successfully detects advanced cloaking & redirection based spam </li></ul><ul><li>Spammers are pervasive </li></ul><ul><ul><li>189 of 190 search terms returned spammed forums in the top 20 search results from both Google and MSN </li></ul></ul><ul><ul><li>Same spammer redirecting to two domains on blogspoint and blogstudio </li></ul></ul>
    30. 30. Future work <ul><li>There is hope! </li></ul><ul><ul><li>Economic solution </li></ul></ul><ul><ul><li>Identifies middlemen in online advertising </li></ul></ul><ul><li>Read our WWW07 paper 1 </li></ul><ul><li>http://wwwcsif.cs.ucdavis.edu/~niu </li></ul><ul><li>http://research.microsoft.com/csm/strider/ </li></ul>1 Yi-Min Wang et al. Spam Double-Funnel: Connecting Web Spammers with Advertisers . WWW 2007.