7 / 12         WEB SPAM           PRESENTED BY             KAUTILYA            ROLL NO:36
INTRODUCTION: WEB SEARCH• Web search – the access to the Web by hundreds of millions of                 people and this ac...
WEB SPAM : DEFINITIONWeb Spam can be defined as any intentional activity by ahuman to generate an unreasonably favorable r...
WEB SPAMMERS ACTIVITIES                                                                              THE                  ...
WEB SPAM IS BAD• Bad for users  – Makes it harder to satisfy information need  – Leads to frustrating search experience• B...
HISTORY• It was introduced by the 1st Generation Search Engine Companies  in the 1990’s  - The technique came to be known ...
SPAMMING TECHNIQUES•   Boosting Rank     •   Term Spamming : Manipulating the text of web pages         in order to appear...
TERM SPAMMING• Repetition     – of one or a few specific terms e.g., free, cheap, Viagra     – Goal is to subvert TF.IDF r...
LINK SPAM• Three kinds of web pages from a spammer’s point of view   – Inaccessible pages   – Accessible pages       • e.g...
WEB SPAM – RECOGNISING WEB SPAM LINKSPotential signs of web spam in SERPS:      Domain name not pertinent/not associable...
EXAMPLE WEB SPAM – ONLINE PHARMACY KEYWORDSThe following keywords can be used to identify webspammers in this industryKeyw...
EXAMPLELINK FARMS AND LINK EXCHANGES
EXPIRED DOMAINS
DETECTING SPAM• Term spamming  – Analyze text using statistical methods e.g., Naïve    Bayes classifiers  – Similar to ema...
CONCLUSION• Web Spam is a by-product of the search engine era• Identifying the structure of web spam is the first step   t...
REFERENCE• [1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First  International Workshop on Adversarial Informa...
Upcoming SlideShare
Loading in …5
×

Webspam kaut

456 views

Published on

  • Be the first to comment

  • Be the first to like this

Webspam kaut

  1. 1. 7 / 12 WEB SPAM PRESENTED BY KAUTILYA ROLL NO:36
  2. 2. INTRODUCTION: WEB SEARCH• Web search – the access to the Web by hundreds of millions of people and this activities can be done in hundreds of millions of queries per day.Hence, Queries + people = TRAFFIC• The web site owners want to avoid huge traffic and ranked high the web site in search engine for – Communicate some message i.e; commercial, political,relegious,etc. – Install viruses, adware, etc.
  3. 3. WEB SPAM : DEFINITIONWeb Spam can be defined as any intentional activity by ahuman to generate an unreasonably favorable result orimportance for a web page that naturally should not havethe weight or significance associated to it.[1]In other wordsThe practice of manipulating web pages in order to causesearch engines to rank some web pages higher than theywould without any manipulation.
  4. 4. WEB SPAMMERS ACTIVITIES THE Document WEB IDs Display results on a web page Retrieve full Index the text of documents relevant documents Rank Resul t Search Engine Servers Inverted Get indices for Index relevant documents QueryWeb Spammers target the last step
  5. 5. WEB SPAM IS BAD• Bad for users – Makes it harder to satisfy information need – Leads to frustrating search experience• Bad for search engines – Burns crawling bandwidth – Pollutes corpus (infinite number of spam pages!) – Distorts ranking of results
  6. 6. HISTORY• It was introduced by the 1st Generation Search Engine Companies in the 1990’s - The technique came to be known as ‘Glittering Generalities’• 2nd Generation Search Engine Companies - Neutralized Glittering Generalities - Ranked pages according to their popularity - Popularity determined by Links pointing to the Web page - Spammers made Link farms to circumvent it• 3rd Generation Search Engine Companies - use page rank, HITS algorithm to rank pages - Spammers have found new ways as well!
  7. 7. SPAMMING TECHNIQUES• Boosting Rank • Term Spamming : Manipulating the text of web pages in order to appear relevant to queries • Link Spamming : Creating link structures that boost page rank or hubs and authorities scores• Hiding Techniques: • Content Hiding : Use same color for text and page background • Cloaking : Return different page to crawlers and browsers • Redirecting - Alternative to cloaking - Redirects are followed by browsers but not crawlers
  8. 8. TERM SPAMMING• Repetition – of one or a few specific terms e.g., free, cheap, Viagra – Goal is to subvert TF.IDF ranking schemes• Dumping – of a large number of unrelated terms – e.g., copy entire dictionaries• Weaving – Copy legitimate pages and insert spam terms at random positions• Phrase Stitching – Glue together sentences and phrases from different sources Term spam targets• Body of web page• Title• URL• HTML meta tags• Anchor text
  9. 9. LINK SPAM• Three kinds of web pages from a spammer’s point of view – Inaccessible pages – Accessible pages • e.g., web log comments pages • spammer can post links to his pages – Own pages • Completely controlled by spammer • May span multiple domain names Spammer’s goal – Maximize the page rank of target page t• Technique – Get as many links from accessible pages as possible to target page t – Construct “link farm” to get page rank multiplier effect
  10. 10. WEB SPAM – RECOGNISING WEB SPAM LINKSPotential signs of web spam in SERPS:  Domain name not pertinent/not associable to the keyword  URL composed by more than one level (long URL) + spam keyword  URL including specific page using parameters such as Id, U, Articleid, etc + spam keyword  Domain suffix: gov, edu, org, info, name, net + spam keyword  Keywords stuffing – spam keyword in title, description and URL10
  11. 11. EXAMPLE WEB SPAM – ONLINE PHARMACY KEYWORDSThe following keywords can be used to identify webspammers in this industryKeywords Google Yahoo Live Spam LinksBuy viagra online 11,200,000 44,600,000 57,400,000 G:4/10 Y:6/10 L:10/10Cheap viagra 12,100,100 36,700,000 53,100,000 G:7/10 Y:7/10 L:9/10Buy cialis online 7,810,000 33,400,000 25,000,000 G:8/10 Y:9/10 L:10/10Buy phentermine 4,340,000 27,000,000 52,600,000 G:8/10online Y:8/10 11 L:10/10
  12. 12. EXAMPLELINK FARMS AND LINK EXCHANGES
  13. 13. EXPIRED DOMAINS
  14. 14. DETECTING SPAM• Term spamming – Analyze text using statistical methods e.g., Naïve Bayes classifiers – Similar to email spam filtering – Also useful: detecting approximate duplicate pages• Link spamming – Open research area – One approach: TrustRank
  15. 15. CONCLUSION• Web Spam is a by-product of the search engine era• Identifying the structure of web spam is the first step to fighting it.• Due to the inherent characteristic of the Web it is difficult to eliminate web spam all together.• Combination of different web spam techniques can be combined together to detect spam in a better way
  16. 16. REFERENCE• [1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.• www. iseclab.org/papers/webspam.pdf• www. cs.wellesley.edu/~cs315/...WebSpamTechniques• www. malerisch.net/docs/web_spam_techniques• www. courses.ischool.berkeley.edu/i141/f07/lectures/najork-web- spam.pdf• www. infolab.stanford.edu/~ullman/mining/pdf/spam.pdf• www. research.microsoft.com/pubs/102938/EDS-WebSpamDetection.pdf

×