SlideShare a Scribd company logo
7 / 12




         WEB SPAM
           PRESENTED BY
             KAUTILYA
            ROLL NO:36
INTRODUCTION: WEB SEARCH
• Web search – the access to the Web by hundreds of millions of
                 people and this activities can be done in hundreds
                of millions of queries per day.

Hence,
               Queries + people = TRAFFIC



• The web site owners want to avoid huge traffic and ranked
  high the web site in search engine for
   – Communicate some message i.e; commercial, political,relegious,etc.
   – Install viruses, adware, etc.
WEB SPAM : DEFINITION
Web Spam can be defined as any intentional activity by a
human to generate an unreasonably favorable result or
importance for a web page that naturally should not have
the weight or significance associated to it.[1]
In other words
The practice of manipulating web pages in order to cause
search engines to rank some web pages higher than they
would without any manipulation.
WEB SPAMMERS ACTIVITIES

                                                                              THE
                                  Document                                    WEB
                                     IDs
   Display results
   on a web page
                             Retrieve full                    Index the
                               text of                       documents
                              relevant
                             documents

                     Rank
                     Resul
                       t                 Search
                                         Engine
                                         Servers                          Inverted
                                                   Get indices for
                                                                            Index
                                                     relevant
                                                    documents
                     Query




Web Spammers target the last step
WEB SPAM IS BAD
• Bad for users
  – Makes it harder to satisfy information need
  – Leads to frustrating search experience


• Bad for search engines
  – Burns crawling bandwidth
  – Pollutes corpus (infinite number of spam pages!)
  – Distorts ranking of results
HISTORY
• It was introduced by the 1st Generation Search Engine Companies
  in the 1990’s
  - The technique came to be known as ‘Glittering Generalities’

• 2nd Generation Search Engine Companies
    - Neutralized Glittering Generalities
    - Ranked pages according to their popularity
     - Popularity determined by Links pointing to the Web page
     - Spammers made Link farms to circumvent it

• 3rd Generation Search Engine Companies
    - use page rank, HITS algorithm to rank pages
    - Spammers have found new ways as well!
SPAMMING TECHNIQUES
•   Boosting Rank
     •   Term Spamming : Manipulating the text of web pages
         in order to appear relevant to queries
     •   Link Spamming : Creating link structures that boost
         page rank or hubs and authorities scores
•   Hiding Techniques:
     •   Content Hiding : Use same color for text and page
         background
     •   Cloaking : Return different page to crawlers and
         browsers
     •   Redirecting
         - Alternative to cloaking
         - Redirects are followed by browsers but not crawlers
TERM SPAMMING
• Repetition
     – of one or a few specific terms e.g., free, cheap, Viagra
     – Goal is to subvert TF.IDF ranking schemes
• Dumping
     – of a large number of unrelated terms
     – e.g., copy entire dictionaries
• Weaving
     – Copy legitimate pages and insert spam terms at random positions
• Phrase Stitching
     – Glue together sentences and phrases from different sources
                           Term spam targets
•   Body of web page
•   Title
•   URL
•   HTML meta tags
•   Anchor text
LINK SPAM
• Three kinds of web pages from a spammer’s point of view
   – Inaccessible pages
   – Accessible pages
       • e.g., web log comments pages
       • spammer can post links to his pages
   – Own pages
       • Completely controlled by spammer
       • May span multiple domain names
                       Spammer’s goal
   – Maximize the page rank of target page t
• Technique
   – Get as many links from accessible pages as possible to target
     page t
   – Construct “link farm” to get page rank multiplier effect
WEB SPAM – RECOGNISING WEB SPAM LINKS

Potential signs of web spam in SERPS:
      Domain name not pertinent/not associable to the keyword
      URL composed by more than one level (long URL) + spam keyword
      URL including specific page using parameters such as
       Id, U, Articleid, etc + spam keyword
      Domain suffix: gov, edu, org, info, name, net + spam keyword
      Keywords stuffing – spam keyword in title, description and URL




10
EXAMPLE WEB SPAM – ONLINE PHARMACY KEYWORDS

The following keywords can be used to identify web
spammers in this industry
Keywords            Google       Yahoo        Live         Spam Links

Buy viagra online   11,200,000   44,600,000   57,400,000   G:4/10
                                                           Y:6/10
                                                           L:10/10
Cheap viagra        12,100,100   36,700,000   53,100,000   G:7/10
                                                           Y:7/10
                                                           L:9/10
Buy cialis online   7,810,000    33,400,000   25,000,000   G:8/10
                                                           Y:9/10
                                                           L:10/10
Buy phentermine     4,340,000    27,000,000   52,600,000   G:8/10
online                                                     Y:8/10
    11
                                                           L:10/10
EXAMPLE
LINK FARMS AND LINK EXCHANGES
EXPIRED DOMAINS
DETECTING SPAM
• Term spamming
  – Analyze text using statistical methods e.g., Naïve
    Bayes classifiers
  – Similar to email spam filtering
  – Also useful: detecting approximate duplicate
    pages
• Link spamming
  – Open research area
  – One approach: TrustRank
CONCLUSION
• Web Spam is a by-product of the search engine era

• Identifying the structure of web spam is the first step
   to fighting it.
• Due to the inherent characteristic of the Web it is
  difficult to eliminate web spam all together.
• Combination of different web spam techniques can be
  combined together to detect spam in a better way
REFERENCE
• [1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First
  International Workshop on Adversarial Information Retrieval on the
  Web (AIRWeb), 2005.
• www. iseclab.org/papers/webspam.pdf
• www. cs.wellesley.edu/~cs315/...WebSpamTechniques
• www. malerisch.net/docs/web_spam_techniques
• www. courses.ischool.berkeley.edu/i141/f07/lectures/najork-web-
  spam.pdf
• www. infolab.stanford.edu/~ullman/mining/pdf/spam.pdf
• www. research.microsoft.com/pubs/102938/EDS-WebSpamDetection.pdf

More Related Content

Similar to Webspam kaut

Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seo
alexanderandreya
 
Seomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEOSeomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEO
Tyson Stevens
 
New understanding website
New understanding websiteNew understanding website
New understanding website
reddvise
 
New understanding website
New understanding websiteNew understanding website
New understanding website
umrella
 
Web crawler
Web crawlerWeb crawler
Web crawler
poonamkenkre
 
Search Engines
Search EnginesSearch Engines
Search Engines
Ram Dutt Shukla
 
Seo digital marketing
Seo digital marketingSeo digital marketing
Seo digital marketing
Shourya Puri
 
SEO000.pdf
SEO000.pdfSEO000.pdf
SEO000.pdf
JaySarma2
 
Information and communication technology
Information and communication technologyInformation and communication technology
Information and communication technology
ChaitraAni
 
Free seo-book
Free seo-bookFree seo-book
Free seo-book
Vindhyachal Tiwari
 
Research on Key Technology of Web Reptile
Research on Key Technology of Web ReptileResearch on Key Technology of Web Reptile
Research on Key Technology of Web Reptile
IRJESJOURNAL
 
E3602042044
E3602042044E3602042044
E3602042044
ijceronline
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
Christopher Mbinda
 
Stephan Spencer - SEMPO Atlanta. October 1, 2010. Topic: Advanced SEO
Stephan Spencer - SEMPO Atlanta.  October 1, 2010.  Topic: Advanced SEOStephan Spencer - SEMPO Atlanta.  October 1, 2010.  Topic: Advanced SEO
Stephan Spencer - SEMPO Atlanta. October 1, 2010. Topic: Advanced SEO
Allison Fabella
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
ScrbifPt
 
Introduction to search_marketing
Introduction to search_marketingIntroduction to search_marketing
Introduction to search_marketing
Bill Hunt
 
Seo onpage & offpage
Seo onpage & offpageSeo onpage & offpage
Seo onpage & offpage
John Yadav
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
Valeria de Paiva
 
Searchland2
Searchland2Searchland2
learn seo, seo marketing
learn seo, seo marketinglearn seo, seo marketing
learn seo, seo marketing
rsayyad88
 

Similar to Webspam kaut (20)

Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seo
 
Seomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEOSeomoz The Beginners Guide to SEO
Seomoz The Beginners Guide to SEO
 
New understanding website
New understanding websiteNew understanding website
New understanding website
 
New understanding website
New understanding websiteNew understanding website
New understanding website
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Seo digital marketing
Seo digital marketingSeo digital marketing
Seo digital marketing
 
SEO000.pdf
SEO000.pdfSEO000.pdf
SEO000.pdf
 
Information and communication technology
Information and communication technologyInformation and communication technology
Information and communication technology
 
Free seo-book
Free seo-bookFree seo-book
Free seo-book
 
Research on Key Technology of Web Reptile
Research on Key Technology of Web ReptileResearch on Key Technology of Web Reptile
Research on Key Technology of Web Reptile
 
E3602042044
E3602042044E3602042044
E3602042044
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
Stephan Spencer - SEMPO Atlanta. October 1, 2010. Topic: Advanced SEO
Stephan Spencer - SEMPO Atlanta.  October 1, 2010.  Topic: Advanced SEOStephan Spencer - SEMPO Atlanta.  October 1, 2010.  Topic: Advanced SEO
Stephan Spencer - SEMPO Atlanta. October 1, 2010. Topic: Advanced SEO
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
Introduction to search_marketing
Introduction to search_marketingIntroduction to search_marketing
Introduction to search_marketing
 
Seo onpage & offpage
Seo onpage & offpageSeo onpage & offpage
Seo onpage & offpage
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 
Searchland2
Searchland2Searchland2
Searchland2
 
learn seo, seo marketing
learn seo, seo marketinglearn seo, seo marketing
learn seo, seo marketing
 

Recently uploaded

The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
heathfieldcps1
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
Iris Thiele Isip-Tan
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
melliereed
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
سمير بسيوني
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
ImMuslim
 

Recently uploaded (20)

The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
 
Nutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour TrainingNutrition Inc FY 2024, 4 - Hour Training
Nutrition Inc FY 2024, 4 - Hour Training
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdfمصحف القراءات العشر   أعد أحرف الخلاف سمير بسيوني.pdf
مصحف القراءات العشر أعد أحرف الخلاف سمير بسيوني.pdf
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
 

Webspam kaut

  • 1. 7 / 12 WEB SPAM PRESENTED BY KAUTILYA ROLL NO:36
  • 2. INTRODUCTION: WEB SEARCH • Web search – the access to the Web by hundreds of millions of people and this activities can be done in hundreds of millions of queries per day. Hence, Queries + people = TRAFFIC • The web site owners want to avoid huge traffic and ranked high the web site in search engine for – Communicate some message i.e; commercial, political,relegious,etc. – Install viruses, adware, etc.
  • 3. WEB SPAM : DEFINITION Web Spam can be defined as any intentional activity by a human to generate an unreasonably favorable result or importance for a web page that naturally should not have the weight or significance associated to it.[1] In other words The practice of manipulating web pages in order to cause search engines to rank some web pages higher than they would without any manipulation.
  • 4. WEB SPAMMERS ACTIVITIES THE Document WEB IDs Display results on a web page Retrieve full Index the text of documents relevant documents Rank Resul t Search Engine Servers Inverted Get indices for Index relevant documents Query Web Spammers target the last step
  • 5. WEB SPAM IS BAD • Bad for users – Makes it harder to satisfy information need – Leads to frustrating search experience • Bad for search engines – Burns crawling bandwidth – Pollutes corpus (infinite number of spam pages!) – Distorts ranking of results
  • 6. HISTORY • It was introduced by the 1st Generation Search Engine Companies in the 1990’s - The technique came to be known as ‘Glittering Generalities’ • 2nd Generation Search Engine Companies - Neutralized Glittering Generalities - Ranked pages according to their popularity - Popularity determined by Links pointing to the Web page - Spammers made Link farms to circumvent it • 3rd Generation Search Engine Companies - use page rank, HITS algorithm to rank pages - Spammers have found new ways as well!
  • 7. SPAMMING TECHNIQUES • Boosting Rank • Term Spamming : Manipulating the text of web pages in order to appear relevant to queries • Link Spamming : Creating link structures that boost page rank or hubs and authorities scores • Hiding Techniques: • Content Hiding : Use same color for text and page background • Cloaking : Return different page to crawlers and browsers • Redirecting - Alternative to cloaking - Redirects are followed by browsers but not crawlers
  • 8. TERM SPAMMING • Repetition – of one or a few specific terms e.g., free, cheap, Viagra – Goal is to subvert TF.IDF ranking schemes • Dumping – of a large number of unrelated terms – e.g., copy entire dictionaries • Weaving – Copy legitimate pages and insert spam terms at random positions • Phrase Stitching – Glue together sentences and phrases from different sources Term spam targets • Body of web page • Title • URL • HTML meta tags • Anchor text
  • 9. LINK SPAM • Three kinds of web pages from a spammer’s point of view – Inaccessible pages – Accessible pages • e.g., web log comments pages • spammer can post links to his pages – Own pages • Completely controlled by spammer • May span multiple domain names Spammer’s goal – Maximize the page rank of target page t • Technique – Get as many links from accessible pages as possible to target page t – Construct “link farm” to get page rank multiplier effect
  • 10. WEB SPAM – RECOGNISING WEB SPAM LINKS Potential signs of web spam in SERPS:  Domain name not pertinent/not associable to the keyword  URL composed by more than one level (long URL) + spam keyword  URL including specific page using parameters such as Id, U, Articleid, etc + spam keyword  Domain suffix: gov, edu, org, info, name, net + spam keyword  Keywords stuffing – spam keyword in title, description and URL 10
  • 11. EXAMPLE WEB SPAM – ONLINE PHARMACY KEYWORDS The following keywords can be used to identify web spammers in this industry Keywords Google Yahoo Live Spam Links Buy viagra online 11,200,000 44,600,000 57,400,000 G:4/10 Y:6/10 L:10/10 Cheap viagra 12,100,100 36,700,000 53,100,000 G:7/10 Y:7/10 L:9/10 Buy cialis online 7,810,000 33,400,000 25,000,000 G:8/10 Y:9/10 L:10/10 Buy phentermine 4,340,000 27,000,000 52,600,000 G:8/10 online Y:8/10 11 L:10/10
  • 12. EXAMPLE LINK FARMS AND LINK EXCHANGES
  • 14. DETECTING SPAM • Term spamming – Analyze text using statistical methods e.g., Naïve Bayes classifiers – Similar to email spam filtering – Also useful: detecting approximate duplicate pages • Link spamming – Open research area – One approach: TrustRank
  • 15. CONCLUSION • Web Spam is a by-product of the search engine era • Identifying the structure of web spam is the first step to fighting it. • Due to the inherent characteristic of the Web it is difficult to eliminate web spam all together. • Combination of different web spam techniques can be combined together to detect spam in a better way
  • 16. REFERENCE • [1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005. • www. iseclab.org/papers/webspam.pdf • www. cs.wellesley.edu/~cs315/...WebSpamTechniques • www. malerisch.net/docs/web_spam_techniques • www. courses.ischool.berkeley.edu/i141/f07/lectures/najork-web- spam.pdf • www. infolab.stanford.edu/~ullman/mining/pdf/spam.pdf • www. research.microsoft.com/pubs/102938/EDS-WebSpamDetection.pdf