Webspam kaut

7 / 12

WEB SPAM
PRESENTED BY
KAUTILYA
ROLL NO:36

INTRODUCTION: WEB SEARCH
• Web search – the access to the Web by hundreds of millions of
people and this activities can be done in hundreds
of millions of queries per day.

Hence,
Queries + people = TRAFFIC

• The web site owners want to avoid huge traffic and ranked
high the web site in search engine for
– Communicate some message i.e; commercial, political,relegious,etc.
– Install viruses, adware, etc.

WEB SPAM : DEFINITION
Web Spam can be defined as any intentional activity by a
human to generate an unreasonably favorable result or
importance for a web page that naturally should not have
the weight or significance associated to it.[1]
In other words
The practice of manipulating web pages in order to cause
search engines to rank some web pages higher than they
would without any manipulation.

WEB SPAMMERS ACTIVITIES

THE
Document WEB
IDs
Display results
on a web page
Retrieve full Index the
text of documents
relevant
documents

Rank
Resul
t Search
Engine
Servers Inverted
Get indices for
Index
relevant
documents
Query

Web Spammers target the last step

WEB SPAM IS BAD
• Bad for users
– Makes it harder to satisfy information need
– Leads to frustrating search experience

• Bad for search engines
– Burns crawling bandwidth
– Pollutes corpus (infinite number of spam pages!)
– Distorts ranking of results

HISTORY
• It was introduced by the 1st Generation Search Engine Companies
in the 1990’s
- The technique came to be known as ‘Glittering Generalities’

• 2nd Generation Search Engine Companies
- Neutralized Glittering Generalities
- Ranked pages according to their popularity
- Popularity determined by Links pointing to the Web page
- Spammers made Link farms to circumvent it

• 3rd Generation Search Engine Companies
- use page rank, HITS algorithm to rank pages
- Spammers have found new ways as well!

SPAMMING TECHNIQUES
• Boosting Rank
• Term Spamming : Manipulating the text of web pages
in order to appear relevant to queries
• Link Spamming : Creating link structures that boost
page rank or hubs and authorities scores
• Hiding Techniques:
• Content Hiding : Use same color for text and page
background
• Cloaking : Return different page to crawlers and
browsers
• Redirecting
- Alternative to cloaking
- Redirects are followed by browsers but not crawlers

TERM SPAMMING
• Repetition
– of one or a few specific terms e.g., free, cheap, Viagra
– Goal is to subvert TF.IDF ranking schemes
• Dumping
– of a large number of unrelated terms
– e.g., copy entire dictionaries
• Weaving
– Copy legitimate pages and insert spam terms at random positions
• Phrase Stitching
– Glue together sentences and phrases from different sources
Term spam targets
• Body of web page
• Title
• URL
• HTML meta tags
• Anchor text

LINK SPAM
• Three kinds of web pages from a spammer’s point of view
– Inaccessible pages
– Accessible pages
• e.g., web log comments pages
• spammer can post links to his pages
– Own pages
• Completely controlled by spammer
• May span multiple domain names
Spammer’s goal
– Maximize the page rank of target page t
• Technique
– Get as many links from accessible pages as possible to target
page t
– Construct “link farm” to get page rank multiplier effect

WEB SPAM – RECOGNISING WEB SPAM LINKS

Potential signs of web spam in SERPS:
 Domain name not pertinent/not associable to the keyword
 URL composed by more than one level (long URL) + spam keyword
 URL including specific page using parameters such as
Id, U, Articleid, etc + spam keyword
 Domain suffix: gov, edu, org, info, name, net + spam keyword
 Keywords stuffing – spam keyword in title, description and URL

10

EXAMPLE WEB SPAM – ONLINE PHARMACY KEYWORDS

The following keywords can be used to identify web
spammers in this industry
Keywords Google Yahoo Live Spam Links

Buy viagra online 11,200,000 44,600,000 57,400,000 G:4/10
Y:6/10
L:10/10
Cheap viagra 12,100,100 36,700,000 53,100,000 G:7/10
Y:7/10
L:9/10
Buy cialis online 7,810,000 33,400,000 25,000,000 G:8/10
Y:9/10
L:10/10
Buy phentermine 4,340,000 27,000,000 52,600,000 G:8/10
online Y:8/10
11
L:10/10

EXAMPLE
LINK FARMS AND LINK EXCHANGES

DETECTING SPAM
• Term spamming
– Analyze text using statistical methods e.g., Naïve
Bayes classifiers
– Similar to email spam filtering
– Also useful: detecting approximate duplicate
pages
• Link spamming
– Open research area
– One approach: TrustRank

CONCLUSION
• Web Spam is a by-product of the search engine era

• Identifying the structure of web spam is the first step
to fighting it.
• Due to the inherent characteristic of the Web it is
difficult to eliminate web spam all together.
• Combination of different web spam techniques can be
combined together to detect spam in a better way

REFERENCE
• [1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First
International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb), 2005.
• www. iseclab.org/papers/webspam.pdf
• www. cs.wellesley.edu/~cs315/...WebSpamTechniques
• www. malerisch.net/docs/web_spam_techniques
• www. courses.ischool.berkeley.edu/i141/f07/lectures/najork-web-
spam.pdf
• www. infolab.stanford.edu/~ullman/mining/pdf/spam.pdf
• www. research.microsoft.com/pubs/102938/EDS-WebSpamDetection.pdf

Webspam kaut

Recommended

Recommended

More Related Content

Similar to Webspam kaut

Similar to Webspam kaut (20)

Recently uploaded

Recently uploaded (20)

Webspam kaut