Web Spam (OjoBuscador 2007 Madrid, Spain) - Presentation Transcript
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Know your Neighbors:
Web Spam
Web Spam Detection using Web Topology
Web Spam
Detection
A Reference
Collection
C. Castillo2 , D. Donato2 , A. Gionis2 ,
Topological Web
Spam
V.Murdock2 and F. Silvestri4
Counting of
Previous work with: R. Baeza-Yates2,3 , L. Becchetti1 ,
Supporters
P. Boldi5 , S. Leonardi1 , M. Santini5 and S. Vigna5
Content-based
Spam detection
Web Topology
1. Universit` di Roma “La Sapienza” – Rome, Italy
a
Conclusions
2. Yahoo! Research Barcelona – Catalunya, Spain
3. Yahoo! Research Santiago – Chile
4. ISTI-CNR –Pisa,Italy
5. Universit` degli Studi di Milano – Milan, Italy
a
Web Spam
This is a talk about academic research!
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Tools for dealing with Web Spam
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
Web Spam
1
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam Detection
2
Web Spam
A Reference Collection
3
Web Spam
Detection
A Reference
Topological Web Spam
Collection 4
Topological Web
Spam
Counting of Supporters
5
Counting of
Supporters
Content-based
Content-based Spam detection
6
Spam detection
Web Topology
Web Topology
Conclusions 7
Conclusions
8
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
The Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
“The sum of all human knowledge plus porn” – Robert Gilbert
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Graphic: www.milliondollarhomepage.com
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Link spam
F. Silvestri
Content spam
Web Spam
Web Spam
Cloaking
Detection
Comment/forum/wiki spam
A Reference
Collection
Spam-oriented blogging
Topological Web
Spam
Click fraud ×2
Counting of
Supporters
Reverse engineering of ranking algorithms
Content-based
Web content filtering
Spam detection
Web Topology
Advertisement blocking
Conclusions
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Opportunities for Web spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
X Spamdexing
Web Spam
Detection
Keyword stuffing
A Reference
Link farms
Collection
Spam blogs (splogs)
Topological Web
Spam
Cloaking
Counting of
Supporters
Adversarial relationship
Content-based
Spam detection
Every undeserved gain in ranking for a spammer, is a loss of
Web Topology
precision for the search engine.
Conclusions
Web Spam
Opportunities for Web spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
X Spamdexing
Web Spam
Detection
Keyword stuffing
A Reference
Link farms
Collection
Spam blogs (splogs)
Topological Web
Spam
Cloaking
Counting of
Supporters
Adversarial relationship
Content-based
Spam detection
Every undeserved gain in ranking for a spammer, is a loss of
Web Topology
precision for the search engine.
Conclusions
Web Spam
Naive Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Hidden text
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Made for Advertising
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Search engine?
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Fake search engine
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Problem: “normal” pages that are spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Machine Learning
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Training of a Decision Tree
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Decision Tree (error = 15%)
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Decision Tree (error = 15% → 12%)
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Machine Learning (cont.)
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Feature Extraction
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Machine Learning
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
Machine Learning Challenges:
A Reference
Collection
Instances are not really independent (graph)
Topological Web
Spam
Learning with few examples
Counting of
Supporters
Scalability
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Machine Learning
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
Machine Learning Challenges:
A Reference
Collection
Instances are not really independent (graph)
Topological Web
Spam
Learning with few examples
Counting of
Supporters
Scalability
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Machine Learning
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
Machine Learning Challenges:
A Reference
Collection
Instances are not really independent (graph)
Topological Web
Spam
Learning with few examples
Counting of
Supporters
Scalability
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Information Retrieval Challenges:
Detection
A Reference
Feature extraction: which features?
Collection
Feature aggregation: page/host/domain
Topological Web
Spam
Feature propagation (graph)
Counting of
Supporters
Recall/precision tradeoffs
Content-based
Spam detection
Scalability
Web Topology
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Information Retrieval Challenges:
Detection
A Reference
Feature extraction: which features?
Collection
Feature aggregation: page/host/domain
Topological Web
Spam
Feature propagation (graph)
Counting of
Supporters
Recall/precision tradeoffs
Content-based
Spam detection
Scalability
Web Topology
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Information Retrieval Challenges:
Detection
A Reference
Feature extraction: which features?
Collection
Feature aggregation: page/host/domain
Topological Web
Spam
Feature propagation (graph)
Counting of
Supporters
Recall/precision tradeoffs
Content-based
Spam detection
Scalability
Web Topology
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Information Retrieval Challenges:
Detection
A Reference
Feature extraction: which features?
Collection
Feature aggregation: page/host/domain
Topological Web
Spam
Feature propagation (graph)
Counting of
Supporters
Recall/precision tradeoffs
Content-based
Spam detection
Scalability
Web Topology
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Information Retrieval Challenges:
Detection
A Reference
Feature extraction: which features?
Collection
Feature aggregation: page/host/domain
Topological Web
Spam
Feature propagation (graph)
Counting of
Supporters
Recall/precision tradeoffs
Content-based
Spam detection
Scalability
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Crawling of base data
Collection
Elaboration of the guidelines and classification interface
Topological Web
Spam
Labeling
Counting of
Supporters
Post-processing
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Crawling of base data
Collection
Elaboration of the guidelines and classification interface
Topological Web
Spam
Labeling
Counting of
Supporters
Post-processing
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Crawling of base data
Collection
Elaboration of the guidelines and classification interface
Topological Web
Spam
Labeling
Counting of
Supporters
Post-processing
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Crawling of base data
Collection
Elaboration of the guidelines and classification interface
Topological Web
Spam
Labeling
Counting of
Supporters
Post-processing
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
U.K. collection
Detection
77.9 M pages downloaded from the .UK domain in May 2006
A Reference
Collection
(LAW, University of Milan)
Topological Web
Spam
Counting of
Large seed of about 150,000 .uk hosts
Supporters
11,400 hosts
Content-based
Spam detection
8 levels depth, with <=50,000 pages per host
Web Topology
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
U.K. collection
Detection
77.9 M pages downloaded from the .UK domain in May 2006
A Reference
Collection
(LAW, University of Milan)
Topological Web
Spam
Counting of
Large seed of about 150,000 .uk hosts
Supporters
11,400 hosts
Content-based
Spam detection
8 levels depth, with <=50,000 pages per host
Web Topology
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
U.K. collection
Detection
77.9 M pages downloaded from the .UK domain in May 2006
A Reference
Collection
(LAW, University of Milan)
Topological Web
Spam
Counting of
Large seed of about 150,000 .uk hosts
Supporters
11,400 hosts
Content-based
Spam detection
8 levels depth, with <=50,000 pages per host
Web Topology
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
U.K. collection
Detection
77.9 M pages downloaded from the .UK domain in May 2006
A Reference
Collection
(LAW, University of Milan)
Topological Web
Spam
Counting of
Large seed of about 150,000 .uk hosts
Supporters
11,400 hosts
Content-based
Spam detection
8 levels depth, with <=50,000 pages per host
Web Topology
Conclusions
Web Spam
Classification interface
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Labeling process
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
We asked 20+ volunteers to classify entire hosts
Collection
Topological Web
Asked to classify normal / borderline / spam
Spam
Counting of
Do they agree? Mostly . . .
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Labeling process
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
We asked 20+ volunteers to classify entire hosts
Collection
Topological Web
Asked to classify normal / borderline / spam
Spam
Counting of
Do they agree? Mostly . . .
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Labeling process
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
We asked 20+ volunteers to classify entire hosts
Collection
Topological Web
Asked to classify normal / borderline / spam
Spam
Counting of
Do they agree? Mostly . . .
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Agreement
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Results
Detection
C. Castillo,
D. Donato,
Labels
A. Gionis,
V. Murdock,
Label Frequency Percentage
F. Silvestri
Normal 4,046 61.75%
Web Spam
Borderline 709 10.82%
Web Spam
Detection
Spam 1,447 22.08%
A Reference
Can not classify 350 5.34%
Collection
Topological Web
Spam
Counting of
Supporters
Agreement
Content-based
Spam detection
Category Kappa Interpretation
Web Topology
normal 0.62 Substantial agreement
Conclusions
spam 0.63 Substantial agreement
borderline 0.11 Slight agreement
global 0.56 Moderate agreement
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Public spam collection
Labels for 6,552 hosts
Web Spam
2,725 hosts classified by at least 2 humans
Web Spam
Detection
3,106 automatically considered normal (.ac.uk,
A Reference
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Collection
http://www.yr-bcn.es/webspam/
Topological Web
Spam
Upcoming Web Spam challenge
Counting of
Track I: Information retrieval + Machine learning
Supporters
Track II: Machine learning
Content-based
Spam detection
http://webspam.lip6.fr/
Web Topology
AIRWeb 2007 Workshop (21 submissions!)
Conclusions
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
AIRWeb 2007 in Banff, Canada
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Topological spam: link farms
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
Web Spam
Topological spam: link farms
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
Web Spam
Motivation
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
Fetterly [Fetterly et al., 2004] hypothesized that studying the
A Reference
distribution of statistics about pages could be a good way of
Collection
detecting spam pages:
Topological Web
Spam
Counting of
“in a number of these distributions, outlier values are
Supporters
associated with web spam”
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Restriction
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Semi-streaming model: graph on disk
F. Silvestri
1: for node : 1 . . . N do
Web Spam
INITIALIZE-MEM(node)
2:
Web Spam
Detection
3: end for
A Reference
4: for distance : 1 . . . d do {Iteration step}
Collection
for src : 1 . . . N do {Follow links in the graph}
5:
Topological Web
Spam
for all links from src to dest do
6:
Counting of
COMPUTE(src,dest)
7:
Supporters
end for
8:
Content-based
Spam detection
end for
9:
Web Topology
NORMALIZE
10:
Conclusions
11: end for
12: POST-PROCESS
13: return Something
Web Spam
Restriction
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Semi-streaming model: graph on disk
F. Silvestri
1: for node : 1 . . . N do
Web Spam
INITIALIZE-MEM(node)
2:
Web Spam
Detection
3: end for
A Reference
4: for distance : 1 . . . d do {Iteration step}
Collection
for src : 1 . . . N do {Follow links in the graph}
5:
Topological Web
Spam
for all links from src to dest do
6:
Counting of
COMPUTE(src,dest)
7:
Supporters
end for
8:
Content-based
Spam detection
end for
9:
Web Topology
NORMALIZE
10:
Conclusions
11: end for
12: POST-PROCESS
13: return Something
Web Spam
Restriction
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Semi-streaming model: graph on disk
F. Silvestri
1: for node : 1 . . . N do
Web Spam
INITIALIZE-MEM(node)
2:
Web Spam
Detection
3: end for
A Reference
4: for distance : 1 . . . d do {Iteration step}
Collection
for src : 1 . . . N do {Follow links in the graph}
5:
Topological Web
Spam
for all links from src to dest do
6:
Counting of
COMPUTE(src,dest)
7:
Supporters
end for
8:
Content-based
Spam detection
end for
9:
Web Topology
NORMALIZE
10:
Conclusions
11: end for
12: POST-PROCESS
13: return Something
Web Spam
Link-Based Features
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
Degree-related measures
A Reference
PageRank
Collection
Topological Web
TrustRank [Gy¨ngyi et al., 2004]
o
Spam
Truncated PageRank [Becchetti et al., 2006]
Counting of
Supporters
Estimation of supporters [Becchetti et al., 2006]
Content-based
Spam detection
140 features per host (2 pages per host)
Web Topology
Conclusions
Web Spam
Degree-Based
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock, 0.12
F. Silvestri Normal
Spam
0.10
Web Spam
0.08
Web Spam
Detection 0.06
A Reference 0.04
Collection
0.02
Topological Web
Spam 0.00
4 18 76 323 1380 5899 25212 107764 460609 1968753
Counting of 0.14
Normal
Supporters Spam
0.12
Content-based
0.10
Spam detection
0.08
Web Topology
0.06
Conclusions
0.04
0.02
0.00
0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
Web Spam
TrustRank Idea
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
TrustRank / PageRank
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam 1.00
Normal
Detection Spam
0.90
A Reference 0.80
Collection 0.70
0.60
Topological Web
0.50
Spam
0.40
Counting of 0.30
Supporters 0.20
0.10
Content-based
Spam detection 0.00
0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
High and low-ranked pages are different
Detection
C. Castillo,
D. Donato,
4
A. Gionis,
x 10
V. Murdock,
Top 0%−10%
F. Silvestri
12
Top 40%−50%
Top 60%−70%
Web Spam
10
Web Spam
Number of Nodes
Detection
A Reference
8
Collection
Topological Web
6
Spam
Counting of
Supporters
4
Content-based
Spam detection
2
Web Topology
Conclusions
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
Web Spam
High and low-ranked pages are different
Detection
C. Castillo,
D. Donato,
4
A. Gionis,
x 10
V. Murdock,
Top 0%−10%
F. Silvestri
12
Top 40%−50%
Top 60%−70%
Web Spam
10
Web Spam
Number of Nodes
Detection
A Reference
8
Collection
Topological Web
6
Spam
Counting of
Supporters
4
Content-based
Spam detection
2
Web Topology
Conclusions
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
Web Spam
Probabilistic counting
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 1
1
0
0
0
0
Web Spam
0
0
0 1
1 1
1
1
Web Spam 0 0
1 1
0
0
Detection 0
0 0 0
Propagation of 0
0 1
1
A Reference bits using the 1
0 1
1
Collection “OR” operation 1
0 1
0
Topological Web
Spam 1
Target
0 Count bits set
0
page
0 to estimate
Counting of 0
0 supporters
Supporters
0
0
1
1 1
1
Content-based 0
0 1
1
Spam detection 0
0
0
0
Web Topology
1
1
0
0
Conclusions
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
Web Spam
Probabilistic counting
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 1
1
0
0
0
0
Web Spam
0
0
0 1
1 1
1
1
Web Spam 0 0
1 1
0
0
Detection 0
0 0 0
Propagation of 0
0 1
1
A Reference bits using the 1
0 1
1
Collection “OR” operation 1
0 1
0
Topological Web
Spam 1
Target
0 Count bits set
0
page
0 to estimate
Counting of 0
0 supporters
Supporters
0
0
1
1 1
1
Content-based 0
0 1
1
Spam detection 0
0
0
0
Web Topology
1
1
0
0
Conclusions
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
Web Spam
Bottleneck number
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth
Web Spam
of the neighbors of x up to a certain distance. We expect that
Web Spam
spam pages form clusters that are somehow isolated from the
Detection
A Reference
rest of the Web graph and they have smaller bottleneck
Collection
numbers than non-spam pages.
Topological Web
0.40
Spam Normal
Spam
0.35
Counting of
Supporters 0.30
0.25
Content-based
Spam detection 0.20
0.15
Web Topology
0.10
Conclusions
0.05
0.00
1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Content-Based Features
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Most of the features reported in [Ntoulas et al., 2006]
Web Spam
Number of word in the page and title
Web Spam
Average word length
Detection
A Reference
Fraction of anchor text
Collection
Fraction of visible text
Topological Web
Spam
Compression rate
Counting of
Supporters
Corpus precision and corpus recall
Content-based
Spam detection
Query precision and query recall
Web Topology
Independent trigram likelihood
Conclusions
Entropy of trigrams
96 features per host
Web Spam
Average word length
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 0.12
Normal
Spam
Web Spam
0.10
Web Spam
Detection
0.08
A Reference
Collection
0.06
Topological Web
Spam
0.04
Counting of
Supporters
0.02
Content-based
Spam detection
0.00
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
Web Topology
Conclusions
Figure: Histogram of the average word length in non-spam vs.
spam pages for k = 500.
Web Spam
Corpus precision
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 0.10
Normal
0.09 Spam
Web Spam
0.08
Web Spam
0.07
Detection
0.06
A Reference
Collection
0.05
Topological Web
0.04
Spam
0.03
Counting of
0.02
Supporters
0.01
Content-based
Spam detection
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Web Topology
Conclusions
Figure: Histogram of the corpus precision in non-spam vs. spam
pages.
Web Spam
Query precision
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri 0.12
Normal
Spam
Web Spam
0.10
Web Spam
Detection
0.08
A Reference
Collection
0.06
Topological Web
Spam
0.04
Counting of
Supporters
0.02
Content-based
Spam detection
0.00
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Web Topology
Conclusions
Figure: Histogram of the query precision in non-spam vs. spam
pages for k = 500.
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
General hypothesis
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
Pages topologically close to each other are more likely
A Reference
to have the same label (spam/nonspam) than random
Collection
pairs of pages.
Topological Web
Spam
Counting of
Pages linked together are more likely to be on the same topic
Supporters
than random pairs of pages [Davison, 2000]
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
General hypothesis
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
Pages topologically close to each other are more likely
A Reference
to have the same label (spam/nonspam) than random
Collection
pairs of pages.
Topological Web
Spam
Counting of
Pages linked together are more likely to be on the same topic
Supporters
than random pairs of pages [Davison, 2000]
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Topological dependencies: in-links
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Histogram of fraction of spam hosts in the in-links
Web Spam
0 = no in-link comes from spam hosts
Web Spam
1 = all of the in-links come from spam hosts
Detection
A Reference
Collection
0.4
Topological Web In-links of non spam
In-links of spam
Spam 0.35
Counting of 0.3
Supporters
0.25
Content-based
0.2
Spam detection
0.15
Web Topology
0.1
Conclusions
0.05
0
0.0 0.2 0.4 0.6 0.8 1.0
Web Spam
Topological dependencies: out-links
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Histogram of fraction of spam hosts in the out-links
Web Spam
0 = none of the out-links points to spam hosts
Web Spam
1 = all of the out-links point to spam hosts
Detection
A Reference
Collection
1
Topological Web Out-links of non spam
0.9 Outlinks of spam
Spam
0.8
Counting of
0.7
Supporters
0.6
Content-based
0.5
Spam detection
0.4
Web Topology
0.3
Conclusions
0.2
0.1
0
0.0 0.2 0.4 0.6 0.8 1.0
Web Spam
Idea 1: Clustering
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Classify, then cluster hosts, then assign the same label to all
F. Silvestri
hosts in the same cluster by majority voting
Web Spam
Web Spam
Detection
Baseline Clustering
A Reference
Without bagging
Collection
True positive rate 75.6% 74.5%
Topological Web
Spam
False positive rate 8.5% 6.8%
Counting of
F-Measure 0.646 0.673
Supporters
Content-based
With bagging
Spam detection
True positive rate 78.7% 76.9%
Web Topology
False positive rate 5.7% 5.0%
Conclusions
F-Measure 0.723 0.728
V Reduces error rate
Web Spam
Idea 1: Clustering
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
Classify, then cluster hosts, then assign the same label to all
F. Silvestri
hosts in the same cluster by majority voting
Web Spam
Web Spam
Detection
Baseline Clustering
A Reference
Without bagging
Collection
True positive rate 75.6% 74.5%
Topological Web
Spam
False positive rate 8.5% 6.8%
Counting of
F-Measure 0.646 0.673
Supporters
Content-based
With bagging
Spam detection
True positive rate 78.7% 76.9%
Web Topology
False positive rate 5.7% 5.0%
Conclusions
F-Measure 0.723 0.728
V Reduces error rate
Web Spam
Idea 2: Propagate the label
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Classify, then interpret “spamicity” as a probability, then do a
Web Spam
random walk with restart from those nodes
Web Spam
Detection
Baseline Fwds. Backwds. Both
A Reference
Collection
Classifier without bagging
Topological Web
True positive rate 75.6% 70.9% 69.4% 71.4%
Spam
False positive rate 8.5% 6.1% 5.8% 5.8%
Counting of
Supporters
F-Measure 0.646 0.665 0.664 0.676
Content-based
Spam detection
Classifier with bagging
Web Topology
True positive rate 78.7% 76.5% 75.0% 75.2%
Conclusions
False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
Web Spam
Idea 2: Propagate the label
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Classify, then interpret “spamicity” as a probability, then do a
Web Spam
random walk with restart from those nodes
Web Spam
Detection
Baseline Fwds. Backwds. Both
A Reference
Collection
Classifier without bagging
Topological Web
True positive rate 75.6% 70.9% 69.4% 71.4%
Spam
False positive rate 8.5% 6.1% 5.8% 5.8%
Counting of
Supporters
F-Measure 0.646 0.665 0.664 0.676
Content-based
Spam detection
Classifier with bagging
Web Topology
True positive rate 78.7% 76.5% 75.0% 75.2%
Conclusions
False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
Web Spam
Idea 3: Stacked graphical learning
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Classify, then add the average predicted “spamicity” of
Web Spam
neighbors as a new feature for each node, then classify
Web Spam
Detection
again[Cohen and Kou, 2006]
A Reference
Collection
Avg. Avg. Avg.
Topological Web
Spam
Baseline of in of out of both
Counting of
Supporters
True positive rate 78.7% 84.4% 78.3% 85.2%
Content-based
False positive rate 5.7% 6.7% 4.8% 6.1%
Spam detection
F-Measure 0.723 0.733 0.742 0.750
Web Topology
Conclusions
V Increases detection rate
Web Spam
Idea 3: Stacked graphical learning
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Classify, then add the average predicted “spamicity” of
Web Spam
neighbors as a new feature for each node, then classify
Web Spam
Detection
again[Cohen and Kou, 2006]
A Reference
Collection
Avg. Avg. Avg.
Topological Web
Spam
Baseline of in of out of both
Counting of
Supporters
True positive rate 78.7% 84.4% 78.3% 85.2%
Content-based
False positive rate 5.7% 6.7% 4.8% 6.1%
Spam detection
F-Measure 0.723 0.733 0.742 0.750
Web Topology
Conclusions
V Increases detection rate
Web Spam
Idea 3: Stacked graphical learning x2
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
And repeat ...
Detection
A Reference
Baseline First pass Second pass
Collection
Topological Web
True positive rate 78.7% 85.2% 88.4%
Spam
False positive rate 5.7% 6.1% 6.3%
Counting of
Supporters
F-Measure 0.723 0.750 0.763
Content-based
Spam detection
V Significant improvement over the baseline
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
1
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Topological Web Spam
4
Collection
Counting of Supporters
5
Topological Web
Content-based Spam detection
6
Spam
Web Topology
7
Counting of
Supporters
Conclusions
8
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Concluding remarks
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
V The UK-2006-05 dataset is “harder” than previous
A Reference
Collection
datasets
Topological Web
V Considering content-based and link-based attributes
Spam
Counting of
improves the accuracy
Supporters
V Considering the dependencies improves the accuracy
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Concluding remarks
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
V The UK-2006-05 dataset is “harder” than previous
A Reference
Collection
datasets
Topological Web
V Considering content-based and link-based attributes
Spam
Counting of
improves the accuracy
Supporters
V Considering the dependencies improves the accuracy
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Concluding remarks
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
V The UK-2006-05 dataset is “harder” than previous
A Reference
Collection
datasets
Topological Web
V Considering content-based and link-based attributes
Spam
Counting of
improves the accuracy
Supporters
V Considering the dependencies improves the accuracy
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
C. Castillo,
D. Donato,
A. Gionis,
V. Murdock,
F. Silvestri
Web Spam
Web Spam
Detection
A Reference
Thank you!
Collection
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.
(2006).
C. Castillo,
D. Donato, Using rank propagation and probabilistic counting for link-based spam
A. Gionis,
detection.
V. Murdock,
In Proceedings of the Workshop on Web Mining and Web Usage Analysis
F. Silvestri
(WebKDD), Pennsylvania, USA. ACM Press.
Web Spam
Cohen, W. W. and Kou, Z. (2006).
Web Spam
Stacked graphical learning: approximating learning in markov random
Detection
fields using very short inhomogeneous markov chains.
A Reference
Technical report.
Collection
Davison, B. D. (2000).
Topological Web
Spam Topical locality in the web.
In Proceedings of the 23rd annual international ACM SIGIR conference on
Counting of
research and development in information retrieval, pages 272–279, Athens,
Supporters
Greece. ACM Press.
Content-based
Spam detection
Fetterly, D., Manasse, M., and Najork, M. (2004).
Spam, damn spam, and statistics: Using statistical analysis to locate spam
Web Topology
web pages.
Conclusions
In Proceedings of the seventh workshop on the Web and databases
(WebDB), pages 1–6, Paris, France.
Flajolet, P. and Martin, N. G. (1985).
Probabilistic counting algorithms for data base applications.
Journal of Computer and System Sciences, 31(2):182–209.
Web Spam
Detection
C. Castillo,
D. Donato,
Gibson, D., Kumar, R., and Tomkins, A. (2005).
A. Gionis,
V. Murdock, Discovering large dense subgraphs in massive graphs.
F. Silvestri
In VLDB ’05: Proceedings of the 31st international conference on Very
large data bases, pages 721–732. VLDB Endowment.
Web Spam
Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004).
o
Web Spam
Detection Combating web spam with trustrank.
In Proceedings of the Thirtieth International Conference on Very Large
A Reference
Collection Data Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Topological Web
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
Spam
Detecting spam web pages through content analysis.
Counting of
In Proceedings of the World Wide Web conference, pages 83–92,
Supporters
Edinburgh, Scotland.
Content-based
Spam detection Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
ANF: a fast and scalable tool for data mining in massive graphs.
Web Topology
In Proceedings of the eighth ACM SIGKDD international conference on
Conclusions
Knowledge discovery and data mining, pages 81–90, New York, NY, USA.
ACM Press.
0 comments
Post a comment