Web Spam (Salamanca 2007) - Presentation Transcript
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam Detection
A Reference
Collection
Web Links
Carlos Castillo1
Topological Web
Spam
chato@yahoo-inc.com
Counting of
With: R. Baeza-Yates1,3 , L. Becchetti2 , P. Boldi5 ,
Supporters
D. Donato1 , A. Gionis1 , S. Leonardi2 , V.Murdock1 ,
Content-based
Spam detection
M. Santini5 , F. Silvestri4 , S. Vigna5
Web Topology
Conclusions
1. Yahoo! Research Barcelona – Catalunya, Spain
2. Universit` di Roma “La Sapienza” – Rome, Italy
a
3. Yahoo! Research Santiago – Chile
4. ISTI-CNR –Pisa,Italy
5. Universit` degli Studi di Milano – Milan, Italy
a
Web Spam
Previous: how search engines work
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Search engine: issues
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Scalability (crawling, indexing, searching, ranking)
Topological Web
Spam
Relevance (query to document match)
Counting of
Supporters
Static ranking (content quality)
Content-based
Incentives for cheating ($)
Spam detection
Web Topology
Conclusions
Web Spam
Search engine: issues
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Scalability (crawling, indexing, searching, ranking)
Topological Web
Spam
Relevance (query to document match)
Counting of
Supporters
Static ranking (content quality)
Content-based
Incentives for cheating ($)
Spam detection
Web Topology
Conclusions
Web Spam
Search engine: issues
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Scalability (crawling, indexing, searching, ranking)
Topological Web
Spam
Relevance (query to document match)
Counting of
Supporters
Static ranking (content quality)
Content-based
Incentives for cheating ($)
Spam detection
Web Topology
Conclusions
Web Spam
Search engine: issues
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Scalability (crawling, indexing, searching, ranking)
Topological Web
Spam
Relevance (query to document match)
Counting of
Supporters
Static ranking (content quality)
Content-based
Incentives for cheating ($)
Spam detection
Web Topology
Conclusions
Web Spam
This is a talk about academic research!
Detection
C. Castillo
Web Spam
Web Spam
Detection
Tools for dealing with Web Spam
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
Web Spam
1
C. Castillo
Web Spam
Web Spam Detection
2
Web Spam
Detection
A Reference Collection
3
A Reference
Collection
Web Links
Web Links
4
Topological Web
Spam
Topological Web Spam
5
Counting of
Supporters
Content-based
Counting of Supporters
6
Spam detection
Web Topology
Content-based Spam detection
7
Conclusions
Web Topology
8
Conclusions
9
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
The Web
Detection
C. Castillo
Web Spam
“The sum of all human knowledge plus porn” – Robert Gilbert
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Graphic: www.milliondollarhomepage.com
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Adversarial IR Issues on the Web
Detection
C. Castillo
Web Spam
Link spam
Web Spam
Detection
Content spam
A Reference
Collection
Cloaking
Web Links
Comment/forum/wiki spam
Topological Web
Spam
Spam-oriented blogging
Counting of
Click fraud ×2
Supporters
Content-based
Reverse engineering of ranking algorithms
Spam detection
Web Topology
Web content filtering
Conclusions
Advertisement blocking
Stealth crawling
Malicious tagging
. . . more?
Web Spam
Opportunities for Web spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
X Spamdexing
Collection
Keyword stuffing
Web Links
Link farms
Topological Web
Spam
Spam blogs (splogs)
Counting of
Cloaking
Supporters
Content-based
Spam detection
Adversarial relationship
Web Topology
Every undeserved gain in ranking for a spammer, is a loss of
Conclusions
precision for the search engine.
Web Spam
Opportunities for Web spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
X Spamdexing
Collection
Keyword stuffing
Web Links
Link farms
Topological Web
Spam
Spam blogs (splogs)
Counting of
Cloaking
Supporters
Content-based
Spam detection
Adversarial relationship
Web Topology
Every undeserved gain in ranking for a spammer, is a loss of
Conclusions
precision for the search engine.
Web Spam
Na¨ Web Spam
ıve
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Hidden text
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Made for Advertising
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Search engine?
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Fake search engine
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
“Normal” content in link farms
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Cloaking
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Redirection
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Redirects using Javascript
Detection
C. Castillo
Web Spam
Web Spam
Simple redirect
Detection
A Reference
<script>
Collection
document.location=\"http://www.topsearch10.com/\";
Web Links
</script>
Topological Web
Spam
Counting of
“Hidden” redirect
Supporters
Content-based
<script>
Spam detection
var1=24; var2=var1;
Web Topology
if(var1==var2) {
Conclusions
document.location=\"http://www.topsearch10.com/\";
}
</script>
Web Spam
Problem: obfuscated code
Detection
C. Castillo
Web Spam
Web Spam
Detection
Obfuscated redirect
A Reference
Collection
<script>
Web Links
var a1=\"win\",a2=\"dow\",a3=\"loca\",a4=\"tion.\",
Topological Web
a5=\"replace\",a6=\"(’http://www.top10search.com/’)\";
Spam
var i,str=\"\";
Counting of
Supporters
for(i=1;i<=6;i++)
Content-based
{
Spam detection
str += eval(\"a\"+i);
Web Topology
}
Conclusions
eval(str);
</script>
Web Spam
Problem: really obfuscated code
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Encoded javascript
Web Links
<script>
Topological Web
Spam
var s = \"%5CBE0D%5C%05GDHJ BDE%16...%04%0E\";
Counting of
var e = ’’, i;
Supporters
eval(unescape(’s%eDunescape%28s%29%3Bfor...%3B’));
Content-based
Spam detection
</script>
Web Topology
Conclusions
More examples: [Chellapilla and Maykov, 2007]
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
Machine Learning
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Training of a Decision Tree
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Decision Tree (error = 15%)
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Decision Tree (error = 15% → 12%)
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Machine Learning (cont.)
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Feature Extraction
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Machine Learning
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Machine Learning Challenges:
Topological Web
Spam
Instances are not really independent (graph)
Counting of
Supporters
Learning with few examples
Content-based
Scalability
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Machine Learning
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Machine Learning Challenges:
Topological Web
Spam
Instances are not really independent (graph)
Counting of
Supporters
Learning with few examples
Content-based
Scalability
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Machine Learning
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Machine Learning Challenges:
Topological Web
Spam
Instances are not really independent (graph)
Counting of
Supporters
Learning with few examples
Content-based
Scalability
Spam detection
Web Topology
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Information Retrieval Challenges:
Web Links
Feature extraction: which features?
Topological Web
Spam
Feature aggregation: page/host/domain
Counting of
Supporters
Feature propagation (graph)
Content-based
Spam detection
Recall/precision tradeoffs
Web Topology
Scalability
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Information Retrieval Challenges:
Web Links
Feature extraction: which features?
Topological Web
Spam
Feature aggregation: page/host/domain
Counting of
Supporters
Feature propagation (graph)
Content-based
Spam detection
Recall/precision tradeoffs
Web Topology
Scalability
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Information Retrieval Challenges:
Web Links
Feature extraction: which features?
Topological Web
Spam
Feature aggregation: page/host/domain
Counting of
Supporters
Feature propagation (graph)
Content-based
Spam detection
Recall/precision tradeoffs
Web Topology
Scalability
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Information Retrieval Challenges:
Web Links
Feature extraction: which features?
Topological Web
Spam
Feature aggregation: page/host/domain
Counting of
Supporters
Feature propagation (graph)
Content-based
Spam detection
Recall/precision tradeoffs
Web Topology
Scalability
Conclusions
Web Spam
Challenges: Information Retrieval
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Information Retrieval Challenges:
Web Links
Feature extraction: which features?
Topological Web
Spam
Feature aggregation: page/host/domain
Counting of
Supporters
Feature propagation (graph)
Content-based
Spam detection
Recall/precision tradeoffs
Web Topology
Scalability
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
Data is really important
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
It is dangerous for a search engine to provide labelled
Spam
data for this
Counting of
Supporters
Even if they do, it would never reflect a consensus
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Data is really important
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
It is dangerous for a search engine to provide labelled
Spam
data for this
Counting of
Supporters
Even if they do, it would never reflect a consensus
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Crawling of base data
Topological Web
Spam
Elaboration of the guidelines and classification interface
Counting of
Supporters
Labeling
Content-based
Post-processing
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Crawling of base data
Topological Web
Spam
Elaboration of the guidelines and classification interface
Counting of
Supporters
Labeling
Content-based
Post-processing
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Crawling of base data
Topological Web
Spam
Elaboration of the guidelines and classification interface
Counting of
Supporters
Labeling
Content-based
Post-processing
Spam detection
Web Topology
Conclusions
Web Spam
Assembling Process
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Crawling of base data
Topological Web
Spam
Elaboration of the guidelines and classification interface
Counting of
Supporters
Labeling
Content-based
Post-processing
Spam detection
Web Topology
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
U.K. collection
Web Links
77.9 M pages downloaded from the .UK domain in May 2006
Topological Web
Spam
(LAW, University of Milan)
Counting of
Supporters
Large seed of about 150,000 .uk hosts
Content-based
Spam detection
11,400 hosts
Web Topology
8 levels depth, with <=50,000 pages per host
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
U.K. collection
Web Links
77.9 M pages downloaded from the .UK domain in May 2006
Topological Web
Spam
(LAW, University of Milan)
Counting of
Supporters
Large seed of about 150,000 .uk hosts
Content-based
Spam detection
11,400 hosts
Web Topology
8 levels depth, with <=50,000 pages per host
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
U.K. collection
Web Links
77.9 M pages downloaded from the .UK domain in May 2006
Topological Web
Spam
(LAW, University of Milan)
Counting of
Supporters
Large seed of about 150,000 .uk hosts
Content-based
Spam detection
11,400 hosts
Web Topology
8 levels depth, with <=50,000 pages per host
Conclusions
Web Spam
Crawling of base data
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
U.K. collection
Web Links
77.9 M pages downloaded from the .UK domain in May 2006
Topological Web
Spam
(LAW, University of Milan)
Counting of
Supporters
Large seed of about 150,000 .uk hosts
Content-based
Spam detection
11,400 hosts
Web Topology
8 levels depth, with <=50,000 pages per host
Conclusions
Web Spam
Classification interface
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Labeling process
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
We asked 20+ volunteers to classify entire hosts
Spam
Asked to classify normal / borderline / spam
Counting of
Supporters
Do they agree? Mostly . . .
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Labeling process
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
We asked 20+ volunteers to classify entire hosts
Spam
Asked to classify normal / borderline / spam
Counting of
Supporters
Do they agree? Mostly . . .
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Labeling process
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
We asked 20+ volunteers to classify entire hosts
Spam
Asked to classify normal / borderline / spam
Counting of
Supporters
Do they agree? Mostly . . .
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Agreement
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Results
Detection
C. Castillo
Labels
Web Spam
Label Frequency Percentage
Web Spam
Detection
Normal 4,046 61.75%
A Reference
Borderline 709 10.82%
Collection
Spam 1,447 22.08%
Web Links
Can not classify 350 5.34%
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Agreement
Web Topology
Category Kappa Interpretation
Conclusions
normal 0.62 Substantial agreement
spam 0.63 Substantial agreement
borderline 0.11 Slight agreement
global 0.56 Moderate agreement
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
Result: first public Web Spam collection
Detection
C. Castillo
Web Spam
Public spam collection
Web Spam
Detection
Labels for 6,552 hosts
A Reference
2,725 hosts classified by at least 2 humans
Collection
3,106 automatically considered normal (.ac.uk,
Web Links
.sch.uk, .gov.uk, .mod.uk, .nhs.uk or .police.uk)
Topological Web
Spam
http://www.yr-bcn.es/webspam/
Counting of
Upcoming Web Spam challenge
Supporters
Track I: Information retrieval + Machine learning
Content-based
Spam detection
Track II: Machine learning
Web Topology
http://webspam.lip6.fr/
Conclusions
AIRWeb 2007 Workshop (challenge results available)
Regular and short papers
Track I of the Web Spam Challenge
http://airweb.cse.lehigh.edu/2007/
Web Spam
AIRWeb 2007 in Banff, Canada
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Scale-free networks
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
How to find meaningful patterns?
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Several levels of analysis:
Topological Web
Spam
Macroscopic view: overall structure
Counting of
Supporters
Microscopic view: nodes
Content-based
Mesoscopic view: regions
Spam detection
Web Topology
Conclusions
Web Spam
How to find meaningful patterns?
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Several levels of analysis:
Topological Web
Spam
Macroscopic view: overall structure
Counting of
Supporters
Microscopic view: nodes
Content-based
Mesoscopic view: regions
Spam detection
Web Topology
Conclusions
Web Spam
How to find meaningful patterns?
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Several levels of analysis:
Topological Web
Spam
Macroscopic view: overall structure
Counting of
Supporters
Microscopic view: nodes
Content-based
Mesoscopic view: regions
Spam detection
Web Topology
Conclusions
Web Spam
Macroscopic view, e.g. Bow-tie
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
[Broder et al., 2000]
Web Spam
Macroscopic view, e.g. Bow-tie, migration
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
[Baeza-Yates and Poblete, 2006]
Web Spam
Macroscopic view, e.g. Jellyfish
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
[Tauro et al., 2001] - Internet Autonomous Systems (AS)
Topology
Web Spam
Macroscopic view, e.g. Jellyfish
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Microscopic view, e.g. Degree
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
[Barab´si, 2002] and others
a
Web Spam
Microscopic view, e.g. Degree
Detection
C. Castillo
Greece Chile
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Spain Korea
Content-based
Spam detection
Web Topology
Conclusions
[Baeza-Yates et al., 2006b] - compares this distribution in 8
countries . . . guess what is the result?
Web Spam
Mesoscopic view, e.g. Hop-plot
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Mesoscopic view, e.g. Hop-plot
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Mesoscopic view, e.g. Hop-plot
Detection
C. Castillo
Web Spam
.it (40M pages) .uk (18M pages)
Web Spam
0.3 0.3
Detection
A Reference
0.2 0.2
Frequency
Frequency
Collection
Web Links
0.1 0.1
Topological Web
Spam
0.0 0.0
5 10 15 20 25 30 5 10 15 20 25 30
Counting of
Distance Distance
Supporters
.eu.int (800K pages) Synthetic graph (100K pages)
Content-based
Spam detection
0.3 0.3
Web Topology
0.2 0.2
Conclusions
Frequency
Frequency
0.1 0.1
0.0 0.0
5 10 15 20 25 30 5 10 15 20 25 30
Distance Distance
[Baeza-Yates et al., 2006a]
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
Topological spam: link farms
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
Web Spam
Topological spam: link farms
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
Web Spam
Motivation
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Fetterly [Fetterly et al., 2004] hypothesized that studying the
Web Links
distribution of statistics about pages could be a good way of
Topological Web
Spam
detecting spam pages:
Counting of
Supporters
“in a number of these distributions, outlier values are
Content-based
Spam detection
associated with web spam”
Web Topology
Conclusions
Web Spam
Handling large graphs
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
For large graphs, random access is not possible.
Topological Web
Spam
Counting of
Large graphs do not fit in main memory
Supporters
Content-based
Streaming model of computation
Spam detection
Web Topology
Conclusions
Web Spam
Handling large graphs
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
For large graphs, random access is not possible.
Topological Web
Spam
Counting of
Large graphs do not fit in main memory
Supporters
Content-based
Streaming model of computation
Spam detection
Web Topology
Conclusions
Web Spam
Handling large graphs
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
For large graphs, random access is not possible.
Topological Web
Spam
Counting of
Large graphs do not fit in main memory
Supporters
Content-based
Streaming model of computation
Spam detection
Web Topology
Conclusions
Web Spam
Semi-streaming model
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Memory size enough to hold some data per-node
Spam
Disk size enough to hold some data per-edge
Counting of
Supporters
A small number of passes over the data
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Restriction
Detection
C. Castillo
Web Spam
Semi-streaming model: graph on disk
Web Spam
Detection
1: for node : 1 . . . N do
A Reference
INITIALIZE-MEM(node)
2:
Collection
3: end for
Web Links
4: for distance : 1 . . . d do {Iteration step}
Topological Web
Spam
for src : 1 . . . N do {Follow links in the graph}
5:
Counting of
for all links from src to dest do
Supporters 6:
Content-based
COMPUTE(src,dest)
7:
Spam detection
end for
8:
Web Topology
end for
9:
Conclusions
NORMALIZE
10:
11: end for
12: POST-PROCESS
13: return Something
Web Spam
Restriction
Detection
C. Castillo
Web Spam
Semi-streaming model: graph on disk
Web Spam
Detection
1: for node : 1 . . . N do
A Reference
INITIALIZE-MEM(node)
2:
Collection
3: end for
Web Links
4: for distance : 1 . . . d do {Iteration step}
Topological Web
Spam
for src : 1 . . . N do {Follow links in the graph}
5:
Counting of
for all links from src to dest do
Supporters 6:
Content-based
COMPUTE(src,dest)
7:
Spam detection
end for
8:
Web Topology
end for
9:
Conclusions
NORMALIZE
10:
11: end for
12: POST-PROCESS
13: return Something
Web Spam
Restriction
Detection
C. Castillo
Web Spam
Semi-streaming model: graph on disk
Web Spam
Detection
1: for node : 1 . . . N do
A Reference
INITIALIZE-MEM(node)
2:
Collection
3: end for
Web Links
4: for distance : 1 . . . d do {Iteration step}
Topological Web
Spam
for src : 1 . . . N do {Follow links in the graph}
5:
Counting of
for all links from src to dest do
Supporters 6:
Content-based
COMPUTE(src,dest)
7:
Spam detection
end for
8:
Web Topology
end for
9:
Conclusions
NORMALIZE
10:
11: end for
12: POST-PROCESS
13: return Something
Web Spam
Link-Based Features
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Degree-related measures
Web Links
Topological Web
PageRank
Spam
TrustRank [Gy¨ngyi et al., 2004]
o
Counting of
Supporters
Truncated PageRank [Becchetti et al., 2006]
Content-based
Spam detection
Estimation of supporters [Becchetti et al., 2006]
Web Topology
140 features per host (2 pages per host)
Conclusions
Web Spam
Degree-Based
Detection
C. Castillo
Web Spam
0.12
Normal
Web Spam Spam
Detection 0.10
A Reference 0.08
Collection
0.06
Web Links
0.04
Topological Web
Spam 0.02
Counting of
0.00
Supporters 4 18 76 323 1380 5899 25212 107764 460609 1968753
0.14
Content-based Normal
Spam
Spam detection 0.12
Web Topology 0.10
Conclusions 0.08
0.06
0.04
0.02
0.00
0.0 0.0 0.0 0.1 0.6 4.9 40.0 327.9 2686.5 22009.9
Web Spam
TrustRank Idea
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
TrustRank / PageRank
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection 1.00
Normal
Spam
Web Links 0.90
0.80
Topological Web 0.70
Spam
0.60
Counting of 0.50
Supporters 0.40
0.30
Content-based
0.20
Spam detection
0.10
Web Topology 0.00
0.4 1 4 1e+01 4e+01 1e+02 3e+02 1e+03 3e+03 9e+03
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
High and low-ranked pages are different
Detection
C. Castillo
4
x 10
Web Spam
Top 0%−10%
Web Spam
12
Top 40%−50%
Detection
Top 60%−70%
A Reference
10
Collection
Number of Nodes
Web Links
8
Topological Web
Spam
Counting of
6
Supporters
Content-based
4
Spam detection
Web Topology
2
Conclusions
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
Web Spam
High and low-ranked pages are different
Detection
C. Castillo
4
x 10
Web Spam
Top 0%−10%
Web Spam
12
Top 40%−50%
Detection
Top 60%−70%
A Reference
10
Collection
Number of Nodes
Web Links
8
Topological Web
Spam
Counting of
6
Supporters
Content-based
4
Spam detection
Web Topology
2
Conclusions
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
Web Spam
Probabilistic counting
Detection
C. Castillo
Web Spam
1
1
Web Spam
0
0
Detection
0
0
0
0
A Reference 0 1
1 1
1
1
Collection
0 0
1 1
0
0
0
0 0 0
Web Links Propagation of 0
0 1
1
bits using the 1
0 1
Topological Web 1
“OR” operation 1
0 1
Spam 0
Counting of
1
Target
0 Count bits set
Supporters
0
page
0 to estimate
0
0 supporters
Content-based
0
0
Spam detection 1
1 1
1
0
0
Web Topology 1
1
0
0
0
0
Conclusions
1
1
0
0
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
Web Spam
Probabilistic counting
Detection
C. Castillo
Web Spam
1
1
Web Spam
0
0
Detection
0
0
0
0
A Reference 0 1
1 1
1
1
Collection
0 0
1 1
0
0
0
0 0 0
Web Links Propagation of 0
0 1
1
bits using the 1
0 1
Topological Web 1
“OR” operation 1
0 1
Spam 0
Counting of
1
Target
0 Count bits set
Supporters
0
page
0 to estimate
0
0 supporters
Content-based
0
0
Spam detection 1
1 1
1
0
0
Web Topology 1
1
0
0
0
0
Conclusions
1
1
0
0
[Becchetti et al., 2006] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
Web Spam
Bottleneck number
Detection
C. Castillo
Web Spam
Web Spam
bd (x) = minj≤d {|Nj (x)|/|Nj−1 (x)|}. Minimum rate of growth
Detection
A Reference
of the neighbors of x up to a certain distance. We expect that
Collection
spam pages form clusters that are somehow isolated from the
Web Links
rest of the Web graph and they have smaller bottleneck
Topological Web
Spam
numbers than non-spam pages.
Counting of 0.40
Normal
Supporters Spam
0.35
Content-based 0.30
Spam detection
0.25
Web Topology 0.20
Conclusions 0.15
0.10
0.05
0.00
1.11 1.30 1.52 1.78 2.07 2.42 2.83 3.31 3.87 4.52
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
Content-Based Features
Detection
C. Castillo
Web Spam
Web Spam
Most of the features reported in [Ntoulas et al., 2006]
Detection
Number of word in the page and title
A Reference
Collection
Average word length
Web Links
Fraction of anchor text
Topological Web
Spam
Fraction of visible text
Counting of
Supporters
Compression rate
Content-based
Spam detection
Corpus precision and corpus recall
Web Topology
Query precision and query recall
Conclusions
Independent trigram likelihood
Entropy of trigrams
96 features per host
Web Spam
Average word length
Detection
C. Castillo
Web Spam
0.12
Web Spam
Normal
Detection
Spam
0.10
A Reference
Collection
0.08
Web Links
Topological Web
0.06
Spam
Counting of
0.04
Supporters
Content-based
0.02
Spam detection
Web Topology
0.00
Conclusions 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
Figure: Histogram of the average word length in non-spam vs.
spam pages for k = 500.
Web Spam
Corpus precision
Detection
C. Castillo
Web Spam
0.10
Web Spam
Normal
Detection
0.09 Spam
A Reference
0.08
Collection
0.07
Web Links
0.06
Topological Web
0.05
Spam
0.04
Counting of
Supporters
0.03
Content-based 0.02
Spam detection
0.01
Web Topology
0.00
Conclusions 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Figure: Histogram of the corpus precision in non-spam vs. spam
pages.
Web Spam
Query precision
Detection
C. Castillo
Web Spam
0.12
Web Spam
Normal
Detection
Spam
0.10
A Reference
Collection
0.08
Web Links
Topological Web
0.06
Spam
Counting of
0.04
Supporters
Content-based
0.02
Spam detection
Web Topology
0.00
Conclusions 0.0 0.1 0.2 0.3 0.4 0.5 0.6
Figure: Histogram of the query precision in non-spam vs. spam
pages for k = 500.
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
General hypothesis
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Pages topologically close to each other are more likely
Topological Web
to have the same label (spam/nonspam) than random
Spam
pairs of pages.
Counting of
Supporters
Content-based
Pages linked together are more likely to be on the same topic
Spam detection
than random pairs of pages [Davison, 2000]
Web Topology
Conclusions
Web Spam
General hypothesis
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Pages topologically close to each other are more likely
Topological Web
to have the same label (spam/nonspam) than random
Spam
pairs of pages.
Counting of
Supporters
Content-based
Pages linked together are more likely to be on the same topic
Spam detection
than random pairs of pages [Davison, 2000]
Web Topology
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Topological dependencies: in-links
Detection
C. Castillo
Web Spam
Histogram of fraction of spam hosts in the in-links
Web Spam
Detection
0 = no in-link comes from spam hosts
A Reference
Collection
1 = all of the in-links come from spam hosts
Web Links
Topological Web
Spam
0.4
In-links of non spam
Counting of In-links of spam
0.35
Supporters
0.3
Content-based
Spam detection 0.25
Web Topology 0.2
Conclusions 0.15
0.1
0.05
0
0.0 0.2 0.4 0.6 0.8 1.0
Web Spam
Topological dependencies: out-links
Detection
C. Castillo
Web Spam
Histogram of fraction of spam hosts in the out-links
Web Spam
Detection
0 = none of the out-links points to spam hosts
A Reference
Collection
1 = all of the out-links point to spam hosts
Web Links
Topological Web
Spam
1
Out-links of non spam
Counting of 0.9 Outlinks of spam
Supporters
0.8
Content-based 0.7
Spam detection
0.6
Web Topology 0.5
0.4
Conclusions
0.3
0.2
0.1
0
0.0 0.2 0.4 0.6 0.8 1.0
Web Spam
Idea 1: Clustering
Detection
C. Castillo
Web Spam
Classify, then cluster hosts, then assign the same label to all
Web Spam
Detection
hosts in the same cluster by majority voting
A Reference
Collection
Baseline Clustering
Web Links
Without bagging
Topological Web
Spam
True positive rate 75.6% 74.5%
Counting of
Supporters
False positive rate 8.5% 6.8%
Content-based
F-Measure 0.646 0.673
Spam detection
With bagging
Web Topology
True positive rate 78.7% 76.9%
Conclusions
False positive rate 5.7% 5.0%
F-Measure 0.723 0.728
V Reduces error rate
Web Spam
Idea 1: Clustering
Detection
C. Castillo
Web Spam
Classify, then cluster hosts, then assign the same label to all
Web Spam
Detection
hosts in the same cluster by majority voting
A Reference
Collection
Baseline Clustering
Web Links
Without bagging
Topological Web
Spam
True positive rate 75.6% 74.5%
Counting of
Supporters
False positive rate 8.5% 6.8%
Content-based
F-Measure 0.646 0.673
Spam detection
With bagging
Web Topology
True positive rate 78.7% 76.9%
Conclusions
False positive rate 5.7% 5.0%
F-Measure 0.723 0.728
V Reduces error rate
Web Spam
Idea 2: Propagate the label
Detection
C. Castillo
Web Spam
Web Spam
Classify, then interpret “spamicity” as a probability, then do a
Detection
A Reference
random walk with restart from those nodes
Collection
Web Links
Baseline Fwds. Backwds. Both
Topological Web
Classifier without bagging
Spam
Counting of
True positive rate 75.6% 70.9% 69.4% 71.4%
Supporters
False positive rate 8.5% 6.1% 5.8% 5.8%
Content-based
Spam detection
F-Measure 0.646 0.665 0.664 0.676
Web Topology
Classifier with bagging
Conclusions
True positive rate 78.7% 76.5% 75.0% 75.2%
False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
Web Spam
Idea 2: Propagate the label
Detection
C. Castillo
Web Spam
Web Spam
Classify, then interpret “spamicity” as a probability, then do a
Detection
A Reference
random walk with restart from those nodes
Collection
Web Links
Baseline Fwds. Backwds. Both
Topological Web
Classifier without bagging
Spam
Counting of
True positive rate 75.6% 70.9% 69.4% 71.4%
Supporters
False positive rate 8.5% 6.1% 5.8% 5.8%
Content-based
Spam detection
F-Measure 0.646 0.665 0.664 0.676
Web Topology
Classifier with bagging
Conclusions
True positive rate 78.7% 76.5% 75.0% 75.2%
False positive rate 5.7% 5.4% 4.3% 4.7%
F-Measure 0.723 0.716 0.733 0.724
Web Spam
Idea 3: Stacked graphical learning
Detection
C. Castillo
Web Spam
Web Spam
Detection
Classify, then add the average predicted “spamicity” of
A Reference
Collection
neighbors as a new feature for each node, then classify
Web Links
again[Cohen and Kou, 2006]
Topological Web
Spam
Avg. Avg. Avg.
Counting of
Supporters
Baseline of in of out of both
Content-based
True positive rate 78.7% 84.4% 78.3% 85.2%
Spam detection
False positive rate 5.7% 6.7% 4.8% 6.1%
Web Topology
F-Measure 0.723 0.733 0.742 0.750
Conclusions
V Increases detection rate
Web Spam
Idea 3: Stacked graphical learning
Detection
C. Castillo
Web Spam
Web Spam
Detection
Classify, then add the average predicted “spamicity” of
A Reference
Collection
neighbors as a new feature for each node, then classify
Web Links
again[Cohen and Kou, 2006]
Topological Web
Spam
Avg. Avg. Avg.
Counting of
Supporters
Baseline of in of out of both
Content-based
True positive rate 78.7% 84.4% 78.3% 85.2%
Spam detection
False positive rate 5.7% 6.7% 4.8% 6.1%
Web Topology
F-Measure 0.723 0.733 0.742 0.750
Conclusions
V Increases detection rate
Web Spam
Idea 3: Stacked graphical learning x2
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
And repeat ...
Web Links
Topological Web
Baseline First pass Second pass
Spam
True positive rate 78.7% 85.2% 88.4%
Counting of
Supporters
False positive rate 5.7% 6.1% 6.3%
Content-based
F-Measure 0.723 0.750 0.763
Spam detection
Web Topology
V Significant improvement over the baseline
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
Web Spam
1
A Reference
Web Spam Detection
2
Collection
A Reference Collection
3
Web Links
Web Links
4
Topological Web
Topological Web Spam
5
Spam
Counting of Supporters
6
Counting of
Supporters
Content-based Spam detection
7
Content-based
Web Topology
8
Spam detection
Conclusions
9
Web Topology
Conclusions
Web Spam
Concluding remarks
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
V The UK-2006-05 dataset is “harder” than previous
Topological Web
datasets
Spam
V
Counting of
Considering content-based and link-based attributes
Supporters
improves the accuracy
Content-based
Spam detection
V Considering the dependencies improves the accuracy
Web Topology
Conclusions
Web Spam
Concluding remarks
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
V The UK-2006-05 dataset is “harder” than previous
Topological Web
datasets
Spam
V
Counting of
Considering content-based and link-based attributes
Supporters
improves the accuracy
Content-based
Spam detection
V Considering the dependencies improves the accuracy
Web Topology
Conclusions
Web Spam
Concluding remarks
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
V The UK-2006-05 dataset is “harder” than previous
Topological Web
datasets
Spam
V
Counting of
Considering content-based and link-based attributes
Supporters
improves the accuracy
Content-based
Spam detection
V Considering the dependencies improves the accuracy
Web Topology
Conclusions
Web Spam
Detection
C. Castillo
Web Spam
Web Spam
Detection
A Reference
Collection
Web Links
Thank you!
Topological Web
Spam
Counting of
Supporters
Content-based
Spam detection
Web Topology
Conclusions
Web Spam
Detection
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006a).
C. Castillo
Generalizing pagerank: Damping functions for link-based ranking
Web Spam algorithms.
In Proceedings of ACM SIGIR, pages 308–315, Seattle, Washington, USA.
Web Spam
Detection ACM Press.
A Reference Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2006b).
Collection
Characterization of national web domains.
Web Links To appear in ACM TOIT.
Topological Web
Baeza-Yates, R. and Poblete, B. (2006).
Spam
Dynamics of the chilean web structure.
Counting of
Comput. Networks, 50(10):1464–1473.
Supporters
Barab´si, A.-L. (2002).
a
Content-based
Spam detection Linked: The New Science of Networks.
Perseus Books Group.
Web Topology
Conclusions Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.
(2006).
Using rank propagation and probabilistic counting for link-based spam
detection.
In Proceedings of the Workshop on Web Mining and Web Usage Analysis
(WebKDD), Pennsylvania, USA. ACM Press.
Web Spam
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,
Detection
Stata, R., Tomkins, A., and Wiener, J. (2000).
C. Castillo
Graph structure in the web: Experiments and models.
In Proceedings of the Ninth Conference on World Wide Web, pages
Web Spam
309–320, Amsterdam, Netherlands. ACM Press.
Web Spam
Detection Chellapilla, K. and Maykov, A. (2007).
A taxonomy of javascript redirection spam.
A Reference
In AIRWeb ’07: Proceedings of the 3rd international workshop on
Collection
Adversarial information retrieval on the web, pages 81–88, New York, NY,
Web Links
USA. ACM Press.
Topological Web
Spam Cohen, W. W. and Kou, Z. (2006).
Stacked graphical learning: approximating learning in markov random
Counting of
fields using very short inhomogeneous markov chains.
Supporters
Technical report.
Content-based
Spam detection
Davison, B. D. (2000).
Web Topology Topical locality in the web.
In Proceedings of the 23rd annual international ACM SIGIR conference on
Conclusions
research and development in information retrieval, pages 272–279, Athens,
Greece. ACM Press.
Fetterly, D., Manasse, M., and Najork, M. (2004).
Spam, damn spam, and statistics: Using statistical analysis to locate spam
web pages.
In Proceedings of the seventh workshop on the Web and databases
(WebDB), pages 1–6, Paris, France.
Web Spam
Flajolet, P. and Martin, N. G. (1985).
Detection
Probabilistic counting algorithms for data base applications.
C. Castillo
Journal of Computer and System Sciences, 31(2):182–209.
Web Spam
Gibson, D., Kumar, R., and Tomkins, A. (2005).
Discovering large dense subgraphs in massive graphs.
Web Spam
Detection
In VLDB ’05: Proceedings of the 31st international conference on Very
large data bases, pages 721–732. VLDB Endowment.
A Reference
Collection
Gy¨ngyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).
o
Web Links
Combating Web spam with TrustRank.
Topological Web In Proceedings of the 30th International Conference on Very Large Data
Spam
Bases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.
Counting of
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
Supporters
Detecting spam web pages through content analysis.
Content-based
In Proceedings of the World Wide Web conference, pages 83–92,
Spam detection
Edinburgh, Scotland.
Web Topology
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
Conclusions
ANF: a fast and scalable tool for data mining in massive graphs.
In Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 81–90, New York, NY, USA.
ACM Press.
Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001).
A simple conceptual model for the internet topology.
In Global Internet, San Antonio, Texas, USA. IEEE CS Press.
0 comments
Post a comment