Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
Using Rank Propagation and Probabilistic
R. Baeza-Yates
Counting for Link-based Spam Detection
Motivation
Degree-based
measures
PageRank
Luca Becchetti1 , Carlos Castillo1 ,Debora Donato1 ,
TrustRank
Stefano Leonardi1 and Ricardo Baeza-Yates2
Truncated
PageRank
1. Universit` di Roma “La Sapienza” – Rome, Italy
a
Counting
supporters
2. Yahoo! Research – Barcelona, Spain and Santiago, Chile
Conclusions
May 19th, 2006
Link-Based
Spam Detection
L. Becchetti,
Motivation
1
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Degree-based measures
2
Motivation
PageRank
Degree-based 3
measures
PageRank
TrustRank
4
TrustRank
Truncated
PageRank
Truncated PageRank
5
Counting
supporters
Conclusions
Counting supporters
6
Conclusions
7
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
1
Motivation
Degree-based measures
2
Degree-based
PageRank
3
measures
TrustRank
4
PageRank
Truncated PageRank
5
TrustRank
Counting supporters
6
Truncated
PageRank Conclusions
7
Counting
supporters
Conclusions
What is on the Web?
Link-Based
Information
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
What is on the Web?
Link-Based
Information + Porn
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
What is on the Web?
Link-Based
Information + Porn + On-line casinos + Free movies +
Spam Detection
Cheap software + Buy a MBA diploma + Prescription -free
L. Becchetti,
C. Castillo,
drugs + V!-4-gra + Get rich now now now!!!
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Graphic: www.milliondollarhomepage.com
Web spam (keywords + links)
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Web spam (mostly keywords)
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Search engine?
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Fake search engine
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Problem: “normal” pages that are spam
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Link farms
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Link farms
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
Motivation
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
[Fetterly et al., 2004] hypothesized that studying the
R. Baeza-Yates
distribution of statistics about pages could be a good way of
Motivation
detecting spam pages:
Degree-based
measures
“in a number of these distributions, outlier values are
PageRank
associated with web spam”
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Motivation
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
[Fetterly et al., 2004] hypothesized that studying the
R. Baeza-Yates
distribution of statistics about pages could be a good way of
Motivation
detecting spam pages:
Degree-based
measures
“in a number of these distributions, outlier values are
PageRank
associated with web spam”
TrustRank
Truncated
PageRank
Research goal
Counting
supporters
Statistical analysis of link-based spam
Conclusions
Metrics
Link-Based
Graph algorithms
Spam Detection
L. Becchetti,
All shortest paths, centrality, betweenness, clustering coefficient...
C. Castillo,
D. Donato,
Streamed algorithms
S. Leonardi and
R. Baeza-Yates
Breadth-first and depth-first search
Motivation
Count of neighbors
Degree-based
Symmetric algorithms
measures
(Strongly) connected components
PageRank
TrustRank
Approximate count of neighbors
Truncated
PageRank, Truncated PageRank, Linear Rank
PageRank
HITS, Salsa, TrustRank
Counting
supporters
Conclusions
Test collection
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
U.K. collection
R. Baeza-Yates
18.5 million pages downloaded from the .UK domain
Motivation
Degree-based
5,344 hosts manually classified (6% of the hosts)
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Test collection
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
U.K. collection
R. Baeza-Yates
18.5 million pages downloaded from the .UK domain
Motivation
Degree-based
5,344 hosts manually classified (6% of the hosts)
measures
PageRank
TrustRank
Truncated
PageRank
Classified entire hosts:
Counting
V A few hosts are mixed: spam and non-spam pages
supporters
Conclusions
X More coverage: sample covers 32% of the pages
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
1
Motivation
Degree-based measures
2
Degree-based
PageRank
3
measures
TrustRank
4
PageRank
Truncated PageRank
5
TrustRank
Counting supporters
6
Truncated
PageRank Conclusions
7
Counting
supporters
Conclusions
Degree
Link-Based
δ = 0.35
In−degree
Spam Detection
L. Becchetti,
Normal
C. Castillo,
D. Donato,
0.4 Spam
S. Leonardi and
R. Baeza-Yates
Motivation
0.3
Degree-based
measures
PageRank
0.2
TrustRank
Truncated
PageRank
Counting
0.1
supporters
Conclusions
0
1 100 10000
Number of in−links
(δ = max. difference in C.D.F. plot)
Degree
Link-Based
Spam Detection
δ = 0.28
Out−degree
L. Becchetti,
0.3
C. Castillo,
Normal
D. Donato,
S. Leonardi and
Spam
R. Baeza-Yates
Motivation
Degree-based
0.2
measures
PageRank
TrustRank
Truncated
PageRank
0.1
Counting
supporters
Conclusions
0
1 10 50 100
Number of out−links
Edge reciprocity
Link-Based
Spam Detection
δ = 0.35
Reciprocity of max. PR page
L. Becchetti,
0.5
C. Castillo,
Normal
D. Donato,
S. Leonardi and
Spam
R. Baeza-Yates
0.4
Motivation
Degree-based
measures
0.3
PageRank
TrustRank
0.2
Truncated
PageRank
Counting
supporters
0.1
Conclusions
0
0 0.2 0.4 0.6 0.8 1
Fraction of reciprocal links
Assortativity
Link-Based
Spam Detection
δ = 0.31
Degree / Degree of neighbors
L. Becchetti,
C. Castillo,
0.4
D. Donato,
Normal
S. Leonardi and
Spam
R. Baeza-Yates
Motivation
0.3
Degree-based
measures
PageRank
0.2
TrustRank
Truncated
PageRank
Counting
0.1
supporters
Conclusions
0
0.001 0.01 0.1 1 10 100 1000
Degree/Degree ratio of home page
Automatic classifier
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
All of the following attributes in the home page and the page
D. Donato,
S. Leonardi and
with the maximum PageRank, plus a binary variable
R. Baeza-Yates
indicating if they are the same page:
Motivation
- In-degree, out-degree
Degree-based
measures
- Fraction of reciprocal edges
PageRank
- Degree divided by degree of direct neighbors
TrustRank
- Average and sum of in-degree of out-neighbors
Truncated
PageRank
- Average and sum of out-degree of in-neighbors
Counting
supporters
Decision tree
Conclusions
72.6% of detection rate, with 3.1% false positives
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
1
Motivation
Degree-based measures
2
Degree-based
PageRank
3
measures
TrustRank
4
PageRank
Truncated PageRank
5
TrustRank
Counting supporters
6
Truncated
PageRank Conclusions
7
Counting
supporters
Conclusions
PageRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
Let PN×N be the normalized link matrix of a graph
D. Donato,
S. Leonardi and
Row-normalized
R. Baeza-Yates
No “sinks”
Motivation
Degree-based
Definition (PageRank)
measures
PageRank
Stationary state of:
TrustRank
(1 − α)
Truncated
αP + 1N×N
PageRank
N
Counting
supporters
Conclusions
PageRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
Let PN×N be the normalized link matrix of a graph
D. Donato,
S. Leonardi and
Row-normalized
R. Baeza-Yates
No “sinks”
Motivation
Degree-based
Definition (PageRank)
measures
PageRank
Stationary state of:
TrustRank
(1 − α)
Truncated
αP + 1N×N
PageRank
N
Counting
supporters
Conclusions
Follow links with probability α
Random jump with probability 1 − α
Maximum PageRank in the Host
Link-Based
Spam Detection
δ = 0.23
Maximum PageRank of the site
L. Becchetti,
C. Castillo,
D. Donato,
Normal
S. Leonardi and
Spam
R. Baeza-Yates
0.3
Motivation
Degree-based
measures
0.2
PageRank
TrustRank
Truncated
PageRank
0.1
Counting
supporters
Conclusions
0
1e−08 1e−07 1e−06 1e−05 0.0001
PageRank
Variance of PageRank
Suggested in [Bencz´r et al., 2005]
u
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
PageRank PageRank
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Variance of PageRank of in-neighbors
Link-Based
Spam Detection
Stdev. of PR of Neighbors (Home) δ = 0.41
L. Becchetti,
C. Castillo,
D. Donato,
Normal
S. Leonardi and
Spam
R. Baeza-Yates
0.3
Motivation
Degree-based
measures
0.2
PageRank
TrustRank
Truncated
PageRank
0.1
Counting
supporters
Conclusions
0
0 0.2 0.4 0.6 0.8 1
σ2 of the logarithm of PageRank
Automatic classifier
Link-Based
Spam Detection
L. Becchetti,
Features: degree-based plus the following in the home page
C. Castillo,
D. Donato,
and the page with maximum PageRank:
S. Leonardi and
R. Baeza-Yates
- PageRank
- In-degree/PageRank
Motivation
Degree-based
- Out-degree/PageRank
measures
Standard deviation of PageRank of in-neighbors = σ 2
-
PageRank
σ 2 /PageRank
-
TrustRank
Plus the PageRank of the home page divided by the
Truncated
PageRank
PageRank of the page with the maximum PageRank.
Counting
supporters
Decision tree
Conclusions
74.4% of detection rate, with 2.6% false positives
Automatic classifier
Link-Based
Spam Detection
L. Becchetti,
Features: degree-based plus the following in the home page
C. Castillo,
D. Donato,
and the page with maximum PageRank:
S. Leonardi and
R. Baeza-Yates
- PageRank
- In-degree/PageRank
Motivation
Degree-based
- Out-degree/PageRank
measures
Standard deviation of PageRank of in-neighbors = σ 2
-
PageRank
σ 2 /PageRank
-
TrustRank
Plus the PageRank of the home page divided by the
Truncated
PageRank
PageRank of the page with the maximum PageRank.
Counting
supporters
Decision tree
Conclusions
74.4% of detection rate, with 2.6% false positives
(Degree-based: 72.6% of detection, 3.1% false positives)
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
1
Motivation
Degree-based measures
2
Degree-based
PageRank
3
measures
TrustRank
4
PageRank
Truncated PageRank
5
TrustRank
Counting supporters
6
Truncated
PageRank Conclusions
7
Counting
supporters
Conclusions
TrustRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
TrustRank [Gy¨ngyi et al., 2004]
o
Motivation
A node with high PageRank, but far away from a core set of
Degree-based
“trusted nodes” is suspicious
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
TrustRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
TrustRank [Gy¨ngyi et al., 2004]
o
Motivation
A node with high PageRank, but far away from a core set of
Degree-based
“trusted nodes” is suspicious
measures
PageRank
Start from a set of trusted nodes, then do a random walk,
TrustRank
returning to the set of trusted nodes with probability 1 − α at
Truncated
PageRank
each step
Counting
supporters
i Trusted nodes: data from http://www.dmoz.org/
Conclusions
TrustRank score
Link-Based
Spam Detection
δ = 0.59
TrustRank score of home page
L. Becchetti,
C. Castillo,
D. Donato,
Normal
S. Leonardi and
0.4 Spam
R. Baeza-Yates
Motivation
Degree-based
0.3
measures
PageRank
TrustRank
0.2
Truncated
PageRank
Counting
0.1
supporters
Conclusions
0
1e−06 0.001
TrustRank
TrustRank / PageRank
Link-Based
Spam Detection
δ = 0.59
Estimated relative non−spam mass
L. Becchetti,
C. Castillo,
D. Donato,
Normal
0.8
S. Leonardi and
Spam
R. Baeza-Yates
0.7
Motivation
0.6
Degree-based
measures
0.5
PageRank
0.4
TrustRank
Truncated
0.3
PageRank
Counting
0.2
supporters
Conclusions
0.1
0
0.3 1 10 100
TrustRank score/PageRank
Automatic classifier
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
PageRank attributes, plus the following in the home page and
D. Donato,
S. Leonardi and
the page with maximum PageRank:
R. Baeza-Yates
- TrustScore
Motivation
- TrustScore/PageRank (estimated relative non-spam mass)
Degree-based
measures
- TrustScore/In-degree
PageRank
Plus the TrustScore in the home page divided by the
TrustRank
TrustScore in the page with the maximum PageRank.
Truncated
PageRank
Decision tree
Counting
supporters
77.3% of detection rate, with 3.0% false positives
Conclusions
Automatic classifier
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
PageRank attributes, plus the following in the home page and
D. Donato,
S. Leonardi and
the page with maximum PageRank:
R. Baeza-Yates
- TrustScore
Motivation
- TrustScore/PageRank (estimated relative non-spam mass)
Degree-based
measures
- TrustScore/In-degree
PageRank
Plus the TrustScore in the home page divided by the
TrustRank
TrustScore in the page with the maximum PageRank.
Truncated
PageRank
Decision tree
Counting
supporters
77.3% of detection rate, with 3.0% false positives
Conclusions
(PageRank-based: 74.4% of detection, 2.6% false positives)
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
1
Motivation
Degree-based measures
2
Degree-based
PageRank
3
measures
TrustRank
4
PageRank
Truncated PageRank
5
TrustRank
Counting supporters
6
Truncated
PageRank Conclusions
7
Counting
supporters
Conclusions
Path-based formula for PageRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
Given a path p = x1 , x2 , . . . , xt of length t = |p|
D. Donato,
S. Leonardi and
R. Baeza-Yates
1
branching(p) =
Motivation
d1 d2 · · · dt−1
Degree-based
measures
where di are the out-degrees of the members of the path
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Path-based formula for PageRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
Given a path p = x1 , x2 , . . . , xt of length t = |p|
D. Donato,
S. Leonardi and
R. Baeza-Yates
1
branching(p) =
Motivation
d1 d2 · · · dt−1
Degree-based
measures
where di are the out-degrees of the members of the path
PageRank
Explicit formula for PageRank [Newman et al., 2001]
TrustRank
Truncated
(1 − α)α|p|
PageRank
ri (α) = branching(p)
Counting
N
supporters
p∈Path(−,i)
Conclusions
Path(−, i) are incoming paths in node i
General functional ranking
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
In general:
R. Baeza-Yates
Motivation
damping(|p|)
ri (α) = branching(p)
Degree-based
N
measures
p∈Path(−,i)
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
General functional ranking
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
In general:
R. Baeza-Yates
Motivation
damping(|p|)
ri (α) = branching(p)
Degree-based
N
measures
p∈Path(−,i)
PageRank
TrustRank
Truncated
There are many choices for damping(|p|), including simply a
PageRank
linear function that is as good as PageRank in practice
Counting
supporters
[Baeza-Yates et al., 2006]
Conclusions
General functional ranking
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
In general:
R. Baeza-Yates
Motivation
damping(|p|)
ri (α) = branching(p)
Degree-based
N
measures
p∈Path(−,i)
PageRank
TrustRank
Truncated
There are many choices for damping(|p|), including simply a
PageRank
linear function that is as good as PageRank in practice
Counting
supporters
[Baeza-Yates et al., 2006]
Conclusions
Truncated PageRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Reduce the direct contribution of the first levels of links:
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
t≤T
0
Counting
damping(t) =
supporters
C αt t>T
Conclusions
Truncated PageRank
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
Reduce the direct contribution of the first levels of links:
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
t≤T
0
Counting
damping(t) =
supporters
C αt t>T
Conclusions
V No extra reading of the graph after PageRank
Truncated PageRank(T=2) / PageRank
Link-Based
Spam Detection
TruncatedPageRank T=2 / PageRank δ = 0.30
L. Becchetti,
C. Castillo,
D. Donato,
Normal
S. Leonardi and
Spam
R. Baeza-Yates
0.3
Motivation
Degree-based
measures
0.2
PageRank
TrustRank
Truncated
PageRank
0.1
Counting
supporters
Conclusions
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
TruncatedPageRank(T=2) / PageRank
Max. change of Truncated PageRank
Link-Based
Spam Detection
L. Becchetti,
Maximum change of Truncated PageRank δ = 0.29
C. Castillo,
D. Donato,
S. Leonardi and
Normal
R. Baeza-Yates
Spam
0.2
Motivation
Degree-based
measures
PageRank
TrustRank
0.1
Truncated
PageRank
Counting
supporters
Conclusions
0
0.85 0.9 0.95 1 1.05 1.1
max(TrPRi+1/TrPri)
Automatic classifier
Link-Based
PageRank attributes, plus the following in the home page and
Spam Detection
the page with maximum PageRank:
L. Becchetti,
C. Castillo,
D. Donato,
- TruncPageRank(T = 1 . . . 4)
S. Leonardi and
R. Baeza-Yates
- TruncPageRank(T = 4) / TruncPageRank(T = 3)
- TruncPageRank(T = 3) / TruncPageRank(T = 2)
Motivation
- TruncPageRank(T = 2) / TruncPageRank(T = 1)
Degree-based
measures
- TruncPageRank(T = 1 . . . 4) / PageRank
PageRank
- Minimum, maximum and average of:
TrustRank
TruncPageRank(T = i)/TruncPageRank(T = i − 1)
Truncated
PageRank
Plus the TruncatedPageRank(T = 1 . . . 4) of the home page
Counting
divided by the same value in the page with the maximum
supporters
PageRank.
Conclusions
Decision tree
76.9% of detection rate, with 2.5% false positives
Automatic classifier
Link-Based
PageRank attributes, plus the following in the home page and
Spam Detection
the page with maximum PageRank:
L. Becchetti,
C. Castillo,
D. Donato,
- TruncPageRank(T = 1 . . . 4)
S. Leonardi and
R. Baeza-Yates
- TruncPageRank(T = 4) / TruncPageRank(T = 3)
- TruncPageRank(T = 3) / TruncPageRank(T = 2)
Motivation
- TruncPageRank(T = 2) / TruncPageRank(T = 1)
Degree-based
measures
- TruncPageRank(T = 1 . . . 4) / PageRank
PageRank
- Minimum, maximum and average of:
TrustRank
TruncPageRank(T = i)/TruncPageRank(T = i − 1)
Truncated
PageRank
Plus the TruncatedPageRank(T = 1 . . . 4) of the home page
Counting
divided by the same value in the page with the maximum
supporters
PageRank.
Conclusions
Decision tree
76.9% of detection rate, with 2.5% false positives
(TrustRank-based: 77.3% of detection, 3.0% false positives)
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
1
Motivation
Degree-based measures
2
Degree-based
PageRank
3
measures
TrustRank
4
PageRank
Truncated PageRank
5
TrustRank
Counting supporters
6
Truncated
PageRank Conclusions
7
Counting
supporters
Conclusions
Idea: count “supporters” at different distances
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
Number of different nodes at a given distance:
D. Donato,
S. Leonardi and
R. Baeza-Yates
.UK 18 mill. nodes .EU.INT 860,000 nodes
Motivation
0.3 0.3
Degree-based
measures
0.2 0.2
Frequency
Frequency
PageRank
TrustRank
0.1 0.1
Truncated
PageRank
0.0 0.0
Counting
5 10 15 20 25 30 5 10 15 20 25 30
supporters
Distance Distance
Conclusions
Average distance Average distance
14.9 clicks 10.0 clicks
High and low-ranked pages are different
Link-Based
4
Spam Detection
x 10
Top 0%−10%
L. Becchetti,
12
C. Castillo,
Top 40%−50%
D. Donato,
Top 60%−70%
S. Leonardi and
R. Baeza-Yates
10
Number of Nodes
Motivation
8
Degree-based
measures
6
PageRank
TrustRank
4
Truncated
PageRank
Counting
2
supporters
Conclusions
0
1 5 10 15 20
Distance
High and low-ranked pages are different
Link-Based
4
Spam Detection
x 10
Top 0%−10%
L. Becchetti,
12
C. Castillo,
Top 40%−50%
D. Donato,
Top 60%−70%
S. Leonardi and
R. Baeza-Yates
10
Number of Nodes
Motivation
8
Degree-based
measures
6
PageRank
TrustRank
4
Truncated
PageRank
Counting
2
supporters
Conclusions
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
Probabilistic counting
Link-Based
Spam Detection
L. Becchetti,
C. Castillo, 1
1
D. Donato, 0
0
S. Leonardi and 0
0
R. Baeza-Yates 0
0
0 1
1 1
1
1
0 0
1 1
0
0
Motivation 0
0 0 0
Propagation of 0
0 1
1
Degree-based bits using the 1
0 1
1
measures “OR” operation 1
0 1
0
PageRank
1
Target
0 Count bits set
TrustRank 0
page
0 to estimate
0
0 supporters
Truncated
0
0
PageRank 1
1 1
1
0
0 1
1
Counting 0
0
supporters 0
0
1
1
Conclusions
0
0
Probabilistic counting
Link-Based
Spam Detection
L. Becchetti,
C. Castillo, 1
1
D. Donato, 0
0
S. Leonardi and 0
0
R. Baeza-Yates 0
0
0 1
1 1
1
1
0 0
1 1
0
0
Motivation 0
0 0 0
Propagation of 0
0 1
1
Degree-based bits using the 1
0 1
1
measures “OR” operation 1
0 1
0
PageRank
1
Target
0 Count bits set
TrustRank 0
page
0 to estimate
0
0 supporters
Truncated
0
0
PageRank 1
1 1
1
0
0 1
1
Counting 0
0
supporters 0
0
1
1
Conclusions
0
0
Improvement of ANF algorithm [Palmer et al., 2002] based on
probabilistic counting [Flajolet and Martin, 1985]
General algorithm
Link-Based
Require: N: number of nodes, d: distance, k: bits
Spam Detection
1: for node : 1 . . . N, bit: 1 . . . k do
L. Becchetti,
C. Castillo,
INIT(node,bit)
2:
D. Donato,
S. Leonardi and
3: end for
R. Baeza-Yates
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
General algorithm
Link-Based
Require: N: number of nodes, d: distance, k: bits
Spam Detection
1: for node : 1 . . . N, bit: 1 . . . k do
L. Becchetti,
C. Castillo,
INIT(node,bit)
2:
D. Donato,
S. Leonardi and
3: end for
R. Baeza-Yates
4: for distance : 1 . . . d do {Iteration step}
Motivation
Aux ← 0k
5:
Degree-based
for src : 1 . . . N do {Follow links in the graph}
measures
6:
PageRank
for all links from src to dest do
7:
TrustRank
Aux[dest] ← Aux[dest] OR V[src,·]
8:
Truncated
end for
9:
PageRank
end for
10:
Counting
supporters
V ← Aux
11:
Conclusions
12: end for
General algorithm
Link-Based
Require: N: number of nodes, d: distance, k: bits
Spam Detection
1: for node : 1 . . . N, bit: 1 . . . k do
L. Becchetti,
C. Castillo,
INIT(node,bit)
2:
D. Donato,
S. Leonardi and
3: end for
R. Baeza-Yates
4: for distance : 1 . . . d do {Iteration step}
Motivation
Aux ← 0k
5:
Degree-based
for src : 1 . . . N do {Follow links in the graph}
measures
6:
PageRank
for all links from src to dest do
7:
TrustRank
Aux[dest] ← Aux[dest] OR V[src,·]
8:
Truncated
end for
9:
PageRank
end for
10:
Counting
supporters
V ← Aux
11:
Conclusions
12: end for
13: for node: 1 . . . N do {Estimate supporters}
Supporters[node] ← ESTIMATE( V[node,·] )
14:
15: end for
16: return Supporters
Our estimator
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Initialize all bits to one with probability
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Our estimator
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Initialize all bits to one with probability
Motivation ones(node)
Estimator: neighbors(node) = log(1− ) 1 − k
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Our estimator
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Initialize all bits to one with probability
Motivation ones(node)
Estimator: neighbors(node) = log(1− ) 1 − k
Degree-based
measures
Adaptive estimation
PageRank
TrustRank
Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look
Truncated
for the transitions from more than (1 − 1/e)k ones to less
PageRank
than (1 − 1/e)k ones.
Counting
supporters
Conclusions
Convergence
Link-Based
Spam Detection
L. Becchetti,
100%
C. Castillo,
D. Donato,
90%
S. Leonardi and
R. Baeza-Yates
80%
Fraction of nodes
Motivation
70%
with estimates
Degree-based
60%
measures
50%
PageRank
d=1
d=2
TrustRank
40%
d=3
Truncated
30% d=4
PageRank
d=5
20%
Counting
d=6
supporters
d=7
10%
Conclusions
d=8
0%
5 10 15 20
Iteration
Hosts at distance 4
Link-Based
Spam Detection
δ = 0.39
Hosts at Distance Exactly 4
L. Becchetti,
0.4
C. Castillo,
Normal
D. Donato,
S. Leonardi and
Spam
R. Baeza-Yates
0.3
Motivation
Degree-based
measures
PageRank
0.2
TrustRank
Truncated
PageRank
Counting
0.1
supporters
Conclusions
0
1 100 1000
S4 − S3
Minimum change of supporters
Link-Based
Spam Detection
δ = 0.39
Minimum change of supporters
L. Becchetti,
C. Castillo,
Normal
D. Donato,
S. Leonardi and
0.4 Spam
R. Baeza-Yates
Motivation
Degree-based
0.3
measures
PageRank
TrustRank
0.2
Truncated
PageRank
Counting
supporters
0.1
Conclusions
0
1 5 10
min(S2/S1, S3/S2, S4/S3)
Automatic classifier
Link-Based
PageRank attributes, plus the following in the home page and
Spam Detection
the page with maximum PageRank:
L. Becchetti,
C. Castillo,
D. Donato,
- Supporters at 2 . . . 4
S. Leonardi and
R. Baeza-Yates
- Supporters at 2 . . . 4 / PageRank
- Supporters at i / Supporters at i − 1 (for i = 1..4)
Motivation
- Minimum, maximum and average of: Supporters at i /
Degree-based
measures
Supporters at i − 1 (for i = 1..4)
PageRank
- (Supporters at i - Supporters at i − 1) / PageRank
TrustRank
Plus the number of supporters at distance 2 . . . 4 in the home
Truncated
PageRank
page divided by the same feature in the page with the
Counting
maximum PageRank.
supporters
Conclusions
Decision tree
78.9% of detection rate, with 2.5% false positives
Automatic classifier
Link-Based
PageRank attributes, plus the following in the home page and
Spam Detection
the page with maximum PageRank:
L. Becchetti,
C. Castillo,
D. Donato,
- Supporters at 2 . . . 4
S. Leonardi and
R. Baeza-Yates
- Supporters at 2 . . . 4 / PageRank
- Supporters at i / Supporters at i − 1 (for i = 1..4)
Motivation
- Minimum, maximum and average of: Supporters at i /
Degree-based
measures
Supporters at i − 1 (for i = 1..4)
PageRank
- (Supporters at i - Supporters at i − 1) / PageRank
TrustRank
Plus the number of supporters at distance 2 . . . 4 in the home
Truncated
PageRank
page divided by the same feature in the page with the
Counting
maximum PageRank.
supporters
Conclusions
Decision tree
78.9% of detection rate, with 2.5% false positives
(TruncPR: 76.9% of detection, 2.5% false positives)
(TrustRank: 77.7% of detection, 3.0% false positives)
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
1
Motivation
Degree-based measures
2
Degree-based
PageRank
3
measures
TrustRank
4
PageRank
Truncated PageRank
5
TrustRank
Counting supporters
6
Truncated
PageRank Conclusions
7
Counting
supporters
Conclusions
Summary of classifiers
Detection False
Link-Based
Spam Detection
Metrics rate positives
L. Becchetti,
Degree (D) 72.6% 3.1%
C. Castillo,
D. Donato,
D + PageRank (P) 74.4% 2.6%
S. Leonardi and
R. Baeza-Yates
D + P + TrustRank 77.7% 3.0%
Motivation
D + P + Trunc. PageRank 76.9% 2.5%
Degree-based
D + P + Est. of Supporters 78.9% 2.5%
measures
All attributes 81.4% 2.8%
PageRank
All attributes (more rules) 80.8% 1.1%
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Summary of classifiers
Detection False
Link-Based
Spam Detection
Metrics rate positives
L. Becchetti,
Degree (D) 72.6% 3.1%
C. Castillo,
D. Donato,
D + PageRank (P) 74.4% 2.6%
S. Leonardi and
R. Baeza-Yates
D + P + TrustRank 77.7% 3.0%
Motivation
D + P + Trunc. PageRank 76.9% 2.5%
Degree-based
D + P + Est. of Supporters 78.9% 2.5%
measures
All attributes 81.4% 2.8%
PageRank
All attributes (more rules) 80.8% 1.1%
TrustRank
Truncated
PageRank
Comparison
Counting
supporters
Content-based analysis [Ntoulas et al., 2006] has shown
Conclusions
86.2% detection rate with 2.2% false positives
Single-attribute classifier with TrustRank
[Gy¨ngyi et al., 2004]: 51.1% detection rate, 3.4% error in
o
our sample – SpamRank [Bencz´r et al., 2005] reports similar
u
detection rates
Top 10 metrics
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
1. Binary variable indicating if homepage is the page with
S. Leonardi and
R. Baeza-Yates
maximum PageRank of the site
2. Edge reciprocity
Motivation
3. Different hosts at distance 4
Degree-based
measures
4. Different hosts at distance 3
PageRank
5. Minimum change of supporters (different hosts)
TrustRank
6. Different hosts at distance 2
Truncated
PageRank
7. TruncatedPagerank (T=1) / PageRank
Counting
8. TrustRank score divided by PageRank
supporters
9. Different hosts at distance 1
Conclusions
10. TruncatedPagerank (T=2) / PageRank
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
V Link-based statistics to detect 80% of spam
Motivation
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
V Link-based statistics to detect 80% of spam
Motivation
X No magic bullet in link analysis
Degree-based
measures
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
V Link-based statistics to detect 80% of spam
Motivation
X No magic bullet in link analysis
Degree-based
measures
X Precision still low compared to e-mail spam filters
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
V Link-based statistics to detect 80% of spam
Motivation
X No magic bullet in link analysis
Degree-based
measures
X Precision still low compared to e-mail spam filters
PageRank
V Measure both home page and max. PageRank page
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
V Link-based statistics to detect 80% of spam
Motivation
X No magic bullet in link analysis
Degree-based
measures
X Precision still low compared to e-mail spam filters
PageRank
V Measure both home page and max. PageRank page
TrustRank
V
Truncated
Host-based counts are important
PageRank
Counting
supporters
Conclusions
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
V Link-based statistics to detect 80% of spam
Motivation
X No magic bullet in link analysis
Degree-based
measures
X Precision still low compared to e-mail spam filters
PageRank
V Measure both home page and max. PageRank page
TrustRank
V
Truncated
Host-based counts are important
PageRank
Counting
supporters
Conclusions
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
V Link-based statistics to detect 80% of spam
Motivation
X No magic bullet in link analysis
Degree-based
measures
X Precision still low compared to e-mail spam filters
PageRank
V Measure both home page and max. PageRank page
TrustRank
V
Truncated
Host-based counts are important
PageRank
Counting
Next step: combine link analysis and content analysis
supporters
Conclusions
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Motivation
Degree-based
measures
Thank you!
PageRank
TrustRank
Truncated
PageRank
Counting
supporters
Conclusions
Link-Based
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
Spam Detection
L. Becchetti, Generalizing pagerank: Damping functions for link-based
C. Castillo,
ranking algorithms.
D. Donato,
S. Leonardi and
In Proceedings of SIGIR, Seattle, Washington, USA. ACM
R. Baeza-Yates
Press.
Motivation
Degree-based Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and
measures
Baeza-Yates, R. (2006).
PageRank
Using rank propagation and probabilistic counting for
TrustRank
link-based spam detection.
Truncated
PageRank
Technical report, DELIS – Dynamically Evolving, Large-Scale
Counting
Information Systems.
supporters
Conclusions
Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M.
u a o
(2005).
Spamrank: fully automatic link spam detection.
In Proceedings of the First International Workshop on
Adversarial Information Retrieval on the Web, Chiba, Japan.
Link-Based
Spam Detection
L. Becchetti,
Fetterly, D., Manasse, M., and Najork, M. (2004).
C. Castillo,
D. Donato,
Spam, damn spam, and statistics: Using statistical analysis to
S. Leonardi and
R. Baeza-Yates
locate spam web pages.
In Proceedings of the seventh workshop on the Web and
Motivation
databases (WebDB), pages 1–6, Paris, France.
Degree-based
measures
PageRank
Flajolet, P. and Martin, N. G. (1985).
TrustRank
Probabilistic counting algorithms for data base applications.
Truncated
Journal of Computer and System Sciences, 31(2):182–209.
PageRank
Counting
supporters
Gibson, D., Kumar, R., and Tomkins, A. (2005).
Conclusions
Discovering large dense subgraphs in massive graphs.
In VLDB ’05: Proceedings of the 31st international conference
on Very large data bases, pages 721–732. VLDB Endowment.
Link-Based
Spam Detection
Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004).
o
L. Becchetti,
C. Castillo,
Combating web spam with trustrank.
D. Donato,
S. Leonardi and
In Proceedings of the Thirtieth International Conference on
R. Baeza-Yates
Very Large Data Bases (VLDB), pages 576–587, Toronto,
Motivation
Canada. Morgan Kaufmann.
Degree-based
measures
Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001).
PageRank
Random graphs with arbitrary degree distributions and their
TrustRank
applications.
Truncated
PageRank
Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
Counting
supporters
Ntoulas, A., Najork, M., and Manasse, M. a. (2006).
Conclusions
Detecting spam web pages through content analysis.
In To appear in proceedings of the World Wide Web
conference, Edinburgh, Scotland.
Link-Based
Spam Detection
L. Becchetti,
C. Castillo,
D. Donato,
S. Leonardi and
R. Baeza-Yates
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
Motivation
Degree-based
ANF: a fast and scalable tool for data mining in massive
measures
graphs.
PageRank
In Proceedings of the eighth ACM SIGKDD international
TrustRank
conference on Knowledge discovery and data mining, pages
Truncated
PageRank
81–90, New York, NY, USA. ACM Press.
Counting
supporters
Conclusions
0 comments
Post a comment