1.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Link Analysis on the Web
Functional
Rankings
The big picture, the small picture and the medium-sized picture
Web Spam
Web Spam
Detection
Ricardo Baeza-Yates3,4
Topological Web
Spam
Joint work with: L. Becchetti1 , P. Boldi2 , C. Castillo1,3 ,
Direct Counting
D. Donato1,3 , S. Leonardi1 , B. Poblete5
of Supporters
Spam Detection
Results
1. Universit` di Roma “La Sapienza” – Rome, Italy
a
2. Univerit` degli Studi di Milano – Milan, Italy
a
3. Yahoo! Research Barcelona – Catalunya, Spain
4. Yahoo! Research Latin America – Santiago, Chile
5. Universitat Pompeu Fabra – Catalunya, Spain
2.
Link Analysis on
the Web
Levels of Link Analysis
1
Levels of Link
Analysis
Generalizing PageRank
2
Generalizing
PageRank
Other
Other Functional Rankings
3
Functional
Rankings
Web Spam
Web Spam
4
Web Spam
Detection
Web Spam Detection
Topological Web 5
Spam
Direct Counting
of Supporters
Topological Web Spam
6
Spam Detection
Results
Direct Counting of Supporters
7
Spam Detection Results
8
3.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
4.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
5.
Link Analysis on
How to ﬁnd meaningful patterns?
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Several levels of analysis:
Web Spam
Web Spam
Macroscopic view: overall structure
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
6.
Link Analysis on
How to ﬁnd meaningful patterns?
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Several levels of analysis:
Web Spam
Web Spam
Macroscopic view: overall structure
Detection
Microscopic view: nodes
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
7.
Link Analysis on
How to ﬁnd meaningful patterns?
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Several levels of analysis:
Web Spam
Web Spam
Macroscopic view: overall structure
Detection
Microscopic view: nodes
Topological Web
Spam
Mesoscopic view: regions
Direct Counting
of Supporters
Spam Detection
Results
8.
Link Analysis on
Macroscopic view, e.g. Bow-tie
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
[Broder et al., 2000]
9.
Link Analysis on
Macroscopic view, e.g. Bow-tie, migration
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
[Baeza-Yates and Poblete, 2006]
10.
Link Analysis on
Macroscopic view, e.g. Jellyﬁsh
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
[Tauro et al., 2001] - Internet Autonomous Systems (AS)
Topology
11.
Link Analysis on
Macroscopic view, e.g. Jellyﬁsh
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
12.
Link Analysis on
Microscopic view, e.g. Degree
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
[Barab´si, 2002] and others
a
13.
Link Analysis on
Microscopic view, e.g. Degree
the Web
Greece Chile
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Spain Korea
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
[Baeza-Yates et al., 2006b] - compares this distribution in 8
countries . . . guess what is the result?
14.
Link Analysis on
Mesoscopic view, e.g. Hop-plot
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
15.
Link Analysis on
Mesoscopic view, e.g. Hop-plot
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
16.
Link Analysis on
Mesoscopic view, e.g. Hop-plot
the Web
Levels of Link
Analysis
.it (40M pages) .uk (18M pages)
Generalizing 0.3 0.3
PageRank
Other 0.2 0.2
Frequency
Frequency
Functional
Rankings
0.1 0.1
Web Spam
Web Spam 0.0 0.0
5 10 15 20 25 30 5 10 15 20 25 30
Detection
Distance Distance
Topological Web
.eu.int (800K pages) Synthetic graph (100K pages)
Spam
Direct Counting 0.3 0.3
of Supporters
Spam Detection 0.2 0.2
Frequency
Frequency
Results
0.1 0.1
0.0 0.0
5 10 15 20 25 30 5 10 15 20 25 30
Distance Distance
[Baeza-Yates et al., 2006a]
17.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
18.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
19.
Link Analysis on
Notation
the Web
Levels of Link
Analysis
Generalizing
Let PN×N be the normalized link matrix of a graph
PageRank
Row-normalized
Other
Functional
Rankings
No “sinks”
Web Spam
Deﬁnition (PageRank)
Web Spam
Detection
Stationary state of:
Topological Web
Spam
(1 − α)
Direct Counting
αP + 1N×N
of Supporters
N
Spam Detection
Results
20.
Link Analysis on
Notation
the Web
Levels of Link
Analysis
Generalizing
Let PN×N be the normalized link matrix of a graph
PageRank
Row-normalized
Other
Functional
Rankings
No “sinks”
Web Spam
Deﬁnition (PageRank)
Web Spam
Detection
Stationary state of:
Topological Web
Spam
(1 − α)
Direct Counting
αP + 1N×N
of Supporters
N
Spam Detection
Results
Follow links with probability α
Random jump with probability 1 − α
21.
Link Analysis on
Explicit Formulas
the Web
Levels of Link
Analysis
Generalizing
PageRank
Formulas for PageRank
Other
Functional
[Newman et al., 2001, Boldi et al., 2005]
Rankings
Web Spam
∞
(1 − α)
Web Spam
(αP)t .
r(α) =
Detection
N
t=0
Topological Web
Spam
(1 − α)α|p|
Direct Counting
ri (α) = branching(p)
of Supporters
N
Spam Detection p∈Path(−,i)
Results
22.
Link Analysis on
Explicit Formulas
the Web
Levels of Link
Analysis
Generalizing
PageRank
Formulas for PageRank
Other
Functional
[Newman et al., 2001, Boldi et al., 2005]
Rankings
Web Spam
∞
(1 − α)
Web Spam
(αP)t .
r(α) =
Detection
N
t=0
Topological Web
Spam
(1 − α)α|p|
Direct Counting
ri (α) = branching(p)
of Supporters
N
Spam Detection p∈Path(−,i)
Results
Path(−, i) are incoming paths in node i
23.
Link Analysis on
Branching contribution
the Web
Levels of Link
Analysis
Generalizing
PageRank
Deﬁnition (Branching contribution of a path)
Other
Functional
Given a path p = x1 , x2 , . . . , xt of length t = |p|
Rankings
Web Spam
1
branching(p) =
Web Spam
d1 d2 · · · dt−1
Detection
Topological Web
where di are the out-degrees of the members of the path
Spam
Direct Counting
For every node i and every length t
of Supporters
Spam Detection
Results
branching(p) = 1.
p∈Path(i,−),|p|=t
24.
Link Analysis on
Functional ranking
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
General functional ranking [Baeza-Yates et al., 2006a]
Web Spam
Web Spam
damping(|p|)
Detection
ri (α) = branching(p)
N
Topological Web
p∈Path(−,i)
Spam
Direct Counting
PageRank is a particular case of path-based ranking
of Supporters
Spam Detection
Results
25.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
26.
Link Analysis on
Exponential damping = PageRank
the Web
Levels of Link
0.30
Analysis
damping(t) with α=0.8
damping(t) with α=0.7
Generalizing
PageRank
Other
0.20
Functional
Weight
Rankings
Web Spam
Web Spam
0.10
Detection
Topological Web
Spam
Direct Counting
0.00
of Supporters
1 2 345678 9 10
Spam Detection
Length of the path (t)
Results
Exponential damping = PageRank
damping(t) = α(1 − α)t
Most of the contribution is on the ﬁrst few levels.
27.
Link Analysis on
Linear damping
the Web
0.30
Levels of Link
damping(t) with L=15
Analysis
damping(t) with L=10
Generalizing
PageRank
0.20
Other
Functional
Weight
Rankings
Web Spam
0.10
Web Spam
Detection
Topological Web
Spam
0.00
Direct Counting
of Supporters
1 2 345678 9 10
Spam Detection
Length of the path (t)
Results
Linear damping
2(L−t)
t<L
L(L+1)
damping(t) =
t≥L
0
28.
Link Analysis on
Example: Calculating LinearRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
For calculating LinearRank we use:
Other
Functional
Rankings
∞
1
Web Spam
damping(t)Pt
LinearRank =
N
Web Spam
t=0
Detection
L−1
Topological Web
2(L − t) t
1
Spam
= P
N L(L + 1)
Direct Counting
t=0
of Supporters
Spam Detection
Results
29.
Link Analysis on
Example: Calculating LinearRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
For calculating LinearRank we use:
Other
Functional
Rankings
∞
1
Web Spam
damping(t)Pt
LinearRank =
N
Web Spam
t=0
Detection
L−1
Topological Web
2(L − t) t
1
Spam
= P
N L(L + 1)
Direct Counting
t=0
of Supporters
Spam Detection
Results
However, we cannot hold the temporary Pt in memory!
30.
Link Analysis on
Re-write the damping as a recursion
the Web
Levels of Link
Analysis
Generalizing
PageRank
We have to rewrite to be able to calculate:
Other
Functional
2
Rankings
R(0) =
Web Spam
L+1
Web Spam
(L − k − 1) (k)
Detection
R(k+1) = RP
(L − k)
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
31.
Link Analysis on
Re-write the damping as a recursion
the Web
Levels of Link
Analysis
Generalizing
PageRank
We have to rewrite to be able to calculate:
Other
Functional
2
Rankings
R(0) =
Web Spam
L+1
Web Spam
(L − k − 1) (k)
Detection
R(k+1) = RP
(L − k)
Topological Web
Spam
L−1
Direct Counting
R(k)
LinearRank =
of Supporters
Spam Detection k=0
Results
32.
Link Analysis on
Re-write the damping as a recursion
the Web
Levels of Link
Analysis
Generalizing
PageRank
We have to rewrite to be able to calculate:
Other
Functional
2
Rankings
R(0) =
Web Spam
L+1
Web Spam
(L − k − 1) (k)
Detection
R(k+1) = RP
(L − k)
Topological Web
Spam
L−1
Direct Counting
R(k)
LinearRank =
of Supporters
Spam Detection k=0
Results
Now we can give the algorithm . . .
33.
Link Analysis on
Algorithm
the Web
Levels of Link
for i : 1 . . . N do {Initialization}
1:
Analysis
2
Score[i] ← R[i] ← L+1
2:
Generalizing
PageRank
3: end for
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
34.
Link Analysis on
Algorithm
the Web
Levels of Link
for i : 1 . . . N do {Initialization}
1:
Analysis
2
Score[i] ← R[i] ← L+1
2:
Generalizing
PageRank
end for
3:
Other
for k : 1 . . . L − 1 do {Iteration step}
4:
Functional
Rankings
Aux ← 0
5:
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
35.
Link Analysis on
Algorithm
the Web
Levels of Link
for i : 1 . . . N do {Initialization}
1:
Analysis
2
Score[i] ← R[i] ← L+1
2:
Generalizing
PageRank
end for
3:
Other
for k : 1 . . . L − 1 do {Iteration step}
4:
Functional
Rankings
Aux ← 0
5:
Web Spam
for i : 1 . . . N do {Follow links in the graph}
6:
Web Spam
for all j such that there is a link from i to j do
7:
Detection
Aux[j] ← Aux[j] + R[i]/outdegree(i)
Topological Web 8:
Spam
end for
9:
Direct Counting
end for
of Supporters 10:
Spam Detection
Results
36.
Link Analysis on
Algorithm
the Web
Levels of Link
for i : 1 . . . N do {Initialization}
1:
Analysis
2
Score[i] ← R[i] ← L+1
2:
Generalizing
PageRank
end for
3:
Other
for k : 1 . . . L − 1 do {Iteration step}
4:
Functional
Rankings
Aux ← 0
5:
Web Spam
for i : 1 . . . N do {Follow links in the graph}
6:
Web Spam
for all j such that there is a link from i to j do
7:
Detection
Aux[j] ← Aux[j] + R[i]/outdegree(i)
Topological Web 8:
Spam
end for
9:
Direct Counting
end for
of Supporters 10:
for i : 1 . . . N do {Add to ranking value}
Spam Detection
11:
Results
R[i] ← Aux[i] × (L−k−1)
12: (L−k)
Score[i] ← Score[i] + R[i]
13:
end for
14:
end for
15:
return Score
16:
37.
Link Analysis on
Algorithm (general)
the Web
Levels of Link
for i : 1 . . . N do {Initialization}
1:
Analysis
Score[i] ← R[i] ← INIT
2:
Generalizing
PageRank
end for
3:
Other
for k : 1 . . . STOP do {Iteration step}
4:
Functional
Rankings
Aux ← 0
5:
Web Spam
for i : 1 . . . N do {Follow links in the graph}
6:
Web Spam
for all j such that there is a link from i to j do
Detection 7:
Aux[j] ← Aux[j] + R[i]/outdegree(i)
Topological Web
8:
Spam
end for
9:
Direct Counting
of Supporters
end for
10:
Spam Detection
for i : 1 . . . N do {Add to ranking value}
11:
Results
R[i] ← Aux[i] × FACTOR
12:
Score[i] ← Score[i] + R[i]
13:
end for
14:
end for
15:
return Score
16:
38.
Link Analysis on
Other damping functions
the Web
Levels of Link
Analysis
Empirical damping:
Generalizing
PageRank
0.7
Other
Functional
Rankings
Average text similarity 0.6
Web Spam
Web Spam
0.5
Detection
Topological Web
Spam
0.4
Direct Counting
of Supporters
0.3
Spam Detection
Results
0.2
1 2 3 4 5
Link distance
39.
Link Analysis on
Using LinearRank to approximage PageRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Experimental comparison: 18-million nodes in the U.K. Web
Rankings
Web Spam
Graph
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
40.
Link Analysis on
Using LinearRank to approximage PageRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Experimental comparison: 18-million nodes in the U.K. Web
Rankings
Web Spam
Graph
Web Spam
Calculated PageRank with α = 0.1, 0.2, . . . , 0.9
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
41.
Link Analysis on
Using LinearRank to approximage PageRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Experimental comparison: 18-million nodes in the U.K. Web
Rankings
Web Spam
Graph
Web Spam
Calculated PageRank with α = 0.1, 0.2, . . . , 0.9
Detection
Topological Web
Calculated LinearRank with L = 5, 10, . . . , 25
Spam
Direct Counting
of Supporters
Spam Detection
Results
42.
Link Analysis on
Using LinearRank to approximage PageRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Experimental comparison: 18-million nodes in the U.K. Web
Rankings
Web Spam
Graph
Web Spam
Calculated PageRank with α = 0.1, 0.2, . . . , 0.9
Detection
Topological Web
Calculated LinearRank with L = 5, 10, . . . , 25
Spam
For certain combinations of parameters, the rankings are
Direct Counting
of Supporters
almost equal!
Spam Detection
Results
43.
Link Analysis on
Experimental comparison
the Web
Levels of Link
Analysis
Experimental Comparison in the U.K. Web Graph
Generalizing
PageRank
Other
Functional
1.00
Rankings
0.95
Web Spam
τ
0.90
Web Spam
Detection
0.85
τ ≥ 0.95
Topological Web
0.80
Spam
Direct Counting
of Supporters
25
Spam Detection
20
Results
0.9
15
L 0.8
10 0.7
α
0.6
5 0.5
44.
Link Analysis on
Prediction of best parameter combination
the Web
Levels of Link
Analysis
Prediction of Best Parameter Combinations (Analysis)
Generalizing
PageRank
25
Actual optimum
Other
Predicted optimum with length=5
Functional
Rankings
L that maximizes Kendall’s τ
20
Web Spam
Web Spam
Detection
15
Topological Web
Spam
10
Direct Counting
of Supporters
Spam Detection
Results
5
0.5 0.6 0.7 0.8 0.9
Exponent α
45.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
46.
Link Analysis on
What is on the Web?
the Web
Information
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
47.
Link Analysis on
What is on the Web?
the Web
Information + Porn
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
48.
Link Analysis on
What is on the Web?
the Web
Information + Porn + On-line casinos + Free movies +
Levels of Link
Analysis
Cheap software + Buy a MBA diploma + Prescription -free
Generalizing
drugs + V!-4-gra + Get rich now now now!!!
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
Graphic: www.milliondollarhomepage.com
49.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
50.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
51.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Link farms
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
52.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Link farms
Web Spam
Scraper, “Made for Advertising” sites
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
53.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Link farms
Web Spam
Scraper, “Made for Advertising” sites
Web Spam
Spam blogs (splogs)
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
54.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Link farms
Web Spam
Scraper, “Made for Advertising” sites
Web Spam
Spam blogs (splogs)
Detection
Cloaking
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
55.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Link farms
Web Spam
Scraper, “Made for Advertising” sites
Web Spam
Spam blogs (splogs)
Detection
Cloaking
Topological Web
Spam
Click spam
Direct Counting
of Supporters
Spam Detection
Results
56.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Link farms
Web Spam
Scraper, “Made for Advertising” sites
Web Spam
Spam blogs (splogs)
Detection
Cloaking
Topological Web
Spam
Click spam
Direct Counting
of Supporters
Spam Detection
Results
57.
Link Analysis on
Opportunities for Web spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
V Spamdexing
Other
Keyword stuﬃng
Functional
Rankings
Link farms
Web Spam
Scraper, “Made for Advertising” sites
Web Spam
Spam blogs (splogs)
Detection
Cloaking
Topological Web
Spam
Click spam
Direct Counting
of Supporters
Adversarial relationship
Spam Detection
Results
Every undeserved gain in ranking for a spammer, is a loss of
precision for the search engine.
58.
Link Analysis on
Typical Web Spam (1)
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
59.
Link Analysis on
Typical Web Spam (2)
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
60.
Link Analysis on
Hidden text
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
61.
Link Analysis on
Made for Advertising (1)
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
62.
Link Analysis on
Made for Advertising (2)
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
63.
Link Analysis on
Made for Advertising (3)
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
64.
Link Analysis on
Search engine?
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
65.
Link Analysis on
Fake search engine
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
66.
Link Analysis on
Problem: “normal” pages that are spam
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
67.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
68.
Link Analysis on
Machine Learning
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
69.
Link Analysis on
Machine Learning (cont.)
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
70.
Link Analysis on
Feature Extraction
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
71.
Link Analysis on
Challenges: Machine Learning
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Machine Learning Challenges:
Web Spam
Web Spam
Learning with inter dependent variables (graph)
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
72.
Link Analysis on
Challenges: Machine Learning
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Machine Learning Challenges:
Web Spam
Web Spam
Learning with inter dependent variables (graph)
Detection
Learning with few examples
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
73.
Link Analysis on
Challenges: Machine Learning
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Machine Learning Challenges:
Web Spam
Web Spam
Learning with inter dependent variables (graph)
Detection
Learning with few examples
Topological Web
Spam
Scalability
Direct Counting
of Supporters
Spam Detection
Results
74.
Link Analysis on
Challenges: Information Retrieval
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Information Retrieval Challenges:
Rankings
Feature extraction: which features?
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
75.
Link Analysis on
Challenges: Information Retrieval
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Information Retrieval Challenges:
Rankings
Feature extraction: which features?
Web Spam
Web Spam
Feature aggregation: page/host/domain
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
76.
Link Analysis on
Challenges: Information Retrieval
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Information Retrieval Challenges:
Rankings
Feature extraction: which features?
Web Spam
Web Spam
Feature aggregation: page/host/domain
Detection
Topological Web
Feature propagation (graph)
Spam
Direct Counting
of Supporters
Spam Detection
Results
77.
Link Analysis on
Challenges: Information Retrieval
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Information Retrieval Challenges:
Rankings
Feature extraction: which features?
Web Spam
Web Spam
Feature aggregation: page/host/domain
Detection
Topological Web
Feature propagation (graph)
Spam
Recall/precision tradeoﬀs
Direct Counting
of Supporters
Spam Detection
Results
78.
Link Analysis on
Challenges: Information Retrieval
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Information Retrieval Challenges:
Rankings
Feature extraction: which features?
Web Spam
Web Spam
Feature aggregation: page/host/domain
Detection
Topological Web
Feature propagation (graph)
Spam
Recall/precision tradeoﬀs
Direct Counting
of Supporters
Scalability
Spam Detection
Results
79.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
80.
Link Analysis on
Topological spam: link farms
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
81.
Link Analysis on
Topological spam: link farms
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
Single-level farms can be detected by searching groups of
nodes sharing their out-links [Gibson et al., 2005]
82.
Link Analysis on
Motivation
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
[Fetterly et al., 2004] hypothesized that studying the
Rankings
distribution of statistics about pages could be a good way of
Web Spam
Web Spam
detecting spam pages:
Detection
Topological Web
“in a number of these distributions, outlier values are
Spam
Direct Counting
associated with web spam”
of Supporters
Spam Detection
Results
83.
Link Analysis on
Test collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
U.K. collection
Other
Functional
Rankings
18.5 million pages downloaded from the .UK domain
Web Spam
5,344 hosts manually classiﬁed (6% of the hosts)
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
84.
Link Analysis on
Test collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
U.K. collection
Other
Functional
Rankings
18.5 million pages downloaded from the .UK domain
Web Spam
5,344 hosts manually classiﬁed (6% of the hosts)
Web Spam
Detection
Topological Web
Spam
Direct Counting
Classiﬁed entire hosts:
of Supporters
Spam Detection
V A few hosts are mixed: spam and non-spam pages
Results
X More coverage: sample covers 32% of the pages
85.
Link Analysis on
In-degree
the Web
δ = 0.35
In−degree
Levels of Link
Analysis
Generalizing
Normal
PageRank
0.4 Spam
Other
Functional
Rankings
0.3
Web Spam
Web Spam
Detection
Topological Web
0.2
Spam
Direct Counting
of Supporters
Spam Detection
0.1
Results
0
1 100 10000
Number of in−links
(δ = max. diﬀerence in C.D.F. plot)
86.
Link Analysis on
Out-degree
the Web
Levels of Link
δ = 0.28
Out−degree
Analysis
0.3
Generalizing
Normal
PageRank
Spam
Other
Functional
Rankings
Web Spam
0.2
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
0.1
Spam Detection
Results
0
1 10 50 100
Number of out−links
87.
Link Analysis on
Edge reciprocity
the Web
Levels of Link
δ = 0.35
Reciprocity of max. PR page
Analysis
0.5
Generalizing
Normal
PageRank
Spam
Other
Functional
0.4
Rankings
Web Spam
Web Spam
0.3
Detection
Topological Web
Spam
0.2
Direct Counting
of Supporters
Spam Detection
Results
0.1
0
0 0.2 0.4 0.6 0.8 1
Fraction of reciprocal links
88.
Link Analysis on
Assortativity
the Web
Levels of Link
δ = 0.31
Degree / Degree of neighbors
Analysis
Generalizing
0.4
PageRank
Normal
Spam
Other
Functional
Rankings
0.3
Web Spam
Web Spam
Detection
Topological Web
0.2
Spam
Direct Counting
of Supporters
Spam Detection
0.1
Results
0
0.001 0.01 0.1 1 10 100 1000
Degree/Degree ratio of home page
89.
Link Analysis on
Variance of PageRank
the Web
Suggested in [Bencz´r et al., 2005]
u
Levels of Link
Analysis
Generalizing
PageRank
PageRank PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
90.
Link Analysis on
Variance of PageRank of in-neighbors
the Web
Levels of Link
Stdev. of PR of Neighbors (Home) δ = 0.41
Analysis
Generalizing
PageRank
Normal
Spam
Other
0.3
Functional
Rankings
Web Spam
Web Spam
Detection
0.2
Topological Web
Spam
Direct Counting
of Supporters
0.1
Spam Detection
Results
0
0 0.2 0.4 0.6 0.8 1
σ2 of the logarithm of PageRank
91.
Link Analysis on
TrustRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
TrustRank [Gy¨ngyi et al., 2004]
o
Functional
Rankings
A node with high PageRank, but far away from a core set of
Web Spam
“trusted nodes” is suspicious
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
92.
Link Analysis on
TrustRank
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
TrustRank [Gy¨ngyi et al., 2004]
o
Functional
Rankings
A node with high PageRank, but far away from a core set of
Web Spam
“trusted nodes” is suspicious
Web Spam
Detection
Start from a set of trusted nodes, then do a random walk,
Topological Web
Spam
returning to the set of trusted nodes with probability 1 − α at
Direct Counting
each step
of Supporters
Spam Detection
Results
i Trusted nodes: data from http://www.dmoz.org/
93.
Link Analysis on
TrustRank Idea
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
94.
Link Analysis on
TrustRank score
the Web
Levels of Link
δ = 0.59
Analysis
TrustRank score of home page
Generalizing
PageRank
Normal
0.4 Spam
Other
Functional
Rankings
Web Spam
0.3
Web Spam
Detection
Topological Web
Spam
0.2
Direct Counting
of Supporters
Spam Detection
0.1
Results
0
1e−06 0.001
TrustRank
95.
Link Analysis on
TrustRank / PageRank
the Web
Levels of Link
δ = 0.59
Analysis
Estimated relative non−spam mass
Generalizing
PageRank
Normal
0.8
Spam
Other
Functional
0.7
Rankings
Web Spam
0.6
Web Spam
0.5
Detection
Topological Web
0.4
Spam
Direct Counting
0.3
of Supporters
Spam Detection
0.2
Results
0.1
0
0.3 1 10 100
TrustRank score/PageRank
96.
Link Analysis on
Truncated PageRank
the Web
Levels of Link
Analysis
Generalizing
Proposed in [Becchetti et al., 2006b]. Idea: reduce the direct
PageRank
contribution of the ﬁrst levels of links:
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
t≤T
0
Results
damping(t) =
C αt t>T
97.
Link Analysis on
Truncated PageRank
the Web
Levels of Link
Analysis
Generalizing
Proposed in [Becchetti et al., 2006b]. Idea: reduce the direct
PageRank
contribution of the ﬁrst levels of links:
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
t≤T
0
Results
damping(t) =
C αt t>T
V No extra reading of the graph after PageRank
98.
Link Analysis on
Truncated PageRank(T=2) / PageRank
the Web
Levels of Link
Analysis
TruncatedPageRank T=2 / PageRank δ = 0.30
Generalizing
PageRank
Normal
Other
Spam
0.3
Functional
Rankings
Web Spam
Web Spam
Detection
0.2
Topological Web
Spam
Direct Counting
of Supporters
0.1
Spam Detection
Results
0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
TruncatedPageRank(T=2) / PageRank
99.
Link Analysis on
Max. change of Truncated PageRank
the Web
Levels of Link
Analysis
Maximum change of Truncated PageRank δ = 0.29
Generalizing
PageRank
Normal
Other
Spam
Functional
Rankings
0.2
Web Spam
Web Spam
Detection
Topological Web
Spam
0.1
Direct Counting
of Supporters
Spam Detection
Results
0
0.85 0.9 0.95 1 1.05 1.1
max(TrPRi+1/TrPri)
100.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
101.
Link Analysis on
High and low-ranked pages are diﬀerent
the Web
4
Levels of Link
x 10
Analysis
Top 0%−10%
12
Generalizing
Top 40%−50%
PageRank
Top 60%−70%
Other
10
Number of Nodes
Functional
Rankings
8
Web Spam
Web Spam
Detection
6
Topological Web
Spam
4
Direct Counting
of Supporters
2
Spam Detection
Results
0
1 5 10 15 20
Distance
102.
Link Analysis on
High and low-ranked pages are diﬀerent
the Web
4
Levels of Link
x 10
Analysis
Top 0%−10%
12
Generalizing
Top 40%−50%
PageRank
Top 60%−70%
Other
10
Number of Nodes
Functional
Rankings
8
Web Spam
Web Spam
Detection
6
Topological Web
Spam
4
Direct Counting
of Supporters
2
Spam Detection
Results
0
1 5 10 15 20
Distance
Areas below the curves are equal if we are in the same
strongly-connected component
103.
Link Analysis on
Probabilistic counting
the Web
Levels of Link
Analysis
1
1
Generalizing 0
0
PageRank 0
0
0
0
Other 0 1
1 1
1
1
Functional 0 0
1 1
0
0
Rankings 0
0 0 0
Propagation of 0
0 1
1
Web Spam bits using the 1
0 1
1
“OR” operation 1
0 1
0
Web Spam
Detection
1
Target
0 Count bits set
Topological Web 0
page
0 to estimate
Spam 0
0 supporters
0
0
Direct Counting 1
1 1
1
of Supporters 0
0 1
1
0
0
Spam Detection 0
0
Results 1
1
0
0
104.
Link Analysis on
Probabilistic counting
the Web
Levels of Link
Analysis
1
1
Generalizing 0
0
PageRank 0
0
0
0
Other 0 1
1 1
1
1
Functional 0 0
1 1
0
0
Rankings 0
0 0 0
Propagation of 0
0 1
1
Web Spam bits using the 1
0 1
1
“OR” operation 1
0 1
0
Web Spam
Detection
1
Target
0 Count bits set
Topological Web 0
page
0 to estimate
Spam 0
0 supporters
0
0
Direct Counting 1
1 1
1
of Supporters 0
0 1
1
0
0
Spam Detection 0
0
Results 1
1
0
0
[Becchetti et al., 2006b] shows an improvement of ANF
algorithm [Palmer et al., 2002] based on probabilistic
counting [Flajolet and Martin, 1985]
105.
Link Analysis on
General algorithm
the Web
Require: N: number of nodes, d: distance, k: bits
Levels of Link
Analysis
1: for node : 1 . . . N, bit: 1 . . . k do
Generalizing
INIT(node,bit)
2:
PageRank
3: end for
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
106.
Link Analysis on
General algorithm
the Web
Require: N: number of nodes, d: distance, k: bits
Levels of Link
Analysis
1: for node : 1 . . . N, bit: 1 . . . k do
Generalizing
INIT(node,bit)
2:
PageRank
3: end for
Other
Functional
4: for distance : 1 . . . d do {Iteration step}
Rankings
Aux ← 0k
Web Spam 5:
for src : 1 . . . N do {Follow links in the graph}
Web Spam
6:
Detection
for all links from src to dest do
7:
Topological Web
Aux[dest] ← Aux[dest] OR V[src,·]
Spam
8:
Direct Counting
end for
9:
of Supporters
end for
10:
Spam Detection
Results
V ← Aux
11:
12: end for
107.
Link Analysis on
General algorithm
the Web
Require: N: number of nodes, d: distance, k: bits
Levels of Link
Analysis
1: for node : 1 . . . N, bit: 1 . . . k do
Generalizing
INIT(node,bit)
2:
PageRank
3: end for
Other
Functional
4: for distance : 1 . . . d do {Iteration step}
Rankings
Aux ← 0k
Web Spam 5:
for src : 1 . . . N do {Follow links in the graph}
Web Spam
6:
Detection
for all links from src to dest do
7:
Topological Web
Aux[dest] ← Aux[dest] OR V[src,·]
Spam
8:
Direct Counting
end for
9:
of Supporters
end for
10:
Spam Detection
Results
V ← Aux
11:
12: end for
13: for node: 1 . . . N do {Estimate supporters}
Supporters[node] ← ESTIMATE( V[node,·] )
14:
15: end for
16: return Supporters
108.
Link Analysis on
Our estimator
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Initialize all bits to one with probability
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
109.
Link Analysis on
Our estimator
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Initialize all bits to one with probability
Rankings
ones(node)
Estimator: neighbors(node) = log(1− ) 1 −
Web Spam
k
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
110.
Link Analysis on
Our estimator
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Initialize all bits to one with probability
Rankings
ones(node)
Estimator: neighbors(node) = log(1− ) 1 −
Web Spam
k
Web Spam
Detection
Adaptive estimation
Topological Web
Spam
Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look
Direct Counting
for the transitions from more than (1 − 1/e)k ones to less
of Supporters
than (1 − 1/e)k ones.
Spam Detection
Results
111.
Link Analysis on
Convergence
the Web
Levels of Link
Analysis
100%
Generalizing
PageRank
90%
Other
80%
Functional
Rankings
Fraction of nodes
70%
with estimates
Web Spam
60%
Web Spam
Detection
50% d=1
Topological Web
d=2
40%
Spam
d=3
Direct Counting
30% d=4
of Supporters
d=5
20%
Spam Detection
d=6
Results
d=7
10%
d=8
0%
5 10 15 20
Iteration
112.
Link Analysis on
Error rate
the Web
Levels of Link
Analysis
Generalizing
Ours 64 bits, epsilon−only estimator
PageRank
Ours 64 bits, combined estimator
0.5
Other
ANF 24 bits × 24 iterations (576 b×i)
Average Relative Error
Functional
ANF 24 bits × 48 iterations (1152 b×i)
Rankings
0.4
Web Spam
960 b×i
Web Spam
1216 b×i
512 b×i 832 b×i
Detection 1344 b×i 1408 b×i
768 b×i 1152 b×i
0.3
Topological Web
Spam
0.2
Direct Counting 576 b×i
1152 b×i
of Supporters
512 b×i 768 b×i 960 b×i 1216 b×i 1344 b×i 1408 b×i
832 b×i 1152 b×i
Spam Detection
0.1
Results
0
1 2 3 4 5 6 7 8
Distance
113.
Link Analysis on
Hosts at distance 4
the Web
Levels of Link
δ = 0.39
Hosts at Distance Exactly 4
Analysis
0.4
Generalizing
Normal
PageRank
Spam
Other
Functional
Rankings
0.3
Web Spam
Web Spam
Detection
Topological Web
0.2
Spam
Direct Counting
of Supporters
Spam Detection
0.1
Results
0
1 100 1000
S4 − S3
114.
Link Analysis on
Minimum change of supporters
the Web
Levels of Link
δ = 0.39
Minimum change of supporters
Analysis
Generalizing
PageRank
Normal
0.4 Spam
Other
Functional
Rankings
Web Spam
0.3
Web Spam
Detection
Topological Web
Spam
0.2
Direct Counting
of Supporters
Spam Detection
0.1
Results
0
1 5 10
min(S2/S1, S3/S2, S4/S3)
115.
Link Analysis on
the Web
Levels of Link
Analysis
Generalizing
PageRank
Levels of Link Analysis
1
Other
Generalizing PageRank
2
Functional
Other Functional Rankings
Rankings 3
Web Spam
4
Web Spam
Web Spam Detection
5
Web Spam
Detection
Topological Web Spam
6
Topological Web
Direct Counting of Supporters
7
Spam
Spam Detection Results
8
Direct Counting
of Supporters
Spam Detection
Results
116.
Link Analysis on
Detection rates
the Web
Levels of Link
Analysis
Generalizing
PageRank
60% (UK-2006) – 80% (UK-2002) of detection rate, with
Other
Functional
4%–2% error rate by combining diﬀerent
Rankings
attributes [Becchetti et al., 2006a].
Web Spam
Web Spam
X No magic bullet in link analysis
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
117.
Link Analysis on
Detection rates
the Web
Levels of Link
Analysis
Generalizing
PageRank
60% (UK-2006) – 80% (UK-2002) of detection rate, with
Other
Functional
4%–2% error rate by combining diﬀerent
Rankings
attributes [Becchetti et al., 2006a].
Web Spam
Web Spam
X No magic bullet in link analysis
Detection
X
Topological Web
Precision still low compared to e-mail spam ﬁlters
Spam
Direct Counting
of Supporters
Spam Detection
Results
118.
Link Analysis on
Detection rates
the Web
Levels of Link
Analysis
Generalizing
PageRank
60% (UK-2006) – 80% (UK-2002) of detection rate, with
Other
Functional
4%–2% error rate by combining diﬀerent
Rankings
attributes [Becchetti et al., 2006a].
Web Spam
Web Spam
X No magic bullet in link analysis
Detection
X
Topological Web
Precision still low compared to e-mail spam ﬁlters
Spam
V Measure both home page and max. PageRank page
Direct Counting
of Supporters
Spam Detection
Results
119.
Link Analysis on
Detection rates
the Web
Levels of Link
Analysis
Generalizing
PageRank
60% (UK-2006) – 80% (UK-2002) of detection rate, with
Other
Functional
4%–2% error rate by combining diﬀerent
Rankings
attributes [Becchetti et al., 2006a].
Web Spam
Web Spam
X No magic bullet in link analysis
Detection
X
Topological Web
Precision still low compared to e-mail spam ﬁlters
Spam
V Measure both home page and max. PageRank page
Direct Counting
of Supporters
V Host-based counts of neighbors are important
Spam Detection
Results
120.
Link Analysis on
Detection rates
the Web
Levels of Link
Analysis
Generalizing
PageRank
60% (UK-2006) – 80% (UK-2002) of detection rate, with
Other
Functional
4%–2% error rate by combining diﬀerent
Rankings
attributes [Becchetti et al., 2006a].
Web Spam
Web Spam
X No magic bullet in link analysis
Detection
X
Topological Web
Precision still low compared to e-mail spam ﬁlters
Spam
V Measure both home page and max. PageRank page
Direct Counting
of Supporters
V Host-based counts of neighbors are important
Spam Detection
Results
121.
Link Analysis on
Detection rates
the Web
Levels of Link
Analysis
Generalizing
PageRank
60% (UK-2006) – 80% (UK-2002) of detection rate, with
Other
Functional
4%–2% error rate by combining diﬀerent
Rankings
attributes [Becchetti et al., 2006a].
Web Spam
Web Spam
X No magic bullet in link analysis
Detection
X
Topological Web
Precision still low compared to e-mail spam ﬁlters
Spam
V Measure both home page and max. PageRank page
Direct Counting
of Supporters
V Host-based counts of neighbors are important
Spam Detection
Results
Next step: combine link analysis and content analysis
122.
Link Analysis on
Upcoming Web Spam Challenge on UK-2006
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
We asked 20+ volunteers to clasify entire hosts
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
123.
Link Analysis on
Upcoming Web Spam Challenge on UK-2006
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
We asked 20+ volunteers to clasify entire hosts
Web Spam
Web Spam
We provided several examples
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
124.
Link Analysis on
Upcoming Web Spam Challenge on UK-2006
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
We asked 20+ volunteers to clasify entire hosts
Web Spam
Web Spam
We provided several examples
Detection
Asked to classify normal / borderline / spam
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
125.
Link Analysis on
Upcoming Web Spam Challenge on UK-2006
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
We asked 20+ volunteers to clasify entire hosts
Web Spam
Web Spam
We provided several examples
Detection
Asked to classify normal / borderline / spam
Topological Web
Spam
Do they agree? Mostly . . .
Direct Counting
of Supporters
Spam Detection
Results
126.
Link Analysis on
Agreement between humans
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
127.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
128.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web graph with ∼80 million pages
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
129.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web graph with ∼80 million pages
Web Spam
∼11,000 hosts
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
130.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web graph with ∼80 million pages
Web Spam
∼11,000 hosts
Web Spam
Labels for ∼4,000 hosts by at least 2 humans each
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
131.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web graph with ∼80 million pages
Web Spam
∼11,000 hosts
Web Spam
Labels for ∼4,000 hosts by at least 2 humans each
Detection
Topological Web
Upcoming Web Spam challenge
Spam
Direct Counting
of Supporters
Spam Detection
Results
132.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web graph with ∼80 million pages
Web Spam
∼11,000 hosts
Web Spam
Labels for ∼4,000 hosts by at least 2 humans each
Detection
Topological Web
Upcoming Web Spam challenge
Spam
Machine learning
Direct Counting
of Supporters
Spam Detection
Results
133.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web graph with ∼80 million pages
Web Spam
∼11,000 hosts
Web Spam
Labels for ∼4,000 hosts by at least 2 humans each
Detection
Topological Web
Upcoming Web Spam challenge
Spam
Machine learning
Direct Counting
of Supporters Information retrieval
Spam Detection
Results
134.
Link Analysis on
Result: ﬁrst public Web Spam collection
the Web
Levels of Link
Analysis
Generalizing
PageRank
Other
Public spam collection
Functional
Rankings
Web graph with ∼80 million pages
Web Spam
∼11,000 hosts
Web Spam
Labels for ∼4,000 hosts by at least 2 humans each
Detection
Topological Web
Upcoming Web Spam challenge
Spam
Machine learning
Direct Counting
of Supporters Information retrieval
Spam Detection
webspam-announces-subscribe@yahoogroups.com
Results
135.
Link Analysis on
the Web
Levels of Link
Thank you!
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
136.
Link Analysis on
the Web
Levels of Link
Thank you!
Analysis
Generalizing
PageRank
Other
Functional
Rankings
Web Spam
Web Spam
Detection
Topological Web
Spam
Direct Counting
of Supporters
Spam Detection
Results
137.
Link Analysis on
the Web
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006a).
Generalizing pagerank: Damping functions for link-based
Levels of Link
Analysis
ranking algorithms.
Generalizing
In Proceedings of ACM SIGIR, pages 308–315, Seattle,
PageRank
Washington, USA. ACM Press.
Other
Functional
Rankings
Baeza-Yates, R., Castillo, C., and Efthimiadis, E. (2006b).
Web Spam
Characterization of national web domains.
Web Spam
Detection
To appear in ACM TOIT.
Topological Web
Spam
Baeza-Yates, R. and Poblete, B. (2006).
Direct Counting
of Supporters
Dynamics of the chilean web structure.
Spam Detection
Comput. Networks, 50(10):1464–1473.
Results
Barab´si, A.-L. (2002).
a
Linked: The New Science of Networks.
Perseus Books Group.
138.
Link Analysis on
the Web
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and
Baeza-Yates, R. (2006a).
Levels of Link
Link-based characterization and detection of Web Spam.
Analysis
Generalizing
In Second International Workshop on Adversarial Information
PageRank
Retrieval on the Web (AIRWeb), Seattle, USA.
Other
Functional
Rankings
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and
Web Spam
Baeza-Yates, R. (2006b).
Web Spam
Using rank propagation and probabilistic counting for
Detection
link-based spam detection.
Topological Web
Spam
In Proceedings of the Workshop on Web Mining and Web
Direct Counting
Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press.
of Supporters
Spam Detection
Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M.
u a o
Results
(2005).
Spamrank: fully automatic link spam detection.
In Proceedings of the First International Workshop on
Adversarial Information Retrieval on the Web, Chiba, Japan.
139.
Link Analysis on
the Web
Boldi, P., Santini, M., and Vigna, S. (2005).
Pagerank as a function of the damping factor.
Levels of Link
Analysis
In Proceedings of the 14th international conference on World
Generalizing
Wide Web, pages 557–566, Chiba, Japan. ACM Press.
PageRank
Other
Functional
Broder, A., Kumar, R., Maghoul, F., Raghavan, P.,
Rankings
Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J.
Web Spam
(2000).
Web Spam
Detection
Graph structure in the web: Experiments and models.
Topological Web
In Proceedings of the Ninth Conference on World Wide Web,
Spam
pages 309–320, Amsterdam, Netherlands. ACM Press.
Direct Counting
of Supporters
Fetterly, D., Manasse, M., and Najork, M. (2004).
Spam Detection
Results
Spam, damn spam, and statistics: Using statistical analysis to
locate spam web pages.
In Proceedings of the seventh workshop on the Web and
databases (WebDB), pages 1–6, Paris, France.
140.
Link Analysis on
Flajolet, P. and Martin, N. G. (1985).
the Web
Probabilistic counting algorithms for data base applications.
Levels of Link
Journal of Computer and System Sciences, 31(2):182–209.
Analysis
Generalizing
Gibson, D., Kumar, R., and Tomkins, A. (2005).
PageRank
Other
Discovering large dense subgraphs in massive graphs.
Functional
Rankings
In VLDB ’05: Proceedings of the 31st international conference
Web Spam
on Very large data bases, pages 721–732. VLDB Endowment.
Web Spam
Detection
Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004).
o
Topological Web
Combating web spam with trustrank.
Spam
Direct Counting
In Proceedings of the Thirtieth International Conference on
of Supporters
Very Large Data Bases (VLDB), pages 576–587, Toronto,
Spam Detection
Canada. Morgan Kaufmann.
Results
Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001).
Random graphs with arbitrary degree distributions and their
applications.
Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
141.
Link Analysis on
the Web
Levels of Link
Analysis
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
Generalizing
PageRank
ANF: a fast and scalable tool for data mining in massive
Other
Functional
graphs.
Rankings
In Proceedings of the eighth ACM SIGKDD international
Web Spam
conference on Knowledge discovery and data mining, pages
Web Spam
Detection
81–90, New York, NY, USA. ACM Press.
Topological Web
Spam
Tauro, L., Palmer, C., Siganos, G., and Faloutsos, M. (2001).
Direct Counting
A simple conceptual model for the internet topology.
of Supporters
Spam Detection
In Global Internet, San Antonio, Texas, USA. IEEE CS Press.
Results
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment