SlideShare a Scribd company logo
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

                   Using rank propagation and Probabilistic
 L. Becchetti,
  C. Castillo,
  D. Donato,
                   counting for Link-Based Spam Detection
S. Leonardi and
R. Baeza-Yates

Motivation

                   Luca Becchetti1 , Carlos Castillo1 ,Debora Donato1 ,
Spam pages
characterization
                     Stefano Leonardi1 and Ricardo Baeza-Yates2
Truncated
PageRank
                         1. Universit` di Roma “La Sapienza” – Rome, Italy
                                     a
Counting
                     2. Yahoo! Research – Barcelona, Spain and Santiago, Chile
supporters

Experiments

                                      August 20th, 2006
Conclusions
Using rank
propagation and
  Probabilistic
  counting for
                       Motivation
  Link-Based       1
Spam Detection

 L. Becchetti,
  C. Castillo,
                       Spam pages characterization
                   2
  D. Donato,
S. Leonardi and
R. Baeza-Yates

                       Truncated PageRank
                   3
Motivation

Spam pages
characterization
                       Counting supporters
                   4
Truncated
PageRank

Counting
supporters
                       Experiments
                   5
Experiments

Conclusions
                       Conclusions
                   6
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
                       Motivation
                   1
S. Leonardi and
                       Spam pages characterization
                   2
R. Baeza-Yates

                       Truncated PageRank
                   3
Motivation
                       Counting supporters
                   4
Spam pages
                       Experiments
                   5
characterization

                       Conclusions
                   6
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
What is on the Web?
   Using rank
                   Information
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
What is on the Web?
   Using rank
                   Information + Porn
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
What is on the Web?
   Using rank
                   Information + Porn + On-line casinos + Free movies +
propagation and
  Probabilistic
                   Cheap software + Buy a MBA diploma + Prescription -free
  counting for
  Link-Based
                   drugs + V!-4-gra + Get rich now now now!!!
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions




                                    Graphic: www.milliondollarhomepage.com
Web spam (keywords + links)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Web spam (mostly keywords)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Search engine?
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Fake search engine
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Problem: “normal” pages that are spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Problem: “normal” pages that are spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions




                   Some content is introduced
Problem: borderline pages
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
                       Motivation
                   1
S. Leonardi and
                       Spam pages characterization
                   2
R. Baeza-Yates

                       Truncated PageRank
                   3
Motivation
                       Counting supporters
                   4
Spam pages
                       Experiments
                   5
characterization

                       Conclusions
                   6
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Definitions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Definitions
   Using rank
propagation and
  Probabilistic
  counting for
                           Any deliberate action that is meant to trigger
  Link-Based
Spam Detection
                         an unjustifiably favorable relevance or importance
 L. Becchetti,
                       for some Web page, considering the page’s true value
  C. Castillo,
  D. Donato,
                                [Gy¨ngyi and Garcia-Molina, 2005]
                                    o
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Definitions
   Using rank
propagation and
  Probabilistic
  counting for
                           Any deliberate action that is meant to trigger
  Link-Based
Spam Detection
                         an unjustifiably favorable relevance or importance
 L. Becchetti,
                       for some Web page, considering the page’s true value
  C. Castillo,
  D. Donato,
                                [Gy¨ngyi and Garcia-Molina, 2005]
                                    o
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization
                   any attempt to deceive a search engine’s relevancy algorithm
Truncated
PageRank
                                             or simply
Counting
supporters

Experiments
                                 anything that would not be done
Conclusions
                                  if search engines did not exist.
                                           [Perkins, 2001]
Link farms
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Link farms
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments
                   Single-level farms can be detected by searching groups of
Conclusions
                   nodes sharing their out-links [Gibson et al., 2005]
Link−based spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                   [Fetterly et al., 2004] hypothesized that studying the
 L. Becchetti,
  C. Castillo,
                   distribution of statistics about pages could be a good way of
  D. Donato,
                   detecting spam pages:
S. Leonardi and
R. Baeza-Yates

Motivation
                   “in a number of these distributions, outlier values are
Spam pages
                   associated with web spam”
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Link−based spam
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                   [Fetterly et al., 2004] hypothesized that studying the
 L. Becchetti,
  C. Castillo,
                   distribution of statistics about pages could be a good way of
  D. Donato,
                   detecting spam pages:
S. Leonardi and
R. Baeza-Yates

Motivation
                   “in a number of these distributions, outlier values are
Spam pages
                   associated with web spam”
characterization

Truncated
PageRank

                   Research goal
Counting
supporters
                   Statistical analysis of link-based spam
Experiments

Conclusions
Idea: count “supporters” at different distances
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                                       Number of different nodes at a given distance:
Spam Detection

 L. Becchetti,
  C. Castillo,
                                  .UK 18 mill. nodes                             .EU.INT 860,000 nodes
  D. Donato,
S. Leonardi and
                                 0.3                                                  0.3
R. Baeza-Yates

Motivation
                                 0.2                                                  0.2
                     Frequency




                                                                          Frequency
Spam pages
characterization
                                 0.1                                                  0.1
Truncated
PageRank
                                 0.0                                                  0.0
                                         5   10     15     20   25   30                       5   10     15     20   25   30
Counting
supporters                                        Distance                                             Distance
Experiments
                                       Average distance                                     Average distance
Conclusions
                                          14.9 clicks                                          10.0 clicks
High and low-ranked pages are different
   Using rank
                                                  4
propagation and
                                               x 10
  Probabilistic
                                                                     Top 0%−10%
  counting for
                                          12
  Link-Based
                                                                     Top 40%−50%
Spam Detection
                                                                     Top 60%−70%
 L. Becchetti,
                                          10

                        Number of Nodes
  C. Castillo,
  D. Donato,
S. Leonardi and
                                          8
R. Baeza-Yates

Motivation
                                          6
Spam pages
characterization
                                          4
Truncated
PageRank

                                          2
Counting
supporters

Experiments
                                          0
                                           1          5     10       15        20
Conclusions
                                                          Distance
High and low-ranked pages are different
   Using rank
                                                   4
propagation and
                                                x 10
  Probabilistic
                                                                      Top 0%−10%
  counting for
                                           12
  Link-Based
                                                                      Top 40%−50%
Spam Detection
                                                                      Top 60%−70%
 L. Becchetti,
                                           10

                         Number of Nodes
  C. Castillo,
  D. Donato,
S. Leonardi and
                                           8
R. Baeza-Yates

Motivation
                                           6
Spam pages
characterization
                                           4
Truncated
PageRank

                                           2
Counting
supporters

Experiments
                                           0
                                            1          5     10       15        20
Conclusions
                                                           Distance
                   Areas below the curves are equal if we are in the same
                   strongly-connected component
Metrics
   Using rank
                          Graph algorithms
propagation and
  Probabilistic
  counting for
                      All shortest paths, centrality, betweenness, clustering coefficient...
  Link-Based
Spam Detection

                             Streamed algorithms
 L. Becchetti,

                         Breadth-first and depth-first search
  C. Castillo,
  D. Donato,

                         Count of neighbors
S. Leonardi and
R. Baeza-Yates
                            Symmetric algorithms
Motivation
                          (Strongly) connected components
Spam pages

                          Approximate count of neighbors
characterization

Truncated
                          PageRank, Truncated PageRank, Linear Rank
PageRank

                          HITS, Salsa, TrustRank
Counting
supporters

Experiments

Conclusions
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
                       Motivation
                   1
S. Leonardi and
                       Spam pages characterization
                   2
R. Baeza-Yates

                       Truncated PageRank
                   3
Motivation
                       Counting supporters
                   4
Spam pages
                       Experiments
                   5
characterization

                       Conclusions
                   6
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
General functional ranking
   Using rank
propagation and
  Probabilistic
                   Let P the row-normalized version of the citation matrix of a
  counting for
  Link-Based
                   graph G = (V , E )
Spam Detection

                   A functional ranking [Baeza-Yates et al., 2006] is a
 L. Becchetti,
  C. Castillo,
                   link-based ranking algorithm to compute a scoring vector W
  D. Donato,
S. Leonardi and
                   of the form:
R. Baeza-Yates
                                             ∞
                                               damping(t) t
                                      W=                  P.
Motivation
                                                    N
                                           t=0
Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
General functional ranking
   Using rank
propagation and
  Probabilistic
                   Let P the row-normalized version of the citation matrix of a
  counting for
  Link-Based
                   graph G = (V , E )
Spam Detection

                   A functional ranking [Baeza-Yates et al., 2006] is a
 L. Becchetti,
  C. Castillo,
                   link-based ranking algorithm to compute a scoring vector W
  D. Donato,
S. Leonardi and
                   of the form:
R. Baeza-Yates
                                             ∞
                                               damping(t) t
                                      W=                  P.
Motivation
                                                    N
                                           t=0
Spam pages
characterization

Truncated
PageRank
                   There are many choices for damping(t), including simply a
Counting
                   linear function that is as good as PageRank in practice
supporters

Experiments

Conclusions
General functional ranking
   Using rank
propagation and
  Probabilistic
                   Let P the row-normalized version of the citation matrix of a
  counting for
  Link-Based
                   graph G = (V , E )
Spam Detection

                   A functional ranking [Baeza-Yates et al., 2006] is a
 L. Becchetti,
  C. Castillo,
                   link-based ranking algorithm to compute a scoring vector W
  D. Donato,
S. Leonardi and
                   of the form:
R. Baeza-Yates
                                             ∞
                                               damping(t) t
                                      W=                  P.
Motivation
                                                    N
                                           t=0
Spam pages
characterization

Truncated
PageRank
                   There are many choices for damping(t), including simply a
Counting
                   linear function that is as good as PageRank in practice
supporters

Experiments

Conclusions
                                     damping(t) = (1 − α)αt
Truncated PageRank
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                   Reduce the direct contribution of the first levels of links:
 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank
                                                            t≤T
                                                     0
                                    damping(t) =
Counting
                                                     C αt
supporters
                                                            t>T
Experiments

Conclusions
Truncated PageRank
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                   Reduce the direct contribution of the first levels of links:
 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank
                                                            t≤T
                                                     0
                                    damping(t) =
Counting
                                                     C αt
supporters
                                                            t>T
Experiments
                   V No extra reading of the graph after PageRank
Conclusions
General algorithm
                   Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for
   Using rank
propagation and        truncation
  Probabilistic
                    1: for i : 1 . . . N do {Initialization}
  counting for
                    2: R[i] ← (1 − α)/((αT +1 )N)
  Link-Based
Spam Detection
                    3: if T≥ 0 then
                    4:       Score[i] ← 0
 L. Becchetti,
                    5: else {Calculate normal PageRank}
  C. Castillo,
  D. Donato,
                    6:       Score[i] ← R[i]
S. Leonardi and
                    7: end if
R. Baeza-Yates
                    8: end for
Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
General algorithm
                   Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for
   Using rank
propagation and        truncation
  Probabilistic
                    1: for i : 1 . . . N do {Initialization}
  counting for
                    2: R[i] ← (1 − α)/((αT +1 )N)
  Link-Based
Spam Detection
                    3: if T≥ 0 then
                    4:       Score[i] ← 0
 L. Becchetti,
                    5: else {Calculate normal PageRank}
  C. Castillo,
  D. Donato,
                    6:       Score[i] ← R[i]
S. Leonardi and
                    7: end if
R. Baeza-Yates
                    8: end for
                    9: distance = 1
Motivation
                   10: while not converged do
Spam pages
                   11: Aux ← 0
characterization
                   12: for src : 1 . . . N do {Follow links in the graph}
Truncated
                   13:       for all link from src to dest do
PageRank
                   14:          Aux[dest] ← Aux[dest] + R[src]/outdegree(src)
Counting
                   15:       end for
supporters
                   16: end for
Experiments
                   17: for i : 1 . . . N do {Apply damping factor α}
Conclusions        18:       R[i] ← Aux[i] ×α
                   19:       if distance > T then {Add to ranking value}
                   20:          Score[i] ← Score[i] + R[i]
                   21:       end if
                   22: end for
                   23: distance = distance +1
                   24: end while
Truncated PageRank vs PageRank
   Using rank
propagation and
  Probabilistic
  counting for                                 −3                                                       −3
                                              10                                                       10
  Link-Based
Spam Detection
                     Truncated PageRank T=1




                                                                              Truncated PageRank T=4
                                               −4                                                       −4
                                              10                                                       10
 L. Becchetti,
                                               −5                                                       −5
  C. Castillo,                                10                                                       10
  D. Donato,
                                               −6                                                       −6
S. Leonardi and                               10                                                       10
R. Baeza-Yates
                                               −7                                                       −7
                                              10                                                       10

Motivation                                     −8                                                       −8
                                              10                                                       10

Spam pages                                     −9                                                       −9
                                              10 −8                                                    10 −8
characterization                                          −6             −4                                        −6             −4
                                                10      10              10                               10      10              10
                                                      Normal PageRank                                          Normal PageRank
Truncated
PageRank

                   Comparing PageRank and Truncated PageRank with T = 1
Counting
supporters
                   and T = 4.
Experiments
                   The correlation is high and decreases as more levels are
Conclusions
                   truncated.
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
                       Motivation
                   1
S. Leonardi and
                       Spam pages characterization
                   2
R. Baeza-Yates

                       Truncated PageRank
                   3
Motivation
                       Counting supporters
                   4
Spam pages
                       Experiments
                   5
characterization

                       Conclusions
                   6
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Probabilistic counting
   Using rank
propagation and
  Probabilistic
  counting for
                                                                        1
                            1
  Link-Based
                                                                        0
                            0
Spam Detection
                                                                        0
                            0
 L. Becchetti,                                                          0
                            0
                                          0                                     1
                                    1                                                     1
  C. Castillo,                                                          1
                            1
                                          0                                     0
                                    1                                                     1
  D. Donato,                                                            0
                            0
                                                                                0
                                    0     0                                               0
S. Leonardi and                                   Propagation of                0
                                    0                                                     1
                                          1
R. Baeza-Yates                                     bits using the               1
                                    0                                                     1
                                          1
                                                 “OR” operation                 1
                                    0                                                     1
                                          0
Motivation
                                                                            1
                                        Target
                                0                                                   Count bits set
Spam pages
                                                                            0
                                         page
                                0
characterization                                                                     to estimate
                                                                            0
                                0                                                    supporters
Truncated                                                                   0
                                0
                                                                    1
                        1
PageRank                                                                    1
                                1
                                                                    0
                        0                                                   1
                                1
                                                                    0
                        0
Counting
                                                                    0
                        0
supporters
                                                                    1
                        1
Experiments                                                         0
                        0

Conclusions
Probabilistic counting
   Using rank
propagation and
  Probabilistic
  counting for
                                                                        1
                            1
  Link-Based
                                                                        0
                            0
Spam Detection
                                                                        0
                            0
 L. Becchetti,                                                          0
                            0
                                          0                                     1
                                    1                                                     1
  C. Castillo,                                                          1
                            1
                                          0                                     0
                                    1                                                     1
  D. Donato,                                                            0
                            0
                                                                                0
                                    0     0                                               0
S. Leonardi and                                   Propagation of                0
                                    0                                                     1
                                          1
R. Baeza-Yates                                     bits using the               1
                                    0                                                     1
                                          1
                                                 “OR” operation                 1
                                    0                                                     1
                                          0
Motivation
                                                                            1
                                        Target
                                0                                                   Count bits set
Spam pages
                                                                            0
                                         page
                                0
characterization                                                                     to estimate
                                                                            0
                                0                                                    supporters
Truncated                                                                   0
                                0
                                                                    1
                        1
PageRank                                                                    1
                                1
                                                                    0
                        0                                                   1
                                1
                                                                    0
                        0
Counting
                                                                    0
                        0
supporters
                                                                    1
                        1
Experiments                                                         0
                        0

Conclusions



                   Improvement of ANF algorithm [Palmer et al., 2002] based on
                   probabilistic counting [Flajolet and Martin, 1985]
General algorithm
   Using rank
                   Require: N: number of nodes, d: distance, k: bits
propagation and
  Probabilistic
                    1: for node : 1 . . . N, bit: 1 . . . k do
  counting for
  Link-Based
                         INIT(node,bit)
                    2:
Spam Detection

                    3: end for
 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
General algorithm
   Using rank
                   Require: N: number of nodes, d: distance, k: bits
propagation and
  Probabilistic
                    1: for node : 1 . . . N, bit: 1 . . . k do
  counting for
  Link-Based
                         INIT(node,bit)
                    2:
Spam Detection

                    3: end for
 L. Becchetti,
  C. Castillo,
                    4: for distance : 1 . . . d do {Iteration step}
  D. Donato,
S. Leonardi and
                         Aux ← 0k
                    5:
R. Baeza-Yates

                         for src : 1 . . . N do {Follow links in the graph}
                    6:
Motivation
                            for all links from src to dest do
                    7:
Spam pages
                               Aux[dest] ← Aux[dest] OR V[src,·]
characterization    8:
Truncated
                            end for
                    9:
PageRank
                         end for
                   10:
Counting
supporters
                         V ← Aux
                   11:
Experiments
                   12: end for
Conclusions
General algorithm
   Using rank
                   Require: N: number of nodes, d: distance, k: bits
propagation and
  Probabilistic
                    1: for node : 1 . . . N, bit: 1 . . . k do
  counting for
  Link-Based
                         INIT(node,bit)
                    2:
Spam Detection

                    3: end for
 L. Becchetti,
  C. Castillo,
                    4: for distance : 1 . . . d do {Iteration step}
  D. Donato,
S. Leonardi and
                         Aux ← 0k
                    5:
R. Baeza-Yates

                         for src : 1 . . . N do {Follow links in the graph}
                    6:
Motivation
                            for all links from src to dest do
                    7:
Spam pages
                               Aux[dest] ← Aux[dest] OR V[src,·]
characterization    8:
Truncated
                            end for
                    9:
PageRank
                         end for
                   10:
Counting
supporters
                         V ← Aux
                   11:
Experiments
                   12: end for
Conclusions
                   13: for node: 1 . . . N do {Estimate supporters}
                         Supporters[node] ← ESTIMATE( V[node,·] )
                   14:
                   15: end for
                   16: return Supporters
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all bits to one with probability
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all bits to one with probability
Spam Detection

                   by the independence of the i − th component Xi ’s we have,
 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                             P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all bits to one with probability
Spam Detection

                   by the independence of the i − th component Xi ’s we have,
 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                             P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation

Spam pages                                                        ones(node)
                                                             1−
                   Estimator: neighbors(node) = log(1−
characterization                                         )             k
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all bits to one with probability
Spam Detection

                   by the independence of the i − th component Xi ’s we have,
 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                             P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation

Spam pages
                   Estimator: neighbors(node) = log(1− ) 1 − ones(node)
characterization
                                                                  k
Truncated
                   Problem: neighbors(node) can vary by orders of magnitudes
PageRank
                   as node varies.
Counting
supporters

Experiments

Conclusions
Our estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Initialize all bits to one with probability
Spam Detection

                   by the independence of the i − th component Xi ’s we have,
 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                             P[Xi = 1] = 1 − (1 − )neighbors(node) ,
Motivation

Spam pages
                   Estimator: neighbors(node) = log(1− ) 1 − ones(node)
characterization
                                                                  k
Truncated
                   Problem: neighbors(node) can vary by orders of magnitudes
PageRank
                   as node varies.
Counting
supporters
                   This means that for some values of , the computed value of
Experiments
                   ones(node) might be k (or 0, depending on neighbors(node))
Conclusions
                   with relatively high probability.
Adaptive estimator
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                                                                          1
Spam Detection
                   if we knew neighbors(node) and chose        =                     we
                                                                   neighbors(node)
 L. Becchetti,
                   would get:
  C. Castillo,
  D. Donato,
S. Leonardi and
                                                       1
R. Baeza-Yates
                                                  1−
                               ones(node)                  k       0.63k,
                                                       e
Motivation

Spam pages
characterization
                   Adaptive estimation
Truncated
PageRank
                   Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look
Counting
                   for the transitions from more than (1 − 1/e)k ones to less
supporters

                   than (1 − 1/e)k ones.
Experiments

Conclusions
Convergence
   Using rank
                                            100%
propagation and
  Probabilistic
                                            90%
  counting for
  Link-Based
Spam Detection
                                            80%


                        Fraction of nodes
 L. Becchetti,
                                            70%


                         with estimates
  C. Castillo,
  D. Donato,
                                            60%
S. Leonardi and
R. Baeza-Yates
                                            50%                         d=1
                                                                        d=2
Motivation
                                            40%
                                                                        d=3
Spam pages
                                            30%                         d=4
characterization
                                                                        d=5
Truncated
                                            20%
                                                                        d=6
PageRank
                                                                        d=7
                                            10%
Counting
                                                                        d=8
supporters
                                             0%
                                                   5   10          15   20
Experiments
                                                       Iteration
Conclusions
Convergence
   Using rank
                                             100%
propagation and
  Probabilistic
                                             90%
  counting for
  Link-Based
Spam Detection
                                             80%


                         Fraction of nodes
 L. Becchetti,
                                             70%


                          with estimates
  C. Castillo,
  D. Donato,
                                             60%
S. Leonardi and
R. Baeza-Yates
                                             50%                         d=1
                                                                         d=2
Motivation
                                             40%
                                                                         d=3
Spam pages
                                             30%                         d=4
characterization
                                                                         d=5
Truncated
                                             20%
                                                                         d=6
PageRank
                                                                         d=7
                                             10%
Counting
                                                                         d=8
supporters
                                              0%
                                                    5   10          15   20
Experiments
                                                        Iteration
Conclusions



                   15 iterations for estimating the neighbors at distance 4 or less
Convergence
   Using rank
                                             100%
propagation and
  Probabilistic
                                             90%
  counting for
  Link-Based
Spam Detection
                                             80%


                         Fraction of nodes
 L. Becchetti,
                                             70%


                          with estimates
  C. Castillo,
  D. Donato,
                                             60%
S. Leonardi and
R. Baeza-Yates
                                             50%                         d=1
                                                                         d=2
Motivation
                                             40%
                                                                         d=3
Spam pages
                                             30%                         d=4
characterization
                                                                         d=5
Truncated
                                             20%
                                                                         d=6
PageRank
                                                                         d=7
                                             10%
Counting
                                                                         d=8
supporters
                                              0%
                                                    5   10          15   20
Experiments
                                                        Iteration
Conclusions



                   15 iterations for estimating the neighbors at distance 4 or less
                   less than 25 iterations for all distances up to 8.
Error rate
   Using rank
propagation and
  Probabilistic
  counting for
                                                                                   Ours 64 bits, epsilon−only estimator
  Link-Based
Spam Detection
                                                                                   Ours 64 bits, combined estimator
                                                  0.5                              ANF 24 bits × 24 iterations (576 b×i)

                         Average Relative Error
 L. Becchetti,
  C. Castillo,
                                                                                   ANF 24 bits × 48 iterations (1152 b×i)
  D. Donato,
S. Leonardi and
                                                  0.4
R. Baeza-Yates
                                                                                       960 b×i

                                                                                                             1216 b×i
                                                        512 b×i              832 b×i                                    1344 b×i 1408 b×i
Motivation                                                        768 b×i                         1152 b×i
                                                  0.3
Spam pages
characterization

                                                  0.2                                        576 b×i
Truncated                                                         1152 b×i
PageRank
                                                        512 b×i   768 b×i              960 b×i               1216 b×i   1344 b×i 1408 b×i
                                                                             832 b×i              1152 b×i
Counting
                                                  0.1
supporters

Experiments

                                                   0
Conclusions
                                                        1           2          3         4             5       6          7            8
                                                                                       Distance
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
                       Motivation
                   1
S. Leonardi and
                       Spam pages characterization
                   2
R. Baeza-Yates

                       Truncated PageRank
                   3
Motivation
                       Counting supporters
                   4
Spam pages
                       Experiments
                   5
characterization

                       Conclusions
                   6
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Test collection
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

                    U.K. collection
 L. Becchetti,
  C. Castillo,
                    18.5 million pages downloaded from the .UK domain
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                    5,344 hosts manually classified (6% of the hosts)
Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Test collection
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

                    U.K. collection
 L. Becchetti,
  C. Castillo,
                    18.5 million pages downloaded from the .UK domain
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                    5,344 hosts manually classified (6% of the hosts)
Motivation

Spam pages
characterization

Truncated
                    Classified entire hosts:
PageRank

Counting
                      V A few hosts are mixed: spam and non-spam pages
supporters

                      X More coverage: sample covers 32% of the pages
Experiments

Conclusions
Automatic classifier
                   We extracted (for the home page and the page with
   Using rank
propagation and
                   maximum PageRank) PageRank, Truncated PageRank at
  Probabilistic
  counting for
                   2 . . . 4, Supporters at 2 . . . 4
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Automatic classifier
                   We extracted (for the home page and the page with
   Using rank
propagation and
                   maximum PageRank) PageRank, Truncated PageRank at
  Probabilistic
  counting for
                   2 . . . 4, Supporters at 2 . . . 4
  Link-Based
Spam Detection
                   We measured:
 L. Becchetti,
  C. Castillo,
  D. Donato,
                                      # of spam hosts classified as spam
S. Leonardi and
                        Precision =
R. Baeza-Yates
                                        # of hosts classified as spam
Motivation
                                    # of spam hosts classified as spam
                         Recall =                                     .
Spam pages
                                            # of spam hosts
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Automatic classifier
                   We extracted (for the home page and the page with
   Using rank
propagation and
                   maximum PageRank) PageRank, Truncated PageRank at
  Probabilistic
  counting for
                   2 . . . 4, Supporters at 2 . . . 4
  Link-Based
Spam Detection
                   We measured:
 L. Becchetti,
  C. Castillo,
  D. Donato,
                                       # of spam hosts classified as spam
S. Leonardi and
                         Precision =
R. Baeza-Yates
                                         # of hosts classified as spam
Motivation
                                     # of spam hosts classified as spam
                          Recall =                                     .
Spam pages
                                             # of spam hosts
characterization

Truncated
PageRank
                   and the two types of errors in spam classification
Counting
supporters

Experiments
                                            # of normal hosts classified as spam
Conclusions
                    False positive rate =
                                                    # of normal hosts
                                            # of spam hosts classified as normal
                   False negative rate =                                        .
                                                     # of spam hosts
Single-technique classifier
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
                   Classifier based on TrustRank: uses as features the PageRank,
Spam Detection
                                 the estimated non-spam mass, and the
 L. Becchetti,
  C. Castillo,
                                 estimated non-spam mass divided by PageRank.
  D. Donato,
S. Leonardi and
                   Classifier based on Truncated PageRank: uses as features the
R. Baeza-Yates

                                 PageRank, the Truncated PageRank with
Motivation
                                 truncation distance t = 2, 3, 4 (with t = 1 it
Spam pages
characterization
                                 would be just based on in-degree), and the
Truncated
                                 Truncated PageRank divided by PageRank.
PageRank

                   Classifier based on Estimation of Supporters: uses as features
Counting
supporters
                                 the PageRank, the estimation of supporters at a
Experiments
                                 given distance d = 2, 3, 4, and the estimation of
Conclusions
                                 supporters divided by PageRank.
Comparison of single-technique classifier (M=5)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                    Classifiers               Spam   class    False   False
 L. Becchetti,
  C. Castillo,
                    (pruning with M = 5)    Prec.   Recall    Pos.   Neg.
  D. Donato,
S. Leonardi and
                     TrustRank              0.82     0.50    2.1%    50%
R. Baeza-Yates

                    Trunc. PageRank t = 2   0.85     0.50    1.6%    50%
Motivation
                    Trunc. PageRank t = 3   0.84     0.47    1.6%    53%
Spam pages
characterization
                    Trunc. PageRank t = 4   0.79     0.45    2.2%    55%
Truncated
                    Est. Supporters d = 2   0.78     0.60    3.2%    40%
PageRank
                    Est. Supporters d = 3   0.83     0.64    2.4%    36%
Counting
supporters
                    Est. Supporters d = 4   0.86     0.64    2.0%    36%
Experiments

Conclusions
Comparison of single-technique classifier (M=30)
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection
                    Classifiers               Spam   class    False   False
 L. Becchetti,
  C. Castillo,
                    (pruning with M = 30)   Prec.   Recall    Pos.   Neg.
  D. Donato,
S. Leonardi and
                    TrustRank               0.80     0.49    2.3%    51%
R. Baeza-Yates

                    Trunc. PageRank t = 2   0.82     0.43    1.8%    57%
Motivation
                    Trunc. PageRank t = 3   0.81     0.42    1.8%    58%
Spam pages
characterization
                    Trunc. PageRank t = 4   0.77     0.43    2.4%    57%
Truncated
                    Est. Supporters d = 2   0.76     0.52    3.1%    48%
PageRank
                    Est. Supporters d = 3   0.83     0.57    2.1%    43%
Counting
supporters
                    Est. Supporters d = 4   0.80     0.57    2.6%    43%
Experiments

Conclusions
Combined classifier
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
                                            Spam class      False   False
S. Leonardi and
R. Baeza-Yates
                     Pruning     Rules   Precision Recall    Pos.   Neg.
Motivation
                      M=5         49       0.87     0.80    2.0%    20%
Spam pages
                      M=30        31       0.88     0.76    1.8%    24%
characterization

                    No pruning    189      0.85     0.79    2.6%    21%
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
                       Motivation
                   1
S. Leonardi and
                       Spam pages characterization
                   2
R. Baeza-Yates

                       Truncated PageRank
                   3
Motivation
                       Counting supporters
                   4
Spam pages
                       Experiments
                   5
characterization

                       Conclusions
                   6
Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Summary of classifiers
   Using rank
                         (a) Precision and recall of spam detection
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank
                           (b) Error rates of the spam classifiers
Counting
supporters

Experiments

Conclusions
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
                     V   Link-based statistics to detect 80% of spam
  D. Donato,
S. Leonardi and
R. Baeza-Yates

Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
                     V   Link-based statistics to detect 80% of spam
  D. Donato,
S. Leonardi and
                     X
R. Baeza-Yates
                         No magic bullet in link analysis
Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
                     V   Link-based statistics to detect 80% of spam
  D. Donato,
S. Leonardi and
                     X
R. Baeza-Yates
                         No magic bullet in link analysis
                     X   Precision still low compared to e-mail spam filters
Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
                     V   Link-based statistics to detect 80% of spam
  D. Donato,
S. Leonardi and
                     X
R. Baeza-Yates
                         No magic bullet in link analysis
                     X   Precision still low compared to e-mail spam filters
Motivation


                     V
Spam pages
                         Measure both home page and max. PageRank page
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
                     V   Link-based statistics to detect 80% of spam
  D. Donato,
S. Leonardi and
                     X
R. Baeza-Yates
                         No magic bullet in link analysis
                     X   Precision still low compared to e-mail spam filters
Motivation


                     V
Spam pages
                         Measure both home page and max. PageRank page
characterization

                     V
Truncated
                         Host-based counts are important
PageRank

Counting
supporters

Experiments

Conclusions
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
                     V   Link-based statistics to detect 80% of spam
  D. Donato,
S. Leonardi and
                     X
R. Baeza-Yates
                         No magic bullet in link analysis
                     X   Precision still low compared to e-mail spam filters
Motivation


                     V
Spam pages
                         Measure both home page and max. PageRank page
characterization

                     V
Truncated
                         Host-based counts are important
PageRank

Counting
supporters

Experiments

Conclusions
Conclusions
   Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
                     V   Link-based statistics to detect 80% of spam
  D. Donato,
S. Leonardi and
                     X
R. Baeza-Yates
                         No magic bullet in link analysis
                     X   Precision still low compared to e-mail spam filters
Motivation


                     V
Spam pages
                         Measure both home page and max. PageRank page
characterization

                     V
Truncated
                         Host-based counts are important
PageRank

Counting
                   Next step: combine link analysis and content analysis
supporters

Experiments

Conclusions
Using rank
propagation and
  Probabilistic
  counting for
  Link-Based
Spam Detection

 L. Becchetti,
  C. Castillo,
  D. Donato,
S. Leonardi and
R. Baeza-Yates

                   Thank you!
Motivation

Spam pages
characterization

Truncated
PageRank

Counting
supporters

Experiments

Conclusions
Using rank
                   Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
propagation and
  Probabilistic
                   Generalizing PageRank: Damping functions for link-based
  counting for
  Link-Based
                   ranking algorithms.
Spam Detection

                   In Proceedings of SIGIR, Seattle, Washington, USA. ACM
 L. Becchetti,
  C. Castillo,
                   Press.
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                   Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and
                   Baeza-Yates, R. (2006).
Motivation

                   Using rank propagation and probabilistic counting for
Spam pages
characterization
                   link-based spam detection.
Truncated
                   In Proceedings of the Workshop on Web Mining and Web
PageRank

                   Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press.
Counting
supporters

Experiments
                   Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M.
                        u                a            o
Conclusions        (2005).
                   Spamrank: fully automatic link spam detection.
                   In Proceedings of the First International Workshop on
                   Adversarial Information Retrieval on the Web, Chiba, Japan.
Using rank
propagation and
  Probabilistic
  counting for
                   Fetterly, D., Manasse, M., and Najork, M. (2004).
  Link-Based
Spam Detection
                   Spam, damn spam, and statistics: Using statistical analysis to
 L. Becchetti,
                   locate spam web pages.
  C. Castillo,
  D. Donato,
                   In Proceedings of the seventh workshop on the Web and
S. Leonardi and
R. Baeza-Yates
                   databases (WebDB), pages 1–6, Paris, France.
Motivation
                   Flajolet, P. and Martin, N. G. (1985).
Spam pages
characterization
                   Probabilistic counting algorithms for data base applications.
Truncated
                   Journal of Computer and System Sciences, 31(2):182–209.
PageRank

Counting
                   Gibson, D., Kumar, R., and Tomkins, A. (2005).
supporters

Experiments
                   Discovering large dense subgraphs in massive graphs.
Conclusions
                   In VLDB ’05: Proceedings of the 31st international conference
                   on Very large data bases, pages 721–732. VLDB Endowment.
Using rank
propagation and
  Probabilistic
                   Gy¨ngyi, Z. and Garcia-Molina, H. (2005).
                     o
  counting for
  Link-Based
                   Web spam taxonomy.
Spam Detection

 L. Becchetti,
                   In First International Workshop on Adversarial Information
  C. Castillo,
                   Retrieval on the Web.
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                   Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004).
                     o
Motivation
                   Combating web spam with trustrank.
Spam pages
                   In Proceedings of the Thirtieth International Conference on
characterization

                   Very Large Data Bases (VLDB), pages 576–587, Toronto,
Truncated
PageRank
                   Canada. Morgan Kaufmann.
Counting
supporters
                   Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001).
Experiments
                   Random graphs with arbitrary degree distributions and their
Conclusions
                   applications.
                   Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
Using rank
propagation and
                   Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
  Probabilistic
  counting for
  Link-Based
                   Detecting spam web pages through content analysis.
Spam Detection
                   In Proceedings of the World Wide Web conference, pages
 L. Becchetti,
  C. Castillo,
                   83–92, Edinburgh, Scotland.
  D. Donato,
S. Leonardi and
R. Baeza-Yates
                   Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
                   ANF: a fast and scalable tool for data mining in massive
Motivation

                   graphs.
Spam pages
characterization
                   In Proceedings of the eighth ACM SIGKDD international
Truncated
                   conference on Knowledge discovery and data mining, pages
PageRank

                   81–90, New York, NY, USA. ACM Press.
Counting
supporters

                   Perkins, A. (2001).
Experiments

Conclusions
                   The classification of search engine spam.
                   Available online at
                   http://www.silverdisc.co.uk/articles/spam-classification/.

More Related Content

Viewers also liked

Risk Beyond Acquiring: Merchant Risk Across FinTech
Risk Beyond Acquiring: Merchant Risk Across FinTechRisk Beyond Acquiring: Merchant Risk Across FinTech
Risk Beyond Acquiring: Merchant Risk Across FinTech
Geo Coelho
 
PhD Thesis presentation
PhD Thesis presentationPhD Thesis presentation
PhD Thesis presentation
Javier Ortega
 
Visa risk-management-guide-ecommerce
Visa risk-management-guide-ecommerceVisa risk-management-guide-ecommerce
Visa risk-management-guide-ecommerce
Sergey Krayev
 
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...
Pedram Hayati
 
Accepting payment at your ecommerce website
Accepting payment at your ecommerce websiteAccepting payment at your ecommerce website
Accepting payment at your ecommerce website
youcandothisbiz
 
Bing Shopping program - Merchant Integration Guide
Bing Shopping program - Merchant Integration GuideBing Shopping program - Merchant Integration Guide
Bing Shopping program - Merchant Integration Guide
Juan Pittau
 
eCommerce Summit Atlanta Mountain Media
eCommerce Summit Atlanta Mountain MediaeCommerce Summit Atlanta Mountain Media
eCommerce Summit Atlanta Mountain Media
eCommerce Merchants
 
Wallet Infographic
Wallet InfographicWallet Infographic
Wallet Infographic
Worldline
 
How Innovative Onboarding Sets Up Sales
How Innovative Onboarding Sets Up SalesHow Innovative Onboarding Sets Up Sales
How Innovative Onboarding Sets Up Sales
TruebridgeFinancialMarketing
 
Finacle Digital Commerce
Finacle Digital CommerceFinacle Digital Commerce
Finacle Digital Commerce
Infosys Finacle
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniques
ranjit banshpal
 
Improving spam detection with automaton
Improving spam detection with automatonImproving spam detection with automaton
Improving spam detection with automaton
Antonio Costa aka Cooler_
 
Étude des techniques de classification et de filtrage automatique de Pourriels
Étude des techniques de classification et de filtrage automatique de PourrielsÉtude des techniques de classification et de filtrage automatique de Pourriels
Étude des techniques de classification et de filtrage automatique de Pourriels
guest3a44d425
 
Why Size Matters in Merchant Onboarding
Why Size Matters in Merchant OnboardingWhy Size Matters in Merchant Onboarding
Why Size Matters in Merchant Onboarding
Provenir
 
Spam
SpamSpam
Mobile Wallet Platform 2015
Mobile Wallet Platform 2015Mobile Wallet Platform 2015
Mobile Wallet Platform 2015
Mikhail Miroshnichenko
 
Ecash and ewallet
Ecash and ewalletEcash and ewallet
Ecash and ewallet
Tirthankar Sutradhar
 
Risk Assessment Process NIST 800-30
Risk Assessment Process NIST 800-30Risk Assessment Process NIST 800-30
Risk Assessment Process NIST 800-30
timmcguinness
 

Viewers also liked (20)

Risk Beyond Acquiring: Merchant Risk Across FinTech
Risk Beyond Acquiring: Merchant Risk Across FinTechRisk Beyond Acquiring: Merchant Risk Across FinTech
Risk Beyond Acquiring: Merchant Risk Across FinTech
 
PhD Thesis presentation
PhD Thesis presentationPhD Thesis presentation
PhD Thesis presentation
 
Visa risk-management-guide-ecommerce
Visa risk-management-guide-ecommerceVisa risk-management-guide-ecommerce
Visa risk-management-guide-ecommerce
 
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...
Evaluation of Spam Detection and Prevention Frameworks for Email and Image Sp...
 
Accepting payment at your ecommerce website
Accepting payment at your ecommerce websiteAccepting payment at your ecommerce website
Accepting payment at your ecommerce website
 
Bing Shopping program - Merchant Integration Guide
Bing Shopping program - Merchant Integration GuideBing Shopping program - Merchant Integration Guide
Bing Shopping program - Merchant Integration Guide
 
eCommerce Summit Atlanta Mountain Media
eCommerce Summit Atlanta Mountain MediaeCommerce Summit Atlanta Mountain Media
eCommerce Summit Atlanta Mountain Media
 
Wallet Infographic
Wallet InfographicWallet Infographic
Wallet Infographic
 
How Innovative Onboarding Sets Up Sales
How Innovative Onboarding Sets Up SalesHow Innovative Onboarding Sets Up Sales
How Innovative Onboarding Sets Up Sales
 
Finacle Digital Commerce
Finacle Digital CommerceFinacle Digital Commerce
Finacle Digital Commerce
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniques
 
Improving spam detection with automaton
Improving spam detection with automatonImproving spam detection with automaton
Improving spam detection with automaton
 
Étude des techniques de classification et de filtrage automatique de Pourriels
Étude des techniques de classification et de filtrage automatique de PourrielsÉtude des techniques de classification et de filtrage automatique de Pourriels
Étude des techniques de classification et de filtrage automatique de Pourriels
 
Why Size Matters in Merchant Onboarding
Why Size Matters in Merchant OnboardingWhy Size Matters in Merchant Onboarding
Why Size Matters in Merchant Onboarding
 
What is SPAM?
What is SPAM?What is SPAM?
What is SPAM?
 
Spam
SpamSpam
Spam
 
Mobile Wallet Platform 2015
Mobile Wallet Platform 2015Mobile Wallet Platform 2015
Mobile Wallet Platform 2015
 
Ecash and ewallet
Ecash and ewalletEcash and ewallet
Ecash and ewallet
 
Digital wallet
Digital walletDigital wallet
Digital wallet
 
Risk Assessment Process NIST 800-30
Risk Assessment Process NIST 800-30Risk Assessment Process NIST 800-30
Risk Assessment Process NIST 800-30
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
Carlos Castillo (ChaTo)
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
Carlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
Carlos Castillo (ChaTo)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
Carlos Castillo (ChaTo)
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
Carlos Castillo (ChaTo)
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
Carlos Castillo (ChaTo)
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
Carlos Castillo (ChaTo)
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
Carlos Castillo (ChaTo)
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
Carlos Castillo (ChaTo)
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
Carlos Castillo (ChaTo)
 
Link prediction
Link predictionLink prediction
Link prediction
Carlos Castillo (ChaTo)
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Carlos Castillo (ChaTo)
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
Carlos Castillo (ChaTo)
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
Carlos Castillo (ChaTo)
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
Carlos Castillo (ChaTo)
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
Carlos Castillo (ChaTo)
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
Indexing
IndexingIndexing
Text Summarization
Text SummarizationText Summarization
Text Summarization
Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 

Using Rank Propagation for Spam Detection (WebKDD 2006)

  • 1. Using rank propagation and Probabilistic counting for Link-Based Spam Detection Using rank propagation and Probabilistic L. Becchetti, C. Castillo, D. Donato, counting for Link-Based Spam Detection S. Leonardi and R. Baeza-Yates Motivation Luca Becchetti1 , Carlos Castillo1 ,Debora Donato1 , Spam pages characterization Stefano Leonardi1 and Ricardo Baeza-Yates2 Truncated PageRank 1. Universit` di Roma “La Sapienza” – Rome, Italy a Counting 2. Yahoo! Research – Barcelona, Spain and Santiago, Chile supporters Experiments August 20th, 2006 Conclusions
  • 2. Using rank propagation and Probabilistic counting for Motivation Link-Based 1 Spam Detection L. Becchetti, C. Castillo, Spam pages characterization 2 D. Donato, S. Leonardi and R. Baeza-Yates Truncated PageRank 3 Motivation Spam pages characterization Counting supporters 4 Truncated PageRank Counting supporters Experiments 5 Experiments Conclusions Conclusions 6
  • 3. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  • 4. What is on the Web? Using rank Information propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 5. What is on the Web? Using rank Information + Porn propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 6. What is on the Web? Using rank Information + Porn + On-line casinos + Free movies + propagation and Probabilistic Cheap software + Buy a MBA diploma + Prescription -free counting for Link-Based drugs + V!-4-gra + Get rich now now now!!! Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions Graphic: www.milliondollarhomepage.com
  • 7. Web spam (keywords + links) Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 8. Web spam (mostly keywords) Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 9. Search engine? Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 10. Fake search engine Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 11. Problem: “normal” pages that are spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 12. Problem: “normal” pages that are spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions Some content is introduced
  • 13. Problem: borderline pages Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 14. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  • 15. Definitions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 16. Definitions Using rank propagation and Probabilistic counting for Any deliberate action that is meant to trigger Link-Based Spam Detection an unjustifiably favorable relevance or importance L. Becchetti, for some Web page, considering the page’s true value C. Castillo, D. Donato, [Gy¨ngyi and Garcia-Molina, 2005] o S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 17. Definitions Using rank propagation and Probabilistic counting for Any deliberate action that is meant to trigger Link-Based Spam Detection an unjustifiably favorable relevance or importance L. Becchetti, for some Web page, considering the page’s true value C. Castillo, D. Donato, [Gy¨ngyi and Garcia-Molina, 2005] o S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization any attempt to deceive a search engine’s relevancy algorithm Truncated PageRank or simply Counting supporters Experiments anything that would not be done Conclusions if search engines did not exist. [Perkins, 2001]
  • 18. Link farms Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 19. Link farms Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Single-level farms can be detected by searching groups of Conclusions nodes sharing their out-links [Gibson et al., 2005]
  • 20. Link−based spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection [Fetterly et al., 2004] hypothesized that studying the L. Becchetti, C. Castillo, distribution of statistics about pages could be a good way of D. Donato, detecting spam pages: S. Leonardi and R. Baeza-Yates Motivation “in a number of these distributions, outlier values are Spam pages associated with web spam” characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 21. Link−based spam Using rank propagation and Probabilistic counting for Link-Based Spam Detection [Fetterly et al., 2004] hypothesized that studying the L. Becchetti, C. Castillo, distribution of statistics about pages could be a good way of D. Donato, detecting spam pages: S. Leonardi and R. Baeza-Yates Motivation “in a number of these distributions, outlier values are Spam pages associated with web spam” characterization Truncated PageRank Research goal Counting supporters Statistical analysis of link-based spam Experiments Conclusions
  • 22. Idea: count “supporters” at different distances Using rank propagation and Probabilistic counting for Link-Based Number of different nodes at a given distance: Spam Detection L. Becchetti, C. Castillo, .UK 18 mill. nodes .EU.INT 860,000 nodes D. Donato, S. Leonardi and 0.3 0.3 R. Baeza-Yates Motivation 0.2 0.2 Frequency Frequency Spam pages characterization 0.1 0.1 Truncated PageRank 0.0 0.0 5 10 15 20 25 30 5 10 15 20 25 30 Counting supporters Distance Distance Experiments Average distance Average distance Conclusions 14.9 clicks 10.0 clicks
  • 23. High and low-ranked pages are different Using rank 4 propagation and x 10 Probabilistic Top 0%−10% counting for 12 Link-Based Top 40%−50% Spam Detection Top 60%−70% L. Becchetti, 10 Number of Nodes C. Castillo, D. Donato, S. Leonardi and 8 R. Baeza-Yates Motivation 6 Spam pages characterization 4 Truncated PageRank 2 Counting supporters Experiments 0 1 5 10 15 20 Conclusions Distance
  • 24. High and low-ranked pages are different Using rank 4 propagation and x 10 Probabilistic Top 0%−10% counting for 12 Link-Based Top 40%−50% Spam Detection Top 60%−70% L. Becchetti, 10 Number of Nodes C. Castillo, D. Donato, S. Leonardi and 8 R. Baeza-Yates Motivation 6 Spam pages characterization 4 Truncated PageRank 2 Counting supporters Experiments 0 1 5 10 15 20 Conclusions Distance Areas below the curves are equal if we are in the same strongly-connected component
  • 25. Metrics Using rank Graph algorithms propagation and Probabilistic counting for All shortest paths, centrality, betweenness, clustering coefficient... Link-Based Spam Detection Streamed algorithms L. Becchetti, Breadth-first and depth-first search C. Castillo, D. Donato, Count of neighbors S. Leonardi and R. Baeza-Yates Symmetric algorithms Motivation (Strongly) connected components Spam pages Approximate count of neighbors characterization Truncated PageRank, Truncated PageRank, Linear Rank PageRank HITS, Salsa, TrustRank Counting supporters Experiments Conclusions
  • 26. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  • 27. General functional ranking Using rank propagation and Probabilistic Let P the row-normalized version of the citation matrix of a counting for Link-Based graph G = (V , E ) Spam Detection A functional ranking [Baeza-Yates et al., 2006] is a L. Becchetti, C. Castillo, link-based ranking algorithm to compute a scoring vector W D. Donato, S. Leonardi and of the form: R. Baeza-Yates ∞ damping(t) t W= P. Motivation N t=0 Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 28. General functional ranking Using rank propagation and Probabilistic Let P the row-normalized version of the citation matrix of a counting for Link-Based graph G = (V , E ) Spam Detection A functional ranking [Baeza-Yates et al., 2006] is a L. Becchetti, C. Castillo, link-based ranking algorithm to compute a scoring vector W D. Donato, S. Leonardi and of the form: R. Baeza-Yates ∞ damping(t) t W= P. Motivation N t=0 Spam pages characterization Truncated PageRank There are many choices for damping(t), including simply a Counting linear function that is as good as PageRank in practice supporters Experiments Conclusions
  • 29. General functional ranking Using rank propagation and Probabilistic Let P the row-normalized version of the citation matrix of a counting for Link-Based graph G = (V , E ) Spam Detection A functional ranking [Baeza-Yates et al., 2006] is a L. Becchetti, C. Castillo, link-based ranking algorithm to compute a scoring vector W D. Donato, S. Leonardi and of the form: R. Baeza-Yates ∞ damping(t) t W= P. Motivation N t=0 Spam pages characterization Truncated PageRank There are many choices for damping(t), including simply a Counting linear function that is as good as PageRank in practice supporters Experiments Conclusions damping(t) = (1 − α)αt
  • 30. Truncated PageRank Using rank propagation and Probabilistic counting for Link-Based Spam Detection Reduce the direct contribution of the first levels of links: L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank t≤T 0 damping(t) = Counting C αt supporters t>T Experiments Conclusions
  • 31. Truncated PageRank Using rank propagation and Probabilistic counting for Link-Based Spam Detection Reduce the direct contribution of the first levels of links: L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank t≤T 0 damping(t) = Counting C αt supporters t>T Experiments V No extra reading of the graph after PageRank Conclusions
  • 32. General algorithm Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for Using rank propagation and truncation Probabilistic 1: for i : 1 . . . N do {Initialization} counting for 2: R[i] ← (1 − α)/((αT +1 )N) Link-Based Spam Detection 3: if T≥ 0 then 4: Score[i] ← 0 L. Becchetti, 5: else {Calculate normal PageRank} C. Castillo, D. Donato, 6: Score[i] ← R[i] S. Leonardi and 7: end if R. Baeza-Yates 8: end for Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 33. General algorithm Require: N: number of nodes, 0 < α < 1: damping factor, T≥ −1: distance for Using rank propagation and truncation Probabilistic 1: for i : 1 . . . N do {Initialization} counting for 2: R[i] ← (1 − α)/((αT +1 )N) Link-Based Spam Detection 3: if T≥ 0 then 4: Score[i] ← 0 L. Becchetti, 5: else {Calculate normal PageRank} C. Castillo, D. Donato, 6: Score[i] ← R[i] S. Leonardi and 7: end if R. Baeza-Yates 8: end for 9: distance = 1 Motivation 10: while not converged do Spam pages 11: Aux ← 0 characterization 12: for src : 1 . . . N do {Follow links in the graph} Truncated 13: for all link from src to dest do PageRank 14: Aux[dest] ← Aux[dest] + R[src]/outdegree(src) Counting 15: end for supporters 16: end for Experiments 17: for i : 1 . . . N do {Apply damping factor α} Conclusions 18: R[i] ← Aux[i] ×α 19: if distance > T then {Add to ranking value} 20: Score[i] ← Score[i] + R[i] 21: end if 22: end for 23: distance = distance +1 24: end while
  • 34. Truncated PageRank vs PageRank Using rank propagation and Probabilistic counting for −3 −3 10 10 Link-Based Spam Detection Truncated PageRank T=1 Truncated PageRank T=4 −4 −4 10 10 L. Becchetti, −5 −5 C. Castillo, 10 10 D. Donato, −6 −6 S. Leonardi and 10 10 R. Baeza-Yates −7 −7 10 10 Motivation −8 −8 10 10 Spam pages −9 −9 10 −8 10 −8 characterization −6 −4 −6 −4 10 10 10 10 10 10 Normal PageRank Normal PageRank Truncated PageRank Comparing PageRank and Truncated PageRank with T = 1 Counting supporters and T = 4. Experiments The correlation is high and decreases as more levels are Conclusions truncated.
  • 35. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  • 36. Probabilistic counting Using rank propagation and Probabilistic counting for 1 1 Link-Based 0 0 Spam Detection 0 0 L. Becchetti, 0 0 0 1 1 1 C. Castillo, 1 1 0 0 1 1 D. Donato, 0 0 0 0 0 0 S. Leonardi and Propagation of 0 0 1 1 R. Baeza-Yates bits using the 1 0 1 1 “OR” operation 1 0 1 0 Motivation 1 Target 0 Count bits set Spam pages 0 page 0 characterization to estimate 0 0 supporters Truncated 0 0 1 1 PageRank 1 1 0 0 1 1 0 0 Counting 0 0 supporters 1 1 Experiments 0 0 Conclusions
  • 37. Probabilistic counting Using rank propagation and Probabilistic counting for 1 1 Link-Based 0 0 Spam Detection 0 0 L. Becchetti, 0 0 0 1 1 1 C. Castillo, 1 1 0 0 1 1 D. Donato, 0 0 0 0 0 0 S. Leonardi and Propagation of 0 0 1 1 R. Baeza-Yates bits using the 1 0 1 1 “OR” operation 1 0 1 0 Motivation 1 Target 0 Count bits set Spam pages 0 page 0 characterization to estimate 0 0 supporters Truncated 0 0 1 1 PageRank 1 1 0 0 1 1 0 0 Counting 0 0 supporters 1 1 Experiments 0 0 Conclusions Improvement of ANF algorithm [Palmer et al., 2002] based on probabilistic counting [Flajolet and Martin, 1985]
  • 38. General algorithm Using rank Require: N: number of nodes, d: distance, k: bits propagation and Probabilistic 1: for node : 1 . . . N, bit: 1 . . . k do counting for Link-Based INIT(node,bit) 2: Spam Detection 3: end for L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 39. General algorithm Using rank Require: N: number of nodes, d: distance, k: bits propagation and Probabilistic 1: for node : 1 . . . N, bit: 1 . . . k do counting for Link-Based INIT(node,bit) 2: Spam Detection 3: end for L. Becchetti, C. Castillo, 4: for distance : 1 . . . d do {Iteration step} D. Donato, S. Leonardi and Aux ← 0k 5: R. Baeza-Yates for src : 1 . . . N do {Follow links in the graph} 6: Motivation for all links from src to dest do 7: Spam pages Aux[dest] ← Aux[dest] OR V[src,·] characterization 8: Truncated end for 9: PageRank end for 10: Counting supporters V ← Aux 11: Experiments 12: end for Conclusions
  • 40. General algorithm Using rank Require: N: number of nodes, d: distance, k: bits propagation and Probabilistic 1: for node : 1 . . . N, bit: 1 . . . k do counting for Link-Based INIT(node,bit) 2: Spam Detection 3: end for L. Becchetti, C. Castillo, 4: for distance : 1 . . . d do {Iteration step} D. Donato, S. Leonardi and Aux ← 0k 5: R. Baeza-Yates for src : 1 . . . N do {Follow links in the graph} 6: Motivation for all links from src to dest do 7: Spam pages Aux[dest] ← Aux[dest] OR V[src,·] characterization 8: Truncated end for 9: PageRank end for 10: Counting supporters V ← Aux 11: Experiments 12: end for Conclusions 13: for node: 1 . . . N do {Estimate supporters} Supporters[node] ← ESTIMATE( V[node,·] ) 14: 15: end for 16: return Supporters
  • 41. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 42. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 43. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages ones(node) 1− Estimator: neighbors(node) = log(1− characterization ) k Truncated PageRank Counting supporters Experiments Conclusions
  • 44. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages Estimator: neighbors(node) = log(1− ) 1 − ones(node) characterization k Truncated Problem: neighbors(node) can vary by orders of magnitudes PageRank as node varies. Counting supporters Experiments Conclusions
  • 45. Our estimator Using rank propagation and Probabilistic counting for Link-Based Initialize all bits to one with probability Spam Detection by the independence of the i − th component Xi ’s we have, L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates P[Xi = 1] = 1 − (1 − )neighbors(node) , Motivation Spam pages Estimator: neighbors(node) = log(1− ) 1 − ones(node) characterization k Truncated Problem: neighbors(node) can vary by orders of magnitudes PageRank as node varies. Counting supporters This means that for some values of , the computed value of Experiments ones(node) might be k (or 0, depending on neighbors(node)) Conclusions with relatively high probability.
  • 46. Adaptive estimator Using rank propagation and Probabilistic counting for Link-Based 1 Spam Detection if we knew neighbors(node) and chose = we neighbors(node) L. Becchetti, would get: C. Castillo, D. Donato, S. Leonardi and 1 R. Baeza-Yates 1− ones(node) k 0.63k, e Motivation Spam pages characterization Adaptive estimation Truncated PageRank Repeat the above process for = 1/2, 1/4, 1/8, . . . , and look Counting for the transitions from more than (1 − 1/e)k ones to less supporters than (1 − 1/e)k ones. Experiments Conclusions
  • 47. Convergence Using rank 100% propagation and Probabilistic 90% counting for Link-Based Spam Detection 80% Fraction of nodes L. Becchetti, 70% with estimates C. Castillo, D. Donato, 60% S. Leonardi and R. Baeza-Yates 50% d=1 d=2 Motivation 40% d=3 Spam pages 30% d=4 characterization d=5 Truncated 20% d=6 PageRank d=7 10% Counting d=8 supporters 0% 5 10 15 20 Experiments Iteration Conclusions
  • 48. Convergence Using rank 100% propagation and Probabilistic 90% counting for Link-Based Spam Detection 80% Fraction of nodes L. Becchetti, 70% with estimates C. Castillo, D. Donato, 60% S. Leonardi and R. Baeza-Yates 50% d=1 d=2 Motivation 40% d=3 Spam pages 30% d=4 characterization d=5 Truncated 20% d=6 PageRank d=7 10% Counting d=8 supporters 0% 5 10 15 20 Experiments Iteration Conclusions 15 iterations for estimating the neighbors at distance 4 or less
  • 49. Convergence Using rank 100% propagation and Probabilistic 90% counting for Link-Based Spam Detection 80% Fraction of nodes L. Becchetti, 70% with estimates C. Castillo, D. Donato, 60% S. Leonardi and R. Baeza-Yates 50% d=1 d=2 Motivation 40% d=3 Spam pages 30% d=4 characterization d=5 Truncated 20% d=6 PageRank d=7 10% Counting d=8 supporters 0% 5 10 15 20 Experiments Iteration Conclusions 15 iterations for estimating the neighbors at distance 4 or less less than 25 iterations for all distances up to 8.
  • 50. Error rate Using rank propagation and Probabilistic counting for Ours 64 bits, epsilon−only estimator Link-Based Spam Detection Ours 64 bits, combined estimator 0.5 ANF 24 bits × 24 iterations (576 b×i) Average Relative Error L. Becchetti, C. Castillo, ANF 24 bits × 48 iterations (1152 b×i) D. Donato, S. Leonardi and 0.4 R. Baeza-Yates 960 b×i 1216 b×i 512 b×i 832 b×i 1344 b×i 1408 b×i Motivation 768 b×i 1152 b×i 0.3 Spam pages characterization 0.2 576 b×i Truncated 1152 b×i PageRank 512 b×i 768 b×i 960 b×i 1216 b×i 1344 b×i 1408 b×i 832 b×i 1152 b×i Counting 0.1 supporters Experiments 0 Conclusions 1 2 3 4 5 6 7 8 Distance
  • 51. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  • 52. Test collection Using rank propagation and Probabilistic counting for Link-Based Spam Detection U.K. collection L. Becchetti, C. Castillo, 18.5 million pages downloaded from the .UK domain D. Donato, S. Leonardi and R. Baeza-Yates 5,344 hosts manually classified (6% of the hosts) Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 53. Test collection Using rank propagation and Probabilistic counting for Link-Based Spam Detection U.K. collection L. Becchetti, C. Castillo, 18.5 million pages downloaded from the .UK domain D. Donato, S. Leonardi and R. Baeza-Yates 5,344 hosts manually classified (6% of the hosts) Motivation Spam pages characterization Truncated Classified entire hosts: PageRank Counting V A few hosts are mixed: spam and non-spam pages supporters X More coverage: sample covers 32% of the pages Experiments Conclusions
  • 54. Automatic classifier We extracted (for the home page and the page with Using rank propagation and maximum PageRank) PageRank, Truncated PageRank at Probabilistic counting for 2 . . . 4, Supporters at 2 . . . 4 Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 55. Automatic classifier We extracted (for the home page and the page with Using rank propagation and maximum PageRank) PageRank, Truncated PageRank at Probabilistic counting for 2 . . . 4, Supporters at 2 . . . 4 Link-Based Spam Detection We measured: L. Becchetti, C. Castillo, D. Donato, # of spam hosts classified as spam S. Leonardi and Precision = R. Baeza-Yates # of hosts classified as spam Motivation # of spam hosts classified as spam Recall = . Spam pages # of spam hosts characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 56. Automatic classifier We extracted (for the home page and the page with Using rank propagation and maximum PageRank) PageRank, Truncated PageRank at Probabilistic counting for 2 . . . 4, Supporters at 2 . . . 4 Link-Based Spam Detection We measured: L. Becchetti, C. Castillo, D. Donato, # of spam hosts classified as spam S. Leonardi and Precision = R. Baeza-Yates # of hosts classified as spam Motivation # of spam hosts classified as spam Recall = . Spam pages # of spam hosts characterization Truncated PageRank and the two types of errors in spam classification Counting supporters Experiments # of normal hosts classified as spam Conclusions False positive rate = # of normal hosts # of spam hosts classified as normal False negative rate = . # of spam hosts
  • 57. Single-technique classifier Using rank propagation and Probabilistic counting for Link-Based Classifier based on TrustRank: uses as features the PageRank, Spam Detection the estimated non-spam mass, and the L. Becchetti, C. Castillo, estimated non-spam mass divided by PageRank. D. Donato, S. Leonardi and Classifier based on Truncated PageRank: uses as features the R. Baeza-Yates PageRank, the Truncated PageRank with Motivation truncation distance t = 2, 3, 4 (with t = 1 it Spam pages characterization would be just based on in-degree), and the Truncated Truncated PageRank divided by PageRank. PageRank Classifier based on Estimation of Supporters: uses as features Counting supporters the PageRank, the estimation of supporters at a Experiments given distance d = 2, 3, 4, and the estimation of Conclusions supporters divided by PageRank.
  • 58. Comparison of single-technique classifier (M=5) Using rank propagation and Probabilistic counting for Link-Based Spam Detection Classifiers Spam class False False L. Becchetti, C. Castillo, (pruning with M = 5) Prec. Recall Pos. Neg. D. Donato, S. Leonardi and TrustRank 0.82 0.50 2.1% 50% R. Baeza-Yates Trunc. PageRank t = 2 0.85 0.50 1.6% 50% Motivation Trunc. PageRank t = 3 0.84 0.47 1.6% 53% Spam pages characterization Trunc. PageRank t = 4 0.79 0.45 2.2% 55% Truncated Est. Supporters d = 2 0.78 0.60 3.2% 40% PageRank Est. Supporters d = 3 0.83 0.64 2.4% 36% Counting supporters Est. Supporters d = 4 0.86 0.64 2.0% 36% Experiments Conclusions
  • 59. Comparison of single-technique classifier (M=30) Using rank propagation and Probabilistic counting for Link-Based Spam Detection Classifiers Spam class False False L. Becchetti, C. Castillo, (pruning with M = 30) Prec. Recall Pos. Neg. D. Donato, S. Leonardi and TrustRank 0.80 0.49 2.3% 51% R. Baeza-Yates Trunc. PageRank t = 2 0.82 0.43 1.8% 57% Motivation Trunc. PageRank t = 3 0.81 0.42 1.8% 58% Spam pages characterization Trunc. PageRank t = 4 0.77 0.43 2.4% 57% Truncated Est. Supporters d = 2 0.76 0.52 3.1% 48% PageRank Est. Supporters d = 3 0.83 0.57 2.1% 43% Counting supporters Est. Supporters d = 4 0.80 0.57 2.6% 43% Experiments Conclusions
  • 60. Combined classifier Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Spam class False False S. Leonardi and R. Baeza-Yates Pruning Rules Precision Recall Pos. Neg. Motivation M=5 49 0.87 0.80 2.0% 20% Spam pages M=30 31 0.88 0.76 1.8% 24% characterization No pruning 189 0.85 0.79 2.6% 21% Truncated PageRank Counting supporters Experiments Conclusions
  • 61. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, Motivation 1 S. Leonardi and Spam pages characterization 2 R. Baeza-Yates Truncated PageRank 3 Motivation Counting supporters 4 Spam pages Experiments 5 characterization Conclusions 6 Truncated PageRank Counting supporters Experiments Conclusions
  • 62. Summary of classifiers Using rank (a) Precision and recall of spam detection propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank (b) Error rates of the spam classifiers Counting supporters Experiments Conclusions
  • 63. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 64. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 65. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 66. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 67. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization V Truncated Host-based counts are important PageRank Counting supporters Experiments Conclusions
  • 68. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization V Truncated Host-based counts are important PageRank Counting supporters Experiments Conclusions
  • 69. Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, V Link-based statistics to detect 80% of spam D. Donato, S. Leonardi and X R. Baeza-Yates No magic bullet in link analysis X Precision still low compared to e-mail spam filters Motivation V Spam pages Measure both home page and max. PageRank page characterization V Truncated Host-based counts are important PageRank Counting Next step: combine link analysis and content analysis supporters Experiments Conclusions
  • 70. Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Thank you! Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions
  • 71. Using rank Baeza-Yates, R., Boldi, P., and Castillo, C. (2006). propagation and Probabilistic Generalizing PageRank: Damping functions for link-based counting for Link-Based ranking algorithms. Spam Detection In Proceedings of SIGIR, Seattle, Washington, USA. ACM L. Becchetti, C. Castillo, Press. D. Donato, S. Leonardi and R. Baeza-Yates Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (2006). Motivation Using rank propagation and probabilistic counting for Spam pages characterization link-based spam detection. Truncated In Proceedings of the Workshop on Web Mining and Web PageRank Usage Analysis (WebKDD), Pennsylvania, USA. ACM Press. Counting supporters Experiments Bencz´r, A. A., Csalog´ny, K., Sarl´s, T., and Uher, M. u a o Conclusions (2005). Spamrank: fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan.
  • 72. Using rank propagation and Probabilistic counting for Fetterly, D., Manasse, M., and Najork, M. (2004). Link-Based Spam Detection Spam, damn spam, and statistics: Using statistical analysis to L. Becchetti, locate spam web pages. C. Castillo, D. Donato, In Proceedings of the seventh workshop on the Web and S. Leonardi and R. Baeza-Yates databases (WebDB), pages 1–6, Paris, France. Motivation Flajolet, P. and Martin, N. G. (1985). Spam pages characterization Probabilistic counting algorithms for data base applications. Truncated Journal of Computer and System Sciences, 31(2):182–209. PageRank Counting Gibson, D., Kumar, R., and Tomkins, A. (2005). supporters Experiments Discovering large dense subgraphs in massive graphs. Conclusions In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 721–732. VLDB Endowment.
  • 73. Using rank propagation and Probabilistic Gy¨ngyi, Z. and Garcia-Molina, H. (2005). o counting for Link-Based Web spam taxonomy. Spam Detection L. Becchetti, In First International Workshop on Adversarial Information C. Castillo, Retrieval on the Web. D. Donato, S. Leonardi and R. Baeza-Yates Gy¨ngyi, Z., Molina, H. G., and Pedersen, J. (2004). o Motivation Combating web spam with trustrank. Spam pages In Proceedings of the Thirtieth International Conference on characterization Very Large Data Bases (VLDB), pages 576–587, Toronto, Truncated PageRank Canada. Morgan Kaufmann. Counting supporters Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001). Experiments Random graphs with arbitrary degree distributions and their Conclusions applications. Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
  • 74. Using rank propagation and Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006). Probabilistic counting for Link-Based Detecting spam web pages through content analysis. Spam Detection In Proceedings of the World Wide Web conference, pages L. Becchetti, C. Castillo, 83–92, Edinburgh, Scotland. D. Donato, S. Leonardi and R. Baeza-Yates Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002). ANF: a fast and scalable tool for data mining in massive Motivation graphs. Spam pages characterization In Proceedings of the eighth ACM SIGKDD international Truncated conference on Knowledge discovery and data mining, pages PageRank 81–90, New York, NY, USA. ACM Press. Counting supporters Perkins, A. (2001). Experiments Conclusions The classification of search engine spam. Available online at http://www.silverdisc.co.uk/articles/spam-classification/.