Outskewer: Using Skewness to Spot Outliers in Samples and Time Series

cnrs - upmc laboratoire d’informatique de paris 6

Outskewer:
Using Skewness to Spot Outliers
in Samples and Time Series
S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien
e e

ASONAM 2012

Did you know?

Outlier detection is an important problem to data mining:

source: https://xkcd.com/539/


How to detect outliers?

• No formal deﬁnition, it is a subjective concept.
• Depends on cases and hypotheses on data.
• Intuitively: to identify values which deviate remarkably from
the remainder of values (Grubbs, 1969).

S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012
e e
3/27


Usual approaches in literature

Hypothesis: data ∼ normal
Distance data points /
distribution.
theoretical values.

e e
4/27


Problem statement

Most of the time, we can’t make strong assumptions on:
• the theoretical distribution of values.
• how the data should evolve over time (time series).

Thus we want a method which makes no hypothesis on data.

e e
5/27


Skewness coeﬃcient

n x−mean 3
γ= (n−1)(n−2) x∈X standard deviation
density

density
x x
γ<0 γ>0
Example of skewed distributions.

e e
7/27


Skewness coeﬃcient

n x−mean 3
γ= (n−1)(n−2) x∈X standard deviation
density

density
x x
γ<0 γ>0
Example of skewed distributions.

It is sensitive to extremal values (min/max) far from the mean !

e e
7/27


Skewness signature

Deﬁnition
Evolution of skewness coeﬃcient γ when extremal values are
removed one by one from the sample.

Algorithm
If γ > 0 then remove max(X ),
1.5

skewness
Else remove min(X ). 1.0
0.5
0.0
Example
1 2 3 4 5 6 7
X = {-3, -2, -1, -1, 0, 1, 2, 3, 7} # extremal values removed
γ: 1.09, 0.22, 0.17, 0, 0.4, 0, 1.73

e e
8/27


Our method: Outskewer

Our deﬁnition
Outlier = extremal value which skews a distribution of values.

Implication
The removal of these extremal values one by one should reduce
the skewness of the distribution.

Implication
Otherwise, there is no outlier as we deﬁne it.

e e
9/27


Outskewer : non-relevant cases

Where extremal values far from the mean are common.
e.g. Power law distributions

e e
10/27


Outskewer : p-stability
Is the signature p-stable?
p: fraction of extremal values removed.
p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5

1.0 q 0.5 t T
cumulative distribution

q q
q
q
qq
q
0.8
q
q
q
q
q
0.4
q
q
q

|skewness|
q
q
q
0.6 q
q 0.3

|g|
qq
qq
qq
q
q
0.4 q
q
q
q
q
0.2
qq
q
q
qq
0.2 qq
q q
q
0.1
q q q
q
q
0.0 0.0
−8 −6 −4 −2 0 2 0 0.14 0.30
0.16 0.5
x p
Example: 0.16-stable but not 0.30-stable
e e
11/27



If yes: there may be outliers.

e e
12/27



If yes: there may be outliers.
If no for all p: the skewness coeﬃcient is always too large, thus no
outlier as we deﬁne it can lie in the sample.

e e
12/27


Outskewer : outlier detection
|g| area of
outliers
area of
potential
area with no outlier

2.0 outliers

1.5
1.0 q not outlier q

|skewness|
q q
cumulative frequency

qq
qq
0.8 potential outlier q
q
q
q
1.0
q
q
qq
outlier q
q
q
q
q
0.6 q
q
q
q
qq
qq
q
0.5
q
0.4 q
q
q
q

t’
q
q
q
q
0.2
T’
0.0 0.0
−8 −6 −4 −2 0 2 t T
x 0 0.14 0.5 1
p

t smallest t-stable value , t smallest value so that |γ| ≤ 0.5 − t
T largest T -stable value , T smallest value so that |γ| ≤ 0.5 − T
Example: 50 values, including 7 outliers and 5 potential outliers
e e
13/27


Outskewer : outcome

Each value of the sample is classiﬁed as follows:
qqqqqqqqqqqqqq
qqqqqqqqqq status
q not outlier
potential outlier
outlier

2000
or unknown when the method is not applicable (skewness
signature never p-stable).

e e
14/27


Extension to time series

On a sliding window of size w , each value of X is classiﬁed w
times.
The ﬁnal class of a value is the one that appears the most.

time

e e
15/27

Experimental Validation
False positive rate.
Regime change.


False positive rate

• Normal distribution: 3% for n = 10, 0.01% for n = 100

• Pareto distribution: 5% for n = 100, 0.01% for n = 1000

e e
17/27


Regime change

Video

5 5 q not outlier 5 q not outlier
q not outlier q
4 4 potential outlier 4 potential outlier q q
q
potential outlier q
3 3 outlier q q q
3 outlier q q q
q
q
unknown
q q q q
q
q q q q q
qq q
q
q qq
q
2 q
q 2 q
unknown
q q
q
q
2 q
unknown
q q
q
q q
q q q q
qq q q
qq q
1 qq qq
q qq
q
qq q
q qq qq 1 qq qq
q qq
q
qq q
q qq qq 1 qq qq
q qq
q
qq q
q qq qq q
q q
q q q
q q q
q
x

x

x
qqq q q qqq
q
q q
q
q qqq q q qqq
q
q q
q
q qqq q q qqq
q
q q
q
q
q q q
0 q qq q q q qqq q q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q
qq q q q q q qq
qq qq q q q q q qqq
qq qq q q q q q qqq
qq
qq qq q q qq
q q q qq qq q
qq qq q qq
qq qq q q qqq
q q q qq qq q qq
qq qq q qq q qq qq q q qqq
q q q qq qq q qq
qq qq q qq q
−1 q
qq q qq q q −1 q
qq q qq q q −1 q
qq q qq q q
qq qq qq
−2 q qq q
q
−2 q qq q
q q −2 q qq q
q q

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
t t t
5 q not outlier q 5 q not outlier q q
q q 5 q q
q q
q q q q q
q q q q q q
4 potential outlier qq q
qq q 4 potential outlier q qqqqq q q
qq q q 4 q not outlier q q q
q q qqq q q
qqq
q q q qq q q
qqq q q
qqq qq q qq qqq q q q q
q q q
3 outlier qq
q q qqq q 3 outlier qqq qqq q
qq qq 3 potential outlier q qq q q q
qq q
q q q q qqqqq
q q q
q qqq
qq q q q q qqq
qq q q q q qqq
qq q q q qqq q q
q q q q q qqq q q q q q qqq q q q q q qqq qq q
qq
2 q
unknown
q q
q
q
q
qqq 2 q
unknown
q q
q
q
q q
qq q
q
2 q
q q
q
q
q
q q
qq
qq q q q q q
q
q qq
q q q q
qq q q qq qq
1 qq qq
q qq
q
qq q
q qq qq q 1 qq qq
q qq
q
qq q
q qq qq q 1 qq qq
q qq
q
qq q
q qq qq q
q q
q q q
q q q
q q
x

x

x
qqq q q qqq
q
q q
q
q qqq q q qqq
q
q q
q
q qqq q q qqq
q
q q
q
q
q q q
0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q
qq q q q q q qqq
qq qq q q q q q qqq
qq qq q q q q q qqq
qq
qq qq q q qqq
q q q qq qq q qq
qq qq q qq q qq qq q q qqq
q q q qq qq q qq
qq qq q qq q q
q qq q q qqq
q q q qq qq q qq
qq qq q qq q
−1 q
qq q qq q q −1 q
qq q qq q q −1 q
qq q qq q q
qq qq qq
−2 q qq q
q q −2 q qq q
q q −2 q qq q
q q

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
t t t

e e
18/27

Experimental Results
French population during the 20th century.
Logs of a P2P search engine.


French population
during the 20th century
Number of inhabitants per year
qqq
qqq
60M qqq
qqq
qqqqq
qqqq
qqqq
population

qqqq
qqq qqqq
q qqq
50M qqq
qqq
qq
qq
qqq
qq
qqq
q qqq
qqqqqqqqqqqqq qqqqq qqqqqqqqqq
qqq
q
40M qqq
qq qqqq qqqqq
q

1900 1920 1940 1960 1980 2000
Year

Diﬀerence over years
1000000
q q q q
500000 q q
qqq qqqqqqq qqq qqqqqqqqqqq status
∆population

q qqqqqqqqqq
qq qq q
q qqqqqqqqqqqqqqqqqqqqqqqqqq
q
qqqqqqqqqqqqq q qqq qq
0 q qq
q not outlier
−500000
potential outlier
−1000000
−1500000 outlier

1900 1920 1940 1960 1980 2000
Year

e e
20/27


Harry Potter on eDonkey
Number of outliers per day
75
# outliers / day

in theatre unknown event pirate release outliers
0
50 potential outliers

15 Jul 24 Aug 12 Oct 1 Dec
Date

Data:
• search logs on P2P network eDonkey.
• # queries containing “half blood prince” per hour, computed
every 10 minutes.
• during 28 weeks.
• over 205 millions of queries.
• for 24.4 millions of IP addresses.

e e
21/27


Contributions

Our method:
• is non-parametric but for the size of the time window.
• classiﬁes values only when the statistical conditions are met.
• is naturally generalized to on-line analysis.

e e
22/27


Conclusion

• Motivation: outlier detection with no hypothesis on data.
• Method based on the skewness of distributions.
• Excellent experimental results.
• Relevant on various data sets.
• Open source code in R on
http://outskewer.sebastien.pro

e e
23/27

Questions?
Outskewer: Using Skewness to Spot Outliers
in Samples and Time Series
<sebastien.heymann@lip6.fr>


Homogeneous / heterogeneous data
Outlier = unexpected extremal value?

Extremal values far from the mean?
• heterogeneous (Pareto, Zipf...): common
• homogeneous (normal, Laplace...): uncommon

100
10−5
density

10−10
10−15
10−20
−10 −5 0 5 10
x
Probability density function of normal and Pareto laws.

e e
25/27


Skewness signature
Normal
2

1 median

0 min
s(p)

max
−1
q1
−2
q3
0.0 0.2 0.4 0.6 0.8 1.0
p

Pareto
8
6 median
4 min
s(p)

2 max
0 q1
−2 q3
0.0 0.2 0.4 0.6 0.8 1.0
p
e e
26/27


Local view of the internet topology
13000
Nb nodes

12000

11000 outlier potential outlier q not outlier unknown

0 1000 2000 3000 4000 5000
Nb rounds

M. Latapy, C. Magnien and F. Ou´draogo, A Radar for the Internet, in Complex Systems, 20 (1), 23-30, 2011.
e
e e
27/27

Outskewer: Using Skewness to Spot Outliers in Samples and Time Series

Recommended

Recommended

More Related Content

Similar to Outskewer: Using Skewness to Spot Outliers in Samples and Time Series

Similar to Outskewer: Using Skewness to Spot Outliers in Samples and Time Series (20)

More from Sébastien

More from Sébastien (13)

Recently uploaded

Recently uploaded (20)

Outskewer: Using Skewness to Spot Outliers in Samples and Time Series