• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Outskewer: Using Skewness to Spot Outliers in Samples and Time Series
 

Outskewer: Using Skewness to Spot Outliers in Samples and Time Series

on

  • 607 views

 

Statistics

Views

Total Views
607
Views on SlideShare
364
Embed Views
243

Actions

Likes
0
Downloads
3
Comments
0

4 Embeds 243

http://sebastien.pro 191
http://www-complexnetworks.lip6.fr 50
http://www.linkedin.com 1
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NoDerivs LicenseCC Attribution-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Outskewer: Using Skewness to Spot Outliers in Samples and Time Series Outskewer: Using Skewness to Spot Outliers in Samples and Time Series Presentation Transcript

    • cnrs - upmc laboratoire d’informatique de paris 6 Outskewer: Using Skewness to Spot Outliers in Samples and Time Series S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien e e ASONAM 2012
    • Did you know?Outlier detection is an important problem to data mining: source: https://xkcd.com/539/
    • cnrs - upmc laboratoire d’informatique de paris 6 How to detect outliers? • No formal definition, it is a subjective concept. • Depends on cases and hypotheses on data. • Intuitively: to identify values which deviate remarkably from the remainder of values (Grubbs, 1969). S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 3/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Usual approaches in literature Hypothesis: data ∼ normal Distance data points / distribution. theoretical values. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 4/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Problem statement Most of the time, we can’t make strong assumptions on: • the theoretical distribution of values. • how the data should evolve over time (time series). Thus we want a method which makes no hypothesis on data. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 5/27
    • Our Method
    • cnrs - upmc laboratoire d’informatique de paris 6 Skewness coefficient n x−mean 3 γ= (n−1)(n−2) x∈X standard deviation density density x x γ<0 γ>0 Example of skewed distributions. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 7/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Skewness coefficient n x−mean 3 γ= (n−1)(n−2) x∈X standard deviation density density x x γ<0 γ>0 Example of skewed distributions. It is sensitive to extremal values (min/max) far from the mean ! S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 7/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Skewness signature Definition Evolution of skewness coefficient γ when extremal values are removed one by one from the sample. Algorithm If γ > 0 then remove max(X ), 1.5 skewness Else remove min(X ). 1.0 0.5 0.0 Example 1 2 3 4 5 6 7 X = {-3, -2, -1, -1, 0, 1, 2, 3, 7} # extremal values removed γ: 1.09, 0.22, 0.17, 0, 0.4, 0, 1.73 S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 8/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Our method: Outskewer Our definition Outlier = extremal value which skews a distribution of values. Implication The removal of these extremal values one by one should reduce the skewness of the distribution. Implication Otherwise, there is no outlier as we define it. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 9/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : non-relevant cases Where extremal values far from the mean are common. e.g. Power law distributions S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 10/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5 1.0 q 0.5 t T cumulative distribution q q q q qq q 0.8 q q q q q 0.4 q q q |skewness| q q q 0.6 q q 0.3 |g| qq qq qq q q 0.4 q q q q q 0.2 qq q q qq 0.2 qq q q q 0.1 q q q q q 0.0 0.0 −8 −6 −4 −2 0 2 0 0.14 0.30 0.16 0.5 x p Example: 0.16-stable but not 0.30-stable S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 11/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5 If yes: there may be outliers. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 12/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable ⇔ |γ| ≤ 0.5 − p, for each p from p to 0.5 If yes: there may be outliers. If no for all p: the skewness coefficient is always too large, thus no outlier as we define it can lie in the sample. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 12/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : outlier detection |g| area of outliers area of potential area with no outlier 2.0 outliers 1.5 1.0 q not outlier q |skewness| q qcumulative frequency qq qq 0.8 potential outlier q q q q 1.0 q q qq outlier q q q q q 0.6 q q q q qq qq q 0.5 q 0.4 q q q q t’ q q q q 0.2 T’ 0.0 0.0 −8 −6 −4 −2 0 2 t T x 0 0.14 0.5 1 p t smallest t-stable value , t smallest value so that |γ| ≤ 0.5 − t T largest T -stable value , T smallest value so that |γ| ≤ 0.5 − T Example: 50 values, including 7 outliers and 5 potential outliers S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 13/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Outskewer : outcome Each value of the sample is classified as follows:qqqqqqqqqqqqqq qqqqqqqqqq status q not outlier potential outlier outlier 2000 or unknown when the method is not applicable (skewness signature never p-stable). S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 14/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Extension to time series On a sliding window of size w , each value of X is classified w times. The final class of a value is the one that appears the most. time S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 15/27
    • Experimental Validation False positive rate. Regime change.
    • cnrs - upmc laboratoire d’informatique de paris 6 False positive rate • Normal distribution: 3% for n = 10, 0.01% for n = 100 • Pareto distribution: 5% for n = 100, 0.01% for n = 1000 S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 17/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Regime change Video 5 5 q not outlier 5 q not outlier q not outlier q 4 4 potential outlier 4 potential outlier q q q potential outlier q 3 3 outlier q q q 3 outlier q q q q q unknown q q q q q q q q q q qq q q q qq q 2 q q 2 q unknown q q q q 2 q unknown q q q q q q q q q qq q q qq q 1 qq qq q qq q qq q q qq qq 1 qq qq q qq q qq q q qq qq 1 qq qq q qq q qq q q qq qq q q q q q q q q q q x x x qqq q q qqq q q q q q qqq q q qqq q q q q q qqq q q qqq q q q q q q q q 0 q qq q q q qqq q q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q qq q q q q q qq qq qq q q q q q qqq qq qq q q q q q qqq qq qq qq q q qq q q q qq qq q qq qq q qq qq qq q q qqq q q q qq qq q qq qq qq q qq q qq qq q q qqq q q q qq qq q qq qq qq q qq q −1 q qq q qq q q −1 q qq q qq q q −1 q qq q qq q q qq qq qq −2 q qq q q −2 q qq q q q −2 q qq q q q 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 t t t 5 q not outlier q 5 q not outlier q q q q 5 q q q q q q q q q q q q q q q 4 potential outlier qq q qq q 4 potential outlier q qqqqq q q qq q q 4 q not outlier q q q q q qqq q q qqq q q q qq q q qqq q q qqq qq q qq qqq q q q q q q q 3 outlier qq q q qqq q 3 outlier qqq qqq q qq qq 3 potential outlier q qq q q q qq q q q q q qqqqq q q q q qqq qq q q q q qqq qq q q q q qqq qq q q q qqq q q q q q q q qqq q q q q q qqq q q q q q qqq qq q qq 2 q unknown q q q q q qqq 2 q unknown q q q q q q qq q q 2 q q q q q q q q qq qq q q q q q q q qq q q q q qq q q qq qq 1 qq qq q qq q qq q q qq qq q 1 qq qq q qq q qq q q qq qq q 1 qq qq q qq q qq q q qq qq q q q q q q q q q q q x x x qqq q q qqq q q q q q qqq q q qqq q q q q q qqq q q qqq q q q q q q q q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q qq q q q q q qqq qq qq q q q q q qqq qq qq q q q q q qqq qq qq qq q q qqq q q q qq qq q qq qq qq q qq q qq qq q q qqq q q q qq qq q qq qq qq q qq q q q qq q q qqq q q q qq qq q qq qq qq q qq q −1 q qq q qq q q −1 q qq q qq q q −1 q qq q qq q q qq qq qq −2 q qq q q q −2 q qq q q q −2 q qq q q q 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 t t t S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 18/27
    • Experimental Results French population during the 20th century. Logs of a P2P search engine.
    • cnrs - upmc laboratoire d’informatique de paris 6 French population during the 20th century Number of inhabitants per year qqq qqq 60M qqq qqq qqqqq qqqq qqqq population qqqq qqq qqqq q qqq 50M qqq qqq qq qq qqq qq qqq q qqq qqqqqqqqqqqqq qqqqq qqqqqqqqqq qqq q 40M qqq qq qqqq qqqqq q 1900 1920 1940 1960 1980 2000 Year Difference over years 1000000 q q q q 500000 q q qqq qqqqqqq qqq qqqqqqqqqqq status ∆population q qqqqqqqqqq qq qq q q qqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqq q qqq qq 0 q qq q not outlier −500000 potential outlier −1000000 −1500000 outlier 1900 1920 1940 1960 1980 2000 Year S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 20/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Harry Potter on eDonkey Number of outliers per day 75 # outliers / day in theatre unknown event pirate release outliers 0 50 potential outliers 15 Jul 24 Aug 12 Oct 1 Dec Date Data: • search logs on P2P network eDonkey. • # queries containing “half blood prince” per hour, computed every 10 minutes. • during 28 weeks. • over 205 millions of queries. • for 24.4 millions of IP addresses. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 21/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Contributions Our method: • is non-parametric but for the size of the time window. • classifies values only when the statistical conditions are met. • is naturally generalized to on-line analysis. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 22/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Conclusion • Motivation: outlier detection with no hypothesis on data. • Method based on the skewness of distributions. • Excellent experimental results. • Relevant on various data sets. • Open source code in R on http://outskewer.sebastien.pro S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 23/27
    • Questions?Outskewer: Using Skewness to Spot Outliers in Samples and Time Series <sebastien.heymann@lip6.fr>
    • cnrs - upmc laboratoire d’informatique de paris 6 Homogeneous / heterogeneous data Outlier = unexpected extremal value? Extremal values far from the mean? • heterogeneous (Pareto, Zipf...): common • homogeneous (normal, Laplace...): uncommon 100 10−5 density 10−10 10−15 10−20 −10 −5 0 5 10 x Probability density function of normal and Pareto laws. S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 25/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Skewness signature Normal 2 1 median 0 min s(p) max −1 q1 −2 q3 0.0 0.2 0.4 0.6 0.8 1.0 p Pareto 8 6 median 4 min s(p) 2 max 0 q1 −2 q3 0.0 0.2 0.4 0.6 0.8 1.0 p S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 26/27
    • cnrs - upmc laboratoire d’informatique de paris 6 Local view of the internet topology 13000 Nb nodes 12000 11000 outlier potential outlier q not outlier unknown 0 1000 2000 3000 4000 5000 Nb rounds M. Latapy, C. Magnien and F. Ou´draogo, A Radar for the Internet, in Complex Systems, 20 (1), 23-30, 2011. e S´bastien Heymann, Matthieu Latapy, Cl´mence Magnien — Outskewer — ASONAM 2012 e e 27/27