Introduction to unsupervised learning: outlier detection

Before we
begin..
Slides?
You can download them at
https://bit.ly/introtoadvml-week1-slides
Questions?
Post your questions in the QA box, one of
the panelists will answer!
Issues?
Chat directly with the panelists if you are
facing any issues!

If you torture the data enough, it will
confess to anything.
Ronald Coase

Agenda
1. What are Outliers?
2. Type of Outliers
3. Causes of outliers
4. Impact of Outliers on Data
5. Detecting and Fixing Outliers
6. Other Ways of Handling Outliers
7. References & Assignment
8. Implementation on Google Colab

What are Outliers?
Outliers are extreme values that
fall a long way outside of the other
observations. For example, in a
gaussian distribution, outliers may
be values on the tails of the
distribution.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Fig 1 Gaussian Distribution showing position
of outliers

What are Outliers?
We will generally define outliers
as samples that are exceptionally
far from the mainstream of the
data.
-Page 33, Applied Predictive
Modeling, 2013.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561, https://research.vu.nl/ws/portalfiles/portal/21334642/hoofdstuk+3.pdf
Fig 2a Data points and
Outliers
Fig 2b Data points and
Regression Model

What are Outliers?
Fig 3a Outliers in clustered data
Fig 3b Outliers in time series data

Genesis of Outliers
Most common causes of outliers on a data set
✓ Errors in Data entry (human errors)
✓ Errors during Measurement (instrument errors)
✓ Experimental errors (data extraction or experiment planning/executing errors)
✓ dummy outliers created to test detection methods
✓ Data processing errors (data manipulation or data set unintended mutations)
✓ Sampling errors (extracting or mixing data from wrong or various sources)
✓ Natural (not an error, novelties in data)
Those causes that are not a product of an error are called novelties.

Types of Outliers
1. Univariate Outliers
2. Multivariate Outliers
3. Point or Global Outliers
4. Collective Outliers
5. Contextual Outliers
6. Other Outliers e.g. Outliers in Time Series Data:
a. Additive Outliers
b. Innovative Outliers

Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
Univariate outliers can be found when looking at a
distribution
of values
in a single
feature
space.
https://journals.sagepub.com/doi/pdf/10.1177/0844562118786647
Outlier Types
Fig 4 Univariate Outlier

Univariate
Multivariate
Point/Global
Collective
Other
→ Additive
→ Innovative
Multivariate outliers can be found in a n-
dimensional space (of n-features).
Outlier Types
Fig 5 Multivariate Outlier

Univariate
Multivariate
Point/Global
Collective
Other
→ Additive
→ Innovative
Point outliers are
single data points
that lay far from
the rest of the
distribution.
Outlier Types
Fig 6 Global or Point Outlier

A subset of data points within a data set is
considered anomalous if those values as a
collection deviate significantly from the
entire data set, but the values of the
individual data
points are not
themselves
anomalous in
either a
contextual or
global sense.
Univariate
Multivariate
Point/Global
Collective
Other
→ Additive
→ Innovative
https://www.researchgate.net/figure/Collective-outlier-in-an-human-ECG-output-corresponding-to-an-Atrial-Premature_fig3_267964435
Outlier Types
Fig 7 CollectiveOutlier

Outlier Types
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditiona
l
Other
→ Additive
→ Innovative
A data point is considered a contextual outlier if its
value significantly deviates from the rest the data
points in the same context.
https://www.semanticscholar.org/paper/Contextual-Outlier-Detection-in-Sensor-Data-Using-Haque-Mineno/5cc5b6760d2de45add2959b150044f4c70a78aea/figure/4
Fig 8 Contextual or Conditional Outlier

Univariate
Multivariate
Point/Global
Collective
Other: Time Series
→ Additive
→ Innovative
An additive outlier occurs at time T if the underlying
process is perturbed/altered additively at time T.
https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720
Outlier Types
Fig 9 Additive Outlier

Univariate
Multivariate
Point/Global
Collective
Other: Time Series
→ Additive
→ Innovative
An innovative outlier occurs at time t if the error
(also known as an innovation) at time t is
perturbed/
altered.
https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720
Outlier Types
Fig 10 Innovative Outlier

Play Ground
Which of the following is NOT a type of outlier?
1. Multivariate outlier
2. Global outlier
3. Angular outlier
4. Contextual outlier
NB: Drop your choice of answer in the chat section

Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
● Outliers is able to rapidly change the results of the data analysis and statistical
modeling. There are several negative impacts of outliers in the data set:
● It increases the error variance and reduces the power of statistical tests.
● If the outliers are non-randomly distributed, they can decrease normality.
● They can bias or influence
estimates that may be of
substantive interest.
● They can also impact the
basic assumption of
Regression, and other
statistical model assumptions.
Fig 11 Calculations showing the impact of outliers

Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
https://www.kdnuggets.com/2018/08/make-machine-learning-models-robust-outliers.html
Fig 12 Impact of
outliers on Model
Performance

Play Ground
Some data scientist argued that, outliers are can be used in a
positive way for improving model performance while others
disagree.
What is your take on the argument?
1. Agree
2. Disagree
NB: Drop your choosing answer in the chat section

Tea Break
Exercise: You have some datasets and you are building
a model to classify the dataset. You observe that, your
classification model has three ways to accomplish this.
Choose between
1,2 and 3,
the line of best
fit that can best
classify your
data?
NB: Take note of the
outlier in the dataset.
Drop your selected
number on the chat
section.
https://classroom.udacity.com/courses/ud120
Fig 13 Choosing the right model

Detecting
Outliers in
Data
There are several ways of detecting outliers in a
dataset. Meanwhile we will discuss two most
frequently used and effective methods.
Most commonly used method to detect outliers is
visualization. We use various visualization
methods, like Box-plot, Histogram, Scatter Plot
(above, we have used box plot and scatter plot for
visualization). Whatever the visualization maybe, it
all depends on some ML algorithms;
→ Multivariate detection
→ Dimensionality Reduction

Detecting Outliers in Data
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
(PCA, LDA)
Fig 14 Detecting outliers using Mahalanobis Distance

Detecting & Fixing Outliers in Data
https://www.researchgate.net/figure/Outliers-by-Cooks-distance-with-a-red-line-plotted-to-indicate-division-to-outlierhood_fig1_323265273
Fig 15
Detecting
outliers using
Cooke’s
Distance
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
A general rule of
thumb is that
observations with a
Cook’s D of more than
3 times the mean, μ,
is a possible outlier.
(PCA, LDA)

→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component
Analysis (PCA)
↳ Linear Discriminant
Analysis (LDA)

https://sebastianraschka.com/faq/docs/lda-vs-pca.html
reduced data set.
↳ Principal Component Analysis (PCA)
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA)
Fig 16 The
Process of
a PCA

https://www.researchgate.net/figure/Principal-component-analysis-example-PC-1-contains-the-most-energy-of-the-data-but-does_fig2_279177589
reduced data set.
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA) Fig 17 An example of PCA for DR

reduced data set.
↳ Linear Discriminant Analysis (LDA)
LDA attempts to find a feature
subspace that maximizes
class separability.
Fig 18
The
process
of an LDA

Differences between PDA & LDA
Principal Component Analysis
(PCA)
Linear Discriminant Analysis
(LDA)
Supervised Learning Unsupervised Learning
Not effective with Labelled
Class of Dataset
Works best with Large
Labelled Dataset
Used for feature classification Used for data classification

Other
Methods to
Handle
Outliers
→ Missing Values
→ High Correlations (Spearman’s
Correlation)
→ Low Variance

Recap
→ What are Outliers?
→ Type of Outliers
↳ Univariate Outliers, Multivariate Outliers, Point or
Global Outliers, Collective Outliers, Contextual
Outliers
→ Causes of outliers
↳ Errors: Human, Natural, Sampling, Data
processing, etc
→ Impact of Outliers on Data
↳ error variance, decrease normality, bias or
influence estimates, reduce model performance
→ Detecting and Fixing Outliers
↳ Multivariate detection, Dimensionality Reduction
→ Other Ways of Handling Outliers
↳ Missing Values, High Correlations (Spearman’s
Correlation), Low Variance

Organize
Your
Research
Useful Links
↳ A Brief Overview of Outlier Detection
Techniques (TowardsDataScience)
↳ PCA on Iris Dataset (Github)
↳ LDA on Iris Dataset (Github)
Assignment
↳ Principal Component Analysis
(TowardsDataScience)

Google Colab Project
bit.ly/introtoadvml-week3-notebook

Homework
1. Letter Recognition Dataset
(Multi-dimensional)
2. New York Times Corpus (Time-
series for event detection)
3. Yahoo Labs: Server Traffic
(Multi-variate time-series)

See you
next week!
Questions?
Join us on Slack and
post your questions
to the #help-me

Introduction to unsupervised learning: outlier detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to unsupervised learning: outlier detection

Similar to Introduction to unsupervised learning: outlier detection (20)

Recently uploaded

Recently uploaded (19)

Introduction to unsupervised learning: outlier detection