5. Before we
begin..
Slides?
You can download them at
https://bit.ly/introtoadvml-week1-slides
Questions?
Post your questions in the QA box, one of
the panelists will answer!
Issues?
Chat directly with the panelists if you are
facing any issues!
6. If you torture the data enough, it will
confess to anything.
Ronald Coase
7. Agenda
1. What are Outliers?
2. Type of Outliers
3. Causes of outliers
4. Impact of Outliers on Data
5. Detecting and Fixing Outliers
6. Other Ways of Handling Outliers
7. References & Assignment
8. Implementation on Google Colab
9. What are Outliers?
Outliers are extreme values that
fall a long way outside of the other
observations. For example, in a
gaussian distribution, outliers may
be values on the tails of the
distribution.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Fig 1 Gaussian Distribution showing position
of outliers
10. What are Outliers?
We will generally define outliers
as samples that are exceptionally
far from the mainstream of the
data.
-Page 33, Applied Predictive
Modeling, 2013.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561, https://research.vu.nl/ws/portalfiles/portal/21334642/hoofdstuk+3.pdf
Fig 2a Data points and
Outliers
Fig 2b Data points and
Regression Model
12. Genesis of Outliers
Most common causes of outliers on a data set
✓ Errors in Data entry (human errors)
✓ Errors during Measurement (instrument errors)
✓ Experimental errors (data extraction or experiment planning/executing errors)
✓ dummy outliers created to test detection methods
✓ Data processing errors (data manipulation or data set unintended mutations)
✓ Sampling errors (extracting or mixing data from wrong or various sources)
✓ Natural (not an error, novelties in data)
Those causes that are not a product of an error are called novelties.
14. Types of Outliers
1. Univariate Outliers
2. Multivariate Outliers
3. Point or Global Outliers
4. Collective Outliers
5. Contextual Outliers
6. Other Outliers e.g. Outliers in Time Series Data:
a. Additive Outliers
b. Innovative Outliers
18. A subset of data points within a data set is
considered anomalous if those values as a
collection deviate significantly from the
entire data set, but the values of the
individual data
points are not
themselves
anomalous in
either a
contextual or
global sense.
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
https://www.researchgate.net/figure/Collective-outlier-in-an-human-ECG-output-corresponding-to-an-Atrial-Premature_fig3_267964435
Outlier Types
Fig 7 CollectiveOutlier
19. Outlier Types
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditiona
l
Other
→ Additive
→ Innovative
A data point is considered a contextual outlier if its
value significantly deviates from the rest the data
points in the same context.
https://www.semanticscholar.org/paper/Contextual-Outlier-Detection-in-Sensor-Data-Using-Haque-Mineno/5cc5b6760d2de45add2959b150044f4c70a78aea/figure/4
Fig 8 Contextual or Conditional Outlier
22. Play Ground
Which of the following is NOT a type of outlier?
1. Multivariate outlier
2. Global outlier
3. Angular outlier
4. Contextual outlier
NB: Drop your choice of answer in the chat section
23. Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
● Outliers is able to rapidly change the results of the data analysis and statistical
modeling. There are several negative impacts of outliers in the data set:
● It increases the error variance and reduces the power of statistical tests.
● If the outliers are non-randomly distributed, they can decrease normality.
● They can bias or influence
estimates that may be of
substantive interest.
● They can also impact the
basic assumption of
Regression, and other
statistical model assumptions.
https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878
Fig 11 Calculations showing the impact of outliers
24. Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
https://www.kdnuggets.com/2018/08/make-machine-learning-models-robust-outliers.html
Fig 12 Impact of
outliers on Model
Performance
25. Play Ground
Some data scientist argued that, outliers are can be used in a
positive way for improving model performance while others
disagree.
What is your take on the argument?
1. Agree
2. Disagree
NB: Drop your choosing answer in the chat section
26. Tea Break
Exercise: You have some datasets and you are building
a model to classify the dataset. You observe that, your
classification model has three ways to accomplish this.
Choose between
1,2 and 3,
the line of best
fit that can best
classify your
data?
NB: Take note of the
outlier in the dataset.
Drop your selected
number on the chat
section.
https://classroom.udacity.com/courses/ud120
Fig 13 Choosing the right model
28. Detecting
Outliers in
Data
There are several ways of detecting outliers in a
dataset. Meanwhile we will discuss two most
frequently used and effective methods.
Most commonly used method to detect outliers is
visualization. We use various visualization
methods, like Box-plot, Histogram, Scatter Plot
(above, we have used box plot and scatter plot for
visualization). Whatever the visualization maybe, it
all depends on some ML algorithms;
→ Multivariate detection
→ Dimensionality Reduction
29. Detecting Outliers in Data
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
→ Dimensionality Reduction
(PCA, LDA)
https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878
Fig 14 Detecting outliers using Mahalanobis Distance
30. Detecting & Fixing Outliers in Data
https://www.researchgate.net/figure/Outliers-by-Cooks-distance-with-a-red-line-plotted-to-indicate-division-to-outlierhood_fig1_323265273
Fig 15
Detecting
outliers using
Cooke’s
Distance
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
A general rule of
thumb is that
observations with a
Cook’s D of more than
3 times the mean, μ,
is a possible outlier.
→ Dimensionality Reduction
(PCA, LDA)
31. Detecting & Fixing Outliers in Data
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component
Analysis (PCA)
↳ Linear Discriminant
Analysis (LDA)
32. Detecting & Fixing Outliers in Data
https://sebastianraschka.com/faq/docs/lda-vs-pca.html
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA)
Fig 16 The
Process of
a PCA
33. Detecting & Fixing Outliers in Data
https://www.researchgate.net/figure/Principal-component-analysis-example-PC-1-contains-the-most-energy-of-the-data-but-does_fig2_279177589
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA) Fig 17 An example of PCA for DR
34. Detecting & Fixing Outliers in Data
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
↳ Linear Discriminant Analysis (LDA)
LDA attempts to find a feature
subspace that maximizes
class separability.
Fig 18
The
process
of an LDA
35. Differences between PDA & LDA
Principal Component Analysis
(PCA)
Linear Discriminant Analysis
(LDA)
Supervised Learning Unsupervised Learning
Not effective with Labelled
Class of Dataset
Works best with Large
Labelled Dataset
Used for feature classification Used for data classification
38. Recap
→ What are Outliers?
→ Type of Outliers
↳ Univariate Outliers, Multivariate Outliers, Point or
Global Outliers, Collective Outliers, Contextual
Outliers
→ Causes of outliers
↳ Errors: Human, Natural, Sampling, Data
processing, etc
→ Impact of Outliers on Data
↳ error variance, decrease normality, bias or
influence estimates, reduce model performance
→ Detecting and Fixing Outliers
↳ Multivariate detection, Dimensionality Reduction
→ Other Ways of Handling Outliers
↳ Missing Values, High Correlations (Spearman’s
Correlation), Low Variance
39. Organize
Your
Research
Useful Links
↳ A Brief Overview of Outlier Detection
Techniques (TowardsDataScience)
↳ PCA on Iris Dataset (Github)
↳ LDA on Iris Dataset (Github)
Assignment
↳ Principal Component Analysis
(TowardsDataScience)
41. Homework
1. Letter Recognition Dataset
(Multi-dimensional)
2. New York Times Corpus (Time-
series for event detection)
3. Yahoo Labs: Server Traffic
(Multi-variate time-series)