SlideShare a Scribd company logo
Thanks to
our leaders
Thanks to
our sponsor
Series
Schedule
Before we
begin..
Slides?
You can download them at
https://bit.ly/introtoadvml-week1-slides
Questions?
Post your questions in the QA box, one of
the panelists will answer!
Issues?
Chat directly with the panelists if you are
facing any issues!
If you torture the data enough, it will
confess to anything.
Ronald Coase
Agenda
1. What are Outliers?
2. Type of Outliers
3. Causes of outliers
4. Impact of Outliers on Data
5. Detecting and Fixing Outliers
6. Other Ways of Handling Outliers
7. References & Assignment
8. Implementation on Google Colab
What are Outliers?
What are Outliers?
Outliers are extreme values that
fall a long way outside of the other
observations. For example, in a
gaussian distribution, outliers may
be values on the tails of the
distribution.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Fig 1 Gaussian Distribution showing position
of outliers
What are Outliers?
We will generally define outliers
as samples that are exceptionally
far from the mainstream of the
data.
-Page 33, Applied Predictive
Modeling, 2013.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561, https://research.vu.nl/ws/portalfiles/portal/21334642/hoofdstuk+3.pdf
Fig 2a Data points and
Outliers
Fig 2b Data points and
Regression Model
What are Outliers?
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Fig 3a Outliers in clustered data
Fig 3b Outliers in time series data
Genesis of Outliers
Most common causes of outliers on a data set
✓ Errors in Data entry (human errors)
✓ Errors during Measurement (instrument errors)
✓ Experimental errors (data extraction or experiment planning/executing errors)
✓ dummy outliers created to test detection methods
✓ Data processing errors (data manipulation or data set unintended mutations)
✓ Sampling errors (extracting or mixing data from wrong or various sources)
✓ Natural (not an error, novelties in data)
Those causes that are not a product of an error are called novelties.
Types of Outliers
Types of Outliers
1. Univariate Outliers
2. Multivariate Outliers
3. Point or Global Outliers
4. Collective Outliers
5. Contextual Outliers
6. Other Outliers e.g. Outliers in Time Series Data:
a. Additive Outliers
b. Innovative Outliers
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
Univariate outliers can be found when looking at a
distribution
of values
in a single
feature
space.
https://journals.sagepub.com/doi/pdf/10.1177/0844562118786647
Outlier Types
Fig 4 Univariate Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
Multivariate outliers can be found in a n-
dimensional space (of n-features).
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Outlier Types
Fig 5 Multivariate Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
Point outliers are
single data points
that lay far from
the rest of the
distribution.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Outlier Types
Fig 6 Global or Point Outlier
A subset of data points within a data set is
considered anomalous if those values as a
collection deviate significantly from the
entire data set, but the values of the
individual data
points are not
themselves
anomalous in
either a
contextual or
global sense.
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
https://www.researchgate.net/figure/Collective-outlier-in-an-human-ECG-output-corresponding-to-an-Atrial-Premature_fig3_267964435
Outlier Types
Fig 7 CollectiveOutlier
Outlier Types
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditiona
l
Other
→ Additive
→ Innovative
A data point is considered a contextual outlier if its
value significantly deviates from the rest the data
points in the same context.
https://www.semanticscholar.org/paper/Contextual-Outlier-Detection-in-Sensor-Data-Using-Haque-Mineno/5cc5b6760d2de45add2959b150044f4c70a78aea/figure/4
Fig 8 Contextual or Conditional Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other: Time Series
→ Additive
→ Innovative
An additive outlier occurs at time T if the underlying
process is perturbed/altered additively at time T.
https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720
Outlier Types
Fig 9 Additive Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other: Time Series
→ Additive
→ Innovative
An innovative outlier occurs at time t if the error
(also known as an innovation) at time t is
perturbed/
altered.
https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720
Outlier Types
Fig 10 Innovative Outlier
Play Ground
Which of the following is NOT a type of outlier?
1. Multivariate outlier
2. Global outlier
3. Angular outlier
4. Contextual outlier
NB: Drop your choice of answer in the chat section
Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
● Outliers is able to rapidly change the results of the data analysis and statistical
modeling. There are several negative impacts of outliers in the data set:
● It increases the error variance and reduces the power of statistical tests.
● If the outliers are non-randomly distributed, they can decrease normality.
● They can bias or influence
estimates that may be of
substantive interest.
● They can also impact the
basic assumption of
Regression, and other
statistical model assumptions.
https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878
Fig 11 Calculations showing the impact of outliers
Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
https://www.kdnuggets.com/2018/08/make-machine-learning-models-robust-outliers.html
Fig 12 Impact of
outliers on Model
Performance
Play Ground
Some data scientist argued that, outliers are can be used in a
positive way for improving model performance while others
disagree.
What is your take on the argument?
1. Agree
2. Disagree
NB: Drop your choosing answer in the chat section
Tea Break
Exercise: You have some datasets and you are building
a model to classify the dataset. You observe that, your
classification model has three ways to accomplish this.
Choose between
1,2 and 3,
the line of best
fit that can best
classify your
data?
NB: Take note of the
outlier in the dataset.
Drop your selected
number on the chat
section.
https://classroom.udacity.com/courses/ud120
Fig 13 Choosing the right model
Detecting & Fixing Outliers
Detecting
Outliers in
Data
There are several ways of detecting outliers in a
dataset. Meanwhile we will discuss two most
frequently used and effective methods.
Most commonly used method to detect outliers is
visualization. We use various visualization
methods, like Box-plot, Histogram, Scatter Plot
(above, we have used box plot and scatter plot for
visualization). Whatever the visualization maybe, it
all depends on some ML algorithms;
→ Multivariate detection
→ Dimensionality Reduction
Detecting Outliers in Data
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
→ Dimensionality Reduction
(PCA, LDA)
https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878
Fig 14 Detecting outliers using Mahalanobis Distance
Detecting & Fixing Outliers in Data
https://www.researchgate.net/figure/Outliers-by-Cooks-distance-with-a-red-line-plotted-to-indicate-division-to-outlierhood_fig1_323265273
Fig 15
Detecting
outliers using
Cooke’s
Distance
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
A general rule of
thumb is that
observations with a
Cook’s D of more than
3 times the mean, μ,
is a possible outlier.
→ Dimensionality Reduction
(PCA, LDA)
Detecting & Fixing Outliers in Data
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component
Analysis (PCA)
↳ Linear Discriminant
Analysis (LDA)
Detecting & Fixing Outliers in Data
https://sebastianraschka.com/faq/docs/lda-vs-pca.html
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA)
Fig 16 The
Process of
a PCA
Detecting & Fixing Outliers in Data
https://www.researchgate.net/figure/Principal-component-analysis-example-PC-1-contains-the-most-energy-of-the-data-but-does_fig2_279177589
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA) Fig 17 An example of PCA for DR
Detecting & Fixing Outliers in Data
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
↳ Linear Discriminant Analysis (LDA)
LDA attempts to find a feature
subspace that maximizes
class separability.
Fig 18
The
process
of an LDA
Differences between PDA & LDA
Principal Component Analysis
(PCA)
Linear Discriminant Analysis
(LDA)
Supervised Learning Unsupervised Learning
Not effective with Labelled
Class of Dataset
Works best with Large
Labelled Dataset
Used for feature classification Used for data classification
Other
Methods to
Handle
Outliers
→ Missing Values
→ High Correlations (Spearman’s
Correlation)
→ Low Variance
Recap
Recap
→ What are Outliers?
→ Type of Outliers
↳ Univariate Outliers, Multivariate Outliers, Point or
Global Outliers, Collective Outliers, Contextual
Outliers
→ Causes of outliers
↳ Errors: Human, Natural, Sampling, Data
processing, etc
→ Impact of Outliers on Data
↳ error variance, decrease normality, bias or
influence estimates, reduce model performance
→ Detecting and Fixing Outliers
↳ Multivariate detection, Dimensionality Reduction
→ Other Ways of Handling Outliers
↳ Missing Values, High Correlations (Spearman’s
Correlation), Low Variance
Organize
Your
Research
Useful Links
↳ A Brief Overview of Outlier Detection
Techniques (TowardsDataScience)
↳ PCA on Iris Dataset (Github)
↳ LDA on Iris Dataset (Github)
Assignment
↳ Principal Component Analysis
(TowardsDataScience)
Google Colab Project
bit.ly/introtoadvml-week3-notebook
Homework
1. Letter Recognition Dataset
(Multi-dimensional)
2. New York Times Corpus (Time-
series for event detection)
3. Yahoo Labs: Server Traffic
(Multi-variate time-series)
See you
next week!
Questions?
Join us on Slack and
post your questions
to the #help-me

More Related Content

What's hot

Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
ishmecse13
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Salah Amean
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualization
Zach Gemignani
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
akanni azeez olamide
 
Unsupervised Machine Learning Ml And How It Works
Unsupervised Machine Learning Ml And How It WorksUnsupervised Machine Learning Ml And How It Works
Unsupervised Machine Learning Ml And How It Works
SlideTeam
 
Outlier Detection
Outlier DetectionOutlier Detection
Outlier Detection
Dr. Abdul Ahad Abro
 
The Basics of Statistics for Data Science By Statisticians
The Basics of Statistics for Data Science By StatisticiansThe Basics of Statistics for Data Science By Statisticians
The Basics of Statistics for Data Science By Statisticians
Stat Analytica
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
QuantUniversity
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
zekeLabs Technologies
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021
Marié Roux
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
Mohammed Musah
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
Krish_ver2
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
Stephen Tracy
 
Data preprocess
Data preprocessData preprocess
Data preprocess
srigiridharan92
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
Dr. Hamdan Al-Sabri
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
AbhishekKumar4995
 

What's hot (20)

Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualization
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
PCA
PCAPCA
PCA
 
Unsupervised Machine Learning Ml And How It Works
Unsupervised Machine Learning Ml And How It WorksUnsupervised Machine Learning Ml And How It Works
Unsupervised Machine Learning Ml And How It Works
 
Outlier Detection
Outlier DetectionOutlier Detection
Outlier Detection
 
The Basics of Statistics for Data Science By Statisticians
The Basics of Statistics for Data Science By StatisticiansThe Basics of Statistics for Data Science By Statisticians
The Basics of Statistics for Data Science By Statisticians
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 

Similar to Introduction to unsupervised learning: outlier detection

Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
Zac Darcy
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
Zac Darcy
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
Zac Darcy
 
Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.ppt
Subrata Kumer Paul
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
Ahmad Ali Abin
 
angle based outlier de
angle based outlier deangle based outlier de
angle based outlier de
Kruthikka Palraj
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
LellaLinton
 
Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.
Megha Sharma
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the main
RamlalMeena5
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
Houw Liong The
 
Anomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component AnalysisAnomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component Analysis
IOSR Journals
 
Multiple Linear Regression Models in Outlier Detection
Multiple Linear Regression Models in Outlier Detection Multiple Linear Regression Models in Outlier Detection
Multiple Linear Regression Models in Outlier Detection
IJORCS
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
Rahul Borate
 
Overview of Data Cleaning.pdf
Overview of Data Cleaning.pdfOverview of Data Cleaning.pdf
Overview of Data Cleaning.pdf
SheetalDandge
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
Akanksha Gohil
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
Optimus Information Inc.
 

Similar to Introduction to unsupervised learning: outlier detection (20)

Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.ppt
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
Kdd08 abod
Kdd08 abodKdd08 abod
Kdd08 abod
 
angle based outlier de
angle based outlier deangle based outlier de
angle based outlier de
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.Data Science- Data Preprocessing, Data Cleaning.
Data Science- Data Preprocessing, Data Cleaning.
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the main
 
12 outlier
12 outlier12 outlier
12 outlier
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Anomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component AnalysisAnomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component Analysis
 
Multiple Linear Regression Models in Outlier Detection
Multiple Linear Regression Models in Outlier Detection Multiple Linear Regression Models in Outlier Detection
Multiple Linear Regression Models in Outlier Detection
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
Overview of Data Cleaning.pdf
Overview of Data Cleaning.pdfOverview of Data Cleaning.pdf
Overview of Data Cleaning.pdf
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 

Recently uploaded

Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 

Recently uploaded (13)

Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 

Introduction to unsupervised learning: outlier detection

  • 1.
  • 5. Before we begin.. Slides? You can download them at https://bit.ly/introtoadvml-week1-slides Questions? Post your questions in the QA box, one of the panelists will answer! Issues? Chat directly with the panelists if you are facing any issues!
  • 6. If you torture the data enough, it will confess to anything. Ronald Coase
  • 7. Agenda 1. What are Outliers? 2. Type of Outliers 3. Causes of outliers 4. Impact of Outliers on Data 5. Detecting and Fixing Outliers 6. Other Ways of Handling Outliers 7. References & Assignment 8. Implementation on Google Colab
  • 9. What are Outliers? Outliers are extreme values that fall a long way outside of the other observations. For example, in a gaussian distribution, outliers may be values on the tails of the distribution. https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 Fig 1 Gaussian Distribution showing position of outliers
  • 10. What are Outliers? We will generally define outliers as samples that are exceptionally far from the mainstream of the data. -Page 33, Applied Predictive Modeling, 2013. https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561, https://research.vu.nl/ws/portalfiles/portal/21334642/hoofdstuk+3.pdf Fig 2a Data points and Outliers Fig 2b Data points and Regression Model
  • 12. Genesis of Outliers Most common causes of outliers on a data set ✓ Errors in Data entry (human errors) ✓ Errors during Measurement (instrument errors) ✓ Experimental errors (data extraction or experiment planning/executing errors) ✓ dummy outliers created to test detection methods ✓ Data processing errors (data manipulation or data set unintended mutations) ✓ Sampling errors (extracting or mixing data from wrong or various sources) ✓ Natural (not an error, novelties in data) Those causes that are not a product of an error are called novelties.
  • 14. Types of Outliers 1. Univariate Outliers 2. Multivariate Outliers 3. Point or Global Outliers 4. Collective Outliers 5. Contextual Outliers 6. Other Outliers e.g. Outliers in Time Series Data: a. Additive Outliers b. Innovative Outliers
  • 15. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative Univariate outliers can be found when looking at a distribution of values in a single feature space. https://journals.sagepub.com/doi/pdf/10.1177/0844562118786647 Outlier Types Fig 4 Univariate Outlier
  • 16. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative Multivariate outliers can be found in a n- dimensional space (of n-features). https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 Outlier Types Fig 5 Multivariate Outlier
  • 17. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative Point outliers are single data points that lay far from the rest of the distribution. https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 Outlier Types Fig 6 Global or Point Outlier
  • 18. A subset of data points within a data set is considered anomalous if those values as a collection deviate significantly from the entire data set, but the values of the individual data points are not themselves anomalous in either a contextual or global sense. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative https://www.researchgate.net/figure/Collective-outlier-in-an-human-ECG-output-corresponding-to-an-Atrial-Premature_fig3_267964435 Outlier Types Fig 7 CollectiveOutlier
  • 19. Outlier Types Univariate Multivariate Point/Global Collective Contextual/Conditiona l Other → Additive → Innovative A data point is considered a contextual outlier if its value significantly deviates from the rest the data points in the same context. https://www.semanticscholar.org/paper/Contextual-Outlier-Detection-in-Sensor-Data-Using-Haque-Mineno/5cc5b6760d2de45add2959b150044f4c70a78aea/figure/4 Fig 8 Contextual or Conditional Outlier
  • 20. Univariate Multivariate Point/Global Collective Contextual/Conditional Other: Time Series → Additive → Innovative An additive outlier occurs at time T if the underlying process is perturbed/altered additively at time T. https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720 Outlier Types Fig 9 Additive Outlier
  • 21. Univariate Multivariate Point/Global Collective Contextual/Conditional Other: Time Series → Additive → Innovative An innovative outlier occurs at time t if the error (also known as an innovation) at time t is perturbed/ altered. https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720 Outlier Types Fig 10 Innovative Outlier
  • 22. Play Ground Which of the following is NOT a type of outlier? 1. Multivariate outlier 2. Global outlier 3. Angular outlier 4. Contextual outlier NB: Drop your choice of answer in the chat section
  • 23. Outliers as a Tsunami in Data What is the Impact of Outliers on a dataset? ● Outliers is able to rapidly change the results of the data analysis and statistical modeling. There are several negative impacts of outliers in the data set: ● It increases the error variance and reduces the power of statistical tests. ● If the outliers are non-randomly distributed, they can decrease normality. ● They can bias or influence estimates that may be of substantive interest. ● They can also impact the basic assumption of Regression, and other statistical model assumptions. https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878 Fig 11 Calculations showing the impact of outliers
  • 24. Outliers as a Tsunami in Data What is the Impact of Outliers on a dataset? https://www.kdnuggets.com/2018/08/make-machine-learning-models-robust-outliers.html Fig 12 Impact of outliers on Model Performance
  • 25. Play Ground Some data scientist argued that, outliers are can be used in a positive way for improving model performance while others disagree. What is your take on the argument? 1. Agree 2. Disagree NB: Drop your choosing answer in the chat section
  • 26. Tea Break Exercise: You have some datasets and you are building a model to classify the dataset. You observe that, your classification model has three ways to accomplish this. Choose between 1,2 and 3, the line of best fit that can best classify your data? NB: Take note of the outlier in the dataset. Drop your selected number on the chat section. https://classroom.udacity.com/courses/ud120 Fig 13 Choosing the right model
  • 27. Detecting & Fixing Outliers
  • 28. Detecting Outliers in Data There are several ways of detecting outliers in a dataset. Meanwhile we will discuss two most frequently used and effective methods. Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Whatever the visualization maybe, it all depends on some ML algorithms; → Multivariate detection → Dimensionality Reduction
  • 29. Detecting Outliers in Data → Multivariate detection can be identified with the use of Mahalanobis distance, which is the distance of a data point from the calculated centroid of the other cases where the centroid is calculated as the intersection of the mean of the variables being assessed. ↳ Mahalanobis Distance ↳ Cooke’s Distance → Dimensionality Reduction (PCA, LDA) https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878 Fig 14 Detecting outliers using Mahalanobis Distance
  • 30. Detecting & Fixing Outliers in Data https://www.researchgate.net/figure/Outliers-by-Cooks-distance-with-a-red-line-plotted-to-indicate-division-to-outlierhood_fig1_323265273 Fig 15 Detecting outliers using Cooke’s Distance → Multivariate detection can be identified with the use of Mahalanobis distance, which is the distance of a data point from the calculated centroid of the other cases where the centroid is calculated as the intersection of the mean of the variables being assessed. ↳ Mahalanobis Distance ↳ Cooke’s Distance A general rule of thumb is that observations with a Cook’s D of more than 3 times the mean, μ, is a possible outlier. → Dimensionality Reduction (PCA, LDA)
  • 31. Detecting & Fixing Outliers in Data → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) ↳ Linear Discriminant Analysis (LDA)
  • 32. Detecting & Fixing Outliers in Data https://sebastianraschka.com/faq/docs/lda-vs-pca.html → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) PCA is a technique that finds the directions of maximal variance. PCA aims to find the directions of maximum variance in high- dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original data. ↳ Linear Discriminant Analysis (LDA) Fig 16 The Process of a PCA
  • 33. Detecting & Fixing Outliers in Data https://www.researchgate.net/figure/Principal-component-analysis-example-PC-1-contains-the-most-energy-of-the-data-but-does_fig2_279177589 → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) PCA is a technique that finds the directions of maximal variance. PCA aims to find the directions of maximum variance in high- dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original data. ↳ Linear Discriminant Analysis (LDA) Fig 17 An example of PCA for DR
  • 34. Detecting & Fixing Outliers in Data → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) ↳ Linear Discriminant Analysis (LDA) LDA attempts to find a feature subspace that maximizes class separability. Fig 18 The process of an LDA
  • 35. Differences between PDA & LDA Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) Supervised Learning Unsupervised Learning Not effective with Labelled Class of Dataset Works best with Large Labelled Dataset Used for feature classification Used for data classification
  • 36. Other Methods to Handle Outliers → Missing Values → High Correlations (Spearman’s Correlation) → Low Variance
  • 37. Recap
  • 38. Recap → What are Outliers? → Type of Outliers ↳ Univariate Outliers, Multivariate Outliers, Point or Global Outliers, Collective Outliers, Contextual Outliers → Causes of outliers ↳ Errors: Human, Natural, Sampling, Data processing, etc → Impact of Outliers on Data ↳ error variance, decrease normality, bias or influence estimates, reduce model performance → Detecting and Fixing Outliers ↳ Multivariate detection, Dimensionality Reduction → Other Ways of Handling Outliers ↳ Missing Values, High Correlations (Spearman’s Correlation), Low Variance
  • 39. Organize Your Research Useful Links ↳ A Brief Overview of Outlier Detection Techniques (TowardsDataScience) ↳ PCA on Iris Dataset (Github) ↳ LDA on Iris Dataset (Github) Assignment ↳ Principal Component Analysis (TowardsDataScience)
  • 41. Homework 1. Letter Recognition Dataset (Multi-dimensional) 2. New York Times Corpus (Time- series for event detection) 3. Yahoo Labs: Server Traffic (Multi-variate time-series)
  • 42. See you next week! Questions? Join us on Slack and post your questions to the #help-me