SlideShare a Scribd company logo
1 of 42
Download to read offline
Thanks to
our leaders
Thanks to
our sponsor
Series
Schedule
Before we
begin..
Slides?
You can download them at
https://bit.ly/introtoadvml-week1-slides
Questions?
Post your questions in the QA box, one of
the panelists will answer!
Issues?
Chat directly with the panelists if you are
facing any issues!
If you torture the data enough, it will
confess to anything.
Ronald Coase
Agenda
1. What are Outliers?
2. Type of Outliers
3. Causes of outliers
4. Impact of Outliers on Data
5. Detecting and Fixing Outliers
6. Other Ways of Handling Outliers
7. References & Assignment
8. Implementation on Google Colab
What are Outliers?
What are Outliers?
Outliers are extreme values that
fall a long way outside of the other
observations. For example, in a
gaussian distribution, outliers may
be values on the tails of the
distribution.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Fig 1 Gaussian Distribution showing position
of outliers
What are Outliers?
We will generally define outliers
as samples that are exceptionally
far from the mainstream of the
data.
-Page 33, Applied Predictive
Modeling, 2013.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561, https://research.vu.nl/ws/portalfiles/portal/21334642/hoofdstuk+3.pdf
Fig 2a Data points and
Outliers
Fig 2b Data points and
Regression Model
What are Outliers?
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Fig 3a Outliers in clustered data
Fig 3b Outliers in time series data
Genesis of Outliers
Most common causes of outliers on a data set
✓ Errors in Data entry (human errors)
✓ Errors during Measurement (instrument errors)
✓ Experimental errors (data extraction or experiment planning/executing errors)
✓ dummy outliers created to test detection methods
✓ Data processing errors (data manipulation or data set unintended mutations)
✓ Sampling errors (extracting or mixing data from wrong or various sources)
✓ Natural (not an error, novelties in data)
Those causes that are not a product of an error are called novelties.
Types of Outliers
Types of Outliers
1. Univariate Outliers
2. Multivariate Outliers
3. Point or Global Outliers
4. Collective Outliers
5. Contextual Outliers
6. Other Outliers e.g. Outliers in Time Series Data:
a. Additive Outliers
b. Innovative Outliers
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
Univariate outliers can be found when looking at a
distribution
of values
in a single
feature
space.
https://journals.sagepub.com/doi/pdf/10.1177/0844562118786647
Outlier Types
Fig 4 Univariate Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
Multivariate outliers can be found in a n-
dimensional space (of n-features).
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Outlier Types
Fig 5 Multivariate Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
Point outliers are
single data points
that lay far from
the rest of the
distribution.
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Outlier Types
Fig 6 Global or Point Outlier
A subset of data points within a data set is
considered anomalous if those values as a
collection deviate significantly from the
entire data set, but the values of the
individual data
points are not
themselves
anomalous in
either a
contextual or
global sense.
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other
→ Additive
→ Innovative
https://www.researchgate.net/figure/Collective-outlier-in-an-human-ECG-output-corresponding-to-an-Atrial-Premature_fig3_267964435
Outlier Types
Fig 7 CollectiveOutlier
Outlier Types
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditiona
l
Other
→ Additive
→ Innovative
A data point is considered a contextual outlier if its
value significantly deviates from the rest the data
points in the same context.
https://www.semanticscholar.org/paper/Contextual-Outlier-Detection-in-Sensor-Data-Using-Haque-Mineno/5cc5b6760d2de45add2959b150044f4c70a78aea/figure/4
Fig 8 Contextual or Conditional Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other: Time Series
→ Additive
→ Innovative
An additive outlier occurs at time T if the underlying
process is perturbed/altered additively at time T.
https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720
Outlier Types
Fig 9 Additive Outlier
Univariate
Multivariate
Point/Global
Collective
Contextual/Conditional
Other: Time Series
→ Additive
→ Innovative
An innovative outlier occurs at time t if the error
(also known as an innovation) at time t is
perturbed/
altered.
https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720
Outlier Types
Fig 10 Innovative Outlier
Play Ground
Which of the following is NOT a type of outlier?
1. Multivariate outlier
2. Global outlier
3. Angular outlier
4. Contextual outlier
NB: Drop your choice of answer in the chat section
Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
● Outliers is able to rapidly change the results of the data analysis and statistical
modeling. There are several negative impacts of outliers in the data set:
● It increases the error variance and reduces the power of statistical tests.
● If the outliers are non-randomly distributed, they can decrease normality.
● They can bias or influence
estimates that may be of
substantive interest.
● They can also impact the
basic assumption of
Regression, and other
statistical model assumptions.
https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878
Fig 11 Calculations showing the impact of outliers
Outliers as a Tsunami in Data
What is the Impact of Outliers on a dataset?
https://www.kdnuggets.com/2018/08/make-machine-learning-models-robust-outliers.html
Fig 12 Impact of
outliers on Model
Performance
Play Ground
Some data scientist argued that, outliers are can be used in a
positive way for improving model performance while others
disagree.
What is your take on the argument?
1. Agree
2. Disagree
NB: Drop your choosing answer in the chat section
Tea Break
Exercise: You have some datasets and you are building
a model to classify the dataset. You observe that, your
classification model has three ways to accomplish this.
Choose between
1,2 and 3,
the line of best
fit that can best
classify your
data?
NB: Take note of the
outlier in the dataset.
Drop your selected
number on the chat
section.
https://classroom.udacity.com/courses/ud120
Fig 13 Choosing the right model
Detecting & Fixing Outliers
Detecting
Outliers in
Data
There are several ways of detecting outliers in a
dataset. Meanwhile we will discuss two most
frequently used and effective methods.
Most commonly used method to detect outliers is
visualization. We use various visualization
methods, like Box-plot, Histogram, Scatter Plot
(above, we have used box plot and scatter plot for
visualization). Whatever the visualization maybe, it
all depends on some ML algorithms;
→ Multivariate detection
→ Dimensionality Reduction
Detecting Outliers in Data
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
→ Dimensionality Reduction
(PCA, LDA)
https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878
Fig 14 Detecting outliers using Mahalanobis Distance
Detecting & Fixing Outliers in Data
https://www.researchgate.net/figure/Outliers-by-Cooks-distance-with-a-red-line-plotted-to-indicate-division-to-outlierhood_fig1_323265273
Fig 15
Detecting
outliers using
Cooke’s
Distance
→ Multivariate detection can be identified with the use of Mahalanobis distance, which is
the distance of a data point from the calculated centroid of the other cases where the
centroid is calculated as the intersection of the mean of the variables being assessed.
↳ Mahalanobis Distance
↳ Cooke’s Distance
A general rule of
thumb is that
observations with a
Cook’s D of more than
3 times the mean, μ,
is a possible outlier.
→ Dimensionality Reduction
(PCA, LDA)
Detecting & Fixing Outliers in Data
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component
Analysis (PCA)
↳ Linear Discriminant
Analysis (LDA)
Detecting & Fixing Outliers in Data
https://sebastianraschka.com/faq/docs/lda-vs-pca.html
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA)
Fig 16 The
Process of
a PCA
Detecting & Fixing Outliers in Data
https://www.researchgate.net/figure/Principal-component-analysis-example-PC-1-contains-the-most-energy-of-the-data-but-does_fig2_279177589
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
PCA is a technique that finds the directions
of maximal variance. PCA aims to find the
directions of maximum variance in high-
dimensional data and projects it onto a new
subspace with equal or fewer dimensions
than the original data.
↳ Linear Discriminant Analysis (LDA) Fig 17 An example of PCA for DR
Detecting & Fixing Outliers in Data
→ Multivariate detection (Mahalanobis Distance, Cooke’s Distance)
→ Dimensionality Reduction is a technique for minimizing dimensionalities of data in other
to reduce over-fitting and avoid making a model complex. When dimensionality is
reduced by projecting all points onto a line, the outlier is mapped into the center of the
reduced data set.
↳ Principal Component Analysis (PCA)
↳ Linear Discriminant Analysis (LDA)
LDA attempts to find a feature
subspace that maximizes
class separability.
Fig 18
The
process
of an LDA
Differences between PDA & LDA
Principal Component Analysis
(PCA)
Linear Discriminant Analysis
(LDA)
Supervised Learning Unsupervised Learning
Not effective with Labelled
Class of Dataset
Works best with Large
Labelled Dataset
Used for feature classification Used for data classification
Other
Methods to
Handle
Outliers
→ Missing Values
→ High Correlations (Spearman’s
Correlation)
→ Low Variance
Recap
Recap
→ What are Outliers?
→ Type of Outliers
↳ Univariate Outliers, Multivariate Outliers, Point or
Global Outliers, Collective Outliers, Contextual
Outliers
→ Causes of outliers
↳ Errors: Human, Natural, Sampling, Data
processing, etc
→ Impact of Outliers on Data
↳ error variance, decrease normality, bias or
influence estimates, reduce model performance
→ Detecting and Fixing Outliers
↳ Multivariate detection, Dimensionality Reduction
→ Other Ways of Handling Outliers
↳ Missing Values, High Correlations (Spearman’s
Correlation), Low Variance
Organize
Your
Research
Useful Links
↳ A Brief Overview of Outlier Detection
Techniques (TowardsDataScience)
↳ PCA on Iris Dataset (Github)
↳ LDA on Iris Dataset (Github)
Assignment
↳ Principal Component Analysis
(TowardsDataScience)
Google Colab Project
bit.ly/introtoadvml-week3-notebook
Homework
1. Letter Recognition Dataset
(Multi-dimensional)
2. New York Times Corpus (Time-
series for event detection)
3. Yahoo Labs: Server Traffic
(Multi-variate time-series)
See you
next week!
Questions?
Join us on Slack and
post your questions
to the #help-me

More Related Content

What's hot

Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
Nikhil Sharma
 
Multivariate analyses
Multivariate analysesMultivariate analyses
Multivariate analyses
Naveen Deswal
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
guestfee8698
 

What's hot (20)

Outlier Detection
Outlier DetectionOutlier Detection
Outlier Detection
 
Outlier detection handling
Outlier detection handlingOutlier detection handling
Outlier detection handling
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
Outliers
OutliersOutliers
Outliers
 
Outlier detection method introduction
Outlier detection method introductionOutlier detection method introduction
Outlier detection method introduction
 
Discriminant analysis
Discriminant analysisDiscriminant analysis
Discriminant analysis
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Curse of dimensionality
Curse of dimensionalityCurse of dimensionality
Curse of dimensionality
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
Multivariate analyses
Multivariate analysesMultivariate analyses
Multivariate analyses
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Maximum Likelihood Estimation
Maximum Likelihood EstimationMaximum Likelihood Estimation
Maximum Likelihood Estimation
 

Similar to Introduction to unsupervised learning: outlier detection

Anomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component AnalysisAnomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component Analysis
IOSR Journals
 

Similar to Introduction to unsupervised learning: outlier detection (20)

A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.ppt
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
Kdd08 abod
Kdd08 abodKdd08 abod
Kdd08 abod
 
angle based outlier de
angle based outlier deangle based outlier de
angle based outlier de
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
OutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the mainOutlierAnalysisIDIO071216.pptx.otliers is the main
OutlierAnalysisIDIO071216.pptx.otliers is the main
 
12 outlier
12 outlier12 outlier
12 outlier
 
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best PracticesNeo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
Neo4j Graph Data Science Training - June 9 & 10 - Slides #7 GDS Best Practices
 
Anomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component AnalysisAnomaly Detection using multidimensional reduction Principal Component Analysis
Anomaly Detection using multidimensional reduction Principal Component Analysis
 
Multiple Linear Regression Models in Outlier Detection
Multiple Linear Regression Models in Outlier Detection Multiple Linear Regression Models in Outlier Detection
Multiple Linear Regression Models in Outlier Detection
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
Overview of Data Cleaning.pdf
Overview of Data Cleaning.pdfOverview of Data Cleaning.pdf
Overview of Data Cleaning.pdf
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 

Recently uploaded

Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
ZurliaSoop
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 

Recently uploaded (19)

LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORNLITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
 
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptxBEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...The Concession of Asaba International Airport: Balancing Politics and Policy ...
The Concession of Asaba International Airport: Balancing Politics and Policy ...
 
Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20Ready Set Go Children Sermon about Mark 16:15-20
Ready Set Go Children Sermon about Mark 16:15-20
 
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
Abortion Pills Fahaheel ௹+918133066128💬@ Safe and Effective Mifepristion and ...
 
"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR"I hear you": Moving beyond empathy in UXR
"I hear you": Moving beyond empathy in UXR
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf2024 mega trends for the digital workplace - FINAL.pdf
2024 mega trends for the digital workplace - FINAL.pdf
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Using AI to boost productivity for developers
Using AI to boost productivity for developersUsing AI to boost productivity for developers
Using AI to boost productivity for developers
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
History of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth deathHistory of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth death
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
 
ECOLOGY OF FISHES.pptx full presentation
ECOLOGY OF FISHES.pptx full presentationECOLOGY OF FISHES.pptx full presentation
ECOLOGY OF FISHES.pptx full presentation
 

Introduction to unsupervised learning: outlier detection

  • 1.
  • 5. Before we begin.. Slides? You can download them at https://bit.ly/introtoadvml-week1-slides Questions? Post your questions in the QA box, one of the panelists will answer! Issues? Chat directly with the panelists if you are facing any issues!
  • 6. If you torture the data enough, it will confess to anything. Ronald Coase
  • 7. Agenda 1. What are Outliers? 2. Type of Outliers 3. Causes of outliers 4. Impact of Outliers on Data 5. Detecting and Fixing Outliers 6. Other Ways of Handling Outliers 7. References & Assignment 8. Implementation on Google Colab
  • 9. What are Outliers? Outliers are extreme values that fall a long way outside of the other observations. For example, in a gaussian distribution, outliers may be values on the tails of the distribution. https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 Fig 1 Gaussian Distribution showing position of outliers
  • 10. What are Outliers? We will generally define outliers as samples that are exceptionally far from the mainstream of the data. -Page 33, Applied Predictive Modeling, 2013. https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561, https://research.vu.nl/ws/portalfiles/portal/21334642/hoofdstuk+3.pdf Fig 2a Data points and Outliers Fig 2b Data points and Regression Model
  • 12. Genesis of Outliers Most common causes of outliers on a data set ✓ Errors in Data entry (human errors) ✓ Errors during Measurement (instrument errors) ✓ Experimental errors (data extraction or experiment planning/executing errors) ✓ dummy outliers created to test detection methods ✓ Data processing errors (data manipulation or data set unintended mutations) ✓ Sampling errors (extracting or mixing data from wrong or various sources) ✓ Natural (not an error, novelties in data) Those causes that are not a product of an error are called novelties.
  • 14. Types of Outliers 1. Univariate Outliers 2. Multivariate Outliers 3. Point or Global Outliers 4. Collective Outliers 5. Contextual Outliers 6. Other Outliers e.g. Outliers in Time Series Data: a. Additive Outliers b. Innovative Outliers
  • 15. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative Univariate outliers can be found when looking at a distribution of values in a single feature space. https://journals.sagepub.com/doi/pdf/10.1177/0844562118786647 Outlier Types Fig 4 Univariate Outlier
  • 16. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative Multivariate outliers can be found in a n- dimensional space (of n-features). https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 Outlier Types Fig 5 Multivariate Outlier
  • 17. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative Point outliers are single data points that lay far from the rest of the distribution. https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 Outlier Types Fig 6 Global or Point Outlier
  • 18. A subset of data points within a data set is considered anomalous if those values as a collection deviate significantly from the entire data set, but the values of the individual data points are not themselves anomalous in either a contextual or global sense. Univariate Multivariate Point/Global Collective Contextual/Conditional Other → Additive → Innovative https://www.researchgate.net/figure/Collective-outlier-in-an-human-ECG-output-corresponding-to-an-Atrial-Premature_fig3_267964435 Outlier Types Fig 7 CollectiveOutlier
  • 19. Outlier Types Univariate Multivariate Point/Global Collective Contextual/Conditiona l Other → Additive → Innovative A data point is considered a contextual outlier if its value significantly deviates from the rest the data points in the same context. https://www.semanticscholar.org/paper/Contextual-Outlier-Detection-in-Sensor-Data-Using-Haque-Mineno/5cc5b6760d2de45add2959b150044f4c70a78aea/figure/4 Fig 8 Contextual or Conditional Outlier
  • 20. Univariate Multivariate Point/Global Collective Contextual/Conditional Other: Time Series → Additive → Innovative An additive outlier occurs at time T if the underlying process is perturbed/altered additively at time T. https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720 Outlier Types Fig 9 Additive Outlier
  • 21. Univariate Multivariate Point/Global Collective Contextual/Conditional Other: Time Series → Additive → Innovative An innovative outlier occurs at time t if the error (also known as an innovation) at time t is perturbed/ altered. https://www.researchgate.net/figure/Examples-of-three-types-of-outliers_fig2_270763720 Outlier Types Fig 10 Innovative Outlier
  • 22. Play Ground Which of the following is NOT a type of outlier? 1. Multivariate outlier 2. Global outlier 3. Angular outlier 4. Contextual outlier NB: Drop your choice of answer in the chat section
  • 23. Outliers as a Tsunami in Data What is the Impact of Outliers on a dataset? ● Outliers is able to rapidly change the results of the data analysis and statistical modeling. There are several negative impacts of outliers in the data set: ● It increases the error variance and reduces the power of statistical tests. ● If the outliers are non-randomly distributed, they can decrease normality. ● They can bias or influence estimates that may be of substantive interest. ● They can also impact the basic assumption of Regression, and other statistical model assumptions. https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878 Fig 11 Calculations showing the impact of outliers
  • 24. Outliers as a Tsunami in Data What is the Impact of Outliers on a dataset? https://www.kdnuggets.com/2018/08/make-machine-learning-models-robust-outliers.html Fig 12 Impact of outliers on Model Performance
  • 25. Play Ground Some data scientist argued that, outliers are can be used in a positive way for improving model performance while others disagree. What is your take on the argument? 1. Agree 2. Disagree NB: Drop your choosing answer in the chat section
  • 26. Tea Break Exercise: You have some datasets and you are building a model to classify the dataset. You observe that, your classification model has three ways to accomplish this. Choose between 1,2 and 3, the line of best fit that can best classify your data? NB: Take note of the outlier in the dataset. Drop your selected number on the chat section. https://classroom.udacity.com/courses/ud120 Fig 13 Choosing the right model
  • 27. Detecting & Fixing Outliers
  • 28. Detecting Outliers in Data There are several ways of detecting outliers in a dataset. Meanwhile we will discuss two most frequently used and effective methods. Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization). Whatever the visualization maybe, it all depends on some ML algorithms; → Multivariate detection → Dimensionality Reduction
  • 29. Detecting Outliers in Data → Multivariate detection can be identified with the use of Mahalanobis distance, which is the distance of a data point from the calculated centroid of the other cases where the centroid is calculated as the intersection of the mean of the variables being assessed. ↳ Mahalanobis Distance ↳ Cooke’s Distance → Dimensionality Reduction (PCA, LDA) https://journals.sagepub.com/doi/pdf/10.1177/1475921717748878 Fig 14 Detecting outliers using Mahalanobis Distance
  • 30. Detecting & Fixing Outliers in Data https://www.researchgate.net/figure/Outliers-by-Cooks-distance-with-a-red-line-plotted-to-indicate-division-to-outlierhood_fig1_323265273 Fig 15 Detecting outliers using Cooke’s Distance → Multivariate detection can be identified with the use of Mahalanobis distance, which is the distance of a data point from the calculated centroid of the other cases where the centroid is calculated as the intersection of the mean of the variables being assessed. ↳ Mahalanobis Distance ↳ Cooke’s Distance A general rule of thumb is that observations with a Cook’s D of more than 3 times the mean, μ, is a possible outlier. → Dimensionality Reduction (PCA, LDA)
  • 31. Detecting & Fixing Outliers in Data → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) ↳ Linear Discriminant Analysis (LDA)
  • 32. Detecting & Fixing Outliers in Data https://sebastianraschka.com/faq/docs/lda-vs-pca.html → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) PCA is a technique that finds the directions of maximal variance. PCA aims to find the directions of maximum variance in high- dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original data. ↳ Linear Discriminant Analysis (LDA) Fig 16 The Process of a PCA
  • 33. Detecting & Fixing Outliers in Data https://www.researchgate.net/figure/Principal-component-analysis-example-PC-1-contains-the-most-energy-of-the-data-but-does_fig2_279177589 → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) PCA is a technique that finds the directions of maximal variance. PCA aims to find the directions of maximum variance in high- dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original data. ↳ Linear Discriminant Analysis (LDA) Fig 17 An example of PCA for DR
  • 34. Detecting & Fixing Outliers in Data → Multivariate detection (Mahalanobis Distance, Cooke’s Distance) → Dimensionality Reduction is a technique for minimizing dimensionalities of data in other to reduce over-fitting and avoid making a model complex. When dimensionality is reduced by projecting all points onto a line, the outlier is mapped into the center of the reduced data set. ↳ Principal Component Analysis (PCA) ↳ Linear Discriminant Analysis (LDA) LDA attempts to find a feature subspace that maximizes class separability. Fig 18 The process of an LDA
  • 35. Differences between PDA & LDA Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) Supervised Learning Unsupervised Learning Not effective with Labelled Class of Dataset Works best with Large Labelled Dataset Used for feature classification Used for data classification
  • 36. Other Methods to Handle Outliers → Missing Values → High Correlations (Spearman’s Correlation) → Low Variance
  • 37. Recap
  • 38. Recap → What are Outliers? → Type of Outliers ↳ Univariate Outliers, Multivariate Outliers, Point or Global Outliers, Collective Outliers, Contextual Outliers → Causes of outliers ↳ Errors: Human, Natural, Sampling, Data processing, etc → Impact of Outliers on Data ↳ error variance, decrease normality, bias or influence estimates, reduce model performance → Detecting and Fixing Outliers ↳ Multivariate detection, Dimensionality Reduction → Other Ways of Handling Outliers ↳ Missing Values, High Correlations (Spearman’s Correlation), Low Variance
  • 39. Organize Your Research Useful Links ↳ A Brief Overview of Outlier Detection Techniques (TowardsDataScience) ↳ PCA on Iris Dataset (Github) ↳ LDA on Iris Dataset (Github) Assignment ↳ Principal Component Analysis (TowardsDataScience)
  • 41. Homework 1. Letter Recognition Dataset (Multi-dimensional) 2. New York Times Corpus (Time- series for event detection) 3. Yahoo Labs: Server Traffic (Multi-variate time-series)
  • 42. See you next week! Questions? Join us on Slack and post your questions to the #help-me