SlideShare a Scribd company logo
1 of 5
Download to read offline
An Integral Part of Data Mining - Outliers
The data mining job requires the prediction of information that the data holds
during the process of data analysis. During this, some deviations in data
trends are observed which are called outliers. Let us first know about data
mining. It is basically an exercise to sort and identify patterns and make
connections from a huge data set to solve the problems. It helps in predicting
future trends. So, what are outliers in data mining? Outliers are also data
objects but behave distinctively from the rest of the data objects. The first
definition of outliers was given by Grubbs in 1969. We should also have
knowledge about outlier analysis in data mining tutorials point and the types of
outliers in data mining.
Outliers and Noise
Outliers are not the same as noise as noises are the random errors or
variances in a measured variable, whereas outliers are considered as not
belonging to the same set of data objects because they are caused due to
incorrect entry or computational or execution error. Also, it is wise to remove
the noise before outlier detection.
Classifying Outliers
From a broader sense, Outliers are classified as:
Univariate Outliers, where only one dimension of space is considered (occurs
in the feature space).
Multivariate Outliers, which occur in a feature space of many dimensions.
Further, discussing the types of outliers, they are of the three following types:
1. Point or Global Outliers:
The most elementary form of outliers is this. These are the few points in a
dataset that are strongly deviating from the rest of the data points and are
therefore located far away from the data distribution or cluster.
2. Contextual or Conditional Outliers:
They appear within a specific context or condition when the data deviates
greatly of course but in other conditions, the data may show normal behavior
which makes it very necessary for the context to be specified in the problem
statement. The two types of attributes of the objects of data are contextual,
which defines the context, and behavioral, which defines the objects'
characteristics.
3. Collective outliers:
These types of outliers deviate from the rest of the dataset by forming a
cluster away from the rest of the dataset. They arise when there are
anomalous behaviors of data points collectively.
Outlier Detection Techniques
The different techniques and approaches to detect all these above-mentioned
outliers are discussed below:
1. Sorting
What makes it one of the simplest ways of detecting outliers in data mining is
the fact that it entails data sorting according to each of their magnitudes
during data manipulation. The data belonging to either the higher or lower
range can be considered outliers.
2. Graphing
This method requires plotting all the data in a graph using either a histogram,
scatter plot, or drop box to detect the outliers which let the user visualize the
data diverging from the dataset.
• Histogram is favorable for bulk data observation.
• With the degree of association of two numerical values, a scatter plot
becomes preferable.
3. Z-score for detecting outliers
The Gaussian distribution is assumed in this method to identify how much the
data points deviate from the mean of the sample by calculating the standard
deviations of the points.
• To calculate the Z-score for an observation, take the raw then subtract
the mean, and then divide by the standard deviation.
• Sometimes, transformations are applied like scaling the data when the
Gaussian distribution is not applied. Libraries of Python consisting of in-
built functions like Scikit-Learn and Scipy have an easy implementation
of transformations.
• A positive value of Z-score indicates the object lying above the mean
whereas a negative value of Z-score indicates the object deviating from
below the mean with the particular value of standard deviation.
• A standard threshold is used for the calculation of the Z-score. It is
unusual for the value to be far away from the value of zero. Such
unusual deviations from zero help us determine the outliers.
• In the case of a parametric distribution in a feature space of low
dimensions, Z-score happens to be a robust method for removing
outliers from a dataset.
4. Dbscan
This method is a clustering approach and also referred to as the Density-
Based Spatial Clustering of Applications with Noise. Clustering methods
happen to be convenient for better visualization and understanding of data. It
can be used to represent the relationships existing between the features and
the trends in the dataset graphically. The cluster identified in a feature space
through this method is a set of points connected through 'density'. An outlier is
a point that is not present in any cluster and is not 'density connected' by other
points. Two properties are to be satisfied when a cluster is defined: the points
should be density connected mutually, and a point that is density reachable by
any other points of a cluster, then the point will be part of the cluster.
5. Isolation Forests
This is one of the best methods which works on the application of binary trees.
Here, the outlier points are few in number and also deviate far enough to be
distinguished clearly. This method has an algorithm to get any feature and to
do any random splitting of the value that lies between the minimum and the
maximum range of values, comparing which the predictions are made. Later
after that, a forest is built up each and every observation in the set. According
to the algorithm, the illustration 'path length' is established as 'splittings'.
An outlier is supposed to have a shorter path length than the other
observations in the dataset. The approaches for outlier analysis in data mining
can also be grouped into statistical methods, a supervised method for outlier
detection which includes graphing and Z-score techniques involving the use of
training sets of data with instances for identifying classes within the data, and
the unsupervised method for outlier detection like Grubbs test, where there
are no labeled instances, but the predictions are based on the assumed
dataset with a majority of normal instances.
6. Using the Interquartile Range to Create Outlier Fences
An outlier boxplot is a variation of the skeletal boxplot whose whiskers extend
to the greatest distant observation within 1.5 X IQR from the quartiles.
Possible near outliers are identified as observations further than 1.5 x IQR
from the quartiles. The interquartile range shows how the data is spread about
the median.
Using the Interquartile Rule to Find Outliers: The interquartile range can be
used to detect outliers.
Conclusion
In this article we have discussed what is outliers in data mining and what is
outlier analysis in data mining. Outliers are usually discarded for predicting
wrong information during data analysis. Yet there are certain scenarios where
outlier detection becomes important, for example, detection of fraud. Either
way, detecting outliers is quite significant in data mining. In this article we
discussed the several methods to determine the outliers of different types.
Data mining is an integral part of our digital lives and outliers are a major part
of it. For a deeper learning you can check out our Skillslash, Data Science
Course in Bangalore, Full Stack Developer Course in Bangalore and
other courses too. As we provide you with the best of coaching and a
wonderful learning experience with 100% placement guarantee.

More Related Content

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Outlier Analysis In Data Mining.pdf

  • 1. An Integral Part of Data Mining - Outliers The data mining job requires the prediction of information that the data holds during the process of data analysis. During this, some deviations in data trends are observed which are called outliers. Let us first know about data mining. It is basically an exercise to sort and identify patterns and make connections from a huge data set to solve the problems. It helps in predicting future trends. So, what are outliers in data mining? Outliers are also data objects but behave distinctively from the rest of the data objects. The first definition of outliers was given by Grubbs in 1969. We should also have knowledge about outlier analysis in data mining tutorials point and the types of outliers in data mining. Outliers and Noise
  • 2. Outliers are not the same as noise as noises are the random errors or variances in a measured variable, whereas outliers are considered as not belonging to the same set of data objects because they are caused due to incorrect entry or computational or execution error. Also, it is wise to remove the noise before outlier detection. Classifying Outliers From a broader sense, Outliers are classified as: Univariate Outliers, where only one dimension of space is considered (occurs in the feature space). Multivariate Outliers, which occur in a feature space of many dimensions. Further, discussing the types of outliers, they are of the three following types: 1. Point or Global Outliers: The most elementary form of outliers is this. These are the few points in a dataset that are strongly deviating from the rest of the data points and are therefore located far away from the data distribution or cluster. 2. Contextual or Conditional Outliers: They appear within a specific context or condition when the data deviates greatly of course but in other conditions, the data may show normal behavior which makes it very necessary for the context to be specified in the problem statement. The two types of attributes of the objects of data are contextual, which defines the context, and behavioral, which defines the objects' characteristics. 3. Collective outliers: These types of outliers deviate from the rest of the dataset by forming a cluster away from the rest of the dataset. They arise when there are anomalous behaviors of data points collectively.
  • 3. Outlier Detection Techniques The different techniques and approaches to detect all these above-mentioned outliers are discussed below: 1. Sorting What makes it one of the simplest ways of detecting outliers in data mining is the fact that it entails data sorting according to each of their magnitudes during data manipulation. The data belonging to either the higher or lower range can be considered outliers. 2. Graphing This method requires plotting all the data in a graph using either a histogram, scatter plot, or drop box to detect the outliers which let the user visualize the data diverging from the dataset. • Histogram is favorable for bulk data observation. • With the degree of association of two numerical values, a scatter plot becomes preferable. 3. Z-score for detecting outliers The Gaussian distribution is assumed in this method to identify how much the data points deviate from the mean of the sample by calculating the standard deviations of the points. • To calculate the Z-score for an observation, take the raw then subtract the mean, and then divide by the standard deviation. • Sometimes, transformations are applied like scaling the data when the Gaussian distribution is not applied. Libraries of Python consisting of in- built functions like Scikit-Learn and Scipy have an easy implementation of transformations.
  • 4. • A positive value of Z-score indicates the object lying above the mean whereas a negative value of Z-score indicates the object deviating from below the mean with the particular value of standard deviation. • A standard threshold is used for the calculation of the Z-score. It is unusual for the value to be far away from the value of zero. Such unusual deviations from zero help us determine the outliers. • In the case of a parametric distribution in a feature space of low dimensions, Z-score happens to be a robust method for removing outliers from a dataset. 4. Dbscan This method is a clustering approach and also referred to as the Density- Based Spatial Clustering of Applications with Noise. Clustering methods happen to be convenient for better visualization and understanding of data. It can be used to represent the relationships existing between the features and the trends in the dataset graphically. The cluster identified in a feature space through this method is a set of points connected through 'density'. An outlier is a point that is not present in any cluster and is not 'density connected' by other points. Two properties are to be satisfied when a cluster is defined: the points should be density connected mutually, and a point that is density reachable by any other points of a cluster, then the point will be part of the cluster. 5. Isolation Forests This is one of the best methods which works on the application of binary trees. Here, the outlier points are few in number and also deviate far enough to be distinguished clearly. This method has an algorithm to get any feature and to do any random splitting of the value that lies between the minimum and the maximum range of values, comparing which the predictions are made. Later after that, a forest is built up each and every observation in the set. According to the algorithm, the illustration 'path length' is established as 'splittings'. An outlier is supposed to have a shorter path length than the other observations in the dataset. The approaches for outlier analysis in data mining can also be grouped into statistical methods, a supervised method for outlier detection which includes graphing and Z-score techniques involving the use of training sets of data with instances for identifying classes within the data, and the unsupervised method for outlier detection like Grubbs test, where there
  • 5. are no labeled instances, but the predictions are based on the assumed dataset with a majority of normal instances. 6. Using the Interquartile Range to Create Outlier Fences An outlier boxplot is a variation of the skeletal boxplot whose whiskers extend to the greatest distant observation within 1.5 X IQR from the quartiles. Possible near outliers are identified as observations further than 1.5 x IQR from the quartiles. The interquartile range shows how the data is spread about the median. Using the Interquartile Rule to Find Outliers: The interquartile range can be used to detect outliers. Conclusion In this article we have discussed what is outliers in data mining and what is outlier analysis in data mining. Outliers are usually discarded for predicting wrong information during data analysis. Yet there are certain scenarios where outlier detection becomes important, for example, detection of fraud. Either way, detecting outliers is quite significant in data mining. In this article we discussed the several methods to determine the outliers of different types. Data mining is an integral part of our digital lives and outliers are a major part of it. For a deeper learning you can check out our Skillslash, Data Science Course in Bangalore, Full Stack Developer Course in Bangalore and other courses too. As we provide you with the best of coaching and a wonderful learning experience with 100% placement guarantee.