DATA SCIENCE - Outlier detection and treatment_ sachin pathania
1. 1
Sachin Pathania
OUTLIER DETECTION AND TREATMENT
Introduction to Outlier Treatment
Outlier Treatment is one of the important part of data pre-processing is the
handling outlier. If the data contains outliers that can affect our result which
will depend on the data. So to remove these outliers from data Outlier
Treatment is used. At first, need to understand what outliers is.
What is Outliers?
An outlier is a value that behaves differently than other observations or we can
say
“A value that lies outside the data”
2. 2
Example: A new coach has been working with the Long Jump team this month,
and the athletes' performance has changed.
Augustus: +0.15m
Tom: +0.11m
June: +0.06m
Carol: +0.06m
Bob: + 0.12m
Sam: -0.56m
So here, Sam is an outlier
Here are the results on the number line:
Following are two process to remove their outliers:-
Interquartile Range ( IQR )
Z-Score
But here I’m only using IQR.
Interquartile Range (IQR)
Interquartile Range (IQR) equally divides the distribution into four equal parts
called quartiles. It takes data into account the most of the value lies in that
region, it used a box plot to detect the outliers in data.
3. 3
The following parameter is used to identify the IQR range:
1st quartile (Q1) is 25%
3rd quartile (Q3) is 75%
2nd quartile (Q2) divides the distribution into two equal parts of 50%.
So, basically it is the same as Median.
The interquartile range is defined as the difference between the third and the
first quartile in other words, IQR equals Q3 minus Q1
Formula: - IQR = Q3 - Q1
4. 4
Identify the Outliers Using IQR Method
As per a rule of thumb, observations can be qualified as outliers when they lie
more than 1.5 IQR below the first quartile or 1.5 IQR above the third quartile.
Outliers are values that “lie outside” the other values.
LB = Q1 – 1.5 * IQR
UB = Q3 + 1.5 * IQR
Outlier Treatment using IQR in Python: