Data science - Handling Outliers

Outlier Handling
Data Science
Copyright 2018, Vasu Bajaj

j
Outliers
Values which are either smaller or greater that the majority of samples in a population.
Lets talk about the central tendency measurement [ Mean, Median, Mode ].
Lets assume a sample A = [1,1,2,3,4,5,6,7]
Mode(A): (Most frequent sample) 1
Mean(A): (Avergae) Sum(A)/n-1 = 30/8 = 3.7
Median(A): (Sort A and then take (n/2+1)th term if odd and (n/2) +(n/2+1)th term
if even and divide by 2)(3+4/2) = 3.5
Lets add an outlier [100] to the sample A
Hence, new sample becomes A = [1,1,2,3,4,5,6,7,100]
Mode(A): 1
Mean(A): 130/9 ~14
Median(A): 4
What are outliers and why do they matter?

j
Outlier Visualization

j
Outlier Handling
Median Replacement
Upper Value Capping
Lower Value Capping

j
Outlier Handling - Median Replacement
Imputation is nothing but deleting the outlier records
Normally, records mean +/- 1.5*IQ are considered as outliers
But, please note the interval 1.5*IQ is entirely use case and business
Case driven
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Median Replacement
df['col'] = np.where(df.col > upper_limit, df['col'].median(),
(np.where(df.GRE < lower_limit, df['col'].median(), df.col)))

j
Outlier Handling - Upper Capping
#Upper Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
#Lower Capping:

j
Outlier Handling - Lower Capping
#Upper Capping:
#Lower Capping:

j
Points to note..
Please note that the outlier removal/imputation is purely a business insight driven step. Hence, you should contact
Either business or senior folks that can provide you the details for such an observation and help you conclude
It is possible that a default value, substituted for a missing value be an outlier like 99999 for a numerical,
these needs to be treated based on data availability and business use case
1
2

About Me
I am a data engineer with 7+ year of experience as of
data. I love exploring new technologies, writing blogs.
You can reach me out on my social handle as below:
Thank you
https://in.linkedin.com/public-profile/in/vasu-bajaj
Vasu.c.Bajaj@outlook.com

Data science - Handling Outliers

More Related Content

Similar to Data science - Handling Outliers

Recently uploaded

Data science - Handling Outliers