2. j
Outliers
Copyright 2018, Vasu Bajaj
Values which are either smaller or greater that the majority of samples in a population.
Lets talk about the central tendency measurement [ Mean, Median, Mode ].
Lets assume a sample A = [1,1,2,3,4,5,6,7]
Mode(A): (Most frequent sample) 1
Mean(A): (Avergae) Sum(A)/n-1 = 30/8 = 3.7
Median(A): (Sort A and then take (n/2+1)th term if odd and (n/2) +(n/2+1)th term
if even and divide by 2)(3+4/2) = 3.5
Lets add an outlier [100] to the sample A
Hence, new sample becomes A = [1,1,2,3,4,5,6,7,100]
Mode(A): 1
Mean(A): 130/9 ~14
Median(A): 4
What are outliers and why do they matter?
5. j
Outlier Handling - Median Replacement
Copyright 2018, Vasu Bajaj
Imputation is nothing but deleting the outlier records
Normally, records mean +/- 1.5*IQ are considered as outliers
But, please note the interval 1.5*IQ is entirely use case and business
Case driven
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Median Replacement
df['col'] = np.where(df.col > upper_limit, df['col'].median(),
(np.where(df.GRE < lower_limit, df['col'].median(), df.col)))
8. j
Points to note..
Copyright 2018, Vasu Bajaj
Please note that the outlier removal/imputation is purely a business insight driven step. Hence, you should contact
Either business or senior folks that can provide you the details for such an observation and help you conclude
It is possible that a default value, substituted for a missing value be an outlier like 99999 for a numerical,
these needs to be treated based on data availability and business use case
1
2
9. About Me
I am a data engineer with 7+ year of experience as of
data. I love exploring new technologies, writing blogs.
You can reach me out on my social handle as below:
Thank you
https://in.linkedin.com/public-profile/in/vasu-bajaj
Vasu.c.Bajaj@outlook.com
Copyright 2018, Vasu Bajaj