Outlier Handling
Data Science
Copyright 2018, Vasu Bajaj
j
Outliers
Copyright 2018, Vasu Bajaj
Values which are either smaller or greater that the majority of samples in a population.
Lets talk about the central tendency measurement [ Mean, Median, Mode ].
Lets assume a sample A = [1,1,2,3,4,5,6,7]
Mode(A): (Most frequent sample) 1
Mean(A): (Avergae) Sum(A)/n-1 = 30/8 = 3.7
Median(A): (Sort A and then take (n/2+1)th term if odd and (n/2) +(n/2+1)th term
if even and divide by 2)(3+4/2) = 3.5
Lets add an outlier [100] to the sample A
Hence, new sample becomes A = [1,1,2,3,4,5,6,7,100]
Mode(A): 1
Mean(A): 130/9 ~14
Median(A): 4
What are outliers and why do they matter?
j
Outlier Visualization
Copyright 2018, Vasu Bajaj
j
Outlier Handling
Copyright 2018, Vasu Bajaj
Median Replacement
Upper Value Capping
Lower Value Capping
j
Outlier Handling - Median Replacement
Copyright 2018, Vasu Bajaj
Imputation is nothing but deleting the outlier records
Normally, records mean +/- 1.5*IQ are considered as outliers
But, please note the interval 1.5*IQ is entirely use case and business
Case driven
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Median Replacement
df['col'] = np.where(df.col > upper_limit, df['col'].median(),
(np.where(df.GRE < lower_limit, df['col'].median(), df.col)))
j
Outlier Handling - Upper Capping
Copyright 2018, Vasu Bajaj
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Upper Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
#Lower Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
j
Outlier Handling - Lower Capping
Copyright 2018, Vasu Bajaj
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Upper Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
#Lower Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
j
Points to note..
Copyright 2018, Vasu Bajaj
Please note that the outlier removal/imputation is purely a business insight driven step. Hence, you should contact
Either business or senior folks that can provide you the details for such an observation and help you conclude
It is possible that a default value, substituted for a missing value be an outlier like 99999 for a numerical,
these needs to be treated based on data availability and business use case
1
2
About Me
I am a data engineer with 7+ year of experience as of
data. I love exploring new technologies, writing blogs.
You can reach me out on my social handle as below:
Thank you
https://in.linkedin.com/public-profile/in/vasu-bajaj
Vasu.c.Bajaj@outlook.com
Copyright 2018, Vasu Bajaj

Data science - Handling Outliers

  • 1.
  • 2.
    j Outliers Copyright 2018, VasuBajaj Values which are either smaller or greater that the majority of samples in a population. Lets talk about the central tendency measurement [ Mean, Median, Mode ]. Lets assume a sample A = [1,1,2,3,4,5,6,7] Mode(A): (Most frequent sample) 1 Mean(A): (Avergae) Sum(A)/n-1 = 30/8 = 3.7 Median(A): (Sort A and then take (n/2+1)th term if odd and (n/2) +(n/2+1)th term if even and divide by 2)(3+4/2) = 3.5 Lets add an outlier [100] to the sample A Hence, new sample becomes A = [1,1,2,3,4,5,6,7,100] Mode(A): 1 Mean(A): 130/9 ~14 Median(A): 4 What are outliers and why do they matter?
  • 3.
  • 4.
    j Outlier Handling Copyright 2018,Vasu Bajaj Median Replacement Upper Value Capping Lower Value Capping
  • 5.
    j Outlier Handling -Median Replacement Copyright 2018, Vasu Bajaj Imputation is nothing but deleting the outlier records Normally, records mean +/- 1.5*IQ are considered as outliers But, please note the interval 1.5*IQ is entirely use case and business Case driven #Python Code Snipppet lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd) upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd) #Median Replacement df['col'] = np.where(df.col > upper_limit, df['col'].median(), (np.where(df.GRE < lower_limit, df['col'].median(), df.col)))
  • 6.
    j Outlier Handling -Upper Capping Copyright 2018, Vasu Bajaj #Python Code Snipppet lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd) upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd) #Upper Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col))) #Lower Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col)))
  • 7.
    j Outlier Handling -Lower Capping Copyright 2018, Vasu Bajaj #Python Code Snipppet lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd) upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd) #Upper Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col))) #Lower Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col)))
  • 8.
    j Points to note.. Copyright2018, Vasu Bajaj Please note that the outlier removal/imputation is purely a business insight driven step. Hence, you should contact Either business or senior folks that can provide you the details for such an observation and help you conclude It is possible that a default value, substituted for a missing value be an outlier like 99999 for a numerical, these needs to be treated based on data availability and business use case 1 2
  • 9.
    About Me I ama data engineer with 7+ year of experience as of data. I love exploring new technologies, writing blogs. You can reach me out on my social handle as below: Thank you https://in.linkedin.com/public-profile/in/vasu-bajaj Vasu.c.Bajaj@outlook.com Copyright 2018, Vasu Bajaj