SlideShare a Scribd company logo
1 of 9
Outlier Handling
Data Science
Copyright 2018, Vasu Bajaj
j
Outliers
Copyright 2018, Vasu Bajaj
Values which are either smaller or greater that the majority of samples in a population.
Lets talk about the central tendency measurement [ Mean, Median, Mode ].
Lets assume a sample A = [1,1,2,3,4,5,6,7]
Mode(A): (Most frequent sample) 1
Mean(A): (Avergae) Sum(A)/n-1 = 30/8 = 3.7
Median(A): (Sort A and then take (n/2+1)th term if odd and (n/2) +(n/2+1)th term
if even and divide by 2)(3+4/2) = 3.5
Lets add an outlier [100] to the sample A
Hence, new sample becomes A = [1,1,2,3,4,5,6,7,100]
Mode(A): 1
Mean(A): 130/9 ~14
Median(A): 4
What are outliers and why do they matter?
j
Outlier Visualization
Copyright 2018, Vasu Bajaj
j
Outlier Handling
Copyright 2018, Vasu Bajaj
Median Replacement
Upper Value Capping
Lower Value Capping
j
Outlier Handling - Median Replacement
Copyright 2018, Vasu Bajaj
Imputation is nothing but deleting the outlier records
Normally, records mean +/- 1.5*IQ are considered as outliers
But, please note the interval 1.5*IQ is entirely use case and business
Case driven
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Median Replacement
df['col'] = np.where(df.col > upper_limit, df['col'].median(),
(np.where(df.GRE < lower_limit, df['col'].median(), df.col)))
j
Outlier Handling - Upper Capping
Copyright 2018, Vasu Bajaj
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Upper Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
#Lower Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
j
Outlier Handling - Lower Capping
Copyright 2018, Vasu Bajaj
#Python Code Snipppet
lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd)
upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd)
#Upper Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
#Lower Capping:
df['col'] = np.where(df.col > upper_limit, upper_limit,
(np.where(df.col < lower_limit, lower_limit, df.col)))
j
Points to note..
Copyright 2018, Vasu Bajaj
Please note that the outlier removal/imputation is purely a business insight driven step. Hence, you should contact
Either business or senior folks that can provide you the details for such an observation and help you conclude
It is possible that a default value, substituted for a missing value be an outlier like 99999 for a numerical,
these needs to be treated based on data availability and business use case
1
2
About Me
I am a data engineer with 7+ year of experience as of
data. I love exploring new technologies, writing blogs.
You can reach me out on my social handle as below:
Thank you
https://in.linkedin.com/public-profile/in/vasu-bajaj
Vasu.c.Bajaj@outlook.com
Copyright 2018, Vasu Bajaj

More Related Content

Similar to Data science - Handling Outliers

Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Language
maggiexyz
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
Khalid Rabayah
 

Similar to Data science - Handling Outliers (20)

fds Practicle 1to 6 program.pdf
fds Practicle 1to 6 program.pdffds Practicle 1to 6 program.pdf
fds Practicle 1to 6 program.pdf
 
NumPy_Broadcasting Data Science - Python.pptx
NumPy_Broadcasting Data Science - Python.pptxNumPy_Broadcasting Data Science - Python.pptx
NumPy_Broadcasting Data Science - Python.pptx
 
Statistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data LanguageStatistics and Data Mining with Perl Data Language
Statistics and Data Mining with Perl Data Language
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
 
r studio presentation.pptx
r studio presentation.pptxr studio presentation.pptx
r studio presentation.pptx
 
r studio presentation.pptx
r studio presentation.pptxr studio presentation.pptx
r studio presentation.pptx
 
R and data mining
R and data miningR and data mining
R and data mining
 
Z score
Z scoreZ score
Z score
 
Measures of Variation
Measures of VariationMeasures of Variation
Measures of Variation
 
Z-SCORE.pptx
Z-SCORE.pptxZ-SCORE.pptx
Z-SCORE.pptx
 
Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015Scaling decision trees - George Murray, July 2015
Scaling decision trees - George Murray, July 2015
 
Survey Demo
Survey DemoSurvey Demo
Survey Demo
 
Measures of Position for Ungroup Data
Measures of Position for Ungroup DataMeasures of Position for Ungroup Data
Measures of Position for Ungroup Data
 
Csci101 lect08b matlab_programs
Csci101 lect08b matlab_programsCsci101 lect08b matlab_programs
Csci101 lect08b matlab_programs
 
Session 02
Session 02Session 02
Session 02
 
Statistics101: Numerical Measures
Statistics101: Numerical MeasuresStatistics101: Numerical Measures
Statistics101: Numerical Measures
 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
Chapter 02-logistic regression
Chapter 02-logistic regressionChapter 02-logistic regression
Chapter 02-logistic regression
 

Recently uploaded

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
hwhqz6r1y
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 

Recently uploaded (20)

Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证成绩单原版一比一
 
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 

Data science - Handling Outliers

  • 2. j Outliers Copyright 2018, Vasu Bajaj Values which are either smaller or greater that the majority of samples in a population. Lets talk about the central tendency measurement [ Mean, Median, Mode ]. Lets assume a sample A = [1,1,2,3,4,5,6,7] Mode(A): (Most frequent sample) 1 Mean(A): (Avergae) Sum(A)/n-1 = 30/8 = 3.7 Median(A): (Sort A and then take (n/2+1)th term if odd and (n/2) +(n/2+1)th term if even and divide by 2)(3+4/2) = 3.5 Lets add an outlier [100] to the sample A Hence, new sample becomes A = [1,1,2,3,4,5,6,7,100] Mode(A): 1 Mean(A): 130/9 ~14 Median(A): 4 What are outliers and why do they matter?
  • 4. j Outlier Handling Copyright 2018, Vasu Bajaj Median Replacement Upper Value Capping Lower Value Capping
  • 5. j Outlier Handling - Median Replacement Copyright 2018, Vasu Bajaj Imputation is nothing but deleting the outlier records Normally, records mean +/- 1.5*IQ are considered as outliers But, please note the interval 1.5*IQ is entirely use case and business Case driven #Python Code Snipppet lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd) upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd) #Median Replacement df['col'] = np.where(df.col > upper_limit, df['col'].median(), (np.where(df.GRE < lower_limit, df['col'].median(), df.col)))
  • 6. j Outlier Handling - Upper Capping Copyright 2018, Vasu Bajaj #Python Code Snipppet lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd) upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd) #Upper Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col))) #Lower Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col)))
  • 7. j Outlier Handling - Lower Capping Copyright 2018, Vasu Bajaj #Python Code Snipppet lower_limit = df['col'].mean() - 1.5* df['col'].mean(sd) upper_limit = df['col'].mean() + 1.5* df['col'].mean(sd) #Upper Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col))) #Lower Capping: df['col'] = np.where(df.col > upper_limit, upper_limit, (np.where(df.col < lower_limit, lower_limit, df.col)))
  • 8. j Points to note.. Copyright 2018, Vasu Bajaj Please note that the outlier removal/imputation is purely a business insight driven step. Hence, you should contact Either business or senior folks that can provide you the details for such an observation and help you conclude It is possible that a default value, substituted for a missing value be an outlier like 99999 for a numerical, these needs to be treated based on data availability and business use case 1 2
  • 9. About Me I am a data engineer with 7+ year of experience as of data. I love exploring new technologies, writing blogs. You can reach me out on my social handle as below: Thank you https://in.linkedin.com/public-profile/in/vasu-bajaj Vasu.c.Bajaj@outlook.com Copyright 2018, Vasu Bajaj