SlideShare a Scribd company logo
Manipulating Data - II
Missing value treatment
Rupak Roy
 is.na(): is a generic function that indicates which elements are missing. It
uses logical indicator TRUE FALSE to indicate the missing values.
#load the data
>mdata<-read.csv(file.choose(),header = TRUE, na.strings = c("","NA"))
>is.na(mdata) #to identify any missing values
>mdata1<-mdata #keeping backup
#we can also add functions like sum() to calculate total missing values
>sum(is.na(mdata))
Missing Value Treatment using is.na()
 Two Simple methods:
#if we are aware of the missing values we can directly impute a value
>mdata$TransAmt3[is.na(mdata$TransAmt3)]<- 52
#else we can also use the average value to impute the missing values
>mdata$TransAmt3[is.na(mdata$TransAmt3)]<-mean(mdata1$TransAmt3, na.rm
= TRUE) #where na.rm indicates to remove the NA values and execute.
>summary(mdata)
Imputing the missing values
Rupak Roy
#a quick summary of the data distributed for the TransAmt2 column
>summary(mdata1$TransAmt2)
#even visualize the data distribution to get a clear picture
>boxplot(mdata1$TransAmt2)
#further breakup the data distribution in percentage
>quantile(mdata1$TransAmt2,c(.25,.50,.75,1),na.rm = TRUE)
#from this 3 methods we are getting a clear picture that half way of the total data
the average median value is 20 then at 75% the average value is 85. So almost 75% of
the data contains an average value 20+85/2= 52.5
Let’s do a sanity check before we can conclude an average value for the missing
values. In the boxplot diagram and the summary there’s a sudden spike in the values
from 85 to 6783 which looks very unusual. Common reasons behind this is a chance
of human error. So let’s remove the outlier and redo the steps to see any difference
Impute missing values: numeric
variables
#saving the position of the values whose TransAmt>=2000
>index<-which(mdata1$TransAmt2>=2000)
>mdata1<-mdata1[-index,] #removing the values >=2000(outliers)
>View(mdata1)
>summary(mdata1$TransAmt2)
>boxplot(mdata1$TransAmt2)
#again breakup the data distributed in percentage
>quantile(mdata1$TransAmt2,c(.25,.50,.75,1),na.rm = TRUE)
Impute missing values: numeric
variables
Rupak Roy
#breakup the data distribution from 10%
>quantile(mdata1$TransAmt2,p=(10:100)/100,na.rm = TRUE)
We can see the pattern of data distribution remains the same. Therefore we can
conclude that the 75% of the data contains an average value of 52.5
#impute the missing values with 52.5
>mdata$TransAmt2[is.na(mdata$TransAmt2)]<-52.5
>sum(is.na(mdata$TransAmt2))
Impute missing values: numeric
variables
Rupak Roy
>summary(mdata1$Department)
#Or find the frequency for each factors of Departments
>table1<-table(mdata$Department, useNA = "always")
#convert into class table table1 into dataframe
>table1df<-data.frame(table1)
>View(table1df)
#add the rate of frequency for each factors
of Departments
>table1df$rate<-table1df$Freq/sum(table1df$Freq)
>quantile(table1df$Freq,c(.25,.50,.75,1),na.rm = T)
Impute missing values:
character/factor variables
>quantile(table1df$Freq,c(.25,.50,.75,1),na.rm = T)
#from the quantile output we can observe that 75% of the departments
occurred with an estimate of 5000times and 25% with an estimate of
8000times. So again we take an average estimate of 2629+5060/2 = 3845
Because from 0-25% it has 1503 times,
25-50%: 1126 times and
from 50-75%:2431 times. Hence we will conclude a value(dept.) which
is closet to 3845.
#filter the data based on frequency range from 3000 to 4000
>table1df_v<-table1df[table1df$Freq>=3000 & table1df$Freq<=4000,]
>View(table1df_v)
Impute missing values:
character/factor variables
Rupak Roy
#impute the missing values with “Storage & Organization”
>mdata$Department[is.na(mdata$Department)]<-"Storage & Organization”
#else we can categorize the missing values as missing
>mdata$Department[is.na(mdata$Department)]<-“missing"
Impute missing values:
character/factor variables
Rupak Roy
Next:
Transpose, Manipulating Character Strings, Pattern Matching and
Replacement.
Manipulating Data
Rupak Roy

More Related Content

What's hot

SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
Avjinder (Avi) Kaler
 
Array BPK 2
Array BPK 2Array BPK 2
Array BPK 2
Riki Afriansyah
 
Merging tables using R
Merging tables using R Merging tables using R
Merging tables using R
Rupak Roy
 
Day 1b R structures objects.pptx
Day 1b   R structures   objects.pptxDay 1b   R structures   objects.pptx
Day 1b R structures objects.pptx
Adrien Melquiond
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptx
Adrien Melquiond
 
R factors
R   factorsR   factors
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
krishna singh
 
Day 2 repeats.pptx
Day 2 repeats.pptxDay 2 repeats.pptx
Day 2 repeats.pptx
Adrien Melquiond
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
Tim Essam
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
Laura Hughes
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
Ashwini Mathur
 
Data preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and Python
Akhilesh Joshi
 
DPLYR package in R
DPLYR package in RDPLYR package in R
DPLYR package in R
Bimba Pawar
 
Day 2b i/o.pptx
Day 2b   i/o.pptxDay 2b   i/o.pptx
Day 2b i/o.pptx
Adrien Melquiond
 
Day 5b statistical functions.pptx
Day 5b   statistical functions.pptxDay 5b   statistical functions.pptx
Day 5b statistical functions.pptx
Adrien Melquiond
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
Rsquared Academy
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
Florian Uhlitz
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
Laura Hughes
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
Tim Essam
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
senthil0809
 

What's hot (20)

SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
 
Array BPK 2
Array BPK 2Array BPK 2
Array BPK 2
 
Merging tables using R
Merging tables using R Merging tables using R
Merging tables using R
 
Day 1b R structures objects.pptx
Day 1b   R structures   objects.pptxDay 1b   R structures   objects.pptx
Day 1b R structures objects.pptx
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptx
 
R factors
R   factorsR   factors
R factors
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Day 2 repeats.pptx
Day 2 repeats.pptxDay 2 repeats.pptx
Day 2 repeats.pptx
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
 
Data preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and Python
 
DPLYR package in R
DPLYR package in RDPLYR package in R
DPLYR package in R
 
Day 2b i/o.pptx
Day 2b   i/o.pptxDay 2b   i/o.pptx
Day 2b i/o.pptx
 
Day 5b statistical functions.pptx
Day 5b   statistical functions.pptxDay 5b   statistical functions.pptx
Day 5b statistical functions.pptx
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
 
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
 

Similar to Handling Missing Values

R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
Avjinder (Avi) Kaler
 
R programming
R programmingR programming
R programming
Pramodkumar Jha
 
bobok
bobokbobok
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
Dennis
 
A brief introduction to apply functions
A brief introduction to apply functionsA brief introduction to apply functions
A brief introduction to apply functions
NIKET CHAURASIA
 
R console
R consoleR console
R console
Ananth Raj
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdf
annikasarees
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Rajib Layek
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
chhabria-nitesh
 
Python-Tuples
Python-TuplesPython-Tuples
Python-Tuples
Krishna Nanda
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
Chu An
 
Programming in R
Programming in RProgramming in R
Programming in R
Smruti Sarangi
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
062MayankSinghal
 
R tutorial (R program 101)
R tutorial (R program 101)R tutorial (R program 101)
R tutorial (R program 101)
Gregory Choi, MBA, CISSP
 
you need to complete the r code and a singlepage document c.pdf
you need to complete the r code and a singlepage document c.pdfyou need to complete the r code and a singlepage document c.pdf
you need to complete the r code and a singlepage document c.pdf
adnankhan605720
 
Intro to matlab
Intro to matlabIntro to matlab
Intro to matlab
Norhan Mohamed
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
Mohammed El Rafie Tarabay
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
Dr. Volkan OBAN
 
In the binary search, if the array being searched has 32 elements in.pdf
In the binary search, if the array being searched has 32 elements in.pdfIn the binary search, if the array being searched has 32 elements in.pdf
In the binary search, if the array being searched has 32 elements in.pdf
arpitaeron555
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2
Kevin Chun-Hsien Hsu
 

Similar to Handling Missing Values (20)

R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R programming
R programmingR programming
R programming
 
bobok
bobokbobok
bobok
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 
A brief introduction to apply functions
A brief introduction to apply functionsA brief introduction to apply functions
A brief introduction to apply functions
 
R console
R consoleR console
R console
 
R is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdfR is a very flexible and powerful programming language, as well as a.pdf
R is a very flexible and powerful programming language, as well as a.pdf
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
Python-Tuples
Python-TuplesPython-Tuples
Python-Tuples
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
 
Programming in R
Programming in RProgramming in R
Programming in R
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
R tutorial (R program 101)
R tutorial (R program 101)R tutorial (R program 101)
R tutorial (R program 101)
 
you need to complete the r code and a singlepage document c.pdf
you need to complete the r code and a singlepage document c.pdfyou need to complete the r code and a singlepage document c.pdf
you need to complete the r code and a singlepage document c.pdf
 
Intro to matlab
Intro to matlabIntro to matlab
Intro to matlab
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
 
In the binary search, if the array being searched has 32 elements in.pdf
In the binary search, if the array being searched has 32 elements in.pdfIn the binary search, if the array being searched has 32 elements in.pdf
In the binary search, if the array being searched has 32 elements in.pdf
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2
 

More from Rupak Roy

Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
Rupak Roy
 
Clustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPClustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLP
Rupak Roy
 
Network Analysis - NLP
Network Analysis  - NLPNetwork Analysis  - NLP
Network Analysis - NLP
Rupak Roy
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
Rupak Roy
 
Sentiment Analysis Practical Steps
Sentiment Analysis Practical StepsSentiment Analysis Practical Steps
Sentiment Analysis Practical Steps
Rupak Roy
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
Rupak Roy
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
Rupak Roy
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
Rupak Roy
 
Apache Hbase Architecture
Apache Hbase ArchitectureApache Hbase Architecture
Apache Hbase Architecture
Rupak Roy
 
Introduction to Hbase
Introduction to Hbase Introduction to Hbase
Introduction to Hbase
Rupak Roy
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQL
Rupak Roy
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
Rupak Roy
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMS
Rupak Roy
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
Rupak Roy
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
Rupak Roy
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
Rupak Roy
 
Apache PIG casting, reference
Apache PIG casting, referenceApache PIG casting, reference
Apache PIG casting, reference
Rupak Roy
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
Rupak Roy
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
Rupak Roy
 

More from Rupak Roy (20)

Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 
Clustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLPClustering K means and Hierarchical - NLP
Clustering K means and Hierarchical - NLP
 
Network Analysis - NLP
Network Analysis  - NLPNetwork Analysis  - NLP
Network Analysis - NLP
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Sentiment Analysis Practical Steps
Sentiment Analysis Practical StepsSentiment Analysis Practical Steps
Sentiment Analysis Practical Steps
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Apache Hbase Architecture
Apache Hbase ArchitectureApache Hbase Architecture
Apache Hbase Architecture
 
Introduction to Hbase
Introduction to Hbase Introduction to Hbase
Introduction to Hbase
 
Apache Hive Table Partition and HQL
Apache Hive Table Partition and HQLApache Hive Table Partition and HQL
Apache Hive Table Partition and HQL
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Scoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMSScoop Job, import and export to RDBMS
Scoop Job, import and export to RDBMS
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Introduction to Flume
Introduction to FlumeIntroduction to Flume
Introduction to Flume
 
Apache PIG casting, reference
Apache PIG casting, referenceApache PIG casting, reference
Apache PIG casting, reference
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 

Recently uploaded

Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
spdendr
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
สมใจ จันสุกสี
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdfIGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
Amin Marwan
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Leena Ghag-Sakpal
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 

Recently uploaded (20)

Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdfIGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 

Handling Missing Values

  • 1. Manipulating Data - II Missing value treatment Rupak Roy
  • 2.  is.na(): is a generic function that indicates which elements are missing. It uses logical indicator TRUE FALSE to indicate the missing values. #load the data >mdata<-read.csv(file.choose(),header = TRUE, na.strings = c("","NA")) >is.na(mdata) #to identify any missing values >mdata1<-mdata #keeping backup #we can also add functions like sum() to calculate total missing values >sum(is.na(mdata)) Missing Value Treatment using is.na()
  • 3.  Two Simple methods: #if we are aware of the missing values we can directly impute a value >mdata$TransAmt3[is.na(mdata$TransAmt3)]<- 52 #else we can also use the average value to impute the missing values >mdata$TransAmt3[is.na(mdata$TransAmt3)]<-mean(mdata1$TransAmt3, na.rm = TRUE) #where na.rm indicates to remove the NA values and execute. >summary(mdata) Imputing the missing values Rupak Roy
  • 4. #a quick summary of the data distributed for the TransAmt2 column >summary(mdata1$TransAmt2) #even visualize the data distribution to get a clear picture >boxplot(mdata1$TransAmt2) #further breakup the data distribution in percentage >quantile(mdata1$TransAmt2,c(.25,.50,.75,1),na.rm = TRUE) #from this 3 methods we are getting a clear picture that half way of the total data the average median value is 20 then at 75% the average value is 85. So almost 75% of the data contains an average value 20+85/2= 52.5 Let’s do a sanity check before we can conclude an average value for the missing values. In the boxplot diagram and the summary there’s a sudden spike in the values from 85 to 6783 which looks very unusual. Common reasons behind this is a chance of human error. So let’s remove the outlier and redo the steps to see any difference Impute missing values: numeric variables
  • 5. #saving the position of the values whose TransAmt>=2000 >index<-which(mdata1$TransAmt2>=2000) >mdata1<-mdata1[-index,] #removing the values >=2000(outliers) >View(mdata1) >summary(mdata1$TransAmt2) >boxplot(mdata1$TransAmt2) #again breakup the data distributed in percentage >quantile(mdata1$TransAmt2,c(.25,.50,.75,1),na.rm = TRUE) Impute missing values: numeric variables Rupak Roy
  • 6. #breakup the data distribution from 10% >quantile(mdata1$TransAmt2,p=(10:100)/100,na.rm = TRUE) We can see the pattern of data distribution remains the same. Therefore we can conclude that the 75% of the data contains an average value of 52.5 #impute the missing values with 52.5 >mdata$TransAmt2[is.na(mdata$TransAmt2)]<-52.5 >sum(is.na(mdata$TransAmt2)) Impute missing values: numeric variables Rupak Roy
  • 7. >summary(mdata1$Department) #Or find the frequency for each factors of Departments >table1<-table(mdata$Department, useNA = "always") #convert into class table table1 into dataframe >table1df<-data.frame(table1) >View(table1df) #add the rate of frequency for each factors of Departments >table1df$rate<-table1df$Freq/sum(table1df$Freq) >quantile(table1df$Freq,c(.25,.50,.75,1),na.rm = T) Impute missing values: character/factor variables
  • 8. >quantile(table1df$Freq,c(.25,.50,.75,1),na.rm = T) #from the quantile output we can observe that 75% of the departments occurred with an estimate of 5000times and 25% with an estimate of 8000times. So again we take an average estimate of 2629+5060/2 = 3845 Because from 0-25% it has 1503 times, 25-50%: 1126 times and from 50-75%:2431 times. Hence we will conclude a value(dept.) which is closet to 3845. #filter the data based on frequency range from 3000 to 4000 >table1df_v<-table1df[table1df$Freq>=3000 & table1df$Freq<=4000,] >View(table1df_v) Impute missing values: character/factor variables Rupak Roy
  • 9. #impute the missing values with “Storage & Organization” >mdata$Department[is.na(mdata$Department)]<-"Storage & Organization” #else we can categorize the missing values as missing >mdata$Department[is.na(mdata$Department)]<-“missing" Impute missing values: character/factor variables Rupak Roy
  • 10. Next: Transpose, Manipulating Character Strings, Pattern Matching and Replacement. Manipulating Data Rupak Roy