Handling Missing Values

Manipulating Data - II
Missing value treatment
Rupak Roy

 is.na(): is a generic function that indicates which elements are missing. It
uses logical indicator TRUE FALSE to indicate the missing values.
#load the data
>mdata<-read.csv(file.choose(),header = TRUE, na.strings = c("","NA"))
>is.na(mdata) #to identify any missing values
>mdata1<-mdata #keeping backup
#we can also add functions like sum() to calculate total missing values
>sum(is.na(mdata))
Missing Value Treatment using is.na()

 Two Simple methods:
#if we are aware of the missing values we can directly impute a value
>mdata$TransAmt3[is.na(mdata$TransAmt3)]<- 52
#else we can also use the average value to impute the missing values
>mdata$TransAmt3[is.na(mdata$TransAmt3)]<-mean(mdata1$TransAmt3, na.rm
= TRUE) #where na.rm indicates to remove the NA values and execute.
>summary(mdata)
Imputing the missing values
Rupak Roy

#a quick summary of the data distributed for the TransAmt2 column
>summary(mdata1$TransAmt2)
#even visualize the data distribution to get a clear picture
>boxplot(mdata1$TransAmt2)
#further breakup the data distribution in percentage
>quantile(mdata1$TransAmt2,c(.25,.50,.75,1),na.rm = TRUE)
#from this 3 methods we are getting a clear picture that half way of the total data
the average median value is 20 then at 75% the average value is 85. So almost 75% of
the data contains an average value 20+85/2= 52.5
Let’s do a sanity check before we can conclude an average value for the missing
values. In the boxplot diagram and the summary there’s a sudden spike in the values
from 85 to 6783 which looks very unusual. Common reasons behind this is a chance
of human error. So let’s remove the outlier and redo the steps to see any difference
Impute missing values: numeric
variables

#saving the position of the values whose TransAmt>=2000
>index<-which(mdata1$TransAmt2>=2000)
>mdata1<-mdata1[-index,] #removing the values >=2000(outliers)
>View(mdata1)
>summary(mdata1$TransAmt2)
>boxplot(mdata1$TransAmt2)
#again breakup the data distributed in percentage
>quantile(mdata1$TransAmt2,c(.25,.50,.75,1),na.rm = TRUE)
variables
Rupak Roy

#breakup the data distribution from 10%
>quantile(mdata1$TransAmt2,p=(10:100)/100,na.rm = TRUE)
We can see the pattern of data distribution remains the same. Therefore we can
conclude that the 75% of the data contains an average value of 52.5
#impute the missing values with 52.5
>mdata$TransAmt2[is.na(mdata$TransAmt2)]<-52.5
>sum(is.na(mdata$TransAmt2))
variables
Rupak Roy

>summary(mdata1$Department)
#Or find the frequency for each factors of Departments
>table1<-table(mdata$Department, useNA = "always")
#convert into class table table1 into dataframe
>table1df<-data.frame(table1)
>View(table1df)
#add the rate of frequency for each factors
of Departments
>table1df$rate<-table1df$Freq/sum(table1df$Freq)
>quantile(table1df$Freq,c(.25,.50,.75,1),na.rm = T)
Impute missing values:
character/factor variables

>quantile(table1df$Freq,c(.25,.50,.75,1),na.rm = T)
#from the quantile output we can observe that 75% of the departments
occurred with an estimate of 5000times and 25% with an estimate of
8000times. So again we take an average estimate of 2629+5060/2 = 3845
Because from 0-25% it has 1503 times,
25-50%: 1126 times and
from 50-75%:2431 times. Hence we will conclude a value(dept.) which
is closet to 3845.
#filter the data based on frequency range from 3000 to 4000
>table1df_v<-table1df[table1df$Freq>=3000 & table1df$Freq<=4000,]
>View(table1df_v)
Rupak Roy

#impute the missing values with “Storage & Organization”
>mdata$Department[is.na(mdata$Department)]<-"Storage & Organization”
#else we can categorize the missing values as missing
>mdata$Department[is.na(mdata$Department)]<-“missing"
Rupak Roy

Next:
Transpose, Manipulating Character Strings, Pattern Matching and
Replacement.
Manipulating Data
Rupak Roy

Handling Missing Values

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Handling Missing Values

Similar to Handling Missing Values (20)

More from Rupak Roy

More from Rupak Roy (20)

Recently uploaded

Recently uploaded (20)

Handling Missing Values