Missing Data handling
by
Gautam Kumar
There are 3 main types of missing data:
Missing completely at random(MCAR)
Missing at random(MAR)
Not missing at random(NMAR)
Missing data handling technique:
Do – Nothing Mean/median imputation
Substitution and zero or
constant imputation
Hot deck imputation
Cold deck imputation Regression imputation
Stochastic regression
imputation
Interpolation and
extrapolation
Imputation using K-NN
Imputation Using
Multivariate Imputation by
Chained Equation (MICE)
Imputation Using Deep
Learning (Data wig)
Do – Nothing (easy one)
 let the algorithm handle the missing data. Some algorithms can factor in the
missing values and learn the best imputation values for the missing data based on
the training loss reduction (ie. XGBoost).
 Some others have the option to just ignore them (ie. LightGBM —
use_missing=false).
 some algorithms will panic and throw an error complaining about the missing
values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the
missing data and clean it before feeding it to the algorithm.
Mean/median
imputation
 calculate the mean/median of the non-
missing column values in a column and then
replace the missing values within each
column separately and independently from
the others.
 It can only be used with numeric data.
Substitution(Most frequent) and zero or
constant imputation or imputation based
on logical rules
 Most Frequent is another statistical strategy to
impute missing values.
 It works with categorical features (strings or
numerical representations) by replacing missing
data with the most frequent values within each
column.
 Zero or Constant imputation — as the name
suggests — it replaces the missing values with
either zero or any constant value you specify
 Suppose in a DF we are having DOB and age as
a 2 feature and age or DOB is missing so based
on the condition we can calculate fill the
missing data.
Hot Deck Imputation
 A randomly chosen value from an individual in the sample who has similar values on other
variables(column).
 find all the sample subjects who are similar on other variables, then randomly choose one of
their values on the missing variable.
Cold deck imputation
 A systematically chosen value from an individual who has similar values on other
variables(column).
 This is similar to Hot Deck in most ways, but removes the random variation. So for example,
you may always choose the third individual in the same experimental condition and block.
Regressing Imputation
 The predicted value obtained by regressing the missing variable on other variables.
 So instead of just taking the mean, you’re taking the predicted value, based on other
variables. This preserves relationships among variables involved in the imputation model, but
not variability around predicted values.
Stochastic regression imputation
 The predicted value from a regression plus a random residual value.
 This has all the advantages of regression imputation but adds in the advantages of the
random component.
 Most multiple imputation is based off of some form of stochastic regression imputation.
Interpolation and extrapolation
 An estimated value from other observations from the same individual. It usually only works in
longitudinal data.
 Use caution, though. Interpolation, for example, might make more sense for a variable like
height in children–one that can’t go back down over time.
 Extrapolation means you’re estimating beyond the actual range of the data and that
requires making more assumptions that you should.
Imputation using K-NN
 The k nearest neighbors is an algorithm that is used for
simple classification.
 The algorithm uses ‘feature similarity’ to predict the
values of any new data points, This means that the new
point is assigned a value based on how closely it
resembles the points in the training set.
 We will perform similar kind of prediction to conform the
missing datapoint using Impyute library in python, which
provides a simple and easy way to use KNN for
imputation.
 It creates a basic mean impute then uses the resulting
complete list to construct a KDTree. Then, it uses the
resulting KDTree to compute nearest neighbors (NN).
After it finds the k-NNs, it takes the weighted average of
them.
Imputation Using Multivariate Imputation
by Chained Equation (MICE)
 MICE imputation works by filling the missing
data multiple times. Multiple Imputations
(MIs) are much better than a single
imputation.
 MICE measures the uncertainty of the
missing values in a better way. The chained
equations approach is also very flexible and
can handle different variables of different
data types (i.e. continuous or binary) as well
as complexities such as bounds or survey
skip patterns.
Imputation using Data
Science(Datawig)
 This method works very well with categorical
and non-numerical features (Data wig is
python library which is used in deep
learning.)
 Datawig that learns Machine Learning models
using Deep Neural Networks to impute
missing values in a df.
 It also supports both CPU and GPU for
training.
Any Question?
Contact:gautam.kmr2893@outlook.com
Thank you

Data mining Part 1

  • 1.
  • 2.
    There are 3main types of missing data: Missing completely at random(MCAR) Missing at random(MAR) Not missing at random(NMAR)
  • 3.
    Missing data handlingtechnique: Do – Nothing Mean/median imputation Substitution and zero or constant imputation Hot deck imputation Cold deck imputation Regression imputation Stochastic regression imputation Interpolation and extrapolation Imputation using K-NN Imputation Using Multivariate Imputation by Chained Equation (MICE) Imputation Using Deep Learning (Data wig)
  • 4.
    Do – Nothing(easy one)  let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction (ie. XGBoost).  Some others have the option to just ignore them (ie. LightGBM — use_missing=false).  some algorithms will panic and throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.
  • 5.
    Mean/median imputation  calculate themean/median of the non- missing column values in a column and then replace the missing values within each column separately and independently from the others.  It can only be used with numeric data.
  • 6.
    Substitution(Most frequent) andzero or constant imputation or imputation based on logical rules  Most Frequent is another statistical strategy to impute missing values.  It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.  Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify  Suppose in a DF we are having DOB and age as a 2 feature and age or DOB is missing so based on the condition we can calculate fill the missing data.
  • 7.
    Hot Deck Imputation A randomly chosen value from an individual in the sample who has similar values on other variables(column).  find all the sample subjects who are similar on other variables, then randomly choose one of their values on the missing variable. Cold deck imputation  A systematically chosen value from an individual who has similar values on other variables(column).  This is similar to Hot Deck in most ways, but removes the random variation. So for example, you may always choose the third individual in the same experimental condition and block.
  • 8.
    Regressing Imputation  Thepredicted value obtained by regressing the missing variable on other variables.  So instead of just taking the mean, you’re taking the predicted value, based on other variables. This preserves relationships among variables involved in the imputation model, but not variability around predicted values. Stochastic regression imputation  The predicted value from a regression plus a random residual value.  This has all the advantages of regression imputation but adds in the advantages of the random component.  Most multiple imputation is based off of some form of stochastic regression imputation.
  • 9.
    Interpolation and extrapolation An estimated value from other observations from the same individual. It usually only works in longitudinal data.  Use caution, though. Interpolation, for example, might make more sense for a variable like height in children–one that can’t go back down over time.  Extrapolation means you’re estimating beyond the actual range of the data and that requires making more assumptions that you should.
  • 10.
    Imputation using K-NN The k nearest neighbors is an algorithm that is used for simple classification.  The algorithm uses ‘feature similarity’ to predict the values of any new data points, This means that the new point is assigned a value based on how closely it resembles the points in the training set.  We will perform similar kind of prediction to conform the missing datapoint using Impyute library in python, which provides a simple and easy way to use KNN for imputation.  It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree to compute nearest neighbors (NN). After it finds the k-NNs, it takes the weighted average of them.
  • 11.
    Imputation Using MultivariateImputation by Chained Equation (MICE)  MICE imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation.  MICE measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (i.e. continuous or binary) as well as complexities such as bounds or survey skip patterns.
  • 12.
    Imputation using Data Science(Datawig) This method works very well with categorical and non-numerical features (Data wig is python library which is used in deep learning.)  Datawig that learns Machine Learning models using Deep Neural Networks to impute missing values in a df.  It also supports both CPU and GPU for training.
  • 13.
  • 14.