SlideShare a Scribd company logo
1 of 10
POOR MAN’S MISSING VALUE RECOVERY METHODS
By Leonardo E. Auslender, ATT/Bell Labs1
Introduction.
I describe two fast nonparametric
methods to impute missing values,
especially appropriate for the working
environment found in database marketing: a)
very tight deadlines, b) the need for an
‘easy ’ algorithm to ‘score ’ a large
database, c) massive data sets for model
development, and d) a not very sympathetic
audience to the plights of modelers.
The Problem.
The presence of missing values is
characteristic of large data bases.
Databases with more than 100 million
observations and thousands of variables
(both business and demographic
information, for instance) are not uncommon.
Under these circumstances, almost 100% of
the observations contain at least one
variable with a missing value.
While missing value imputation has
been and is extensively analyzed by
statisticians and others (see the classic
Rubin and Little, 1987 or any recent JASA).
These methods are difficult to implement in a
DM enviroment because:
1) they require more time than
usually available. Time deadlines are
sometimes two days to complete a profiling
and modeling project. Imputing missing
values by modeling implies searches for
model specifications (variable selection,
functional form) and inference, which are
time consuming.
2) the resulting imputing algorithm/s
are not easily implementable in large
databases, in a quick, efficient and error-free
manners. Implementation errors could
produce catastrophic business results.
Therefore, the algorithm/s must be
‘simple ’, and cannot exceed present
hardware and software constraints in the
frame of massive databases.
3) the ad hoc methods, such as
ascribing the overall mean or a constant
(such as zero, to prevent dropping the
observation altogether, as SAS would do)
could be more harmful than doing nothing
(Little and Rubin, 1987). Imputing a mean
reduces the variance of the entire
distribution arbitrarily, and thus changes the
empirical distribution function. Further,
imputing a mean to a categorical variable
needs at least suspension of disbelief. To
delete all those observations with some
missing values or all those variables
which have some missing values, is another
approach of the ‘barbarian’ persuasion..
The Poor Man’s methods I
advocate in this paper are based on
imputing missing values based on the
univariate and multivariate empirical
distributions. The first method, which is
simpler, sacrifices all the multivariate
information available for the sake of meeting
a deadline. The marginal densities are not
affected if the data are missing at random.
The second method tries to remedy the lack
of multivariability, within the constraints just
mentioned.
Poor Man’s Imputation Methods (PMIM)
1) Univariate cumulative
empirical distribution
Let x1...,xn, be n independent
observations from a distribution F(x). Let the
real world, represented by F(x), be replaced
by the observations x(1) < x(2) < .... x(n), and
we represent these data by the empirical
function Fn(x) putting probability 1/n on each
observed value.
0 x < x(1)
Fn(x) = # (xi ≤ x) = { k / n x(k) ≤ x < x(k+1) , k < n (1)
n 1 x(n) ≤ x
Based on (1), discretize each of the
variables with at least one missing value.
The number of bands or categories is a
macro parameter, in this paper fixed at ten,
and upper-bounded by computer resources.
The larger this parameter, the greater the
detail and accuracy, and the greater the
computer resources utilized. By discretizing,
the information for each variable with some
missing information will be temporarily
replaced by the band within which each
variable within each observation belongs. If
the variable is binary (which occurs very
frequently in data base marketing), the
number of bands will be obviously two. Even
1
I wish to thank Andreas Buja and Chuanhai Liu of Bell Labs/Statistics Research for the
extensive discussions on this paper. The usual caveats apply.
for continuous variables, it is seldom the
case that those variables will be discretized
in 10 bands (if the number of bands chosen
is 10), because there are regions of the
range which are empty.
Once the variables are discretized,
corresponding means and standard errors
are estimated for each band within each
variable. Likewise, the cumulative
percentage of observations along the bands
is obtained. This information contains all the
elements that define the empirical
distribution function for each variable with
missing values.
Given the missing observation i for
variable j, generate a random number R
from a uniform distribution, and a random
number Normal from a Normal Distribution
N(0,1). For Perci ≤ R < Perci+1, the rule is
then to impute Meani + SEi Normal.
As a comment, overall mean
imputation is equivalent to applying this
method with only one band. In this sense,
overall mean imputation is an extreme case.
1.1) Example of SAS code used
to impute the variable VAR2
IF VAR2 = . THEN DO;
AUX = RANUNI(9);
IF 0 < AUX <= 0.1703606086 THEN
VAR2 = 616.90998594 + RANNOR(8) *
2.6743066012 ;
END;
ELSE IF 0.1703606086 < AUX <= 0.3429974841
THEN
VAR2 = 933.89312977 + RANNOR(8) *
2.6652926231 ;
END;
...
END /* IF VAR2 = .*/;
OUTPUT ;
1.2) Criticisms of the method and
possible improvements.
1) Incorporate the multivariate
structure into the method: a multivariate
extension of the above method of
discretization is not directly possible
because of the curse of dimensionality. For
instance, let assume ten binary variables
with missing values. The number of possible
patterns is then 210
= 1024, and despite large
volumes of data it is not always possible to
obtain all patterns for all cases of
missingness. If missingness is itself
considered to be a category, i.e. each binary
becomes trinary, then the number of
interesting patterns would be 310
- 1 = 59048
(the case “all missing” is clearly
uninformative). If we consider that present
databases contain perhaps thousands of
categorical and continuous variables, the
number of possible patterns would far
exceed the information content of the data
base.
2) As an implication from point 1
above, the method does not address
missing data not at random. In the ideal case
in which the mechanism generating
missingness were known, it would be
possible to model and thus impute the
values, which could perhaps be done in a
prompt, efficient and scorable ways
Missingness at random does however occur,
such as when matching different data
sources by name or by other index, the
corresponding information of which is not
very accurate, updated or perhaps changed
(such as change of name due to change in
marital status, death, etc).
3) Since the method leaves the
univariate distributions unaffected while
disregarding multivariate information, it most
likely diminishes the structure of
interdependencies among variables, which
affect modeling and profiling results.
2) Multivariate Cumulative
Empirical Distribution
The curse of dimensionality prevents
the direct generalization of the previous
method into a multivariate dimension.
Since there is a need to limit the
number of variables, let us invite modeling.
For the sake of presentation, I chose as a
modeling method of variable selection the
best five correlated variables with the one
with missing values. The number five is a
macro parameter, and together with the
number of bands must be determined with a
view to computing constraints. Since
variables with missing values might be highly
correlated with other missing value
variables, the method proceeds by imputing
one variable at a time, in increasing order of
percentage of missingness. Once the first
variable is imputed, the following correlation
searches are performed with the imputed
data set (a caveat on this procedure will be
mentioned later on).
For each observation in which the
“to be” imputed variable, say mvar, is not
missing, categorize each one of the selected
variables into corresponding bands. That is,
the information contained in these five
variables is categorized and patterned. Find
corresponding means and standard errors of
mvar for each of the patterns, as well as for
partial patterns. For instance, if the pattern
11111 exists, also find the means and
standard errors for patterns .1111, 1.111,
11.11, ..., ..111, etc. Store all this information
in formats.
2
For mvar missing, then find the
pattern of the five variables, and impute the
mean plus a normal random number times
the standard error. If the pattern is not
present in the just done summarization
(recall dimensionality’s curse?), search
through all partial patterns to find the closest
pattern.
2.1) Criticisms of the method and
possible improvements.
1) The present ‘modeling’ method
is very crude and ad-hoc, and many
possibilities to replace it come to mind. For
instance, the multirelation coefficient
(Dresner, 1995), principal components, a
stepwise or backwards regression or logistic
regression, etc. The scoring of the database
is not affected by the difficulty of the
modeling method, because the modeling
method only affects the development stage,
and the creation of formats necessary for the
imputation.
2) The use of just imputed variables
to impute other variables increases the level
of the variances of the imputed variables,
which is not easily accounted for.
3) The method takes longer than the
univariate method, and scoring requires the
storing of large format libraries. If the scoring
is not done in SAS, the transformation of
these enormous formats could be
unfeasible. On the other hand, were the
scoring done in SAS, new updates of the
format libraries would be relatively
straightforward.
4) In the case when data is not
missing at random, the method can provide
some insight about the mechanism, by
providing details of the distribution of
patterns. That is, missingness not at random
will show itself in a pronounced lack of
patterns in a region of the dimension of the
data. The method could then be modified to
weight the information differently in those
regions.
3) Empirical application.
Demographic information is
commonly used in segmentation
applications. In our case, the focus of
research lies in classifying customers as
youngies (below 26 years of age) or oldies
(otherwise). The age information commonly
available from census sources contains a
large proportion of missing values.
The study involves two parts. First, a
profile of the distribution of the imputed
values of selected variables (the top three
variables of an original regression tree).
Second, a model which classifies prospects
into one or the other. It is important to see
how missing value imputation affects model
performance. Customer profiling is omitted
due to corporate concerns.
Our development data set contains
about 10,000 observations with more than
100 variables. For the sake of brevity,
validation results will be omitted, as well as
most of the information which play no vital
role in this study. The classification model
was obtained by running a tree regression
program, because at least in the case of
categorical missing values, tree regression
programs provide a solution. Further, when I
tried logistic regression, convergence was
not attained.
I present the results for the following
cases: a) original data (which contained
some missingness). At this point, I added
missingness at random in key variables; b)
deleted observations with missing variables ,
c) deleting variables with missing values, d)
estimation of missing values by
corresponding means, e) univariate PMIM, f)
multivariate PMIM. The regression trees
graphical representations will be drastically
abbreviated for the sake of space.
Data description
Eighteen percent of observations of
the variables VAR1 (binary), VAR2
(continuous) and VAR3 (continuous) were
set to missing at random, independently of
each other (Table 1)
----------------------------------------------------------------------------------
|DATA DESCRIPTION | | | | %_CASES | |
| | | | PERC. |GUESS=OU-| |
| |FREQUEN-| PERC. | OTHER | TCOME |SST = N x P |
|TABLE 1 | CY | CATEG | CATEG | NONINFO | x Q |
|----------------------------+--------+---------+---------+---------+------------|
|CASE |TOTAL |OLD/YOUNG | | | | | |
|--------|OBS | | | | | | |
|a c d e |--------+----------| | | | | |
|f |10101 |OLDIE | 5009| 49.59| 50.41| 50.00| 2525.079|
| | |----------+--------+---------+---------+---------+------------|
| | |YOUNGIE | 5092| 50.41| 49.59| 50.00| 2525.079|
----------------------------------------------------------------------------------
|b |4025 |OLDIE | 2307| 57.32| 42.68| 51.07| 984.702|
| | |----------+--------+---------+---------+---------+------------|
3
| | |YOUNGIE | 1718| 42.68| 57.32| 51.07| 984.702|
----------------------------------------------------------------------------------
The case b), which eliminates all
observations with some missings, reduces
the effective development sample size by
more than 50%, and also changes the
proportion of observations of the dependent
variable, thereby affecting modeling and
profiling results..
Distributions of the imputed variables
The continuous variables graphed
below were rescaled to fit between 0 and
100.
---------------------------------------------------------
|VAR1 | CATEGORIES | | | |
|(binary) |--------------------| | | |
| | . | 0 | 1 | | MODE | |
|Missing |------+------+------| |CATEG-| |
|values only | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 47.30| 52.70| 1| 1| 52.70|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 44.80| 55.20| 1| 1| 55.20|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 47.58| 52.42| 1| 1| 52.42|
---------------------------------------------------------
4
---------------------------------------------------------
| VAR1 | CATEGORIES | | | |
| (binary) --------------------| | | |
| | . | 0 | 1 | | MODE | |
| Entire |------+------+------| |CATEG-| |
| file | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 45.92| 54.08| 1| 1| 54.08|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 45.47| 54.53| 1| 1| 54.53|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 45.97| 54.03| 1| 1| 54.03|
---------------------------------------------------------
MIN = 0 Q1 = 1 MEDIAN = 2 MEAN = M Q3 = 3 MAX= 4 OVERPRINT = *
REFERENCE LINE AT 50 = |
VARIABLE MIN VAR1 MAX
0 Missing values only 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
0 Full File 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
VAR1 is a binary variable, imputed as continuous by the mean imputation method, which
collapses mean, median, q1 and q3 at one point. The univariate and multivariate methods are
closer to the true distribution, especially so the multivariate case (two tables above).
VARIABLE MIN VAR2 (continuous) MAX
0 Missing values only 100
*------------------------------------------------------------*
TRUE |0-------1-----------2|--M----------------3-----------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0-----1-------------2|-M---------------3-------------------4|
MULTIVARIATE |0---1-------------2--M-----------------3-------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-------1------------2--M---------------3------------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0--------1-----------2--M--------------3-------------------4|
MULTIVARIATE |0---------1----------2--M------------3---------------------4|
*------------------------------------------------------------*
The mean imputation has shrunk the distribution of VAR2, while the univariate and multivariate
methods are closer to the true distribution, particularly so the univariate method for the missing
values only case, as graphed in the previous two plots.
VARIABLE MIN VAR3 (continuous) MAX
0 Missing values only 100
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2----M3--|-------------------------------------4|
UNIVARIATE |*-----2---------M----|---3---------------------------------4|
MULTIVARIATE |0----1-------------2-|M--------------3---------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2-----M-3|-------------------------------------4|
UNIVARIATE |0-1-----2---------M--|----3--------------------------------4|
MULTIVARIATE |0---1---2---------M--|------3------------------------------4|
*------------------------------------------------------------*
The mean imputation has again shrunk the distribution of the data, this time VAR3, while the
univariate and multivariate methods are closer to the true distribution. The univariate distribution
5
has probably collapsed too much towards the low end of the distribution when we view the
distribution of the imputed missing values only.
IMPUTATION ACCURACY
-------------------------------------------------------------
|MEAN SQ ERROR OF | MEAN | UNIVARIATE |MULTIVARIATE|
|IMPUTATION | IMPUTATION | IMPUTATION | IMPUTATION |
|--------------------+------------+------------+------------|
|VARIABLE | | | |
|--------------------| | | |
|VAR1 | 0.25| 0.25| 0.25|
|--------------------+------------+------------+------------|
|VAR2 | 1119839.15| 1105777.94| 717651.84|
|--------------------+------------+------------+------------|
|VAR3 | 26.63| 23.14| 17.69|
-------------------------------------------------------------
On average, PMIM are more accurate than mean imputation.
MODEL COMPARISONS
I compare the resulting trees. Note that I do not present case c), which is the case of
deleting variables with any missing observation, because the resulting tree is extreme, lacks in
interest, and its performance is very poor.
ORIGINAL CLASSIFICATION TREE MODEL, CASE a)
YOUNGIE
VAR1 = 0
VAR1 = 1
y o
y y
1697 < VAR2 < 1698
y o
1682 < VAR2 < 1683
DELETING ARTIFICIALLY CREATED OBSERVATIONS, CASE b)
6
YOUNGIE
VAR2< 1610.5
VAR2 > 1610.5
y o
y O
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
MEAN IMPUTATION, CASE d)
YOUNGIE
VAR2 < 1387.5
VAR2 > 1387.5
y
O
y O
VAR3 = 0
y O
VAR1 < .37VAR3 = 1 VAR1 > .37
UNIVARIATE IMPUTATION, CASE e)
YOUNGIE
VAR3 = 0
VAR3 = 1
y
o
y O
1446 <VAR2 < 1447
O O
1708 < VAR2 < 1709
MULTIVARIATE IMPUTATION, CASE f)
7
YOUNGIE
VAR2 < 1708
VAR2 > 1708
y
O
y Y
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
3.1) Notes on the Tree models
1) Cases b, d and f revert the order
of the most important variables when
compared to the case a, or original tree. It is
important to mention that even the case a)
contained some missing information
originally, which has been somehow
eliminated in cases b, d and f.
2) Case d) imputes a mean value to
a binary variable (var1 <> .37), which is at
least inapporpriate. The cutoff value for var2
(continuous), around 1400 is far from the
value estimated by the other methods, which
is closer to 1700. This is probably due to the
reduction in variable of var2, and explains
the poor modeling performance as shown
below.
3) The univariate imputation method,
case e), utilizes var3 instead of var1 as the
principal variable, and then follows a pattern
similar to case a and f. Var3 and var1 are
binary variables, highly correlated. However,
Var3 did not have any original nor artificially
generated missing values.
4) Trees for cases a) and f) are very
similar. It is worth to mention that case f) has
imputed all missing values, both original and
artificially created.
3.2) Models performance and
diagnostics.
I measured performance by using
different statistics, hereby defined:
1) HITRATE: % of cases where the
prediction agrees with the true nature.
2) T: standardized difference in
probability means for oldies and youngies.
The larger the value of T, the better the
model discriminates.
3) Number of nodes: simpler trees
are preferable to larger ones.
4) Classification rate: percentage of
true state of nature predicted to be that
same state of nature.
5) True positive rate: of those
predicted to be a certain state of nature, the
percentage that truly belong in that state.
6) Top 50: Cumulative percentage of
Youngies captured at the 5th decile.
The results clearly indicate the
power of the Poor man’s methods. While
mean imputation performs satisfactorily (T =
1.01), multivariate imputation is superior in
all aspects. Database marketers also look at
the gainschart, here summarized in the
column TOP50.
While univariate imputation performs
very satisfactorily, multivariate imputation is
still superior. The performance by mean
imputation has suffered, however.
8
---------------------------------------------------------------------------------------
|DIAGN. FOR BINARY DEP VARIABLE| | | # | | |OLDIES |YNGIES | TOP |
| | | |NO- |OLDIES |YNGIES | TRUE | TRUE | 50 |
| |HITRA-| |DES | CLSF | CLSF | POS | POS |CUM |
| | TE | T | | RATE | RATE | RATE | RATE | % |
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CASE | | | | | | | | |
|------------------------------| | | | | | | | |
|AS IS | 75.53|1.09| 6| 72.23| 78.77| 77.00| 74.25|74.94|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CREATED MISSINGS & DELETED OBS| 77.52|1.14| 8| 84.31| 68.39| 78.18| 76.45|78.81|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|DELETING VARS WITH MISSINGS | 73.50|0.93| 8| 85.49| 54.97| 74.60| 71.00|70.48|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MEAN IMPUTATION | 75.48|1.01| 7| 86.28| 58.78| 76.40| 73.47|69.12|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|UNIVARIATE IMPUTATION | 75.53|1.01| 7| 87.60| 56.87| 75.86| 74.78|74.69|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MULT IMPUTATION | 78.58|1.15| 8| 86.81| 65.85| 79.73| 76.34|76.60|
---------------------------------------------------------------------------------------
4) Hardware considerations.
The macro systems were run on a
Sun Sparc 10, 256 megs of Ram. The entire
run, including regression trees and
diagnostics took less than one hour of actual
time. However, running the multivariate
imputation with five bands and 15 correlated
variables took more than 6 hours of CPU
time, and its results are not reported here.
5) General comments and
conclusion
The methods just presented provide
alternatives to the beleaguered analyst in a
fast-paced environment. Especially in the
case of profiling, the poor man’s methods
perform better than the even more hurried
mean imputation method. The model
performance was very good, especially for
the multivariate case. The number of bands
is still a research area, and the correlation
aspects in the multivariate imputation
deserve further investigation.
Possible areas of improvement
could hinge especially in the area of
modeling, such as implementing Dresner’s
or Leahy’s (1995) suggestions. It is also
necessary to further test the methods under
different conditions of missingness, such as
different patterns and different percentages
of missingness across variables. There is an
extensive literature in the area of bandwidth
selection as well (e.g., Thombs and
Sheather, 1990).
6) Bibliography
Dresner, A. (1995): Multirelation - correlation
among more than two variables,
Computational Statistics and Data Anslysis.
Leahy, K. (1995): Nature, prevalence, and
benefits of suppression effects in direct
response segmentation, presented at 1995
American Statistical Association meeting.
Rubin, D. and Little, R. (1987): Statistical
Analysis of missing data, Wiley.
Thombs, L. and Sheather, S. (1990): Local
bandwidth selection for density estimation,
Interface ‘90, Proceedings of the 22nd
Symposium of the Interface.
9
Proc Univariate: determine vars
w missing values, ranges and minima
Data Step: for each obs of each missing var,
determine corresponding patterns.
Proc Summary: for each pattern, determine mean and se of
missing variable.
Data Step: for each missing obs, find the pattern and impute.
BRIEF DIAGRAM OF SAS STEPS
Multivariate Imputation
Proc Corr: determine best correlated variables
Proc Format: create formats of mean, se
frequencies for every pattern.
10

More Related Content

What's hot

ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalisedTetsuya Sakai
 
Hierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validationHierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validationColleen Farrelly
 
Stat11t alq chapter03
Stat11t alq chapter03Stat11t alq chapter03
Stat11t alq chapter03raylenepotter
 
Evaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data setsEvaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data setsAlexander Decker
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsColleen Farrelly
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy IIColleen Farrelly
 
2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments2014 IIAG Imputation Assessments
2014 IIAG Imputation AssessmentsDr Lendy Spires
 
Introduction to the t Statistic
Introduction to the t StatisticIntroduction to the t Statistic
Introduction to the t Statisticjasondroesch
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis緯鈞 沈
 
PyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsPyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsColleen Farrelly
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation Remyagharishs
 
Estimation in statistics
Estimation in statisticsEstimation in statistics
Estimation in statisticsRabea Jamal
 

What's hot (19)

ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
Hierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validationHierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validation
 
Stat11t alq chapter03
Stat11t alq chapter03Stat11t alq chapter03
Stat11t alq chapter03
 
Decision tree
Decision tree Decision tree
Decision tree
 
Stat11t chapter3
Stat11t chapter3Stat11t chapter3
Stat11t chapter3
 
Evaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data setsEvaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data sets
 
Stat11t chapter1
Stat11t chapter1Stat11t chapter1
Stat11t chapter1
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
 
Machine Learning by Analogy II
Machine Learning by Analogy IIMachine Learning by Analogy II
Machine Learning by Analogy II
 
2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments2014 IIAG Imputation Assessments
2014 IIAG Imputation Assessments
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
Introduction to the t Statistic
Introduction to the t StatisticIntroduction to the t Statistic
Introduction to the t Statistic
 
Morse-Smale Regression
Morse-Smale RegressionMorse-Smale Regression
Morse-Smale Regression
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
PyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsPyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear Models
 
Statistical Estimation
Statistical Estimation Statistical Estimation
Statistical Estimation
 
Two Means Independent Samples
Two Means Independent Samples  Two Means Independent Samples
Two Means Independent Samples
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Estimation in statistics
Estimation in statisticsEstimation in statistics
Estimation in statistics
 

Similar to Poor man's missing value imputation

Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man'sLeonardo Auslender
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial heteroEdda Kang
 
Variable and feature selection
Variable and feature selectionVariable and feature selection
Variable and feature selectionAaron Karper
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSKAMIL MAJEED
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detectionTrilochan Panigrahi
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classificationcsandit
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionLalit Jain
 
A researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxA researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxevonnehoggarth79783
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)
Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)
Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)Anisa Aulia Sabilah
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Gingles Caroline
 
A Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano ClassifierA Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano Classifierjournal ijrtem
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IJDKP
 

Similar to Poor man's missing value imputation (20)

Missing Value imputation, Poor man's
Missing Value imputation, Poor man'sMissing Value imputation, Poor man's
Missing Value imputation, Poor man's
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
 
Variable and feature selection
Variable and feature selectionVariable and feature selection
Variable and feature selection
 
Dimensionality Reduction for Classification with High-Dimensional Data
Dimensionality Reduction for Classification with High-Dimensional DataDimensionality Reduction for Classification with High-Dimensional Data
Dimensionality Reduction for Classification with High-Dimensional Data
 
SELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODSSELECTED DATA PREPARATION METHODS
SELECTED DATA PREPARATION METHODS
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detection
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classification
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
 
report
reportreport
report
 
Credit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly DetectionCredit Card Fraud Detection - Anomaly Detection
Credit Card Fraud Detection - Anomaly Detection
 
Data Analyst - Interview Guide
Data Analyst - Interview GuideData Analyst - Interview Guide
Data Analyst - Interview Guide
 
A researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docxA researcher in attempting to run a regression model noticed a neg.docx
A researcher in attempting to run a regression model noticed a neg.docx
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)
Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)
Chapter 7 Beyond The Error Matrix (Congalton & Green 1999)
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
 
A Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano ClassifierA Magnified Application of Deficient Data Using Bolzano Classifier
A Magnified Application of Deficient Data Using Bolzano Classifier
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
 

More from Leonardo Auslender

4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdfLeonardo Auslender
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdfLeonardo Auslender
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdfLeonardo Auslender
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdfLeonardo Auslender
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Poor man's missing value imputation

  • 1. POOR MAN’S MISSING VALUE RECOVERY METHODS By Leonardo E. Auslender, ATT/Bell Labs1 Introduction. I describe two fast nonparametric methods to impute missing values, especially appropriate for the working environment found in database marketing: a) very tight deadlines, b) the need for an ‘easy ’ algorithm to ‘score ’ a large database, c) massive data sets for model development, and d) a not very sympathetic audience to the plights of modelers. The Problem. The presence of missing values is characteristic of large data bases. Databases with more than 100 million observations and thousands of variables (both business and demographic information, for instance) are not uncommon. Under these circumstances, almost 100% of the observations contain at least one variable with a missing value. While missing value imputation has been and is extensively analyzed by statisticians and others (see the classic Rubin and Little, 1987 or any recent JASA). These methods are difficult to implement in a DM enviroment because: 1) they require more time than usually available. Time deadlines are sometimes two days to complete a profiling and modeling project. Imputing missing values by modeling implies searches for model specifications (variable selection, functional form) and inference, which are time consuming. 2) the resulting imputing algorithm/s are not easily implementable in large databases, in a quick, efficient and error-free manners. Implementation errors could produce catastrophic business results. Therefore, the algorithm/s must be ‘simple ’, and cannot exceed present hardware and software constraints in the frame of massive databases. 3) the ad hoc methods, such as ascribing the overall mean or a constant (such as zero, to prevent dropping the observation altogether, as SAS would do) could be more harmful than doing nothing (Little and Rubin, 1987). Imputing a mean reduces the variance of the entire distribution arbitrarily, and thus changes the empirical distribution function. Further, imputing a mean to a categorical variable needs at least suspension of disbelief. To delete all those observations with some missing values or all those variables which have some missing values, is another approach of the ‘barbarian’ persuasion.. The Poor Man’s methods I advocate in this paper are based on imputing missing values based on the univariate and multivariate empirical distributions. The first method, which is simpler, sacrifices all the multivariate information available for the sake of meeting a deadline. The marginal densities are not affected if the data are missing at random. The second method tries to remedy the lack of multivariability, within the constraints just mentioned. Poor Man’s Imputation Methods (PMIM) 1) Univariate cumulative empirical distribution Let x1...,xn, be n independent observations from a distribution F(x). Let the real world, represented by F(x), be replaced by the observations x(1) < x(2) < .... x(n), and we represent these data by the empirical function Fn(x) putting probability 1/n on each observed value. 0 x < x(1) Fn(x) = # (xi ≤ x) = { k / n x(k) ≤ x < x(k+1) , k < n (1) n 1 x(n) ≤ x Based on (1), discretize each of the variables with at least one missing value. The number of bands or categories is a macro parameter, in this paper fixed at ten, and upper-bounded by computer resources. The larger this parameter, the greater the detail and accuracy, and the greater the computer resources utilized. By discretizing, the information for each variable with some missing information will be temporarily replaced by the band within which each variable within each observation belongs. If the variable is binary (which occurs very frequently in data base marketing), the number of bands will be obviously two. Even 1 I wish to thank Andreas Buja and Chuanhai Liu of Bell Labs/Statistics Research for the extensive discussions on this paper. The usual caveats apply.
  • 2. for continuous variables, it is seldom the case that those variables will be discretized in 10 bands (if the number of bands chosen is 10), because there are regions of the range which are empty. Once the variables are discretized, corresponding means and standard errors are estimated for each band within each variable. Likewise, the cumulative percentage of observations along the bands is obtained. This information contains all the elements that define the empirical distribution function for each variable with missing values. Given the missing observation i for variable j, generate a random number R from a uniform distribution, and a random number Normal from a Normal Distribution N(0,1). For Perci ≤ R < Perci+1, the rule is then to impute Meani + SEi Normal. As a comment, overall mean imputation is equivalent to applying this method with only one band. In this sense, overall mean imputation is an extreme case. 1.1) Example of SAS code used to impute the variable VAR2 IF VAR2 = . THEN DO; AUX = RANUNI(9); IF 0 < AUX <= 0.1703606086 THEN VAR2 = 616.90998594 + RANNOR(8) * 2.6743066012 ; END; ELSE IF 0.1703606086 < AUX <= 0.3429974841 THEN VAR2 = 933.89312977 + RANNOR(8) * 2.6652926231 ; END; ... END /* IF VAR2 = .*/; OUTPUT ; 1.2) Criticisms of the method and possible improvements. 1) Incorporate the multivariate structure into the method: a multivariate extension of the above method of discretization is not directly possible because of the curse of dimensionality. For instance, let assume ten binary variables with missing values. The number of possible patterns is then 210 = 1024, and despite large volumes of data it is not always possible to obtain all patterns for all cases of missingness. If missingness is itself considered to be a category, i.e. each binary becomes trinary, then the number of interesting patterns would be 310 - 1 = 59048 (the case “all missing” is clearly uninformative). If we consider that present databases contain perhaps thousands of categorical and continuous variables, the number of possible patterns would far exceed the information content of the data base. 2) As an implication from point 1 above, the method does not address missing data not at random. In the ideal case in which the mechanism generating missingness were known, it would be possible to model and thus impute the values, which could perhaps be done in a prompt, efficient and scorable ways Missingness at random does however occur, such as when matching different data sources by name or by other index, the corresponding information of which is not very accurate, updated or perhaps changed (such as change of name due to change in marital status, death, etc). 3) Since the method leaves the univariate distributions unaffected while disregarding multivariate information, it most likely diminishes the structure of interdependencies among variables, which affect modeling and profiling results. 2) Multivariate Cumulative Empirical Distribution The curse of dimensionality prevents the direct generalization of the previous method into a multivariate dimension. Since there is a need to limit the number of variables, let us invite modeling. For the sake of presentation, I chose as a modeling method of variable selection the best five correlated variables with the one with missing values. The number five is a macro parameter, and together with the number of bands must be determined with a view to computing constraints. Since variables with missing values might be highly correlated with other missing value variables, the method proceeds by imputing one variable at a time, in increasing order of percentage of missingness. Once the first variable is imputed, the following correlation searches are performed with the imputed data set (a caveat on this procedure will be mentioned later on). For each observation in which the “to be” imputed variable, say mvar, is not missing, categorize each one of the selected variables into corresponding bands. That is, the information contained in these five variables is categorized and patterned. Find corresponding means and standard errors of mvar for each of the patterns, as well as for partial patterns. For instance, if the pattern 11111 exists, also find the means and standard errors for patterns .1111, 1.111, 11.11, ..., ..111, etc. Store all this information in formats. 2
  • 3. For mvar missing, then find the pattern of the five variables, and impute the mean plus a normal random number times the standard error. If the pattern is not present in the just done summarization (recall dimensionality’s curse?), search through all partial patterns to find the closest pattern. 2.1) Criticisms of the method and possible improvements. 1) The present ‘modeling’ method is very crude and ad-hoc, and many possibilities to replace it come to mind. For instance, the multirelation coefficient (Dresner, 1995), principal components, a stepwise or backwards regression or logistic regression, etc. The scoring of the database is not affected by the difficulty of the modeling method, because the modeling method only affects the development stage, and the creation of formats necessary for the imputation. 2) The use of just imputed variables to impute other variables increases the level of the variances of the imputed variables, which is not easily accounted for. 3) The method takes longer than the univariate method, and scoring requires the storing of large format libraries. If the scoring is not done in SAS, the transformation of these enormous formats could be unfeasible. On the other hand, were the scoring done in SAS, new updates of the format libraries would be relatively straightforward. 4) In the case when data is not missing at random, the method can provide some insight about the mechanism, by providing details of the distribution of patterns. That is, missingness not at random will show itself in a pronounced lack of patterns in a region of the dimension of the data. The method could then be modified to weight the information differently in those regions. 3) Empirical application. Demographic information is commonly used in segmentation applications. In our case, the focus of research lies in classifying customers as youngies (below 26 years of age) or oldies (otherwise). The age information commonly available from census sources contains a large proportion of missing values. The study involves two parts. First, a profile of the distribution of the imputed values of selected variables (the top three variables of an original regression tree). Second, a model which classifies prospects into one or the other. It is important to see how missing value imputation affects model performance. Customer profiling is omitted due to corporate concerns. Our development data set contains about 10,000 observations with more than 100 variables. For the sake of brevity, validation results will be omitted, as well as most of the information which play no vital role in this study. The classification model was obtained by running a tree regression program, because at least in the case of categorical missing values, tree regression programs provide a solution. Further, when I tried logistic regression, convergence was not attained. I present the results for the following cases: a) original data (which contained some missingness). At this point, I added missingness at random in key variables; b) deleted observations with missing variables , c) deleting variables with missing values, d) estimation of missing values by corresponding means, e) univariate PMIM, f) multivariate PMIM. The regression trees graphical representations will be drastically abbreviated for the sake of space. Data description Eighteen percent of observations of the variables VAR1 (binary), VAR2 (continuous) and VAR3 (continuous) were set to missing at random, independently of each other (Table 1) ---------------------------------------------------------------------------------- |DATA DESCRIPTION | | | | %_CASES | | | | | | PERC. |GUESS=OU-| | | |FREQUEN-| PERC. | OTHER | TCOME |SST = N x P | |TABLE 1 | CY | CATEG | CATEG | NONINFO | x Q | |----------------------------+--------+---------+---------+---------+------------| |CASE |TOTAL |OLD/YOUNG | | | | | | |--------|OBS | | | | | | | |a c d e |--------+----------| | | | | | |f |10101 |OLDIE | 5009| 49.59| 50.41| 50.00| 2525.079| | | |----------+--------+---------+---------+---------+------------| | | |YOUNGIE | 5092| 50.41| 49.59| 50.00| 2525.079| ---------------------------------------------------------------------------------- |b |4025 |OLDIE | 2307| 57.32| 42.68| 51.07| 984.702| | | |----------+--------+---------+---------+---------+------------| 3
  • 4. | | |YOUNGIE | 1718| 42.68| 57.32| 51.07| 984.702| ---------------------------------------------------------------------------------- The case b), which eliminates all observations with some missings, reduces the effective development sample size by more than 50%, and also changes the proportion of observations of the dependent variable, thereby affecting modeling and profiling results.. Distributions of the imputed variables The continuous variables graphed below were rescaled to fit between 0 and 100. --------------------------------------------------------- |VAR1 | CATEGORIES | | | | |(binary) |--------------------| | | | | | . | 0 | 1 | | MODE | | |Missing |------+------+------| |CATEG-| | |values only | % | % | % |MEDIAN| ORY | MODE | |-------------+------+------+------+------+------+------| |VARIABLES | | | | | | | |-------------| | | | | | | |TRUE | .| 47.30| 52.70| 1| 1| 52.70| |-------------+------+------+------+------+------+------| |UNIVARIATE | .| 44.80| 55.20| 1| 1| 55.20| |-------------+------+------+------+------+------+------| |MULTIVARIATE | .| 47.58| 52.42| 1| 1| 52.42| --------------------------------------------------------- 4
  • 5. --------------------------------------------------------- | VAR1 | CATEGORIES | | | | | (binary) --------------------| | | | | | . | 0 | 1 | | MODE | | | Entire |------+------+------| |CATEG-| | | file | % | % | % |MEDIAN| ORY | MODE | |-------------+------+------+------+------+------+------| |VARIABLES | | | | | | | |-------------| | | | | | | |TRUE | .| 45.92| 54.08| 1| 1| 54.08| |-------------+------+------+------+------+------+------| |UNIVARIATE | .| 45.47| 54.53| 1| 1| 54.53| |-------------+------+------+------+------+------+------| |MULTIVARIATE | .| 45.97| 54.03| 1| 1| 54.03| --------------------------------------------------------- MIN = 0 Q1 = 1 MEDIAN = 2 MEAN = M Q3 = 3 MAX= 4 OVERPRINT = * REFERENCE LINE AT 50 = | VARIABLE MIN VAR1 MAX 0 Missing values only 100 *------------------------------------------------------------* MEAN IMP |*--------------------|-----------*-------------------------*| *------------------------------------------------------------* 0 Full File 100 *------------------------------------------------------------* MEAN IMP |*--------------------|-----------*-------------------------*| *------------------------------------------------------------* VAR1 is a binary variable, imputed as continuous by the mean imputation method, which collapses mean, median, q1 and q3 at one point. The univariate and multivariate methods are closer to the true distribution, especially so the multivariate case (two tables above). VARIABLE MIN VAR2 (continuous) MAX 0 Missing values only 100 *------------------------------------------------------------* TRUE |0-------1-----------2|--M----------------3-----------------4| MEAN IMP |0---------1----------|--*--------3-------------------------4| UNIVARIATE |0-----1-------------2|-M---------------3-------------------4| MULTIVARIATE |0---1-------------2--M-----------------3-------------------4| *------------------------------------------------------------* 0 Full file 100 *------------------------------------------------------------* TRUE |0-------1------------2--M---------------3------------------4| MEAN IMP |0---------1----------|--*--------3-------------------------4| UNIVARIATE |0--------1-----------2--M--------------3-------------------4| MULTIVARIATE |0---------1----------2--M------------3---------------------4| *------------------------------------------------------------* The mean imputation has shrunk the distribution of VAR2, while the univariate and multivariate methods are closer to the true distribution, particularly so the univariate method for the missing values only case, as graphed in the previous two plots. VARIABLE MIN VAR3 (continuous) MAX 0 Missing values only 100 *------------------------------------------------------------* TRUE |0-1-----2---------M--|------3------------------------------4| MEAN IMP |0---1-------2----M3--|-------------------------------------4| UNIVARIATE |*-----2---------M----|---3---------------------------------4| MULTIVARIATE |0----1-------------2-|M--------------3---------------------4| *------------------------------------------------------------* 0 Full file 100 *------------------------------------------------------------* TRUE |0-1-----2---------M--|------3------------------------------4| MEAN IMP |0---1-------2-----M-3|-------------------------------------4| UNIVARIATE |0-1-----2---------M--|----3--------------------------------4| MULTIVARIATE |0---1---2---------M--|------3------------------------------4| *------------------------------------------------------------* The mean imputation has again shrunk the distribution of the data, this time VAR3, while the univariate and multivariate methods are closer to the true distribution. The univariate distribution 5
  • 6. has probably collapsed too much towards the low end of the distribution when we view the distribution of the imputed missing values only. IMPUTATION ACCURACY ------------------------------------------------------------- |MEAN SQ ERROR OF | MEAN | UNIVARIATE |MULTIVARIATE| |IMPUTATION | IMPUTATION | IMPUTATION | IMPUTATION | |--------------------+------------+------------+------------| |VARIABLE | | | | |--------------------| | | | |VAR1 | 0.25| 0.25| 0.25| |--------------------+------------+------------+------------| |VAR2 | 1119839.15| 1105777.94| 717651.84| |--------------------+------------+------------+------------| |VAR3 | 26.63| 23.14| 17.69| ------------------------------------------------------------- On average, PMIM are more accurate than mean imputation. MODEL COMPARISONS I compare the resulting trees. Note that I do not present case c), which is the case of deleting variables with any missing observation, because the resulting tree is extreme, lacks in interest, and its performance is very poor. ORIGINAL CLASSIFICATION TREE MODEL, CASE a) YOUNGIE VAR1 = 0 VAR1 = 1 y o y y 1697 < VAR2 < 1698 y o 1682 < VAR2 < 1683 DELETING ARTIFICIALLY CREATED OBSERVATIONS, CASE b) 6
  • 7. YOUNGIE VAR2< 1610.5 VAR2 > 1610.5 y o y O VAR1 = 0 O O VAR1 = 0VAR1 = 1 VAR1 = 1 MEAN IMPUTATION, CASE d) YOUNGIE VAR2 < 1387.5 VAR2 > 1387.5 y O y O VAR3 = 0 y O VAR1 < .37VAR3 = 1 VAR1 > .37 UNIVARIATE IMPUTATION, CASE e) YOUNGIE VAR3 = 0 VAR3 = 1 y o y O 1446 <VAR2 < 1447 O O 1708 < VAR2 < 1709 MULTIVARIATE IMPUTATION, CASE f) 7
  • 8. YOUNGIE VAR2 < 1708 VAR2 > 1708 y O y Y VAR1 = 0 O O VAR1 = 0VAR1 = 1 VAR1 = 1 3.1) Notes on the Tree models 1) Cases b, d and f revert the order of the most important variables when compared to the case a, or original tree. It is important to mention that even the case a) contained some missing information originally, which has been somehow eliminated in cases b, d and f. 2) Case d) imputes a mean value to a binary variable (var1 <> .37), which is at least inapporpriate. The cutoff value for var2 (continuous), around 1400 is far from the value estimated by the other methods, which is closer to 1700. This is probably due to the reduction in variable of var2, and explains the poor modeling performance as shown below. 3) The univariate imputation method, case e), utilizes var3 instead of var1 as the principal variable, and then follows a pattern similar to case a and f. Var3 and var1 are binary variables, highly correlated. However, Var3 did not have any original nor artificially generated missing values. 4) Trees for cases a) and f) are very similar. It is worth to mention that case f) has imputed all missing values, both original and artificially created. 3.2) Models performance and diagnostics. I measured performance by using different statistics, hereby defined: 1) HITRATE: % of cases where the prediction agrees with the true nature. 2) T: standardized difference in probability means for oldies and youngies. The larger the value of T, the better the model discriminates. 3) Number of nodes: simpler trees are preferable to larger ones. 4) Classification rate: percentage of true state of nature predicted to be that same state of nature. 5) True positive rate: of those predicted to be a certain state of nature, the percentage that truly belong in that state. 6) Top 50: Cumulative percentage of Youngies captured at the 5th decile. The results clearly indicate the power of the Poor man’s methods. While mean imputation performs satisfactorily (T = 1.01), multivariate imputation is superior in all aspects. Database marketers also look at the gainschart, here summarized in the column TOP50. While univariate imputation performs very satisfactorily, multivariate imputation is still superior. The performance by mean imputation has suffered, however. 8
  • 9. --------------------------------------------------------------------------------------- |DIAGN. FOR BINARY DEP VARIABLE| | | # | | |OLDIES |YNGIES | TOP | | | | |NO- |OLDIES |YNGIES | TRUE | TRUE | 50 | | |HITRA-| |DES | CLSF | CLSF | POS | POS |CUM | | | TE | T | | RATE | RATE | RATE | RATE | % | |------------------------------|------|----|----|-------|-------|-------|-------|-----| |CASE | | | | | | | | | |------------------------------| | | | | | | | | |AS IS | 75.53|1.09| 6| 72.23| 78.77| 77.00| 74.25|74.94| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |CREATED MISSINGS & DELETED OBS| 77.52|1.14| 8| 84.31| 68.39| 78.18| 76.45|78.81| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |DELETING VARS WITH MISSINGS | 73.50|0.93| 8| 85.49| 54.97| 74.60| 71.00|70.48| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |MEAN IMPUTATION | 75.48|1.01| 7| 86.28| 58.78| 76.40| 73.47|69.12| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |UNIVARIATE IMPUTATION | 75.53|1.01| 7| 87.60| 56.87| 75.86| 74.78|74.69| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |MULT IMPUTATION | 78.58|1.15| 8| 86.81| 65.85| 79.73| 76.34|76.60| --------------------------------------------------------------------------------------- 4) Hardware considerations. The macro systems were run on a Sun Sparc 10, 256 megs of Ram. The entire run, including regression trees and diagnostics took less than one hour of actual time. However, running the multivariate imputation with five bands and 15 correlated variables took more than 6 hours of CPU time, and its results are not reported here. 5) General comments and conclusion The methods just presented provide alternatives to the beleaguered analyst in a fast-paced environment. Especially in the case of profiling, the poor man’s methods perform better than the even more hurried mean imputation method. The model performance was very good, especially for the multivariate case. The number of bands is still a research area, and the correlation aspects in the multivariate imputation deserve further investigation. Possible areas of improvement could hinge especially in the area of modeling, such as implementing Dresner’s or Leahy’s (1995) suggestions. It is also necessary to further test the methods under different conditions of missingness, such as different patterns and different percentages of missingness across variables. There is an extensive literature in the area of bandwidth selection as well (e.g., Thombs and Sheather, 1990). 6) Bibliography Dresner, A. (1995): Multirelation - correlation among more than two variables, Computational Statistics and Data Anslysis. Leahy, K. (1995): Nature, prevalence, and benefits of suppression effects in direct response segmentation, presented at 1995 American Statistical Association meeting. Rubin, D. and Little, R. (1987): Statistical Analysis of missing data, Wiley. Thombs, L. and Sheather, S. (1990): Local bandwidth selection for density estimation, Interface ‘90, Proceedings of the 22nd Symposium of the Interface. 9
  • 10. Proc Univariate: determine vars w missing values, ranges and minima Data Step: for each obs of each missing var, determine corresponding patterns. Proc Summary: for each pattern, determine mean and se of missing variable. Data Step: for each missing obs, find the pattern and impute. BRIEF DIAGRAM OF SAS STEPS Multivariate Imputation Proc Corr: determine best correlated variables Proc Format: create formats of mean, se frequencies for every pattern. 10