1. POOR MAN’S MISSING VALUE RECOVERY METHODS
By Leonardo E. Auslender, ATT/Bell Labs1
Introduction.
I describe two fast nonparametric
methods to impute missing values,
especially appropriate for the working
environment found in database marketing: a)
very tight deadlines, b) the need for an
‘easy ’ algorithm to ‘score ’ a large
database, c) massive data sets for model
development, and d) a not very sympathetic
audience to the plights of modelers.
The Problem.
The presence of missing values is
characteristic of large data bases.
Databases with more than 100 million
observations and thousands of variables
(both business and demographic
information, for instance) are not uncommon.
Under these circumstances, almost 100% of
the observations contain at least one
variable with a missing value.
While missing value imputation has
been and is extensively analyzed by
statisticians and others (see the classic
Rubin and Little, 1987 or any recent JASA).
These methods are difficult to implement in a
DM enviroment because:
1) they require more time than
usually available. Time deadlines are
sometimes two days to complete a profiling
and modeling project. Imputing missing
values by modeling implies searches for
model specifications (variable selection,
functional form) and inference, which are
time consuming.
2) the resulting imputing algorithm/s
are not easily implementable in large
databases, in a quick, efficient and error-free
manners. Implementation errors could
produce catastrophic business results.
Therefore, the algorithm/s must be
‘simple ’, and cannot exceed present
hardware and software constraints in the
frame of massive databases.
3) the ad hoc methods, such as
ascribing the overall mean or a constant
(such as zero, to prevent dropping the
observation altogether, as SAS would do)
could be more harmful than doing nothing
(Little and Rubin, 1987). Imputing a mean
reduces the variance of the entire
distribution arbitrarily, and thus changes the
empirical distribution function. Further,
imputing a mean to a categorical variable
needs at least suspension of disbelief. To
delete all those observations with some
missing values or all those variables
which have some missing values, is another
approach of the ‘barbarian’ persuasion..
The Poor Man’s methods I
advocate in this paper are based on
imputing missing values based on the
univariate and multivariate empirical
distributions. The first method, which is
simpler, sacrifices all the multivariate
information available for the sake of meeting
a deadline. The marginal densities are not
affected if the data are missing at random.
The second method tries to remedy the lack
of multivariability, within the constraints just
mentioned.
Poor Man’s Imputation Methods (PMIM)
1) Univariate cumulative
empirical distribution
Let x1...,xn, be n independent
observations from a distribution F(x). Let the
real world, represented by F(x), be replaced
by the observations x(1) < x(2) < .... x(n), and
we represent these data by the empirical
function Fn(x) putting probability 1/n on each
observed value.
0 x < x(1)
Fn(x) = # (xi ≤ x) = { k / n x(k) ≤ x < x(k+1) , k < n (1)
n 1 x(n) ≤ x
Based on (1), discretize each of the
variables with at least one missing value.
The number of bands or categories is a
macro parameter, in this paper fixed at ten,
and upper-bounded by computer resources.
The larger this parameter, the greater the
detail and accuracy, and the greater the
computer resources utilized. By discretizing,
the information for each variable with some
missing information will be temporarily
replaced by the band within which each
variable within each observation belongs. If
the variable is binary (which occurs very
frequently in data base marketing), the
number of bands will be obviously two. Even
1
I wish to thank Andreas Buja and Chuanhai Liu of Bell Labs/Statistics Research for the
extensive discussions on this paper. The usual caveats apply.
2. for continuous variables, it is seldom the
case that those variables will be discretized
in 10 bands (if the number of bands chosen
is 10), because there are regions of the
range which are empty.
Once the variables are discretized,
corresponding means and standard errors
are estimated for each band within each
variable. Likewise, the cumulative
percentage of observations along the bands
is obtained. This information contains all the
elements that define the empirical
distribution function for each variable with
missing values.
Given the missing observation i for
variable j, generate a random number R
from a uniform distribution, and a random
number Normal from a Normal Distribution
N(0,1). For Perci ≤ R < Perci+1, the rule is
then to impute Meani + SEi Normal.
As a comment, overall mean
imputation is equivalent to applying this
method with only one band. In this sense,
overall mean imputation is an extreme case.
1.1) Example of SAS code used
to impute the variable VAR2
IF VAR2 = . THEN DO;
AUX = RANUNI(9);
IF 0 < AUX <= 0.1703606086 THEN
VAR2 = 616.90998594 + RANNOR(8) *
2.6743066012 ;
END;
ELSE IF 0.1703606086 < AUX <= 0.3429974841
THEN
VAR2 = 933.89312977 + RANNOR(8) *
2.6652926231 ;
END;
...
END /* IF VAR2 = .*/;
OUTPUT ;
1.2) Criticisms of the method and
possible improvements.
1) Incorporate the multivariate
structure into the method: a multivariate
extension of the above method of
discretization is not directly possible
because of the curse of dimensionality. For
instance, let assume ten binary variables
with missing values. The number of possible
patterns is then 210
= 1024, and despite large
volumes of data it is not always possible to
obtain all patterns for all cases of
missingness. If missingness is itself
considered to be a category, i.e. each binary
becomes trinary, then the number of
interesting patterns would be 310
- 1 = 59048
(the case “all missing” is clearly
uninformative). If we consider that present
databases contain perhaps thousands of
categorical and continuous variables, the
number of possible patterns would far
exceed the information content of the data
base.
2) As an implication from point 1
above, the method does not address
missing data not at random. In the ideal case
in which the mechanism generating
missingness were known, it would be
possible to model and thus impute the
values, which could perhaps be done in a
prompt, efficient and scorable ways
Missingness at random does however occur,
such as when matching different data
sources by name or by other index, the
corresponding information of which is not
very accurate, updated or perhaps changed
(such as change of name due to change in
marital status, death, etc).
3) Since the method leaves the
univariate distributions unaffected while
disregarding multivariate information, it most
likely diminishes the structure of
interdependencies among variables, which
affect modeling and profiling results.
2) Multivariate Cumulative
Empirical Distribution
The curse of dimensionality prevents
the direct generalization of the previous
method into a multivariate dimension.
Since there is a need to limit the
number of variables, let us invite modeling.
For the sake of presentation, I chose as a
modeling method of variable selection the
best five correlated variables with the one
with missing values. The number five is a
macro parameter, and together with the
number of bands must be determined with a
view to computing constraints. Since
variables with missing values might be highly
correlated with other missing value
variables, the method proceeds by imputing
one variable at a time, in increasing order of
percentage of missingness. Once the first
variable is imputed, the following correlation
searches are performed with the imputed
data set (a caveat on this procedure will be
mentioned later on).
For each observation in which the
“to be” imputed variable, say mvar, is not
missing, categorize each one of the selected
variables into corresponding bands. That is,
the information contained in these five
variables is categorized and patterned. Find
corresponding means and standard errors of
mvar for each of the patterns, as well as for
partial patterns. For instance, if the pattern
11111 exists, also find the means and
standard errors for patterns .1111, 1.111,
11.11, ..., ..111, etc. Store all this information
in formats.
2
3. For mvar missing, then find the
pattern of the five variables, and impute the
mean plus a normal random number times
the standard error. If the pattern is not
present in the just done summarization
(recall dimensionality’s curse?), search
through all partial patterns to find the closest
pattern.
2.1) Criticisms of the method and
possible improvements.
1) The present ‘modeling’ method
is very crude and ad-hoc, and many
possibilities to replace it come to mind. For
instance, the multirelation coefficient
(Dresner, 1995), principal components, a
stepwise or backwards regression or logistic
regression, etc. The scoring of the database
is not affected by the difficulty of the
modeling method, because the modeling
method only affects the development stage,
and the creation of formats necessary for the
imputation.
2) The use of just imputed variables
to impute other variables increases the level
of the variances of the imputed variables,
which is not easily accounted for.
3) The method takes longer than the
univariate method, and scoring requires the
storing of large format libraries. If the scoring
is not done in SAS, the transformation of
these enormous formats could be
unfeasible. On the other hand, were the
scoring done in SAS, new updates of the
format libraries would be relatively
straightforward.
4) In the case when data is not
missing at random, the method can provide
some insight about the mechanism, by
providing details of the distribution of
patterns. That is, missingness not at random
will show itself in a pronounced lack of
patterns in a region of the dimension of the
data. The method could then be modified to
weight the information differently in those
regions.
3) Empirical application.
Demographic information is
commonly used in segmentation
applications. In our case, the focus of
research lies in classifying customers as
youngies (below 26 years of age) or oldies
(otherwise). The age information commonly
available from census sources contains a
large proportion of missing values.
The study involves two parts. First, a
profile of the distribution of the imputed
values of selected variables (the top three
variables of an original regression tree).
Second, a model which classifies prospects
into one or the other. It is important to see
how missing value imputation affects model
performance. Customer profiling is omitted
due to corporate concerns.
Our development data set contains
about 10,000 observations with more than
100 variables. For the sake of brevity,
validation results will be omitted, as well as
most of the information which play no vital
role in this study. The classification model
was obtained by running a tree regression
program, because at least in the case of
categorical missing values, tree regression
programs provide a solution. Further, when I
tried logistic regression, convergence was
not attained.
I present the results for the following
cases: a) original data (which contained
some missingness). At this point, I added
missingness at random in key variables; b)
deleted observations with missing variables ,
c) deleting variables with missing values, d)
estimation of missing values by
corresponding means, e) univariate PMIM, f)
multivariate PMIM. The regression trees
graphical representations will be drastically
abbreviated for the sake of space.
Data description
Eighteen percent of observations of
the variables VAR1 (binary), VAR2
(continuous) and VAR3 (continuous) were
set to missing at random, independently of
each other (Table 1)
----------------------------------------------------------------------------------
|DATA DESCRIPTION | | | | %_CASES | |
| | | | PERC. |GUESS=OU-| |
| |FREQUEN-| PERC. | OTHER | TCOME |SST = N x P |
|TABLE 1 | CY | CATEG | CATEG | NONINFO | x Q |
|----------------------------+--------+---------+---------+---------+------------|
|CASE |TOTAL |OLD/YOUNG | | | | | |
|--------|OBS | | | | | | |
|a c d e |--------+----------| | | | | |
|f |10101 |OLDIE | 5009| 49.59| 50.41| 50.00| 2525.079|
| | |----------+--------+---------+---------+---------+------------|
| | |YOUNGIE | 5092| 50.41| 49.59| 50.00| 2525.079|
----------------------------------------------------------------------------------
|b |4025 |OLDIE | 2307| 57.32| 42.68| 51.07| 984.702|
| | |----------+--------+---------+---------+---------+------------|
3
4. | | |YOUNGIE | 1718| 42.68| 57.32| 51.07| 984.702|
----------------------------------------------------------------------------------
The case b), which eliminates all
observations with some missings, reduces
the effective development sample size by
more than 50%, and also changes the
proportion of observations of the dependent
variable, thereby affecting modeling and
profiling results..
Distributions of the imputed variables
The continuous variables graphed
below were rescaled to fit between 0 and
100.
---------------------------------------------------------
|VAR1 | CATEGORIES | | | |
|(binary) |--------------------| | | |
| | . | 0 | 1 | | MODE | |
|Missing |------+------+------| |CATEG-| |
|values only | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 47.30| 52.70| 1| 1| 52.70|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 44.80| 55.20| 1| 1| 55.20|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 47.58| 52.42| 1| 1| 52.42|
---------------------------------------------------------
4
5. ---------------------------------------------------------
| VAR1 | CATEGORIES | | | |
| (binary) --------------------| | | |
| | . | 0 | 1 | | MODE | |
| Entire |------+------+------| |CATEG-| |
| file | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 45.92| 54.08| 1| 1| 54.08|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 45.47| 54.53| 1| 1| 54.53|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 45.97| 54.03| 1| 1| 54.03|
---------------------------------------------------------
MIN = 0 Q1 = 1 MEDIAN = 2 MEAN = M Q3 = 3 MAX= 4 OVERPRINT = *
REFERENCE LINE AT 50 = |
VARIABLE MIN VAR1 MAX
0 Missing values only 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
0 Full File 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
VAR1 is a binary variable, imputed as continuous by the mean imputation method, which
collapses mean, median, q1 and q3 at one point. The univariate and multivariate methods are
closer to the true distribution, especially so the multivariate case (two tables above).
VARIABLE MIN VAR2 (continuous) MAX
0 Missing values only 100
*------------------------------------------------------------*
TRUE |0-------1-----------2|--M----------------3-----------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0-----1-------------2|-M---------------3-------------------4|
MULTIVARIATE |0---1-------------2--M-----------------3-------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-------1------------2--M---------------3------------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0--------1-----------2--M--------------3-------------------4|
MULTIVARIATE |0---------1----------2--M------------3---------------------4|
*------------------------------------------------------------*
The mean imputation has shrunk the distribution of VAR2, while the univariate and multivariate
methods are closer to the true distribution, particularly so the univariate method for the missing
values only case, as graphed in the previous two plots.
VARIABLE MIN VAR3 (continuous) MAX
0 Missing values only 100
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2----M3--|-------------------------------------4|
UNIVARIATE |*-----2---------M----|---3---------------------------------4|
MULTIVARIATE |0----1-------------2-|M--------------3---------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2-----M-3|-------------------------------------4|
UNIVARIATE |0-1-----2---------M--|----3--------------------------------4|
MULTIVARIATE |0---1---2---------M--|------3------------------------------4|
*------------------------------------------------------------*
The mean imputation has again shrunk the distribution of the data, this time VAR3, while the
univariate and multivariate methods are closer to the true distribution. The univariate distribution
5
6. has probably collapsed too much towards the low end of the distribution when we view the
distribution of the imputed missing values only.
IMPUTATION ACCURACY
-------------------------------------------------------------
|MEAN SQ ERROR OF | MEAN | UNIVARIATE |MULTIVARIATE|
|IMPUTATION | IMPUTATION | IMPUTATION | IMPUTATION |
|--------------------+------------+------------+------------|
|VARIABLE | | | |
|--------------------| | | |
|VAR1 | 0.25| 0.25| 0.25|
|--------------------+------------+------------+------------|
|VAR2 | 1119839.15| 1105777.94| 717651.84|
|--------------------+------------+------------+------------|
|VAR3 | 26.63| 23.14| 17.69|
-------------------------------------------------------------
On average, PMIM are more accurate than mean imputation.
MODEL COMPARISONS
I compare the resulting trees. Note that I do not present case c), which is the case of
deleting variables with any missing observation, because the resulting tree is extreme, lacks in
interest, and its performance is very poor.
ORIGINAL CLASSIFICATION TREE MODEL, CASE a)
YOUNGIE
VAR1 = 0
VAR1 = 1
y o
y y
1697 < VAR2 < 1698
y o
1682 < VAR2 < 1683
DELETING ARTIFICIALLY CREATED OBSERVATIONS, CASE b)
6
7. YOUNGIE
VAR2< 1610.5
VAR2 > 1610.5
y o
y O
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
MEAN IMPUTATION, CASE d)
YOUNGIE
VAR2 < 1387.5
VAR2 > 1387.5
y
O
y O
VAR3 = 0
y O
VAR1 < .37VAR3 = 1 VAR1 > .37
UNIVARIATE IMPUTATION, CASE e)
YOUNGIE
VAR3 = 0
VAR3 = 1
y
o
y O
1446 <VAR2 < 1447
O O
1708 < VAR2 < 1709
MULTIVARIATE IMPUTATION, CASE f)
7
8. YOUNGIE
VAR2 < 1708
VAR2 > 1708
y
O
y Y
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
3.1) Notes on the Tree models
1) Cases b, d and f revert the order
of the most important variables when
compared to the case a, or original tree. It is
important to mention that even the case a)
contained some missing information
originally, which has been somehow
eliminated in cases b, d and f.
2) Case d) imputes a mean value to
a binary variable (var1 <> .37), which is at
least inapporpriate. The cutoff value for var2
(continuous), around 1400 is far from the
value estimated by the other methods, which
is closer to 1700. This is probably due to the
reduction in variable of var2, and explains
the poor modeling performance as shown
below.
3) The univariate imputation method,
case e), utilizes var3 instead of var1 as the
principal variable, and then follows a pattern
similar to case a and f. Var3 and var1 are
binary variables, highly correlated. However,
Var3 did not have any original nor artificially
generated missing values.
4) Trees for cases a) and f) are very
similar. It is worth to mention that case f) has
imputed all missing values, both original and
artificially created.
3.2) Models performance and
diagnostics.
I measured performance by using
different statistics, hereby defined:
1) HITRATE: % of cases where the
prediction agrees with the true nature.
2) T: standardized difference in
probability means for oldies and youngies.
The larger the value of T, the better the
model discriminates.
3) Number of nodes: simpler trees
are preferable to larger ones.
4) Classification rate: percentage of
true state of nature predicted to be that
same state of nature.
5) True positive rate: of those
predicted to be a certain state of nature, the
percentage that truly belong in that state.
6) Top 50: Cumulative percentage of
Youngies captured at the 5th decile.
The results clearly indicate the
power of the Poor man’s methods. While
mean imputation performs satisfactorily (T =
1.01), multivariate imputation is superior in
all aspects. Database marketers also look at
the gainschart, here summarized in the
column TOP50.
While univariate imputation performs
very satisfactorily, multivariate imputation is
still superior. The performance by mean
imputation has suffered, however.
8
9. ---------------------------------------------------------------------------------------
|DIAGN. FOR BINARY DEP VARIABLE| | | # | | |OLDIES |YNGIES | TOP |
| | | |NO- |OLDIES |YNGIES | TRUE | TRUE | 50 |
| |HITRA-| |DES | CLSF | CLSF | POS | POS |CUM |
| | TE | T | | RATE | RATE | RATE | RATE | % |
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CASE | | | | | | | | |
|------------------------------| | | | | | | | |
|AS IS | 75.53|1.09| 6| 72.23| 78.77| 77.00| 74.25|74.94|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CREATED MISSINGS & DELETED OBS| 77.52|1.14| 8| 84.31| 68.39| 78.18| 76.45|78.81|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|DELETING VARS WITH MISSINGS | 73.50|0.93| 8| 85.49| 54.97| 74.60| 71.00|70.48|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MEAN IMPUTATION | 75.48|1.01| 7| 86.28| 58.78| 76.40| 73.47|69.12|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|UNIVARIATE IMPUTATION | 75.53|1.01| 7| 87.60| 56.87| 75.86| 74.78|74.69|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MULT IMPUTATION | 78.58|1.15| 8| 86.81| 65.85| 79.73| 76.34|76.60|
---------------------------------------------------------------------------------------
4) Hardware considerations.
The macro systems were run on a
Sun Sparc 10, 256 megs of Ram. The entire
run, including regression trees and
diagnostics took less than one hour of actual
time. However, running the multivariate
imputation with five bands and 15 correlated
variables took more than 6 hours of CPU
time, and its results are not reported here.
5) General comments and
conclusion
The methods just presented provide
alternatives to the beleaguered analyst in a
fast-paced environment. Especially in the
case of profiling, the poor man’s methods
perform better than the even more hurried
mean imputation method. The model
performance was very good, especially for
the multivariate case. The number of bands
is still a research area, and the correlation
aspects in the multivariate imputation
deserve further investigation.
Possible areas of improvement
could hinge especially in the area of
modeling, such as implementing Dresner’s
or Leahy’s (1995) suggestions. It is also
necessary to further test the methods under
different conditions of missingness, such as
different patterns and different percentages
of missingness across variables. There is an
extensive literature in the area of bandwidth
selection as well (e.g., Thombs and
Sheather, 1990).
6) Bibliography
Dresner, A. (1995): Multirelation - correlation
among more than two variables,
Computational Statistics and Data Anslysis.
Leahy, K. (1995): Nature, prevalence, and
benefits of suppression effects in direct
response segmentation, presented at 1995
American Statistical Association meeting.
Rubin, D. and Little, R. (1987): Statistical
Analysis of missing data, Wiley.
Thombs, L. and Sheather, S. (1990): Local
bandwidth selection for density estimation,
Interface ‘90, Proceedings of the 22nd
Symposium of the Interface.
9
10. Proc Univariate: determine vars
w missing values, ranges and minima
Data Step: for each obs of each missing var,
determine corresponding patterns.
Proc Summary: for each pattern, determine mean and se of
missing variable.
Data Step: for each missing obs, find the pattern and impute.
BRIEF DIAGRAM OF SAS STEPS
Multivariate Imputation
Proc Corr: determine best correlated variables
Proc Format: create formats of mean, se
frequencies for every pattern.
10