Missing Value imputation, Poor man's

MISSING IMPUTATION METHOD FOR LARGE DATABASES
By Leonardo E. Auslender
Chase Manhattan Mortgage Corp.
Introduction.
In Auslender (1996), I proposed
nonparametric methods for univariate and
multivariate missing value impution of large
databases in a database marketing
environemnt. In this environment, the analyst
faces tight deadlines constraints due to
marketing campaign pressures, and the
‘demand ’ (mostly from those who
‘score ’ the algorithms) for equations that
are easy to code, and which will not impose
‘extra-ordinary’ demands on existent
hardware and time to market. .
In this paper, I further continue the
development of the multivariate algorithm
proposed in Auslender (1996). I first briefly
describe the multivariate method proposed
earlier. Second, I proposed further
developments of the algorithm, which are
tested in a third section. I end the paper with
conclusions and suggestions for future
research.
1) Multivariate Imputation Method
A typical database contains
thousands (if not millions) of observations.
There are fewer variables usually, but
nonetheless their number can easily rank in
the thousands. The information contained in
these databases is usually obtained from
different sources, both internal and external,
but we will not enter the problem of data
integration, quality, and scalability..
In this context, the presence off
missing observations is almost assured. In
Auslender (1996), in addition to the
constraints imposed by database marketing,
the multivariate method of imputation faced
the curse of dimensionality to effectively
estimate the empirical multivariate
distribution function of the variable the
values of which are being imputed. As a
panacea, I proposed modeling on a handful
of variables, and each variable on as many
bins as hardware constraints allow.
Variable selection was performed by
finding the best ‘p ’ correlated variables
with the variable being imputed. The value
‘p ’ and the maximum number of bins ‘k ’
is a user’s parameter. In practise, for a
database containing a few million
observations, and say, 1000 variables, all of
which must be imputed, a wise user would
choose p <= 10, and ‘k ’ <= 5.The unwise
user will be sorry for doing otherwise.
Since missing variables might be
highly correlated with other missing values,
the method eliminates those independent
variables with overlapping ranges of
missingness. Further, the method proceeds
by imputing one variable at a time, in
increasing order of missingness.
For each observation in which the
“to be” imputed variable, say mvar, is not
missing, categorize each one of the selected
variables into corresponding bands. That is,
the information contained in these five
variables is categorized and patterned. Find
corresponding means and standard errors of
mvar for each of the patterns, as well as for
partial patterns. For instance, if the pattern
11111 exists, also find the means and
standard errors for patterns .1111, 1.111,
11.11, ..., ..111, etc. Store all this information
in formats.
For mvar missing, then find the
pattern of the five variables, and impute the
mean plus a normal random number times
the standard error. If the pattern is not
present in the just done summarization
(recall dimensionality’s curse?), search
through all partial patterns to find the closest
pattern.
2.) A possible improvement.
1) The present ‘modeling’ method
is very crude and ad-hoc, and many
possibilities to replace it come to mind. For
instance, the multirelation coefficient
(Dresner, 1995), principal components, a
stepwise or backwards regression or logistic
regression, etc. The scoring of the database
is not affected by the difficulty of the
modeling method, because the modeling
method only affects the development stage,
and the creation of formats necessary for the
imputation.

2) The use of just imputed variables
to impute other variables increases the level
of the variances of the imputed variables,
which is not easily accounted for.
3) The method takes longer than the
univariate method, and scoring requires the
storing of large format libraries. If the scoring
is not done in SAS, the transformation of
these enormous formats could be
unfeasible. On the other hand, were the
scoring done in SAS, new updates of the
format libraries would be relatively
straightforward.
4) In the case when data is not
missing at random, the method can provide
some insight about the mechanism, by
providing details of the distribution of
patterns. That is, missingness not at random
will show itself in a pronounced lack of
patterns in a region of the dimension of the
data. The method could then be modified to
weight the information differently in those
regions.
3) Empirical application.
Demographic information is
commonly used in segmentation
applications. In our case, the focus of
research lies in classifying customers as
youngies (below 26 years of age) or oldies
(otherwise). The age information commonly
available from census sources contains a
large proportion of missing values.
The study involves two parts. First, a
profile of the distribution of the imputed
values of selected variables (the top three
variables of an original regression tree).
Second, a model which classifies prospects
into one or the other. It is important to see
how missing value imputation affects model
performance. Customer profiling is omitted
due to corporate concerns.
Our development data set contains
about 10,000 observations with more than
100 variables. For the sake of brevity,
validation results will be omitted, as well as
most of the information which play no vital
role in this study. The classification model
was obtained by running a tree regression
program, because at least in the case of
categorical missing values, tree regression
programs provide a solution. Further, when I
tried logistic regression, convergence was
not attained.
I present the results for the following
cases: a) original data (which contained
some missingness). At this point, I added
missingness at random in key variables; b)
deleted observations with missing variables ,
c) deleting variables with missing values, d)
estimation of missing values by
corresponding means, e) univariate PMIM, f)
multivariate PMIM. The regression trees
graphical representations will be drastically
abbreviated for the sake of space.
Data description
Eighteen percent of observations of
the variables VAR1 (binary), VAR2
(continuous) and VAR3 (continuous) were
set to missing at random, independently of
each other (Table 1)
----------------------------------------------------------------------------------
|DATA DESCRIPTION | | | | %_CASES | |
| | | | PERC. |GUESS=OU-| |
| |FREQUEN-| PERC. | OTHER | TCOME |SST = N x P |
|TABLE 1 | CY | CATEG | CATEG | NONINFO | x Q |
|----------------------------+--------+---------+---------+---------+------------|
|CASE |TOTAL |OLD/YOUNG | | | | | |
|--------|OBS | | | | | | |
|a c d e |--------+----------| | | | | |
|f |10101 |OLDIE | 5009| 49.59| 50.41| 50.00| 2525.079|
| | |----------+--------+---------+---------+---------+------------|
| | |YOUNGIE | 5092| 50.41| 49.59| 50.00| 2525.079|
----------------------------------------------------------------------------------
|b |4025 |OLDIE | 2307| 57.32| 42.68| 51.07| 984.702|
| | |----------+--------+---------+---------+---------+------------|
| | |YOUNGIE | 1718| 42.68| 57.32| 51.07| 984.702|
----------------------------------------------------------------------------------
The case b), which eliminates all
observations with some missings, reduces
the effective development sample size by
more than 50%, and also changes the
proportion of observations of the dependent
variable, thereby affecting modeling and
profiling results..
Distributions of the imputed variables
2

The continuous variables graphed
below were rescaled to fit between 0 and
100.
---------------------------------------------------------
|VAR1 | CATEGORIES | | | |
|(binary) |--------------------| | | |
| | . | 0 | 1 | | MODE | |
|Missing |------+------+------| |CATEG-| |
|values only | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 47.30| 52.70| 1| 1| 52.70|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 44.80| 55.20| 1| 1| 55.20|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 47.58| 52.42| 1| 1| 52.42|
---------------------------------------------------------
3

---------------------------------------------------------
| VAR1 | CATEGORIES | | | |
| (binary) --------------------| | | |
| | . | 0 | 1 | | MODE | |
| Entire |------+------+------| |CATEG-| |
| file | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 45.92| 54.08| 1| 1| 54.08|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 45.47| 54.53| 1| 1| 54.53|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 45.97| 54.03| 1| 1| 54.03|
---------------------------------------------------------
MIN = 0 Q1 = 1 MEDIAN = 2 MEAN = M Q3 = 3 MAX= 4 OVERPRINT = *
REFERENCE LINE AT 50 = |
VARIABLE MIN VAR1 MAX
0 Missing values only 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
0 Full File 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
VAR1 is a binary variable, imputed as continuous by the mean imputation method, which
collapses mean, median, q1 and q3 at one point. The univariate and multivariate methods are
closer to the true distribution, especially so the multivariate case (two tables above).
VARIABLE MIN VAR2 (continuous) MAX
*------------------------------------------------------------*
TRUE |0-------1-----------2|--M----------------3-----------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0-----1-------------2|-M---------------3-------------------4|
MULTIVARIATE |0---1-------------2--M-----------------3-------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-------1------------2--M---------------3------------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0--------1-----------2--M--------------3-------------------4|
MULTIVARIATE |0---------1----------2--M------------3---------------------4|
*------------------------------------------------------------*
The mean imputation has shrunk the distribution of VAR2, while the univariate and multivariate
methods are closer to the true distribution, particularly so the univariate method for the missing
values only case, as graphed in the previous two plots.
VARIABLE MIN VAR3 (continuous) MAX
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2----M3--|-------------------------------------4|
UNIVARIATE |*-----2---------M----|---3---------------------------------4|
MULTIVARIATE |0----1-------------2-|M--------------3---------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2-----M-3|-------------------------------------4|
UNIVARIATE |0-1-----2---------M--|----3--------------------------------4|
MULTIVARIATE |0---1---2---------M--|------3------------------------------4|
*------------------------------------------------------------*
The mean imputation has again shrunk the distribution of the data, this time VAR3, while the
univariate and multivariate methods are closer to the true distribution. The univariate distribution
4

has probably collapsed too much towards the low end of the distribution when we view the
distribution of the imputed missing values only.
IMPUTATION ACCURACY
-------------------------------------------------------------
|MEAN SQ ERROR OF | MEAN | UNIVARIATE |MULTIVARIATE|
|IMPUTATION | IMPUTATION | IMPUTATION | IMPUTATION |
|--------------------+------------+------------+------------|
|VARIABLE | | | |
|--------------------| | | |
|VAR1 | 0.25| 0.25| 0.25|
|--------------------+------------+------------+------------|
|VAR2 | 1119839.15| 1105777.94| 717651.84|
|--------------------+------------+------------+------------|
|VAR3 | 26.63| 23.14| 17.69|
-------------------------------------------------------------
On average, PMIM are more accurate than mean imputation.
MODEL COMPARISONS
I compare the resulting trees. Note that I do not present case c), which is the case of
deleting variables with any missing observation, because the resulting tree is extreme, lacks in
interest, and its performance is very poor.
ORIGINAL CLASSIFICATION TREE MODEL, CASE a)
YOUNGIE
VAR1 = 0
VAR1 = 1
y o
y y
1697 < VAR2 < 1698
y o
1682 < VAR2 < 1683
DELETING ARTIFICIALLY CREATED OBSERVATIONS, CASE b)
5

YOUNGIE
VAR2< 1610.5
VAR2 > 1610.5
y o
y O
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
MEAN IMPUTATION, CASE d)
YOUNGIE
VAR2 < 1387.5
VAR2 > 1387.5
y
O
y O
VAR3 = 0
y O
VAR1 < .37VAR3 = 1 VAR1 > .37
UNIVARIATE IMPUTATION, CASE e)
YOUNGIE
VAR3 = 0
VAR3 = 1
y
o
y O
1446 <VAR2 < 1447
O O
1708 < VAR2 < 1709
MULTIVARIATE IMPUTATION, CASE f)
6

YOUNGIE
VAR2 < 1708
VAR2 > 1708
y
O
y Y
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
3.1) Notes on the Tree models
1) Cases b, d and f revert the order
of the most important variables when
compared to the case a, or original tree. It is
important to mention that even the case a)
contained some missing information
originally, which has been somehow
eliminated in cases b, d and f.
2) Case d) imputes a mean value to
a binary variable (var1 <> .37), which is at
least inapporpriate. The cutoff value for var2
(continuous), around 1400 is far from the
value estimated by the other methods, which
is closer to 1700. This is probably due to the
reduction in variable of var2, and explains
the poor modeling performance as shown
below.
3) The univariate imputation method,
case e), utilizes var3 instead of var1 as the
principal variable, and then follows a pattern
similar to case a and f. Var3 and var1 are
binary variables, highly correlated. However,
Var3 did not have any original nor artificially
generated missing values.
4) Trees for cases a) and f) are very
similar. It is worth to mention that case f) has
imputed all missing values, both original and
artificially created.
3.2) Models performance and
diagnostics.
I measured performance by using
different statistics, hereby defined:
1) HITRATE: % of cases where the
prediction agrees with the true nature.
2) T: standardized difference in
probability means for oldies and youngies.
The larger the value of T, the better the
model discriminates.
3) Number of nodes: simpler trees
are preferable to larger ones.
4) Classification rate: percentage of
true state of nature predicted to be that
same state of nature.
5) True positive rate: of those
predicted to be a certain state of nature, the
percentage that truly belong in that state.
6) Top 50: Cumulative percentage of
Youngies captured at the 5th decile.
The results clearly indicate the
power of the Poor man’s methods. While
mean imputation performs satisfactorily (T =
1.01), multivariate imputation is superior in
all aspects. Database marketers also look at
the gainschart, here summarized in the
column TOP50.
While univariate imputation performs
very satisfactorily, multivariate imputation is
still superior. The performance by mean
imputation has suffered, however.
7

---------------------------------------------------------------------------------------
|DIAGN. FOR BINARY DEP VARIABLE| | | # | | |OLDIES |YNGIES | TOP |
| | | |NO- |OLDIES |YNGIES | TRUE | TRUE | 50 |
| |HITRA-| |DES | CLSF | CLSF | POS | POS |CUM |
| | TE | T | | RATE | RATE | RATE | RATE | % |
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CASE | | | | | | | | |
|------------------------------| | | | | | | | |
|AS IS | 75.53|1.09| 6| 72.23| 78.77| 77.00| 74.25|74.94|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CREATED MISSINGS & DELETED OBS| 77.52|1.14| 8| 84.31| 68.39| 78.18| 76.45|78.81|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|DELETING VARS WITH MISSINGS | 73.50|0.93| 8| 85.49| 54.97| 74.60| 71.00|70.48|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MEAN IMPUTATION | 75.48|1.01| 7| 86.28| 58.78| 76.40| 73.47|69.12|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|UNIVARIATE IMPUTATION | 75.53|1.01| 7| 87.60| 56.87| 75.86| 74.78|74.69|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MULT IMPUTATION | 78.58|1.15| 8| 86.81| 65.85| 79.73| 76.34|76.60|
---------------------------------------------------------------------------------------
4) Hardware considerations.
The macro systems were run on a
Sun Sparc 10, 256 megs of Ram. The entire
run, including regression trees and
diagnostics took less than one hour of actual
time. However, running the multivariate
imputation with five bands and 15 correlated
variables took more than 6 hours of CPU
time, and its results are not reported here.
5) General comments and
conclusion
The methods just presented provide
alternatives to the beleaguered analyst in a
fast-paced environment. Especially in the
case of profiling, the poor man’s methods
perform better than the even more hurried
mean imputation method. The model
performance was very good, especially for
the multivariate case. The number of bands
is still a research area, and the correlation
aspects in the multivariate imputation
deserve further investigation.
Possible areas of improvement
could hinge especially in the area of
modeling, such as implementing Dresner’s
or Leahy’s (1995) suggestions. It is also
necessary to further test the methods under
different conditions of missingness, such as
different patterns and different percentages
of missingness across variables. There is an
extensive literature in the area of bandwidth
selection as well (e.g., Thombs and
Sheather, 1990).
6) Bibliography
Dresner, A. (1995): Multirelation - correlation
among more than two variables,
Computational Statistics and Data Anslysis.
Scott, David W.. (1992): Multivariate density
estimation, John Wiley & Sons, Inc.
Rubin, D. and Little, R. (1987): Statistical
Analysis of missing data, Wiley.
Thombs, L. and Sheather, S. (1990): Local
bandwidth selection for density estimation,
Interface ‘90, Proceedings of the 22nd
Symposium of the Interface.
8

Proc Univariate: determine vars
w missing values, ranges and minima
Data Step: for each obs of each missing var,
determine corresponding patterns.
Proc Summary: for each pattern, determine mean and se of
missing variable.
Data Step: for each missing obs, find the pattern and impute.
BRIEF DIAGRAM OF SAS STEPS
Multivariate Imputation
Proc Corr: determine best correlated variables
Proc Format: create formats of mean, se
frequencies for every pattern.
9

Missing Value imputation, Poor man's

More Related Content

What's hot

Similar to Missing Value imputation, Poor man's

More from Leonardo Auslender

Recently uploaded

Missing Value imputation, Poor man's