MISSING IMPUTATION METHOD FOR LARGE DATABASES
By Leonardo E. Auslender
Chase Manhattan Mortgage Corp.
Introduction.
In Auslender (1996), I proposed
nonparametric methods for univariate and
multivariate missing value impution of large
databases in a database marketing
environemnt. In this environment, the analyst
faces tight deadlines constraints due to
marketing campaign pressures, and the
‘demand ’ (mostly from those who
‘score ’ the algorithms) for equations that
are easy to code, and which will not impose
‘extra-ordinary’ demands on existent
hardware and time to market. .
In this paper, I further continue the
development of the multivariate algorithm
proposed in Auslender (1996). I first briefly
describe the multivariate method proposed
earlier. Second, I proposed further
developments of the algorithm, which are
tested in a third section. I end the paper with
conclusions and suggestions for future
research.
1) Multivariate Imputation Method
A typical database contains
thousands (if not millions) of observations.
There are fewer variables usually, but
nonetheless their number can easily rank in
the thousands. The information contained in
these databases is usually obtained from
different sources, both internal and external,
but we will not enter the problem of data
integration, quality, and scalability..
In this context, the presence off
missing observations is almost assured. In
Auslender (1996), in addition to the
constraints imposed by database marketing,
the multivariate method of imputation faced
the curse of dimensionality to effectively
estimate the empirical multivariate
distribution function of the variable the
values of which are being imputed. As a
panacea, I proposed modeling on a handful
of variables, and each variable on as many
bins as hardware constraints allow.
Variable selection was performed by
finding the best ‘p ’ correlated variables
with the variable being imputed. The value
‘p ’ and the maximum number of bins ‘k ’
is a user’s parameter. In practise, for a
database containing a few million
observations, and say, 1000 variables, all of
which must be imputed, a wise user would
choose p <= 10, and ‘k ’ <= 5.The unwise
user will be sorry for doing otherwise.
Since missing variables might be
highly correlated with other missing values,
the method eliminates those independent
variables with overlapping ranges of
missingness. Further, the method proceeds
by imputing one variable at a time, in
increasing order of missingness.
For each observation in which the
“to be” imputed variable, say mvar, is not
missing, categorize each one of the selected
variables into corresponding bands. That is,
the information contained in these five
variables is categorized and patterned. Find
corresponding means and standard errors of
mvar for each of the patterns, as well as for
partial patterns. For instance, if the pattern
11111 exists, also find the means and
standard errors for patterns .1111, 1.111,
11.11, ..., ..111, etc. Store all this information
in formats.
For mvar missing, then find the
pattern of the five variables, and impute the
mean plus a normal random number times
the standard error. If the pattern is not
present in the just done summarization
(recall dimensionality’s curse?), search
through all partial patterns to find the closest
pattern.
2.) A possible improvement.
1) The present ‘modeling’ method
is very crude and ad-hoc, and many
possibilities to replace it come to mind. For
instance, the multirelation coefficient
(Dresner, 1995), principal components, a
stepwise or backwards regression or logistic
regression, etc. The scoring of the database
is not affected by the difficulty of the
modeling method, because the modeling
method only affects the development stage,
and the creation of formats necessary for the
imputation.
2) The use of just imputed variables
to impute other variables increases the level
of the variances of the imputed variables,
which is not easily accounted for.
3) The method takes longer than the
univariate method, and scoring requires the
storing of large format libraries. If the scoring
is not done in SAS, the transformation of
these enormous formats could be
unfeasible. On the other hand, were the
scoring done in SAS, new updates of the
format libraries would be relatively
straightforward.
4) In the case when data is not
missing at random, the method can provide
some insight about the mechanism, by
providing details of the distribution of
patterns. That is, missingness not at random
will show itself in a pronounced lack of
patterns in a region of the dimension of the
data. The method could then be modified to
weight the information differently in those
regions.
3) Empirical application.
Demographic information is
commonly used in segmentation
applications. In our case, the focus of
research lies in classifying customers as
youngies (below 26 years of age) or oldies
(otherwise). The age information commonly
available from census sources contains a
large proportion of missing values.
The study involves two parts. First, a
profile of the distribution of the imputed
values of selected variables (the top three
variables of an original regression tree).
Second, a model which classifies prospects
into one or the other. It is important to see
how missing value imputation affects model
performance. Customer profiling is omitted
due to corporate concerns.
Our development data set contains
about 10,000 observations with more than
100 variables. For the sake of brevity,
validation results will be omitted, as well as
most of the information which play no vital
role in this study. The classification model
was obtained by running a tree regression
program, because at least in the case of
categorical missing values, tree regression
programs provide a solution. Further, when I
tried logistic regression, convergence was
not attained.
I present the results for the following
cases: a) original data (which contained
some missingness). At this point, I added
missingness at random in key variables; b)
deleted observations with missing variables ,
c) deleting variables with missing values, d)
estimation of missing values by
corresponding means, e) univariate PMIM, f)
multivariate PMIM. The regression trees
graphical representations will be drastically
abbreviated for the sake of space.
Data description
Eighteen percent of observations of
the variables VAR1 (binary), VAR2
(continuous) and VAR3 (continuous) were
set to missing at random, independently of
each other (Table 1)
----------------------------------------------------------------------------------
|DATA DESCRIPTION | | | | %_CASES | |
| | | | PERC. |GUESS=OU-| |
| |FREQUEN-| PERC. | OTHER | TCOME |SST = N x P |
|TABLE 1 | CY | CATEG | CATEG | NONINFO | x Q |
|----------------------------+--------+---------+---------+---------+------------|
|CASE |TOTAL |OLD/YOUNG | | | | | |
|--------|OBS | | | | | | |
|a c d e |--------+----------| | | | | |
|f |10101 |OLDIE | 5009| 49.59| 50.41| 50.00| 2525.079|
| | |----------+--------+---------+---------+---------+------------|
| | |YOUNGIE | 5092| 50.41| 49.59| 50.00| 2525.079|
----------------------------------------------------------------------------------
|b |4025 |OLDIE | 2307| 57.32| 42.68| 51.07| 984.702|
| | |----------+--------+---------+---------+---------+------------|
| | |YOUNGIE | 1718| 42.68| 57.32| 51.07| 984.702|
----------------------------------------------------------------------------------
The case b), which eliminates all
observations with some missings, reduces
the effective development sample size by
more than 50%, and also changes the
proportion of observations of the dependent
variable, thereby affecting modeling and
profiling results..
Distributions of the imputed variables
2
The continuous variables graphed
below were rescaled to fit between 0 and
100.
---------------------------------------------------------
|VAR1 | CATEGORIES | | | |
|(binary) |--------------------| | | |
| | . | 0 | 1 | | MODE | |
|Missing |------+------+------| |CATEG-| |
|values only | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 47.30| 52.70| 1| 1| 52.70|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 44.80| 55.20| 1| 1| 55.20|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 47.58| 52.42| 1| 1| 52.42|
---------------------------------------------------------
3
---------------------------------------------------------
| VAR1 | CATEGORIES | | | |
| (binary) --------------------| | | |
| | . | 0 | 1 | | MODE | |
| Entire |------+------+------| |CATEG-| |
| file | % | % | % |MEDIAN| ORY | MODE |
|-------------+------+------+------+------+------+------|
|VARIABLES | | | | | | |
|-------------| | | | | | |
|TRUE | .| 45.92| 54.08| 1| 1| 54.08|
|-------------+------+------+------+------+------+------|
|UNIVARIATE | .| 45.47| 54.53| 1| 1| 54.53|
|-------------+------+------+------+------+------+------|
|MULTIVARIATE | .| 45.97| 54.03| 1| 1| 54.03|
---------------------------------------------------------
MIN = 0 Q1 = 1 MEDIAN = 2 MEAN = M Q3 = 3 MAX= 4 OVERPRINT = *
REFERENCE LINE AT 50 = |
VARIABLE MIN VAR1 MAX
0 Missing values only 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
0 Full File 100
*------------------------------------------------------------*
MEAN IMP |*--------------------|-----------*-------------------------*|
*------------------------------------------------------------*
VAR1 is a binary variable, imputed as continuous by the mean imputation method, which
collapses mean, median, q1 and q3 at one point. The univariate and multivariate methods are
closer to the true distribution, especially so the multivariate case (two tables above).
VARIABLE MIN VAR2 (continuous) MAX
0 Missing values only 100
*------------------------------------------------------------*
TRUE |0-------1-----------2|--M----------------3-----------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0-----1-------------2|-M---------------3-------------------4|
MULTIVARIATE |0---1-------------2--M-----------------3-------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-------1------------2--M---------------3------------------4|
MEAN IMP |0---------1----------|--*--------3-------------------------4|
UNIVARIATE |0--------1-----------2--M--------------3-------------------4|
MULTIVARIATE |0---------1----------2--M------------3---------------------4|
*------------------------------------------------------------*
The mean imputation has shrunk the distribution of VAR2, while the univariate and multivariate
methods are closer to the true distribution, particularly so the univariate method for the missing
values only case, as graphed in the previous two plots.
VARIABLE MIN VAR3 (continuous) MAX
0 Missing values only 100
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2----M3--|-------------------------------------4|
UNIVARIATE |*-----2---------M----|---3---------------------------------4|
MULTIVARIATE |0----1-------------2-|M--------------3---------------------4|
*------------------------------------------------------------*
0 Full file 100
*------------------------------------------------------------*
TRUE |0-1-----2---------M--|------3------------------------------4|
MEAN IMP |0---1-------2-----M-3|-------------------------------------4|
UNIVARIATE |0-1-----2---------M--|----3--------------------------------4|
MULTIVARIATE |0---1---2---------M--|------3------------------------------4|
*------------------------------------------------------------*
The mean imputation has again shrunk the distribution of the data, this time VAR3, while the
univariate and multivariate methods are closer to the true distribution. The univariate distribution
4
has probably collapsed too much towards the low end of the distribution when we view the
distribution of the imputed missing values only.
IMPUTATION ACCURACY
-------------------------------------------------------------
|MEAN SQ ERROR OF | MEAN | UNIVARIATE |MULTIVARIATE|
|IMPUTATION | IMPUTATION | IMPUTATION | IMPUTATION |
|--------------------+------------+------------+------------|
|VARIABLE | | | |
|--------------------| | | |
|VAR1 | 0.25| 0.25| 0.25|
|--------------------+------------+------------+------------|
|VAR2 | 1119839.15| 1105777.94| 717651.84|
|--------------------+------------+------------+------------|
|VAR3 | 26.63| 23.14| 17.69|
-------------------------------------------------------------
On average, PMIM are more accurate than mean imputation.
MODEL COMPARISONS
I compare the resulting trees. Note that I do not present case c), which is the case of
deleting variables with any missing observation, because the resulting tree is extreme, lacks in
interest, and its performance is very poor.
ORIGINAL CLASSIFICATION TREE MODEL, CASE a)
YOUNGIE
VAR1 = 0
VAR1 = 1
y o
y y
1697 < VAR2 < 1698
y o
1682 < VAR2 < 1683
DELETING ARTIFICIALLY CREATED OBSERVATIONS, CASE b)
5
YOUNGIE
VAR2< 1610.5
VAR2 > 1610.5
y o
y O
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
MEAN IMPUTATION, CASE d)
YOUNGIE
VAR2 < 1387.5
VAR2 > 1387.5
y
O
y O
VAR3 = 0
y O
VAR1 < .37VAR3 = 1 VAR1 > .37
UNIVARIATE IMPUTATION, CASE e)
YOUNGIE
VAR3 = 0
VAR3 = 1
y
o
y O
1446 <VAR2 < 1447
O O
1708 < VAR2 < 1709
MULTIVARIATE IMPUTATION, CASE f)
6
YOUNGIE
VAR2 < 1708
VAR2 > 1708
y
O
y Y
VAR1 = 0
O O
VAR1 = 0VAR1 = 1 VAR1 = 1
3.1) Notes on the Tree models
1) Cases b, d and f revert the order
of the most important variables when
compared to the case a, or original tree. It is
important to mention that even the case a)
contained some missing information
originally, which has been somehow
eliminated in cases b, d and f.
2) Case d) imputes a mean value to
a binary variable (var1 <> .37), which is at
least inapporpriate. The cutoff value for var2
(continuous), around 1400 is far from the
value estimated by the other methods, which
is closer to 1700. This is probably due to the
reduction in variable of var2, and explains
the poor modeling performance as shown
below.
3) The univariate imputation method,
case e), utilizes var3 instead of var1 as the
principal variable, and then follows a pattern
similar to case a and f. Var3 and var1 are
binary variables, highly correlated. However,
Var3 did not have any original nor artificially
generated missing values.
4) Trees for cases a) and f) are very
similar. It is worth to mention that case f) has
imputed all missing values, both original and
artificially created.
3.2) Models performance and
diagnostics.
I measured performance by using
different statistics, hereby defined:
1) HITRATE: % of cases where the
prediction agrees with the true nature.
2) T: standardized difference in
probability means for oldies and youngies.
The larger the value of T, the better the
model discriminates.
3) Number of nodes: simpler trees
are preferable to larger ones.
4) Classification rate: percentage of
true state of nature predicted to be that
same state of nature.
5) True positive rate: of those
predicted to be a certain state of nature, the
percentage that truly belong in that state.
6) Top 50: Cumulative percentage of
Youngies captured at the 5th decile.
The results clearly indicate the
power of the Poor man’s methods. While
mean imputation performs satisfactorily (T =
1.01), multivariate imputation is superior in
all aspects. Database marketers also look at
the gainschart, here summarized in the
column TOP50.
While univariate imputation performs
very satisfactorily, multivariate imputation is
still superior. The performance by mean
imputation has suffered, however.
7
---------------------------------------------------------------------------------------
|DIAGN. FOR BINARY DEP VARIABLE| | | # | | |OLDIES |YNGIES | TOP |
| | | |NO- |OLDIES |YNGIES | TRUE | TRUE | 50 |
| |HITRA-| |DES | CLSF | CLSF | POS | POS |CUM |
| | TE | T | | RATE | RATE | RATE | RATE | % |
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CASE | | | | | | | | |
|------------------------------| | | | | | | | |
|AS IS | 75.53|1.09| 6| 72.23| 78.77| 77.00| 74.25|74.94|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|CREATED MISSINGS & DELETED OBS| 77.52|1.14| 8| 84.31| 68.39| 78.18| 76.45|78.81|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|DELETING VARS WITH MISSINGS | 73.50|0.93| 8| 85.49| 54.97| 74.60| 71.00|70.48|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MEAN IMPUTATION | 75.48|1.01| 7| 86.28| 58.78| 76.40| 73.47|69.12|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|UNIVARIATE IMPUTATION | 75.53|1.01| 7| 87.60| 56.87| 75.86| 74.78|74.69|
|------------------------------|------|----|----|-------|-------|-------|-------|-----|
|MULT IMPUTATION | 78.58|1.15| 8| 86.81| 65.85| 79.73| 76.34|76.60|
---------------------------------------------------------------------------------------
4) Hardware considerations.
The macro systems were run on a
Sun Sparc 10, 256 megs of Ram. The entire
run, including regression trees and
diagnostics took less than one hour of actual
time. However, running the multivariate
imputation with five bands and 15 correlated
variables took more than 6 hours of CPU
time, and its results are not reported here.
5) General comments and
conclusion
The methods just presented provide
alternatives to the beleaguered analyst in a
fast-paced environment. Especially in the
case of profiling, the poor man’s methods
perform better than the even more hurried
mean imputation method. The model
performance was very good, especially for
the multivariate case. The number of bands
is still a research area, and the correlation
aspects in the multivariate imputation
deserve further investigation.
Possible areas of improvement
could hinge especially in the area of
modeling, such as implementing Dresner’s
or Leahy’s (1995) suggestions. It is also
necessary to further test the methods under
different conditions of missingness, such as
different patterns and different percentages
of missingness across variables. There is an
extensive literature in the area of bandwidth
selection as well (e.g., Thombs and
Sheather, 1990).
6) Bibliography
Dresner, A. (1995): Multirelation - correlation
among more than two variables,
Computational Statistics and Data Anslysis.
Scott, David W.. (1992): Multivariate density
estimation, John Wiley & Sons, Inc.
Rubin, D. and Little, R. (1987): Statistical
Analysis of missing data, Wiley.
Thombs, L. and Sheather, S. (1990): Local
bandwidth selection for density estimation,
Interface ‘90, Proceedings of the 22nd
Symposium of the Interface.
8
Proc Univariate: determine vars
w missing values, ranges and minima
Data Step: for each obs of each missing var,
determine corresponding patterns.
Proc Summary: for each pattern, determine mean and se of
missing variable.
Data Step: for each missing obs, find the pattern and impute.
BRIEF DIAGRAM OF SAS STEPS
Multivariate Imputation
Proc Corr: determine best correlated variables
Proc Format: create formats of mean, se
frequencies for every pattern.
9
Proc Univariate: determine vars
w missing values, ranges and minima
Data Step: for each obs of each missing var,
determine corresponding patterns.
Proc Summary: for each pattern, determine mean and se of
missing variable.
Data Step: for each missing obs, find the pattern and impute.
BRIEF DIAGRAM OF SAS STEPS
Multivariate Imputation
Proc Corr: determine best correlated variables
Proc Format: create formats of mean, se
frequencies for every pattern.
9

Missing Value imputation, Poor man's

  • 1.
    MISSING IMPUTATION METHODFOR LARGE DATABASES By Leonardo E. Auslender Chase Manhattan Mortgage Corp. Introduction. In Auslender (1996), I proposed nonparametric methods for univariate and multivariate missing value impution of large databases in a database marketing environemnt. In this environment, the analyst faces tight deadlines constraints due to marketing campaign pressures, and the ‘demand ’ (mostly from those who ‘score ’ the algorithms) for equations that are easy to code, and which will not impose ‘extra-ordinary’ demands on existent hardware and time to market. . In this paper, I further continue the development of the multivariate algorithm proposed in Auslender (1996). I first briefly describe the multivariate method proposed earlier. Second, I proposed further developments of the algorithm, which are tested in a third section. I end the paper with conclusions and suggestions for future research. 1) Multivariate Imputation Method A typical database contains thousands (if not millions) of observations. There are fewer variables usually, but nonetheless their number can easily rank in the thousands. The information contained in these databases is usually obtained from different sources, both internal and external, but we will not enter the problem of data integration, quality, and scalability.. In this context, the presence off missing observations is almost assured. In Auslender (1996), in addition to the constraints imposed by database marketing, the multivariate method of imputation faced the curse of dimensionality to effectively estimate the empirical multivariate distribution function of the variable the values of which are being imputed. As a panacea, I proposed modeling on a handful of variables, and each variable on as many bins as hardware constraints allow. Variable selection was performed by finding the best ‘p ’ correlated variables with the variable being imputed. The value ‘p ’ and the maximum number of bins ‘k ’ is a user’s parameter. In practise, for a database containing a few million observations, and say, 1000 variables, all of which must be imputed, a wise user would choose p <= 10, and ‘k ’ <= 5.The unwise user will be sorry for doing otherwise. Since missing variables might be highly correlated with other missing values, the method eliminates those independent variables with overlapping ranges of missingness. Further, the method proceeds by imputing one variable at a time, in increasing order of missingness. For each observation in which the “to be” imputed variable, say mvar, is not missing, categorize each one of the selected variables into corresponding bands. That is, the information contained in these five variables is categorized and patterned. Find corresponding means and standard errors of mvar for each of the patterns, as well as for partial patterns. For instance, if the pattern 11111 exists, also find the means and standard errors for patterns .1111, 1.111, 11.11, ..., ..111, etc. Store all this information in formats. For mvar missing, then find the pattern of the five variables, and impute the mean plus a normal random number times the standard error. If the pattern is not present in the just done summarization (recall dimensionality’s curse?), search through all partial patterns to find the closest pattern. 2.) A possible improvement. 1) The present ‘modeling’ method is very crude and ad-hoc, and many possibilities to replace it come to mind. For instance, the multirelation coefficient (Dresner, 1995), principal components, a stepwise or backwards regression or logistic regression, etc. The scoring of the database is not affected by the difficulty of the modeling method, because the modeling method only affects the development stage, and the creation of formats necessary for the imputation.
  • 2.
    2) The useof just imputed variables to impute other variables increases the level of the variances of the imputed variables, which is not easily accounted for. 3) The method takes longer than the univariate method, and scoring requires the storing of large format libraries. If the scoring is not done in SAS, the transformation of these enormous formats could be unfeasible. On the other hand, were the scoring done in SAS, new updates of the format libraries would be relatively straightforward. 4) In the case when data is not missing at random, the method can provide some insight about the mechanism, by providing details of the distribution of patterns. That is, missingness not at random will show itself in a pronounced lack of patterns in a region of the dimension of the data. The method could then be modified to weight the information differently in those regions. 3) Empirical application. Demographic information is commonly used in segmentation applications. In our case, the focus of research lies in classifying customers as youngies (below 26 years of age) or oldies (otherwise). The age information commonly available from census sources contains a large proportion of missing values. The study involves two parts. First, a profile of the distribution of the imputed values of selected variables (the top three variables of an original regression tree). Second, a model which classifies prospects into one or the other. It is important to see how missing value imputation affects model performance. Customer profiling is omitted due to corporate concerns. Our development data set contains about 10,000 observations with more than 100 variables. For the sake of brevity, validation results will be omitted, as well as most of the information which play no vital role in this study. The classification model was obtained by running a tree regression program, because at least in the case of categorical missing values, tree regression programs provide a solution. Further, when I tried logistic regression, convergence was not attained. I present the results for the following cases: a) original data (which contained some missingness). At this point, I added missingness at random in key variables; b) deleted observations with missing variables , c) deleting variables with missing values, d) estimation of missing values by corresponding means, e) univariate PMIM, f) multivariate PMIM. The regression trees graphical representations will be drastically abbreviated for the sake of space. Data description Eighteen percent of observations of the variables VAR1 (binary), VAR2 (continuous) and VAR3 (continuous) were set to missing at random, independently of each other (Table 1) ---------------------------------------------------------------------------------- |DATA DESCRIPTION | | | | %_CASES | | | | | | PERC. |GUESS=OU-| | | |FREQUEN-| PERC. | OTHER | TCOME |SST = N x P | |TABLE 1 | CY | CATEG | CATEG | NONINFO | x Q | |----------------------------+--------+---------+---------+---------+------------| |CASE |TOTAL |OLD/YOUNG | | | | | | |--------|OBS | | | | | | | |a c d e |--------+----------| | | | | | |f |10101 |OLDIE | 5009| 49.59| 50.41| 50.00| 2525.079| | | |----------+--------+---------+---------+---------+------------| | | |YOUNGIE | 5092| 50.41| 49.59| 50.00| 2525.079| ---------------------------------------------------------------------------------- |b |4025 |OLDIE | 2307| 57.32| 42.68| 51.07| 984.702| | | |----------+--------+---------+---------+---------+------------| | | |YOUNGIE | 1718| 42.68| 57.32| 51.07| 984.702| ---------------------------------------------------------------------------------- The case b), which eliminates all observations with some missings, reduces the effective development sample size by more than 50%, and also changes the proportion of observations of the dependent variable, thereby affecting modeling and profiling results.. Distributions of the imputed variables 2
  • 3.
    The continuous variablesgraphed below were rescaled to fit between 0 and 100. --------------------------------------------------------- |VAR1 | CATEGORIES | | | | |(binary) |--------------------| | | | | | . | 0 | 1 | | MODE | | |Missing |------+------+------| |CATEG-| | |values only | % | % | % |MEDIAN| ORY | MODE | |-------------+------+------+------+------+------+------| |VARIABLES | | | | | | | |-------------| | | | | | | |TRUE | .| 47.30| 52.70| 1| 1| 52.70| |-------------+------+------+------+------+------+------| |UNIVARIATE | .| 44.80| 55.20| 1| 1| 55.20| |-------------+------+------+------+------+------+------| |MULTIVARIATE | .| 47.58| 52.42| 1| 1| 52.42| --------------------------------------------------------- 3
  • 4.
    --------------------------------------------------------- | VAR1 |CATEGORIES | | | | | (binary) --------------------| | | | | | . | 0 | 1 | | MODE | | | Entire |------+------+------| |CATEG-| | | file | % | % | % |MEDIAN| ORY | MODE | |-------------+------+------+------+------+------+------| |VARIABLES | | | | | | | |-------------| | | | | | | |TRUE | .| 45.92| 54.08| 1| 1| 54.08| |-------------+------+------+------+------+------+------| |UNIVARIATE | .| 45.47| 54.53| 1| 1| 54.53| |-------------+------+------+------+------+------+------| |MULTIVARIATE | .| 45.97| 54.03| 1| 1| 54.03| --------------------------------------------------------- MIN = 0 Q1 = 1 MEDIAN = 2 MEAN = M Q3 = 3 MAX= 4 OVERPRINT = * REFERENCE LINE AT 50 = | VARIABLE MIN VAR1 MAX 0 Missing values only 100 *------------------------------------------------------------* MEAN IMP |*--------------------|-----------*-------------------------*| *------------------------------------------------------------* 0 Full File 100 *------------------------------------------------------------* MEAN IMP |*--------------------|-----------*-------------------------*| *------------------------------------------------------------* VAR1 is a binary variable, imputed as continuous by the mean imputation method, which collapses mean, median, q1 and q3 at one point. The univariate and multivariate methods are closer to the true distribution, especially so the multivariate case (two tables above). VARIABLE MIN VAR2 (continuous) MAX 0 Missing values only 100 *------------------------------------------------------------* TRUE |0-------1-----------2|--M----------------3-----------------4| MEAN IMP |0---------1----------|--*--------3-------------------------4| UNIVARIATE |0-----1-------------2|-M---------------3-------------------4| MULTIVARIATE |0---1-------------2--M-----------------3-------------------4| *------------------------------------------------------------* 0 Full file 100 *------------------------------------------------------------* TRUE |0-------1------------2--M---------------3------------------4| MEAN IMP |0---------1----------|--*--------3-------------------------4| UNIVARIATE |0--------1-----------2--M--------------3-------------------4| MULTIVARIATE |0---------1----------2--M------------3---------------------4| *------------------------------------------------------------* The mean imputation has shrunk the distribution of VAR2, while the univariate and multivariate methods are closer to the true distribution, particularly so the univariate method for the missing values only case, as graphed in the previous two plots. VARIABLE MIN VAR3 (continuous) MAX 0 Missing values only 100 *------------------------------------------------------------* TRUE |0-1-----2---------M--|------3------------------------------4| MEAN IMP |0---1-------2----M3--|-------------------------------------4| UNIVARIATE |*-----2---------M----|---3---------------------------------4| MULTIVARIATE |0----1-------------2-|M--------------3---------------------4| *------------------------------------------------------------* 0 Full file 100 *------------------------------------------------------------* TRUE |0-1-----2---------M--|------3------------------------------4| MEAN IMP |0---1-------2-----M-3|-------------------------------------4| UNIVARIATE |0-1-----2---------M--|----3--------------------------------4| MULTIVARIATE |0---1---2---------M--|------3------------------------------4| *------------------------------------------------------------* The mean imputation has again shrunk the distribution of the data, this time VAR3, while the univariate and multivariate methods are closer to the true distribution. The univariate distribution 4
  • 5.
    has probably collapsedtoo much towards the low end of the distribution when we view the distribution of the imputed missing values only. IMPUTATION ACCURACY ------------------------------------------------------------- |MEAN SQ ERROR OF | MEAN | UNIVARIATE |MULTIVARIATE| |IMPUTATION | IMPUTATION | IMPUTATION | IMPUTATION | |--------------------+------------+------------+------------| |VARIABLE | | | | |--------------------| | | | |VAR1 | 0.25| 0.25| 0.25| |--------------------+------------+------------+------------| |VAR2 | 1119839.15| 1105777.94| 717651.84| |--------------------+------------+------------+------------| |VAR3 | 26.63| 23.14| 17.69| ------------------------------------------------------------- On average, PMIM are more accurate than mean imputation. MODEL COMPARISONS I compare the resulting trees. Note that I do not present case c), which is the case of deleting variables with any missing observation, because the resulting tree is extreme, lacks in interest, and its performance is very poor. ORIGINAL CLASSIFICATION TREE MODEL, CASE a) YOUNGIE VAR1 = 0 VAR1 = 1 y o y y 1697 < VAR2 < 1698 y o 1682 < VAR2 < 1683 DELETING ARTIFICIALLY CREATED OBSERVATIONS, CASE b) 5
  • 6.
    YOUNGIE VAR2< 1610.5 VAR2 >1610.5 y o y O VAR1 = 0 O O VAR1 = 0VAR1 = 1 VAR1 = 1 MEAN IMPUTATION, CASE d) YOUNGIE VAR2 < 1387.5 VAR2 > 1387.5 y O y O VAR3 = 0 y O VAR1 < .37VAR3 = 1 VAR1 > .37 UNIVARIATE IMPUTATION, CASE e) YOUNGIE VAR3 = 0 VAR3 = 1 y o y O 1446 <VAR2 < 1447 O O 1708 < VAR2 < 1709 MULTIVARIATE IMPUTATION, CASE f) 6
  • 7.
    YOUNGIE VAR2 < 1708 VAR2> 1708 y O y Y VAR1 = 0 O O VAR1 = 0VAR1 = 1 VAR1 = 1 3.1) Notes on the Tree models 1) Cases b, d and f revert the order of the most important variables when compared to the case a, or original tree. It is important to mention that even the case a) contained some missing information originally, which has been somehow eliminated in cases b, d and f. 2) Case d) imputes a mean value to a binary variable (var1 <> .37), which is at least inapporpriate. The cutoff value for var2 (continuous), around 1400 is far from the value estimated by the other methods, which is closer to 1700. This is probably due to the reduction in variable of var2, and explains the poor modeling performance as shown below. 3) The univariate imputation method, case e), utilizes var3 instead of var1 as the principal variable, and then follows a pattern similar to case a and f. Var3 and var1 are binary variables, highly correlated. However, Var3 did not have any original nor artificially generated missing values. 4) Trees for cases a) and f) are very similar. It is worth to mention that case f) has imputed all missing values, both original and artificially created. 3.2) Models performance and diagnostics. I measured performance by using different statistics, hereby defined: 1) HITRATE: % of cases where the prediction agrees with the true nature. 2) T: standardized difference in probability means for oldies and youngies. The larger the value of T, the better the model discriminates. 3) Number of nodes: simpler trees are preferable to larger ones. 4) Classification rate: percentage of true state of nature predicted to be that same state of nature. 5) True positive rate: of those predicted to be a certain state of nature, the percentage that truly belong in that state. 6) Top 50: Cumulative percentage of Youngies captured at the 5th decile. The results clearly indicate the power of the Poor man’s methods. While mean imputation performs satisfactorily (T = 1.01), multivariate imputation is superior in all aspects. Database marketers also look at the gainschart, here summarized in the column TOP50. While univariate imputation performs very satisfactorily, multivariate imputation is still superior. The performance by mean imputation has suffered, however. 7
  • 8.
    --------------------------------------------------------------------------------------- |DIAGN. FOR BINARYDEP VARIABLE| | | # | | |OLDIES |YNGIES | TOP | | | | |NO- |OLDIES |YNGIES | TRUE | TRUE | 50 | | |HITRA-| |DES | CLSF | CLSF | POS | POS |CUM | | | TE | T | | RATE | RATE | RATE | RATE | % | |------------------------------|------|----|----|-------|-------|-------|-------|-----| |CASE | | | | | | | | | |------------------------------| | | | | | | | | |AS IS | 75.53|1.09| 6| 72.23| 78.77| 77.00| 74.25|74.94| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |CREATED MISSINGS & DELETED OBS| 77.52|1.14| 8| 84.31| 68.39| 78.18| 76.45|78.81| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |DELETING VARS WITH MISSINGS | 73.50|0.93| 8| 85.49| 54.97| 74.60| 71.00|70.48| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |MEAN IMPUTATION | 75.48|1.01| 7| 86.28| 58.78| 76.40| 73.47|69.12| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |UNIVARIATE IMPUTATION | 75.53|1.01| 7| 87.60| 56.87| 75.86| 74.78|74.69| |------------------------------|------|----|----|-------|-------|-------|-------|-----| |MULT IMPUTATION | 78.58|1.15| 8| 86.81| 65.85| 79.73| 76.34|76.60| --------------------------------------------------------------------------------------- 4) Hardware considerations. The macro systems were run on a Sun Sparc 10, 256 megs of Ram. The entire run, including regression trees and diagnostics took less than one hour of actual time. However, running the multivariate imputation with five bands and 15 correlated variables took more than 6 hours of CPU time, and its results are not reported here. 5) General comments and conclusion The methods just presented provide alternatives to the beleaguered analyst in a fast-paced environment. Especially in the case of profiling, the poor man’s methods perform better than the even more hurried mean imputation method. The model performance was very good, especially for the multivariate case. The number of bands is still a research area, and the correlation aspects in the multivariate imputation deserve further investigation. Possible areas of improvement could hinge especially in the area of modeling, such as implementing Dresner’s or Leahy’s (1995) suggestions. It is also necessary to further test the methods under different conditions of missingness, such as different patterns and different percentages of missingness across variables. There is an extensive literature in the area of bandwidth selection as well (e.g., Thombs and Sheather, 1990). 6) Bibliography Dresner, A. (1995): Multirelation - correlation among more than two variables, Computational Statistics and Data Anslysis. Scott, David W.. (1992): Multivariate density estimation, John Wiley & Sons, Inc. Rubin, D. and Little, R. (1987): Statistical Analysis of missing data, Wiley. Thombs, L. and Sheather, S. (1990): Local bandwidth selection for density estimation, Interface ‘90, Proceedings of the 22nd Symposium of the Interface. 8
  • 9.
    Proc Univariate: determinevars w missing values, ranges and minima Data Step: for each obs of each missing var, determine corresponding patterns. Proc Summary: for each pattern, determine mean and se of missing variable. Data Step: for each missing obs, find the pattern and impute. BRIEF DIAGRAM OF SAS STEPS Multivariate Imputation Proc Corr: determine best correlated variables Proc Format: create formats of mean, se frequencies for every pattern. 9
  • 10.
    Proc Univariate: determinevars w missing values, ranges and minima Data Step: for each obs of each missing var, determine corresponding patterns. Proc Summary: for each pattern, determine mean and se of missing variable. Data Step: for each missing obs, find the pattern and impute. BRIEF DIAGRAM OF SAS STEPS Multivariate Imputation Proc Corr: determine best correlated variables Proc Format: create formats of mean, se frequencies for every pattern. 9