1.
C l a r k G l y m o u r,
David Madigan, Daryl Pregibon,
and Padhraic Smyth
Statistics may have little to offer the
search architectures in a data mining search,
but a great deal to offer in evaluating hypotheses
in the search, in evaluating the results of
the search, and in applying the results.
Statistical Inference
and Data Mining
DATA MINING AIMS TO DISCOVER SOMETHING NEW FROM THE FACTS RECORDED
in a database. For many reasons—encoding errors, measurement
errors, unrecorded causes of recorded features—the information
in a database is almost always noisy; therefore, inference from data-
bases invites applications of the theory of probability. From a sta-
tistical point of view, databases are usually uncontrolled
convenience samples; therefore data mining poses a collection of
interesting, difficult—sometimes impossible—inference problems,
raising many issues, some well studied and others unexplored or at
least unsettled.
Data mining almost always involves a search architecture requir-
ing evaluation of hypotheses at the stages of the search, evaluation
of the search output, and appropriate use of the results. Statistics
has little to offer in understanding search architectures but a great
deal to offer in evaluation of hypotheses in the course of a search,
TERRY WIDENER
in evaluating the results of a search, and in understanding the
appropriate uses of the results.
COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 35
2.
Here we describe some of the tion). Knowledge of the properties of
central statistical ideas relevant distribution families can be invalu-
to data mining, along with a able in analyzing data and making
number of recent techniques appropriate inferences.
that may sometimes be applied. Inference involves the following
Our topics include features of features:
probability distributions, estima-
tion, hypothesis testing, model • Estimation
scoring, Gibb’s sampling, ratio- • Consistency
nal decision making, causal • Uncertainty
inference, prediction, and • Assumptions
model averaging. For a rigorous • Robustness
survey of statistics, the mathe- • Model averaging
matically inclined reader should
see [7]. Due to space limita- Many procedures of inference can
tions, we must also ignore a be thought of as estimators, or func-
number of interesting topics, tions from data to some object to be
including time series analysis estimated, whether the object is the
and meta-analysis. values of a parameter, intervals of
values, structures, decision trees, or
Probability Distributions something else. Where the data are a
The statistical literature con- Heuristic sample from a larger (actual or
tains mathematical characteriza- potential) collection described by a
tions of a wealth of probability procedures, probability distribution for any given
distributions, as well as proper- sample size, the array of values of an
ties of random variables—func- which abound in estimator over samples of that size
tions defined on the “events” to has a probability distribution. Statis-
which a probability measure machine learning tics investigates such distributions of
assigns values. Important rela- estimates to identify features of an
tions among probability distrib- (and in statistics), estimator related to the information,
utions include marginalization reliability, and uncertainty it pro-
(summing over a subset of val- have no vides.
ues) and conditionalization
guarantee of
A
(forming a conditional proba- N important feature of
bility measure from a probabili- an estimator is consisten-
ty measure on a sample space ever converging cy; in the limit, as the
and some event of positive prob- sample size increases
ability. Essential relations on the right without bound, esti-
among random variables mates should almost certainly con-
include independence, condi- answer. verge to the correct value of whatever
tional independence, and vari- is being estimated. Heuristic proce-
ous measures of dures, which abound in machine
dependence—of which the learning (and in statistics), have no
most famous is the correlation guarantee of ever converging on the
coefficient. The statistical litera- right answer. An equally important
ture also characterizes families feature is the uncertainty of an esti-
of distributions by properties mate made from a finite sample.
useful in identifying any particu- That uncertainty can be thought of
lar member of the family from as the probability distribution of esti-
data, or by closure properties mates made from hypothetical sam-
useful in model construction or ples of the same size obtained in the
inference (e.g., conjugate fami- same way. Statistical theory provides
lies closed under conditionaliza- measures of uncertainty (e.g., stan-
tion and the multinormal family dard errors) and methods of calcu-
closed under linear combina- lating them for various families of
36 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM
3.
estimators. A variety of resampling and simulation are rarely estimated in this way. (See [10] and [11]
techniques have also been developed for assessing for Monte Carlo test design-for-search procedures.)
uncertainties of estimates [1]. Other things (e.g., When the probabilities of various models are entirely
consistency) being equal, estimators that minimize subjective, model averaging gives at least coherent
uncertainty are preferred. estimates.
The importance of uncertainty assessments can be
illustrated in many ways. For example, in recent Hypothesis Testing
research aimed at predicting the mortality of hospi- Hypothesis testing can be viewed as one-sided estima-
talized pneumonia patients, a large medical database tion in which, for a specific hypothesis and any sam-
was divided into a training set and a test set. (Search ple of an appropriate kind, a testing rule either
procedures used the training set to form a model, conjectures that the hypothesis is false or makes no
and the test set helped assess the predictions of the conjecture. The testing rule is based on the condi-
model.) A neural net using a large number of vari- tional sampling distribution (conditional on the
ables outperformed several other methods. However, truth of the hypothesis to be tested) of some statistic
the neural net’s performance turned out to be an or other. The significance level of a statistical test
accident of the particular train/test division. When a specifies the probability of erroneously conjecturing
random selection of other train/test divisions (with that the hypothesis is false (often called rejecting the
the same proportions) were made and the neural net hypothesis) when the hypothesis is in fact true. Given
and competing methods trained and tested accord- an appropriate alternative hypothesis, the probability
ing to each, the average neural net performance was of failing to reject the hypothesis under test can be
comparable to that of logistic regression. calculated; that probability is called the power of the
Estimation is almost always made on the basis of a test against the alternative. The power of a test is obvi-
set of assumptions, or model, but for a variety of rea- ously a function of the alternative hypothesis being
sons the assumptions may not be strictly met. If the considered.
model is incorrect, estimates based on it are also
S
expected to be incorrect, although that is not always INCE statistical tests are widely used, some of
the case. One aim of statistical research is to find ways their important limitations should be noted.
to weaken the assumptions necessary for good esti- Viewed as a one-sided estimation method,
mation. Robust statistics looks for estimators that hypothesis testing is inconsistent unless the
work satisfactorily for larger families of distributions; alpha level of the testing rule is decreased
resilient statistics [3] concern estimators—often appropriately as the sample size increases. Generally,
order statistics—that typically have small errors when a level test of one hypothesis and a level test of
assumptions are violated. another hypothesis do not jointly provide a level test
A more Bayesian approach to the problem of esti- of the conjunction of the two hypotheses. In special
mation under assumptions emphasizes that alterna- cases, rules (sometimes called contrasts) exist for
tive models and their competing assumptions are simultaneously testing several hypotheses [4]. An
often plausible. Rather than making an estimate important corollary for data mining is that the alpha
based on a single model, several models can be con- level of a test has nothing directly to do with the prob-
sidered, each with an appropriate probability, and ability of error in a search procedure that involves test-
when each of the competing models yields an esti- ing a series of hypotheses. If, for example, for each
mate of the quantity of interest, an estimate can be pair of a set of variables, hypotheses of independence
obtained as the weighted average of the estimates are tested at = 0.5, then 0.5 is not the probability of
given by the individual models [5]. When the proba- erroneously finding some dependent set of variables
bility weights are well calibrated to the frequencies when in fact all pairs are independent. That relation
with which the various models obtain, model averag- would hold (approximately) only when the sample
ing is bound to improve estimation on average. Since size is much larger than the number of variables con-
the models obtained in data mining are usually the sidered. Thus, in data mining procedures that use a
result of some automated search procedure, the sequence of hypothesis tests, the alpha level of the
advantages of model averaging are best obtained if tests cannot generally be taken as an estimate of any
the error frequencies of the search procedure are error probability related to the outcome of the search.
known—something usually obtainable only through In many, perhaps most, realistic hypothesis spaces,
extensive Monte Carlo exploration. Our impression hypothesis testing is comparatively uninformative. If a
is that the error rates of search procedures proposed hypothesis is not rejected by a test rule and a sample,
and used in the data mining and statistical literatures the same test rule and the same sample may very well
COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 37
4.
also not reject many other hypothe- not seem to be known whether
ses. And in the absence of knowledge and under what conditions they
of the entire power function of the form a consistent set of scores.
test, the testing procedure provides There are also uncertainties asso-
no information about such alterna- ciated with scores, since two dif-
tives. Further, the error probabilities ferent samples of the same size
of tests have to do with the truth of from the same distribution can
hypotheses, not with approximate yield not only different numerical
truth; hypotheses that are excellent values for the same model but
approximations may be rejected in even different orderings of
large samples. Tests of linear models, models.
for example, typically reject them in For obvious combinatorial rea-
very large samples no matter how sons, it is often impossible when
closely they seem to fit the data. searching a large model space to
calculate scores for all models;
Model Scoring however, it is often feasible to
T
HE evidence provided by describe and calculate scores for
data should lead us to a few equivalence classes of mod-
prefer some models or els receiving the highest scores.
hypotheses to others and In some contexts, inferences
to be indifferent about made using Bayes scores and pos-
still other models. A score is any rule teriors can differ a great deal
that maps models and data to num- When the from inferences made with
bers whose numerical ordering cor- hypothesis tests. (See [5] for
responds to a preference ordering probabilities of examples of models that account
over the space of models, given the for almost all of the variance of
data. For such reasons, scoring rules various models an outcome of interest and that
are often an attractive alternative to have very high posterior or Bayes
tests. Indeed, the values of test statis- are entirely scores but are overwhelmingly
tics are sometimes themselves used rejected by statistical tests.)
as scores, especially in the structural- subjective, model Of the various scoring rules,
equation literature. Typical rules perhaps the most interesting is
assign to a model a value deter- averaging gives the posterior probability, because,
mined by the likelihood function unlike many other consistent
associated with the model, the num- at least scores, posterior probability has a
ber of parameters, or dimension, of central role in the theory of ratio-
the model, and the data. Popular coherent nal choice. Unfortunately, poste-
rules include the Akaike Informa- riors can be difficult to compute.
tion Criterion (AIC), Bayes Informa- estimates.
tion Criterion (BIC), and Minimum Gibbs Sampling
Description Length. Given a prior Statistical theory typically gives
probability distribution over models, asymptotic results that can be
the posterior probability on the data used to describe posteriors or
is itself a scoring function, arguably a likelihoods in large samples.
privileged one. The BIC approxi- Unfortunately, even in very large
mates posterior probabilities in databases, the number of cases
large samples. relevant to a particular question
There is a notion of consistency can be quite small. For example,
appropriate to scoring rules; in the in studying the effects of hospital-
large sample limit, the true model ization on survival of pneumonia
should almost surely be among patients, mortality comparisons
those receiving maximal scores. AIC between those treated at home
scores are generally not consistent and those treated in a hospital
[8]. The probability (p) values might be wanted. But even in a
assigned to statistics in hypothesis very large database, the number
tests of models are scores, but it does of pneumonia patients treated at
38 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM
5.
home and who die of pneumonia complications is causal connection in what they called the Markov
very small. And statistical theory typically provides few condition—if Y is not an effect of X, then X and Y are
or no ways to calculate distributions in small samples conditionally independent, given the direct causes of
in which the application of asymptotic formulas can X. Kiiveri and Speed showed that much of the linear
be wildly misleading. Recently, a family of simulation modeling literature tacitly assumed the Markov con-
methods—often described as Gibbs sampling after the dition; the Markov condition is also satisfied by most
great American physicist Josiah Willard Gibbs causal models of categorical data and of virtually all
1839–1903, have been adapted from statistical causal models of systems without feedback. Under
mechanics, permitting the approximate calculation of additional assumptions, conditional independence
many distributions. A review of these procedures is in provides information about causal dependence. The
[9]. most common—and most thoroughly investigated—
additional assumption is that all conditional inde-
Rational Decision Making and Planning pendencies are due to the Markov condition's being
The theory of rational choice assumes the decision applied to the directed graph describing the actual
maker has a definite set of alternative actions, knowl- causal processes generating the data, a requirement
edge of a definite set of possible alternative states of with many names (e.g., faithfulness). Directed graphs
the world, and knowledge of the payoffs or utilities of with associated probability distributions satisfying the
the outcomes of each possible action in each possible Markov condition are called by different names in
state of the world, as well as knowledge of the proba- different literatures (e.g., Bayes nets, belief nets,
bilities of various possible states of the world. Given structural equation models, and path models).
all this information, a decision rule specifies which of
C
the alternative actions ought to be taken. A large lit- AUSAL inference from uncontrolled con-
erature in statistics and economics addresses alterna- venience samples is liable to many
tive decision rules—maximizing expected utility, sources of error. Three of the most
minimizing maximum possible loss, and more. Ratio- important are latent variables (or con-
nal decision making and planning are typically the founders), sample selection bias, and
goals of data mining, but rather than providing tech- model equivalence. A latent variable is any unrecord-
niques or methods for data mining, the theory of ed feature that varies among recorded units and
rational choice poses norms for the use of informa- whose variation influences recorded features. The
tion obtained from a database. result is an association among recorded features not
The very framework of rational decision making in fact due to any causal influence of the recorded
requires probabilities for alternative states of affairs features themselves. The possibility of latent variables
and knowledge of the effects alternative actions will can seldom, if ever, be ignored in data mining. Sam-
have. To know the outcomes of actions is to know ple selection bias occurs when the values of any two
something of cause-and-effect relations. Extracting of the variables under study, say X and Y, themselves
such causal information is often one of the principal influence whether a feature is recorded in a database.
goals of data mining and more generally of statistical That influence produces a statistical association
inference. between X and Y (and other variables) that has no
causal significance. Datasets with missing values pose
Inference to Causes sample selection bias problems. Models with quite
Understanding causation is the hidden motivation different graphs may generate the same constraints
behind the historical development of statistics. From on probability distributions through the Markov con-
the beginning of the field, in the work of Bernoulli dition and may therefore be indistinguishable with-
and Laplace, the absence of causal connection out experimental intervention. Any procedure that
between two variables has been taken to imply their arbitrarily selects one or a few of the equivalents may
probabilistic independence [12]; the same idea is badly mislead users when the models are given a
fundamental in the theory of experimental design. In causal significance. If model search is viewed as a
1934, Sewell Wright, a biologist, introduced directed form of estimation, all of these difficulties are sources
graphs to represent causal hypotheses (with vertices of inconsistency.
as random variables and edges representing direct Standard data mining methods run afoul of these
influences); these graphs have become common rep- difficulties. The search algorithms in such commer-
resentations of causal hypotheses in the social sci- cial linear model analysis programs as LISREL select
ences, biology, computer science, and engineering. one from an unknown number of statistically indis-
In 1982, statisticians Harry Kiiveri and T. P. Speed tinguishable models. Regression methods are incon-
combined directed graphs with a generalized con- sistent for all of the reasons listed earlier. For
nection between independence and absence of example, consider the structure: Y = aT + ey; X1 = bT
COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 39
6.
+ cQ + e1; X2 = dQ + e2, where T automatically inferred from the data-
and Q are unrecorded. Neither base. For example, regression
X1 nor X2 has any influence on assumes a particular functional form
Y. For all nonzero values of a, b, for relating variables or, in the case of
c,, d, however, in sufficiently logistic regression, relating the values
large samples, regression of Y on of some variables to the probabilities
X1, X2 yields significant regres- of other variables; but constraints are
sion coefficients for X1 and X2. implicit in any prediction method
With the causal interpretation that uses a database to adjust or esti-
often given it, regression says mate the parameters used in predic-
that X1 and X2 cause of Y. tion. Other forms of constraint may
Assuming the Markov and faith- include independence, conditional
fulness conditions, all that can independence, and higher-order
be inferred correctly (in large conditions on correlations (e.g.,
samples) from data on X1, X2, tetrad constraints). On average, a
and Y is that X1 is not a cause of prediction method guaranteeing sat-
X2 or of Y; X2 is not a cause of Y; isfaction of the constraints realized in
Y is not a cause of X2; and there the probability distribution is more
is no common cause of Y and X2. accurate and has a smaller variance
Nonregression algorithms imple- than a prediction method that does
mented in the TETRAD II pro- not. Finding the appropriate con-
gram [6, 10] give the correct straints to be satisfied is the most dif-
result asymptotically in this case ficult issue in this sort of prediction.
and in all cases in which the As with estimation, prediction can be
Markov and faithfulness condi- improved by model averaging, pro-
tions hold. The results are also Understanding vided the probabilities of the alterna-
robust against the three prob- tive assumptions imposed by the
lems with causal inference noted causation is the model are available.
in the previous paragraph [11]. Another sort of prediction
However, the statistical decisions hidden motivation involves interventions that alter the
made by the algorithms are not probability distribution—as in pre-
really optimal, and the imple- behind the dicting the values (or probabilities)
mentations are limited to the of variables under a change in man-
multinomial and multinormal historical ufacturing procedures or changes in
families of probability distribu- economic or medical treatment
tions. A review of Bayesian development of policies. Making accurate predic-
search procedures for causal tions of this kind requires some
models is given in [2]. statistics. knowledge of the relevant causal
structure and is generally quite dif-
Prediction ferent from prediction without
Sometimes one is interested in using intervention, although the same
a sample, or a database, to predict caveats about uncertainty and
properties of a new sample, where it model averaging apply. For graphi-
is assumed the two samples are cal representations of causal
obtained from the same probability hypotheses according to the Markov
distribution. As with estimation, pre- condition, general algorithms for
diction is interested in accuracy and predicting the outcomes of inter-
uncertainty, and is often measured ventions from complete or incom-
by the variance of the predictor. plete causal models were developed
Prediction methods for this in [10]. In 1995, some of these pro-
sort of prediction problem always cedures were extended and made
assume some regularities—con- into a more convenient calculus by
straints—in the probability distri- Judea Pearl, a computer scientist. A
bution. In data mining contexts, related theory without graphical
the constraints are typically either models was developed in 1974 by
supplied by human experts or Donald Rubin, a statistician, and
40 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment