C l a r k G l y m o u r,

                       David Madigan, Daryl Pregibon,

                                and Padhr...
Here we describe some of the                                          tion). Knowledge of the properties of
central statis...
estimators. A variety of resampling and simulation          are rarely estimated in this way. (See [10] and [11]
also not reject many other hypothe-                                           not seem to be known whether
ses. And in the...
home and who die of pneumonia complications is              causal connection in what they called the Markov
very small. A...
+ cQ + e1; X2 = dQ + e2, where T                                           automatically inferred from the data-
and Q are...
others, and in 1986 by James Robins.                          in data mining are consistent under conditions
Upcoming SlideShare
Loading in …5

Statistical Inference and Data Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Statistical Inference and Data Mining

  1. 1. C l a r k G l y m o u r, David Madigan, Daryl Pregibon, and Padhraic Smyth Statistics may have little to offer the search architectures in a data mining search, but a great deal to offer in evaluating hypotheses in the search, in evaluating the results of the search, and in applying the results. Statistical Inference and Data Mining DATA MINING AIMS TO DISCOVER SOMETHING NEW FROM THE FACTS RECORDED in a database. For many reasons—encoding errors, measurement errors, unrecorded causes of recorded features—the information in a database is almost always noisy; therefore, inference from data- bases invites applications of the theory of probability. From a sta- tistical point of view, databases are usually uncontrolled convenience samples; therefore data mining poses a collection of interesting, difficult—sometimes impossible—inference problems, raising many issues, some well studied and others unexplored or at least unsettled. Data mining almost always involves a search architecture requir- ing evaluation of hypotheses at the stages of the search, evaluation of the search output, and appropriate use of the results. Statistics has little to offer in understanding search architectures but a great deal to offer in evaluation of hypotheses in the course of a search, TERRY WIDENER in evaluating the results of a search, and in understanding the appropriate uses of the results. COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 35
  2. 2. Here we describe some of the tion). Knowledge of the properties of central statistical ideas relevant distribution families can be invalu- to data mining, along with a able in analyzing data and making number of recent techniques appropriate inferences. that may sometimes be applied. Inference involves the following Our topics include features of features: probability distributions, estima- tion, hypothesis testing, model • Estimation scoring, Gibb’s sampling, ratio- • Consistency nal decision making, causal • Uncertainty inference, prediction, and • Assumptions model averaging. For a rigorous • Robustness survey of statistics, the mathe- • Model averaging matically inclined reader should see [7]. Due to space limita- Many procedures of inference can tions, we must also ignore a be thought of as estimators, or func- number of interesting topics, tions from data to some object to be including time series analysis estimated, whether the object is the and meta-analysis. values of a parameter, intervals of values, structures, decision trees, or Probability Distributions something else. Where the data are a The statistical literature con- Heuristic sample from a larger (actual or tains mathematical characteriza- potential) collection described by a tions of a wealth of probability procedures, probability distribution for any given distributions, as well as proper- sample size, the array of values of an ties of random variables—func- which abound in estimator over samples of that size tions defined on the “events” to has a probability distribution. Statis- which a probability measure machine learning tics investigates such distributions of assigns values. Important rela- estimates to identify features of an tions among probability distrib- (and in statistics), estimator related to the information, utions include marginalization reliability, and uncertainty it pro- (summing over a subset of val- have no vides. ues) and conditionalization guarantee of A (forming a conditional proba- N important feature of bility measure from a probabili- an estimator is consisten- ty measure on a sample space ever converging cy; in the limit, as the and some event of positive prob- sample size increases ability. Essential relations on the right without bound, esti- among random variables mates should almost certainly con- include independence, condi- answer. verge to the correct value of whatever tional independence, and vari- is being estimated. Heuristic proce- ous measures of dures, which abound in machine dependence—of which the learning (and in statistics), have no most famous is the correlation guarantee of ever converging on the coefficient. The statistical litera- right answer. An equally important ture also characterizes families feature is the uncertainty of an esti- of distributions by properties mate made from a finite sample. useful in identifying any particu- That uncertainty can be thought of lar member of the family from as the probability distribution of esti- data, or by closure properties mates made from hypothetical sam- useful in model construction or ples of the same size obtained in the inference (e.g., conjugate fami- same way. Statistical theory provides lies closed under conditionaliza- measures of uncertainty (e.g., stan- tion and the multinormal family dard errors) and methods of calcu- closed under linear combina- lating them for various families of 36 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM
  3. 3. estimators. A variety of resampling and simulation are rarely estimated in this way. (See [10] and [11] techniques have also been developed for assessing for Monte Carlo test design-for-search procedures.) uncertainties of estimates [1]. Other things (e.g., When the probabilities of various models are entirely consistency) being equal, estimators that minimize subjective, model averaging gives at least coherent uncertainty are preferred. estimates. The importance of uncertainty assessments can be illustrated in many ways. For example, in recent Hypothesis Testing research aimed at predicting the mortality of hospi- Hypothesis testing can be viewed as one-sided estima- talized pneumonia patients, a large medical database tion in which, for a specific hypothesis and any sam- was divided into a training set and a test set. (Search ple of an appropriate kind, a testing rule either procedures used the training set to form a model, conjectures that the hypothesis is false or makes no and the test set helped assess the predictions of the conjecture. The testing rule is based on the condi- model.) A neural net using a large number of vari- tional sampling distribution (conditional on the ables outperformed several other methods. However, truth of the hypothesis to be tested) of some statistic the neural net’s performance turned out to be an or other. The significance level of a statistical test accident of the particular train/test division. When a specifies the probability of erroneously conjecturing random selection of other train/test divisions (with that the hypothesis is false (often called rejecting the the same proportions) were made and the neural net hypothesis) when the hypothesis is in fact true. Given and competing methods trained and tested accord- an appropriate alternative hypothesis, the probability ing to each, the average neural net performance was of failing to reject the hypothesis under test can be comparable to that of logistic regression. calculated; that probability is called the power of the Estimation is almost always made on the basis of a test against the alternative. The power of a test is obvi- set of assumptions, or model, but for a variety of rea- ously a function of the alternative hypothesis being sons the assumptions may not be strictly met. If the considered. model is incorrect, estimates based on it are also S expected to be incorrect, although that is not always INCE statistical tests are widely used, some of the case. One aim of statistical research is to find ways their important limitations should be noted. to weaken the assumptions necessary for good esti- Viewed as a one-sided estimation method, mation. Robust statistics looks for estimators that hypothesis testing is inconsistent unless the work satisfactorily for larger families of distributions; alpha level of the testing rule is decreased resilient statistics [3] concern estimators—often appropriately as the sample size increases. Generally, order statistics—that typically have small errors when a level test of one hypothesis and a level test of assumptions are violated. another hypothesis do not jointly provide a level test A more Bayesian approach to the problem of esti- of the conjunction of the two hypotheses. In special mation under assumptions emphasizes that alterna- cases, rules (sometimes called contrasts) exist for tive models and their competing assumptions are simultaneously testing several hypotheses [4]. An often plausible. Rather than making an estimate important corollary for data mining is that the alpha based on a single model, several models can be con- level of a test has nothing directly to do with the prob- sidered, each with an appropriate probability, and ability of error in a search procedure that involves test- when each of the competing models yields an esti- ing a series of hypotheses. If, for example, for each mate of the quantity of interest, an estimate can be pair of a set of variables, hypotheses of independence obtained as the weighted average of the estimates are tested at = 0.5, then 0.5 is not the probability of given by the individual models [5]. When the proba- erroneously finding some dependent set of variables bility weights are well calibrated to the frequencies when in fact all pairs are independent. That relation with which the various models obtain, model averag- would hold (approximately) only when the sample ing is bound to improve estimation on average. Since size is much larger than the number of variables con- the models obtained in data mining are usually the sidered. Thus, in data mining procedures that use a result of some automated search procedure, the sequence of hypothesis tests, the alpha level of the advantages of model averaging are best obtained if tests cannot generally be taken as an estimate of any the error frequencies of the search procedure are error probability related to the outcome of the search. known—something usually obtainable only through In many, perhaps most, realistic hypothesis spaces, extensive Monte Carlo exploration. Our impression hypothesis testing is comparatively uninformative. If a is that the error rates of search procedures proposed hypothesis is not rejected by a test rule and a sample, and used in the data mining and statistical literatures the same test rule and the same sample may very well COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 37
  4. 4. also not reject many other hypothe- not seem to be known whether ses. And in the absence of knowledge and under what conditions they of the entire power function of the form a consistent set of scores. test, the testing procedure provides There are also uncertainties asso- no information about such alterna- ciated with scores, since two dif- tives. Further, the error probabilities ferent samples of the same size of tests have to do with the truth of from the same distribution can hypotheses, not with approximate yield not only different numerical truth; hypotheses that are excellent values for the same model but approximations may be rejected in even different orderings of large samples. Tests of linear models, models. for example, typically reject them in For obvious combinatorial rea- very large samples no matter how sons, it is often impossible when closely they seem to fit the data. searching a large model space to calculate scores for all models; Model Scoring however, it is often feasible to T HE evidence provided by describe and calculate scores for data should lead us to a few equivalence classes of mod- prefer some models or els receiving the highest scores. hypotheses to others and In some contexts, inferences to be indifferent about made using Bayes scores and pos- still other models. A score is any rule teriors can differ a great deal that maps models and data to num- When the from inferences made with bers whose numerical ordering cor- hypothesis tests. (See [5] for responds to a preference ordering probabilities of examples of models that account over the space of models, given the for almost all of the variance of data. For such reasons, scoring rules various models an outcome of interest and that are often an attractive alternative to have very high posterior or Bayes tests. Indeed, the values of test statis- are entirely scores but are overwhelmingly tics are sometimes themselves used rejected by statistical tests.) as scores, especially in the structural- subjective, model Of the various scoring rules, equation literature. Typical rules perhaps the most interesting is assign to a model a value deter- averaging gives the posterior probability, because, mined by the likelihood function unlike many other consistent associated with the model, the num- at least scores, posterior probability has a ber of parameters, or dimension, of central role in the theory of ratio- the model, and the data. Popular coherent nal choice. Unfortunately, poste- rules include the Akaike Informa- riors can be difficult to compute. tion Criterion (AIC), Bayes Informa- estimates. tion Criterion (BIC), and Minimum Gibbs Sampling Description Length. Given a prior Statistical theory typically gives probability distribution over models, asymptotic results that can be the posterior probability on the data used to describe posteriors or is itself a scoring function, arguably a likelihoods in large samples. privileged one. The BIC approxi- Unfortunately, even in very large mates posterior probabilities in databases, the number of cases large samples. relevant to a particular question There is a notion of consistency can be quite small. For example, appropriate to scoring rules; in the in studying the effects of hospital- large sample limit, the true model ization on survival of pneumonia should almost surely be among patients, mortality comparisons those receiving maximal scores. AIC between those treated at home scores are generally not consistent and those treated in a hospital [8]. The probability (p) values might be wanted. But even in a assigned to statistics in hypothesis very large database, the number tests of models are scores, but it does of pneumonia patients treated at 38 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM
  5. 5. home and who die of pneumonia complications is causal connection in what they called the Markov very small. And statistical theory typically provides few condition—if Y is not an effect of X, then X and Y are or no ways to calculate distributions in small samples conditionally independent, given the direct causes of in which the application of asymptotic formulas can X. Kiiveri and Speed showed that much of the linear be wildly misleading. Recently, a family of simulation modeling literature tacitly assumed the Markov con- methods—often described as Gibbs sampling after the dition; the Markov condition is also satisfied by most great American physicist Josiah Willard Gibbs causal models of categorical data and of virtually all 1839–1903, have been adapted from statistical causal models of systems without feedback. Under mechanics, permitting the approximate calculation of additional assumptions, conditional independence many distributions. A review of these procedures is in provides information about causal dependence. The [9]. most common—and most thoroughly investigated— additional assumption is that all conditional inde- Rational Decision Making and Planning pendencies are due to the Markov condition's being The theory of rational choice assumes the decision applied to the directed graph describing the actual maker has a definite set of alternative actions, knowl- causal processes generating the data, a requirement edge of a definite set of possible alternative states of with many names (e.g., faithfulness). Directed graphs the world, and knowledge of the payoffs or utilities of with associated probability distributions satisfying the the outcomes of each possible action in each possible Markov condition are called by different names in state of the world, as well as knowledge of the proba- different literatures (e.g., Bayes nets, belief nets, bilities of various possible states of the world. Given structural equation models, and path models). all this information, a decision rule specifies which of C the alternative actions ought to be taken. A large lit- AUSAL inference from uncontrolled con- erature in statistics and economics addresses alterna- venience samples is liable to many tive decision rules—maximizing expected utility, sources of error. Three of the most minimizing maximum possible loss, and more. Ratio- important are latent variables (or con- nal decision making and planning are typically the founders), sample selection bias, and goals of data mining, but rather than providing tech- model equivalence. A latent variable is any unrecord- niques or methods for data mining, the theory of ed feature that varies among recorded units and rational choice poses norms for the use of informa- whose variation influences recorded features. The tion obtained from a database. result is an association among recorded features not The very framework of rational decision making in fact due to any causal influence of the recorded requires probabilities for alternative states of affairs features themselves. The possibility of latent variables and knowledge of the effects alternative actions will can seldom, if ever, be ignored in data mining. Sam- have. To know the outcomes of actions is to know ple selection bias occurs when the values of any two something of cause-and-effect relations. Extracting of the variables under study, say X and Y, themselves such causal information is often one of the principal influence whether a feature is recorded in a database. goals of data mining and more generally of statistical That influence produces a statistical association inference. between X and Y (and other variables) that has no causal significance. Datasets with missing values pose Inference to Causes sample selection bias problems. Models with quite Understanding causation is the hidden motivation different graphs may generate the same constraints behind the historical development of statistics. From on probability distributions through the Markov con- the beginning of the field, in the work of Bernoulli dition and may therefore be indistinguishable with- and Laplace, the absence of causal connection out experimental intervention. Any procedure that between two variables has been taken to imply their arbitrarily selects one or a few of the equivalents may probabilistic independence [12]; the same idea is badly mislead users when the models are given a fundamental in the theory of experimental design. In causal significance. If model search is viewed as a 1934, Sewell Wright, a biologist, introduced directed form of estimation, all of these difficulties are sources graphs to represent causal hypotheses (with vertices of inconsistency. as random variables and edges representing direct Standard data mining methods run afoul of these influences); these graphs have become common rep- difficulties. The search algorithms in such commer- resentations of causal hypotheses in the social sci- cial linear model analysis programs as LISREL select ences, biology, computer science, and engineering. one from an unknown number of statistically indis- In 1982, statisticians Harry Kiiveri and T. P. Speed tinguishable models. Regression methods are incon- combined directed graphs with a generalized con- sistent for all of the reasons listed earlier. For nection between independence and absence of example, consider the structure: Y = aT + ey; X1 = bT COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 39
  6. 6. + cQ + e1; X2 = dQ + e2, where T automatically inferred from the data- and Q are unrecorded. Neither base. For example, regression X1 nor X2 has any influence on assumes a particular functional form Y. For all nonzero values of a, b, for relating variables or, in the case of c,, d, however, in sufficiently logistic regression, relating the values large samples, regression of Y on of some variables to the probabilities X1, X2 yields significant regres- of other variables; but constraints are sion coefficients for X1 and X2. implicit in any prediction method With the causal interpretation that uses a database to adjust or esti- often given it, regression says mate the parameters used in predic- that X1 and X2 cause of Y. tion. Other forms of constraint may Assuming the Markov and faith- include independence, conditional fulness conditions, all that can independence, and higher-order be inferred correctly (in large conditions on correlations (e.g., samples) from data on X1, X2, tetrad constraints). On average, a and Y is that X1 is not a cause of prediction method guaranteeing sat- X2 or of Y; X2 is not a cause of Y; isfaction of the constraints realized in Y is not a cause of X2; and there the probability distribution is more is no common cause of Y and X2. accurate and has a smaller variance Nonregression algorithms imple- than a prediction method that does mented in the TETRAD II pro- not. Finding the appropriate con- gram [6, 10] give the correct straints to be satisfied is the most dif- result asymptotically in this case ficult issue in this sort of prediction. and in all cases in which the As with estimation, prediction can be Markov and faithfulness condi- improved by model averaging, pro- tions hold. The results are also Understanding vided the probabilities of the alterna- robust against the three prob- tive assumptions imposed by the lems with causal inference noted causation is the model are available. in the previous paragraph [11]. Another sort of prediction However, the statistical decisions hidden motivation involves interventions that alter the made by the algorithms are not probability distribution—as in pre- really optimal, and the imple- behind the dicting the values (or probabilities) mentations are limited to the of variables under a change in man- multinomial and multinormal historical ufacturing procedures or changes in families of probability distribu- economic or medical treatment tions. A review of Bayesian development of policies. Making accurate predic- search procedures for causal tions of this kind requires some models is given in [2]. statistics. knowledge of the relevant causal structure and is generally quite dif- Prediction ferent from prediction without Sometimes one is interested in using intervention, although the same a sample, or a database, to predict caveats about uncertainty and properties of a new sample, where it model averaging apply. For graphi- is assumed the two samples are cal representations of causal obtained from the same probability hypotheses according to the Markov distribution. As with estimation, pre- condition, general algorithms for diction is interested in accuracy and predicting the outcomes of inter- uncertainty, and is often measured ventions from complete or incom- by the variance of the predictor. plete causal models were developed Prediction methods for this in [10]. In 1995, some of these pro- sort of prediction problem always cedures were extended and made assume some regularities—con- into a more convenient calculus by straints—in the probability distri- Judea Pearl, a computer scientist. A bution. In data mining contexts, related theory without graphical the constraints are typically either models was developed in 1974 by supplied by human experts or Donald Rubin, a statistician, and 40 November 1996/Vol. 39, No. 11 COMMUNICATIONS OF THE ACM
  7. 7. others, and in 1986 by James Robins. in data mining are consistent under conditions Well-known studies by Herbert Needleman, a reasonably thought to apply in applications; physician and statistician, of the correlation of lead • Use and reveal uncertainty—don’t hide it; deposits in children’s teeth and the children’s IQs • Calibrate the errors of search—for honesty and to resulted, eventually, in removal of tetraethyl lead take advantages of model averaging; from gasoline in the U.S. One dataset Needleman • Don’t confuse conditioning with intervening, that is, examined included more than 200 subjects and mea- don’t take the error probabilities of hypothesis tests sured a large number of covariates. In 1985, Needle- to be the error probabilities of search procedures. man and his colleagues reanalyzed the data using backward stepwise regression of verbal IQ on these Otherwise, good luck. You’ll need it. C variables and obtained six significant regressors, including lead. In 1988, Steven Klepper, an econo- References mist, and his collaborators reanalyzed the data assum- 1. Efron, B. The Jackknife, the Bootstrap, and Other Resampling Plans. Society for Industrial and Applied Mathematics (SIAM), Num- ing that all the variables were measured with error. ber 38, Philadelphia, 1982. Klepper’s model assumes that each measured num- 2. Heckerman, D. Bayesian networks for data mining. Data Min- ber is a linear combination of the true value and an ing and Knowledge Discovery, submitted. 3. Hoaglin, D., Mosteller, F., and Tukey, J. Understanding Robust error and that the parameters of interest are not the and Exploratory Data Analysis. Wiley, New York, 1983. regression coefficients but the coefficients relating 4. Miller, R. Simultaneous Statistical Inference. Springer-Verlag, New the unmeasured true-value variables to the unmea- York, 1981. 5. Raftery, A.E. Bayesian model selection in social research. Work- sured true value of verbal IQ. ing Paper 94-12, Center for Studies in Demography and Ecolo- These coefficients are in fact indeterminate—or, in gy, Univ. of Washington, Seattle, 1994. econometric terminology, unidentifiable. However, 6. Scheines, R., Spirtes, P., Glymour, C., and Meek, C. TETRAD II: Tools for Causal Modeling. Users Manual. Erlbaum, Hillsdale, N.J., an interval estimate of the coefficients that is strictly 1994. positive or negative for each coefficient can be made 7. Schervish, M. Theory of Statistics. Springer-Verlag, New York, 1995. if the amount of measurement error can be bounded 8. Schwartz, G. Estimating the dimension of a model. Ann. Stat. 6 (1978), 461–464. with prior knowledge by an amount that varies from 9. Smith, A.F.M., and Roberts, G.O. Bayesian computation via the case to case. For example, Klepper found that the Gibb's sampler and related Markov chain Monte Carlo meth- bound required to ensure the existence of a strictly ods. J. R. Stat. Soc., Series B, 55 (1993), 3–23. 10. Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, negative interval estimate for the lead-to-IQ coeffi- and Search. Springer-Verlag, New York, 1993. cient was much too strict to be credible; thus he con- 11. Spirtes, P., Meek, C., and Richardson, T. Causal inference in cluded that the case against lead was not nearly as the presence of latent variables and selection bias. In Proceed- ings of the 11th Conference on Uncertainty in Artificial Intelligence. P. strong as Needleman’s analysis suggested. Besnard and S. Hanks, Eds. Morgan Kaufmann Publishers, San Allowing the possibility of latent variables, Richard Mateo, Calif., 1995, pp. 499–506. Scheines in 1996 reanalyzed the correlations with the 12. Stigler, S. The History of Statistics. Harvard University Press, Cam- bridge, Mass., 1986. TETRAD II program and concluded that three of the six regressors could have no influence on IQ. The Additional references for this article can be found at regression included the three extra variables only http://www.research.microsoft.com/research/datamine/CACM- DM-refs/. because the partial regression coefficient is estimated by conditioning on all other regressors—just the right CLARK GLYMOUR is Alumni University Professor at Carnegie thing to do for linear prediction, but the wrong thing Mellon University and Valtz Family Professor of Philosophy at the University of California, San Diego. He can be reached to do for causal inference using the Markov condition cg09@andrew.cmu.edu. (see the example at the end of the earlier section Infer- ence to Causes). Using the Klepper model—but with- DAVID MADIGAN is an associate professor of statistics at the Uni- versity of Washington in Seattle. He can be reached at madi- out the three irrelevant variables—and assigning to all gan@stat.washington.edu. of the parameters a normal prior probability with mean zero and a substantial variance, Scheines used DARYL PREGIBON is the head of statistics research in AT&T Lab- oratories. He can be reached at daryl@research.att.com Gibbs sampling to compute a posterior probability dis- tribution for the lead-to-IQ parameter. The probabili- PADHRAIC SMYTH is an assistant professor of information and ty is very high that lead exposure reduces verbal IQ. computer science at the University of California, Irvine. He can be reached at smyth@ics.uci.edu. Conclusion Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or The statistical literature has a wealth of technical pro- distributed for profit or commercial advantage, the copyright notice, the title cedures and results to offer data mining, but it also of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, offers several methodological morals: or to redistribute to lists requires prior specific permission and/or a fee. • Prove that estimation and search procedures used © ACM 0002-0782/96/1100 $3.50 COMMUNICATIONS OF THE ACM November 1996/Vol. 39, No. 11 41