Statistical Approaches to Missing Data


Published on

Statistical Approaches to Missing Data: Imputation, Interpolation, and Data Fusion 3rd Socio-Cultural Data Summit.

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Statistical Approaches to Missing Data

  1. 1. 3rd Socio-Cultural Data Summit Statistical Approaches to Missing Data:Imputation, Interpolation, and Data Fusion Brian Efird, Ph.D. National Defense University
  2. 2. What Do We Mean By “Missing Data”• In a structured, quantitative dataset, we simply mean that some of the “observations” have null values. That is, there is no observation for some part(s) of the dataset. − E.g. in a survey, an answer(s) was not provided to a question (or multiple questions) by a respondent (or multiple respondents). − We intended to have these observations but they are not present in the dataset.• Missing responses can also be “strategic“ (e.g. deception/self preservation).• However, we would still like to say something or make an inference about the phenomena that is supposedly measured by the dataset as if we had no missing values.• One approach just ignores the missing data. Another approach applies one of various statistical techniques to “fill” the holes in the dataset.• Either approach has consequences and requires one to understand a bit more about why the data are missing. 2
  3. 3. Typical Assumptions About Missing Data for Statistics• Values can be missing for dependent (response) variables or on independent (explanatory) variables.• Missing data can effect properties of estimators (for example, means, percentages, percentiles, variances, ratios, regressi on parameters and so on).• Missing data can also affect inferences, i.e. the properties of tests and confidence intervals, and Bayesian posterior distributions.• A critical determinant of these effects is the way in which the probability of an observation being missing (the missingness mechanism) depends on other variables (measured or not) and on its own value.• If one ignores missing data, it may bias the sample. E.g., if you only include observations in behavioral data where every question is answered, you typically end up with a very odd sample. 3
  4. 4. More Assumptions About Missing Data for Statistics• In contrast with the sampling process, which is usually known, the missingness mechanism is usually unknown.• The additional assumptions needed to allow the observed data to be the basis of inferences that would have been available from the complete data can usually be expressed in terms of either: − The relationship between selection of missing observations and the values they would have taken, or − The statistical behavior of the unseen data.• These additional assumptions are not subject to assessment from the data under analysis; their plausibility cannot be definitively determined from the data. 4
  5. 5. What Type of Missing Data Do You Have – MCAR?• Missing data are said to be missing completely at random (MCAR) if the probability that data are missing does not depend on observed or unobserved data.• Under MCAR, the missing-data values are a simple random sample of all data values, and so any analysis that discards the missing values remains consistent (although maybe inefficient).• An example of a MCAR mechanism would be that a laboratory sample is dropped, so the resulting observation is missing. Or data may be missing because equipment malfunctioned, the weather was terrible, people got sick, or the data were not entered correctly.• This is the best case. It means there is no underlying mechanism or pattern (observed or unobserved) which explains the missing data. Proceed…. 5
  6. 6. What Type of Missing Data Do You Have – MAR?• Missing data are said to be missing at random (MAR) if the probability that data are missing does not depend on unobserved data but may depend on observed data.• That is, the data are not missing completely at random.• In other words, under MAR, the probability of a value being missing will generally depend on observed values, so it does not correspond to the intuitive notion of random. 6
  7. 7. What Type of Missing Data Do You Have – MAR? (cont’d)• For example: − People who are depressed might be less inclined to report their income, and thus reported income will be related to depression. − Depressed people might also have a lower income in general, and thus when we have a high rate of missing data among depressed individuals, the actual mean income of the population might be lower than it would be without missing data. − However, if, within depressed patients the probability of reported income was unrelated to income level, then the data would be considered MAR, though not MCAR. − Another way of saying this is to say that to the extent that we can explain missingness is correlated with other variables that are included in the analysis, the data are MAR. 7
  8. 8. What Type of Missing Data Do You Have –MNAR?• Missing data are said to be missing not at random (MNAR) for a specific and systematic, but unobserved, reason.• We cannot ignore data that are MNAR.• For example: − If we are studying mental health and people who have been diagnosed as depressed are less likely than others to report their mental status, the data are not missing at random. − Clearly the mean mental status score for the available data will not be an unbiased estimate of the mean that we would have obtained with complete data. − The same thing happens when people with low income are less likely to report their income on a data collection form. − Or, if you ask opinions on a large number of instruments, typically only highly educated people answer all of them. If you drop non-responses, you bias the sample badly. 8
  9. 9. Introduction to Imputation• Missing data arise frequently.• The technique of multiple imputation, which originated in early 1970 in application to survey nonresponse, has gained popularity over the years.• An imputation represents one set of plausible values for missing data. Multiple imputations represent multiple sets of plausible values.• Multiple imputation is a simulation-based exercise where a number of plausible values for each missing observation are generated.• This raises the secondary but still important question, if multiple imputations are to be generated, how many should one simulate? More is better to some extent…. 9
  10. 10. Interpolation – A Simple Example of ImputationWe have data points on y and x, although sometimes theobservations on y are missing. We believe that y is a function ofx, justifying filling in the missing values by linear interpolation. Interpolation uses the values of x to approximate missing values of y in y1 and y2Inference is using the data that we do have (i.e. in a survey thosequestions that were answered) to fill in values for what we don’thave (i.e. what they didnt answer or were unwilling to answer). 10
  11. 11. A Bit More on Imputation• Univariate imputation is used to impute a single variable. It can be used repeatedly to impute multiple variables only when the variables are independent and will be used in separate analyses. − Well established techniques are available for a variety of types of variables, e.g. continuous variables, censored variables, binary variables, categorical variables, count variables.• If variables follow a “monotone-missing” pattern, they can be imputed sequentially using univariate conditional distributions.• When a pattern of missing values is arbitrary, iterative or multivariate methods should be used to fill in missing values.• As with any statistical procedure, choosing an appropriate imputation approach is an art, and the choice should ultimately be determined by your data and research objectives. It is good practice to check that your imputations are sensible and to 11
  12. 12. More Concretely• Essentially, imputation is using responses we do have to construct a model to fill in responses we do NOT have.• Other, naive techniques (e.g., filling in non-responses with the mean of the respondents) are not as good as using a model (i.e. treating the variable with missing data as a dependent variable and using logical independent variables to help fill in the values.• For example: − If a person misses a policy instrument (e.g., abortion) but answered gay marriage, religion in politics, plus demographics, its easy to impute the abortion response and a lot more logically satisfying than filling in the mean. 12