Missing Data- Five Practical Guidelines
R22013
Prakriti Sinha
Item-Level Missingness
- Answering only j out of J possible items on a
scale
Construct-Level Missingness
- Answering zero items on a construct (Entire
scale )
Person-Level Missingness
- Failure to return the survey by a person
Missing Data Levels
•Missing data levels are nested
Item-level missingness can aggregate into construct-level missingness
Construct-level missingness can aggregate into Person-level missingness
Choice of appropriate missing data technique can depend upon level of missingness
Missing Data Are Partly Unavoidable, and
Partly Avoidable
Missing data are a natural and
unavoidable
• Consequence of the ethical
principle of respect for persons
• Target population are allowed
to autonomously opt out of the
study
Much missing data are
avoidable
• Personally, distributing surveys
• Using identification numbers
• Personalization of the survey
invitation
• University sponsorship of the
survey
• Giving advance notice
3 Missing Data Mechanisms
Missing data can be
missing-
- Randomly
- Systematically
• MCAR: R miss Y is not related to X or Y miss Y
• MAR: R miss Y is related to X, but is not related to Y after controlling for X
• MNAR: R miss Y is related to Y
Multiple Imputation
Maximum Likelihood
Sensitivity Analysis
Missing Data Treatments
List Wise Deletion
Pairwise Deletion
Single Imputation
• Deleting the entire
row (person- level)
for whom any data
are missing, then
proceeding with the
analysis
• This procedure
converts item-level
and scale-level
missingness into
person-level
missingness!
Missing Data Treatments- LISTWISE DELETION
Guideline 1: Use All the Available Data
• Listwise deletion
• Compounds the problem of sample nonresponse
• Often greatly reduces sample size and statistical power
• Yields biased parameter estimates under systematic (MAR and MNAR)
missingness
• Target population of ‘‘individuals who fill out surveys completely” is not
theoretically defensible
Avoid outright !
• Mean imputation (i.e., mean(across persons)
underestimates variance and correlation
• Hot deck imputation — using "donors" increases
error—worse than regression imputation
• Regression imputation — using predicted values
underestimates variance and can bias the
correlation
Single imputation
techniques involve
filling in each missing
datum with a ‘‘good
guess’’ as to what the
missing datum should
be.
Missing Data Treatments- SINGLE IMPUTATION
Guideline 2: Do Not Use Single Imputation
• Single Imputation
• First, most single imputation techniques are biased under MCAR
• The inability to calculate accurate SEs for hypothesis testing
• Creates Type I errors of inference
Place a moratorium!
Guideline 3: Construct-Level Missingness: Use
Maximum Likelihood or Multiple Imputation
Missing Data Treatments Whenever 10% or
More of The Respondent Sample Is Made Up
of Construct-Level Partial Respondents
Response rate = (n partial respondents + n full respondents) / n contacted
Multiple Imputation (MI) — a 3-step process:
• Step 1) Impute (or fill-in) missing values multiple
times, to create multiple, partly imputed datasets.
• Step 2) Run the analysis on each of these multiple,
partly-imputed datasets.
• Step 3) Combine these multiple results to get
parameter estimates and standard errors.
Each single imputation
contains some
inaccuracy, so the
imputations are
performed multiple
times and then
aggregated in a way
that accounts for the
uncertainty of each
imputation.
Missing Data Treatments- MULTIPLE IMPUTATION
Maximum Likelihood direct estimation of parameters
and standard errors by choosing estimates that
maximize the probability of the observed data
• There are two common ML missing data techniques:
Full Information Maximum Likelihood (FIML) and the
EM algorithm. FIML directly estimates parameters
and provides accurate Standard Errors (SEs), while
the EM algorithm calculates summary statistics for
further analysis.
• Auxiliary variables (i.e., variables usedfor imputation
only not part of the theoretical model being tested)
are easily incorporated into the EM algorithm. These
variables can make an MNAR mechanism more
similar to MAR.
ML methods
are
acknowledged
as
mathematically
complex.
Missing Data Treatments- MAXIMUM LIKELIHOOD
Two Approaches for Handling Item-Level Missing Data:
Listwise Deletion Cutoffs: This approach involves dropping participants from the
analysis if they fail to respond to at least half of the items on a scale. It is a
commonly taught practice but arbitrary and converts item-level missingness into
construct-level missingness, which may lead to data loss.
Mean Across Available Items: This approach suggests calculating an individual's
scale score using only the available items they responded to. This method is
sometimes referred to as "mean substitution across items." It avoids data loss but
may introduce some reduction in reliability.
Recommendation:
The guideline recommends using the Mean Across Available Items method for
handling item-level missing data, as it typically offers greater expected statistical
power than listwise deletion cutoffs, even when only one item has been answered.
Both methods may suffer from bias under MAR and MNAR mechanisms.
Guideline 4: Item-Level Missingness—One Item Is Enough!
Researchers are encouraged to report response rates, systematic
nonresponse parameters, and to conduct response rate sensitivity
analyses to assess the potential direction and magnitude of missing
data bias.
• Report the overall response rate, calculated as the ratio of full respondents
plus partial respondents to the total number of individuals contacted.
• Report systematic nonresponse parameters (SNPs) if possible, which
capture differences between respondents and nonrespondents on variables
of interest in the study.
• Conduct response rate sensitivity analyses by estimating the response rate–
corrected correlations u
Guideline 5: Person-Level Missingness: If the Response Rate Is
Below 30%, Report Systematic Nonresponse Parameters and
Consider Conducting Sensitivity Analyses
Decision tree for
choosing missing data
treatments.
• To aid in the selection of
appropriate missing data
techniques to address item-level
missingness, construct-level
missingness, and person-level
missingness.
Thank You
Prakriti Sinha

Missing Data Analysis_Data Analysis Techniques

  • 1.
    Missing Data- FivePractical Guidelines R22013 Prakriti Sinha
  • 2.
    Item-Level Missingness - Answeringonly j out of J possible items on a scale Construct-Level Missingness - Answering zero items on a construct (Entire scale ) Person-Level Missingness - Failure to return the survey by a person Missing Data Levels •Missing data levels are nested Item-level missingness can aggregate into construct-level missingness Construct-level missingness can aggregate into Person-level missingness Choice of appropriate missing data technique can depend upon level of missingness
  • 3.
    Missing Data ArePartly Unavoidable, and Partly Avoidable Missing data are a natural and unavoidable • Consequence of the ethical principle of respect for persons • Target population are allowed to autonomously opt out of the study Much missing data are avoidable • Personally, distributing surveys • Using identification numbers • Personalization of the survey invitation • University sponsorship of the survey • Giving advance notice
  • 4.
    3 Missing DataMechanisms Missing data can be missing- - Randomly - Systematically • MCAR: R miss Y is not related to X or Y miss Y • MAR: R miss Y is related to X, but is not related to Y after controlling for X • MNAR: R miss Y is related to Y
  • 5.
    Multiple Imputation Maximum Likelihood SensitivityAnalysis Missing Data Treatments List Wise Deletion Pairwise Deletion Single Imputation
  • 6.
    • Deleting theentire row (person- level) for whom any data are missing, then proceeding with the analysis • This procedure converts item-level and scale-level missingness into person-level missingness! Missing Data Treatments- LISTWISE DELETION
  • 7.
    Guideline 1: UseAll the Available Data • Listwise deletion • Compounds the problem of sample nonresponse • Often greatly reduces sample size and statistical power • Yields biased parameter estimates under systematic (MAR and MNAR) missingness • Target population of ‘‘individuals who fill out surveys completely” is not theoretically defensible Avoid outright !
  • 8.
    • Mean imputation(i.e., mean(across persons) underestimates variance and correlation • Hot deck imputation — using "donors" increases error—worse than regression imputation • Regression imputation — using predicted values underestimates variance and can bias the correlation Single imputation techniques involve filling in each missing datum with a ‘‘good guess’’ as to what the missing datum should be. Missing Data Treatments- SINGLE IMPUTATION
  • 9.
    Guideline 2: DoNot Use Single Imputation • Single Imputation • First, most single imputation techniques are biased under MCAR • The inability to calculate accurate SEs for hypothesis testing • Creates Type I errors of inference Place a moratorium!
  • 10.
    Guideline 3: Construct-LevelMissingness: Use Maximum Likelihood or Multiple Imputation Missing Data Treatments Whenever 10% or More of The Respondent Sample Is Made Up of Construct-Level Partial Respondents Response rate = (n partial respondents + n full respondents) / n contacted
  • 11.
    Multiple Imputation (MI)— a 3-step process: • Step 1) Impute (or fill-in) missing values multiple times, to create multiple, partly imputed datasets. • Step 2) Run the analysis on each of these multiple, partly-imputed datasets. • Step 3) Combine these multiple results to get parameter estimates and standard errors. Each single imputation contains some inaccuracy, so the imputations are performed multiple times and then aggregated in a way that accounts for the uncertainty of each imputation. Missing Data Treatments- MULTIPLE IMPUTATION
  • 12.
    Maximum Likelihood directestimation of parameters and standard errors by choosing estimates that maximize the probability of the observed data • There are two common ML missing data techniques: Full Information Maximum Likelihood (FIML) and the EM algorithm. FIML directly estimates parameters and provides accurate Standard Errors (SEs), while the EM algorithm calculates summary statistics for further analysis. • Auxiliary variables (i.e., variables usedfor imputation only not part of the theoretical model being tested) are easily incorporated into the EM algorithm. These variables can make an MNAR mechanism more similar to MAR. ML methods are acknowledged as mathematically complex. Missing Data Treatments- MAXIMUM LIKELIHOOD
  • 13.
    Two Approaches forHandling Item-Level Missing Data: Listwise Deletion Cutoffs: This approach involves dropping participants from the analysis if they fail to respond to at least half of the items on a scale. It is a commonly taught practice but arbitrary and converts item-level missingness into construct-level missingness, which may lead to data loss. Mean Across Available Items: This approach suggests calculating an individual's scale score using only the available items they responded to. This method is sometimes referred to as "mean substitution across items." It avoids data loss but may introduce some reduction in reliability. Recommendation: The guideline recommends using the Mean Across Available Items method for handling item-level missing data, as it typically offers greater expected statistical power than listwise deletion cutoffs, even when only one item has been answered. Both methods may suffer from bias under MAR and MNAR mechanisms. Guideline 4: Item-Level Missingness—One Item Is Enough!
  • 14.
    Researchers are encouragedto report response rates, systematic nonresponse parameters, and to conduct response rate sensitivity analyses to assess the potential direction and magnitude of missing data bias. • Report the overall response rate, calculated as the ratio of full respondents plus partial respondents to the total number of individuals contacted. • Report systematic nonresponse parameters (SNPs) if possible, which capture differences between respondents and nonrespondents on variables of interest in the study. • Conduct response rate sensitivity analyses by estimating the response rate– corrected correlations u Guideline 5: Person-Level Missingness: If the Response Rate Is Below 30%, Report Systematic Nonresponse Parameters and Consider Conducting Sensitivity Analyses
  • 15.
    Decision tree for choosingmissing data treatments. • To aid in the selection of appropriate missing data techniques to address item-level missingness, construct-level missingness, and person-level missingness.
  • 16.