• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The methodology for handling missing data during development of predictive model
 

The methodology for handling missing data during development of predictive model

on

  • 296 views

 

Statistics

Views

Total Views
296
Views on SlideShare
296
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The methodology for handling missing data during development of predictive model The methodology for handling missing data during development of predictive model Presentation Transcript

    • The methodology for handling missing data during development of predictive model
      Xiao-Ou Ping1, Ja-Der Liang3, Yi-Ju Tseng2, Pei-Ming Yang3,
      Guan-Tarn Huang3, Feipei Lai1, 2
      1. Department of Computer Science and Information Engineering,
      National Taiwan University
      2. Graduate Institute of Biomedical Electronic and Bioinformatics,
      National Taiwan University
      3. Department of Internal Medicine,
      National Taiwan University Hospital and
      National Taiwan University College of Medicine
    • Introduction
      In medical research, the problem of missing data occurs frequently
      According to a study of reviewing 100 articles among seven cancer journals
      81 of these articles have evidences of missing data
      In our study of developing liver cancer recurrence predictive model, there have still missing data
      The adoption of methods for dealing with missing data is necessary
      The aims of this study are as follows
      To evaluate the imputed performance of imputation methods in the stability and accuracy
      To present the impact of different missing data handling methods in the predictive results of recurrence predictive model
      To discover whether if the clinical features with missing values still have the potential for building the more accurate predictive model
      2
    • Materials and methods
      To develop predictive models based on incomplete clinical data using the missing data handling methods
      Complete case (CC) analysis, complete variable (CV) analysis, and imputation method (IM)
      The 92 liver cancer patients were included in the study
      7.6% missing values
      Analyzed features contain age, gender, laboratory tests, tumor size, tumor number, and cancer staging, etc
      In the simulation experiment
      The observed values are randomly masked as missing values and the IMs are employed for imputing these missing entries
      After the process of data imputation, the masked true valuesand the imputed valued estimated by IMscan be compared
      The normalized root mean squared errors (NRMSEs) can be used for evaluating the imputation accuracy of IMs
      The summation of first quartile, third quartile, and the median is regarded as the IM selection criterion for comparing the imputation performance in the stability and accuracy
      3
    • Evaluation of missing data handling methods
      4
      • Complete case (CC) analysis: analyzing only the data of patients without missing data
      • Complete variable (CV) analysis: dropping the variables with missing data and analyzing only the variables without missing data
      • Imputation method (IM): estimating the missing values (MVs) based on different methods
      8 IMs are compared
      in this study:
      6 single MIs
      1. “SVDImpute”
      2. “LLSImpute”
      3. “PPCA”
      4. “BPCA”
      5. “NLPCA”
      6. “Nipals PCA”
      2 multiple MIs
      1. “MICE”
      2. “mi”
    • Results and Conclusion
      The sensitivity and specificity of CC are 100% and 83%
      The best sensitivity and specificity of CV are 86% and 79%
      In the study, we designed the score of Imputation Method (IM) selection criterionfor selecting the more appropriate IMs
      The best three IMs can achieve better predictive accuracy (i.e. the same sensitivity, 86%, and the better specificity, 88%) than CV
      The best IM, “BPCA”, can use just four features
      Included the feature with missing values
      IMs for data with missing values still show the compatible benefit for the recurrence predictive model
      5