The methodology for handling missing data during development of predictive model


Published on

Published in: Health & Medicine, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The methodology for handling missing data during development of predictive model

  1. 1. The methodology for handling missing data during development of predictive model<br />Xiao-Ou Ping1, Ja-Der Liang3, Yi-Ju Tseng2, Pei-Ming Yang3, <br />Guan-Tarn Huang3, Feipei Lai1, 2<br />1. Department of Computer Science and Information Engineering, <br />National Taiwan University <br />2. Graduate Institute of Biomedical Electronic and Bioinformatics, <br />National Taiwan University <br />3. Department of Internal Medicine, <br />National Taiwan University Hospital and <br />National Taiwan University College of Medicine<br />
  2. 2. Introduction<br />In medical research, the problem of missing data occurs frequently<br />According to a study of reviewing 100 articles among seven cancer journals<br />81 of these articles have evidences of missing data<br />In our study of developing liver cancer recurrence predictive model, there have still missing data<br />The adoption of methods for dealing with missing data is necessary<br />The aims of this study are as follows<br />To evaluate the imputed performance of imputation methods in the stability and accuracy<br />To present the impact of different missing data handling methods in the predictive results of recurrence predictive model<br />To discover whether if the clinical features with missing values still have the potential for building the more accurate predictive model<br />2<br />
  3. 3. Materials and methods<br />To develop predictive models based on incomplete clinical data using the missing data handling methods <br />Complete case (CC) analysis, complete variable (CV) analysis, and imputation method (IM)<br />The 92 liver cancer patients were included in the study<br />7.6% missing values<br />Analyzed features contain age, gender, laboratory tests, tumor size, tumor number, and cancer staging, etc<br />In the simulation experiment<br />The observed values are randomly masked as missing values and the IMs are employed for imputing these missing entries<br />After the process of data imputation, the masked true valuesand the imputed valued estimated by IMscan be compared<br />The normalized root mean squared errors (NRMSEs) can be used for evaluating the imputation accuracy of IMs<br />The summation of first quartile, third quartile, and the median is regarded as the IM selection criterion for comparing the imputation performance in the stability and accuracy<br />3<br />
  4. 4. Evaluation of missing data handling methods <br />4<br /><ul><li>Complete case (CC) analysis: analyzing only the data of patients without missing data
  5. 5. Complete variable (CV) analysis: dropping the variables with missing data and analyzing only the variables without missing data
  6. 6. Imputation method (IM): estimating the missing values (MVs) based on different methods</li></ul>8 IMs are compared<br />in this study:<br />6 single MIs<br />1. “SVDImpute”<br />2. “LLSImpute”<br />3. “PPCA”<br />4. “BPCA” <br />5. “NLPCA”<br />6. “Nipals PCA”<br />2 multiple MIs<br />1. “MICE”<br />2. “mi”<br />
  7. 7. Results and Conclusion<br />The sensitivity and specificity of CC are 100% and 83%<br />The best sensitivity and specificity of CV are 86% and 79%<br />In the study, we designed the score of Imputation Method (IM) selection criterionfor selecting the more appropriate IMs<br />The best three IMs can achieve better predictive accuracy (i.e. the same sensitivity, 86%, and the better specificity, 88%) than CV<br />The best IM, “BPCA”, can use just four features <br />Included the feature with missing values<br />IMs for data with missing values still show the compatible benefit for the recurrence predictive model<br />5<br />