The methodology for handling missing data during development of predictive model
1. The methodology for handling missing data during development of predictive model Xiao-Ou Ping1, Ja-Der Liang3, Yi-Ju Tseng2, Pei-Ming Yang3, Guan-Tarn Huang3, Feipei Lai1, 2 1. Department of Computer Science and Information Engineering, National Taiwan University 2. Graduate Institute of Biomedical Electronic and Bioinformatics, National Taiwan University 3. Department of Internal Medicine, National Taiwan University Hospital and National Taiwan University College of Medicine
2. Introduction In medical research, the problem of missing data occurs frequently According to a study of reviewing 100 articles among seven cancer journals 81 of these articles have evidences of missing data In our study of developing liver cancer recurrence predictive model, there have still missing data The adoption of methods for dealing with missing data is necessary The aims of this study are as follows To evaluate the imputed performance of imputation methods in the stability and accuracy To present the impact of different missing data handling methods in the predictive results of recurrence predictive model To discover whether if the clinical features with missing values still have the potential for building the more accurate predictive model 2
3. Materials and methods To develop predictive models based on incomplete clinical data using the missing data handling methods Complete case (CC) analysis, complete variable (CV) analysis, and imputation method (IM) The 92 liver cancer patients were included in the study 7.6% missing values Analyzed features contain age, gender, laboratory tests, tumor size, tumor number, and cancer staging, etc In the simulation experiment The observed values are randomly masked as missing values and the IMs are employed for imputing these missing entries After the process of data imputation, the masked true valuesand the imputed valued estimated by IMscan be compared The normalized root mean squared errors (NRMSEs) can be used for evaluating the imputation accuracy of IMs The summation of first quartile, third quartile, and the median is regarded as the IM selection criterion for comparing the imputation performance in the stability and accuracy 3
4.
5. Complete variable (CV) analysis: dropping the variables with missing data and analyzing only the variables without missing data
6. Imputation method (IM): estimating the missing values (MVs) based on different methods8 IMs are compared in this study: 6 single MIs 1. “SVDImpute” 2. “LLSImpute” 3. “PPCA” 4. “BPCA” 5. “NLPCA” 6. “Nipals PCA” 2 multiple MIs 1. “MICE” 2. “mi”
7. Results and Conclusion The sensitivity and specificity of CC are 100% and 83% The best sensitivity and specificity of CV are 86% and 79% In the study, we designed the score of Imputation Method (IM) selection criterionfor selecting the more appropriate IMs The best three IMs can achieve better predictive accuracy (i.e. the same sensitivity, 86%, and the better specificity, 88%) than CV The best IM, “BPCA”, can use just four features Included the feature with missing values IMs for data with missing values still show the compatible benefit for the recurrence predictive model 5