This document presents a machine learning approach for early detection of chronic kidney disease (CKD) using an advanced feature selection method. The proposed method uses Grey Wolf Optimization (GWO) for feature selection and Random Forest for classification. It selects the top 5 features that result in 98.75% accuracy for CKD detection. SHAP and partial dependence plots are used to interpret and explain the model's predictions. The results show improved performance over other state-of-the-art methods while using fewer features and having lower time complexity.
Machine Learning Predicts Early CKD Using Advanced Feature Selection
1. Prediction of Early-Stage Chronic Kidney Disease Using
Machine Learning with Advanced Feature Selection
2. Motivation / Background
• Chronic Kidney Disease (CKD) is one of the most deadly
non- communicable diseases globally.
• In 2017, the total number of worldwide deaths due to
CKD was 1.2 million which rose to 41.5% from 1990.
• The global burden of CKD is increasing, and is projected
to become the 5th most common cause of years of life
lost globally by 2040.
• In CKD, kidney unable to perform essential functions in
the body which leads to other critical diseases such as
heart disease, high blood pressure, diabetes etc.
• Therefore, early-stage detection of CKD is essential for
containing the further progression of the disease, which in
turn reduce the mortality rate and treatment cost
significantly.
3. Literature Review
In one of the study, researchers have proposed Density based feature selection with Ant Colony
Optimization (D-ACO) for detection of CKD which resulted in higher accuracy of 95% and
sensitivity of 96%.
In a study, researchers have proposed bio-inspired Fruit Fly Optimization (FFO) based feature
selection technique with Multi Kernel Support Vector Machine Classifier (MKSVM) resulted in
accuracy of 98.5% and sensitivity of 97.6%.
In another study, the researchers have employed Neuro-Fuzzy algorithm for detection of CKD
which resulted in higher accuracy of 97%.
In a research, the researchers have used Deep learning based stacked auto-encoder based
feature selection approach which resulted in 100% Accuracy and 100% Sensitivity.
4. Problem Statement
In the previous studies, researchers have employed very limited feature selection
techniques such as filter and wrapper.
In the existing literature, the researchers have used only mean and mode based imputation
for handling missing values.
No studies in the past, have employed model interpretation technique for explaining their
black-box model.
None of the techniques except D-ACO have employed meta-heuristic based advanced
feature selection for selecting most optimal feature set in early-stage CKD prediction.
7. Methodology
• Grey Wolf Optimization is a meta-heuristic algorithm introduced by
S. Mirjalili, S.M. Mirjalili, and A.Lewis in 2014
• It is based on the leadership hierarchy and hunting pattern of grey
wolves in nature.
• In terms of leadership hierarchy, ⍺ is the leader and decision maker.
• β and 𝜹 assist ⍺ in decision making.
• The rest of the wolves are Ω which serves as a scapegoat.
• The main steps of Grey Wolf Hunting are :
1. Searching for Prey
2. Tracking, Chasing, and Approaching Prey
3. Pursuing, Encircling and Harassing the Prey until it stop moving.
4. Attacking the Prey
8. Methodology
Initialize parameters (number of grey wolves, no. of iterations etc.)
Create initial population of grey wolves with different social
hierarchy (α,β,δ and ω)
Estimate the position of prey by α,β, and δ
Evaluate the position of grey wolves by the position of the prey
Grade the grey wolves
End
Stopping
criteria satisfied
Start
Yes
No
Flow Chart of
Grey Wolf Optimization Algorithm
9. Methodology
• SHAP was first introduced by (Lundberg and Lee, 2017).
• The main objective of SHAP is to explain the model’s prediction by showing the contribution of each feature
in the prediction.
• It works on the principle of game theory where success of a team is determined by the contribution of each
player in the game.
• It calculates what the prediction of the model would be with and without a feature
Model Interpretation : SHapley Additive exPlanations (SHAP)
10. Methodology
• PDP were proposed by Friedman in (Friedman, 2001).
• The partial dependence plot is a model agnostic tool which plots the change in average predicted
value of a feature over their marginal distribution in the dataset .
• The PDP plots gives the overall picture of a feature contribution in the prediction i.e. how the
prediction value changes with the change in the value of a feature.
Model Interpretation : Partial Dependence Plot (PDP)
11. Results and Discussion
Model Without Feature Selection With Feature Selection
Accuracy (%) Accuracy (%)
Random
Forest
97.18 98.43
Adaboost 99.37 99.37
XGBoost 96.87 96.87
Comparison of averaged 10-fold cross validation accuracy with and without feature selection
12. Results and Discussion
Comparison of results on test set with and without feature selection
Criterion ML Algorithms Accuracy
(%)
Precision
(%)
Sensitivity
(%)
Specificity
(%)
F1-score
(%)
MCC
(%)
Without
Feature
Selection
Random
Forest
95.00 97.91 94.00 96.67 95.91 0.8959
Adaboost 97.50 100.00 96.00 100.00 97.95 0.9486
XGBoost 96.25 100.00 94.00 100.00 96.90 0.9244
With Feature
Selection
Random
Forest
98.75 98.03 100.00 96.67 99.00 0.9735
Adaboost 96.25 97.95 96.00 96.67 96.96 0.9208
XGBoost 96.25 94.33 100.00 90.00 97.08 0.9214
13. Results and Discussion
Comparison of results with recent state-of-the-art methods
Methodology # Features Accuracy Sensitivity Specificity F1-score
FFO
(Jerlin Rubini and Perumal, 2020)
11 98.50 97.60 100.00 -
Improved SVM-Radial with
Chi-Square
(Harimoorthy and Thangavelu,
2020)
11 98.30 100.00 97.60 -
Random Forest with Chi-Square
Feature Selection (Yashfi et al.,
2020)
20 97.12 97.00 - 97.00
Deep stacked auto encoder
(Khamparia et al., 2019)
10 100.00 100.00 100.00 100.00
D-ACO
(Elhoseny et al., 2019)
14 95.00 96.00 93.33 96.00
Naïve Bayes with Best First
Algorithm (Arulanthu and
Perumal, 2019)
5 80.65 71.00 85.93 -
Proposed MV-GWO 5 98.75 100.00 96.67 99.00
22. Conclusion and Future Work
The proposed MV-GWO method selected 5 critical features i.e., packed cell volume,
diabetes mellitus, red blood cells, blood urea and pus cells which resulted in higher
performance using Random Forest model.
The proposed feature selection method with Random Forest shown promising results
comparable to state-of-the-art feature selection methods in literature.
The results showed that the time complexity of proposed method with Random Forest was
significantly lesser than other algorithms with all features.
In the study, we have also used SHAP and PDP plots to analyse the effect of top 5 critical
features and explained how these features are contributing to prediction of CKD.
In future, we will extend our method for detection of other chronic and most critical diseases
such as cardiovascular disease, lung disease, liver disease, breast cancer, cervical cancer
etc.