Parkinson disease classification recorded v2.0

Parkinson Disease
Classification
ANLY 530 Late Summer 2019
Nikhil Shrivastava
1

Agenda
• Executive Summary
• Project Overview
• Process Involved
• Project Summary
• Data Summary Statistics
• Data Visualizations
• Feature Importance
• ML Methodologies
• Conclusion & Future Efforts
• Lessons Learned
2

Executive Summary
OBJECTIVE
APPROACH
RECOMMENDATION
• Design a Machine Learning Model that can predict whether a person has
Parkinson Disease or not
• Post Data Cleaning use different Machine Learning Classification Techniques
like Decision Tree, Naïve Bayes, Random Forest, etc to develop a
classification model
• Identify the important features that can help detect Parkinson Disease and
cure it before it gets worse.
3

Project Overview & Background
• Issue: Parkinson’s disease is the second most prevalent neurodegenerative
disorder after Alzheimer’s, affecting more than 10 million people worldwide.
Symptoms include frozen facial features, slowness of movement, tremor, etc.
• Current Accuracy: A study from National Institute of Neurological Disorders finds
that early diagnosis is only 53% accurate.
• Goal: Design a Machine Learning Model that can predict whether a person has
Parkinson Disease or not based on certain attributes of speech such as relative
jitter, shimmer, MFCC coefficients ,etc.
4

Process Involved
Data Cleaning Feature Engineering Data Visualization Model Building
• Checking for Missing values. • Check the categorical
features if any for Label
encoding.
• Visualize the target variable
and relationship with certain
independent attributes.
• Start with Basic Tree Based
model, and then ensemble
techniques to predict the
target variable.
• Checking imbalance if any
for the target variable.
• Check the correlation
between different
independent features.
• Visualize important features
as well after feature
importance test .
• Run cross validation and
check feature importance,
accuracy score, confusion
matrix, etc.
• Use fillna method to detect
the missing values and
impute it accordingly.
• Use Label Encoder method to
do Label Encoding of
categorical columns.
• Use seaborn, matplotlib
libraries to understand
different relationships.
• Use different models from
sklearn library like Decision
Tree, Naïve Bayes, Random
Forest.
• Check the distribution of
different classes. If they are
imbalanced then use
synthetic oversampling or
undersampling
techniques(SMOTE,
ADASYN) to create balanced
training dataset.
• Use corr method to check for
Pearson coefficient of greater
than 0.8 or less than -0.8.
• Use distplot, histograms,
scatterplot to understand
important features and their
relationship with the target
variable.
• Check for accuracy score,
bias-variance trade off,
confusion matrix to validate
if the model is properly
fitted.
What
How
5

Project Summary
Overall Results – Before Hyper-parameter Tuning & CV
Decision Tree
Accuracy: 100%
Confusion Matrix:
Random Forest
Accuracy: 94.4%
Confusion Matrix:
Gaussian Naïve Bayes
Accuracy: 87.5%
Confusion Matrix:
Predicted 0 1
Actual
0 39 0
1 0 33
Predicted 0 1
Actual
0 38 1
1 8 25
Predicted 0 1
Actual
0 39 0
1 2 31
6
Decision Tree
Accuracy: 100%
Over fitting.
Overall Results – After Hyper-parameter Tuning & CV
Gaussian Naïve Bayes
Accuracy: 87.5%
Lower than Random Forest
Random Forest
Accuracy: 97.2%
CV Technique: Grid Search
Bootstrap: True
Impurity Criterion : Gini

Understanding the Data
7
1. ID: Subjects's identifier.
2. Recording: Number of the recording.
3. Status: 0=Healthy; 1=PD
4. Gender: 0=Man; 1=Woman
5. Pitch local perturbation measures: relative jitter (Jitter_rel), absolute jitter (Jitter_abs),
relative average perturbation (Jitter_RAP), and pitch perturbation quotient
(Jitter_PPQ).
6. Amplitude perturbation measures: local shimmer (Shim_loc), shimmer in dB
(Shim_dB), 3-point amplitude perturbation quotient (Shim_APQ3), 5-point amplitude
perturbation quotient (Shim_APQ5), and 11-point amplitude perturbation quotient
(Shim_APQ11).
7. Harmonic-to-noise ratio measures: harmonic-to-noise ratio in the frequency band 0-
500 Hz (HNR05), in 0-1500 Hz (HNR15), in 0-2500 Hz (HNR25), in 0-3500 Hz (HNR35),
and in 0-3800 Hz (HNR38).
8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (MFCC0,
MFCC1,..., MFCC12) and their derivatives (Delta0, Delta1,..., Delta12).
9. Recurrence period density entropy (RPDE).
10. Detrended fluctuation analysis (DFA).
11. Pitch period entropy (PPE).
12. Glottal-to-noise excitation ratio (GNE).

Data & Summary Statistics
• We have used Parkinson’s disease data for this project. It has around 48 attributes for determining if patient
has the disease or not.
• Number of records : 240
• Number of attributes: 48
• Key attributes: Delta3, MFCC4, HNR15
Key Attributes Mean Std Comments
Delta3 1.34 0.19 Delta3 is the derivative of MFCC and is one of the key attribute with a mean 1.34
MFCC4 1.355 0.21 MFCC of order 4 has mean of 1.35 and std of 0.21
HNR15 63.67 15.62 Harmonic to noise ratio measure has a mean of 64 and std of 15
8

Data Visualizations
Status Vs HNR15 Status Vs Shim_loc
People who have Parkinson Disease have
relatively low Harmonic to Noise Ratio
measure in the frequency band of 0-1500
Hz.
People who have Parkinson Disease have
relatively high local shimmer.
9

Data Visualizations - Multicollinearity
Checking collinearity of different
features. Most of the Mel frequency
cepstral coefficients MFC are highly
correlated with Harmonic-to-noise ratio
measures. We have removed features
whose Pearson coefficient is higher then
0.8. There are no features showing very
high negative correlation.
10

Outlier Detection
11
• Removed outlier using z-score and kept a threshold of 3 standard deviation.
• I got 3 observations that had multiple columns with z-score greater than even 5, so removed those.
• Post Outlier Detection and removal, there were 237 records.
• After removing Outliers, when I ran Random Forest model, I didn’t get better results in terms of Precision
and Accuracy. Number of False Positives increased which is not better for problems of healthcare
background.
Before Outlier removal:
Accuracy:94.4%
False Positives: 0
After Outlier removal:
Accuracy:93.5%
False Positives:1

Feature Importance
Based on sklearn feature_importance
following were top 5 features:
• Delta3
• MFCC3
• MFCC9
• MFCC8
• HNR05
Mel frequency cepstral coefficient of order
3,8,9 shows that they are important in
determining whether a person has
Parkinson Disease or is healthy.
12

Machine Learning Methodology
Random Forest uses boosting approach
with multiple trees which improves the
model and gives better results.
Decision Tree
Random Forest involves more tree shuffling
and hence the accuracy was 94% without
cross val and 97% after cross val
Why
Accuracy
Conclusion
Random Forest
Comments
Decision Tree Classifier uses Tree based
models bagging approach to classify different
class instances.
Decision Tree Accuracy was 100%. Clearly it
was over fitted.
Decision Tree was clearly over fitted so we
tried Naïve Bayes which was not over fitted
and gave 87.5%
Random Forest gave 97.2% with 39 True
Positives and 31 True Negatives
Pre-pruning and Post-pruning could minimize
over-fitting and will create more accurate
model
Random Forest was the best model out of
the three ML algorithms that we ran based
on Accuracy, TPs and FPs.
13

Conclusion and Future Efforts
• Random Forest was the best model out of all three ML models
• Cross Validation using Grid Search improved the accuracy giving the best parameters.
• We could further optimize Decision Tree using Pre-Pruning or Post-Pruning techniques to minimize over-
fitting.
• We could use other ML techniques like XG Boost, Light GBM which could give more accurate results
because of the learning rate they use to train every tree instance.
• In further iterations of the model, we could check skewness, check for outliers to further filter the
dataset and get more accurate results.
14

Lessons Learned
• Labelled Data:
Availability of enough number of samples is the key to any ML algorithms, but especially Healthcare industry face challenges
on patient population or patient data. Even sometimes when the data is available it is protected by laws & regulations. In this
project models were built with around 240 records, which is quite less for making predictions and deploying it for practical
purposes. To overcome patient data challenge, National Institute of Health is running multiple programs via grant CTSA to
build patient population database and supplement researches via informatics and AI.
• Domain/Functional Knowledge:
Any Data Science project is dependent on the domain knowledge of the Data Scientist. Despite availability of millions of
records and thousands of attributes, it is critical to have domain knowledge which helps in establishing certain hypothesis,
example, high blood pressure drives risk of heart attack. Lack of domain knowledge was another challenge in this project
which I realized after 30% of work.

Parkinson disease classification recorded v2.0

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Parkinson disease classification recorded v2.0

Similar to Parkinson disease classification recorded v2.0 (20)

Recently uploaded

Recently uploaded (20)

Parkinson disease classification recorded v2.0