SlideShare a Scribd company logo
Parkinson Disease
Classification
ANLY 530 Late Summer 2019
Nikhil Shrivastava
1
Agenda
• Executive Summary
• Project Overview
• Process Involved
• Project Summary
• Data Summary Statistics
• Data Visualizations
• Feature Importance
• ML Methodologies
• Conclusion & Future Efforts
• Lessons Learned
2
Executive Summary
OBJECTIVE
APPROACH
RECOMMENDATION
• Design a Machine Learning Model that can predict whether a person has
Parkinson Disease or not
• Post Data Cleaning use different Machine Learning Classification Techniques
like Decision Tree, Naïve Bayes, Random Forest, etc to develop a
classification model
• Identify the important features that can help detect Parkinson Disease and
cure it before it gets worse.
3
Project Overview & Background
• Issue: Parkinson’s disease is the second most prevalent neurodegenerative
disorder after Alzheimer’s, affecting more than 10 million people worldwide.
Symptoms include frozen facial features, slowness of movement, tremor, etc.
• Current Accuracy: A study from National Institute of Neurological Disorders finds
that early diagnosis is only 53% accurate.
• Goal: Design a Machine Learning Model that can predict whether a person has
Parkinson Disease or not based on certain attributes of speech such as relative
jitter, shimmer, MFCC coefficients ,etc.
4
Process Involved
Data Cleaning Feature Engineering Data Visualization Model Building
• Checking for Missing values. • Check the categorical
features if any for Label
encoding.
• Visualize the target variable
and relationship with certain
independent attributes.
• Start with Basic Tree Based
model, and then ensemble
techniques to predict the
target variable.
• Checking imbalance if any
for the target variable.
• Check the correlation
between different
independent features.
• Visualize important features
as well after feature
importance test .
• Run cross validation and
check feature importance,
accuracy score, confusion
matrix, etc.
• Use fillna method to detect
the missing values and
impute it accordingly.
• Use Label Encoder method to
do Label Encoding of
categorical columns.
• Use seaborn, matplotlib
libraries to understand
different relationships.
• Use different models from
sklearn library like Decision
Tree, Naïve Bayes, Random
Forest.
• Check the distribution of
different classes. If they are
imbalanced then use
synthetic oversampling or
undersampling
techniques(SMOTE,
ADASYN) to create balanced
training dataset.
• Use corr method to check for
Pearson coefficient of greater
than 0.8 or less than -0.8.
• Use distplot, histograms,
scatterplot to understand
important features and their
relationship with the target
variable.
• Check for accuracy score,
bias-variance trade off,
confusion matrix to validate
if the model is properly
fitted.
What
How
5
Project Summary
Overall Results – Before Hyper-parameter Tuning & CV
Decision Tree
Accuracy: 100%
Confusion Matrix:
Random Forest
Accuracy: 94.4%
Confusion Matrix:
Gaussian Naïve Bayes
Accuracy: 87.5%
Confusion Matrix:
Predicted 0 1
Actual
0 39 0
1 0 33
Predicted 0 1
Actual
0 38 1
1 8 25
Predicted 0 1
Actual
0 39 0
1 2 31
6
Decision Tree
Accuracy: 100%
Over fitting.
Overall Results – After Hyper-parameter Tuning & CV
Gaussian Naïve Bayes
Accuracy: 87.5%
Lower than Random Forest
Random Forest
Accuracy: 97.2%
CV Technique: Grid Search
Bootstrap: True
Impurity Criterion : Gini
Understanding the Data
7
1. ID: Subjects's identifier.
2. Recording: Number of the recording.
3. Status: 0=Healthy; 1=PD
4. Gender: 0=Man; 1=Woman
5. Pitch local perturbation measures: relative jitter (Jitter_rel), absolute jitter (Jitter_abs),
relative average perturbation (Jitter_RAP), and pitch perturbation quotient
(Jitter_PPQ).
6. Amplitude perturbation measures: local shimmer (Shim_loc), shimmer in dB
(Shim_dB), 3-point amplitude perturbation quotient (Shim_APQ3), 5-point amplitude
perturbation quotient (Shim_APQ5), and 11-point amplitude perturbation quotient
(Shim_APQ11).
7. Harmonic-to-noise ratio measures: harmonic-to-noise ratio in the frequency band 0-
500 Hz (HNR05), in 0-1500 Hz (HNR15), in 0-2500 Hz (HNR25), in 0-3500 Hz (HNR35),
and in 0-3800 Hz (HNR38).
8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (MFCC0,
MFCC1,..., MFCC12) and their derivatives (Delta0, Delta1,..., Delta12).
9. Recurrence period density entropy (RPDE).
10. Detrended fluctuation analysis (DFA).
11. Pitch period entropy (PPE).
12. Glottal-to-noise excitation ratio (GNE).
Data & Summary Statistics
• We have used Parkinson’s disease data for this project. It has around 48 attributes for determining if patient
has the disease or not.
• Number of records : 240
• Number of attributes: 48
• Key attributes: Delta3, MFCC4, HNR15
Key Attributes Mean Std Comments
Delta3 1.34 0.19 Delta3 is the derivative of MFCC and is one of the key attribute with a mean 1.34
MFCC4 1.355 0.21 MFCC of order 4 has mean of 1.35 and std of 0.21
HNR15 63.67 15.62 Harmonic to noise ratio measure has a mean of 64 and std of 15
8
Data Visualizations
Status Vs HNR15 Status Vs Shim_loc
People who have Parkinson Disease have
relatively low Harmonic to Noise Ratio
measure in the frequency band of 0-1500
Hz.
People who have Parkinson Disease have
relatively high local shimmer.
9
Data Visualizations - Multicollinearity
Checking collinearity of different
features. Most of the Mel frequency
cepstral coefficients MFC are highly
correlated with Harmonic-to-noise ratio
measures. We have removed features
whose Pearson coefficient is higher then
0.8. There are no features showing very
high negative correlation.
10
Outlier Detection
11
• Removed outlier using z-score and kept a threshold of 3 standard deviation.
• I got 3 observations that had multiple columns with z-score greater than even 5, so removed those.
• Post Outlier Detection and removal, there were 237 records.
• After removing Outliers, when I ran Random Forest model, I didn’t get better results in terms of Precision
and Accuracy. Number of False Positives increased which is not better for problems of healthcare
background.
Before Outlier removal:
Accuracy:94.4%
False Positives: 0
After Outlier removal:
Accuracy:93.5%
False Positives:1
Feature Importance
Based on sklearn feature_importance
following were top 5 features:
• Delta3
• MFCC3
• MFCC9
• MFCC8
• HNR05
Mel frequency cepstral coefficient of order
3,8,9 shows that they are important in
determining whether a person has
Parkinson Disease or is healthy.
12
Machine Learning Methodology
Random Forest uses boosting approach
with multiple trees which improves the
model and gives better results.
Decision Tree
Random Forest involves more tree shuffling
and hence the accuracy was 94% without
cross val and 97% after cross val
Why
Accuracy
Conclusion
Random Forest
Comments
Decision Tree Classifier uses Tree based
models bagging approach to classify different
class instances.
Decision Tree Accuracy was 100%. Clearly it
was over fitted.
Decision Tree was clearly over fitted so we
tried Naïve Bayes which was not over fitted
and gave 87.5%
Random Forest gave 97.2% with 39 True
Positives and 31 True Negatives
Pre-pruning and Post-pruning could minimize
over-fitting and will create more accurate
model
Random Forest was the best model out of
the three ML algorithms that we ran based
on Accuracy, TPs and FPs.
13
Conclusion and Future Efforts
• Random Forest was the best model out of all three ML models
• Cross Validation using Grid Search improved the accuracy giving the best parameters.
• We could further optimize Decision Tree using Pre-Pruning or Post-Pruning techniques to minimize over-
fitting.
• We could use other ML techniques like XG Boost, Light GBM which could give more accurate results
because of the learning rate they use to train every tree instance.
• In further iterations of the model, we could check skewness, check for outliers to further filter the
dataset and get more accurate results.
14
Lessons Learned
• Labelled Data:
Availability of enough number of samples is the key to any ML algorithms, but especially Healthcare industry face challenges
on patient population or patient data. Even sometimes when the data is available it is protected by laws & regulations. In this
project models were built with around 240 records, which is quite less for making predictions and deploying it for practical
purposes. To overcome patient data challenge, National Institute of Health is running multiple programs via grant CTSA to
build patient population database and supplement researches via informatics and AI.
• Domain/Functional Knowledge:
Any Data Science project is dependent on the domain knowledge of the Data Scientist. Despite availability of millions of
records and thousands of attributes, it is critical to have domain knowledge which helps in establishing certain hypothesis,
example, high blood pressure drives risk of heart attack. Lack of domain knowledge was another challenge in this project
which I realized after 30% of work.

More Related Content

What's hot

Hanaa phd presentation 14-4-2017
Hanaa phd  presentation  14-4-2017Hanaa phd  presentation  14-4-2017
Hanaa phd presentation 14-4-2017
Aboul Ella Hassanien
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
Francisco E. Figueroa-Nigaglioni
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
jim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
An Introduction to Anomaly Detection
An Introduction to Anomaly DetectionAn Introduction to Anomaly Detection
An Introduction to Anomaly Detection
Kenneth Graham
 
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Pradeep Redddy Raamana
 
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
cscpconf
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Devansh16
 
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET Journal
 
Propose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart DiseasePropose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart Disease
IJERA Editor
 
When deep learners change their mind learning dynamics for active learning
When deep learners change their mind  learning dynamics for active learningWhen deep learners change their mind  learning dynamics for active learning
When deep learners change their mind learning dynamics for active learning
Devansh16
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Impetus Technologies
 
neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...
Pradeep Redddy Raamana
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET Journal
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.Teng Xiaolu
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Peea Bal Chakraborty
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
Manojit Nandi
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
Devansh16
 

What's hot (19)

Hanaa phd presentation 14-4-2017
Hanaa phd  presentation  14-4-2017Hanaa phd  presentation  14-4-2017
Hanaa phd presentation 14-4-2017
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
An Introduction to Anomaly Detection
An Introduction to Anomaly DetectionAn Introduction to Anomaly Detection
An Introduction to Anomaly Detection
 
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
 
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
 
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
IRJET- An Extensive Study of Sentiment Analysis Techniques and its Progressio...
 
Propose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart DiseasePropose a Enhanced Framework for Prediction of Heart Disease
Propose a Enhanced Framework for Prediction of Heart Disease
 
When deep learners change their mind learning dynamics for active learning
When deep learners change their mind  learning dynamics for active learningWhen deep learners change their mind  learning dynamics for active learning
When deep learners change their mind learning dynamics for active learning
 
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptxAnomaly Detection and Spark Implementation - Meetup Presentation.pptx
Anomaly Detection and Spark Implementation - Meetup Presentation.pptx
 
neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...neuropredict: a proposal and a tool towards standardized and easy assessment ...
neuropredict: a proposal and a tool towards standardized and easy assessment ...
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 
Anomaly Detection for Real-World Systems
Anomaly Detection for Real-World SystemsAnomaly Detection for Real-World Systems
Anomaly Detection for Real-World Systems
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
 
my IEEE
my IEEEmy IEEE
my IEEE
 

Similar to Parkinson disease classification recorded v2.0

Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
Nikhil Shrivastava, MS, SAFe PMPO
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
KOYELMAJUMDAR1
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-Statisticians
Brook White, PMP
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
nehaSingh1543
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
Aboul Ella Hassanien
 
Diagnosis Support by Machine Learning Using Posturography Data
Diagnosis Support by Machine Learning Using Posturography DataDiagnosis Support by Machine Learning Using Posturography Data
Diagnosis Support by Machine Learning Using Posturography Data
TeruKamogashira
 
Predictive Analysis of Breast Cancer Detection using Classification Algorithm
Predictive Analysis of Breast Cancer Detection using Classification AlgorithmPredictive Analysis of Breast Cancer Detection using Classification Algorithm
Predictive Analysis of Breast Cancer Detection using Classification Algorithm
Sushanti Acharya
 
Batch -13.pptx lung cancer detection using transfer learning
Batch -13.pptx lung cancer detection using transfer learningBatch -13.pptx lung cancer detection using transfer learning
Batch -13.pptx lung cancer detection using transfer learning
hananth1513
 
DataMining Techniques in BreastCancer.pptx
DataMining Techniques in BreastCancer.pptxDataMining Techniques in BreastCancer.pptx
DataMining Techniques in BreastCancer.pptx
MaligireddyTanujaRed1
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
ijsc
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
ijsc
 
Predicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B PatientsPredicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B Patients
nabeelali11101999
 
A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...
Xi Chen
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)
Dongheon Lee
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Bigfinite
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set Theory
IRJET Journal
 
KG_based pharma marketing.pptx
KG_based pharma marketing.pptxKG_based pharma marketing.pptx
KG_based pharma marketing.pptx
Sridhar Nomula
 
Datamining in BreastCancer.pptx
Datamining in BreastCancer.pptxDatamining in BreastCancer.pptx
Datamining in BreastCancer.pptx
MaligireddyTanujaRed1
 
Breast Cancer Prediction - Arwa Marfatia.pptx
Breast Cancer Prediction - Arwa Marfatia.pptxBreast Cancer Prediction - Arwa Marfatia.pptx
Breast Cancer Prediction - Arwa Marfatia.pptx
Boston Institute of Analytics
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptx
rakshashadu
 

Similar to Parkinson disease classification recorded v2.0 (20)

Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-Statisticians
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Diagnosis Support by Machine Learning Using Posturography Data
Diagnosis Support by Machine Learning Using Posturography DataDiagnosis Support by Machine Learning Using Posturography Data
Diagnosis Support by Machine Learning Using Posturography Data
 
Predictive Analysis of Breast Cancer Detection using Classification Algorithm
Predictive Analysis of Breast Cancer Detection using Classification AlgorithmPredictive Analysis of Breast Cancer Detection using Classification Algorithm
Predictive Analysis of Breast Cancer Detection using Classification Algorithm
 
Batch -13.pptx lung cancer detection using transfer learning
Batch -13.pptx lung cancer detection using transfer learningBatch -13.pptx lung cancer detection using transfer learning
Batch -13.pptx lung cancer detection using transfer learning
 
DataMining Techniques in BreastCancer.pptx
DataMining Techniques in BreastCancer.pptxDataMining Techniques in BreastCancer.pptx
DataMining Techniques in BreastCancer.pptx
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
 
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
 
Predicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B PatientsPredicting Life Expectancy of Hepatitis B Patients
Predicting Life Expectancy of Hepatitis B Patients
 
A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...A Method to facilitate cancer detection and type classification from gene exp...
A Method to facilitate cancer detection and type classification from gene exp...
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set Theory
 
KG_based pharma marketing.pptx
KG_based pharma marketing.pptxKG_based pharma marketing.pptx
KG_based pharma marketing.pptx
 
Datamining in BreastCancer.pptx
Datamining in BreastCancer.pptxDatamining in BreastCancer.pptx
Datamining in BreastCancer.pptx
 
Breast Cancer Prediction - Arwa Marfatia.pptx
Breast Cancer Prediction - Arwa Marfatia.pptxBreast Cancer Prediction - Arwa Marfatia.pptx
Breast Cancer Prediction - Arwa Marfatia.pptx
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptx
 

Recently uploaded

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 

Recently uploaded (20)

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 

Parkinson disease classification recorded v2.0

  • 1. Parkinson Disease Classification ANLY 530 Late Summer 2019 Nikhil Shrivastava 1
  • 2. Agenda • Executive Summary • Project Overview • Process Involved • Project Summary • Data Summary Statistics • Data Visualizations • Feature Importance • ML Methodologies • Conclusion & Future Efforts • Lessons Learned 2
  • 3. Executive Summary OBJECTIVE APPROACH RECOMMENDATION • Design a Machine Learning Model that can predict whether a person has Parkinson Disease or not • Post Data Cleaning use different Machine Learning Classification Techniques like Decision Tree, Naïve Bayes, Random Forest, etc to develop a classification model • Identify the important features that can help detect Parkinson Disease and cure it before it gets worse. 3
  • 4. Project Overview & Background • Issue: Parkinson’s disease is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting more than 10 million people worldwide. Symptoms include frozen facial features, slowness of movement, tremor, etc. • Current Accuracy: A study from National Institute of Neurological Disorders finds that early diagnosis is only 53% accurate. • Goal: Design a Machine Learning Model that can predict whether a person has Parkinson Disease or not based on certain attributes of speech such as relative jitter, shimmer, MFCC coefficients ,etc. 4
  • 5. Process Involved Data Cleaning Feature Engineering Data Visualization Model Building • Checking for Missing values. • Check the categorical features if any for Label encoding. • Visualize the target variable and relationship with certain independent attributes. • Start with Basic Tree Based model, and then ensemble techniques to predict the target variable. • Checking imbalance if any for the target variable. • Check the correlation between different independent features. • Visualize important features as well after feature importance test . • Run cross validation and check feature importance, accuracy score, confusion matrix, etc. • Use fillna method to detect the missing values and impute it accordingly. • Use Label Encoder method to do Label Encoding of categorical columns. • Use seaborn, matplotlib libraries to understand different relationships. • Use different models from sklearn library like Decision Tree, Naïve Bayes, Random Forest. • Check the distribution of different classes. If they are imbalanced then use synthetic oversampling or undersampling techniques(SMOTE, ADASYN) to create balanced training dataset. • Use corr method to check for Pearson coefficient of greater than 0.8 or less than -0.8. • Use distplot, histograms, scatterplot to understand important features and their relationship with the target variable. • Check for accuracy score, bias-variance trade off, confusion matrix to validate if the model is properly fitted. What How 5
  • 6. Project Summary Overall Results – Before Hyper-parameter Tuning & CV Decision Tree Accuracy: 100% Confusion Matrix: Random Forest Accuracy: 94.4% Confusion Matrix: Gaussian Naïve Bayes Accuracy: 87.5% Confusion Matrix: Predicted 0 1 Actual 0 39 0 1 0 33 Predicted 0 1 Actual 0 38 1 1 8 25 Predicted 0 1 Actual 0 39 0 1 2 31 6 Decision Tree Accuracy: 100% Over fitting. Overall Results – After Hyper-parameter Tuning & CV Gaussian Naïve Bayes Accuracy: 87.5% Lower than Random Forest Random Forest Accuracy: 97.2% CV Technique: Grid Search Bootstrap: True Impurity Criterion : Gini
  • 7. Understanding the Data 7 1. ID: Subjects's identifier. 2. Recording: Number of the recording. 3. Status: 0=Healthy; 1=PD 4. Gender: 0=Man; 1=Woman 5. Pitch local perturbation measures: relative jitter (Jitter_rel), absolute jitter (Jitter_abs), relative average perturbation (Jitter_RAP), and pitch perturbation quotient (Jitter_PPQ). 6. Amplitude perturbation measures: local shimmer (Shim_loc), shimmer in dB (Shim_dB), 3-point amplitude perturbation quotient (Shim_APQ3), 5-point amplitude perturbation quotient (Shim_APQ5), and 11-point amplitude perturbation quotient (Shim_APQ11). 7. Harmonic-to-noise ratio measures: harmonic-to-noise ratio in the frequency band 0- 500 Hz (HNR05), in 0-1500 Hz (HNR15), in 0-2500 Hz (HNR25), in 0-3500 Hz (HNR35), and in 0-3800 Hz (HNR38). 8. Mel frequency cepstral coefficient-based spectral measures of order 0 to 12 (MFCC0, MFCC1,..., MFCC12) and their derivatives (Delta0, Delta1,..., Delta12). 9. Recurrence period density entropy (RPDE). 10. Detrended fluctuation analysis (DFA). 11. Pitch period entropy (PPE). 12. Glottal-to-noise excitation ratio (GNE).
  • 8. Data & Summary Statistics • We have used Parkinson’s disease data for this project. It has around 48 attributes for determining if patient has the disease or not. • Number of records : 240 • Number of attributes: 48 • Key attributes: Delta3, MFCC4, HNR15 Key Attributes Mean Std Comments Delta3 1.34 0.19 Delta3 is the derivative of MFCC and is one of the key attribute with a mean 1.34 MFCC4 1.355 0.21 MFCC of order 4 has mean of 1.35 and std of 0.21 HNR15 63.67 15.62 Harmonic to noise ratio measure has a mean of 64 and std of 15 8
  • 9. Data Visualizations Status Vs HNR15 Status Vs Shim_loc People who have Parkinson Disease have relatively low Harmonic to Noise Ratio measure in the frequency band of 0-1500 Hz. People who have Parkinson Disease have relatively high local shimmer. 9
  • 10. Data Visualizations - Multicollinearity Checking collinearity of different features. Most of the Mel frequency cepstral coefficients MFC are highly correlated with Harmonic-to-noise ratio measures. We have removed features whose Pearson coefficient is higher then 0.8. There are no features showing very high negative correlation. 10
  • 11. Outlier Detection 11 • Removed outlier using z-score and kept a threshold of 3 standard deviation. • I got 3 observations that had multiple columns with z-score greater than even 5, so removed those. • Post Outlier Detection and removal, there were 237 records. • After removing Outliers, when I ran Random Forest model, I didn’t get better results in terms of Precision and Accuracy. Number of False Positives increased which is not better for problems of healthcare background. Before Outlier removal: Accuracy:94.4% False Positives: 0 After Outlier removal: Accuracy:93.5% False Positives:1
  • 12. Feature Importance Based on sklearn feature_importance following were top 5 features: • Delta3 • MFCC3 • MFCC9 • MFCC8 • HNR05 Mel frequency cepstral coefficient of order 3,8,9 shows that they are important in determining whether a person has Parkinson Disease or is healthy. 12
  • 13. Machine Learning Methodology Random Forest uses boosting approach with multiple trees which improves the model and gives better results. Decision Tree Random Forest involves more tree shuffling and hence the accuracy was 94% without cross val and 97% after cross val Why Accuracy Conclusion Random Forest Comments Decision Tree Classifier uses Tree based models bagging approach to classify different class instances. Decision Tree Accuracy was 100%. Clearly it was over fitted. Decision Tree was clearly over fitted so we tried Naïve Bayes which was not over fitted and gave 87.5% Random Forest gave 97.2% with 39 True Positives and 31 True Negatives Pre-pruning and Post-pruning could minimize over-fitting and will create more accurate model Random Forest was the best model out of the three ML algorithms that we ran based on Accuracy, TPs and FPs. 13
  • 14. Conclusion and Future Efforts • Random Forest was the best model out of all three ML models • Cross Validation using Grid Search improved the accuracy giving the best parameters. • We could further optimize Decision Tree using Pre-Pruning or Post-Pruning techniques to minimize over- fitting. • We could use other ML techniques like XG Boost, Light GBM which could give more accurate results because of the learning rate they use to train every tree instance. • In further iterations of the model, we could check skewness, check for outliers to further filter the dataset and get more accurate results. 14
  • 15. Lessons Learned • Labelled Data: Availability of enough number of samples is the key to any ML algorithms, but especially Healthcare industry face challenges on patient population or patient data. Even sometimes when the data is available it is protected by laws & regulations. In this project models were built with around 240 records, which is quite less for making predictions and deploying it for practical purposes. To overcome patient data challenge, National Institute of Health is running multiple programs via grant CTSA to build patient population database and supplement researches via informatics and AI. • Domain/Functional Knowledge: Any Data Science project is dependent on the domain knowledge of the Data Scientist. Despite availability of millions of records and thousands of attributes, it is critical to have domain knowledge which helps in establishing certain hypothesis, example, high blood pressure drives risk of heart attack. Lack of domain knowledge was another challenge in this project which I realized after 30% of work.