SlideShare a Scribd company logo
1 of 22
ANALYSIS ON STATISTICAL MODEL
THAT BEST PREDICTS HEART
DISEASE
Prepared for Healthy Living, Inc.
Prepared by Team 4, LLC
April 15, 2015
1
Objective
• To provide a statistical model that predicts heart disease
and can be implemented in a consumer facing app
• To present our methodologies and model results
• To provide our recommendation on which model should
be used
• To provide necessary R code to implement the model
2
Overall Process
3
DataPreparationand VariableSelection
• Raw data set (“heart disease data_hungarian.xlsx”) includes
76 variables and 294 observations
• Response variable for predicting heart disease is a binary
variable, with a value of 1 for positive, and 0 for negative (no
heart disease).
• recode a prior variable (originally called “num”) which included
values 1 through 4 to be a value of 1 in the new response variable
• Need to cut down the amount of variables while efficiently
predicting heart disease
4
Processto ReduceNumberof Variables
• Remove Variables with No Relevant Information
• Thirty two variables are removed that were empty, irrelevant,
lacked any information in the data dictionary, or represented the
month, day and the year
• Remove Variables with Small Variability
• Some variables had one value for almost all observations
• Remove Variables with Multicollinearity
• Nine variables are removed that had high multicollinearity using a
collinearity cutoff parameter of 0.6
• Remove Variables with Low Correlation with Outcome
• Used a minimum absolute threshold of 10 percent to remove four
additional variables that were not related to the outcome 5
Monte Carlo Data Imputation
• In order to maximize the power of our tests, we use a Monte
Carlo technique for data imputation.
• Examine the distribution of each variable and generating
random numbers from these distributions to fill in the missing
data
• The Monte Carlo imputation is proven to be more accurate
and generally leads to less biased results.
6
Validation Set
Training Data Set Testing Data Set
Total Number of Observations 294 123
Number of patients with heart
disease
(% of total)
106 (36%) 115 (93%)
Number of patients without
heart disease (% of total)
188 (64%) 8 (7%)
7
We were originally provided a separate training and testing set. Upon
investigating the distribution of the response variable, we found that
the sample size of observations that are not classified as heart
disease is very small:
Validation Set
• Two data sets come from fundamentally different populations
8
Proposed Models
• The following models were considered in the analysis:
• K-Nearest Neighbors
• Logistic Regression
• Linear and Quadratic Discriminant Analysis
• Decision Trees: Bagging
• Decision Trees: Random Forests
• Decision Trees: Boosting
9
Analysis Results –
K-Nearest Neighbors (KNN)
• K=13 performed best:
• kNN method achieves the lowest false positive rate (6.6%)
compared to all other methods. 10
Logistic Regression
• The variable selection technique we chose allows us to quickly
isolate the variables that affect our outcome the most: We
chose the variables that are most correlated with the
outcome.
• This was done by running models with the most correlated
variable, then with the first 2 most correlated variables, then
with the first 3, etc, and comparing the accuracy of each
model.
• We found that a model with the 10 highest correlated
variables has the best accuracy of 84%. 11
Decision Tree – Bagging
• Mean decrease in Gini index and mean decrease in accuracy
for each variable, relative to the largest. Therefore, the most
important variables are oldpeak and relrest.
• While the results give insight on the most impactful variables,
they do not perform well compared to the other methods.
12
Decision Tree - Random
Forests
• Number of variables to be randomly sampled as 4
• This is based on the commonly used method in determining the best
number for classifications with random forests, 𝑝 ≈ 4, where p is
the number of features in the dataset
• While random forests improved the prediction results compared to
bagging, they ultimately did not beat the other models. 13
Decision Tree – Boosting
• Boosting performs best when using 14 trees, as shown by the
graph below:
• Out of all decision tree methods, boosting achieved the best
values, but they were not as good as the other models.
14
Linear Discriminant Analysis and
Quadratic Discriminant Analysis
• The LDA method had decent results, but was not able to pass
the logistic in any of the relevant metrics (accuracy, sensitivity,
false positive rate)
• QDA resulted in the method with the highest sensitivity,
almost 77%, but at a cost of a higher false positive rate.
15
Model Summary
Model Accuracy Sensitivity False Positive Rate
K Nearest Neighbour 80.95% 60.71% 6.59%
Logistic Regression 84.35% 73.21% 8.79%
Decision Tree - Bagging 78.23% 66.07% 14.29%
Decision Tree - Random
Forest
79.59% 66.07% 12.09%
Decision Tree - Boosting 82.99% 69.64% 8.79%
Linear Discriminant
Analysis
80.95% 66.07% 9.89%
Quadratic Discriminant
Analysis
81.63% 76.79% 15.38%
16
Conclusion
• Logistic regression is the best model
• High accuracy and reasonably high sensitivity and low false
positive rate
17
Observed
- No heart disease
Observed
- Heart disease
Predicted
- No heart disease 83 cases 15 cases
Predicted
- Heart disease 8 cases 41 cases
ROC Curve for Optimizing Logistic
Regression Threshold
18
Logistic Regression Model
• By using a threshold of 0.29, we are able to boost the
sensitivity of the model by 7%, while decreasing the accuracy
by only 6%.
19
Logistic Regression Model
Accuracy Sensitivity False Positive Rate
Threshold = 0.5 84.35% 73.21% 8.79%
Threshold = 0.29 78.23% 80.36% 23.08%
Difference -6.12% 7.15% 14.29%
Final Model
• (𝑜𝑑𝑑𝑠)ℎ𝑒𝑎𝑟𝑡 𝑑𝑖𝑠𝑒𝑎𝑠𝑒= 𝑒 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ...+ 𝛽10 𝑋10
• where the 𝑋𝑖s are variables that can be summarized as:
• Patient demographic information
• “sex” (patient gender)
• “age” (patient age),
• Physical health
• “chol” (cholesterol level),
• “fbs” (indicating high blood sugar),
• Mental health
• “oldpeak” (a measure of depression),
• “relrest” (indicating relief after a rest),
• Exercise related indications:
• “thalach” (maximum heart rate),
• “thalrest” (resting heart rate),
• “pro”, and “prop” (indicating ECG measurement specifications)
20
Prototype of App
21
To access the application, please navigate to
the following address:
www.tinyurl.com/hdprototype
Roles and Responsibilities
• KNN & Boosting - Ravi & Eugenia
• Bagging & Random Forests - Zijian & Jiayang
• Logistic & LDA/QDA - Shijie & Armen
• Final Model runs and app design - Ravi
• Missing value imputation and final QA - Zijian & Jiayang
• Final report writing - Armen, Eugenia, Shijie
22

More Related Content

What's hot

Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
 
Biostatistics Workshop: Regression
Biostatistics Workshop: RegressionBiostatistics Workshop: Regression
Biostatistics Workshop: RegressionHopkinsCFAR
 
Some illustrative examples on the analysis of the SW-CRT
Some illustrative examples on the analysis of the SW-CRTSome illustrative examples on the analysis of the SW-CRT
Some illustrative examples on the analysis of the SW-CRTNIHR CLAHRC West Midlands
 
Non-inferiority and Equivalence Study design considerations and sample size
Non-inferiority and Equivalence Study design considerations and sample sizeNon-inferiority and Equivalence Study design considerations and sample size
Non-inferiority and Equivalence Study design considerations and sample sizenQuery
 
Hypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical TestHypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical TestMatt Hansen
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesnQuery
 
What is your question
What is your questionWhat is your question
What is your questionStephenSenn2
 
Errors in Chemical Analysis and Sampling
Errors in Chemical Analysis and SamplingErrors in Chemical Analysis and Sampling
Errors in Chemical Analysis and SamplingUmer Ali
 
Sample size calculation in medical research
Sample size calculation in medical researchSample size calculation in medical research
Sample size calculation in medical researchKannan Iyanar
 
How to Prepared Score Card
How to Prepared Score CardHow to Prepared Score Card
How to Prepared Score Cardvkrjpatan
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshopodsc
 
5 essential steps for sample size determination in clinical trials slideshare
5 essential steps for sample size determination in clinical trials   slideshare5 essential steps for sample size determination in clinical trials   slideshare
5 essential steps for sample size determination in clinical trials slidesharenQuery
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Matt Hansen
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)Matt Hansen
 

What's hot (20)

Regression
RegressionRegression
Regression
 
Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...
 
Sample size calculation
Sample size calculationSample size calculation
Sample size calculation
 
Biostatistics Workshop: Regression
Biostatistics Workshop: RegressionBiostatistics Workshop: Regression
Biostatistics Workshop: Regression
 
Sample size calculation final
Sample size calculation finalSample size calculation final
Sample size calculation final
 
Some illustrative examples on the analysis of the SW-CRT
Some illustrative examples on the analysis of the SW-CRTSome illustrative examples on the analysis of the SW-CRT
Some illustrative examples on the analysis of the SW-CRT
 
Non-inferiority and Equivalence Study design considerations and sample size
Non-inferiority and Equivalence Study design considerations and sample sizeNon-inferiority and Equivalence Study design considerations and sample size
Non-inferiority and Equivalence Study design considerations and sample size
 
PS.Observational.SAS_Y.Duan
PS.Observational.SAS_Y.DuanPS.Observational.SAS_Y.Duan
PS.Observational.SAS_Y.Duan
 
Hypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical TestHypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical Test
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar Slides
 
A04 Sample Size
A04 Sample SizeA04 Sample Size
A04 Sample Size
 
What is your question
What is your questionWhat is your question
What is your question
 
Errors in Chemical Analysis and Sampling
Errors in Chemical Analysis and SamplingErrors in Chemical Analysis and Sampling
Errors in Chemical Analysis and Sampling
 
Sample size calculation in medical research
Sample size calculation in medical researchSample size calculation in medical research
Sample size calculation in medical research
 
How to Prepared Score Card
How to Prepared Score CardHow to Prepared Score Card
How to Prepared Score Card
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
 
5 essential steps for sample size determination in clinical trials slideshare
5 essential steps for sample size determination in clinical trials   slideshare5 essential steps for sample size determination in clinical trials   slideshare
5 essential steps for sample size determination in clinical trials slideshare
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)
 
3b. Introductory Statistics - Julia Saperia
3b. Introductory Statistics - Julia Saperia3b. Introductory Statistics - Julia Saperia
3b. Introductory Statistics - Julia Saperia
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
 

Similar to Analysis Report Presentation 041515 - Team 4

Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )Amany Elsayed
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment systemKOYELMAJUMDAR1
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detectionShantanuDeosthale
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptxrakshashadu
 
Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Farhad Ashraf
 
Affect of Metabolic Obesity and Body Mass Index in Coronary Artery Diseases
Affect of Metabolic Obesity and Body Mass Index in Coronary Artery DiseasesAffect of Metabolic Obesity and Body Mass Index in Coronary Artery Diseases
Affect of Metabolic Obesity and Body Mass Index in Coronary Artery DiseasesNikhil Gupta
 
Analyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKAAnalyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKAYogesh Shinde
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISBabasID2
 
L1 statistics
L1 statisticsL1 statistics
L1 statisticsdapdai
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendencyMmedsc Hahm
 
3 Missing data12256429.ppt
3 Missing data12256429.ppt3 Missing data12256429.ppt
3 Missing data12256429.pptAravind Reddy
 

Similar to Analysis Report Presentation 041515 - Team 4 (20)

Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )Data analysis ( Bio-statistic )
Data analysis ( Bio-statistic )
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Heart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptxHeart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptx
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Environmental statistics
Environmental statisticsEnvironmental statistics
Environmental statistics
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptx
 
Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)
 
Parkinson disease classification recorded v2.0
Parkinson disease classification recorded   v2.0Parkinson disease classification recorded   v2.0
Parkinson disease classification recorded v2.0
 
Affect of Metabolic Obesity and Body Mass Index in Coronary Artery Diseases
Affect of Metabolic Obesity and Body Mass Index in Coronary Artery DiseasesAffect of Metabolic Obesity and Body Mass Index in Coronary Artery Diseases
Affect of Metabolic Obesity and Body Mass Index in Coronary Artery Diseases
 
Analyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKAAnalyzing Road Side Breath Test Data with WEKA
Analyzing Road Side Breath Test Data with WEKA
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
L1 statistics
L1 statisticsL1 statistics
L1 statistics
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 
Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
 
3 Missing data12256429.ppt
3 Missing data12256429.ppt3 Missing data12256429.ppt
3 Missing data12256429.ppt
 
Data science
Data scienceData science
Data science
 

Analysis Report Presentation 041515 - Team 4

  • 1. ANALYSIS ON STATISTICAL MODEL THAT BEST PREDICTS HEART DISEASE Prepared for Healthy Living, Inc. Prepared by Team 4, LLC April 15, 2015 1
  • 2. Objective • To provide a statistical model that predicts heart disease and can be implemented in a consumer facing app • To present our methodologies and model results • To provide our recommendation on which model should be used • To provide necessary R code to implement the model 2
  • 4. DataPreparationand VariableSelection • Raw data set (“heart disease data_hungarian.xlsx”) includes 76 variables and 294 observations • Response variable for predicting heart disease is a binary variable, with a value of 1 for positive, and 0 for negative (no heart disease). • recode a prior variable (originally called “num”) which included values 1 through 4 to be a value of 1 in the new response variable • Need to cut down the amount of variables while efficiently predicting heart disease 4
  • 5. Processto ReduceNumberof Variables • Remove Variables with No Relevant Information • Thirty two variables are removed that were empty, irrelevant, lacked any information in the data dictionary, or represented the month, day and the year • Remove Variables with Small Variability • Some variables had one value for almost all observations • Remove Variables with Multicollinearity • Nine variables are removed that had high multicollinearity using a collinearity cutoff parameter of 0.6 • Remove Variables with Low Correlation with Outcome • Used a minimum absolute threshold of 10 percent to remove four additional variables that were not related to the outcome 5
  • 6. Monte Carlo Data Imputation • In order to maximize the power of our tests, we use a Monte Carlo technique for data imputation. • Examine the distribution of each variable and generating random numbers from these distributions to fill in the missing data • The Monte Carlo imputation is proven to be more accurate and generally leads to less biased results. 6
  • 7. Validation Set Training Data Set Testing Data Set Total Number of Observations 294 123 Number of patients with heart disease (% of total) 106 (36%) 115 (93%) Number of patients without heart disease (% of total) 188 (64%) 8 (7%) 7 We were originally provided a separate training and testing set. Upon investigating the distribution of the response variable, we found that the sample size of observations that are not classified as heart disease is very small:
  • 8. Validation Set • Two data sets come from fundamentally different populations 8
  • 9. Proposed Models • The following models were considered in the analysis: • K-Nearest Neighbors • Logistic Regression • Linear and Quadratic Discriminant Analysis • Decision Trees: Bagging • Decision Trees: Random Forests • Decision Trees: Boosting 9
  • 10. Analysis Results – K-Nearest Neighbors (KNN) • K=13 performed best: • kNN method achieves the lowest false positive rate (6.6%) compared to all other methods. 10
  • 11. Logistic Regression • The variable selection technique we chose allows us to quickly isolate the variables that affect our outcome the most: We chose the variables that are most correlated with the outcome. • This was done by running models with the most correlated variable, then with the first 2 most correlated variables, then with the first 3, etc, and comparing the accuracy of each model. • We found that a model with the 10 highest correlated variables has the best accuracy of 84%. 11
  • 12. Decision Tree – Bagging • Mean decrease in Gini index and mean decrease in accuracy for each variable, relative to the largest. Therefore, the most important variables are oldpeak and relrest. • While the results give insight on the most impactful variables, they do not perform well compared to the other methods. 12
  • 13. Decision Tree - Random Forests • Number of variables to be randomly sampled as 4 • This is based on the commonly used method in determining the best number for classifications with random forests, 𝑝 ≈ 4, where p is the number of features in the dataset • While random forests improved the prediction results compared to bagging, they ultimately did not beat the other models. 13
  • 14. Decision Tree – Boosting • Boosting performs best when using 14 trees, as shown by the graph below: • Out of all decision tree methods, boosting achieved the best values, but they were not as good as the other models. 14
  • 15. Linear Discriminant Analysis and Quadratic Discriminant Analysis • The LDA method had decent results, but was not able to pass the logistic in any of the relevant metrics (accuracy, sensitivity, false positive rate) • QDA resulted in the method with the highest sensitivity, almost 77%, but at a cost of a higher false positive rate. 15
  • 16. Model Summary Model Accuracy Sensitivity False Positive Rate K Nearest Neighbour 80.95% 60.71% 6.59% Logistic Regression 84.35% 73.21% 8.79% Decision Tree - Bagging 78.23% 66.07% 14.29% Decision Tree - Random Forest 79.59% 66.07% 12.09% Decision Tree - Boosting 82.99% 69.64% 8.79% Linear Discriminant Analysis 80.95% 66.07% 9.89% Quadratic Discriminant Analysis 81.63% 76.79% 15.38% 16
  • 17. Conclusion • Logistic regression is the best model • High accuracy and reasonably high sensitivity and low false positive rate 17 Observed - No heart disease Observed - Heart disease Predicted - No heart disease 83 cases 15 cases Predicted - Heart disease 8 cases 41 cases
  • 18. ROC Curve for Optimizing Logistic Regression Threshold 18
  • 19. Logistic Regression Model • By using a threshold of 0.29, we are able to boost the sensitivity of the model by 7%, while decreasing the accuracy by only 6%. 19 Logistic Regression Model Accuracy Sensitivity False Positive Rate Threshold = 0.5 84.35% 73.21% 8.79% Threshold = 0.29 78.23% 80.36% 23.08% Difference -6.12% 7.15% 14.29%
  • 20. Final Model • (𝑜𝑑𝑑𝑠)ℎ𝑒𝑎𝑟𝑡 𝑑𝑖𝑠𝑒𝑎𝑠𝑒= 𝑒 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ...+ 𝛽10 𝑋10 • where the 𝑋𝑖s are variables that can be summarized as: • Patient demographic information • “sex” (patient gender) • “age” (patient age), • Physical health • “chol” (cholesterol level), • “fbs” (indicating high blood sugar), • Mental health • “oldpeak” (a measure of depression), • “relrest” (indicating relief after a rest), • Exercise related indications: • “thalach” (maximum heart rate), • “thalrest” (resting heart rate), • “pro”, and “prop” (indicating ECG measurement specifications) 20
  • 21. Prototype of App 21 To access the application, please navigate to the following address: www.tinyurl.com/hdprototype
  • 22. Roles and Responsibilities • KNN & Boosting - Ravi & Eugenia • Bagging & Random Forests - Zijian & Jiayang • Logistic & LDA/QDA - Shijie & Armen • Final Model runs and app design - Ravi • Missing value imputation and final QA - Zijian & Jiayang • Final report writing - Armen, Eugenia, Shijie 22