The document analyzes different statistical models to best predict heart disease using a dataset of 294 patients and 76 variables, finding that logistic regression had the best performance with an accuracy of 84% and reasonably high sensitivity and low false positive rate, and recommending implementing a logistic regression model in a consumer-facing heart disease prediction app.
1. ANALYSIS ON STATISTICAL MODEL
THAT BEST PREDICTS HEART
DISEASE
Prepared for Healthy Living, Inc.
Prepared by Team 4, LLC
April 15, 2015
1
2. Objective
• To provide a statistical model that predicts heart disease
and can be implemented in a consumer facing app
• To present our methodologies and model results
• To provide our recommendation on which model should
be used
• To provide necessary R code to implement the model
2
4. DataPreparationand VariableSelection
• Raw data set (“heart disease data_hungarian.xlsx”) includes
76 variables and 294 observations
• Response variable for predicting heart disease is a binary
variable, with a value of 1 for positive, and 0 for negative (no
heart disease).
• recode a prior variable (originally called “num”) which included
values 1 through 4 to be a value of 1 in the new response variable
• Need to cut down the amount of variables while efficiently
predicting heart disease
4
5. Processto ReduceNumberof Variables
• Remove Variables with No Relevant Information
• Thirty two variables are removed that were empty, irrelevant,
lacked any information in the data dictionary, or represented the
month, day and the year
• Remove Variables with Small Variability
• Some variables had one value for almost all observations
• Remove Variables with Multicollinearity
• Nine variables are removed that had high multicollinearity using a
collinearity cutoff parameter of 0.6
• Remove Variables with Low Correlation with Outcome
• Used a minimum absolute threshold of 10 percent to remove four
additional variables that were not related to the outcome 5
6. Monte Carlo Data Imputation
• In order to maximize the power of our tests, we use a Monte
Carlo technique for data imputation.
• Examine the distribution of each variable and generating
random numbers from these distributions to fill in the missing
data
• The Monte Carlo imputation is proven to be more accurate
and generally leads to less biased results.
6
7. Validation Set
Training Data Set Testing Data Set
Total Number of Observations 294 123
Number of patients with heart
disease
(% of total)
106 (36%) 115 (93%)
Number of patients without
heart disease (% of total)
188 (64%) 8 (7%)
7
We were originally provided a separate training and testing set. Upon
investigating the distribution of the response variable, we found that
the sample size of observations that are not classified as heart
disease is very small:
9. Proposed Models
• The following models were considered in the analysis:
• K-Nearest Neighbors
• Logistic Regression
• Linear and Quadratic Discriminant Analysis
• Decision Trees: Bagging
• Decision Trees: Random Forests
• Decision Trees: Boosting
9
10. Analysis Results –
K-Nearest Neighbors (KNN)
• K=13 performed best:
• kNN method achieves the lowest false positive rate (6.6%)
compared to all other methods. 10
11. Logistic Regression
• The variable selection technique we chose allows us to quickly
isolate the variables that affect our outcome the most: We
chose the variables that are most correlated with the
outcome.
• This was done by running models with the most correlated
variable, then with the first 2 most correlated variables, then
with the first 3, etc, and comparing the accuracy of each
model.
• We found that a model with the 10 highest correlated
variables has the best accuracy of 84%. 11
12. Decision Tree – Bagging
• Mean decrease in Gini index and mean decrease in accuracy
for each variable, relative to the largest. Therefore, the most
important variables are oldpeak and relrest.
• While the results give insight on the most impactful variables,
they do not perform well compared to the other methods.
12
13. Decision Tree - Random
Forests
• Number of variables to be randomly sampled as 4
• This is based on the commonly used method in determining the best
number for classifications with random forests, 𝑝 ≈ 4, where p is
the number of features in the dataset
• While random forests improved the prediction results compared to
bagging, they ultimately did not beat the other models. 13
14. Decision Tree – Boosting
• Boosting performs best when using 14 trees, as shown by the
graph below:
• Out of all decision tree methods, boosting achieved the best
values, but they were not as good as the other models.
14
15. Linear Discriminant Analysis and
Quadratic Discriminant Analysis
• The LDA method had decent results, but was not able to pass
the logistic in any of the relevant metrics (accuracy, sensitivity,
false positive rate)
• QDA resulted in the method with the highest sensitivity,
almost 77%, but at a cost of a higher false positive rate.
15
16. Model Summary
Model Accuracy Sensitivity False Positive Rate
K Nearest Neighbour 80.95% 60.71% 6.59%
Logistic Regression 84.35% 73.21% 8.79%
Decision Tree - Bagging 78.23% 66.07% 14.29%
Decision Tree - Random
Forest
79.59% 66.07% 12.09%
Decision Tree - Boosting 82.99% 69.64% 8.79%
Linear Discriminant
Analysis
80.95% 66.07% 9.89%
Quadratic Discriminant
Analysis
81.63% 76.79% 15.38%
16
17. Conclusion
• Logistic regression is the best model
• High accuracy and reasonably high sensitivity and low false
positive rate
17
Observed
- No heart disease
Observed
- Heart disease
Predicted
- No heart disease 83 cases 15 cases
Predicted
- Heart disease 8 cases 41 cases
18. ROC Curve for Optimizing Logistic
Regression Threshold
18
19. Logistic Regression Model
• By using a threshold of 0.29, we are able to boost the
sensitivity of the model by 7%, while decreasing the accuracy
by only 6%.
19
Logistic Regression Model
Accuracy Sensitivity False Positive Rate
Threshold = 0.5 84.35% 73.21% 8.79%
Threshold = 0.29 78.23% 80.36% 23.08%
Difference -6.12% 7.15% 14.29%
20. Final Model
• (𝑜𝑑𝑑𝑠)ℎ𝑒𝑎𝑟𝑡 𝑑𝑖𝑠𝑒𝑎𝑠𝑒= 𝑒 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ...+ 𝛽10 𝑋10
• where the 𝑋𝑖s are variables that can be summarized as:
• Patient demographic information
• “sex” (patient gender)
• “age” (patient age),
• Physical health
• “chol” (cholesterol level),
• “fbs” (indicating high blood sugar),
• Mental health
• “oldpeak” (a measure of depression),
• “relrest” (indicating relief after a rest),
• Exercise related indications:
• “thalach” (maximum heart rate),
• “thalrest” (resting heart rate),
• “pro”, and “prop” (indicating ECG measurement specifications)
20
21. Prototype of App
21
To access the application, please navigate to
the following address:
www.tinyurl.com/hdprototype
22. Roles and Responsibilities
• KNN & Boosting - Ravi & Eugenia
• Bagging & Random Forests - Zijian & Jiayang
• Logistic & LDA/QDA - Shijie & Armen
• Final Model runs and app design - Ravi
• Missing value imputation and final QA - Zijian & Jiayang
• Final report writing - Armen, Eugenia, Shijie
22