This document summarizes a study exploring the prevalence of asthma in the United States in 2017. It presents exploratory analysis of the relationships between asthma and variables like age, sex, race, income, and geography. A logistic regression model is built to predict asthma likelihood using these variables. The model achieves an accuracy rate of 84% and area under the ROC curve of 0.6695 when tested on a separate dataset, showing logistic regression can reasonably classify individuals as asthmatic or non-asthmatic. Challenges from the survey method and missing data are also noted.
4. THE DATASET
• SOURCE: LANDLINE/CELLULAR SURVEY BY CDC IN 2017
• 450,016 OBSERVATIONS – 358 VARIABLES
• SELECTED 11 VARIABLES
5. ASTHMA PREVALENCE WITH RESPECT TO
• AGE GROUP
• SEX
• RACE
• INCOME GROUP
• EDUCATION LEVEL
• WEIGHT GROUP
• SMOKING STATUS
• DEPRESSIVE DISORDER
• PHYSICAL ACTIVENESS
6. GEOGRAPHIC DISTRIBUTION - COUNT
Largest count of asthma:
• Kansas
• Florida
Smallest count of asthma:
• Alaska
• Nevada
7. GEOGRAPHIC DISTRIBUTION - PROBABILITY
Largest Probability:
• West Virginia
Smallest Probability:
• Minnesota
9. FULL MODEL WITH FULL DATASET
• ASTHMA STATUS: CHANGED TO BINARY
• 0: DOES NOT CURRENTLY HAVE ASTHMA
• 1: CURRENTLY HAVE ASTHMA
• CHANGE ALL 9 PREDICTORS TO FACTOR
VARIABLES
glm (AsthmaStatus ~. , family = binomial (link = ‘logit’), data = data)