Classification of Heart Diseases Patients using Data Mining Techniques
Project_Report_RMD
1. Project : Mortality Rate Analysis in USA for deadly
causes
Jatri Dave (jad752) , Prashantkumar Patel (pnp249)
December 14, 2016
Project Outline
We have obtained the data from CHSI(Community Health Status Indicators). In this project we will try
to figure out the leading causes of the death in 4 major regions in USA(Northeast, West, Midwest, South).
After getting the major causes of deaths we will try to analyse major daily human characteristics that are
contributing towards these major deaths.
The steps that we implemented in order to solve above stated problem are briefly explained below.
Data Cleaning and Normalizing.
First of all, When we gathered the data we did not realize that there was significant missing data. Besides
there were hundreds of unnecessary features available in the data. Therefore, we selected the required features
for the project. Moreover, the data we obtained was not normalized on a balanced scale; some data was in
percentage, some of those were in the base of 100,000 , some data was in the form of population count etc.
We needed some common scale on which we can normalize it. Furthermore, the data that we gathered was
not for a single time duration. For example, some data was in time span of the 1999-2003, some data was
in time span of 1995-2003 and so on.Hence, we normalized the data for the individual year. We performed
all those operations in Microsoft Excel (2016). We then combined necessary features and created a comma
separated value data (CSV) which we are directly using in R for the project.
Partitinoning the data into the region wise data.
The code is described below.
#Reading the data
data<-read.csv("E:/NYU/1/Foundation of Data Science/Projects/Foundations-of-Data-Science/USADataCleanPra
#Adding new column for region selection
data[,"region"] <- NA
#Removing unnecessary columns
data$X<-NULL
data$X.1<-NULL
#Partitioning the data based on regions. We have manually used the names of the states in order to crea
#region1
data$region[data$CHSI_State_Name=="Connecticut" | data$CHSI_State_Name=="Maine" | data$CHSI_State_Name==
#region2
data$region[data$CHSI_State_Name=="Illinois" | data$CHSI_State_Name=="Indiana" | data$CHSI_State_Name=="
#region3
data$region[data$CHSI_State_Name=="Delaware" | data$CHSI_State_Name=="Florida" | data$CHSI_State_Name=="
1
2. #region4
data$region[data$CHSI_State_Name=="Arizona" | data$CHSI_State_Name=="Colorado" | data$CHSI_State_Name=="
#converting state names into lower case letters
data$CHSI_State_Name <- tolower(data$CHSI_State_Name)
#Creating seperate data sets for each regions so that we can perform analysis on these seperate data
Northeast<- data[data$region==1,]
Midwest<-data[data$region==2,]
South<-data[data$region==3,]
West<-data[data$region==4,]
Region 1 (NorthEast region)
We will be analyzing the northeast region for the following problem. All the code and necessary visualizations
are included in following section.
#Cleaning the data in region 1.
#Removing missing data.
Northeast <- subset(Northeast, Northeast$No_Exercise!=0)
Northeast <- subset(Northeast, Northeast$Few_Fruit_Veg!=0)
Northeast <- subset(Northeast, Northeast$Obesity!=0)
Northeast <- subset(Northeast, Northeast$High_Blood_Pres!=0)
Northeast <- subset(Northeast, Northeast$Smoker!=0)
Northeast <- subset(Northeast, Northeast$Diabetes!=0)
Northeast <- subset(Northeast, Northeast$Lung_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Col_Cancer!=0)
Northeast <- subset(Northeast, Northeast$CHD!=0)
Northeast <- subset(Northeast, Northeast$Brst_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Suicide!=0)
Northeast <- subset(Northeast, Northeast$Total_Death_Causes!=0)
Northeast <- subset(Northeast, Northeast$Injury!=0)
Northeast<-subset(Northeast,Northeast$Stroke!=0)
Northeast <- subset(Northeast, Northeast$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths. Here for the
simplicity we have only included the top two reasons why people are dying in region 1. But we came to the
conclusion using single variate regression of the total death with respect to individual disease and then we
combined the features for maximum Rˆ2 value. Following is the table which shows our experiment results.
disease region1 (R squared)
breast cancer 0.04
mva 0.17
chd 0.71
colon cancer 0.24
lung cancer 0.16
injury 0.1
suicide 0.07
stroke 0.04
2
3. Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
northeast region.
1. CHD (Corronary heart disease)
2. Colon Cancer
#Since we have taken the CHD and Colon Cancer as the mahor reason why people are dying we will perform m
regressionModel<-lm(Northeast$Total_Death_Causes~Northeast$CHD+Northeast$Col_Cancer)
#Summary of regression between total deaths and diseases we selected.
summary(regressionModel)
##
## Call:
## lm(formula = Northeast$Total_Death_Causes ~ Northeast$CHD + Northeast$Col_Cancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.766 -5.234 -1.086 5.389 28.828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.05596 2.51849 3.596 0.000433 ***
## Northeast$CHD 1.00017 0.04322 23.141 < 2e-16 ***
## Northeast$Col_Cancer 8.15527 0.44732 18.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.382 on 157 degrees of freedom
## Multiple R-squared: 0.9621, Adjusted R-squared: 0.9616
## F-statistic: 1991 on 2 and 157 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plotting regression analysis
plot(regressionModel)
3
4. 50 100 150 200 250
−30030
Fitted values
Residuals
Residuals vs Fitted
18
65
9
−2 −1 0 1 2
−213
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
18
65
9
50 100 150 200 250
0.01.0
Fitted values
Standardizedresiduals
Scale−Location
18 659
0.00 0.05 0.10 0.15
−302
Leverage
Standardizedresiduals
Cook's distance 0.5
0.5
1
Residuals vs Leverage
908878
Now that we have established the major disease in North east region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multi variate regression with training and testing data.
Following is the code which describes the procedure for CHD.
NE.states<-Northeast
NE.Procedures<-data.frame()
#Creating a data frame inorder to generate correlation heat map
NE.Procedures<-NE.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
NE.Procedures.Matrix<-as.matrix(NE.Procedures)
NE.Cor<-cor(NE.Procedures)
#Correlation metrix
NE.Cor
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.8886204 0.8202385 0.7790907
## No_Exercise 0.8886204 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.8202385 0.8788325 1.0000000 0.9008216
## Obesity 0.7790907 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.7685505 0.8430622 0.9158679 0.8826446
## Smoker 0.7693031 0.8041099 0.8774229 0.9155866
## Diabetes 0.7731015 0.8495609 0.8339938 0.8830184
## High_Blood_Pres Smoker Diabetes
4
5. ## CHD 0.7685505 0.7693031 0.7731015
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
#Melting the correlation matrix and creating a data frame
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
mean(NE.Melt$value)
## [1] 0.8705658
#Summary
summary(NE.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.7686
## Diabetes :7 Diabetes :7 1st Qu.:0.8340
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8706
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9008
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
NE.Melt<-NE.Melt[(!NE.Melt$value==1),]
NE.MeltMean<-mean(NE.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = NE.MeltMean,
limits = c(0.7, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 1")
plt_heat_blue
5
6. CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.7
0.8
0.9
1.0Correlations
Heat map of correlations in Risk Factors data : Region 1
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
plt_obevsNoEx_NE
6
7. 10%
10%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
plt_noExvsfru_NE
7
8. 20%
30%
40%
50%
20% 30% 40%
No Exercise %s
FewFruitsandVegetables%s
% Few Fruits and Vegetables
7
8
20
30
40
Percentage of No Exercise vs.
Percentage of Few Fruits and Vegetables
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedur
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
plt_noExvsblood_NE
8
9. 20%
30%
40%
20% 30% 40%
No Exercise %s
HighBloodPressure%s
% High Blood Pressure
2
3
4
5
6
7
8
9
10
20
Percentage of No Excersice vs.
Percentage of High Blood Pressure
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~NE.Procedures$CHD+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obesity+NE.
9
10. NE.Procedures$CHD
5 10 2 6 10 5 10
50
515
NE.Procedures$No_Exercise
NE.Procedures$Few_Fruit_Veg
1040
210
NE.Procedures$Obesity
NE.Procedures$High_Blood_Pres
515
515
NE.Procedures$Smoker
50 150 10 25 40 5 10 1 3 5
14
NE.Procedures$Diabetes
Multivariate Scatterplot : Region 1
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Diabetes
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = NE.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-NE.Procedures[train_part,]
TestingCHD<-NE.Procedures[-train_part,]
#Performing regression between CHD and No exercise
chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
##
10
11. ## Residuals:
## Min 1Q Median 3Q Max
## -27.186 -6.009 -1.259 6.963 76.044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## TrainingCHD$No_Exercise 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.51 on 126 degrees of freedom
## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759
## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$High_Blood_Pres,C3
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.186 -6.009 -1.259 6.963 76.044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## E 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.51 on 126 degrees of freedom
## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759
## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Obisity
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
##
11
12. ## Residuals:
## Min 1Q Median 3Q Max
## -25.604 -6.818 -1.170 5.811 77.683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9462 3.4303 1.733 0.0855 .
## E 5.7057 0.6647 8.583 3.06e-14 ***
## C1 1.3495 0.7389 1.826 0.0702 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.39 on 125 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.78
## F-statistic: 226.2 on 2 and 125 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with High Blood pressure
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.440 -6.496 -1.075 5.928 81.639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3274 3.4403 1.549 0.1240
## E 5.6011 0.6125 9.145 1.39e-15 ***
## C2 1.2760 0.5716 2.232 0.0274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.31 on 125 degrees of freedom
## Multiple R-squared: 0.7862, Adjusted R-squared: 0.7828
## F-statistic: 229.9 on 2 and 125 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Diabeties
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.229 -6.466 -1.525 6.379 79.871
##
## Coefficients:
12
13. ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.057 3.284 2.149 0.0336 *
## E 5.612 0.612 9.171 1.2e-15 ***
## C3 4.022 1.817 2.213 0.0287 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.32 on 125 degrees of freedom
## Multiple R-squared: 0.7861, Adjusted R-squared: 0.7827
## F-statistic: 229.7 on 2 and 125 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.767 -6.870 -1.347 5.835 81.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5001 3.5445 1.552 0.123
## E 5.1938 0.7261 7.153 6.68e-11 ***
## C1 0.4091 0.9195 0.445 0.657
## C2 0.7147 0.7485 0.955 0.342
## C3 2.0806 2.4541 0.848 0.398
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.34 on 123 degrees of freedom
## Multiple R-squared: 0.7886, Adjusted R-squared: 0.7817
## F-statistic: 114.7 on 4 and 123 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
13
14. 20 40 60 80 100 120
−2040
Fitted values
Residuals
Residuals vs Fitted
71
110
24
−2 −1 0 1 2
−226
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
71
110
24
20 40 60 80 100 120
0.01.5
Fitted values
Standardizedresiduals
Scale−Location
71
11024
0.00 0.05 0.10 0.15 0.20
−226
Leverage
Standardizedresiduals
Cook's distance
0.5
0.5
1
Residuals vs Leverage
71
21
69
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$High_Blood_Pres,C
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 8567.965
SS.regression/SS.total
## [1] 0.5136492
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the colon cancer.
Here is the code for the colon cancer model.
14
15. #Creating a data frame inorder to generate correlation heat map
NE.Procedures<-NE.states[,c("Col_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smok
NE.Procedures.Matrix<-as.matrix(NE.Procedures)
NE.Cor<-cor(NE.Procedures)
#Correlation metrix
NE.Cor
## Col_Cancer No_Exercise Few_Fruit_Veg Obesity
## Col_Cancer 1.0000000 0.8447630 0.9067467 0.8433966
## No_Exercise 0.8447630 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.9067467 0.8788325 1.0000000 0.9008216
## Obesity 0.8433966 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.8606917 0.8430622 0.9158679 0.8826446
## Smoker 0.8306595 0.8041099 0.8774229 0.9155866
## Diabetes 0.7996891 0.8495609 0.8339938 0.8830184
## High_Blood_Pres Smoker Diabetes
## Col_Cancer 0.8606917 0.8306595 0.7996891
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
#Melting the correlation matrix and creating a data frame
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
mean(NE.Melt$value)
## [1] 0.8822818
#Summary
summary(NE.Melt)
## x y value
## Col_Cancer :7 Col_Cancer :7 Min. :0.7997
## Diabetes :7 Diabetes :7 1st Qu.:0.8448
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8823
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9067
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
NE.Melt<-NE.Melt[(!NE.Melt$value==1),]
NE.MeltMean<-mean(NE.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
15
16. # Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = NE.MeltMean,
limits = c(0.7, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 1")
plt_heat_blue
Col_Cancer
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
Col_Cancer DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.7
0.8
0.9
1.0Correlations
Heat map of correlations in Risk Factors data : Region 1
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS Few Fruit vegitables
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Procedur
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of Few Fruit and vegitables")
plt_obevsNoEx_NE
16
17. 10%
10% 20% 30% 40%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
Percentage of obesity vs.
Percentage of Few Fruit and vegitables
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
plt_noExvsfru_NE
17
18. 20%
30%
40%
50%
20% 30% 40%
No Exercise %s
FewFruitsandVegetables%s
% Few Fruits and Vegetables
7
8
20
30
40
Percentage of No Exercise vs.
Percentage of Few Fruits and Vegetables
#graph of Few Fruit and Vegitables VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Proced
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'Few Fruit and vegitables %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of Few Fruits and vegitables vs.n Percentage of High Blood Pressure")
plt_noExvsblood_NE
18
19. 20%
30%
40%
20% 30% 40% 50%
Few Fruit and vegitables %s
HighBloodPressure%s
% High Blood Pressure
2
3
4
5
6
7
8
9
10
20
Percentage of Few Fruits and vegitables vs.
Percentage of High Blood Pressure
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~NE.Procedures$Col_Cancer+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obes
19
20. NE.Procedures$Col_Cancer
5 10 2 6 10 5 10
28
515
NE.Procedures$No_Exercise
NE.Procedures$Few_Fruit_Veg
1040
210
NE.Procedures$Obesity
NE.Procedures$High_Blood_Pres
515
515
NE.Procedures$Smoker
2 6 10 10 25 40 5 10 1 3 5
14
NE.Procedures$Diabetes
Multivariate Scatterplot : Region 1
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the colon cancer
is highly correlated with the
1. Few Fruits and vegetable.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. No Exercise 2. High blood
pressure 3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = NE.Procedures$Col_Cancer,p = 0.80,list = FALSE)
TrainingColon <- NE.Procedures[train_part,]
TestingColon <- NE.Procedures[-train_part,]
#Performing regression between Colon Cancer and Few fruits and vegetables
chdRegr<-lm(TrainingColon$Col_Cancer~TrainingColon$Few_Fruit_Veg)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingColon$Col_Cancer ~ TrainingColon$Few_Fruit_Veg)
##
20
21. ## Residuals:
## Min 1Q Median 3Q Max
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## TrainingColon$Few_Fruit_Veg 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 127 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingColon$Few_Fruit_Veg, C1=TrainingColon$Obesity,C2=TrainingColon$High_Blood
temp <- mutate(temp, O = TrainingColon$Col_Cancer)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## E 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 127 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Obisity
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
##
21
22. ## Residuals:
## Min 1Q Median 3Q Max
## -3.1345 -0.6380 -0.0406 0.5283 3.7629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66357 0.33068 2.007 0.0469 *
## E 0.22619 0.02392 9.455 2.32e-16 ***
## C1 0.12091 0.07412 1.631 0.1054
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.095 on 126 degrees of freedom
## Multiple R-squared: 0.8343, Adjusted R-squared: 0.8317
## F-statistic: 317.2 on 2 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with High Blood pressure
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3833 -0.6435 0.0166 0.4775 3.8119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68140 0.32910 2.070 0.0405 *
## E 0.21531 0.02585 8.330 1.16e-13 ***
## C2 0.13163 0.06773 1.944 0.0542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 126 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8331
## F-statistic: 320.5 on 2 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Diabeties
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -0.6606 -0.0144 0.4913 3.8662
##
## Coefficients:
22
23. ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.71645 0.32566 2.200 0.0296 *
## E 0.21276 0.02128 10.000 <2e-16 ***
## C3 0.14657 0.05628 2.604 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.078 on 126 degrees of freedom
## Multiple R-squared: 0.8394, Adjusted R-squared: 0.8369
## F-statistic: 329.4 on 2 and 126 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0938 -0.6481 0.0027 0.4265 3.7933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.70568 0.32562 2.167 0.0321 *
## E 0.17857 0.03122 5.719 7.56e-08 ***
## C1 0.03900 0.07983 0.489 0.6261
## C2 0.09059 0.07048 1.285 0.2011
## C3 0.11996 0.06030 1.989 0.0489 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.077 on 124 degrees of freedom
## Multiple R-squared: 0.8424, Adjusted R-squared: 0.8373
## F-statistic: 165.7 on 4 and 124 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
23
24. 4 6 8 10
−22
Fitted values
Residuals
Residuals vs Fitted
50
120
6
−2 −1 0 1 2
−303
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
50
120
6
4 6 8 10
0.01.0
Fitted values
Standardizedresiduals
Scale−Location
50
1206
0.00 0.05 0.10 0.15
−22
Leverage
Standardizedresiduals
Cook's distance
0.5
0.5
Residuals vs Leverage
7166104
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingColon$Few_Fruit_Veg, C1=TestingColon$Obesity,C2=TestingColon$High_Bloo
tempTest <- mutate(tempTest, O = TestingColon$Col_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 13.39228
SS.regression/SS.total
## [1] 0.7494172
#This is the regression value Rsquare value for testing data
Region 2 (Midwest region)
We will be analyzing the Midwest region for the following problem. All the code and necessary visualizations
are included in following section.
24
25. #Cleaning the data in region 2.
#Removing missing data.
Midwest <- subset(Midwest, Midwest$No_Exercise!=0)
Midwest <- subset(Midwest, Midwest$Few_Fruit_Veg!=0)
Midwest <- subset(Midwest, Midwest$Obesity!=0)
Midwest <- subset(Midwest, Midwest$High_Blood_Pres!=0)
Midwest <- subset(Midwest, Midwest$Smoker!=0)
Midwest <- subset(Midwest, Midwest$Diabetes!=0)
Midwest <- subset(Midwest, Midwest$Lung_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Col_Cancer!=0)
Midwest <- subset(Midwest, Midwest$CHD!=0)
Midwest <- subset(Midwest, Midwest$Brst_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Suicide!=0)
Midwest <- subset(Midwest, Midwest$Total_Death_Causes!=0)
Midwest <- subset(Midwest, Midwest$Injury!=0)
Midwest<-subset(Midwest,Midwest$Stroke!=0)
Midwest <- subset(Midwest, Midwest$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top two reasons why people are dying in region 2. We came
to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
results.
disease region2 (R squared)
breast cancer 0.016
mva 0.19
chd 0.77
colon cancer 0.12
lung cancer 0.25
injury 0.11
suicide 0.05
stroke 0.13
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
Midwest region.
1. CHD (Corronary heart disease)
2. Lung Cancer
#Since we have taken the CHD and Lung Cancer as the major reason why people are dying. We will perform m
regressionModel<-lm(Midwest$Total_Death_Causes~Midwest$CHD+Midwest$Lung_Cancer)
#Summary of regression between total deaths and diseases we selected.
summary(regressionModel)
##
## Call:
## lm(formula = Midwest$Total_Death_Causes ~ Midwest$CHD + Midwest$Lung_Cancer)
##
25
26. ## Residuals:
## Min 1Q Median 3Q Max
## -19.6367 -4.0070 -0.6864 3.8916 22.6175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.26689 0.65579 11.08 <2e-16 ***
## Midwest$CHD 1.12881 0.02958 38.16 <2e-16 ***
## Midwest$Lung_Cancer 2.97013 0.08260 35.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.172 on 389 degrees of freedom
## Multiple R-squared: 0.9886, Adjusted R-squared: 0.9885
## F-statistic: 1.688e+04 on 2 and 389 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plotting regression analysis
plot(regressionModel)
50 100 150 200 250
−20020
Fitted values
Residuals
Residuals vs Fitted
385
165
231
−3 −2 −1 0 1 2 3
−22
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
385
165
231
50 100 150 200 250
0.01.0
Fitted values
Standardizedresiduals
Scale−Location
385 165231
0.00 0.01 0.02 0.03 0.04 0.05
−404
Leverage
Standardizedresiduals
Cook's distance
Residuals vs Leverage
31050165
Now that we have established the major disease in Midwest region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multivariate regression with training and testing data.
Following is the code which describes the procedure for CHD.
26
27. MW.states<-Midwest
MW.Procedures<-data.frame()
#Creating a data frame inorder to generate correlation heat map
MW.Procedures<-MW.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
MW.Procedures.Matrix<-as.matrix(MW.Procedures)
MW.Cor<-cor(MW.Procedures)
#Correlation metrix
MW.Cor
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.9109385 0.8956826 0.9072785
## No_Exercise 0.9109385 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.8956826 0.9135682 1.0000000 0.9520907
## Obesity 0.9072785 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9045115 0.9155135 0.9331822 0.9536831
## Smoker 0.9076982 0.9292124 0.9400104 0.9312576
## Diabetes 0.8902612 0.9031664 0.8688473 0.9153784
## High_Blood_Pres Smoker Diabetes
## CHD 0.9045115 0.9076982 0.8902612
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
#Melting the correlation matrix and creating a data frame
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
mean(MW.Melt$value)
## [1] 0.9275097
#Summary
summary(MW.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.8688
## Diabetes :7 Diabetes :7 1st Qu.:0.9073
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.9188
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.9275
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9400
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
MW.Melt<-MW.Melt[(!MW.Melt$value==1),]
MW.MeltMean<-mean(MW.Melt$value)
#Making various colors to geMWrate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
27
28. RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = MW.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 2")
plt_heat_blue
CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
Heat map of correlations in Risk Factors data : Region 2
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Obesity, 0)))) +
geom_point() +
28
29. scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
plt_obevsNoEx_MW
10%
10%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
20
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Smoker
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
plt_noExvsSmo_MW
29
30. 20%
30%
40%
50%
20% 30% 40%
No Exercise %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Smoker
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedur
color = factor(signif(MW.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
plt_noExvsblood_MW
30
31. 20%
30%
40%
20% 30% 40%
No Exercise %s
HighBloodPressure%s
% High Blood Pressure
2
3
4
5
6
7
8
9
10
20
Percentage of No Excersice vs.
Percentage of High Blood Pressure
#Plotting the multivariate scatter plot in order to understand the correlation better.
pairs(~MW.Procedures$CHD+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obesity+MW.
31
32. MW.Procedures$CHD
5 15 5 15 5 10
20
5
MW.Procedures$No_Exercise
MW.Procedures$Few_Fruit_Veg
1035
5
MW.Procedures$Obesity
MW.Procedures$High_Blood_Pres
5
515
MW.Procedures$Smoker
20 80 10 25 40 5 15 1 3 5
14
MW.Procedures$Diabetes
Multivariate Scatterplot : Region 2
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Smoker
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = MW.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-MW.Procedures[train_part,]
TestingCHD<-MW.Procedures[-train_part,]
#Performing regression between CHD and No exercise
chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
32
33. ##
## Residuals:
## Min 1Q Median 3Q Max
## -26.669 -5.992 -1.075 4.611 37.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## TrainingCHD$No_Exercise 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.56 on 314 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443
## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.669 -5.992 -1.075 4.611 37.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## E 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.56 on 314 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443
## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
33
34. ##
## Residuals:
## Min 1Q Median 3Q Max
## -23.236 -5.870 -0.862 5.417 36.030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6657 1.2036 0.553 0.581
## E 3.9856 0.4212 9.462 < 2e-16 ***
## C1 3.1306 0.4123 7.593 3.64e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.723 on 313 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8681
## F-statistic: 1037 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.550 -5.426 -0.776 4.322 37.881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6510 1.1790 2.249 0.0252 *
## E 4.0462 0.4223 9.581 < 2e-16 ***
## C2 2.8735 0.3872 7.420 1.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.757 on 313 degrees of freedom
## Multiple R-squared: 0.868, Adjusted R-squared: 0.8671
## F-statistic: 1029 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.324 -5.482 -0.944 5.139 34.475
##
34
35. ## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4248 1.1767 1.211 0.227
## E 4.1168 0.3898 10.560 < 2e-16 ***
## C3 2.8134 0.3545 7.936 3.74e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.654 on 313 degrees of freedom
## Multiple R-squared: 0.8708, Adjusted R-squared: 0.8699
## F-statistic: 1054 on 2 and 313 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.493 -5.404 -0.842 4.904 33.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3202 1.1689 1.129 0.259602
## E 2.8621 0.4635 6.175 2.07e-09 ***
## C1 1.1091 0.5728 1.936 0.053747 .
## C2 1.5669 0.4458 3.514 0.000506 ***
## C3 1.4402 0.4852 2.968 0.003230 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.36 on 311 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8777
## F-statistic: 566.3 on 4 and 311 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
35
37. ## 64 65 66 67 68 69 70
## 87.13676 84.23844 86.80106 71.39024 69.61883 40.42645 18.85330
## 71 72 73 74 75 76
## 46.86680 32.61050 38.34111 38.45673 67.09643 42.34753
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] -1040.415
SS.regression/SS.total
## [1] 0.854508
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
#Creating a data frame inorder to generate correlation heat map
MW.Procedures<-MW.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo
MW.Procedures.Matrix<-as.matrix(MW.Procedures)
MW.Cor<-cor(MW.Procedures)
#Correlation metrix
MW.Cor
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.9402863 0.9478072 0.9459666
## No_Exercise 0.9402863 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.9478072 0.9135682 1.0000000 0.9520907
## Obesity 0.9459666 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9279799 0.9155135 0.9331822 0.9536831
## Smoker 0.9494486 0.9292124 0.9400104 0.9312576
## Diabetes 0.8984837 0.9031664 0.8688473 0.9153784
## High_Blood_Pres Smoker Diabetes
## Lung_Cancer 0.9279799 0.9494486 0.8984837
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
#Melting the correlation matrix and creating a data frame
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
mean(MW.Melt$value)
37
38. ## [1] 0.9354118
#Summary
summary(MW.Melt)
## x y value
## Diabetes :7 Diabetes :7 Min. :0.8688
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.9155
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.9313
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.9354
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9494
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
MW.Melt<-MW.Melt[(!MW.Melt$value==1),]
MW.MeltMean<-mean(MW.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, red) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = MW.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 2")
plt_heat_blue
38
39. Diabetes
Few_Fruit_Veg
High_Blood_Pres
Lung_Cancer
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
Heat map of correlations in Risk Factors data : Region 2
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS Few Fruit vegitables
plt_Smovsfru_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Few_Fruit_Veg), y = (MW.Procedure
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fruits and Vegetables %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of Few Fruit and vegitables")
plt_Smovsfru_MW
39
40. 10%
10% 20% 30% 40%
Few Fruits and Vegetables %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of Smoker vs.
Percentage of Few Fruit and vegitables
#graph of No Exercise VS Smoker
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
plt_noExvsSmo_MW
40
41. 20%
30%
40%
50%
20% 30% 40%
No Exercise %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Percentage of Smoker
#graph of Smoker VS Obesity
plt_SmovsObe_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Obesity), y = (MW.Procedures$Smok
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
plt_SmovsObe_MW
41
42. 20%
30%
40%
50%
20% 30% 40%
Obesity %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of Obesity vs.
Percentage of Smoker
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~MW.Procedures$Lung_Cancer+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obe
42
43. MW.Procedures$Lung_Cancer
5 15 5 15 5 10
1040
5
MW.Procedures$No_Exercise
MW.Procedures$Few_Fruit_Veg
1035
5
MW.Procedures$Obesity
MW.Procedures$High_Blood_Pres
5
515
MW.Procedures$Smoker
10 30 10 25 40 5 15 1 3 5
14
MW.Procedures$Diabetes
Multivariate Scatterplot : Region 2
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. Smoker.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. No Exercise
2. Few Fruits and vegetable
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = MW.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- MW.Procedures[train_part,]
TestingLung <- MW.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
43
44. ##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8531 -1.4729 0.0134 1.5588 7.9568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10692 0.34785 0.307 0.759
## TrainingLung$Smoker 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.047 on 314 degrees of freedom
## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995
## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$Few_Fruit_Veg,C3=Tra
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8531 -1.4729 0.0134 1.5588 7.9568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10692 0.34785 0.307 0.759
## E 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.047 on 314 degrees of freedom
## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995
## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
44
45. ##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2839 -1.2744 0.1438 1.3621 7.6394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1171 0.3171 -3.523 0.00049 ***
## E 1.3294 0.1012 13.133 < 2e-16 ***
## C1 1.1735 0.1075 10.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.598 on 313 degrees of freedom
## Multiple R-squared: 0.9274, Adjusted R-squared: 0.9269
## F-statistic: 1999 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8397 -1.2740 0.0911 1.3515 8.0799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.31635 0.32814 -4.012 7.55e-05 ***
## E 1.24197 0.11201 11.088 < 2e-16 ***
## C2 0.39028 0.03696 10.559 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.621 on 313 degrees of freedom
## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9256
## F-statistic: 1961 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4258 -1.3278 -0.2252 1.5037 7.6570
##
45
46. ## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9052 0.3127 -2.895 0.00406 **
## E 1.3013 0.1054 12.341 < 2e-16 ***
## C3 1.2172 0.1137 10.703 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.611 on 313 degrees of freedom
## Multiple R-squared: 0.9266, Adjusted R-squared: 0.9262
## F-statistic: 1977 on 2 and 313 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0675 -1.1707 0.1069 1.2777 7.5823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.79363 0.29587 -6.062 3.88e-09 ***
## E 0.67286 0.11725 5.739 2.27e-08 ***
## C1 0.44770 0.13018 3.439 0.000664 ***
## C2 0.21279 0.04183 5.087 6.31e-07 ***
## C3 0.79075 0.11385 6.946 2.21e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.327 on 311 degrees of freedom
## Multiple R-squared: 0.9421, Adjusted R-squared: 0.9414
## F-statistic: 1265 on 4 and 311 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
46
47. 5 10 15 20 25 30 35
−100
Fitted values
Residuals
Residuals vs Fitted
313128
39
−3 −2 −1 0 1 2 3
−404
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
313128
39
5 10 15 20 25 30 35
0.01.02.0
Fitted values
Standardizedresiduals
Scale−Location
31312839
0.00 0.05 0.10 0.15
−404
Leverage
Standardizedresiduals
Cook's distance 0.5
0.5
Residuals vs Leverage
245
191
128
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$Few_Fruit_Veg,C3=Te
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 500.1066
SS.regression/SS.total
## [1] 0.8705158
#This is the regression value Rsquare value for testing data
Region 3 (South region)
We will be analyzing the South region for the following problem. All the code and necessary visualizations
are included in following section.
47
48. #Cleaning the data in region 3.
#Removing missing data.
South <- subset(South, South$No_Exercise!=0)
South <- subset(South, South$Few_Fruit_Veg!=0)
South <- subset(South, South$Obesity!=0)
South <- subset(South, South$High_Blood_Pres!=0)
South <- subset(South, South$Smoker!=0)
South <- subset(South, South$Diabetes!=0)
South <- subset(South, South$Lung_Cancer!=0)
South <- subset(South, South$Col_Cancer!=0)
South <- subset(South, South$CHD!=0)
South <- subset(South, South$Brst_Cancer!=0)
South <- subset(South, South$Suicide!=0)
South <- subset(South, South$Total_Death_Causes!=0)
South <- subset(South, South$Injury!=0)
South<-subset(South,South$Stroke!=0)
South <- subset(South, South$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top three reasons why people are dying in region 3. We
came to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
results.
disease region3 (R squared)
breast cancer 0.06
mva 0.26
chd 0.79
colon cancer 0.16
lung cancer 0.35
injury 0.17
suicide 0.03
stroke 0.14
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
South region.
1. CHD (Corronary heart disease)
2. Lung Cancer
3. MVA (Motor Vehicle Accidents)
#Since we have taken the CHD , Lung Cancer and MVA as the major reason why people are dying. We will per
regressionModel<-lm(South$Total_Death_Causes~South$CHD+South$Lung_Cancer+South$MVA)
#Summary of regression between total deaths and diseases we selected.
summary(regressionModel)
##
## Call:
## lm(formula = South$Total_Death_Causes ~ South$CHD + South$Lung_Cancer +
## South$MVA)
48
49. ##
## Residuals:
## Min 1Q Median 3Q Max
## -40.105 -5.062 -0.461 4.612 30.418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.17128 0.98511 1.189 0.235
## South$CHD 1.16308 0.02562 45.398 < 2e-16 ***
## South$Lung_Cancer 2.84552 0.07955 35.770 < 2e-16 ***
## South$MVA 1.08049 0.15806 6.836 2.16e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.829 on 554 degrees of freedom
## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9783
## F-statistic: 8374 on 3 and 554 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plotting regression analysis
plot(regressionModel)
50 100 150 200 250 300
−400
Fitted values
Residuals
Residuals vs Fitted
184
214523
−3 −2 −1 0 1 2 3
−404
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
184
214523
50 100 150 200 250 300
0.01.5
Fitted values
Standardizedresiduals
Scale−Location
184
214523
0.00 0.02 0.04 0.06 0.08
−604
Leverage
Standardizedresiduals
Cook's distance
1
0.5
0.5
Residuals vs Leverage
261
523
319
Now that we have established the major disease in South region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multivariate regression with training and testing data.
49
50. Following is the code which describes the procedure for CHD.
SO.states<-South
SO.Procedures<-data.frame()
#Creating a data frame inorder to generate correlation heat map
SO.Procedures<-SO.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
SO.Procedures.Matrix<-as.matrix(SO.Procedures)
SO.Cor<-cor(SO.Procedures)
#Correlation metrix
SO.Cor
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.8652391 0.8531583 0.8529559
## No_Exercise 0.8652391 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8531583 0.8673295 1.0000000 0.9171308
## Obesity 0.8529559 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8240507 0.8680525 0.8880798 0.8827588
## Smoker 0.8387031 0.8636349 0.8920701 0.8694016
## Diabetes 0.7821009 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## CHD 0.8240507 0.8387031 0.7821009
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8792101
#Summary
summary(SO.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.7821
## Diabetes :7 Diabetes :7 1st Qu.:0.8530
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8681
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8792
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
SO.Melt<-SO.Melt[(!SO.Melt$value==1),]
SO.MeltMean<-mean(SO.Melt$value)
50
51. #Making various colors to geSOrate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, green) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.8, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
plt_heat_blue
CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.80
0.85
0.90
0.95
1.00
Correlations
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Obesity, 0)))) +
51
52. geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
plt_obevsNoEx_SO
10%
10% 20%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
20
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Smoker
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
plt_noExvsSmo_SO
52
53. 10%
10% 20%
No Exercise %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Smoker
#graph of No Exercise VS Few Fruits And Vegetables
plt_noExvsfru_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Few_Fruit_Veg, 0
geom_point() +
scale_color_discrete(name="% Few Fruits And Vegetables") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Few Fruits And Vegetables %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of Few Fruits And Vegetables")
plt_noExvsfru_SO
53
54. 10%
20%
30%
40%
10% 20%
No Exercise %s
FewFruitsAndVegetables%s
% Few Fruits And Vegetables
8
9
10
20
30
40
Percentage of No Excersice vs.
Percentage of Few Fruits And Vegetables
#Plotting the multivariate scatter plot in order to understand the correlation better.
pairs(~SO.Procedures$CHD+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO.
54
55. SO.Procedures$CHD
5 15 2 6 12 5 10
20140
520
SO.Procedures$No_Exercise
SO.Procedures$Few_Fruit_Veg
1040
212
SO.Procedures$Obesity
SO.Procedures$High_Blood_Pres
520
515
SO.Procedures$Smoker
20 80 140 10 30 5 15 1 3 5
14
SO.Procedures$Diabetes
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. Few Fruits And Vegetables
2. Smoker
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-SO.Procedures[train_part,]
TestingCHD<-SO.Procedures[-train_part,]
#Performing regression between CHD and No exercise
chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
55
56. ##
## Residuals:
## Min 1Q Median 3Q Max
## -63.726 -7.474 -0.401 7.460 53.480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## TrainingCHD$No_Exercise 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 446 degrees of freedom
## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554
## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.726 -7.474 -0.401 7.460 53.480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## E 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 446 degrees of freedom
## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554
## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
56
57. ##
## Residuals:
## Min 1Q Median 3Q Max
## -56.692 -6.553 -0.426 6.782 49.965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7745 1.5323 2.463 0.0141 *
## E 3.6456 0.3685 9.893 < 2e-16 ***
## C1 3.1443 0.4080 7.707 8.45e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.64 on 445 degrees of freedom
## Multiple R-squared: 0.7847, Adjusted R-squared: 0.7837
## F-statistic: 810.8 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.419 -7.093 -0.348 6.464 50.962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2383 1.5379 1.455 0.146
## E 3.8791 0.3103 12.503 <2e-16 ***
## C2 3.0972 0.3567 8.684 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.46 on 445 degrees of freedom
## Multiple R-squared: 0.7913, Adjusted R-squared: 0.7904
## F-statistic: 843.6 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.863 -6.174 -0.278 6.583 48.496
##
57
58. ## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0217 1.4558 2.076 0.0385 *
## E 3.3541 0.3044 11.019 <2e-16 ***
## C3 1.1467 0.1064 10.776 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.03 on 445 degrees of freedom
## Multiple R-squared: 0.8064, Adjusted R-squared: 0.8056
## F-statistic: 927 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.608 -6.372 -0.209 6.585 48.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9532 1.4613 1.337 0.182035
## E 2.5933 0.3754 6.907 1.72e-11 ***
## C1 0.8052 0.4921 1.636 0.102483
## C2 1.4445 0.4122 3.504 0.000505 ***
## C3 0.7514 0.1527 4.922 1.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.87 on 443 degrees of freedom
## Multiple R-squared: 0.8129, Adjusted R-squared: 0.8112
## F-statistic: 481.1 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
58
60. ## 64 65 66 67 68 69 70
## 85.16485 43.43088 101.14805 81.53974 81.15735 66.76741 60.25659
## 71 72 73 74 75 76 77
## 42.04435 54.65376 103.69421 49.09867 59.43466 51.29171 56.07857
## 78 79 80 81 82 83 84
## 25.54696 55.45294 23.65491 93.97364 51.35518 46.81402 92.22464
## 85 86 87 88 89 90 91
## 84.72256 90.79933 95.60329 59.45610 90.05428 45.56084 76.83904
## 92 93 94 95 96 97 98
## 82.85766 74.82753 77.28768 64.59310 51.51838 40.74676 79.32157
## 99 100 101 102 103 104 105
## 38.59788 52.26890 49.81253 87.43576 89.41201 90.57723 79.44587
## 106 107 108 109 110
## 60.00317 46.91665 59.17558 47.64249 24.00712
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 2561.812
SS.regression/SS.total
## [1] 0.7099096
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
#Creating a data frame inorder to generate correlation heat map
SO.Procedures<-SO.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo
SO.Procedures.Matrix<-as.matrix(SO.Procedures)
SO.Cor<-cor(SO.Procedures)
#Correlation metrix
SO.Cor
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.8398485 0.8922953 0.8688772
## No_Exercise 0.8398485 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8922953 0.8673295 1.0000000 0.9171308
## Obesity 0.8688772 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8698291 0.8680525 0.8880798 0.8827588
## Smoker 0.9145492 0.8636349 0.8920701 0.8694016
## Diabetes 0.7993788 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## Lung_Cancer 0.8698291 0.9145492 0.7993788
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
60
61. ## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8860905
#Summary
summary(SO.Melt)
## x y value
## Diabetes :7 Diabetes :7 Min. :0.7994
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8636
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8698
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.8861
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9031
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
SO.Melt<-SO.Melt[(!SO.Melt$value==1),]
SO.MeltMean<-mean(SO.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, blue) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
plt_heat_blue
61
62. Diabetes
Few_Fruit_Veg
High_Blood_Pres
Lung_Cancer
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS High Blood Pressure
plt_Smovsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proce
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of High Blood Pressure")
plt_Smovsblood_SO
62
63. 10%
10% 20%
High Blood Pressure %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Smoker vs.
Percentage of High Blood Pressure
#graph of Few Fruits and Vegetables VS Smoker
plt_fruvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Few_Fruit_Veg), y = (SO.Procedure
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fuits and Vegetable %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Few Fuits and Vegetable vs.n Percentage of Smoker")
plt_fruvsSmo_SO
63
64. 10%
10% 20% 30% 40%
Few Fuits and Vegetable %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Few Fuits and Vegetable vs.
Percentage of Smoker
#graph of Smoker VS Obesity
plt_SmovsObe_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Obesity), y = (SO.Procedures$Smok
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
plt_SmovsObe_SO
64
65. 20%
30%
40%
50%
20% 30% 40% 50%
Obesity %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Obesity vs.
Percentage of Smoker
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~SO.Procedures$Lung_Cancer+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obe
65
66. SO.Procedures$Lung_Cancer
5 15 2 6 12 5 10
1040
520
SO.Procedures$No_Exercise
SO.Procedures$Few_Fruit_Veg
1040
212
SO.Procedures$Obesity
SO.Procedures$High_Blood_Pres
520
515
SO.Procedures$Smoker
10 30 10 30 5 15 1 3 5
14
SO.Procedures$Diabetes
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. Smoker.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Few Fruits
and vegetable 3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- SO.Procedures[train_part,]
TestingLung <- SO.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
##
66
67. ## Residuals:
## Min 1Q Median 3Q Max
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## TrainingLung$Smoker 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 446 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271
## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$High_Blood_Pres,C3=T
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## E 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 446 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271
## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
##
67
68. ## Residuals:
## Min 1Q Median 3Q Max
## -18.098 -1.752 -0.184 1.773 16.095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92734 0.41643 2.227 0.0265 *
## E 1.71902 0.09419 18.250 < 2e-16 ***
## C1 0.75201 0.09266 8.115 4.78e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.109 on 445 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8491
## F-statistic: 1258 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4221 -1.8371 -0.1437 1.7834 17.3119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99954 0.40928 2.442 0.015 *
## E 1.65336 0.09547 17.317 <2e-16 ***
## C2 0.70444 0.08065 8.735 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.078 on 445 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8521
## F-statistic: 1288 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -1.660 -0.174 1.687 15.611
##
## Coefficients:
68
69. ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.14230 0.40135 2.846 0.00463 **
## E 1.50746 0.10386 14.515 < 2e-16 ***
## C3 0.29928 0.03189 9.385 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.044 on 445 degrees of freedom
## Multiple R-squared: 0.856, Adjusted R-squared: 0.8554
## F-statistic: 1323 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8100 -1.7365 -0.1321 1.7312 15.9466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79644 0.39875 1.997 0.0464 *
## E 1.29944 0.10929 11.889 < 2e-16 ***
## C1 0.18044 0.12280 1.469 0.1424
## C2 0.38893 0.09591 4.055 5.92e-05 ***
## C3 0.17908 0.04193 4.271 2.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.967 on 443 degrees of freedom
## Multiple R-squared: 0.8638, Adjusted R-squared: 0.8626
## F-statistic: 702.5 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
69
70. 5 10 15 20 25 30 35
−1010
Fitted values
Residuals
Residuals vs Fitted
211
382
369
−3 −2 −1 0 1 2 3
−426
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
382
211
369
5 10 15 20 25 30 35
0.01.5
Fitted values
Standardizedresiduals
Scale−Location
382211
369
0.00 0.02 0.04 0.06 0.08
−604
Leverage
Standardizedresiduals
Cook's distance
0.5
0.5
Residuals vs Leverage
382
369360
#Now we will test our regression model with testing data to check the performance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$High_Blood_Pres,C3=
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 882.6084
SS.regression/SS.total
## [1] 0.7934475
#This is the regression value Rsquare value for testing data
Similarly , we have done calculations for MVA. Here is the code for the MVA model.
70
71. #Creating a data frame inorder to generate correlation heat map
SO.Procedures<-SO.states[,c("MVA","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
SO.Procedures.Matrix<-as.matrix(SO.Procedures)
SO.Cor<-cor(SO.Procedures)
#Correlation metrix
SO.Cor
## MVA No_Exercise Few_Fruit_Veg Obesity
## MVA 1.0000000 0.7037265 0.5939313 0.6419514
## No_Exercise 0.7037265 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.5939313 0.8673295 1.0000000 0.9171308
## Obesity 0.6419514 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.6708440 0.8680525 0.8880798 0.8827588
## Smoker 0.6515928 0.8636349 0.8920701 0.8694016
## Diabetes 0.6625232 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## MVA 0.6708440 0.6515928 0.6625232
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8346534
#Summary
summary(SO.Melt)
## x y value
## Diabetes :7 Diabetes :7 Min. :0.5939
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8033
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8681
## MVA :7 MVA :7 Mean :0.8347
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
SO.Melt<-SO.Melt[(!SO.Melt$value==1),]
SO.MeltMean<-mean(SO.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, green) )
71
72. # Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.5, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
plt_heat_blue
Diabetes
Few_Fruit_Veg
High_Blood_Pres
MVA
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresMVA No_Exercise Obesity Smoker
0.5
0.6
0.7
0.8
0.9
1.0
Correlations
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of no Exercise VS High Blood Pressure
plt_noExvsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proc
color = factor(signif(SO.Procedures$No_Exercise, 0)))) +
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Excercise %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of High Blood Pressure")
plt_noExvsblood_SO
72
73. 10%
20%
10% 20%
High Blood Pressure %s
NoExcercise%s
% No Exercise
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Percentage of High Blood Pressure
#graph of No Exercise VS Smoker
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
plt_noExvsSmo_SO
73
74. 10%
10% 20%
No Exercise %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Percentage of Smoker
#graph of Diabetes VS No Exercise
plt_noExvsDia_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Diabetes), y = (SO.Procedures$No
color = factor(signif(SO.Procedures$No_Exercise, 0))
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks,name = 'Diabetes %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
ggtitle(label = "Percentage of Diabetes vs.n Percentage of No Exercise")
plt_noExvsDia_SO
74
75. 10%
20%
Diabetes %s
NoExercise%s
% No Exercise
2
3
4
5
6
7
8
9
10
20
Percentage of Diabetes vs.
Percentage of No Exercise
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~SO.Procedures$MVA+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO.
75
76. SO.Procedures$MVA
5 15 2 6 12 5 10
520
520
SO.Procedures$No_Exercise
SO.Procedures$Few_Fruit_Veg
1040
212
SO.Procedures$Obesity
SO.Procedures$High_Blood_Pres
520
515
SO.Procedures$Smoker
5 15 25 10 30 5 15 1 3 5
14
SO.Procedures$Diabetes
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. No Exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Diabetes
3. Smoker
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$MVA,p = 0.80,list = FALSE)
TrainingMVA <- SO.Procedures[train_part,]
TestingMVA <- SO.Procedures[-train_part,]
#Performing regression between MVA and No exercise
mvaRegr<-lm(TrainingMVA$MVA~TrainingMVA$No_Exercise)
#Regression Summary
summary(mvaRegr)
##
## Call:
## lm(formula = TrainingMVA$MVA ~ TrainingMVA$No_Exercise)
76
77. ##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## TrainingMVA$No_Exercise 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.989 on 446 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005
## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having decent accuracy we will still add more confounders and then
perform the multivariate regression.0
#making a temporary table with all required variables
temp <- data.frame(E = TrainingMVA$No_Exercise, C1=TrainingMVA$Smoker,C2=TrainingMVA$High_Blood_Pres,C3=
temp <- mutate(temp, O = TrainingMVA$MVA)
#Regression on MVA and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## E 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.989 on 446 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005
## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
77
78. ##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9779 -1.1454 -0.0693 1.0158 10.1595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.52869 0.26785 5.707 2.1e-08 ***
## E 0.50745 0.05789 8.765 < 2e-16 ***
## C1 0.09930 0.06317 1.572 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.986 on 445 degrees of freedom
## Multiple R-squared: 0.5044, Adjusted R-squared: 0.5022
## F-statistic: 226.4 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5950 -1.1474 -0.0641 1.0431 10.7984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.51447 0.26458 5.724 1.91e-08 ***
## E 0.45073 0.05872 7.676 1.05e-13 ***
## C2 0.14265 0.05414 2.635 0.00871 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.976 on 445 degrees of freedom
## Multiple R-squared: 0.5093, Adjusted R-squared: 0.5071
## F-statistic: 230.9 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6516 -1.1369 -0.0745 1.1027 10.0249
##
78
79. ## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60721 0.26225 6.129 1.95e-09 ***
## E 0.45541 0.05443 8.367 7.68e-16 ***
## C3 0.44833 0.15954 2.810 0.00517 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.974 on 445 degrees of freedom
## Multiple R-squared: 0.5103, Adjusted R-squared: 0.5081
## F-statistic: 231.9 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4964 -1.1325 -0.0407 1.0714 10.5865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.53795 0.26771 5.745 1.71e-08 ***
## E 0.39844 0.06998 5.693 2.27e-08 ***
## C1 0.02554 0.06972 0.366 0.7143
## C2 0.08004 0.06679 1.198 0.2314
## C3 0.31152 0.18517 1.682 0.0932 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.974 on 443 degrees of freedom
## Multiple R-squared: 0.5127, Adjusted R-squared: 0.5083
## F-statistic: 116.5 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
79