SlideShare a Scribd company logo
1 of 103
Download to read offline
Project : Mortality Rate Analysis in USA for deadly
causes
Jatri Dave (jad752) , Prashantkumar Patel (pnp249)
December 14, 2016
Project Outline
We have obtained the data from CHSI(Community Health Status Indicators). In this project we will try
to figure out the leading causes of the death in 4 major regions in USA(Northeast, West, Midwest, South).
After getting the major causes of deaths we will try to analyse major daily human characteristics that are
contributing towards these major deaths.
The steps that we implemented in order to solve above stated problem are briefly explained below.
Data Cleaning and Normalizing.
First of all, When we gathered the data we did not realize that there was significant missing data. Besides
there were hundreds of unnecessary features available in the data. Therefore, we selected the required features
for the project. Moreover, the data we obtained was not normalized on a balanced scale; some data was in
percentage, some of those were in the base of 100,000 , some data was in the form of population count etc.
We needed some common scale on which we can normalize it. Furthermore, the data that we gathered was
not for a single time duration. For example, some data was in time span of the 1999-2003, some data was
in time span of 1995-2003 and so on.Hence, we normalized the data for the individual year. We performed
all those operations in Microsoft Excel (2016). We then combined necessary features and created a comma
separated value data (CSV) which we are directly using in R for the project.
Partitinoning the data into the region wise data.
The code is described below.
#Reading the data
data<-read.csv("E:/NYU/1/Foundation of Data Science/Projects/Foundations-of-Data-Science/USADataCleanPra
#Adding new column for region selection
data[,"region"] <- NA
#Removing unnecessary columns
data$X<-NULL
data$X.1<-NULL
#Partitioning the data based on regions. We have manually used the names of the states in order to crea
#region1
data$region[data$CHSI_State_Name=="Connecticut" | data$CHSI_State_Name=="Maine" | data$CHSI_State_Name==
#region2
data$region[data$CHSI_State_Name=="Illinois" | data$CHSI_State_Name=="Indiana" | data$CHSI_State_Name=="
#region3
data$region[data$CHSI_State_Name=="Delaware" | data$CHSI_State_Name=="Florida" | data$CHSI_State_Name=="
1
#region4
data$region[data$CHSI_State_Name=="Arizona" | data$CHSI_State_Name=="Colorado" | data$CHSI_State_Name=="
#converting state names into lower case letters
data$CHSI_State_Name <- tolower(data$CHSI_State_Name)
#Creating seperate data sets for each regions so that we can perform analysis on these seperate data
Northeast<- data[data$region==1,]
Midwest<-data[data$region==2,]
South<-data[data$region==3,]
West<-data[data$region==4,]
Region 1 (NorthEast region)
We will be analyzing the northeast region for the following problem. All the code and necessary visualizations
are included in following section.
#Cleaning the data in region 1.
#Removing missing data.
Northeast <- subset(Northeast, Northeast$No_Exercise!=0)
Northeast <- subset(Northeast, Northeast$Few_Fruit_Veg!=0)
Northeast <- subset(Northeast, Northeast$Obesity!=0)
Northeast <- subset(Northeast, Northeast$High_Blood_Pres!=0)
Northeast <- subset(Northeast, Northeast$Smoker!=0)
Northeast <- subset(Northeast, Northeast$Diabetes!=0)
Northeast <- subset(Northeast, Northeast$Lung_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Col_Cancer!=0)
Northeast <- subset(Northeast, Northeast$CHD!=0)
Northeast <- subset(Northeast, Northeast$Brst_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Suicide!=0)
Northeast <- subset(Northeast, Northeast$Total_Death_Causes!=0)
Northeast <- subset(Northeast, Northeast$Injury!=0)
Northeast<-subset(Northeast,Northeast$Stroke!=0)
Northeast <- subset(Northeast, Northeast$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths. Here for the
simplicity we have only included the top two reasons why people are dying in region 1. But we came to the
conclusion using single variate regression of the total death with respect to individual disease and then we
combined the features for maximum Rˆ2 value. Following is the table which shows our experiment results.
disease region1 (R squared)
breast cancer 0.04
mva 0.17
chd 0.71
colon cancer 0.24
lung cancer 0.16
injury 0.1
suicide 0.07
stroke 0.04
2
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
northeast region.
1. CHD (Corronary heart disease)
2. Colon Cancer
#Since we have taken the CHD and Colon Cancer as the mahor reason why people are dying we will perform m
regressionModel<-lm(Northeast$Total_Death_Causes~Northeast$CHD+Northeast$Col_Cancer)
#Summary of regression between total deaths and diseases we selected.
summary(regressionModel)
##
## Call:
## lm(formula = Northeast$Total_Death_Causes ~ Northeast$CHD + Northeast$Col_Cancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.766 -5.234 -1.086 5.389 28.828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.05596 2.51849 3.596 0.000433 ***
## Northeast$CHD 1.00017 0.04322 23.141 < 2e-16 ***
## Northeast$Col_Cancer 8.15527 0.44732 18.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.382 on 157 degrees of freedom
## Multiple R-squared: 0.9621, Adjusted R-squared: 0.9616
## F-statistic: 1991 on 2 and 157 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plotting regression analysis
plot(regressionModel)
3
50 100 150 200 250
−30030
Fitted values
Residuals
Residuals vs Fitted
18
65
9
−2 −1 0 1 2
−213
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
18
65
9
50 100 150 200 250
0.01.0
Fitted values
Standardizedresiduals
Scale−Location
18 659
0.00 0.05 0.10 0.15
−302
Leverage
Standardizedresiduals
Cook's distance 0.5
0.5
1
Residuals vs Leverage
908878
Now that we have established the major disease in North east region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multi variate regression with training and testing data.
Following is the code which describes the procedure for CHD.
NE.states<-Northeast
NE.Procedures<-data.frame()
#Creating a data frame inorder to generate correlation heat map
NE.Procedures<-NE.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
NE.Procedures.Matrix<-as.matrix(NE.Procedures)
NE.Cor<-cor(NE.Procedures)
#Correlation metrix
NE.Cor
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.8886204 0.8202385 0.7790907
## No_Exercise 0.8886204 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.8202385 0.8788325 1.0000000 0.9008216
## Obesity 0.7790907 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.7685505 0.8430622 0.9158679 0.8826446
## Smoker 0.7693031 0.8041099 0.8774229 0.9155866
## Diabetes 0.7731015 0.8495609 0.8339938 0.8830184
## High_Blood_Pres Smoker Diabetes
4
## CHD 0.7685505 0.7693031 0.7731015
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
#Melting the correlation matrix and creating a data frame
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
mean(NE.Melt$value)
## [1] 0.8705658
#Summary
summary(NE.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.7686
## Diabetes :7 Diabetes :7 1st Qu.:0.8340
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8706
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9008
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
NE.Melt<-NE.Melt[(!NE.Melt$value==1),]
NE.MeltMean<-mean(NE.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = NE.MeltMean,
limits = c(0.7, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 1")
plt_heat_blue
5
CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.7
0.8
0.9
1.0Correlations
Heat map of correlations in Risk Factors data : Region 1
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
plt_obevsNoEx_NE
6
10%
10%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
plt_noExvsfru_NE
7
20%
30%
40%
50%
20% 30% 40%
No Exercise %s
FewFruitsandVegetables%s
% Few Fruits and Vegetables
7
8
20
30
40
Percentage of No Exercise vs.
Percentage of Few Fruits and Vegetables
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedur
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
plt_noExvsblood_NE
8
20%
30%
40%
20% 30% 40%
No Exercise %s
HighBloodPressure%s
% High Blood Pressure
2
3
4
5
6
7
8
9
10
20
Percentage of No Excersice vs.
Percentage of High Blood Pressure
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~NE.Procedures$CHD+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obesity+NE.
9
NE.Procedures$CHD
5 10 2 6 10 5 10
50
515
NE.Procedures$No_Exercise
NE.Procedures$Few_Fruit_Veg
1040
210
NE.Procedures$Obesity
NE.Procedures$High_Blood_Pres
515
515
NE.Procedures$Smoker
50 150 10 25 40 5 10 1 3 5
14
NE.Procedures$Diabetes
Multivariate Scatterplot : Region 1
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Diabetes
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = NE.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-NE.Procedures[train_part,]
TestingCHD<-NE.Procedures[-train_part,]
#Performing regression between CHD and No exercise
chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
##
10
## Residuals:
## Min 1Q Median 3Q Max
## -27.186 -6.009 -1.259 6.963 76.044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## TrainingCHD$No_Exercise 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.51 on 126 degrees of freedom
## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759
## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$High_Blood_Pres,C3
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.186 -6.009 -1.259 6.963 76.044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## E 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.51 on 126 degrees of freedom
## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759
## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Obisity
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
##
11
## Residuals:
## Min 1Q Median 3Q Max
## -25.604 -6.818 -1.170 5.811 77.683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9462 3.4303 1.733 0.0855 .
## E 5.7057 0.6647 8.583 3.06e-14 ***
## C1 1.3495 0.7389 1.826 0.0702 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.39 on 125 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.78
## F-statistic: 226.2 on 2 and 125 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with High Blood pressure
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.440 -6.496 -1.075 5.928 81.639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3274 3.4403 1.549 0.1240
## E 5.6011 0.6125 9.145 1.39e-15 ***
## C2 1.2760 0.5716 2.232 0.0274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.31 on 125 degrees of freedom
## Multiple R-squared: 0.7862, Adjusted R-squared: 0.7828
## F-statistic: 229.9 on 2 and 125 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Diabeties
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.229 -6.466 -1.525 6.379 79.871
##
## Coefficients:
12
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.057 3.284 2.149 0.0336 *
## E 5.612 0.612 9.171 1.2e-15 ***
## C3 4.022 1.817 2.213 0.0287 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.32 on 125 degrees of freedom
## Multiple R-squared: 0.7861, Adjusted R-squared: 0.7827
## F-statistic: 229.7 on 2 and 125 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.767 -6.870 -1.347 5.835 81.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5001 3.5445 1.552 0.123
## E 5.1938 0.7261 7.153 6.68e-11 ***
## C1 0.4091 0.9195 0.445 0.657
## C2 0.7147 0.7485 0.955 0.342
## C3 2.0806 2.4541 0.848 0.398
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.34 on 123 degrees of freedom
## Multiple R-squared: 0.7886, Adjusted R-squared: 0.7817
## F-statistic: 114.7 on 4 and 123 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
13
20 40 60 80 100 120
−2040
Fitted values
Residuals
Residuals vs Fitted
71
110
24
−2 −1 0 1 2
−226
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
71
110
24
20 40 60 80 100 120
0.01.5
Fitted values
Standardizedresiduals
Scale−Location
71
11024
0.00 0.05 0.10 0.15 0.20
−226
Leverage
Standardizedresiduals
Cook's distance
0.5
0.5
1
Residuals vs Leverage
71
21
69
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$High_Blood_Pres,C
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 8567.965
SS.regression/SS.total
## [1] 0.5136492
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the colon cancer.
Here is the code for the colon cancer model.
14
#Creating a data frame inorder to generate correlation heat map
NE.Procedures<-NE.states[,c("Col_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smok
NE.Procedures.Matrix<-as.matrix(NE.Procedures)
NE.Cor<-cor(NE.Procedures)
#Correlation metrix
NE.Cor
## Col_Cancer No_Exercise Few_Fruit_Veg Obesity
## Col_Cancer 1.0000000 0.8447630 0.9067467 0.8433966
## No_Exercise 0.8447630 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.9067467 0.8788325 1.0000000 0.9008216
## Obesity 0.8433966 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.8606917 0.8430622 0.9158679 0.8826446
## Smoker 0.8306595 0.8041099 0.8774229 0.9155866
## Diabetes 0.7996891 0.8495609 0.8339938 0.8830184
## High_Blood_Pres Smoker Diabetes
## Col_Cancer 0.8606917 0.8306595 0.7996891
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
#Melting the correlation matrix and creating a data frame
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
mean(NE.Melt$value)
## [1] 0.8822818
#Summary
summary(NE.Melt)
## x y value
## Col_Cancer :7 Col_Cancer :7 Min. :0.7997
## Diabetes :7 Diabetes :7 1st Qu.:0.8448
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8823
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9067
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
NE.Melt<-NE.Melt[(!NE.Melt$value==1),]
NE.MeltMean<-mean(NE.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
15
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = NE.MeltMean,
limits = c(0.7, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 1")
plt_heat_blue
Col_Cancer
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
Col_Cancer DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.7
0.8
0.9
1.0Correlations
Heat map of correlations in Risk Factors data : Region 1
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS Few Fruit vegitables
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Procedur
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of Few Fruit and vegitables")
plt_obevsNoEx_NE
16
10%
10% 20% 30% 40%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
Percentage of obesity vs.
Percentage of Few Fruit and vegitables
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
plt_noExvsfru_NE
17
20%
30%
40%
50%
20% 30% 40%
No Exercise %s
FewFruitsandVegetables%s
% Few Fruits and Vegetables
7
8
20
30
40
Percentage of No Exercise vs.
Percentage of Few Fruits and Vegetables
#graph of Few Fruit and Vegitables VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Proced
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'Few Fruit and vegitables %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of Few Fruits and vegitables vs.n Percentage of High Blood Pressure")
plt_noExvsblood_NE
18
20%
30%
40%
20% 30% 40% 50%
Few Fruit and vegitables %s
HighBloodPressure%s
% High Blood Pressure
2
3
4
5
6
7
8
9
10
20
Percentage of Few Fruits and vegitables vs.
Percentage of High Blood Pressure
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~NE.Procedures$Col_Cancer+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obes
19
NE.Procedures$Col_Cancer
5 10 2 6 10 5 10
28
515
NE.Procedures$No_Exercise
NE.Procedures$Few_Fruit_Veg
1040
210
NE.Procedures$Obesity
NE.Procedures$High_Blood_Pres
515
515
NE.Procedures$Smoker
2 6 10 10 25 40 5 10 1 3 5
14
NE.Procedures$Diabetes
Multivariate Scatterplot : Region 1
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the colon cancer
is highly correlated with the
1. Few Fruits and vegetable.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. No Exercise 2. High blood
pressure 3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = NE.Procedures$Col_Cancer,p = 0.80,list = FALSE)
TrainingColon <- NE.Procedures[train_part,]
TestingColon <- NE.Procedures[-train_part,]
#Performing regression between Colon Cancer and Few fruits and vegetables
chdRegr<-lm(TrainingColon$Col_Cancer~TrainingColon$Few_Fruit_Veg)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingColon$Col_Cancer ~ TrainingColon$Few_Fruit_Veg)
##
20
## Residuals:
## Min 1Q Median 3Q Max
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## TrainingColon$Few_Fruit_Veg 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 127 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingColon$Few_Fruit_Veg, C1=TrainingColon$Obesity,C2=TrainingColon$High_Blood
temp <- mutate(temp, O = TrainingColon$Col_Cancer)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## E 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.102 on 127 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Obisity
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
##
21
## Residuals:
## Min 1Q Median 3Q Max
## -3.1345 -0.6380 -0.0406 0.5283 3.7629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66357 0.33068 2.007 0.0469 *
## E 0.22619 0.02392 9.455 2.32e-16 ***
## C1 0.12091 0.07412 1.631 0.1054
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.095 on 126 degrees of freedom
## Multiple R-squared: 0.8343, Adjusted R-squared: 0.8317
## F-statistic: 317.2 on 2 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with High Blood pressure
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3833 -0.6435 0.0166 0.4775 3.8119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68140 0.32910 2.070 0.0405 *
## E 0.21531 0.02585 8.330 1.16e-13 ***
## C2 0.13163 0.06773 1.944 0.0542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 126 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8331
## F-statistic: 320.5 on 2 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Diabeties
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -0.6606 -0.0144 0.4913 3.8662
##
## Coefficients:
22
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.71645 0.32566 2.200 0.0296 *
## E 0.21276 0.02128 10.000 <2e-16 ***
## C3 0.14657 0.05628 2.604 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.078 on 126 degrees of freedom
## Multiple R-squared: 0.8394, Adjusted R-squared: 0.8369
## F-statistic: 329.4 on 2 and 126 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0938 -0.6481 0.0027 0.4265 3.7933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.70568 0.32562 2.167 0.0321 *
## E 0.17857 0.03122 5.719 7.56e-08 ***
## C1 0.03900 0.07983 0.489 0.6261
## C2 0.09059 0.07048 1.285 0.2011
## C3 0.11996 0.06030 1.989 0.0489 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.077 on 124 degrees of freedom
## Multiple R-squared: 0.8424, Adjusted R-squared: 0.8373
## F-statistic: 165.7 on 4 and 124 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
23
4 6 8 10
−22
Fitted values
Residuals
Residuals vs Fitted
50
120
6
−2 −1 0 1 2
−303
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
50
120
6
4 6 8 10
0.01.0
Fitted values
Standardizedresiduals
Scale−Location
50
1206
0.00 0.05 0.10 0.15
−22
Leverage
Standardizedresiduals
Cook's distance
0.5
0.5
Residuals vs Leverage
7166104
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingColon$Few_Fruit_Veg, C1=TestingColon$Obesity,C2=TestingColon$High_Bloo
tempTest <- mutate(tempTest, O = TestingColon$Col_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 13.39228
SS.regression/SS.total
## [1] 0.7494172
#This is the regression value Rsquare value for testing data
Region 2 (Midwest region)
We will be analyzing the Midwest region for the following problem. All the code and necessary visualizations
are included in following section.
24
#Cleaning the data in region 2.
#Removing missing data.
Midwest <- subset(Midwest, Midwest$No_Exercise!=0)
Midwest <- subset(Midwest, Midwest$Few_Fruit_Veg!=0)
Midwest <- subset(Midwest, Midwest$Obesity!=0)
Midwest <- subset(Midwest, Midwest$High_Blood_Pres!=0)
Midwest <- subset(Midwest, Midwest$Smoker!=0)
Midwest <- subset(Midwest, Midwest$Diabetes!=0)
Midwest <- subset(Midwest, Midwest$Lung_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Col_Cancer!=0)
Midwest <- subset(Midwest, Midwest$CHD!=0)
Midwest <- subset(Midwest, Midwest$Brst_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Suicide!=0)
Midwest <- subset(Midwest, Midwest$Total_Death_Causes!=0)
Midwest <- subset(Midwest, Midwest$Injury!=0)
Midwest<-subset(Midwest,Midwest$Stroke!=0)
Midwest <- subset(Midwest, Midwest$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top two reasons why people are dying in region 2. We came
to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
results.
disease region2 (R squared)
breast cancer 0.016
mva 0.19
chd 0.77
colon cancer 0.12
lung cancer 0.25
injury 0.11
suicide 0.05
stroke 0.13
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
Midwest region.
1. CHD (Corronary heart disease)
2. Lung Cancer
#Since we have taken the CHD and Lung Cancer as the major reason why people are dying. We will perform m
regressionModel<-lm(Midwest$Total_Death_Causes~Midwest$CHD+Midwest$Lung_Cancer)
#Summary of regression between total deaths and diseases we selected.
summary(regressionModel)
##
## Call:
## lm(formula = Midwest$Total_Death_Causes ~ Midwest$CHD + Midwest$Lung_Cancer)
##
25
## Residuals:
## Min 1Q Median 3Q Max
## -19.6367 -4.0070 -0.6864 3.8916 22.6175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.26689 0.65579 11.08 <2e-16 ***
## Midwest$CHD 1.12881 0.02958 38.16 <2e-16 ***
## Midwest$Lung_Cancer 2.97013 0.08260 35.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.172 on 389 degrees of freedom
## Multiple R-squared: 0.9886, Adjusted R-squared: 0.9885
## F-statistic: 1.688e+04 on 2 and 389 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plotting regression analysis
plot(regressionModel)
50 100 150 200 250
−20020
Fitted values
Residuals
Residuals vs Fitted
385
165
231
−3 −2 −1 0 1 2 3
−22
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
385
165
231
50 100 150 200 250
0.01.0
Fitted values
Standardizedresiduals
Scale−Location
385 165231
0.00 0.01 0.02 0.03 0.04 0.05
−404
Leverage
Standardizedresiduals
Cook's distance
Residuals vs Leverage
31050165
Now that we have established the major disease in Midwest region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multivariate regression with training and testing data.
Following is the code which describes the procedure for CHD.
26
MW.states<-Midwest
MW.Procedures<-data.frame()
#Creating a data frame inorder to generate correlation heat map
MW.Procedures<-MW.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
MW.Procedures.Matrix<-as.matrix(MW.Procedures)
MW.Cor<-cor(MW.Procedures)
#Correlation metrix
MW.Cor
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.9109385 0.8956826 0.9072785
## No_Exercise 0.9109385 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.8956826 0.9135682 1.0000000 0.9520907
## Obesity 0.9072785 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9045115 0.9155135 0.9331822 0.9536831
## Smoker 0.9076982 0.9292124 0.9400104 0.9312576
## Diabetes 0.8902612 0.9031664 0.8688473 0.9153784
## High_Blood_Pres Smoker Diabetes
## CHD 0.9045115 0.9076982 0.8902612
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
#Melting the correlation matrix and creating a data frame
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
mean(MW.Melt$value)
## [1] 0.9275097
#Summary
summary(MW.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.8688
## Diabetes :7 Diabetes :7 1st Qu.:0.9073
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.9188
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.9275
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9400
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
MW.Melt<-MW.Melt[(!MW.Melt$value==1),]
MW.MeltMean<-mean(MW.Melt$value)
#Making various colors to geMWrate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
27
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = MW.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 2")
plt_heat_blue
CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
Heat map of correlations in Risk Factors data : Region 2
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Obesity, 0)))) +
geom_point() +
28
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
plt_obevsNoEx_MW
10%
10%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
20
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Smoker
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
plt_noExvsSmo_MW
29
20%
30%
40%
50%
20% 30% 40%
No Exercise %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Smoker
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedur
color = factor(signif(MW.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
plt_noExvsblood_MW
30
20%
30%
40%
20% 30% 40%
No Exercise %s
HighBloodPressure%s
% High Blood Pressure
2
3
4
5
6
7
8
9
10
20
Percentage of No Excersice vs.
Percentage of High Blood Pressure
#Plotting the multivariate scatter plot in order to understand the correlation better.
pairs(~MW.Procedures$CHD+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obesity+MW.
31
MW.Procedures$CHD
5 15 5 15 5 10
20
5
MW.Procedures$No_Exercise
MW.Procedures$Few_Fruit_Veg
1035
5
MW.Procedures$Obesity
MW.Procedures$High_Blood_Pres
5
515
MW.Procedures$Smoker
20 80 10 25 40 5 15 1 3 5
14
MW.Procedures$Diabetes
Multivariate Scatterplot : Region 2
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Smoker
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = MW.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-MW.Procedures[train_part,]
TestingCHD<-MW.Procedures[-train_part,]
#Performing regression between CHD and No exercise
chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
32
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.669 -5.992 -1.075 4.611 37.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## TrainingCHD$No_Exercise 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.56 on 314 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443
## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.669 -5.992 -1.075 4.611 37.145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## E 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.56 on 314 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443
## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
33
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.236 -5.870 -0.862 5.417 36.030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6657 1.2036 0.553 0.581
## E 3.9856 0.4212 9.462 < 2e-16 ***
## C1 3.1306 0.4123 7.593 3.64e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.723 on 313 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8681
## F-statistic: 1037 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.550 -5.426 -0.776 4.322 37.881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6510 1.1790 2.249 0.0252 *
## E 4.0462 0.4223 9.581 < 2e-16 ***
## C2 2.8735 0.3872 7.420 1.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.757 on 313 degrees of freedom
## Multiple R-squared: 0.868, Adjusted R-squared: 0.8671
## F-statistic: 1029 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.324 -5.482 -0.944 5.139 34.475
##
34
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4248 1.1767 1.211 0.227
## E 4.1168 0.3898 10.560 < 2e-16 ***
## C3 2.8134 0.3545 7.936 3.74e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.654 on 313 degrees of freedom
## Multiple R-squared: 0.8708, Adjusted R-squared: 0.8699
## F-statistic: 1054 on 2 and 313 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.493 -5.404 -0.842 4.904 33.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3202 1.1689 1.129 0.259602
## E 2.8621 0.4635 6.175 2.07e-09 ***
## C1 1.1091 0.5728 1.936 0.053747 .
## C2 1.5669 0.4458 3.514 0.000506 ***
## C3 1.4402 0.4852 2.968 0.003230 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.36 on 311 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8777
## F-statistic: 566.3 on 4 and 311 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
35
20 40 60 80 100
−30030
Fitted values
Residuals
Residuals vs Fitted
6201 149
−3 −2 −1 0 1 2 3
−3024
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
6201149
20 40 60 80 100
0.01.0
Fitted values
Standardizedresiduals
Scale−Location
6201 149
0.00 0.04 0.08 0.12
−22
Leverage
Standardizedresiduals
Cook's distance
0.5
Residuals vs Leverage
1916
47
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.predictions
## 1 2 3 4 5 6 7
## 47.35611 71.16578 74.24867 74.90092 88.18896 85.76173 53.21561
## 8 9 10 11 12 13 14
## 46.55992 49.44050 85.85667 62.59279 42.36807 92.85802 41.66495
## 15 16 17 18 19 20 21
## 90.65576 38.70206 44.74763 18.59404 16.51755 23.08953 19.65119
## 22 23 24 25 26 27 28
## 17.38575 19.99998 40.34271 25.57896 22.67974 18.95382 75.20951
## 29 30 31 32 33 34 35
## 20.59550 44.87894 22.94132 21.52470 28.72008 16.55805 21.38459
## 36 37 38 39 40 41 42
## 53.07606 51.01929 52.00420 22.51745 111.28415 89.03471 80.60128
## 43 44 45 46 47 48 49
## 70.85496 44.13435 65.03187 73.84524 36.00484 39.05419 62.18069
## 50 51 52 53 54 55 56
## 29.67611 63.29361 76.79059 38.52942 50.68324 49.06936 21.51555
## 57 58 59 60 61 62 63
## 50.18942 42.64157 20.31264 78.53103 20.19143 18.05501 20.39115
36
## 64 65 66 67 68 69 70
## 87.13676 84.23844 86.80106 71.39024 69.61883 40.42645 18.85330
## 71 72 73 74 75 76
## 46.86680 32.61050 38.34111 38.45673 67.09643 42.34753
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] -1040.415
SS.regression/SS.total
## [1] 0.854508
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
#Creating a data frame inorder to generate correlation heat map
MW.Procedures<-MW.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo
MW.Procedures.Matrix<-as.matrix(MW.Procedures)
MW.Cor<-cor(MW.Procedures)
#Correlation metrix
MW.Cor
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.9402863 0.9478072 0.9459666
## No_Exercise 0.9402863 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.9478072 0.9135682 1.0000000 0.9520907
## Obesity 0.9459666 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9279799 0.9155135 0.9331822 0.9536831
## Smoker 0.9494486 0.9292124 0.9400104 0.9312576
## Diabetes 0.8984837 0.9031664 0.8688473 0.9153784
## High_Blood_Pres Smoker Diabetes
## Lung_Cancer 0.9279799 0.9494486 0.8984837
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
#Melting the correlation matrix and creating a data frame
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
mean(MW.Melt$value)
37
## [1] 0.9354118
#Summary
summary(MW.Melt)
## x y value
## Diabetes :7 Diabetes :7 Min. :0.8688
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.9155
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.9313
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.9354
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9494
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
MW.Melt<-MW.Melt[(!MW.Melt$value==1),]
MW.MeltMean<-mean(MW.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, red) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = MW.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 2")
plt_heat_blue
38
Diabetes
Few_Fruit_Veg
High_Blood_Pres
Lung_Cancer
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
Heat map of correlations in Risk Factors data : Region 2
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS Few Fruit vegitables
plt_Smovsfru_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Few_Fruit_Veg), y = (MW.Procedure
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fruits and Vegetables %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of Few Fruit and vegitables")
plt_Smovsfru_MW
39
10%
10% 20% 30% 40%
Few Fruits and Vegetables %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of Smoker vs.
Percentage of Few Fruit and vegitables
#graph of No Exercise VS Smoker
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
plt_noExvsSmo_MW
40
20%
30%
40%
50%
20% 30% 40%
No Exercise %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Percentage of Smoker
#graph of Smoker VS Obesity
plt_SmovsObe_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Obesity), y = (MW.Procedures$Smok
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
plt_SmovsObe_MW
41
20%
30%
40%
50%
20% 30% 40%
Obesity %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of Obesity vs.
Percentage of Smoker
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~MW.Procedures$Lung_Cancer+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obe
42
MW.Procedures$Lung_Cancer
5 15 5 15 5 10
1040
5
MW.Procedures$No_Exercise
MW.Procedures$Few_Fruit_Veg
1035
5
MW.Procedures$Obesity
MW.Procedures$High_Blood_Pres
5
515
MW.Procedures$Smoker
10 30 10 25 40 5 15 1 3 5
14
MW.Procedures$Diabetes
Multivariate Scatterplot : Region 2
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. Smoker.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. No Exercise
2. Few Fruits and vegetable
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = MW.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- MW.Procedures[train_part,]
TestingLung <- MW.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
43
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8531 -1.4729 0.0134 1.5588 7.9568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10692 0.34785 0.307 0.759
## TrainingLung$Smoker 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.047 on 314 degrees of freedom
## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995
## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$Few_Fruit_Veg,C3=Tra
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8531 -1.4729 0.0134 1.5588 7.9568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10692 0.34785 0.307 0.759
## E 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.047 on 314 degrees of freedom
## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995
## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
44
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2839 -1.2744 0.1438 1.3621 7.6394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1171 0.3171 -3.523 0.00049 ***
## E 1.3294 0.1012 13.133 < 2e-16 ***
## C1 1.1735 0.1075 10.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.598 on 313 degrees of freedom
## Multiple R-squared: 0.9274, Adjusted R-squared: 0.9269
## F-statistic: 1999 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8397 -1.2740 0.0911 1.3515 8.0799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.31635 0.32814 -4.012 7.55e-05 ***
## E 1.24197 0.11201 11.088 < 2e-16 ***
## C2 0.39028 0.03696 10.559 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.621 on 313 degrees of freedom
## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9256
## F-statistic: 1961 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4258 -1.3278 -0.2252 1.5037 7.6570
##
45
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9052 0.3127 -2.895 0.00406 **
## E 1.3013 0.1054 12.341 < 2e-16 ***
## C3 1.2172 0.1137 10.703 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.611 on 313 degrees of freedom
## Multiple R-squared: 0.9266, Adjusted R-squared: 0.9262
## F-statistic: 1977 on 2 and 313 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0675 -1.1707 0.1069 1.2777 7.5823
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.79363 0.29587 -6.062 3.88e-09 ***
## E 0.67286 0.11725 5.739 2.27e-08 ***
## C1 0.44770 0.13018 3.439 0.000664 ***
## C2 0.21279 0.04183 5.087 6.31e-07 ***
## C3 0.79075 0.11385 6.946 2.21e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.327 on 311 degrees of freedom
## Multiple R-squared: 0.9421, Adjusted R-squared: 0.9414
## F-statistic: 1265 on 4 and 311 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
46
5 10 15 20 25 30 35
−100
Fitted values
Residuals
Residuals vs Fitted
313128
39
−3 −2 −1 0 1 2 3
−404
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
313128
39
5 10 15 20 25 30 35
0.01.02.0
Fitted values
Standardizedresiduals
Scale−Location
31312839
0.00 0.05 0.10 0.15
−404
Leverage
Standardizedresiduals
Cook's distance 0.5
0.5
Residuals vs Leverage
245
191
128
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$Few_Fruit_Veg,C3=Te
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 500.1066
SS.regression/SS.total
## [1] 0.8705158
#This is the regression value Rsquare value for testing data
Region 3 (South region)
We will be analyzing the South region for the following problem. All the code and necessary visualizations
are included in following section.
47
#Cleaning the data in region 3.
#Removing missing data.
South <- subset(South, South$No_Exercise!=0)
South <- subset(South, South$Few_Fruit_Veg!=0)
South <- subset(South, South$Obesity!=0)
South <- subset(South, South$High_Blood_Pres!=0)
South <- subset(South, South$Smoker!=0)
South <- subset(South, South$Diabetes!=0)
South <- subset(South, South$Lung_Cancer!=0)
South <- subset(South, South$Col_Cancer!=0)
South <- subset(South, South$CHD!=0)
South <- subset(South, South$Brst_Cancer!=0)
South <- subset(South, South$Suicide!=0)
South <- subset(South, South$Total_Death_Causes!=0)
South <- subset(South, South$Injury!=0)
South<-subset(South,South$Stroke!=0)
South <- subset(South, South$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top three reasons why people are dying in region 3. We
came to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
results.
disease region3 (R squared)
breast cancer 0.06
mva 0.26
chd 0.79
colon cancer 0.16
lung cancer 0.35
injury 0.17
suicide 0.03
stroke 0.14
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
South region.
1. CHD (Corronary heart disease)
2. Lung Cancer
3. MVA (Motor Vehicle Accidents)
#Since we have taken the CHD , Lung Cancer and MVA as the major reason why people are dying. We will per
regressionModel<-lm(South$Total_Death_Causes~South$CHD+South$Lung_Cancer+South$MVA)
#Summary of regression between total deaths and diseases we selected.
summary(regressionModel)
##
## Call:
## lm(formula = South$Total_Death_Causes ~ South$CHD + South$Lung_Cancer +
## South$MVA)
48
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.105 -5.062 -0.461 4.612 30.418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.17128 0.98511 1.189 0.235
## South$CHD 1.16308 0.02562 45.398 < 2e-16 ***
## South$Lung_Cancer 2.84552 0.07955 35.770 < 2e-16 ***
## South$MVA 1.08049 0.15806 6.836 2.16e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.829 on 554 degrees of freedom
## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9783
## F-statistic: 8374 on 3 and 554 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plotting regression analysis
plot(regressionModel)
50 100 150 200 250 300
−400
Fitted values
Residuals
Residuals vs Fitted
184
214523
−3 −2 −1 0 1 2 3
−404
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
184
214523
50 100 150 200 250 300
0.01.5
Fitted values
Standardizedresiduals
Scale−Location
184
214523
0.00 0.02 0.04 0.06 0.08
−604
Leverage
Standardizedresiduals
Cook's distance
1
0.5
0.5
Residuals vs Leverage
261
523
319
Now that we have established the major disease in South region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multivariate regression with training and testing data.
49
Following is the code which describes the procedure for CHD.
SO.states<-South
SO.Procedures<-data.frame()
#Creating a data frame inorder to generate correlation heat map
SO.Procedures<-SO.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
SO.Procedures.Matrix<-as.matrix(SO.Procedures)
SO.Cor<-cor(SO.Procedures)
#Correlation metrix
SO.Cor
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.8652391 0.8531583 0.8529559
## No_Exercise 0.8652391 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8531583 0.8673295 1.0000000 0.9171308
## Obesity 0.8529559 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8240507 0.8680525 0.8880798 0.8827588
## Smoker 0.8387031 0.8636349 0.8920701 0.8694016
## Diabetes 0.7821009 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## CHD 0.8240507 0.8387031 0.7821009
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8792101
#Summary
summary(SO.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.7821
## Diabetes :7 Diabetes :7 1st Qu.:0.8530
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8681
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8792
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
SO.Melt<-SO.Melt[(!SO.Melt$value==1),]
SO.MeltMean<-mean(SO.Melt$value)
50
#Making various colors to geSOrate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, green) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.8, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
plt_heat_blue
CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.80
0.85
0.90
0.95
1.00
Correlations
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Obesity, 0)))) +
51
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
plt_obevsNoEx_SO
10%
10% 20%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
20
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Smoker
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
plt_noExvsSmo_SO
52
10%
10% 20%
No Exercise %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Smoker
#graph of No Exercise VS Few Fruits And Vegetables
plt_noExvsfru_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Few_Fruit_Veg, 0
geom_point() +
scale_color_discrete(name="% Few Fruits And Vegetables") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Few Fruits And Vegetables %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of Few Fruits And Vegetables")
plt_noExvsfru_SO
53
10%
20%
30%
40%
10% 20%
No Exercise %s
FewFruitsAndVegetables%s
% Few Fruits And Vegetables
8
9
10
20
30
40
Percentage of No Excersice vs.
Percentage of Few Fruits And Vegetables
#Plotting the multivariate scatter plot in order to understand the correlation better.
pairs(~SO.Procedures$CHD+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO.
54
SO.Procedures$CHD
5 15 2 6 12 5 10
20140
520
SO.Procedures$No_Exercise
SO.Procedures$Few_Fruit_Veg
1040
212
SO.Procedures$Obesity
SO.Procedures$High_Blood_Pres
520
515
SO.Procedures$Smoker
20 80 140 10 30 5 15 1 3 5
14
SO.Procedures$Diabetes
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. Few Fruits And Vegetables
2. Smoker
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-SO.Procedures[train_part,]
TestingCHD<-SO.Procedures[-train_part,]
#Performing regression between CHD and No exercise
chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
55
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.726 -7.474 -0.401 7.460 53.480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## TrainingCHD$No_Exercise 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 446 degrees of freedom
## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554
## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.726 -7.474 -0.401 7.460 53.480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## E 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 446 degrees of freedom
## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554
## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
56
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.692 -6.553 -0.426 6.782 49.965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7745 1.5323 2.463 0.0141 *
## E 3.6456 0.3685 9.893 < 2e-16 ***
## C1 3.1443 0.4080 7.707 8.45e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.64 on 445 degrees of freedom
## Multiple R-squared: 0.7847, Adjusted R-squared: 0.7837
## F-statistic: 810.8 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.419 -7.093 -0.348 6.464 50.962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2383 1.5379 1.455 0.146
## E 3.8791 0.3103 12.503 <2e-16 ***
## C2 3.0972 0.3567 8.684 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.46 on 445 degrees of freedom
## Multiple R-squared: 0.7913, Adjusted R-squared: 0.7904
## F-statistic: 843.6 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.863 -6.174 -0.278 6.583 48.496
##
57
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0217 1.4558 2.076 0.0385 *
## E 3.3541 0.3044 11.019 <2e-16 ***
## C3 1.1467 0.1064 10.776 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.03 on 445 degrees of freedom
## Multiple R-squared: 0.8064, Adjusted R-squared: 0.8056
## F-statistic: 927 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.608 -6.372 -0.209 6.585 48.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9532 1.4613 1.337 0.182035
## E 2.5933 0.3754 6.907 1.72e-11 ***
## C1 0.8052 0.4921 1.636 0.102483
## C2 1.4445 0.4122 3.504 0.000505 ***
## C3 0.7514 0.1527 4.922 1.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.87 on 443 degrees of freedom
## Multiple R-squared: 0.8129, Adjusted R-squared: 0.8112
## F-statistic: 481.1 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
58
20 40 60 80 100 120
−60060
Fitted values
Residuals
Residuals vs Fitted
171
43
202
−3 −2 −1 0 1 2 3
−404
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
171
43
202
20 40 60 80 100 120
0.01.5
Fitted values
Standardizedresiduals
Scale−Location
171
43
202
0.00 0.02 0.04 0.06 0.08 0.10
−604
Leverage
Standardizedresiduals
Cook's distance
1
0.5
0.5
Residuals vs Leverage
364
171
202
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.predictions
## 1 2 3 4 5 6 7
## 93.11863 54.33024 52.85395 45.79852 84.97513 56.88569 90.36550
## 8 9 10 11 12 13 14
## 54.14329 27.01033 21.19971 22.31869 51.65249 44.54791 47.33240
## 15 16 17 18 19 20 21
## 82.57801 95.08141 82.71398 87.34274 77.96068 95.69180 39.98928
## 22 23 24 25 26 27 28
## 92.73972 78.85468 78.94780 80.59074 80.79515 88.44522 51.13704
## 29 30 31 32 33 34 35
## 53.76243 54.93526 44.19820 61.92209 47.95062 57.45545 51.17947
## 36 37 38 39 40 41 42
## 27.33281 62.49750 23.71136 28.54283 26.16418 27.07990 58.66436
## 43 44 45 46 47 48 49
## 102.65357 91.78205 105.83215 55.19101 80.64446 40.27956 53.95203
## 50 51 52 53 54 55 56
## 48.46951 84.13853 40.82755 102.13602 54.52694 48.51574 50.28585
## 57 58 59 60 61 62 63
## 65.78484 53.00168 50.13015 53.58431 48.56988 45.02007 80.19727
59
## 64 65 66 67 68 69 70
## 85.16485 43.43088 101.14805 81.53974 81.15735 66.76741 60.25659
## 71 72 73 74 75 76 77
## 42.04435 54.65376 103.69421 49.09867 59.43466 51.29171 56.07857
## 78 79 80 81 82 83 84
## 25.54696 55.45294 23.65491 93.97364 51.35518 46.81402 92.22464
## 85 86 87 88 89 90 91
## 84.72256 90.79933 95.60329 59.45610 90.05428 45.56084 76.83904
## 92 93 94 95 96 97 98
## 82.85766 74.82753 77.28768 64.59310 51.51838 40.74676 79.32157
## 99 100 101 102 103 104 105
## 38.59788 52.26890 49.81253 87.43576 89.41201 90.57723 79.44587
## 106 107 108 109 110
## 60.00317 46.91665 59.17558 47.64249 24.00712
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 2561.812
SS.regression/SS.total
## [1] 0.7099096
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
#Creating a data frame inorder to generate correlation heat map
SO.Procedures<-SO.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo
SO.Procedures.Matrix<-as.matrix(SO.Procedures)
SO.Cor<-cor(SO.Procedures)
#Correlation metrix
SO.Cor
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.8398485 0.8922953 0.8688772
## No_Exercise 0.8398485 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8922953 0.8673295 1.0000000 0.9171308
## Obesity 0.8688772 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8698291 0.8680525 0.8880798 0.8827588
## Smoker 0.9145492 0.8636349 0.8920701 0.8694016
## Diabetes 0.7993788 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## Lung_Cancer 0.8698291 0.9145492 0.7993788
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
60
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8860905
#Summary
summary(SO.Melt)
## x y value
## Diabetes :7 Diabetes :7 Min. :0.7994
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8636
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8698
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.8861
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9031
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
SO.Melt<-SO.Melt[(!SO.Melt$value==1),]
SO.MeltMean<-mean(SO.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, blue) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
plt_heat_blue
61
Diabetes
Few_Fruit_Veg
High_Blood_Pres
Lung_Cancer
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS High Blood Pressure
plt_Smovsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proce
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of High Blood Pressure")
plt_Smovsblood_SO
62
10%
10% 20%
High Blood Pressure %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Smoker vs.
Percentage of High Blood Pressure
#graph of Few Fruits and Vegetables VS Smoker
plt_fruvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Few_Fruit_Veg), y = (SO.Procedure
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fuits and Vegetable %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Few Fuits and Vegetable vs.n Percentage of Smoker")
plt_fruvsSmo_SO
63
10%
10% 20% 30% 40%
Few Fuits and Vegetable %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Few Fuits and Vegetable vs.
Percentage of Smoker
#graph of Smoker VS Obesity
plt_SmovsObe_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Obesity), y = (SO.Procedures$Smok
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
plt_SmovsObe_SO
64
20%
30%
40%
50%
20% 30% 40% 50%
Obesity %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Obesity vs.
Percentage of Smoker
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~SO.Procedures$Lung_Cancer+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obe
65
SO.Procedures$Lung_Cancer
5 15 2 6 12 5 10
1040
520
SO.Procedures$No_Exercise
SO.Procedures$Few_Fruit_Veg
1040
212
SO.Procedures$Obesity
SO.Procedures$High_Blood_Pres
520
515
SO.Procedures$Smoker
10 30 10 30 5 15 1 3 5
14
SO.Procedures$Diabetes
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. Smoker.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Few Fruits
and vegetable 3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- SO.Procedures[train_part,]
TestingLung <- SO.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
##
66
## Residuals:
## Min 1Q Median 3Q Max
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## TrainingLung$Smoker 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 446 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271
## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$High_Blood_Pres,C3=T
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## E 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 446 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271
## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
##
67
## Residuals:
## Min 1Q Median 3Q Max
## -18.098 -1.752 -0.184 1.773 16.095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92734 0.41643 2.227 0.0265 *
## E 1.71902 0.09419 18.250 < 2e-16 ***
## C1 0.75201 0.09266 8.115 4.78e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.109 on 445 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8491
## F-statistic: 1258 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4221 -1.8371 -0.1437 1.7834 17.3119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99954 0.40928 2.442 0.015 *
## E 1.65336 0.09547 17.317 <2e-16 ***
## C2 0.70444 0.08065 8.735 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.078 on 445 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8521
## F-statistic: 1288 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -1.660 -0.174 1.687 15.611
##
## Coefficients:
68
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.14230 0.40135 2.846 0.00463 **
## E 1.50746 0.10386 14.515 < 2e-16 ***
## C3 0.29928 0.03189 9.385 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.044 on 445 degrees of freedom
## Multiple R-squared: 0.856, Adjusted R-squared: 0.8554
## F-statistic: 1323 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8100 -1.7365 -0.1321 1.7312 15.9466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79644 0.39875 1.997 0.0464 *
## E 1.29944 0.10929 11.889 < 2e-16 ***
## C1 0.18044 0.12280 1.469 0.1424
## C2 0.38893 0.09591 4.055 5.92e-05 ***
## C3 0.17908 0.04193 4.271 2.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.967 on 443 degrees of freedom
## Multiple R-squared: 0.8638, Adjusted R-squared: 0.8626
## F-statistic: 702.5 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
69
5 10 15 20 25 30 35
−1010
Fitted values
Residuals
Residuals vs Fitted
211
382
369
−3 −2 −1 0 1 2 3
−426
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
382
211
369
5 10 15 20 25 30 35
0.01.5
Fitted values
Standardizedresiduals
Scale−Location
382211
369
0.00 0.02 0.04 0.06 0.08
−604
Leverage
Standardizedresiduals
Cook's distance
0.5
0.5
Residuals vs Leverage
382
369360
#Now we will test our regression model with testing data to check the performance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$High_Blood_Pres,C3=
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 882.6084
SS.regression/SS.total
## [1] 0.7934475
#This is the regression value Rsquare value for testing data
Similarly , we have done calculations for MVA. Here is the code for the MVA model.
70
#Creating a data frame inorder to generate correlation heat map
SO.Procedures<-SO.states[,c("MVA","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
SO.Procedures.Matrix<-as.matrix(SO.Procedures)
SO.Cor<-cor(SO.Procedures)
#Correlation metrix
SO.Cor
## MVA No_Exercise Few_Fruit_Veg Obesity
## MVA 1.0000000 0.7037265 0.5939313 0.6419514
## No_Exercise 0.7037265 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.5939313 0.8673295 1.0000000 0.9171308
## Obesity 0.6419514 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.6708440 0.8680525 0.8880798 0.8827588
## Smoker 0.6515928 0.8636349 0.8920701 0.8694016
## Diabetes 0.6625232 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## MVA 0.6708440 0.6515928 0.6625232
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8346534
#Summary
summary(SO.Melt)
## x y value
## Diabetes :7 Diabetes :7 Min. :0.5939
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8033
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8681
## MVA :7 MVA :7 Mean :0.8347
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
SO.Melt<-SO.Melt[(!SO.Melt$value==1),]
SO.MeltMean<-mean(SO.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, green) )
71
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.5, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
plt_heat_blue
Diabetes
Few_Fruit_Veg
High_Blood_Pres
MVA
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresMVA No_Exercise Obesity Smoker
0.5
0.6
0.7
0.8
0.9
1.0
Correlations
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of no Exercise VS High Blood Pressure
plt_noExvsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proc
color = factor(signif(SO.Procedures$No_Exercise, 0)))) +
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Excercise %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of High Blood Pressure")
plt_noExvsblood_SO
72
10%
20%
10% 20%
High Blood Pressure %s
NoExcercise%s
% No Exercise
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Percentage of High Blood Pressure
#graph of No Exercise VS Smoker
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
plt_noExvsSmo_SO
73
10%
10% 20%
No Exercise %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of No Exercise vs.
Percentage of Smoker
#graph of Diabetes VS No Exercise
plt_noExvsDia_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Diabetes), y = (SO.Procedures$No
color = factor(signif(SO.Procedures$No_Exercise, 0))
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks,name = 'Diabetes %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
ggtitle(label = "Percentage of Diabetes vs.n Percentage of No Exercise")
plt_noExvsDia_SO
74
10%
20%
Diabetes %s
NoExercise%s
% No Exercise
2
3
4
5
6
7
8
9
10
20
Percentage of Diabetes vs.
Percentage of No Exercise
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~SO.Procedures$MVA+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO.
75
SO.Procedures$MVA
5 15 2 6 12 5 10
520
520
SO.Procedures$No_Exercise
SO.Procedures$Few_Fruit_Veg
1040
212
SO.Procedures$Obesity
SO.Procedures$High_Blood_Pres
520
515
SO.Procedures$Smoker
5 15 25 10 30 5 15 1 3 5
14
SO.Procedures$Diabetes
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. No Exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Diabetes
3. Smoker
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$MVA,p = 0.80,list = FALSE)
TrainingMVA <- SO.Procedures[train_part,]
TestingMVA <- SO.Procedures[-train_part,]
#Performing regression between MVA and No exercise
mvaRegr<-lm(TrainingMVA$MVA~TrainingMVA$No_Exercise)
#Regression Summary
summary(mvaRegr)
##
## Call:
## lm(formula = TrainingMVA$MVA ~ TrainingMVA$No_Exercise)
76
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## TrainingMVA$No_Exercise 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.989 on 446 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005
## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having decent accuracy we will still add more confounders and then
perform the multivariate regression.0
#making a temporary table with all required variables
temp <- data.frame(E = TrainingMVA$No_Exercise, C1=TrainingMVA$Smoker,C2=TrainingMVA$High_Blood_Pres,C3=
temp <- mutate(temp, O = TrainingMVA$MVA)
#Regression on MVA and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## E 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.989 on 446 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005
## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
77
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9779 -1.1454 -0.0693 1.0158 10.1595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.52869 0.26785 5.707 2.1e-08 ***
## E 0.50745 0.05789 8.765 < 2e-16 ***
## C1 0.09930 0.06317 1.572 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.986 on 445 degrees of freedom
## Multiple R-squared: 0.5044, Adjusted R-squared: 0.5022
## F-statistic: 226.4 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
summary(reg.EC2)
##
## Call:
## lm(formula = O ~ E + C2, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5950 -1.1474 -0.0641 1.0431 10.7984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.51447 0.26458 5.724 1.91e-08 ***
## E 0.45073 0.05872 7.676 1.05e-13 ***
## C2 0.14265 0.05414 2.635 0.00871 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.976 on 445 degrees of freedom
## Multiple R-squared: 0.5093, Adjusted R-squared: 0.5071
## F-statistic: 230.9 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
summary(reg.EC3)
##
## Call:
## lm(formula = O ~ E + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6516 -1.1369 -0.0745 1.1027 10.0249
##
78
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60721 0.26225 6.129 1.95e-09 ***
## E 0.45541 0.05443 8.367 7.68e-16 ***
## C3 0.44833 0.15954 2.810 0.00517 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.974 on 445 degrees of freedom
## Multiple R-squared: 0.5103, Adjusted R-squared: 0.5081
## F-statistic: 231.9 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4964 -1.1325 -0.0407 1.0714 10.5865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.53795 0.26771 5.745 1.71e-08 ***
## E 0.39844 0.06998 5.693 2.27e-08 ***
## C1 0.02554 0.06972 0.366 0.7143
## C2 0.08004 0.06679 1.198 0.2314
## C3 0.31152 0.18517 1.682 0.0932 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.974 on 443 degrees of freedom
## Multiple R-squared: 0.5127, Adjusted R-squared: 0.5083
## F-statistic: 116.5 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
79
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD
Project_Report_RMD

More Related Content

Similar to Project_Report_RMD

Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationMarjan Sterjev
 
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)Theodore Grammatikopoulos
 
web-application.pdf
web-application.pdfweb-application.pdf
web-application.pdfouiamouhdifa
 
Healthcare deserts: How accessible is US healthcare?
Healthcare deserts: How accessible is US healthcare?Healthcare deserts: How accessible is US healthcare?
Healthcare deserts: How accessible is US healthcare?Data Con LA
 
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...Paul Richards
 
Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...
Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...
Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...Stephanie Bax
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
Don't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsDon't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsGramener
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and EstimationLouis A. Poulin
 
Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich_R_User_Group
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
PCA with princomp and prcomp
PCA with princomp and prcompPCA with princomp and prcomp
PCA with princomp and prcompRupak Roy
 
Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by KodebayKodebay
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
 
Classification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining TechniquesClassification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining TechniquesLovely Professional University
 

Similar to Project_Report_RMD (20)

Iowa_Report_2
Iowa_Report_2Iowa_Report_2
Iowa_Report_2
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)Methods of Unsupervised Learning (Article 10 - Practical Exercises)
Methods of Unsupervised Learning (Article 10 - Practical Exercises)
 
web-application.pdf
web-application.pdfweb-application.pdf
web-application.pdf
 
Healthcare deserts: How accessible is US healthcare?
Healthcare deserts: How accessible is US healthcare?Healthcare deserts: How accessible is US healthcare?
Healthcare deserts: How accessible is US healthcare?
 
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
 
Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...
Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...
Six Leading Causes of Death in Kansas, Missouri, and Nebraska in the Years 20...
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
Don't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsDon't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code Reviews
 
Statistics Assignment Help
Statistics Assignment HelpStatistics Assignment Help
Statistics Assignment Help
 
Statistics for biology
Statistics for biologyStatistics for biology
Statistics for biology
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and Estimation
 
Milestone1 (3).pptx
Milestone1 (3).pptxMilestone1 (3).pptx
Milestone1 (3).pptx
 
Zurich R User group: Desc tools
Zurich R User group: Desc tools Zurich R User group: Desc tools
Zurich R User group: Desc tools
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Presentation
PresentationPresentation
Presentation
 
PCA with princomp and prcomp
PCA with princomp and prcompPCA with princomp and prcomp
PCA with princomp and prcomp
 
Linear regression by Kodebay
Linear regression by KodebayLinear regression by Kodebay
Linear regression by Kodebay
 
2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin 2013.11.14 Big Data Workshop Bruno Voisin
2013.11.14 Big Data Workshop Bruno Voisin
 
Classification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining TechniquesClassification of Heart Diseases Patients using Data Mining Techniques
Classification of Heart Diseases Patients using Data Mining Techniques
 

Project_Report_RMD

  • 1. Project : Mortality Rate Analysis in USA for deadly causes Jatri Dave (jad752) , Prashantkumar Patel (pnp249) December 14, 2016 Project Outline We have obtained the data from CHSI(Community Health Status Indicators). In this project we will try to figure out the leading causes of the death in 4 major regions in USA(Northeast, West, Midwest, South). After getting the major causes of deaths we will try to analyse major daily human characteristics that are contributing towards these major deaths. The steps that we implemented in order to solve above stated problem are briefly explained below. Data Cleaning and Normalizing. First of all, When we gathered the data we did not realize that there was significant missing data. Besides there were hundreds of unnecessary features available in the data. Therefore, we selected the required features for the project. Moreover, the data we obtained was not normalized on a balanced scale; some data was in percentage, some of those were in the base of 100,000 , some data was in the form of population count etc. We needed some common scale on which we can normalize it. Furthermore, the data that we gathered was not for a single time duration. For example, some data was in time span of the 1999-2003, some data was in time span of 1995-2003 and so on.Hence, we normalized the data for the individual year. We performed all those operations in Microsoft Excel (2016). We then combined necessary features and created a comma separated value data (CSV) which we are directly using in R for the project. Partitinoning the data into the region wise data. The code is described below. #Reading the data data<-read.csv("E:/NYU/1/Foundation of Data Science/Projects/Foundations-of-Data-Science/USADataCleanPra #Adding new column for region selection data[,"region"] <- NA #Removing unnecessary columns data$X<-NULL data$X.1<-NULL #Partitioning the data based on regions. We have manually used the names of the states in order to crea #region1 data$region[data$CHSI_State_Name=="Connecticut" | data$CHSI_State_Name=="Maine" | data$CHSI_State_Name== #region2 data$region[data$CHSI_State_Name=="Illinois" | data$CHSI_State_Name=="Indiana" | data$CHSI_State_Name==" #region3 data$region[data$CHSI_State_Name=="Delaware" | data$CHSI_State_Name=="Florida" | data$CHSI_State_Name==" 1
  • 2. #region4 data$region[data$CHSI_State_Name=="Arizona" | data$CHSI_State_Name=="Colorado" | data$CHSI_State_Name==" #converting state names into lower case letters data$CHSI_State_Name <- tolower(data$CHSI_State_Name) #Creating seperate data sets for each regions so that we can perform analysis on these seperate data Northeast<- data[data$region==1,] Midwest<-data[data$region==2,] South<-data[data$region==3,] West<-data[data$region==4,] Region 1 (NorthEast region) We will be analyzing the northeast region for the following problem. All the code and necessary visualizations are included in following section. #Cleaning the data in region 1. #Removing missing data. Northeast <- subset(Northeast, Northeast$No_Exercise!=0) Northeast <- subset(Northeast, Northeast$Few_Fruit_Veg!=0) Northeast <- subset(Northeast, Northeast$Obesity!=0) Northeast <- subset(Northeast, Northeast$High_Blood_Pres!=0) Northeast <- subset(Northeast, Northeast$Smoker!=0) Northeast <- subset(Northeast, Northeast$Diabetes!=0) Northeast <- subset(Northeast, Northeast$Lung_Cancer!=0) Northeast <- subset(Northeast, Northeast$Col_Cancer!=0) Northeast <- subset(Northeast, Northeast$CHD!=0) Northeast <- subset(Northeast, Northeast$Brst_Cancer!=0) Northeast <- subset(Northeast, Northeast$Suicide!=0) Northeast <- subset(Northeast, Northeast$Total_Death_Causes!=0) Northeast <- subset(Northeast, Northeast$Injury!=0) Northeast<-subset(Northeast,Northeast$Stroke!=0) Northeast <- subset(Northeast, Northeast$MVA!=0) Now we applied the regression model for different kinds of deaths with total number of deaths. Here for the simplicity we have only included the top two reasons why people are dying in region 1. But we came to the conclusion using single variate regression of the total death with respect to individual disease and then we combined the features for maximum Rˆ2 value. Following is the table which shows our experiment results. disease region1 (R squared) breast cancer 0.04 mva 0.17 chd 0.71 colon cancer 0.24 lung cancer 0.16 injury 0.1 suicide 0.07 stroke 0.04 2
  • 3. Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in northeast region. 1. CHD (Corronary heart disease) 2. Colon Cancer #Since we have taken the CHD and Colon Cancer as the mahor reason why people are dying we will perform m regressionModel<-lm(Northeast$Total_Death_Causes~Northeast$CHD+Northeast$Col_Cancer) #Summary of regression between total deaths and diseases we selected. summary(regressionModel) ## ## Call: ## lm(formula = Northeast$Total_Death_Causes ~ Northeast$CHD + Northeast$Col_Cancer) ## ## Residuals: ## Min 1Q Median 3Q Max ## -24.766 -5.234 -1.086 5.389 28.828 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.05596 2.51849 3.596 0.000433 *** ## Northeast$CHD 1.00017 0.04322 23.141 < 2e-16 *** ## Northeast$Col_Cancer 8.15527 0.44732 18.231 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.382 on 157 degrees of freedom ## Multiple R-squared: 0.9621, Adjusted R-squared: 0.9616 ## F-statistic: 1991 on 2 and 157 DF, p-value: < 2.2e-16 par(mfrow=c(2,2)) #plotting regression analysis plot(regressionModel) 3
  • 4. 50 100 150 200 250 −30030 Fitted values Residuals Residuals vs Fitted 18 65 9 −2 −1 0 1 2 −213 Theoretical Quantiles Standardizedresiduals Normal Q−Q 18 65 9 50 100 150 200 250 0.01.0 Fitted values Standardizedresiduals Scale−Location 18 659 0.00 0.05 0.10 0.15 −302 Leverage Standardizedresiduals Cook's distance 0.5 0.5 1 Residuals vs Leverage 908878 Now that we have established the major disease in North east region. We will now analyse the relationship between the these disease and the daily human activities using multivariate regression. We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders and perform multi variate regression with training and testing data. Following is the code which describes the procedure for CHD. NE.states<-Northeast NE.Procedures<-data.frame() #Creating a data frame inorder to generate correlation heat map NE.Procedures<-NE.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di NE.Procedures.Matrix<-as.matrix(NE.Procedures) NE.Cor<-cor(NE.Procedures) #Correlation metrix NE.Cor ## CHD No_Exercise Few_Fruit_Veg Obesity ## CHD 1.0000000 0.8886204 0.8202385 0.7790907 ## No_Exercise 0.8886204 1.0000000 0.8788325 0.8627715 ## Few_Fruit_Veg 0.8202385 0.8788325 1.0000000 0.9008216 ## Obesity 0.7790907 0.8627715 0.9008216 1.0000000 ## High_Blood_Pres 0.7685505 0.8430622 0.9158679 0.8826446 ## Smoker 0.7693031 0.8041099 0.8774229 0.9155866 ## Diabetes 0.7731015 0.8495609 0.8339938 0.8830184 ## High_Blood_Pres Smoker Diabetes 4
  • 5. ## CHD 0.7685505 0.7693031 0.7731015 ## No_Exercise 0.8430622 0.8041099 0.8495609 ## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938 ## Obesity 0.8826446 0.9155866 0.8830184 ## High_Blood_Pres 1.0000000 0.8503166 0.8839868 ## Smoker 0.8503166 1.0000000 0.8479606 ## Diabetes 0.8839868 0.8479606 1.0000000 #Melting the correlation matrix and creating a data frame NE.Melt<-melt(data=NE.Cor,varnames = c("x","y")) NE.Melt <- NE.Melt[order(NE.Melt$value),] #Mean of the melt mean(NE.Melt$value) ## [1] 0.8705658 #Summary summary(NE.Melt) ## x y value ## CHD :7 CHD :7 Min. :0.7686 ## Diabetes :7 Diabetes :7 1st Qu.:0.8340 ## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774 ## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8706 ## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9008 ## Obesity :7 Obesity :7 Max. :1.0000 ## Smoker :7 Smoker :7 NE.Melt<-NE.Melt[(!NE.Melt$value==1),] NE.MeltMean<-mean(NE.Melt$value) #Making various colors to generate dynamic range of colors using a given pallate red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1) RtoWrange<-colorRampPalette(c(white, green) ) WtoGrange<-colorRampPalette(c(green, red) ) # Heat map - using colors | used ggplot2 for the colors plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) + theme(panel.background = element_rect(fill = "snow2")) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray", midpoint = NE.MeltMean, limits = c(0.7, 1), name = "Correlations") + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + labs(x=NULL, y=NULL) + theme(panel.background = element_rect(fill = "snow2")) + ggtitle("Heat map of correlations in Risk Factors data : Region 1") plt_heat_blue 5
  • 6. CHD Diabetes Few_Fruit_Veg High_Blood_Pres No_Exercise Obesity Smoker CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker 0.7 0.8 0.9 1.0Correlations Heat map of correlations in Risk Factors data : Region 1 percent <- c("10%","20%","30%","40%","50%") breaks <- c(10,20,30,40,50) #graph of Obesity VS No Exercise plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures color = factor(signif(NE.Procedures$Obesity, 0)))) + geom_point() + scale_color_discrete(name="% Obesity") + scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') + ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise") plt_obevsNoEx_NE 6
  • 7. 10% 10% No Exercise %s Obesity%s % Obesity 2 3 4 5 6 7 8 9 10 Percentage of obesity vs. Percentage of No Exercise #graph of No Exercise VS Few Fruits and Vegetables plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures color = factor(signif(NE.Procedures$Few_Fruit_Veg, geom_point() + scale_color_discrete(name="% Few Fruits and Vegetables") + scale_x_continuous(labels = percent, name = 'No Exercise %s') + scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') + ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables") plt_noExvsfru_NE 7
  • 8. 20% 30% 40% 50% 20% 30% 40% No Exercise %s FewFruitsandVegetables%s % Few Fruits and Vegetables 7 8 20 30 40 Percentage of No Exercise vs. Percentage of Few Fruits and Vegetables #graph of No Exercise VS High Blood Pressure plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedur color = factor(signif(NE.Procedures$High_Blood_Pres, geom_point() + scale_color_discrete(name="% High Blood Pressure") + scale_x_continuous(labels = percent, name = 'No Exercise %s') + scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') + ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure") plt_noExvsblood_NE 8
  • 9. 20% 30% 40% 20% 30% 40% No Exercise %s HighBloodPressure%s % High Blood Pressure 2 3 4 5 6 7 8 9 10 20 Percentage of No Excersice vs. Percentage of High Blood Pressure #Plotting the multi variate scatter plot in order to understand the correlation better. pairs(~NE.Procedures$CHD+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obesity+NE. 9
  • 10. NE.Procedures$CHD 5 10 2 6 10 5 10 50 515 NE.Procedures$No_Exercise NE.Procedures$Few_Fruit_Veg 1040 210 NE.Procedures$Obesity NE.Procedures$High_Blood_Pres 515 515 NE.Procedures$Smoker 50 150 10 25 40 5 10 1 3 5 14 NE.Procedures$Diabetes Multivariate Scatterplot : Region 1 Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is highly correlated with the 1. No exercise. Now we will find the confounding variables. Using confounding variables we can fit model better and the output prediction will be less error prone. We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Diabetes 3. Obesity #partitioning the data into Training and Testing set. set.seed(100) train_part <- createDataPartition(y = NE.Procedures$CHD,p = 0.80,list = FALSE) TrainingCHD<-NE.Procedures[train_part,] TestingCHD<-NE.Procedures[-train_part,] #Performing regression between CHD and No exercise chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise) #Regression Summary summary(chdRegr) ## ## Call: ## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise) ## 10
  • 11. ## Residuals: ## Min 1Q Median 3Q Max ## -27.186 -6.009 -1.259 6.963 76.044 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.7172 3.3207 2.324 0.0217 * ## TrainingCHD$No_Exercise 6.7704 0.3225 20.996 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.51 on 126 degrees of freedom ## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759 ## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16 As you can see that model is already having good accuracy we will still add more confounders and then perform the multivariate regression. #making a temporary table with all required variables temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$High_Blood_Pres,C3 temp <- mutate(temp, O = TrainingCHD$CHD) #Regression on CHD and No Exercise reg.E <- lm(O ~ E, data = temp) summary(reg.E) ## ## Call: ## lm(formula = O ~ E, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -27.186 -6.009 -1.259 6.963 76.044 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.7172 3.3207 2.324 0.0217 * ## E 6.7704 0.3225 20.996 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.51 on 126 degrees of freedom ## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759 ## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16 #Regression on CHD and No excercise with Obisity reg.EC1 <- lm(O ~ E+C1, data = temp) summary(reg.EC1) ## ## Call: ## lm(formula = O ~ E + C1, data = temp) ## 11
  • 12. ## Residuals: ## Min 1Q Median 3Q Max ## -25.604 -6.818 -1.170 5.811 77.683 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.9462 3.4303 1.733 0.0855 . ## E 5.7057 0.6647 8.583 3.06e-14 *** ## C1 1.3495 0.7389 1.826 0.0702 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.39 on 125 degrees of freedom ## Multiple R-squared: 0.7835, Adjusted R-squared: 0.78 ## F-statistic: 226.2 on 2 and 125 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with High Blood pressure reg.EC2 <- lm(O ~ E+C2, data = temp) summary(reg.EC2) ## ## Call: ## lm(formula = O ~ E + C2, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.440 -6.496 -1.075 5.928 81.639 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.3274 3.4403 1.549 0.1240 ## E 5.6011 0.6125 9.145 1.39e-15 *** ## C2 1.2760 0.5716 2.232 0.0274 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.31 on 125 degrees of freedom ## Multiple R-squared: 0.7862, Adjusted R-squared: 0.7828 ## F-statistic: 229.9 on 2 and 125 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with Diabeties reg.EC3 <- lm(O ~ E+C3, data = temp) summary(reg.EC3) ## ## Call: ## lm(formula = O ~ E + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -27.229 -6.466 -1.525 6.379 79.871 ## ## Coefficients: 12
  • 13. ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.057 3.284 2.149 0.0336 * ## E 5.612 0.612 9.171 1.2e-15 *** ## C3 4.022 1.817 2.213 0.0287 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.32 on 125 degrees of freedom ## Multiple R-squared: 0.7861, Adjusted R-squared: 0.7827 ## F-statistic: 229.7 on 2 and 125 DF, p-value: < 2.2e-16 #Regression with Explainatory variable and all the counfounders that we think are there. reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp) summary(reg.EC1234) ## ## Call: ## lm(formula = O ~ E + C1 + C2 + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.767 -6.870 -1.347 5.835 81.655 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.5001 3.5445 1.552 0.123 ## E 5.1938 0.7261 7.153 6.68e-11 *** ## C1 0.4091 0.9195 0.445 0.657 ## C2 0.7147 0.7485 0.955 0.342 ## C3 2.0806 2.4541 0.848 0.398 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.34 on 123 degrees of freedom ## Multiple R-squared: 0.7886, Adjusted R-squared: 0.7817 ## F-statistic: 114.7 on 4 and 123 DF, p-value: < 2.2e-16 #plotting regression. par(mfrow=c(2,2)) plot(reg.EC1234) abline(reg.EC1234) ## Warning in abline(reg.EC1234): only using the first two of 5 regression ## coefficients 13
  • 14. 20 40 60 80 100 120 −2040 Fitted values Residuals Residuals vs Fitted 71 110 24 −2 −1 0 1 2 −226 Theoretical Quantiles Standardizedresiduals Normal Q−Q 71 110 24 20 40 60 80 100 120 0.01.5 Fitted values Standardizedresiduals Scale−Location 71 11024 0.00 0.05 0.10 0.15 0.20 −226 Leverage Standardizedresiduals Cook's distance 0.5 0.5 1 Residuals vs Leverage 71 21 69 #No we will test our regression model with testing data to check the prerformance. tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$High_Blood_Pres,C tempTest <- mutate(tempTest, O = TestingCHD$CHD) test.predictions<-predict(reg.EC1234,newdata = tempTest) test.y<-tempTest$O SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.predictions)^2) SS.regression <- sum((test.predictions - mean(test.y))^2) SS.total - (SS.regression+SS.residual) ## [1] 8567.965 SS.regression/SS.total ## [1] 0.5136492 #This is the regression value Rsquare value for testing data Now as we have fitted the regression model for “CHD” we will do the same for the colon cancer. Here is the code for the colon cancer model. 14
  • 15. #Creating a data frame inorder to generate correlation heat map NE.Procedures<-NE.states[,c("Col_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smok NE.Procedures.Matrix<-as.matrix(NE.Procedures) NE.Cor<-cor(NE.Procedures) #Correlation metrix NE.Cor ## Col_Cancer No_Exercise Few_Fruit_Veg Obesity ## Col_Cancer 1.0000000 0.8447630 0.9067467 0.8433966 ## No_Exercise 0.8447630 1.0000000 0.8788325 0.8627715 ## Few_Fruit_Veg 0.9067467 0.8788325 1.0000000 0.9008216 ## Obesity 0.8433966 0.8627715 0.9008216 1.0000000 ## High_Blood_Pres 0.8606917 0.8430622 0.9158679 0.8826446 ## Smoker 0.8306595 0.8041099 0.8774229 0.9155866 ## Diabetes 0.7996891 0.8495609 0.8339938 0.8830184 ## High_Blood_Pres Smoker Diabetes ## Col_Cancer 0.8606917 0.8306595 0.7996891 ## No_Exercise 0.8430622 0.8041099 0.8495609 ## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938 ## Obesity 0.8826446 0.9155866 0.8830184 ## High_Blood_Pres 1.0000000 0.8503166 0.8839868 ## Smoker 0.8503166 1.0000000 0.8479606 ## Diabetes 0.8839868 0.8479606 1.0000000 #Melting the correlation matrix and creating a data frame NE.Melt<-melt(data=NE.Cor,varnames = c("x","y")) NE.Melt <- NE.Melt[order(NE.Melt$value),] #Mean of the melt mean(NE.Melt$value) ## [1] 0.8822818 #Summary summary(NE.Melt) ## x y value ## Col_Cancer :7 Col_Cancer :7 Min. :0.7997 ## Diabetes :7 Diabetes :7 1st Qu.:0.8448 ## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774 ## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8823 ## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9067 ## Obesity :7 Obesity :7 Max. :1.0000 ## Smoker :7 Smoker :7 NE.Melt<-NE.Melt[(!NE.Melt$value==1),] NE.MeltMean<-mean(NE.Melt$value) #Making various colors to generate dynamic range of colors using a given pallate red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1) RtoWrange<-colorRampPalette(c(white, green) ) WtoGrange<-colorRampPalette(c(green, red) ) 15
  • 16. # Heat map - using colors | used ggplot2 for the colors plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) + theme(panel.background = element_rect(fill = "snow2")) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray", midpoint = NE.MeltMean, limits = c(0.7, 1), name = "Correlations") + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + labs(x=NULL, y=NULL) + theme(panel.background = element_rect(fill = "snow2")) + ggtitle("Heat map of correlations in Risk Factors data : Region 1") plt_heat_blue Col_Cancer Diabetes Few_Fruit_Veg High_Blood_Pres No_Exercise Obesity Smoker Col_Cancer DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker 0.7 0.8 0.9 1.0Correlations Heat map of correlations in Risk Factors data : Region 1 percent <- c("10%","20%","30%","40%","50%") breaks <- c(10,20,30,40,50) #graph of Obesity VS Few Fruit vegitables plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Procedur color = factor(signif(NE.Procedures$Obesity, 0)))) + geom_point() + scale_color_discrete(name="% Obesity") + scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') + ggtitle(label = "Percentage of obesity vs.n Percentage of Few Fruit and vegitables") plt_obevsNoEx_NE 16
  • 17. 10% 10% 20% 30% 40% No Exercise %s Obesity%s % Obesity 2 3 4 5 6 7 8 9 10 Percentage of obesity vs. Percentage of Few Fruit and vegitables #graph of No Exercise VS Few Fruits and Vegetables plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures color = factor(signif(NE.Procedures$Few_Fruit_Veg, geom_point() + scale_color_discrete(name="% Few Fruits and Vegetables") + scale_x_continuous(labels = percent, name = 'No Exercise %s') + scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') + ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables") plt_noExvsfru_NE 17
  • 18. 20% 30% 40% 50% 20% 30% 40% No Exercise %s FewFruitsandVegetables%s % Few Fruits and Vegetables 7 8 20 30 40 Percentage of No Exercise vs. Percentage of Few Fruits and Vegetables #graph of Few Fruit and Vegitables VS High Blood Pressure plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Proced color = factor(signif(NE.Procedures$High_Blood_Pres, geom_point() + scale_color_discrete(name="% High Blood Pressure") + scale_x_continuous(labels = percent, name = 'Few Fruit and vegitables %s') + scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') + ggtitle(label = "Percentage of Few Fruits and vegitables vs.n Percentage of High Blood Pressure") plt_noExvsblood_NE 18
  • 19. 20% 30% 40% 20% 30% 40% 50% Few Fruit and vegitables %s HighBloodPressure%s % High Blood Pressure 2 3 4 5 6 7 8 9 10 20 Percentage of Few Fruits and vegitables vs. Percentage of High Blood Pressure #Plotting the multi variate scatter plot in order to understand the correlation better. pairs(~NE.Procedures$Col_Cancer+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obes 19
  • 20. NE.Procedures$Col_Cancer 5 10 2 6 10 5 10 28 515 NE.Procedures$No_Exercise NE.Procedures$Few_Fruit_Veg 1040 210 NE.Procedures$Obesity NE.Procedures$High_Blood_Pres 515 515 NE.Procedures$Smoker 2 6 10 10 25 40 5 10 1 3 5 14 NE.Procedures$Diabetes Multivariate Scatterplot : Region 1 Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the colon cancer is highly correlated with the 1. Few Fruits and vegetable. Now we will find the confounding variables. Using confounding variables we can fit model better and the output prediction will be less error prone. We based on the correlation heatmap and the scatterplot we can say that 1. No Exercise 2. High blood pressure 3. Obesity #partitioning the data into Training and Testing set. set.seed(100) train_part <- createDataPartition(y = NE.Procedures$Col_Cancer,p = 0.80,list = FALSE) TrainingColon <- NE.Procedures[train_part,] TestingColon <- NE.Procedures[-train_part,] #Performing regression between Colon Cancer and Few fruits and vegetables chdRegr<-lm(TrainingColon$Col_Cancer~TrainingColon$Few_Fruit_Veg) #Regression Summary summary(chdRegr) ## ## Call: ## lm(formula = TrainingColon$Col_Cancer ~ TrainingColon$Few_Fruit_Veg) ## 20
  • 21. ## Residuals: ## Min 1Q Median 3Q Max ## -3.3197 -0.6742 -0.0447 0.5826 3.8689 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.67978 0.33268 2.043 0.0431 * ## TrainingColon$Few_Fruit_Veg 0.26134 0.01047 24.971 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.102 on 127 degrees of freedom ## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295 ## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16 As you can see that model is already having good accuracy we will still add more confounders and then perform the multivariate regression. #making a temporary table with all required variables temp <- data.frame(E = TrainingColon$Few_Fruit_Veg, C1=TrainingColon$Obesity,C2=TrainingColon$High_Blood temp <- mutate(temp, O = TrainingColon$Col_Cancer) #Regression on CHD and No Exercise reg.E <- lm(O ~ E, data = temp) summary(reg.E) ## ## Call: ## lm(formula = O ~ E, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.3197 -0.6742 -0.0447 0.5826 3.8689 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.67978 0.33268 2.043 0.0431 * ## E 0.26134 0.01047 24.971 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.102 on 127 degrees of freedom ## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295 ## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16 #Regression on CHD and No excercise with Obisity reg.EC1 <- lm(O ~ E+C1, data = temp) summary(reg.EC1) ## ## Call: ## lm(formula = O ~ E + C1, data = temp) ## 21
  • 22. ## Residuals: ## Min 1Q Median 3Q Max ## -3.1345 -0.6380 -0.0406 0.5283 3.7629 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.66357 0.33068 2.007 0.0469 * ## E 0.22619 0.02392 9.455 2.32e-16 *** ## C1 0.12091 0.07412 1.631 0.1054 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.095 on 126 degrees of freedom ## Multiple R-squared: 0.8343, Adjusted R-squared: 0.8317 ## F-statistic: 317.2 on 2 and 126 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with High Blood pressure reg.EC2 <- lm(O ~ E+C2, data = temp) summary(reg.EC2) ## ## Call: ## lm(formula = O ~ E + C2, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.3833 -0.6435 0.0166 0.4775 3.8119 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.68140 0.32910 2.070 0.0405 * ## E 0.21531 0.02585 8.330 1.16e-13 *** ## C2 0.13163 0.06773 1.944 0.0542 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.09 on 126 degrees of freedom ## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8331 ## F-statistic: 320.5 on 2 and 126 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with Diabeties reg.EC3 <- lm(O ~ E+C3, data = temp) summary(reg.EC3) ## ## Call: ## lm(formula = O ~ E + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.0632 -0.6606 -0.0144 0.4913 3.8662 ## ## Coefficients: 22
  • 23. ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.71645 0.32566 2.200 0.0296 * ## E 0.21276 0.02128 10.000 <2e-16 *** ## C3 0.14657 0.05628 2.604 0.0103 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.078 on 126 degrees of freedom ## Multiple R-squared: 0.8394, Adjusted R-squared: 0.8369 ## F-statistic: 329.4 on 2 and 126 DF, p-value: < 2.2e-16 #Regression with Explainatory variable and all the counfounders that we think are there. reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp) summary(reg.EC1234) ## ## Call: ## lm(formula = O ~ E + C1 + C2 + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.0938 -0.6481 0.0027 0.4265 3.7933 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.70568 0.32562 2.167 0.0321 * ## E 0.17857 0.03122 5.719 7.56e-08 *** ## C1 0.03900 0.07983 0.489 0.6261 ## C2 0.09059 0.07048 1.285 0.2011 ## C3 0.11996 0.06030 1.989 0.0489 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.077 on 124 degrees of freedom ## Multiple R-squared: 0.8424, Adjusted R-squared: 0.8373 ## F-statistic: 165.7 on 4 and 124 DF, p-value: < 2.2e-16 #plotting regression. par(mfrow=c(2,2)) plot(reg.EC1234) abline(reg.EC1234) ## Warning in abline(reg.EC1234): only using the first two of 5 regression ## coefficients 23
  • 24. 4 6 8 10 −22 Fitted values Residuals Residuals vs Fitted 50 120 6 −2 −1 0 1 2 −303 Theoretical Quantiles Standardizedresiduals Normal Q−Q 50 120 6 4 6 8 10 0.01.0 Fitted values Standardizedresiduals Scale−Location 50 1206 0.00 0.05 0.10 0.15 −22 Leverage Standardizedresiduals Cook's distance 0.5 0.5 Residuals vs Leverage 7166104 #Now we will test our regression model with testing data to check the prerformance. tempTest <- data.frame(E = TestingColon$Few_Fruit_Veg, C1=TestingColon$Obesity,C2=TestingColon$High_Bloo tempTest <- mutate(tempTest, O = TestingColon$Col_Cancer) test.predictions<-predict(reg.EC1234,newdata = tempTest) test.y<-tempTest$O SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.predictions)^2) SS.regression <- sum((test.predictions - mean(test.y))^2) SS.total - (SS.regression+SS.residual) ## [1] 13.39228 SS.regression/SS.total ## [1] 0.7494172 #This is the regression value Rsquare value for testing data Region 2 (Midwest region) We will be analyzing the Midwest region for the following problem. All the code and necessary visualizations are included in following section. 24
  • 25. #Cleaning the data in region 2. #Removing missing data. Midwest <- subset(Midwest, Midwest$No_Exercise!=0) Midwest <- subset(Midwest, Midwest$Few_Fruit_Veg!=0) Midwest <- subset(Midwest, Midwest$Obesity!=0) Midwest <- subset(Midwest, Midwest$High_Blood_Pres!=0) Midwest <- subset(Midwest, Midwest$Smoker!=0) Midwest <- subset(Midwest, Midwest$Diabetes!=0) Midwest <- subset(Midwest, Midwest$Lung_Cancer!=0) Midwest <- subset(Midwest, Midwest$Col_Cancer!=0) Midwest <- subset(Midwest, Midwest$CHD!=0) Midwest <- subset(Midwest, Midwest$Brst_Cancer!=0) Midwest <- subset(Midwest, Midwest$Suicide!=0) Midwest <- subset(Midwest, Midwest$Total_Death_Causes!=0) Midwest <- subset(Midwest, Midwest$Injury!=0) Midwest<-subset(Midwest,Midwest$Stroke!=0) Midwest <- subset(Midwest, Midwest$MVA!=0) Now we applied the regression model for different kinds of deaths with total number of deaths in this region. Here for the simplicity we have only included the top two reasons why people are dying in region 2. We came to the conclusion using single variate regression of the total death with respect to individual disease and then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental results. disease region2 (R squared) breast cancer 0.016 mva 0.19 chd 0.77 colon cancer 0.12 lung cancer 0.25 injury 0.11 suicide 0.05 stroke 0.13 Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in Midwest region. 1. CHD (Corronary heart disease) 2. Lung Cancer #Since we have taken the CHD and Lung Cancer as the major reason why people are dying. We will perform m regressionModel<-lm(Midwest$Total_Death_Causes~Midwest$CHD+Midwest$Lung_Cancer) #Summary of regression between total deaths and diseases we selected. summary(regressionModel) ## ## Call: ## lm(formula = Midwest$Total_Death_Causes ~ Midwest$CHD + Midwest$Lung_Cancer) ## 25
  • 26. ## Residuals: ## Min 1Q Median 3Q Max ## -19.6367 -4.0070 -0.6864 3.8916 22.6175 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.26689 0.65579 11.08 <2e-16 *** ## Midwest$CHD 1.12881 0.02958 38.16 <2e-16 *** ## Midwest$Lung_Cancer 2.97013 0.08260 35.96 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.172 on 389 degrees of freedom ## Multiple R-squared: 0.9886, Adjusted R-squared: 0.9885 ## F-statistic: 1.688e+04 on 2 and 389 DF, p-value: < 2.2e-16 par(mfrow=c(2,2)) #plotting regression analysis plot(regressionModel) 50 100 150 200 250 −20020 Fitted values Residuals Residuals vs Fitted 385 165 231 −3 −2 −1 0 1 2 3 −22 Theoretical Quantiles Standardizedresiduals Normal Q−Q 385 165 231 50 100 150 200 250 0.01.0 Fitted values Standardizedresiduals Scale−Location 385 165231 0.00 0.01 0.02 0.03 0.04 0.05 −404 Leverage Standardizedresiduals Cook's distance Residuals vs Leverage 31050165 Now that we have established the major disease in Midwest region. We will now analyse the relationship between the these disease and the daily human activities using multivariate regression. We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders and perform multivariate regression with training and testing data. Following is the code which describes the procedure for CHD. 26
  • 27. MW.states<-Midwest MW.Procedures<-data.frame() #Creating a data frame inorder to generate correlation heat map MW.Procedures<-MW.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di MW.Procedures.Matrix<-as.matrix(MW.Procedures) MW.Cor<-cor(MW.Procedures) #Correlation metrix MW.Cor ## CHD No_Exercise Few_Fruit_Veg Obesity ## CHD 1.0000000 0.9109385 0.8956826 0.9072785 ## No_Exercise 0.9109385 1.0000000 0.9135682 0.9265991 ## Few_Fruit_Veg 0.8956826 0.9135682 1.0000000 0.9520907 ## Obesity 0.9072785 0.9265991 0.9520907 1.0000000 ## High_Blood_Pres 0.9045115 0.9155135 0.9331822 0.9536831 ## Smoker 0.9076982 0.9292124 0.9400104 0.9312576 ## Diabetes 0.8902612 0.9031664 0.8688473 0.9153784 ## High_Blood_Pres Smoker Diabetes ## CHD 0.9045115 0.9076982 0.8902612 ## No_Exercise 0.9155135 0.9292124 0.9031664 ## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473 ## Obesity 0.9536831 0.9312576 0.9153784 ## High_Blood_Pres 1.0000000 0.9187944 0.9194037 ## Smoker 0.9187944 1.0000000 0.8869094 ## Diabetes 0.9194037 0.8869094 1.0000000 #Melting the correlation matrix and creating a data frame MW.Melt<-melt(data=MW.Cor,varnames = c("x","y")) MW.Melt <- MW.Melt[order(MW.Melt$value),] #Mean of the melt mean(MW.Melt$value) ## [1] 0.9275097 #Summary summary(MW.Melt) ## x y value ## CHD :7 CHD :7 Min. :0.8688 ## Diabetes :7 Diabetes :7 1st Qu.:0.9073 ## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.9188 ## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.9275 ## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9400 ## Obesity :7 Obesity :7 Max. :1.0000 ## Smoker :7 Smoker :7 MW.Melt<-MW.Melt[(!MW.Melt$value==1),] MW.MeltMean<-mean(MW.Melt$value) #Making various colors to geMWrate dynamic range of colors using a given pallate red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1) 27
  • 28. RtoWrange<-colorRampPalette(c(white, blue) ) WtoGrange<-colorRampPalette(c(blue, red) ) # Heat map - using colors | used ggplot2 for the colors plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) + theme(panel.background = element_rect(fill = "snow2")) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray", midpoint = MW.MeltMean, limits = c(0.9, 1), name = "Correlations") + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + labs(x=NULL, y=NULL) + theme(panel.background = element_rect(fill = "snow2")) + ggtitle("Heat map of correlations in Risk Factors data : Region 2") plt_heat_blue CHD Diabetes Few_Fruit_Veg High_Blood_Pres No_Exercise Obesity Smoker CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker 0.900 0.925 0.950 0.975 1.000 Correlations Heat map of correlations in Risk Factors data : Region 2 percent <- c("10%","20%","30%","40%","50%") breaks <- c(10,20,30,40,50) #graph of Obesity VS No Exercise plt_obevsNoEx_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures color = factor(signif(MW.Procedures$Obesity, 0)))) + geom_point() + 28
  • 29. scale_color_discrete(name="% Obesity") + scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') + ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise") plt_obevsNoEx_MW 10% 10% No Exercise %s Obesity%s % Obesity 2 3 4 5 6 7 8 9 10 20 Percentage of obesity vs. Percentage of No Exercise #graph of No Exercise VS Smoker plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures color = factor(signif(MW.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent, name = 'No Exercise %s') + scale_y_continuous(labels = percent, name = 'Smoker %s') + ggtitle(label = "Percentage of No Exercise vs.n Smoker") plt_noExvsSmo_MW 29
  • 30. 20% 30% 40% 50% 20% 30% 40% No Exercise %s Smoker%s % Smoker 1 2 3 4 5 6 7 8 9 10 20 Percentage of No Exercise vs. Smoker #graph of No Exercise VS High Blood Pressure plt_noExvsblood_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedur color = factor(signif(MW.Procedures$High_Blood_Pres, geom_point() + scale_color_discrete(name="% High Blood Pressure") + scale_x_continuous(labels = percent, name = 'No Exercise %s') + scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') + ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure") plt_noExvsblood_MW 30
  • 31. 20% 30% 40% 20% 30% 40% No Exercise %s HighBloodPressure%s % High Blood Pressure 2 3 4 5 6 7 8 9 10 20 Percentage of No Excersice vs. Percentage of High Blood Pressure #Plotting the multivariate scatter plot in order to understand the correlation better. pairs(~MW.Procedures$CHD+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obesity+MW. 31
  • 32. MW.Procedures$CHD 5 15 5 15 5 10 20 5 MW.Procedures$No_Exercise MW.Procedures$Few_Fruit_Veg 1035 5 MW.Procedures$Obesity MW.Procedures$High_Blood_Pres 5 515 MW.Procedures$Smoker 20 80 10 25 40 5 15 1 3 5 14 MW.Procedures$Diabetes Multivariate Scatterplot : Region 2 Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is highly correlated with the 1. No exercise. Now we will find the confounding variables. Using confounding variables we can fit model better and the output prediction will be less error prone. We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Smoker 3. Obesity #partitioning the data into Training and Testing set. set.seed(100) train_part <- createDataPartition(y = MW.Procedures$CHD,p = 0.80,list = FALSE) TrainingCHD<-MW.Procedures[train_part,] TestingCHD<-MW.Procedures[-train_part,] #Performing regression between CHD and No exercise chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise) #Regression Summary summary(chdRegr) ## ## Call: ## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise) 32
  • 33. ## ## Residuals: ## Min 1Q Median 3Q Max ## -26.669 -5.992 -1.075 4.611 37.145 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.6506 1.2765 2.077 0.0387 * ## TrainingCHD$No_Exercise 6.9596 0.1684 41.334 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.56 on 314 degrees of freedom ## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443 ## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16 As you can see that model is already having good accuracy we will still add more confounders and then perform the multivariate regression. #making a temporary table with all required variables temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training temp <- mutate(temp, O = TrainingCHD$CHD) #Regression on CHD and No Exercise reg.E <- lm(O ~ E, data = temp) summary(reg.E) ## ## Call: ## lm(formula = O ~ E, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -26.669 -5.992 -1.075 4.611 37.145 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.6506 1.2765 2.077 0.0387 * ## E 6.9596 0.1684 41.334 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.56 on 314 degrees of freedom ## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443 ## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16 #Regression on CHD and No excercise with Confounder 1 reg.EC1 <- lm(O ~ E+C1, data = temp) summary(reg.EC1) ## ## Call: ## lm(formula = O ~ E + C1, data = temp) 33
  • 34. ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.236 -5.870 -0.862 5.417 36.030 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.6657 1.2036 0.553 0.581 ## E 3.9856 0.4212 9.462 < 2e-16 *** ## C1 3.1306 0.4123 7.593 3.64e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.723 on 313 degrees of freedom ## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8681 ## F-statistic: 1037 on 2 and 313 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with Confounder 2 reg.EC2 <- lm(O ~ E+C2, data = temp) summary(reg.EC2) ## ## Call: ## lm(formula = O ~ E + C2, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -30.550 -5.426 -0.776 4.322 37.881 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.6510 1.1790 2.249 0.0252 * ## E 4.0462 0.4223 9.581 < 2e-16 *** ## C2 2.8735 0.3872 7.420 1.11e-12 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.757 on 313 degrees of freedom ## Multiple R-squared: 0.868, Adjusted R-squared: 0.8671 ## F-statistic: 1029 on 2 and 313 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with Confounder 3 reg.EC3 <- lm(O ~ E+C3, data = temp) summary(reg.EC3) ## ## Call: ## lm(formula = O ~ E + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -22.324 -5.482 -0.944 5.139 34.475 ## 34
  • 35. ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.4248 1.1767 1.211 0.227 ## E 4.1168 0.3898 10.560 < 2e-16 *** ## C3 2.8134 0.3545 7.936 3.74e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.654 on 313 degrees of freedom ## Multiple R-squared: 0.8708, Adjusted R-squared: 0.8699 ## F-statistic: 1054 on 2 and 313 DF, p-value: < 2.2e-16 #Regression with Explainatory variable and all the counfounders that we think are there. reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp) summary(reg.EC1234) ## ## Call: ## lm(formula = O ~ E + C1 + C2 + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -25.493 -5.404 -0.842 4.904 33.167 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.3202 1.1689 1.129 0.259602 ## E 2.8621 0.4635 6.175 2.07e-09 *** ## C1 1.1091 0.5728 1.936 0.053747 . ## C2 1.5669 0.4458 3.514 0.000506 *** ## C3 1.4402 0.4852 2.968 0.003230 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.36 on 311 degrees of freedom ## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8777 ## F-statistic: 566.3 on 4 and 311 DF, p-value: < 2.2e-16 #plotting regression. par(mfrow=c(2,2)) plot(reg.EC1234) abline(reg.EC1234) ## Warning in abline(reg.EC1234): only using the first two of 5 regression ## coefficients 35
  • 36. 20 40 60 80 100 −30030 Fitted values Residuals Residuals vs Fitted 6201 149 −3 −2 −1 0 1 2 3 −3024 Theoretical Quantiles Standardizedresiduals Normal Q−Q 6201149 20 40 60 80 100 0.01.0 Fitted values Standardizedresiduals Scale−Location 6201 149 0.00 0.04 0.08 0.12 −22 Leverage Standardizedresiduals Cook's distance 0.5 Residuals vs Leverage 1916 47 #Now we will test our regression model with testing data to check the prerformance. tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing tempTest <- mutate(tempTest, O = TestingCHD$CHD) test.predictions<-predict(reg.EC1234,newdata = tempTest) test.predictions ## 1 2 3 4 5 6 7 ## 47.35611 71.16578 74.24867 74.90092 88.18896 85.76173 53.21561 ## 8 9 10 11 12 13 14 ## 46.55992 49.44050 85.85667 62.59279 42.36807 92.85802 41.66495 ## 15 16 17 18 19 20 21 ## 90.65576 38.70206 44.74763 18.59404 16.51755 23.08953 19.65119 ## 22 23 24 25 26 27 28 ## 17.38575 19.99998 40.34271 25.57896 22.67974 18.95382 75.20951 ## 29 30 31 32 33 34 35 ## 20.59550 44.87894 22.94132 21.52470 28.72008 16.55805 21.38459 ## 36 37 38 39 40 41 42 ## 53.07606 51.01929 52.00420 22.51745 111.28415 89.03471 80.60128 ## 43 44 45 46 47 48 49 ## 70.85496 44.13435 65.03187 73.84524 36.00484 39.05419 62.18069 ## 50 51 52 53 54 55 56 ## 29.67611 63.29361 76.79059 38.52942 50.68324 49.06936 21.51555 ## 57 58 59 60 61 62 63 ## 50.18942 42.64157 20.31264 78.53103 20.19143 18.05501 20.39115 36
  • 37. ## 64 65 66 67 68 69 70 ## 87.13676 84.23844 86.80106 71.39024 69.61883 40.42645 18.85330 ## 71 72 73 74 75 76 ## 46.86680 32.61050 38.34111 38.45673 67.09643 42.34753 test.y<-tempTest$O SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.predictions)^2) SS.regression <- sum((test.predictions - mean(test.y))^2) SS.total - (SS.regression+SS.residual) ## [1] -1040.415 SS.regression/SS.total ## [1] 0.854508 #This is the regression value Rsquare value for testing data Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer. Here is the code for the lung cancer model. #Creating a data frame inorder to generate correlation heat map MW.Procedures<-MW.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo MW.Procedures.Matrix<-as.matrix(MW.Procedures) MW.Cor<-cor(MW.Procedures) #Correlation metrix MW.Cor ## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity ## Lung_Cancer 1.0000000 0.9402863 0.9478072 0.9459666 ## No_Exercise 0.9402863 1.0000000 0.9135682 0.9265991 ## Few_Fruit_Veg 0.9478072 0.9135682 1.0000000 0.9520907 ## Obesity 0.9459666 0.9265991 0.9520907 1.0000000 ## High_Blood_Pres 0.9279799 0.9155135 0.9331822 0.9536831 ## Smoker 0.9494486 0.9292124 0.9400104 0.9312576 ## Diabetes 0.8984837 0.9031664 0.8688473 0.9153784 ## High_Blood_Pres Smoker Diabetes ## Lung_Cancer 0.9279799 0.9494486 0.8984837 ## No_Exercise 0.9155135 0.9292124 0.9031664 ## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473 ## Obesity 0.9536831 0.9312576 0.9153784 ## High_Blood_Pres 1.0000000 0.9187944 0.9194037 ## Smoker 0.9187944 1.0000000 0.8869094 ## Diabetes 0.9194037 0.8869094 1.0000000 #Melting the correlation matrix and creating a data frame MW.Melt<-melt(data=MW.Cor,varnames = c("x","y")) MW.Melt <- MW.Melt[order(MW.Melt$value),] #Mean of the melt mean(MW.Melt$value) 37
  • 38. ## [1] 0.9354118 #Summary summary(MW.Melt) ## x y value ## Diabetes :7 Diabetes :7 Min. :0.8688 ## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.9155 ## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.9313 ## Lung_Cancer :7 Lung_Cancer :7 Mean :0.9354 ## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9494 ## Obesity :7 Obesity :7 Max. :1.0000 ## Smoker :7 Smoker :7 MW.Melt<-MW.Melt[(!MW.Melt$value==1),] MW.MeltMean<-mean(MW.Melt$value) #Making various colors to generate dynamic range of colors using a given pallate red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1) RtoWrange<-colorRampPalette(c(white, blue) ) WtoGrange<-colorRampPalette(c(blue, red) ) # Heat map - using colors | used ggplot2 fr the colors plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) + theme(panel.background = element_rect(fill = "snow2")) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray", midpoint = MW.MeltMean, limits = c(0.9, 1), name = "Correlations") + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + labs(x=NULL, y=NULL) + theme(panel.background = element_rect(fill = "snow2")) + ggtitle("Heat map of correlations in Risk Factors data : Region 2") plt_heat_blue 38
  • 39. Diabetes Few_Fruit_Veg High_Blood_Pres Lung_Cancer No_Exercise Obesity Smoker DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker 0.900 0.925 0.950 0.975 1.000 Correlations Heat map of correlations in Risk Factors data : Region 2 percent <- c("10%","20%","30%","40%","50%") breaks <- c(10,20,30,40,50) #graph of Smoker VS Few Fruit vegitables plt_Smovsfru_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Few_Fruit_Veg), y = (MW.Procedure color = factor(signif(MW.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fruits and Vegetables %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') + ggtitle(label = "Percentage of Smoker vs.n Percentage of Few Fruit and vegitables") plt_Smovsfru_MW 39
  • 40. 10% 10% 20% 30% 40% Few Fruits and Vegetables %s Smoker%s % Smoker 1 2 3 4 5 6 7 8 9 10 20 Percentage of Smoker vs. Percentage of Few Fruit and vegitables #graph of No Exercise VS Smoker plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures color = factor(signif(MW.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent, name = 'No Exercise %s') + scale_y_continuous(labels = percent, name = 'Smoker %s') + ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker") plt_noExvsSmo_MW 40
  • 41. 20% 30% 40% 50% 20% 30% 40% No Exercise %s Smoker%s % Smoker 1 2 3 4 5 6 7 8 9 10 20 Percentage of No Exercise vs. Percentage of Smoker #graph of Smoker VS Obesity plt_SmovsObe_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Obesity), y = (MW.Procedures$Smok color = factor(signif(MW.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent, name = 'Obesity %s') + scale_y_continuous(labels = percent, name = 'Smoker %s') + ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker") plt_SmovsObe_MW 41
  • 42. 20% 30% 40% 50% 20% 30% 40% Obesity %s Smoker%s % Smoker 1 2 3 4 5 6 7 8 9 10 20 Percentage of Obesity vs. Percentage of Smoker #Plotting the multi variate scatter plot in order to understand the correlation better. pairs(~MW.Procedures$Lung_Cancer+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obe 42
  • 43. MW.Procedures$Lung_Cancer 5 15 5 15 5 10 1040 5 MW.Procedures$No_Exercise MW.Procedures$Few_Fruit_Veg 1035 5 MW.Procedures$Obesity MW.Procedures$High_Blood_Pres 5 515 MW.Procedures$Smoker 10 30 10 25 40 5 15 1 3 5 14 MW.Procedures$Diabetes Multivariate Scatterplot : Region 2 Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer is highly correlated with the 1. Smoker. Now we will find the confounding variables. Using confounding variables we can fit model better and the output prediction will be less error prone. Based on the correlation heatmap and the scatterplot we can say that 1. No Exercise 2. Few Fruits and vegetable 3. Obesity #partitioning the data into Training and Testing set. set.seed(100) train_part <- createDataPartition(y = MW.Procedures$Lung_Cancer,p = 0.80,list = FALSE) TrainingLung <- MW.Procedures[train_part,] TestingLung <- MW.Procedures[-train_part,] #Performing regression between Lung Cancer and Smoker chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker) #Regression Summary summary(chdRegr) ## ## Call: ## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker) 43
  • 44. ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.8531 -1.4729 0.0134 1.5588 7.9568 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.10692 0.34785 0.307 0.759 ## TrainingLung$Smoker 2.35403 0.04434 53.096 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.047 on 314 degrees of freedom ## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995 ## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16 As you can see that model is already having good accuracy we will still add more confounders and then perform the multivariate regression. #making a temporary table with all required variables temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$Few_Fruit_Veg,C3=Tra temp <- mutate(temp, O = TrainingLung$Lung_Cancer) #Regression on Lung Cancer and Smoker reg.E <- lm(O ~ E, data = temp) summary(reg.E) ## ## Call: ## lm(formula = O ~ E, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.8531 -1.4729 0.0134 1.5588 7.9568 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.10692 0.34785 0.307 0.759 ## E 2.35403 0.04434 53.096 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.047 on 314 degrees of freedom ## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995 ## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16 #Regression on Lung Cancer and Smoker with Confounder 1 reg.EC1 <- lm(O ~ E+C1, data = temp) summary(reg.EC1) ## ## Call: ## lm(formula = O ~ E + C1, data = temp) 44
  • 45. ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.2839 -1.2744 0.1438 1.3621 7.6394 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.1171 0.3171 -3.523 0.00049 *** ## E 1.3294 0.1012 13.133 < 2e-16 *** ## C1 1.1735 0.1075 10.911 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.598 on 313 degrees of freedom ## Multiple R-squared: 0.9274, Adjusted R-squared: 0.9269 ## F-statistic: 1999 on 2 and 313 DF, p-value: < 2.2e-16 #Regression on Lung Cancer and Smoker with Confounder 2 reg.EC2 <- lm(O ~ E+C2, data = temp) summary(reg.EC2) ## ## Call: ## lm(formula = O ~ E + C2, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11.8397 -1.2740 0.0911 1.3515 8.0799 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.31635 0.32814 -4.012 7.55e-05 *** ## E 1.24197 0.11201 11.088 < 2e-16 *** ## C2 0.39028 0.03696 10.559 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.621 on 313 degrees of freedom ## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9256 ## F-statistic: 1961 on 2 and 313 DF, p-value: < 2.2e-16 #Regression on Lung Cancer and Smoker with Confounder 3 reg.EC3 <- lm(O ~ E+C3, data = temp) summary(reg.EC3) ## ## Call: ## lm(formula = O ~ E + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.4258 -1.3278 -0.2252 1.5037 7.6570 ## 45
  • 46. ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.9052 0.3127 -2.895 0.00406 ** ## E 1.3013 0.1054 12.341 < 2e-16 *** ## C3 1.2172 0.1137 10.703 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.611 on 313 degrees of freedom ## Multiple R-squared: 0.9266, Adjusted R-squared: 0.9262 ## F-statistic: 1977 on 2 and 313 DF, p-value: < 2.2e-16 #Regression with Explainatory variable and all the counfounders that we think are there. reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp) summary(reg.EC1234) ## ## Call: ## lm(formula = O ~ E + C1 + C2 + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.0675 -1.1707 0.1069 1.2777 7.5823 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.79363 0.29587 -6.062 3.88e-09 *** ## E 0.67286 0.11725 5.739 2.27e-08 *** ## C1 0.44770 0.13018 3.439 0.000664 *** ## C2 0.21279 0.04183 5.087 6.31e-07 *** ## C3 0.79075 0.11385 6.946 2.21e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.327 on 311 degrees of freedom ## Multiple R-squared: 0.9421, Adjusted R-squared: 0.9414 ## F-statistic: 1265 on 4 and 311 DF, p-value: < 2.2e-16 #plotting regression. par(mfrow=c(2,2)) plot(reg.EC1234) abline(reg.EC1234) ## Warning in abline(reg.EC1234): only using the first two of 5 regression ## coefficients 46
  • 47. 5 10 15 20 25 30 35 −100 Fitted values Residuals Residuals vs Fitted 313128 39 −3 −2 −1 0 1 2 3 −404 Theoretical Quantiles Standardizedresiduals Normal Q−Q 313128 39 5 10 15 20 25 30 35 0.01.02.0 Fitted values Standardizedresiduals Scale−Location 31312839 0.00 0.05 0.10 0.15 −404 Leverage Standardizedresiduals Cook's distance 0.5 0.5 Residuals vs Leverage 245 191 128 #No we will test our regression model with testing data to check the prerformance. tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$Few_Fruit_Veg,C3=Te tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer) test.predictions<-predict(reg.EC1234,newdata = tempTest) test.y<-tempTest$O SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.predictions)^2) SS.regression <- sum((test.predictions - mean(test.y))^2) SS.total - (SS.regression+SS.residual) ## [1] 500.1066 SS.regression/SS.total ## [1] 0.8705158 #This is the regression value Rsquare value for testing data Region 3 (South region) We will be analyzing the South region for the following problem. All the code and necessary visualizations are included in following section. 47
  • 48. #Cleaning the data in region 3. #Removing missing data. South <- subset(South, South$No_Exercise!=0) South <- subset(South, South$Few_Fruit_Veg!=0) South <- subset(South, South$Obesity!=0) South <- subset(South, South$High_Blood_Pres!=0) South <- subset(South, South$Smoker!=0) South <- subset(South, South$Diabetes!=0) South <- subset(South, South$Lung_Cancer!=0) South <- subset(South, South$Col_Cancer!=0) South <- subset(South, South$CHD!=0) South <- subset(South, South$Brst_Cancer!=0) South <- subset(South, South$Suicide!=0) South <- subset(South, South$Total_Death_Causes!=0) South <- subset(South, South$Injury!=0) South<-subset(South,South$Stroke!=0) South <- subset(South, South$MVA!=0) Now we applied the regression model for different kinds of deaths with total number of deaths in this region. Here for the simplicity we have only included the top three reasons why people are dying in region 3. We came to the conclusion using single variate regression of the total death with respect to individual disease and then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental results. disease region3 (R squared) breast cancer 0.06 mva 0.26 chd 0.79 colon cancer 0.16 lung cancer 0.35 injury 0.17 suicide 0.03 stroke 0.14 Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in South region. 1. CHD (Corronary heart disease) 2. Lung Cancer 3. MVA (Motor Vehicle Accidents) #Since we have taken the CHD , Lung Cancer and MVA as the major reason why people are dying. We will per regressionModel<-lm(South$Total_Death_Causes~South$CHD+South$Lung_Cancer+South$MVA) #Summary of regression between total deaths and diseases we selected. summary(regressionModel) ## ## Call: ## lm(formula = South$Total_Death_Causes ~ South$CHD + South$Lung_Cancer + ## South$MVA) 48
  • 49. ## ## Residuals: ## Min 1Q Median 3Q Max ## -40.105 -5.062 -0.461 4.612 30.418 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.17128 0.98511 1.189 0.235 ## South$CHD 1.16308 0.02562 45.398 < 2e-16 *** ## South$Lung_Cancer 2.84552 0.07955 35.770 < 2e-16 *** ## South$MVA 1.08049 0.15806 6.836 2.16e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.829 on 554 degrees of freedom ## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9783 ## F-statistic: 8374 on 3 and 554 DF, p-value: < 2.2e-16 par(mfrow=c(2,2)) #plotting regression analysis plot(regressionModel) 50 100 150 200 250 300 −400 Fitted values Residuals Residuals vs Fitted 184 214523 −3 −2 −1 0 1 2 3 −404 Theoretical Quantiles Standardizedresiduals Normal Q−Q 184 214523 50 100 150 200 250 300 0.01.5 Fitted values Standardizedresiduals Scale−Location 184 214523 0.00 0.02 0.04 0.06 0.08 −604 Leverage Standardizedresiduals Cook's distance 1 0.5 0.5 Residuals vs Leverage 261 523 319 Now that we have established the major disease in South region. We will now analyse the relationship between the these disease and the daily human activities using multivariate regression. We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders and perform multivariate regression with training and testing data. 49
  • 50. Following is the code which describes the procedure for CHD. SO.states<-South SO.Procedures<-data.frame() #Creating a data frame inorder to generate correlation heat map SO.Procedures<-SO.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di SO.Procedures.Matrix<-as.matrix(SO.Procedures) SO.Cor<-cor(SO.Procedures) #Correlation metrix SO.Cor ## CHD No_Exercise Few_Fruit_Veg Obesity ## CHD 1.0000000 0.8652391 0.8531583 0.8529559 ## No_Exercise 0.8652391 1.0000000 0.8673295 0.9031475 ## Few_Fruit_Veg 0.8531583 0.8673295 1.0000000 0.9171308 ## Obesity 0.8529559 0.9031475 0.9171308 1.0000000 ## High_Blood_Pres 0.8240507 0.8680525 0.8880798 0.8827588 ## Smoker 0.8387031 0.8636349 0.8920701 0.8694016 ## Diabetes 0.7821009 0.8543135 0.8032958 0.8497419 ## High_Blood_Pres Smoker Diabetes ## CHD 0.8240507 0.8387031 0.7821009 ## No_Exercise 0.8680525 0.8636349 0.8543135 ## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958 ## Obesity 0.8827588 0.8694016 0.8497419 ## High_Blood_Pres 1.0000000 0.8691031 0.8753804 ## Smoker 0.8691031 1.0000000 0.8209986 ## Diabetes 0.8753804 0.8209986 1.0000000 #Melting the correlation matrix and creating a data frame SO.Melt<-melt(data=SO.Cor,varnames = c("x","y")) SO.Melt <- SO.Melt[order(SO.Melt$value),] #Mean of the melt mean(SO.Melt$value) ## [1] 0.8792101 #Summary summary(SO.Melt) ## x y value ## CHD :7 CHD :7 Min. :0.7821 ## Diabetes :7 Diabetes :7 1st Qu.:0.8530 ## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8681 ## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8792 ## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921 ## Obesity :7 Obesity :7 Max. :1.0000 ## Smoker :7 Smoker :7 SO.Melt<-SO.Melt[(!SO.Melt$value==1),] SO.MeltMean<-mean(SO.Melt$value) 50
  • 51. #Making various colors to geSOrate dynamic range of colors using a given pallate red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1) RtoWrange<-colorRampPalette(c(white, red) ) WtoGrange<-colorRampPalette(c(red, green) ) # Heat map - using colors | used ggplot2 for the colors plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) + theme(panel.background = element_rect(fill = "snow2")) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray", midpoint = SO.MeltMean, limits = c(0.8, 1), name = "Correlations") + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + labs(x=NULL, y=NULL) + theme(panel.background = element_rect(fill = "snow2")) + ggtitle("Heat map of correlations in Risk Factors data : Region 3") plt_heat_blue CHD Diabetes Few_Fruit_Veg High_Blood_Pres No_Exercise Obesity Smoker CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker 0.80 0.85 0.90 0.95 1.00 Correlations Heat map of correlations in Risk Factors data : Region 3 percent <- c("10%","20%","30%","40%","50%") breaks <- c(10,20,30,40,50) #graph of Obesity VS No Exercise plt_obevsNoEx_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures color = factor(signif(SO.Procedures$Obesity, 0)))) + 51
  • 52. geom_point() + scale_color_discrete(name="% Obesity") + scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') + ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise") plt_obevsNoEx_SO 10% 10% 20% No Exercise %s Obesity%s % Obesity 2 3 4 5 6 7 8 9 10 20 Percentage of obesity vs. Percentage of No Exercise #graph of No Exercise VS Smoker plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures color = factor(signif(SO.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') + ggtitle(label = "Percentage of No Exercise vs.n Smoker") plt_noExvsSmo_SO 52
  • 53. 10% 10% 20% No Exercise %s Smoker%s % Smoker 2 3 4 5 6 7 8 9 10 20 Percentage of No Exercise vs. Smoker #graph of No Exercise VS Few Fruits And Vegetables plt_noExvsfru_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures color = factor(signif(SO.Procedures$Few_Fruit_Veg, 0 geom_point() + scale_color_discrete(name="% Few Fruits And Vegetables") + scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Few Fruits And Vegetables %s') + ggtitle(label = "Percentage of No Excersice vs.n Percentage of Few Fruits And Vegetables") plt_noExvsfru_SO 53
  • 54. 10% 20% 30% 40% 10% 20% No Exercise %s FewFruitsAndVegetables%s % Few Fruits And Vegetables 8 9 10 20 30 40 Percentage of No Excersice vs. Percentage of Few Fruits And Vegetables #Plotting the multivariate scatter plot in order to understand the correlation better. pairs(~SO.Procedures$CHD+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO. 54
  • 55. SO.Procedures$CHD 5 15 2 6 12 5 10 20140 520 SO.Procedures$No_Exercise SO.Procedures$Few_Fruit_Veg 1040 212 SO.Procedures$Obesity SO.Procedures$High_Blood_Pres 520 515 SO.Procedures$Smoker 20 80 140 10 30 5 15 1 3 5 14 SO.Procedures$Diabetes Multivariate Scatterplot : Region 3 Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is highly correlated with the 1. No exercise. Now we will find the confounding variables. Using confounding variables we can fit model better and the output prediction will be less error prone. We based on the correlation heatmap and the scatterplot we can say that 1. Few Fruits And Vegetables 2. Smoker 3. Obesity #partitioning the data into Training and Testing set. set.seed(100) train_part <- createDataPartition(y = SO.Procedures$CHD,p = 0.80,list = FALSE) TrainingCHD<-SO.Procedures[train_part,] TestingCHD<-SO.Procedures[-train_part,] #Performing regression between CHD and No exercise chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise) #Regression Summary summary(chdRegr) ## ## Call: ## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise) 55
  • 56. ## ## Residuals: ## Min 1Q Median 3Q Max ## -63.726 -7.474 -0.401 7.460 53.480 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.3934 1.6142 3.341 0.000904 *** ## TrainingCHD$No_Exercise 6.2141 0.1672 37.167 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.37 on 446 degrees of freedom ## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554 ## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16 As you can see that model is already having good accuracy we will still add more confounders and then perform the multivariate regression. #making a temporary table with all required variables temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training temp <- mutate(temp, O = TrainingCHD$CHD) #Regression on CHD and No Exercise reg.E <- lm(O ~ E, data = temp) summary(reg.E) ## ## Call: ## lm(formula = O ~ E, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -63.726 -7.474 -0.401 7.460 53.480 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.3934 1.6142 3.341 0.000904 *** ## E 6.2141 0.1672 37.167 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.37 on 446 degrees of freedom ## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554 ## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16 #Regression on CHD and No excercise with Confounder 1 reg.EC1 <- lm(O ~ E+C1, data = temp) summary(reg.EC1) ## ## Call: ## lm(formula = O ~ E + C1, data = temp) 56
  • 57. ## ## Residuals: ## Min 1Q Median 3Q Max ## -56.692 -6.553 -0.426 6.782 49.965 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.7745 1.5323 2.463 0.0141 * ## E 3.6456 0.3685 9.893 < 2e-16 *** ## C1 3.1443 0.4080 7.707 8.45e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.64 on 445 degrees of freedom ## Multiple R-squared: 0.7847, Adjusted R-squared: 0.7837 ## F-statistic: 810.8 on 2 and 445 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with Confounder 2 reg.EC2 <- lm(O ~ E+C2, data = temp) summary(reg.EC2) ## ## Call: ## lm(formula = O ~ E + C2, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -58.419 -7.093 -0.348 6.464 50.962 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.2383 1.5379 1.455 0.146 ## E 3.8791 0.3103 12.503 <2e-16 *** ## C2 3.0972 0.3567 8.684 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.46 on 445 degrees of freedom ## Multiple R-squared: 0.7913, Adjusted R-squared: 0.7904 ## F-statistic: 843.6 on 2 and 445 DF, p-value: < 2.2e-16 #Regression on CHD and No exercise with Confounder 3 reg.EC3 <- lm(O ~ E+C3, data = temp) summary(reg.EC3) ## ## Call: ## lm(formula = O ~ E + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -57.863 -6.174 -0.278 6.583 48.496 ## 57
  • 58. ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0217 1.4558 2.076 0.0385 * ## E 3.3541 0.3044 11.019 <2e-16 *** ## C3 1.1467 0.1064 10.776 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.03 on 445 degrees of freedom ## Multiple R-squared: 0.8064, Adjusted R-squared: 0.8056 ## F-statistic: 927 on 2 and 445 DF, p-value: < 2.2e-16 #Regression with Explainatory variable and all the counfounders that we think are there. reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp) summary(reg.EC1234) ## ## Call: ## lm(formula = O ~ E + C1 + C2 + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -55.608 -6.372 -0.209 6.585 48.139 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.9532 1.4613 1.337 0.182035 ## E 2.5933 0.3754 6.907 1.72e-11 *** ## C1 0.8052 0.4921 1.636 0.102483 ## C2 1.4445 0.4122 3.504 0.000505 *** ## C3 0.7514 0.1527 4.922 1.21e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.87 on 443 degrees of freedom ## Multiple R-squared: 0.8129, Adjusted R-squared: 0.8112 ## F-statistic: 481.1 on 4 and 443 DF, p-value: < 2.2e-16 #plotting regression. par(mfrow=c(2,2)) plot(reg.EC1234) abline(reg.EC1234) ## Warning in abline(reg.EC1234): only using the first two of 5 regression ## coefficients 58
  • 59. 20 40 60 80 100 120 −60060 Fitted values Residuals Residuals vs Fitted 171 43 202 −3 −2 −1 0 1 2 3 −404 Theoretical Quantiles Standardizedresiduals Normal Q−Q 171 43 202 20 40 60 80 100 120 0.01.5 Fitted values Standardizedresiduals Scale−Location 171 43 202 0.00 0.02 0.04 0.06 0.08 0.10 −604 Leverage Standardizedresiduals Cook's distance 1 0.5 0.5 Residuals vs Leverage 364 171 202 #Now we will test our regression model with testing data to check the prerformance. tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing tempTest <- mutate(tempTest, O = TestingCHD$CHD) test.predictions<-predict(reg.EC1234,newdata = tempTest) test.predictions ## 1 2 3 4 5 6 7 ## 93.11863 54.33024 52.85395 45.79852 84.97513 56.88569 90.36550 ## 8 9 10 11 12 13 14 ## 54.14329 27.01033 21.19971 22.31869 51.65249 44.54791 47.33240 ## 15 16 17 18 19 20 21 ## 82.57801 95.08141 82.71398 87.34274 77.96068 95.69180 39.98928 ## 22 23 24 25 26 27 28 ## 92.73972 78.85468 78.94780 80.59074 80.79515 88.44522 51.13704 ## 29 30 31 32 33 34 35 ## 53.76243 54.93526 44.19820 61.92209 47.95062 57.45545 51.17947 ## 36 37 38 39 40 41 42 ## 27.33281 62.49750 23.71136 28.54283 26.16418 27.07990 58.66436 ## 43 44 45 46 47 48 49 ## 102.65357 91.78205 105.83215 55.19101 80.64446 40.27956 53.95203 ## 50 51 52 53 54 55 56 ## 48.46951 84.13853 40.82755 102.13602 54.52694 48.51574 50.28585 ## 57 58 59 60 61 62 63 ## 65.78484 53.00168 50.13015 53.58431 48.56988 45.02007 80.19727 59
  • 60. ## 64 65 66 67 68 69 70 ## 85.16485 43.43088 101.14805 81.53974 81.15735 66.76741 60.25659 ## 71 72 73 74 75 76 77 ## 42.04435 54.65376 103.69421 49.09867 59.43466 51.29171 56.07857 ## 78 79 80 81 82 83 84 ## 25.54696 55.45294 23.65491 93.97364 51.35518 46.81402 92.22464 ## 85 86 87 88 89 90 91 ## 84.72256 90.79933 95.60329 59.45610 90.05428 45.56084 76.83904 ## 92 93 94 95 96 97 98 ## 82.85766 74.82753 77.28768 64.59310 51.51838 40.74676 79.32157 ## 99 100 101 102 103 104 105 ## 38.59788 52.26890 49.81253 87.43576 89.41201 90.57723 79.44587 ## 106 107 108 109 110 ## 60.00317 46.91665 59.17558 47.64249 24.00712 test.y<-tempTest$O SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.predictions)^2) SS.regression <- sum((test.predictions - mean(test.y))^2) SS.total - (SS.regression+SS.residual) ## [1] 2561.812 SS.regression/SS.total ## [1] 0.7099096 #This is the regression value Rsquare value for testing data Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer. Here is the code for the lung cancer model. #Creating a data frame inorder to generate correlation heat map SO.Procedures<-SO.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo SO.Procedures.Matrix<-as.matrix(SO.Procedures) SO.Cor<-cor(SO.Procedures) #Correlation metrix SO.Cor ## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity ## Lung_Cancer 1.0000000 0.8398485 0.8922953 0.8688772 ## No_Exercise 0.8398485 1.0000000 0.8673295 0.9031475 ## Few_Fruit_Veg 0.8922953 0.8673295 1.0000000 0.9171308 ## Obesity 0.8688772 0.9031475 0.9171308 1.0000000 ## High_Blood_Pres 0.8698291 0.8680525 0.8880798 0.8827588 ## Smoker 0.9145492 0.8636349 0.8920701 0.8694016 ## Diabetes 0.7993788 0.8543135 0.8032958 0.8497419 ## High_Blood_Pres Smoker Diabetes ## Lung_Cancer 0.8698291 0.9145492 0.7993788 ## No_Exercise 0.8680525 0.8636349 0.8543135 ## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958 60
  • 61. ## Obesity 0.8827588 0.8694016 0.8497419 ## High_Blood_Pres 1.0000000 0.8691031 0.8753804 ## Smoker 0.8691031 1.0000000 0.8209986 ## Diabetes 0.8753804 0.8209986 1.0000000 #Melting the correlation matrix and creating a data frame SO.Melt<-melt(data=SO.Cor,varnames = c("x","y")) SO.Melt <- SO.Melt[order(SO.Melt$value),] #Mean of the melt mean(SO.Melt$value) ## [1] 0.8860905 #Summary summary(SO.Melt) ## x y value ## Diabetes :7 Diabetes :7 Min. :0.7994 ## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8636 ## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8698 ## Lung_Cancer :7 Lung_Cancer :7 Mean :0.8861 ## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9031 ## Obesity :7 Obesity :7 Max. :1.0000 ## Smoker :7 Smoker :7 SO.Melt<-SO.Melt[(!SO.Melt$value==1),] SO.MeltMean<-mean(SO.Melt$value) #Making various colors to generate dynamic range of colors using a given pallate red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1) RtoWrange<-colorRampPalette(c(white, red) ) WtoGrange<-colorRampPalette(c(red, blue) ) # Heat map - using colors | used ggplot2 fr the colors plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) + theme(panel.background = element_rect(fill = "snow2")) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray", midpoint = SO.MeltMean, limits = c(0.9, 1), name = "Correlations") + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + labs(x=NULL, y=NULL) + theme(panel.background = element_rect(fill = "snow2")) + ggtitle("Heat map of correlations in Risk Factors data : Region 3") plt_heat_blue 61
  • 62. Diabetes Few_Fruit_Veg High_Blood_Pres Lung_Cancer No_Exercise Obesity Smoker DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker 0.900 0.925 0.950 0.975 1.000 Correlations Heat map of correlations in Risk Factors data : Region 3 percent <- c("10%","20%","30%","40%","50%") breaks <- c(10,20,30,40,50) #graph of Smoker VS High Blood Pressure plt_Smovsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proce color = factor(signif(SO.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') + ggtitle(label = "Percentage of Smoker vs.n Percentage of High Blood Pressure") plt_Smovsblood_SO 62
  • 63. 10% 10% 20% High Blood Pressure %s Smoker%s % Smoker 2 3 4 5 6 7 8 9 10 20 Percentage of Smoker vs. Percentage of High Blood Pressure #graph of Few Fruits and Vegetables VS Smoker plt_fruvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Few_Fruit_Veg), y = (SO.Procedure color = factor(signif(SO.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fuits and Vegetable %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') + ggtitle(label = "Percentage of Few Fuits and Vegetable vs.n Percentage of Smoker") plt_fruvsSmo_SO 63
  • 64. 10% 10% 20% 30% 40% Few Fuits and Vegetable %s Smoker%s % Smoker 2 3 4 5 6 7 8 9 10 20 Percentage of Few Fuits and Vegetable vs. Percentage of Smoker #graph of Smoker VS Obesity plt_SmovsObe_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Obesity), y = (SO.Procedures$Smok color = factor(signif(SO.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent, name = 'Obesity %s') + scale_y_continuous(labels = percent, name = 'Smoker %s') + ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker") plt_SmovsObe_SO 64
  • 65. 20% 30% 40% 50% 20% 30% 40% 50% Obesity %s Smoker%s % Smoker 2 3 4 5 6 7 8 9 10 20 Percentage of Obesity vs. Percentage of Smoker #Plotting the multi variate scatter plot in order to understand the correlation better. pairs(~SO.Procedures$Lung_Cancer+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obe 65
  • 66. SO.Procedures$Lung_Cancer 5 15 2 6 12 5 10 1040 520 SO.Procedures$No_Exercise SO.Procedures$Few_Fruit_Veg 1040 212 SO.Procedures$Obesity SO.Procedures$High_Blood_Pres 520 515 SO.Procedures$Smoker 10 30 10 30 5 15 1 3 5 14 SO.Procedures$Diabetes Multivariate Scatterplot : Region 3 Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer is highly correlated with the 1. Smoker. Now we will find the confounding variables. Using confounding variables we can fit model better and the output prediction will be less error prone. Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Few Fruits and vegetable 3. Obesity #partitioning the data into Training and Testing set. set.seed(100) train_part <- createDataPartition(y = SO.Procedures$Lung_Cancer,p = 0.80,list = FALSE) TrainingLung <- SO.Procedures[train_part,] TestingLung <- SO.Procedures[-train_part,] #Performing regression between Lung Cancer and Smoker chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker) #Regression Summary summary(chdRegr) ## ## Call: ## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker) ## 66
  • 67. ## Residuals: ## Min 1Q Median 3Q Max ## -17.9394 -2.0631 -0.1777 1.8757 17.4538 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.69422 0.43405 3.903 0.00011 *** ## TrainingLung$Smoker 2.37666 0.05138 46.254 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.328 on 446 degrees of freedom ## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271 ## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16 As you can see that model is already having good accuracy we will still add more confounders and then perform the multivariate regression. #making a temporary table with all required variables temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$High_Blood_Pres,C3=T temp <- mutate(temp, O = TrainingLung$Lung_Cancer) #Regression on Lung Cancer and Smoker reg.E <- lm(O ~ E, data = temp) summary(reg.E) ## ## Call: ## lm(formula = O ~ E, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -17.9394 -2.0631 -0.1777 1.8757 17.4538 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.69422 0.43405 3.903 0.00011 *** ## E 2.37666 0.05138 46.254 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.328 on 446 degrees of freedom ## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271 ## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16 #Regression on Lung Cancer and Smoker with confounder 1 reg.EC1 <- lm(O ~ E+C1, data = temp) summary(reg.EC1) ## ## Call: ## lm(formula = O ~ E + C1, data = temp) ## 67
  • 68. ## Residuals: ## Min 1Q Median 3Q Max ## -18.098 -1.752 -0.184 1.773 16.095 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.92734 0.41643 2.227 0.0265 * ## E 1.71902 0.09419 18.250 < 2e-16 *** ## C1 0.75201 0.09266 8.115 4.78e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.109 on 445 degrees of freedom ## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8491 ## F-statistic: 1258 on 2 and 445 DF, p-value: < 2.2e-16 #Regression on Lung Cancer and Smoker with confounder 2 reg.EC2 <- lm(O ~ E+C2, data = temp) summary(reg.EC2) ## ## Call: ## lm(formula = O ~ E + C2, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -13.4221 -1.8371 -0.1437 1.7834 17.3119 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.99954 0.40928 2.442 0.015 * ## E 1.65336 0.09547 17.317 <2e-16 *** ## C2 0.70444 0.08065 8.735 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.078 on 445 degrees of freedom ## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8521 ## F-statistic: 1288 on 2 and 445 DF, p-value: < 2.2e-16 #Regression on Lung Cancer and Smoker with confounder 3 reg.EC3 <- lm(O ~ E+C3, data = temp) summary(reg.EC3) ## ## Call: ## lm(formula = O ~ E + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -18.485 -1.660 -0.174 1.687 15.611 ## ## Coefficients: 68
  • 69. ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.14230 0.40135 2.846 0.00463 ** ## E 1.50746 0.10386 14.515 < 2e-16 *** ## C3 0.29928 0.03189 9.385 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.044 on 445 degrees of freedom ## Multiple R-squared: 0.856, Adjusted R-squared: 0.8554 ## F-statistic: 1323 on 2 and 445 DF, p-value: < 2.2e-16 #Regression with Explainatory variable and all the counfounders that we think are there. reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp) summary(reg.EC1234) ## ## Call: ## lm(formula = O ~ E + C1 + C2 + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.8100 -1.7365 -0.1321 1.7312 15.9466 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.79644 0.39875 1.997 0.0464 * ## E 1.29944 0.10929 11.889 < 2e-16 *** ## C1 0.18044 0.12280 1.469 0.1424 ## C2 0.38893 0.09591 4.055 5.92e-05 *** ## C3 0.17908 0.04193 4.271 2.38e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.967 on 443 degrees of freedom ## Multiple R-squared: 0.8638, Adjusted R-squared: 0.8626 ## F-statistic: 702.5 on 4 and 443 DF, p-value: < 2.2e-16 #plotting regression. par(mfrow=c(2,2)) plot(reg.EC1234) abline(reg.EC1234) ## Warning in abline(reg.EC1234): only using the first two of 5 regression ## coefficients 69
  • 70. 5 10 15 20 25 30 35 −1010 Fitted values Residuals Residuals vs Fitted 211 382 369 −3 −2 −1 0 1 2 3 −426 Theoretical Quantiles Standardizedresiduals Normal Q−Q 382 211 369 5 10 15 20 25 30 35 0.01.5 Fitted values Standardizedresiduals Scale−Location 382211 369 0.00 0.02 0.04 0.06 0.08 −604 Leverage Standardizedresiduals Cook's distance 0.5 0.5 Residuals vs Leverage 382 369360 #Now we will test our regression model with testing data to check the performance. tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$High_Blood_Pres,C3= tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer) test.predictions<-predict(reg.EC1234,newdata = tempTest) test.y<-tempTest$O SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.predictions)^2) SS.regression <- sum((test.predictions - mean(test.y))^2) SS.total - (SS.regression+SS.residual) ## [1] 882.6084 SS.regression/SS.total ## [1] 0.7934475 #This is the regression value Rsquare value for testing data Similarly , we have done calculations for MVA. Here is the code for the MVA model. 70
  • 71. #Creating a data frame inorder to generate correlation heat map SO.Procedures<-SO.states[,c("MVA","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di SO.Procedures.Matrix<-as.matrix(SO.Procedures) SO.Cor<-cor(SO.Procedures) #Correlation metrix SO.Cor ## MVA No_Exercise Few_Fruit_Veg Obesity ## MVA 1.0000000 0.7037265 0.5939313 0.6419514 ## No_Exercise 0.7037265 1.0000000 0.8673295 0.9031475 ## Few_Fruit_Veg 0.5939313 0.8673295 1.0000000 0.9171308 ## Obesity 0.6419514 0.9031475 0.9171308 1.0000000 ## High_Blood_Pres 0.6708440 0.8680525 0.8880798 0.8827588 ## Smoker 0.6515928 0.8636349 0.8920701 0.8694016 ## Diabetes 0.6625232 0.8543135 0.8032958 0.8497419 ## High_Blood_Pres Smoker Diabetes ## MVA 0.6708440 0.6515928 0.6625232 ## No_Exercise 0.8680525 0.8636349 0.8543135 ## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958 ## Obesity 0.8827588 0.8694016 0.8497419 ## High_Blood_Pres 1.0000000 0.8691031 0.8753804 ## Smoker 0.8691031 1.0000000 0.8209986 ## Diabetes 0.8753804 0.8209986 1.0000000 #Melting the correlation matrix and creating a data frame SO.Melt<-melt(data=SO.Cor,varnames = c("x","y")) SO.Melt <- SO.Melt[order(SO.Melt$value),] #Mean of the melt mean(SO.Melt$value) ## [1] 0.8346534 #Summary summary(SO.Melt) ## x y value ## Diabetes :7 Diabetes :7 Min. :0.5939 ## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8033 ## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8681 ## MVA :7 MVA :7 Mean :0.8347 ## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921 ## Obesity :7 Obesity :7 Max. :1.0000 ## Smoker :7 Smoker :7 SO.Melt<-SO.Melt[(!SO.Melt$value==1),] SO.MeltMean<-mean(SO.Melt$value) #Making various colors to generate dynamic range of colors using a given pallate red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1) RtoWrange<-colorRampPalette(c(white, blue) ) WtoGrange<-colorRampPalette(c(blue, green) ) 71
  • 72. # Heat map - using colors | used ggplot2 fr the colors plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) + theme(panel.background = element_rect(fill = "snow2")) + geom_tile(aes(fill = value)) + scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray", midpoint = SO.MeltMean, limits = c(0.5, 1), name = "Correlations") + scale_x_discrete(expand = c(0,0)) + scale_y_discrete(expand = c(0,0)) + labs(x=NULL, y=NULL) + theme(panel.background = element_rect(fill = "snow2")) + ggtitle("Heat map of correlations in Risk Factors data : Region 3") plt_heat_blue Diabetes Few_Fruit_Veg High_Blood_Pres MVA No_Exercise Obesity Smoker DiabetesFew_Fruit_VegHigh_Blood_PresMVA No_Exercise Obesity Smoker 0.5 0.6 0.7 0.8 0.9 1.0 Correlations Heat map of correlations in Risk Factors data : Region 3 percent <- c("10%","20%","30%","40%","50%") breaks <- c(10,20,30,40,50) #graph of no Exercise VS High Blood Pressure plt_noExvsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proc color = factor(signif(SO.Procedures$No_Exercise, 0)))) + geom_point() + scale_color_discrete(name="% No Exercise") + scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'No Excercise %s') + ggtitle(label = "Percentage of No Exercise vs.n Percentage of High Blood Pressure") plt_noExvsblood_SO 72
  • 73. 10% 20% 10% 20% High Blood Pressure %s NoExcercise%s % No Exercise 2 3 4 5 6 7 8 9 10 20 Percentage of No Exercise vs. Percentage of High Blood Pressure #graph of No Exercise VS Smoker plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures color = factor(signif(SO.Procedures$Smoker, 0)))) + geom_point() + scale_color_discrete(name="% Smoker") + scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') + ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker") plt_noExvsSmo_SO 73
  • 74. 10% 10% 20% No Exercise %s Smoker%s % Smoker 2 3 4 5 6 7 8 9 10 20 Percentage of No Exercise vs. Percentage of Smoker #graph of Diabetes VS No Exercise plt_noExvsDia_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Diabetes), y = (SO.Procedures$No color = factor(signif(SO.Procedures$No_Exercise, 0)) geom_point() + scale_color_discrete(name="% No Exercise") + scale_x_continuous(labels = percent,breaks = breaks,name = 'Diabetes %s') + scale_y_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') + ggtitle(label = "Percentage of Diabetes vs.n Percentage of No Exercise") plt_noExvsDia_SO 74
  • 75. 10% 20% Diabetes %s NoExercise%s % No Exercise 2 3 4 5 6 7 8 9 10 20 Percentage of Diabetes vs. Percentage of No Exercise #Plotting the multi variate scatter plot in order to understand the correlation better. pairs(~SO.Procedures$MVA+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO. 75
  • 76. SO.Procedures$MVA 5 15 2 6 12 5 10 520 520 SO.Procedures$No_Exercise SO.Procedures$Few_Fruit_Veg 1040 212 SO.Procedures$Obesity SO.Procedures$High_Blood_Pres 520 515 SO.Procedures$Smoker 5 15 25 10 30 5 15 1 3 5 14 SO.Procedures$Diabetes Multivariate Scatterplot : Region 3 Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer is highly correlated with the 1. No Exercise. Now we will find the confounding variables. Using confounding variables we can fit model better and the output prediction will be less error prone. Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Diabetes 3. Smoker #partitioning the data into Training and Testing set. set.seed(100) train_part <- createDataPartition(y = SO.Procedures$MVA,p = 0.80,list = FALSE) TrainingMVA <- SO.Procedures[train_part,] TestingMVA <- SO.Procedures[-train_part,] #Performing regression between MVA and No exercise mvaRegr<-lm(TrainingMVA$MVA~TrainingMVA$No_Exercise) #Regression Summary summary(mvaRegr) ## ## Call: ## lm(formula = TrainingMVA$MVA ~ TrainingMVA$No_Exercise) 76
  • 77. ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.0631 -1.2072 -0.0759 1.0708 9.8437 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.60141 0.26426 6.06 2.9e-09 *** ## TrainingMVA$No_Exercise 0.58738 0.02772 21.19 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.989 on 446 degrees of freedom ## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005 ## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16 As you can see that model is already having decent accuracy we will still add more confounders and then perform the multivariate regression.0 #making a temporary table with all required variables temp <- data.frame(E = TrainingMVA$No_Exercise, C1=TrainingMVA$Smoker,C2=TrainingMVA$High_Blood_Pres,C3= temp <- mutate(temp, O = TrainingMVA$MVA) #Regression on MVA and No Exercise reg.E <- lm(O ~ E, data = temp) summary(reg.E) ## ## Call: ## lm(formula = O ~ E, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.0631 -1.2072 -0.0759 1.0708 9.8437 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.60141 0.26426 6.06 2.9e-09 *** ## E 0.58738 0.02772 21.19 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.989 on 446 degrees of freedom ## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005 ## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16 #Regression on MVA and No Exercise with confounder 1 reg.EC1 <- lm(O ~ E+C1, data = temp) summary(reg.EC1) ## ## Call: ## lm(formula = O ~ E + C1, data = temp) 77
  • 78. ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.9779 -1.1454 -0.0693 1.0158 10.1595 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.52869 0.26785 5.707 2.1e-08 *** ## E 0.50745 0.05789 8.765 < 2e-16 *** ## C1 0.09930 0.06317 1.572 0.117 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.986 on 445 degrees of freedom ## Multiple R-squared: 0.5044, Adjusted R-squared: 0.5022 ## F-statistic: 226.4 on 2 and 445 DF, p-value: < 2.2e-16 #Regression on MVA and No Exercise with confounder 2 reg.EC2 <- lm(O ~ E+C2, data = temp) summary(reg.EC2) ## ## Call: ## lm(formula = O ~ E + C2, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.5950 -1.1474 -0.0641 1.0431 10.7984 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.51447 0.26458 5.724 1.91e-08 *** ## E 0.45073 0.05872 7.676 1.05e-13 *** ## C2 0.14265 0.05414 2.635 0.00871 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.976 on 445 degrees of freedom ## Multiple R-squared: 0.5093, Adjusted R-squared: 0.5071 ## F-statistic: 230.9 on 2 and 445 DF, p-value: < 2.2e-16 #Regression on MVA and No Exercise with confounder 3 reg.EC3 <- lm(O ~ E+C3, data = temp) summary(reg.EC3) ## ## Call: ## lm(formula = O ~ E + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.6516 -1.1369 -0.0745 1.1027 10.0249 ## 78
  • 79. ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.60721 0.26225 6.129 1.95e-09 *** ## E 0.45541 0.05443 8.367 7.68e-16 *** ## C3 0.44833 0.15954 2.810 0.00517 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.974 on 445 degrees of freedom ## Multiple R-squared: 0.5103, Adjusted R-squared: 0.5081 ## F-statistic: 231.9 on 2 and 445 DF, p-value: < 2.2e-16 #Regression with Explainatory variable and all the counfounders that we think are there. reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp) summary(reg.EC1234) ## ## Call: ## lm(formula = O ~ E + C1 + C2 + C3, data = temp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.4964 -1.1325 -0.0407 1.0714 10.5865 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.53795 0.26771 5.745 1.71e-08 *** ## E 0.39844 0.06998 5.693 2.27e-08 *** ## C1 0.02554 0.06972 0.366 0.7143 ## C2 0.08004 0.06679 1.198 0.2314 ## C3 0.31152 0.18517 1.682 0.0932 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.974 on 443 degrees of freedom ## Multiple R-squared: 0.5127, Adjusted R-squared: 0.5083 ## F-statistic: 116.5 on 4 and 443 DF, p-value: < 2.2e-16 #plotting regression. par(mfrow=c(2,2)) plot(reg.EC1234) abline(reg.EC1234) ## Warning in abline(reg.EC1234): only using the first two of 5 regression ## coefficients 79