Project_Report_RMD

Project : Mortality Rate Analysis in USA for deadly
causes
Jatri Dave (jad752) , Prashantkumar Patel (pnp249)
December 14, 2016
Project Outline
We have obtained the data from CHSI(Community Health Status Indicators). In this project we will try
to figure out the leading causes of the death in 4 major regions in USA(Northeast, West, Midwest, South).
After getting the major causes of deaths we will try to analyse major daily human characteristics that are
contributing towards these major deaths.
The steps that we implemented in order to solve above stated problem are briefly explained below.
Data Cleaning and Normalizing.
First of all, When we gathered the data we did not realize that there was significant missing data. Besides
there were hundreds of unnecessary features available in the data. Therefore, we selected the required features
for the project. Moreover, the data we obtained was not normalized on a balanced scale; some data was in
percentage, some of those were in the base of 100,000 , some data was in the form of population count etc.
We needed some common scale on which we can normalize it. Furthermore, the data that we gathered was
not for a single time duration. For example, some data was in time span of the 1999-2003, some data was
in time span of 1995-2003 and so on.Hence, we normalized the data for the individual year. We performed
all those operations in Microsoft Excel (2016). We then combined necessary features and created a comma
separated value data (CSV) which we are directly using in R for the project.
Partitinoning the data into the region wise data.
The code is described below.
#Reading the data
data<-read.csv("E:/NYU/1/Foundation of Data Science/Projects/Foundations-of-Data-Science/USADataCleanPra
#Adding new column for region selection
data[,"region"] <- NA
#Removing unnecessary columns
data$X<-NULL
data$X.1<-NULL
#Partitioning the data based on regions. We have manually used the names of the states in order to crea
#region1
data$region[data$CHSI_State_Name=="Connecticut" | data$CHSI_State_Name=="Maine" | data$CHSI_State_Name==
#region2
data$region[data$CHSI_State_Name=="Illinois" | data$CHSI_State_Name=="Indiana" | data$CHSI_State_Name=="
#region3
data$region[data$CHSI_State_Name=="Delaware" | data$CHSI_State_Name=="Florida" | data$CHSI_State_Name=="
1

#region4
data$region[data$CHSI_State_Name=="Arizona" | data$CHSI_State_Name=="Colorado" | data$CHSI_State_Name=="
#converting state names into lower case letters
data$CHSI_State_Name <- tolower(data$CHSI_State_Name)
#Creating seperate data sets for each regions so that we can perform analysis on these seperate data
Northeast<- data[data$region==1,]
Midwest<-data[data$region==2,]
South<-data[data$region==3,]
West<-data[data$region==4,]
Region 1 (NorthEast region)
We will be analyzing the northeast region for the following problem. All the code and necessary visualizations
are included in following section.
#Cleaning the data in region 1.
#Removing missing data.
Northeast <- subset(Northeast, Northeast$No_Exercise!=0)
Northeast <- subset(Northeast, Northeast$Few_Fruit_Veg!=0)
Northeast <- subset(Northeast, Northeast$Obesity!=0)
Northeast <- subset(Northeast, Northeast$High_Blood_Pres!=0)
Northeast <- subset(Northeast, Northeast$Smoker!=0)
Northeast <- subset(Northeast, Northeast$Diabetes!=0)
Northeast <- subset(Northeast, Northeast$Lung_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Col_Cancer!=0)
Northeast <- subset(Northeast, Northeast$CHD!=0)
Northeast <- subset(Northeast, Northeast$Brst_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Suicide!=0)
Northeast <- subset(Northeast, Northeast$Total_Death_Causes!=0)
Northeast <- subset(Northeast, Northeast$Injury!=0)
Northeast<-subset(Northeast,Northeast$Stroke!=0)
Northeast <- subset(Northeast, Northeast$MVA!=0)
Now we applied the regression model for diﬀerent kinds of deaths with total number of deaths. Here for the
simplicity we have only included the top two reasons why people are dying in region 1. But we came to the
conclusion using single variate regression of the total death with respect to individual disease and then we
combined the features for maximum Rˆ2 value. Following is the table which shows our experiment results.
disease region1 (R squared)
breast cancer 0.04
mva 0.17
chd 0.71
colon cancer 0.24
lung cancer 0.16
injury 0.1
suicide 0.07
stroke 0.04
2

Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
northeast region.
1. CHD (Corronary heart disease)
2. Colon Cancer
#Since we have taken the CHD and Colon Cancer as the mahor reason why people are dying we will perform m
regressionModel<-lm(Northeast$Total_Death_Causes~Northeast$CHD+Northeast$Col_Cancer)
#Summary of regression between total deaths and diseases we selected.
summary(regressionModel)
##
## Call:
## lm(formula = Northeast$Total_Death_Causes ~ Northeast$CHD + Northeast$Col_Cancer)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.766 -5.234 -1.086 5.389 28.828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.05596 2.51849 3.596 0.000433 ***
## Northeast$CHD 1.00017 0.04322 23.141 < 2e-16 ***
## Northeast$Col_Cancer 8.15527 0.44732 18.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.382 on 157 degrees of freedom
## Multiple R-squared: 0.9621, Adjusted R-squared: 0.9616
## F-statistic: 1991 on 2 and 157 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
#plotting regression analysis
plot(regressionModel)
3

50 100 150 200 250
−30030
Fitted values
Residuals
Residuals vs Fitted
18
65
9
−2 −1 0 1 2
−213
Theoretical Quantiles
Standardizedresiduals
Normal Q−Q
18
65
9
50 100 150 200 250
0.01.0
Fitted values
Scale−Location
18 659
0.00 0.05 0.10 0.15
−302
Leverage
Cook's distance 0.5
0.5
1
Residuals vs Leverage
908878
Now that we have established the major disease in North east region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to ﬁnd the confounders
and perform multi variate regression with training and testing data.
Following is the code which describes the procedure for CHD.
NE.states<-Northeast
NE.Procedures<-data.frame()
#Creating a data frame inorder to generate correlation heat map
NE.Procedures<-NE.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
NE.Procedures.Matrix<-as.matrix(NE.Procedures)
NE.Cor<-cor(NE.Procedures)
#Correlation metrix
NE.Cor
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.8886204 0.8202385 0.7790907
## No_Exercise 0.8886204 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.8202385 0.8788325 1.0000000 0.9008216
## Obesity 0.7790907 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.7685505 0.8430622 0.9158679 0.8826446
## Smoker 0.7693031 0.8041099 0.8774229 0.9155866
## Diabetes 0.7731015 0.8495609 0.8339938 0.8830184
## High_Blood_Pres Smoker Diabetes
4

## CHD 0.7685505 0.7693031 0.7731015
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
#Melting the correlation matrix and creating a data frame
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
mean(NE.Melt$value)
## [1] 0.8705658
#Summary
summary(NE.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.7686
## Diabetes :7 Diabetes :7 1st Qu.:0.8340
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8706
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9008
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
NE.Melt<-NE.Melt[(!NE.Melt$value==1),]
NE.MeltMean<-mean(NE.Melt$value)
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = NE.MeltMean,
limits = c(0.7, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
ggtitle("Heat map of correlations in Risk Factors data : Region 1")
plt_heat_blue
5

CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.7
0.8
0.9
1.0Correlations
Heat map of correlations in Risk Factors data : Region 1
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
plt_obevsNoEx_NE
6

10%
10%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
plt_noExvsfru_NE
7

20%
30%
40%
50%
20% 30% 40%
No Exercise %s
FewFruitsandVegetables%s
% Few Fruits and Vegetables
7
8
20
30
40
Percentage of No Exercise vs.
Percentage of Few Fruits and Vegetables
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedur
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
plt_noExvsblood_NE
8

20%
30%
40%
20% 30% 40%
No Exercise %s
HighBloodPressure%s
% High Blood Pressure
2
3
4
5
6
7
8
9
10
20
Percentage of No Excersice vs.
Percentage of High Blood Pressure
#Plotting the multi variate scatter plot in order to understand the correlation better.
pairs(~NE.Procedures$CHD+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obesity+NE.
9

NE.Procedures$CHD
5 10 2 6 10 5 10
50
515
NE.Procedures$No_Exercise
NE.Procedures$Few_Fruit_Veg
1040
210
NE.Procedures$Obesity
NE.Procedures$High_Blood_Pres
515
515
NE.Procedures$Smoker
50 150 10 25 40 5 10 1 3 5
14
NE.Procedures$Diabetes
Multivariate Scatterplot : Region 1
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will ﬁnd the confounding variables. Using confounding variables we can ﬁt model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Diabetes
3. Obesity
#partitioning the data into Training and Testing set.
set.seed(100)
train_part <- createDataPartition(y = NE.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-NE.Procedures[train_part,]
TestingCHD<-NE.Procedures[-train_part,]
#Performing regression between CHD and No exercise
chdRegr<-lm(TrainingCHD$CHD~TrainingCHD$No_Exercise)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
##
10

## Residuals:
## -27.186 -6.009 -1.259 6.963 76.044
##
## Coefficients:
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## TrainingCHD$No_Exercise 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$High_Blood_Pres,C3
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
summary(reg.E)
##
## Call:
## lm(formula = O ~ E, data = temp)
##
## Residuals:
## -27.186 -6.009 -1.259 6.963 76.044
##
## Coefficients:
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## E 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No excercise with Obisity
reg.EC1 <- lm(O ~ E+C1, data = temp)
summary(reg.EC1)
##
## Call:
## lm(formula = O ~ E + C1, data = temp)
##
11

## Residuals:
## -25.604 -6.818 -1.170 5.811 77.683
##
## Coefficients:
## (Intercept) 5.9462 3.4303 1.733 0.0855 .
## E 5.7057 0.6647 8.583 3.06e-14 ***
## C1 1.3495 0.7389 1.826 0.0702 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No exercise with High Blood pressure
summary(reg.EC2)
##
## Call:
##
## Residuals:
## -25.440 -6.496 -1.075 5.928 81.639
##
## Coefficients:
## (Intercept) 5.3274 3.4403 1.549 0.1240
## E 5.6011 0.6125 9.145 1.39e-15 ***
## C2 1.2760 0.5716 2.232 0.0274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No exercise with Diabeties
summary(reg.EC3)
##
## Call:
##
## Residuals:
## -27.229 -6.466 -1.525 6.379 79.871
##
## Coefficients:
12

## (Intercept) 7.057 3.284 2.149 0.0336 *
## E 5.612 0.612 9.171 1.2e-15 ***
## C3 4.022 1.817 2.213 0.0287 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
summary(reg.EC1234)
##
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
##
## Residuals:
## -25.767 -6.870 -1.347 5.835 81.655
##
## Coefficients:
## (Intercept) 5.5001 3.5445 1.552 0.123
## E 5.1938 0.7261 7.153 6.68e-11 ***
## C1 0.4091 0.9195 0.445 0.657
## C2 0.7147 0.7485 0.955 0.342
## C3 2.0806 2.4541 0.848 0.398
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#plotting regression.
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
13

20 40 60 80 100 120
−2040
Fitted values
Residuals
Residuals vs Fitted
71
110
24
−2 −1 0 1 2
−226
Normal Q−Q
71
110
24
20 40 60 80 100 120
0.01.5
Fitted values
Scale−Location
71
11024
0.00 0.05 0.10 0.15 0.20
−226
Leverage
Cook's distance
0.5
0.5
1
71
21
69
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$High_Blood_Pres,C
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O
SS.total <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
## [1] 8567.965
SS.regression/SS.total
## [1] 0.5136492
#This is the regression value Rsquare value for testing data
Now as we have ﬁtted the regression model for “CHD” we will do the same for the colon cancer.
Here is the code for the colon cancer model.
14

NE.Procedures<-NE.states[,c("Col_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smok
NE.Procedures.Matrix<-as.matrix(NE.Procedures)
NE.Cor<-cor(NE.Procedures)
#Correlation metrix
NE.Cor
## Col_Cancer No_Exercise Few_Fruit_Veg Obesity
## Col_Cancer 1.0000000 0.8447630 0.9067467 0.8433966
## No_Exercise 0.8447630 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.9067467 0.8788325 1.0000000 0.9008216
## Obesity 0.8433966 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.8606917 0.8430622 0.9158679 0.8826446
## Smoker 0.8306595 0.8041099 0.8774229 0.9155866
## Diabetes 0.7996891 0.8495609 0.8339938 0.8830184
## Col_Cancer 0.8606917 0.8306595 0.7996891
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
mean(NE.Melt$value)
## [1] 0.8822818
#Summary
summary(NE.Melt)
## x y value
## Col_Cancer :7 Col_Cancer :7 Min. :0.7997
NE.Melt<-NE.Melt[(!NE.Melt$value==1),]
NE.MeltMean<-mean(NE.Melt$value)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
15

plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
midpoint = NE.MeltMean,
plt_heat_blue
Col_Cancer
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
Col_Cancer DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
0.7
0.8
0.9
1.0Correlations
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS Few Fruit vegitables
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Procedur
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
ggtitle(label = "Percentage of obesity vs.n Percentage of Few Fruit and vegitables")
plt_obevsNoEx_NE
16

10%
10% 20% 30% 40%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
Percentage of Few Fruit and vegitables
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
plt_noExvsfru_NE
17

20%
30%
40%
50%
20% 30% 40%
No Exercise %s
FewFruitsandVegetables%s
% Few Fruits and Vegetables
7
8
20
30
40
Percentage of Few Fruits and Vegetables
#graph of Few Fruit and Vegitables VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Proced
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_x_continuous(labels = percent, name = 'Few Fruit and vegitables %s') +
ggtitle(label = "Percentage of Few Fruits and vegitables vs.n Percentage of High Blood Pressure")
plt_noExvsblood_NE
18

20%
30%
40%
20% 30% 40% 50%
Few Fruit and vegitables %s
HighBloodPressure%s
2
3
4
5
6
7
8
9
10
20
Percentage of Few Fruits and vegitables vs.
pairs(~NE.Procedures$Col_Cancer+NE.Procedures$No_Exercise+NE.Procedures$Few_Fruit_Veg+NE.Procedures$Obes
19

NE.Procedures$Col_Cancer
5 10 2 6 10 5 10
28
515
NE.Procedures$No_Exercise
NE.Procedures$Few_Fruit_Veg
1040
210
NE.Procedures$Obesity
NE.Procedures$High_Blood_Pres
515
515
NE.Procedures$Smoker
2 6 10 10 25 40 5 10 1 3 5
14
NE.Procedures$Diabetes
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the colon cancer
is highly correlated with the
1. Few Fruits and vegetable.
We based on the correlation heatmap and the scatterplot we can say that 1. No Exercise 2. High blood
pressure 3. Obesity
set.seed(100)
train_part <- createDataPartition(y = NE.Procedures$Col_Cancer,p = 0.80,list = FALSE)
TrainingColon <- NE.Procedures[train_part,]
TestingColon <- NE.Procedures[-train_part,]
#Performing regression between Colon Cancer and Few fruits and vegetables
chdRegr<-lm(TrainingColon$Col_Cancer~TrainingColon$Few_Fruit_Veg)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingColon$Col_Cancer ~ TrainingColon$Few_Fruit_Veg)
##
20

## Residuals:
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
##
## Coefficients:
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## TrainingColon$Few_Fruit_Veg 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
temp <- data.frame(E = TrainingColon$Few_Fruit_Veg, C1=TrainingColon$Obesity,C2=TrainingColon$High_Blood
temp <- mutate(temp, O = TrainingColon$Col_Cancer)
summary(reg.E)
##
## Call:
##
## Residuals:
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
##
## Coefficients:
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## E 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No excercise with Obisity
summary(reg.EC1)
##
## Call:
##
21

## Residuals:
## -3.1345 -0.6380 -0.0406 0.5283 3.7629
##
## Coefficients:
## (Intercept) 0.66357 0.33068 2.007 0.0469 *
## E 0.22619 0.02392 9.455 2.32e-16 ***
## C1 0.12091 0.07412 1.631 0.1054
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No exercise with High Blood pressure
summary(reg.EC2)
##
## Call:
##
## Residuals:
## -3.3833 -0.6435 0.0166 0.4775 3.8119
##
## Coefficients:
## (Intercept) 0.68140 0.32910 2.070 0.0405 *
## E 0.21531 0.02585 8.330 1.16e-13 ***
## C2 0.13163 0.06773 1.944 0.0542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No exercise with Diabeties
summary(reg.EC3)
##
## Call:
##
## Residuals:
## -3.0632 -0.6606 -0.0144 0.4913 3.8662
##
## Coefficients:
22

## (Intercept) 0.71645 0.32566 2.200 0.0296 *
## E 0.21276 0.02128 10.000 <2e-16 ***
## C3 0.14657 0.05628 2.604 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC1234)
##
## Call:
##
## Residuals:
## -3.0938 -0.6481 0.0027 0.4265 3.7933
##
## Coefficients:
## (Intercept) 0.70568 0.32562 2.167 0.0321 *
## E 0.17857 0.03122 5.719 7.56e-08 ***
## C1 0.03900 0.07983 0.489 0.6261
## C2 0.09059 0.07048 1.285 0.2011
## C3 0.11996 0.06030 1.989 0.0489 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## coefficients
23

4 6 8 10
−22
Fitted values
Residuals
Residuals vs Fitted
50
120
6
−2 −1 0 1 2
−303
Normal Q−Q
50
120
6
4 6 8 10
0.01.0
Fitted values
Scale−Location
50
1206
0.00 0.05 0.10 0.15
−22
Leverage
Cook's distance
0.5
0.5
7166104
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingColon$Few_Fruit_Veg, C1=TestingColon$Obesity,C2=TestingColon$High_Bloo
tempTest <- mutate(tempTest, O = TestingColon$Col_Cancer)
test.y<-tempTest$O
## [1] 13.39228
## [1] 0.7494172
Region 2 (Midwest region)
We will be analyzing the Midwest region for the following problem. All the code and necessary visualizations
24

Midwest <- subset(Midwest, Midwest$No_Exercise!=0)
Midwest <- subset(Midwest, Midwest$Few_Fruit_Veg!=0)
Midwest <- subset(Midwest, Midwest$Obesity!=0)
Midwest <- subset(Midwest, Midwest$High_Blood_Pres!=0)
Midwest <- subset(Midwest, Midwest$Smoker!=0)
Midwest <- subset(Midwest, Midwest$Diabetes!=0)
Midwest <- subset(Midwest, Midwest$Lung_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Col_Cancer!=0)
Midwest <- subset(Midwest, Midwest$CHD!=0)
Midwest <- subset(Midwest, Midwest$Brst_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Suicide!=0)
Midwest <- subset(Midwest, Midwest$Total_Death_Causes!=0)
Midwest <- subset(Midwest, Midwest$Injury!=0)
Midwest<-subset(Midwest,Midwest$Stroke!=0)
Midwest <- subset(Midwest, Midwest$MVA!=0)
Now we applied the regression model for diﬀerent kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top two reasons why people are dying in region 2. We came
to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
results.
breast cancer 0.016
mva 0.19
chd 0.77
colon cancer 0.12
lung cancer 0.25
injury 0.11
suicide 0.05
stroke 0.13
Midwest region.
2. Lung Cancer
#Since we have taken the CHD and Lung Cancer as the major reason why people are dying. We will perform m
regressionModel<-lm(Midwest$Total_Death_Causes~Midwest$CHD+Midwest$Lung_Cancer)
##
## Call:
## lm(formula = Midwest$Total_Death_Causes ~ Midwest$CHD + Midwest$Lung_Cancer)
##
25

## Residuals:
## -19.6367 -4.0070 -0.6864 3.8916 22.6175
##
## Coefficients:
## (Intercept) 7.26689 0.65579 11.08 <2e-16 ***
## Midwest$CHD 1.12881 0.02958 38.16 <2e-16 ***
## Midwest$Lung_Cancer 2.97013 0.08260 35.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistic: 1.688e+04 on 2 and 389 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
50 100 150 200 250
−20020
Fitted values
Residuals
Residuals vs Fitted
385
165
231
−3 −2 −1 0 1 2 3
−22
Normal Q−Q
385
165
231
50 100 150 200 250
0.01.0
Fitted values
Scale−Location
385 165231
0.00 0.01 0.02 0.03 0.04 0.05
−404
Leverage
Cook's distance
31050165
Now that we have established the major disease in Midwest region. We will now analyse the relationship
and perform multivariate regression with training and testing data.
26

MW.states<-Midwest
MW.Procedures<-data.frame()
MW.Procedures<-MW.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
MW.Procedures.Matrix<-as.matrix(MW.Procedures)
MW.Cor<-cor(MW.Procedures)
#Correlation metrix
MW.Cor
## CHD 1.0000000 0.9109385 0.8956826 0.9072785
## No_Exercise 0.9109385 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.8956826 0.9135682 1.0000000 0.9520907
## Obesity 0.9072785 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9045115 0.9155135 0.9331822 0.9536831
## Smoker 0.9076982 0.9292124 0.9400104 0.9312576
## Diabetes 0.8902612 0.9031664 0.8688473 0.9153784
## CHD 0.9045115 0.9076982 0.8902612
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
mean(MW.Melt$value)
## [1] 0.9275097
#Summary
summary(MW.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.8688
MW.Melt<-MW.Melt[(!MW.Melt$value==1),]
MW.MeltMean<-mean(MW.Melt$value)
#Making various colors to geMWrate dynamic range of colors using a given pallate
27

RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, red) )
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
midpoint = MW.MeltMean,
plt_heat_blue
CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
plt_obevsNoEx_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Obesity, 0)))) +
geom_point() +
28

plt_obevsNoEx_MW
10%
10%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
20
#graph of No Exercise VS Smoker
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
plt_noExvsSmo_MW
29

20%
30%
40%
50%
20% 30% 40%
No Exercise %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Smoker
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedur
color = factor(signif(MW.Procedures$High_Blood_Pres,
geom_point() +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
plt_noExvsblood_MW
30

20%
30%
40%
20% 30% 40%
No Exercise %s
HighBloodPressure%s
2
3
4
5
6
7
8
9
10
20
#Plotting the multivariate scatter plot in order to understand the correlation better.
pairs(~MW.Procedures$CHD+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obesity+MW.
31

MW.Procedures$CHD
5 15 5 15 5 10
20
5
MW.Procedures$No_Exercise
MW.Procedures$Few_Fruit_Veg
1035
5
MW.Procedures$Obesity
MW.Procedures$High_Blood_Pres
5
515
MW.Procedures$Smoker
20 80 10 25 40 5 15 1 3 5
14
MW.Procedures$Diabetes
1. No exercise.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Smoker
3. Obesity
set.seed(100)
train_part <- createDataPartition(y = MW.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-MW.Procedures[train_part,]
TestingCHD<-MW.Procedures[-train_part,]
#Regression Summary
summary(chdRegr)
##
## Call:
32

##
## Residuals:
## -26.669 -5.992 -1.075 4.611 37.145
##
## Coefficients:
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## TrainingCHD$No_Exercise 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
summary(reg.E)
##
## Call:
##
## Residuals:
## -26.669 -5.992 -1.075 4.611 37.145
##
## Coefficients:
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## E 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No excercise with Confounder 1
summary(reg.EC1)
##
## Call:
33

##
## Residuals:
## -23.236 -5.870 -0.862 5.417 36.030
##
## Coefficients:
## (Intercept) 0.6657 1.2036 0.553 0.581
## E 3.9856 0.4212 9.462 < 2e-16 ***
## C1 3.1306 0.4123 7.593 3.64e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No exercise with Confounder 2
summary(reg.EC2)
##
## Call:
##
## Residuals:
## -30.550 -5.426 -0.776 4.322 37.881
##
## Coefficients:
## (Intercept) 2.6510 1.1790 2.249 0.0252 *
## E 4.0462 0.4223 9.581 < 2e-16 ***
## C2 2.8735 0.3872 7.420 1.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC3)
##
## Call:
##
## Residuals:
## -22.324 -5.482 -0.944 5.139 34.475
##
34

## Coefficients:
## (Intercept) 1.4248 1.1767 1.211 0.227
## E 4.1168 0.3898 10.560 < 2e-16 ***
## C3 2.8134 0.3545 7.936 3.74e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC1234)
##
## Call:
##
## Residuals:
## -25.493 -5.404 -0.842 4.904 33.167
##
## Coefficients:
## (Intercept) 1.3202 1.1689 1.129 0.259602
## E 2.8621 0.4635 6.175 2.07e-09 ***
## C1 1.1091 0.5728 1.936 0.053747 .
## C2 1.5669 0.4458 3.514 0.000506 ***
## C3 1.4402 0.4852 2.968 0.003230 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## coefficients
35

20 40 60 80 100
−30030
Fitted values
Residuals
Residuals vs Fitted
6201 149
−3 −2 −1 0 1 2 3
−3024
Normal Q−Q
6201149
20 40 60 80 100
0.01.0
Fitted values
Scale−Location
6201 149
0.00 0.04 0.08 0.12
−22
Leverage
Cook's distance
0.5
1916
47
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing
test.predictions
## 1 2 3 4 5 6 7
## 47.35611 71.16578 74.24867 74.90092 88.18896 85.76173 53.21561
## 8 9 10 11 12 13 14
## 46.55992 49.44050 85.85667 62.59279 42.36807 92.85802 41.66495
## 15 16 17 18 19 20 21
## 90.65576 38.70206 44.74763 18.59404 16.51755 23.08953 19.65119
## 22 23 24 25 26 27 28
## 17.38575 19.99998 40.34271 25.57896 22.67974 18.95382 75.20951
## 29 30 31 32 33 34 35
## 20.59550 44.87894 22.94132 21.52470 28.72008 16.55805 21.38459
## 36 37 38 39 40 41 42
## 53.07606 51.01929 52.00420 22.51745 111.28415 89.03471 80.60128
## 43 44 45 46 47 48 49
## 70.85496 44.13435 65.03187 73.84524 36.00484 39.05419 62.18069
## 50 51 52 53 54 55 56
## 29.67611 63.29361 76.79059 38.52942 50.68324 49.06936 21.51555
## 57 58 59 60 61 62 63
## 50.18942 42.64157 20.31264 78.53103 20.19143 18.05501 20.39115
36

## 64 65 66 67 68 69 70
## 87.13676 84.23844 86.80106 71.39024 69.61883 40.42645 18.85330
## 71 72 73 74 75 76
## 46.86680 32.61050 38.34111 38.45673 67.09643 42.34753
test.y<-tempTest$O
## [1] -1040.415
## [1] 0.854508
Now as we have ﬁtted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
MW.Procedures<-MW.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo
MW.Procedures.Matrix<-as.matrix(MW.Procedures)
MW.Cor<-cor(MW.Procedures)
#Correlation metrix
MW.Cor
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.9402863 0.9478072 0.9459666
## No_Exercise 0.9402863 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.9478072 0.9135682 1.0000000 0.9520907
## Obesity 0.9459666 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9279799 0.9155135 0.9331822 0.9536831
## Smoker 0.9494486 0.9292124 0.9400104 0.9312576
## Diabetes 0.8984837 0.9031664 0.8688473 0.9153784
## Lung_Cancer 0.9279799 0.9494486 0.8984837
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
mean(MW.Melt$value)
37

## [1] 0.9354118
#Summary
summary(MW.Melt)
## x y value
## Diabetes :7 Diabetes :7 Min. :0.8688
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.9155
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.9313
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.9354
MW.Melt<-MW.Melt[(!MW.Melt$value==1),]
MW.MeltMean<-mean(MW.Melt$value)
WtoGrange<-colorRampPalette(c(blue, red) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
midpoint = MW.MeltMean,
plt_heat_blue
38

Diabetes
Few_Fruit_Veg
High_Blood_Pres
Lung_Cancer
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS Few Fruit vegitables
plt_Smovsfru_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Few_Fruit_Veg), y = (MW.Procedure
geom_point() +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fruits and Vegetables %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of Few Fruit and vegitables")
plt_Smovsfru_MW
39

10%
10% 20% 30% 40%
Few Fruits and Vegetables %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of Smoker vs.
Percentage of Few Fruit and vegitables
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
geom_point() +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
plt_noExvsSmo_MW
40

20%
30%
40%
50%
20% 30% 40%
No Exercise %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of Smoker
#graph of Smoker VS Obesity
plt_SmovsObe_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Obesity), y = (MW.Procedures$Smok
geom_point() +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
plt_SmovsObe_MW
41

20%
30%
40%
50%
20% 30% 40%
Obesity %s
Smoker%s
% Smoker
1
2
3
4
5
6
7
8
9
10
20
Percentage of Obesity vs.
pairs(~MW.Procedures$Lung_Cancer+MW.Procedures$No_Exercise+MW.Procedures$Few_Fruit_Veg+MW.Procedures$Obe
42

MW.Procedures$Lung_Cancer
5 15 5 15 5 10
1040
5
MW.Procedures$No_Exercise
MW.Procedures$Few_Fruit_Veg
1035
5
MW.Procedures$Obesity
MW.Procedures$High_Blood_Pres
5
515
MW.Procedures$Smoker
10 30 10 25 40 5 15 1 3 5
14
MW.Procedures$Diabetes
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
1. Smoker.
Based on the correlation heatmap and the scatterplot we can say that 1. No Exercise
2. Few Fruits and vegetable
3. Obesity
set.seed(100)
train_part <- createDataPartition(y = MW.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- MW.Procedures[train_part,]
TestingLung <- MW.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
43

##
## Residuals:
## -11.8531 -1.4729 0.0134 1.5588 7.9568
##
## Coefficients:
## (Intercept) 0.10692 0.34785 0.307 0.759
## TrainingLung$Smoker 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$Few_Fruit_Veg,C3=Tra
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
summary(reg.E)
##
## Call:
##
## Residuals:
## -11.8531 -1.4729 0.0134 1.5588 7.9568
##
## Coefficients:
## (Intercept) 0.10692 0.34785 0.307 0.759
## E 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on Lung Cancer and Smoker with Confounder 1
summary(reg.EC1)
##
## Call:
44

##
## Residuals:
## -11.2839 -1.2744 0.1438 1.3621 7.6394
##
## Coefficients:
## (Intercept) -1.1171 0.3171 -3.523 0.00049 ***
## E 1.3294 0.1012 13.133 < 2e-16 ***
## C1 1.1735 0.1075 10.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC2)
##
## Call:
##
## Residuals:
## -11.8397 -1.2740 0.0911 1.3515 8.0799
##
## Coefficients:
## (Intercept) -1.31635 0.32814 -4.012 7.55e-05 ***
## E 1.24197 0.11201 11.088 < 2e-16 ***
## C2 0.39028 0.03696 10.559 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC3)
##
## Call:
##
## Residuals:
## -8.4258 -1.3278 -0.2252 1.5037 7.6570
##
45

## Coefficients:
## (Intercept) -0.9052 0.3127 -2.895 0.00406 **
## E 1.3013 0.1054 12.341 < 2e-16 ***
## C3 1.2172 0.1137 10.703 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC1234)
##
## Call:
##
## Residuals:
## -9.0675 -1.1707 0.1069 1.2777 7.5823
##
## Coefficients:
## (Intercept) -1.79363 0.29587 -6.062 3.88e-09 ***
## E 0.67286 0.11725 5.739 2.27e-08 ***
## C1 0.44770 0.13018 3.439 0.000664 ***
## C2 0.21279 0.04183 5.087 6.31e-07 ***
## C3 0.79075 0.11385 6.946 2.21e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## coefficients
46

5 10 15 20 25 30 35
−100
Fitted values
Residuals
Residuals vs Fitted
313128
39
−3 −2 −1 0 1 2 3
−404
Normal Q−Q
313128
39
5 10 15 20 25 30 35
0.01.02.0
Fitted values
Scale−Location
31312839
0.00 0.05 0.10 0.15
−404
Leverage
Cook's distance 0.5
0.5
245
191
128
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$Few_Fruit_Veg,C3=Te
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.y<-tempTest$O
## [1] 500.1066
## [1] 0.8705158
Region 3 (South region)
We will be analyzing the South region for the following problem. All the code and necessary visualizations
47

South <- subset(South, South$No_Exercise!=0)
South <- subset(South, South$Few_Fruit_Veg!=0)
South <- subset(South, South$Obesity!=0)
South <- subset(South, South$High_Blood_Pres!=0)
South <- subset(South, South$Smoker!=0)
South <- subset(South, South$Diabetes!=0)
South <- subset(South, South$Lung_Cancer!=0)
South <- subset(South, South$Col_Cancer!=0)
South <- subset(South, South$CHD!=0)
South <- subset(South, South$Brst_Cancer!=0)
South <- subset(South, South$Suicide!=0)
South <- subset(South, South$Total_Death_Causes!=0)
South <- subset(South, South$Injury!=0)
South<-subset(South,South$Stroke!=0)
South <- subset(South, South$MVA!=0)
Now we applied the regression model for diﬀerent kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top three reasons why people are dying in region 3. We
came to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
results.
breast cancer 0.06
mva 0.26
chd 0.79
colon cancer 0.16
lung cancer 0.35
injury 0.17
suicide 0.03
stroke 0.14
South region.
2. Lung Cancer
3. MVA (Motor Vehicle Accidents)
#Since we have taken the CHD , Lung Cancer and MVA as the major reason why people are dying. We will per
regressionModel<-lm(South$Total_Death_Causes~South$CHD+South$Lung_Cancer+South$MVA)
##
## Call:
## lm(formula = South$Total_Death_Causes ~ South$CHD + South$Lung_Cancer +
## South$MVA)
48

##
## Residuals:
## -40.105 -5.062 -0.461 4.612 30.418
##
## Coefficients:
## (Intercept) 1.17128 0.98511 1.189 0.235
## South$CHD 1.16308 0.02562 45.398 < 2e-16 ***
## South$Lung_Cancer 2.84552 0.07955 35.770 < 2e-16 ***
## South$MVA 1.08049 0.15806 6.836 2.16e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
par(mfrow=c(2,2))
50 100 150 200 250 300
−400
Fitted values
Residuals
Residuals vs Fitted
184
214523
−3 −2 −1 0 1 2 3
−404
Normal Q−Q
184
214523
50 100 150 200 250 300
0.01.5
Fitted values
Scale−Location
184
214523
0.00 0.02 0.04 0.06 0.08
−604
Leverage
Cook's distance
1
0.5
0.5
261
523
319
Now that we have established the major disease in South region. We will now analyse the relationship
and perform multivariate regression with training and testing data.
49

SO.states<-South
SO.Procedures<-data.frame()
SO.Procedures<-SO.states[,c("CHD","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
SO.Procedures.Matrix<-as.matrix(SO.Procedures)
SO.Cor<-cor(SO.Procedures)
#Correlation metrix
SO.Cor
## CHD 1.0000000 0.8652391 0.8531583 0.8529559
## No_Exercise 0.8652391 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8531583 0.8673295 1.0000000 0.9171308
## Obesity 0.8529559 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8240507 0.8680525 0.8880798 0.8827588
## Smoker 0.8387031 0.8636349 0.8920701 0.8694016
## Diabetes 0.7821009 0.8543135 0.8032958 0.8497419
## CHD 0.8240507 0.8387031 0.7821009
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8792101
#Summary
summary(SO.Melt)
## x y value
## CHD :7 CHD :7 Min. :0.7821
SO.Melt<-SO.Melt[(!SO.Melt$value==1),]
SO.MeltMean<-mean(SO.Melt$value)
50

#Making various colors to geSOrate dynamic range of colors using a given pallate
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, green) )
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
midpoint = SO.MeltMean,
plt_heat_blue
CHD
Diabetes
Few_Fruit_Veg
High_Blood_Pres
No_Exercise
Obesity
Smoker
0.80
0.85
0.90
0.95
1.00
Correlations
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
plt_obevsNoEx_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Obesity, 0)))) +
51

geom_point() +
plt_obevsNoEx_SO
10%
10% 20%
No Exercise %s
Obesity%s
% Obesity
2
3
4
5
6
7
8
9
10
20
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
plt_noExvsSmo_SO
52

10%
10% 20%
No Exercise %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Smoker
#graph of No Exercise VS Few Fruits And Vegetables
plt_noExvsfru_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Few_Fruit_Veg, 0
geom_point() +
scale_color_discrete(name="% Few Fruits And Vegetables") +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Few Fruits And Vegetables %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of Few Fruits And Vegetables")
plt_noExvsfru_SO
53

10%
20%
30%
40%
10% 20%
No Exercise %s
FewFruitsAndVegetables%s
% Few Fruits And Vegetables
8
9
10
20
30
40
Percentage of Few Fruits And Vegetables
#Plotting the multivariate scatter plot in order to understand the correlation better.
pairs(~SO.Procedures$CHD+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO.
54

SO.Procedures$CHD
5 15 2 6 12 5 10
20140
520
SO.Procedures$No_Exercise
SO.Procedures$Few_Fruit_Veg
1040
212
SO.Procedures$Obesity
SO.Procedures$High_Blood_Pres
520
515
SO.Procedures$Smoker
20 80 140 10 30 5 15 1 3 5
14
SO.Procedures$Diabetes
1. No exercise.
We based on the correlation heatmap and the scatterplot we can say that 1. Few Fruits And Vegetables
2. Smoker
3. Obesity
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$CHD,p = 0.80,list = FALSE)
TrainingCHD<-SO.Procedures[train_part,]
TestingCHD<-SO.Procedures[-train_part,]
#Regression Summary
summary(chdRegr)
##
## Call:
55

##
## Residuals:
## -63.726 -7.474 -0.401 7.460 53.480
##
## Coefficients:
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## TrainingCHD$No_Exercise 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
summary(reg.E)
##
## Call:
##
## Residuals:
## -63.726 -7.474 -0.401 7.460 53.480
##
## Coefficients:
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## E 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on CHD and No excercise with Confounder 1
summary(reg.EC1)
##
## Call:
56

##
## Residuals:
## -56.692 -6.553 -0.426 6.782 49.965
##
## Coefficients:
## (Intercept) 3.7745 1.5323 2.463 0.0141 *
## E 3.6456 0.3685 9.893 < 2e-16 ***
## C1 3.1443 0.4080 7.707 8.45e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC2)
##
## Call:
##
## Residuals:
## -58.419 -7.093 -0.348 6.464 50.962
##
## Coefficients:
## (Intercept) 2.2383 1.5379 1.455 0.146
## E 3.8791 0.3103 12.503 <2e-16 ***
## C2 3.0972 0.3567 8.684 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC3)
##
## Call:
##
## Residuals:
## -57.863 -6.174 -0.278 6.583 48.496
##
57

## Coefficients:
## (Intercept) 3.0217 1.4558 2.076 0.0385 *
## E 3.3541 0.3044 11.019 <2e-16 ***
## C3 1.1467 0.1064 10.776 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC1234)
##
## Call:
##
## Residuals:
## -55.608 -6.372 -0.209 6.585 48.139
##
## Coefficients:
## (Intercept) 1.9532 1.4613 1.337 0.182035
## E 2.5933 0.3754 6.907 1.72e-11 ***
## C1 0.8052 0.4921 1.636 0.102483
## C2 1.4445 0.4122 3.504 0.000505 ***
## C3 0.7514 0.1527 4.922 1.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## coefficients
58

20 40 60 80 100 120
−60060
Fitted values
Residuals
Residuals vs Fitted
171
43
202
−3 −2 −1 0 1 2 3
−404
Normal Q−Q
171
43
202
20 40 60 80 100 120
0.01.5
Fitted values
Scale−Location
171
43
202
0.00 0.02 0.04 0.06 0.08 0.10
−604
Leverage
Cook's distance
1
0.5
0.5
364
171
202
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing
test.predictions
## 1 2 3 4 5 6 7
## 93.11863 54.33024 52.85395 45.79852 84.97513 56.88569 90.36550
## 8 9 10 11 12 13 14
## 54.14329 27.01033 21.19971 22.31869 51.65249 44.54791 47.33240
## 15 16 17 18 19 20 21
## 82.57801 95.08141 82.71398 87.34274 77.96068 95.69180 39.98928
## 22 23 24 25 26 27 28
## 92.73972 78.85468 78.94780 80.59074 80.79515 88.44522 51.13704
## 29 30 31 32 33 34 35
## 53.76243 54.93526 44.19820 61.92209 47.95062 57.45545 51.17947
## 36 37 38 39 40 41 42
## 27.33281 62.49750 23.71136 28.54283 26.16418 27.07990 58.66436
## 43 44 45 46 47 48 49
## 102.65357 91.78205 105.83215 55.19101 80.64446 40.27956 53.95203
## 50 51 52 53 54 55 56
## 48.46951 84.13853 40.82755 102.13602 54.52694 48.51574 50.28585
## 57 58 59 60 61 62 63
## 65.78484 53.00168 50.13015 53.58431 48.56988 45.02007 80.19727
59

## 64 65 66 67 68 69 70
## 85.16485 43.43088 101.14805 81.53974 81.15735 66.76741 60.25659
## 71 72 73 74 75 76 77
## 42.04435 54.65376 103.69421 49.09867 59.43466 51.29171 56.07857
## 78 79 80 81 82 83 84
## 25.54696 55.45294 23.65491 93.97364 51.35518 46.81402 92.22464
## 85 86 87 88 89 90 91
## 84.72256 90.79933 95.60329 59.45610 90.05428 45.56084 76.83904
## 92 93 94 95 96 97 98
## 82.85766 74.82753 77.28768 64.59310 51.51838 40.74676 79.32157
## 99 100 101 102 103 104 105
## 38.59788 52.26890 49.81253 87.43576 89.41201 90.57723 79.44587
## 106 107 108 109 110
## 60.00317 46.91665 59.17558 47.64249 24.00712
test.y<-tempTest$O
## [1] 2561.812
## [1] 0.7099096
Now as we have ﬁtted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
SO.Procedures<-SO.states[,c("Lung_Cancer","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smo
#Correlation metrix
SO.Cor
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.8398485 0.8922953 0.8688772
## No_Exercise 0.8398485 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8922953 0.8673295 1.0000000 0.9171308
## Obesity 0.8688772 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8698291 0.8680525 0.8880798 0.8827588
## Smoker 0.9145492 0.8636349 0.8920701 0.8694016
## Diabetes 0.7993788 0.8543135 0.8032958 0.8497419
## Lung_Cancer 0.8698291 0.9145492 0.7993788
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
60

## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8860905
#Summary
summary(SO.Melt)
## x y value
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.8861
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, blue) )
plt_heat_blue
61

Diabetes
Few_Fruit_Veg
High_Blood_Pres
Lung_Cancer
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
0.900
0.925
0.950
0.975
1.000
Correlations
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS High Blood Pressure
plt_Smovsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proce
geom_point() +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of High Blood Pressure")
plt_Smovsblood_SO
62

10%
10% 20%
High Blood Pressure %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Smoker vs.
#graph of Few Fruits and Vegetables VS Smoker
plt_fruvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Few_Fruit_Veg), y = (SO.Procedure
geom_point() +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fuits and Vegetable %s') +
ggtitle(label = "Percentage of Few Fuits and Vegetable vs.n Percentage of Smoker")
plt_fruvsSmo_SO
63

10%
10% 20% 30% 40%
Few Fuits and Vegetable %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Few Fuits and Vegetable vs.
#graph of Smoker VS Obesity
plt_SmovsObe_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Obesity), y = (SO.Procedures$Smok
geom_point() +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
plt_SmovsObe_SO
64

20%
30%
40%
50%
20% 30% 40% 50%
Obesity %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
Percentage of Obesity vs.
pairs(~SO.Procedures$Lung_Cancer+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obe
65

SO.Procedures$Lung_Cancer
5 15 2 6 12 5 10
1040
520
1040
212
520
515
10 30 10 30 5 15 1 3 5
14
1. Smoker.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Few Fruits
and vegetable 3. Obesity
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- SO.Procedures[train_part,]
TestingLung <- SO.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
chdRegr<-lm(TrainingLung$Lung_Cancer~TrainingLung$Smoker)
#Regression Summary
summary(chdRegr)
##
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
##
66

## Residuals:
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
##
## Coefficients:
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## TrainingLung$Smoker 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$High_Blood_Pres,C3=T
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
summary(reg.E)
##
## Call:
##
## Residuals:
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
##
## Coefficients:
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## E 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on Lung Cancer and Smoker with confounder 1
summary(reg.EC1)
##
## Call:
##
67

## Residuals:
## -18.098 -1.752 -0.184 1.773 16.095
##
## Coefficients:
## (Intercept) 0.92734 0.41643 2.227 0.0265 *
## E 1.71902 0.09419 18.250 < 2e-16 ***
## C1 0.75201 0.09266 8.115 4.78e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC2)
##
## Call:
##
## Residuals:
## -13.4221 -1.8371 -0.1437 1.7834 17.3119
##
## Coefficients:
## (Intercept) 0.99954 0.40928 2.442 0.015 *
## E 1.65336 0.09547 17.317 <2e-16 ***
## C2 0.70444 0.08065 8.735 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC3)
##
## Call:
##
## Residuals:
## -18.485 -1.660 -0.174 1.687 15.611
##
## Coefficients:
68

## (Intercept) 1.14230 0.40135 2.846 0.00463 **
## E 1.50746 0.10386 14.515 < 2e-16 ***
## C3 0.29928 0.03189 9.385 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC1234)
##
## Call:
##
## Residuals:
## -15.8100 -1.7365 -0.1321 1.7312 15.9466
##
## Coefficients:
## (Intercept) 0.79644 0.39875 1.997 0.0464 *
## E 1.29944 0.10929 11.889 < 2e-16 ***
## C1 0.18044 0.12280 1.469 0.1424
## C2 0.38893 0.09591 4.055 5.92e-05 ***
## C3 0.17908 0.04193 4.271 2.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## coefficients
69

5 10 15 20 25 30 35
−1010
Fitted values
Residuals
Residuals vs Fitted
211
382
369
−3 −2 −1 0 1 2 3
−426
Normal Q−Q
382
211
369
5 10 15 20 25 30 35
0.01.5
Fitted values
Scale−Location
382211
369
0.00 0.02 0.04 0.06 0.08
−604
Leverage
Cook's distance
0.5
0.5
382
369360
#Now we will test our regression model with testing data to check the performance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$High_Blood_Pres,C3=
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.y<-tempTest$O
## [1] 882.6084
## [1] 0.7934475
Similarly , we have done calculations for MVA. Here is the code for the MVA model.
70

SO.Procedures<-SO.states[,c("MVA","No_Exercise","Few_Fruit_Veg","Obesity","High_Blood_Pres","Smoker","Di
#Correlation metrix
SO.Cor
## MVA No_Exercise Few_Fruit_Veg Obesity
## MVA 1.0000000 0.7037265 0.5939313 0.6419514
## No_Exercise 0.7037265 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.5939313 0.8673295 1.0000000 0.9171308
## Obesity 0.6419514 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.6708440 0.8680525 0.8880798 0.8827588
## Smoker 0.6515928 0.8636349 0.8920701 0.8694016
## Diabetes 0.6625232 0.8543135 0.8032958 0.8497419
## MVA 0.6708440 0.6515928 0.6625232
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Mean of the melt
mean(SO.Melt$value)
## [1] 0.8346534
#Summary
summary(SO.Melt)
## x y value
## MVA :7 MVA :7 Mean :0.8347
WtoGrange<-colorRampPalette(c(blue, green) )
71

plt_heat_blue
Diabetes
Few_Fruit_Veg
High_Blood_Pres
MVA
No_Exercise
Obesity
Smoker
DiabetesFew_Fruit_VegHigh_Blood_PresMVA No_Exercise Obesity Smoker
0.5
0.6
0.7
0.8
0.9
1.0
Correlations
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of no Exercise VS High Blood Pressure
plt_noExvsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proc
color = factor(signif(SO.Procedures$No_Exercise, 0)))) +
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Excercise %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of High Blood Pressure")
plt_noExvsblood_SO
72

10%
20%
10% 20%
High Blood Pressure %s
NoExcercise%s
% No Exercise
2
3
4
5
6
7
8
9
10
20
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
geom_point() +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
plt_noExvsSmo_SO
73

10%
10% 20%
No Exercise %s
Smoker%s
% Smoker
2
3
4
5
6
7
8
9
10
20
#graph of Diabetes VS No Exercise
plt_noExvsDia_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Diabetes), y = (SO.Procedures$No
color = factor(signif(SO.Procedures$No_Exercise, 0))
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks,name = 'Diabetes %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
ggtitle(label = "Percentage of Diabetes vs.n Percentage of No Exercise")
plt_noExvsDia_SO
74

10%
20%
Diabetes %s
NoExercise%s
% No Exercise
2
3
4
5
6
7
8
9
10
20
Percentage of Diabetes vs.
pairs(~SO.Procedures$MVA+SO.Procedures$No_Exercise+SO.Procedures$Few_Fruit_Veg+SO.Procedures$Obesity+SO.
75

SO.Procedures$MVA
5 15 2 6 12 5 10
520
520
1040
212
520
515
5 15 25 10 30 5 15 1 3 5
14
1. No Exercise.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Diabetes
3. Smoker
set.seed(100)
train_part <- createDataPartition(y = SO.Procedures$MVA,p = 0.80,list = FALSE)
TrainingMVA <- SO.Procedures[train_part,]
TestingMVA <- SO.Procedures[-train_part,]
#Performing regression between MVA and No exercise
mvaRegr<-lm(TrainingMVA$MVA~TrainingMVA$No_Exercise)
#Regression Summary
summary(mvaRegr)
##
## Call:
## lm(formula = TrainingMVA$MVA ~ TrainingMVA$No_Exercise)
76

##
## Residuals:
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
##
## Coefficients:
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## TrainingMVA$No_Exercise 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
As you can see that model is already having decent accuracy we will still add more confounders and then
perform the multivariate regression.0
temp <- data.frame(E = TrainingMVA$No_Exercise, C1=TrainingMVA$Smoker,C2=TrainingMVA$High_Blood_Pres,C3=
temp <- mutate(temp, O = TrainingMVA$MVA)
#Regression on MVA and No Exercise
summary(reg.E)
##
## Call:
##
## Residuals:
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
##
## Coefficients:
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## E 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
#Regression on MVA and No Exercise with confounder 1
summary(reg.EC1)
##
## Call:
77

##
## Residuals:
## -5.9779 -1.1454 -0.0693 1.0158 10.1595
##
## Coefficients:
## (Intercept) 1.52869 0.26785 5.707 2.1e-08 ***
## E 0.50745 0.05789 8.765 < 2e-16 ***
## C1 0.09930 0.06317 1.572 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC2)
##
## Call:
##
## Residuals:
## -5.5950 -1.1474 -0.0641 1.0431 10.7984
##
## Coefficients:
## (Intercept) 1.51447 0.26458 5.724 1.91e-08 ***
## E 0.45073 0.05872 7.676 1.05e-13 ***
## C2 0.14265 0.05414 2.635 0.00871 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC3)
##
## Call:
##
## Residuals:
## -5.6516 -1.1369 -0.0745 1.1027 10.0249
##
78

## Coefficients:
## (Intercept) 1.60721 0.26225 6.129 1.95e-09 ***
## E 0.45541 0.05443 8.367 7.68e-16 ***
## C3 0.44833 0.15954 2.810 0.00517 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(reg.EC1234)
##
## Call:
##
## Residuals:
## -5.4964 -1.1325 -0.0407 1.0714 10.5865
##
## Coefficients:
## (Intercept) 1.53795 0.26771 5.745 1.71e-08 ***
## E 0.39844 0.06998 5.693 2.27e-08 ***
## C1 0.02554 0.06972 0.366 0.7143
## C2 0.08004 0.06679 1.198 0.2314
## C3 0.31152 0.18517 1.682 0.0932 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
par(mfrow=c(2,2))
plot(reg.EC1234)
abline(reg.EC1234)
## coefficients
79

Project_Report_RMD

Recommended

Recommended

More Related Content

Similar to Project_Report_RMD

Similar to Project_Report_RMD (20)

Project_Report_RMD