Project : Mortality Rate Analysis in USA for deadly
Jatri Dave (jad752) , Prashantkumar Patel (pnp249)
December 14, 2016
Project Outline
We have obtained the data from CHSI(Community Health Status Indicators). In this project we will try
to figure out the leading causes of the death in 4 major regions in USA(Northeast, West, Midwest, South).
After getting the major causes of deaths we will try to analyse major daily human characteristics that are
contributing towards these major deaths.
The steps that we implemented in order to solve above stated problem are briefly explained below.
Data Cleaning and Normalizing.
First of all, When we gathered the data we did not realize that there was significant missing data. Besides
there were hundreds of unnecessary features available in the data. Therefore, we selected the required features
for the project. Moreover, the data we obtained was not normalized on a balanced scale; some data was in
percentage, some of those were in the base of 100,000 , some data was in the form of population count etc.
We needed some common scale on which we can normalize it. Furthermore, the data that we gathered was
not for a single time duration. For example, some data was in time span of the 1999-2003, some data was
in time span of 1995-2003 and so on.Hence, we normalized the data for the individual year. We performed
all those operations in Microsoft Excel (2016). We then combined necessary features and created a comma
separated value data (CSV) which we are directly using in R for the project.
Partitinoning the data into the region wise data.
The code is described below.
#Reading the data
data<-read.csv("E:/NYU/1/Foundation of Data Science/Projects/Foundations-of-Data-Science/USADataCleanPra
#Adding new column for region selection
data[,"region"] <- NA
#Removing unnecessary columns
#Partitioning the data based on regions. We have manually used the names of the states in order to crea
data$region[data$CHSI_State_Name=="Connecticut" | data$CHSI_State_Name=="Maine" | data$CHSI_State_Name==
data$region[data$CHSI_State_Name=="Illinois" | data$CHSI_State_Name=="Indiana" | data$CHSI_State_Name=="
data$region[data$CHSI_State_Name=="Delaware" | data$CHSI_State_Name=="Florida" | data$CHSI_State_Name=="
data$region[data$CHSI_State_Name=="Arizona" | data$CHSI_State_Name=="Colorado" | data$CHSI_State_Name=="
#converting state names into lower case letters
data$CHSI_State_Name <- tolower(data$CHSI_State_Name)
#Creating seperate data sets for each regions so that we can perform analysis on these seperate data
Northeast<- data[data$region==1,]
Region 1 (NorthEast region)
We will be analyzing the northeast region for the following problem. All the code and necessary visualizations
are included in following section.
#Cleaning the data in region 1.
#Removing missing data.
Northeast <- subset(Northeast, Northeast$No_Exercise!=0)
Northeast <- subset(Northeast, Northeast$Few_Fruit_Veg!=0)
Northeast <- subset(Northeast, Northeast$Obesity!=0)
Northeast <- subset(Northeast, Northeast$High_Blood_Pres!=0)
Northeast <- subset(Northeast, Northeast$Smoker!=0)
Northeast <- subset(Northeast, Northeast$Diabetes!=0)
Northeast <- subset(Northeast, Northeast$Lung_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Col_Cancer!=0)
Northeast <- subset(Northeast, Northeast$CHD!=0)
Northeast <- subset(Northeast, Northeast$Brst_Cancer!=0)
Northeast <- subset(Northeast, Northeast$Suicide!=0)
Northeast <- subset(Northeast, Northeast$Total_Death_Causes!=0)
Northeast <- subset(Northeast, Northeast$Injury!=0)
Northeast <- subset(Northeast, Northeast$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths. Here for the
simplicity we have only included the top two reasons why people are dying in region 1. But we came to the
conclusion using single variate regression of the total death with respect to individual disease and then we
combined the features for maximum Rˆ2 value. Following is the table which shows our experiment results.
disease region1 (R squared)
breast cancer 0.04
mva 0.17
chd 0.71
colon cancer 0.24
lung cancer 0.16
injury 0.1
suicide 0.07
stroke 0.04
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
northeast region.
1. CHD (Corronary heart disease)
2. Colon Cancer
#Since we have taken the CHD and Colon Cancer as the mahor reason why people are dying we will perform m
#Summary of regression between total deaths and diseases we selected.
## Call:
## lm(formula = Northeast$Total_Death_Causes ~ Northeast$CHD + Northeast$Col_Cancer)
## Residuals:
## Min 1Q Median 3Q Max
## -24.766 -5.234 -1.086 5.389 28.828
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.05596 2.51849 3.596 0.000433 ***
## Northeast$CHD 1.00017 0.04322 23.141 < 2e-16 ***
## Northeast$Col_Cancer 8.15527 0.44732 18.231 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9.382 on 157 degrees of freedom
## Multiple R-squared: 0.9621, Adjusted R-squared: 0.9616
## F-statistic: 1991 on 2 and 157 DF, p-value: < 2.2e-16
#plotting regression analysis
50 100 150 200 250
Fitted values
Residuals vs Fitted
−2 −1 0 1 2
Theoretical Quantiles
Normal Q−Q
50 100 150 200 250
Fitted values
18 659
0.00 0.05 0.10 0.15
Cook's distance 0.5
Residuals vs Leverage
Now that we have established the major disease in North east region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multi variate regression with training and testing data.
Following is the code which describes the procedure for CHD.
#Creating a data frame inorder to generate correlation heat map
#Correlation metrix
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.8886204 0.8202385 0.7790907
## No_Exercise 0.8886204 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.8202385 0.8788325 1.0000000 0.9008216
## Obesity 0.7790907 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.7685505 0.8430622 0.9158679 0.8826446
## Smoker 0.7693031 0.8041099 0.8774229 0.9155866
## Diabetes 0.7731015 0.8495609 0.8339938 0.8830184
## High_Blood_Pres Smoker Diabetes
## CHD 0.7685505 0.7693031 0.7731015
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
#Melting the correlation matrix and creating a data frame
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
## [1] 0.8705658
## x y value
## CHD :7 CHD :7 Min. :0.7686
## Diabetes :7 Diabetes :7 1st Qu.:0.8340
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8706
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9008
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = NE.MeltMean,
limits = c(0.7, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 1")
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
Heat map of correlations in Risk Factors data : Region 1
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
No Exercise %s
% Obesity
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
20% 30% 40%
No Exercise %s
% Few Fruits and Vegetables
Percentage of No Exercise vs.
Percentage of Few Fruits and Vegetables
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedur
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
20% 30% 40%
No Exercise %s
% High Blood Pressure
Percentage of No Excersice vs.
Percentage of High Blood Pressure
#Plotting the multi variate scatter plot in order to understand the correlation better.
5 10 2 6 10 5 10
50 150 10 25 40 5 10 1 3 5
Multivariate Scatterplot : Region 1
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Diabetes
3. Obesity
#partitioning the data into Training and Testing set.
train_part <- createDataPartition(y = NE.Procedures$CHD,p = 0.80,list = FALSE)
#Performing regression between CHD and No exercise
#Regression Summary
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
## Residuals:
## Min 1Q Median 3Q Max
## -27.186 -6.009 -1.259 6.963 76.044
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## TrainingCHD$No_Exercise 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.51 on 126 degrees of freedom
## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759
## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$High_Blood_Pres,C3
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
## Call:
## lm(formula = O ~ E, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -27.186 -6.009 -1.259 6.963 76.044
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7172 3.3207 2.324 0.0217 *
## E 6.7704 0.3225 20.996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.51 on 126 degrees of freedom
## Multiple R-squared: 0.7777, Adjusted R-squared: 0.7759
## F-statistic: 440.8 on 1 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Obisity
reg.EC1 <- lm(O ~ E+C1, data = temp)
## Call:
## lm(formula = O ~ E + C1, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -25.604 -6.818 -1.170 5.811 77.683
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9462 3.4303 1.733 0.0855 .
## E 5.7057 0.6647 8.583 3.06e-14 ***
## C1 1.3495 0.7389 1.826 0.0702 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.39 on 125 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.78
## F-statistic: 226.2 on 2 and 125 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with High Blood pressure
reg.EC2 <- lm(O ~ E+C2, data = temp)
## Call:
## lm(formula = O ~ E + C2, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -25.440 -6.496 -1.075 5.928 81.639
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3274 3.4403 1.549 0.1240
## E 5.6011 0.6125 9.145 1.39e-15 ***
## C2 1.2760 0.5716 2.232 0.0274 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.31 on 125 degrees of freedom
## Multiple R-squared: 0.7862, Adjusted R-squared: 0.7828
## F-statistic: 229.9 on 2 and 125 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Diabeties
reg.EC3 <- lm(O ~ E+C3, data = temp)
## Call:
## lm(formula = O ~ E + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -27.229 -6.466 -1.525 6.379 79.871
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.057 3.284 2.149 0.0336 *
## E 5.612 0.612 9.171 1.2e-15 ***
## C3 4.022 1.817 2.213 0.0287 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.32 on 125 degrees of freedom
## Multiple R-squared: 0.7861, Adjusted R-squared: 0.7827
## F-statistic: 229.7 on 2 and 125 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -25.767 -6.870 -1.347 5.835 81.655
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5001 3.5445 1.552 0.123
## E 5.1938 0.7261 7.153 6.68e-11 ***
## C1 0.4091 0.9195 0.445 0.657
## C2 0.7147 0.7485 0.955 0.342
## C3 2.0806 2.4541 0.848 0.398
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.34 on 123 degrees of freedom
## Multiple R-squared: 0.7886, Adjusted R-squared: 0.7817
## F-statistic: 114.7 on 4 and 123 DF, p-value: < 2.2e-16
#plotting regression.
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
20 40 60 80 100 120
Fitted values
Residuals vs Fitted
−2 −1 0 1 2
Theoretical Quantiles
Normal Q−Q
20 40 60 80 100 120
Fitted values
0.00 0.05 0.10 0.15 0.20
Cook's distance
Residuals vs Leverage
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$High_Blood_Pres,C
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2) - (SS.regression+SS.residual)
## [1] 8567.965
## [1] 0.5136492
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the colon cancer.
Here is the code for the colon cancer model.
#Creating a data frame inorder to generate correlation heat map
#Correlation metrix
## Col_Cancer No_Exercise Few_Fruit_Veg Obesity
## Col_Cancer 1.0000000 0.8447630 0.9067467 0.8433966
## No_Exercise 0.8447630 1.0000000 0.8788325 0.8627715
## Few_Fruit_Veg 0.9067467 0.8788325 1.0000000 0.9008216
## Obesity 0.8433966 0.8627715 0.9008216 1.0000000
## High_Blood_Pres 0.8606917 0.8430622 0.9158679 0.8826446
## Smoker 0.8306595 0.8041099 0.8774229 0.9155866
## Diabetes 0.7996891 0.8495609 0.8339938 0.8830184
## High_Blood_Pres Smoker Diabetes
## Col_Cancer 0.8606917 0.8306595 0.7996891
## No_Exercise 0.8430622 0.8041099 0.8495609
## Few_Fruit_Veg 0.9158679 0.8774229 0.8339938
## Obesity 0.8826446 0.9155866 0.8830184
## High_Blood_Pres 1.0000000 0.8503166 0.8839868
## Smoker 0.8503166 1.0000000 0.8479606
## Diabetes 0.8839868 0.8479606 1.0000000
#Melting the correlation matrix and creating a data frame
NE.Melt<-melt(data=NE.Cor,varnames = c("x","y"))
NE.Melt <- NE.Melt[order(NE.Melt$value),]
#Mean of the melt
## [1] 0.8822818
## x y value
## Col_Cancer :7 Col_Cancer :7 Min. :0.7997
## Diabetes :7 Diabetes :7 1st Qu.:0.8448
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8774
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8823
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9067
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, green) )
WtoGrange<-colorRampPalette(c(green, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = NE.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = NE.MeltMean,
limits = c(0.7, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 1")
Col_Cancer DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
Heat map of correlations in Risk Factors data : Region 1
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS Few Fruit vegitables
plt_obevsNoEx_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Procedur
color = factor(signif(NE.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of Few Fruit and vegitables")
10% 20% 30% 40%
No Exercise %s
% Obesity
Percentage of obesity vs.
Percentage of Few Fruit and vegitables
#graph of No Exercise VS Few Fruits and Vegetables
plt_noExvsfru_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$No_Exercise), y = (NE.Procedures
color = factor(signif(NE.Procedures$Few_Fruit_Veg,
geom_point() +
scale_color_discrete(name="% Few Fruits and Vegetables") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Few Fruits and Vegetables %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Few Fruits and Vegetables")
20% 30% 40%
No Exercise %s
% Few Fruits and Vegetables
Percentage of No Exercise vs.
Percentage of Few Fruits and Vegetables
#graph of Few Fruit and Vegitables VS High Blood Pressure
plt_noExvsblood_NE <- ggplot(data = NE.Procedures, aes(x = (NE.Procedures$Few_Fruit_Veg), y = (NE.Proced
color = factor(signif(NE.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'Few Fruit and vegitables %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of Few Fruits and vegitables vs.n Percentage of High Blood Pressure")
20% 30% 40% 50%
Few Fruit and vegitables %s
% High Blood Pressure
Percentage of Few Fruits and vegitables vs.
Percentage of High Blood Pressure
#Plotting the multi variate scatter plot in order to understand the correlation better.
5 10 2 6 10 5 10
2 6 10 10 25 40 5 10 1 3 5
Multivariate Scatterplot : Region 1
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the colon cancer
is highly correlated with the
1. Few Fruits and vegetable.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. No Exercise 2. High blood
pressure 3. Obesity
#partitioning the data into Training and Testing set.
train_part <- createDataPartition(y = NE.Procedures$Col_Cancer,p = 0.80,list = FALSE)
TrainingColon <- NE.Procedures[train_part,]
TestingColon <- NE.Procedures[-train_part,]
#Performing regression between Colon Cancer and Few fruits and vegetables
#Regression Summary
## Call:
## lm(formula = TrainingColon$Col_Cancer ~ TrainingColon$Few_Fruit_Veg)
## Residuals:
## Min 1Q Median 3Q Max
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## TrainingColon$Few_Fruit_Veg 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.102 on 127 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingColon$Few_Fruit_Veg, C1=TrainingColon$Obesity,C2=TrainingColon$High_Blood
temp <- mutate(temp, O = TrainingColon$Col_Cancer)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
## Call:
## lm(formula = O ~ E, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -3.3197 -0.6742 -0.0447 0.5826 3.8689
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67978 0.33268 2.043 0.0431 *
## E 0.26134 0.01047 24.971 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.102 on 127 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 623.5 on 1 and 127 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Obisity
reg.EC1 <- lm(O ~ E+C1, data = temp)
## Call:
## lm(formula = O ~ E + C1, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -3.1345 -0.6380 -0.0406 0.5283 3.7629
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66357 0.33068 2.007 0.0469 *
## E 0.22619 0.02392 9.455 2.32e-16 ***
## C1 0.12091 0.07412 1.631 0.1054
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.095 on 126 degrees of freedom
## Multiple R-squared: 0.8343, Adjusted R-squared: 0.8317
## F-statistic: 317.2 on 2 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with High Blood pressure
reg.EC2 <- lm(O ~ E+C2, data = temp)
## Call:
## lm(formula = O ~ E + C2, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -3.3833 -0.6435 0.0166 0.4775 3.8119
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68140 0.32910 2.070 0.0405 *
## E 0.21531 0.02585 8.330 1.16e-13 ***
## C2 0.13163 0.06773 1.944 0.0542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.09 on 126 degrees of freedom
## Multiple R-squared: 0.8357, Adjusted R-squared: 0.8331
## F-statistic: 320.5 on 2 and 126 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Diabeties
reg.EC3 <- lm(O ~ E+C3, data = temp)
## Call:
## lm(formula = O ~ E + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -3.0632 -0.6606 -0.0144 0.4913 3.8662
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.71645 0.32566 2.200 0.0296 *
## E 0.21276 0.02128 10.000 <2e-16 ***
## C3 0.14657 0.05628 2.604 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.078 on 126 degrees of freedom
## Multiple R-squared: 0.8394, Adjusted R-squared: 0.8369
## F-statistic: 329.4 on 2 and 126 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -3.0938 -0.6481 0.0027 0.4265 3.7933
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.70568 0.32562 2.167 0.0321 *
## E 0.17857 0.03122 5.719 7.56e-08 ***
## C1 0.03900 0.07983 0.489 0.6261
## C2 0.09059 0.07048 1.285 0.2011
## C3 0.11996 0.06030 1.989 0.0489 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.077 on 124 degrees of freedom
## Multiple R-squared: 0.8424, Adjusted R-squared: 0.8373
## F-statistic: 165.7 on 4 and 124 DF, p-value: < 2.2e-16
#plotting regression.
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
4 6 8 10
Fitted values
Residuals vs Fitted
−2 −1 0 1 2
Theoretical Quantiles
Normal Q−Q
4 6 8 10
Fitted values
0.00 0.05 0.10 0.15
Cook's distance
Residuals vs Leverage
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingColon$Few_Fruit_Veg, C1=TestingColon$Obesity,C2=TestingColon$High_Bloo
tempTest <- mutate(tempTest, O = TestingColon$Col_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2) - (SS.regression+SS.residual)
## [1] 13.39228
## [1] 0.7494172
#This is the regression value Rsquare value for testing data
Region 2 (Midwest region)
We will be analyzing the Midwest region for the following problem. All the code and necessary visualizations
are included in following section.
#Cleaning the data in region 2.
#Removing missing data.
Midwest <- subset(Midwest, Midwest$No_Exercise!=0)
Midwest <- subset(Midwest, Midwest$Few_Fruit_Veg!=0)
Midwest <- subset(Midwest, Midwest$Obesity!=0)
Midwest <- subset(Midwest, Midwest$High_Blood_Pres!=0)
Midwest <- subset(Midwest, Midwest$Smoker!=0)
Midwest <- subset(Midwest, Midwest$Diabetes!=0)
Midwest <- subset(Midwest, Midwest$Lung_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Col_Cancer!=0)
Midwest <- subset(Midwest, Midwest$CHD!=0)
Midwest <- subset(Midwest, Midwest$Brst_Cancer!=0)
Midwest <- subset(Midwest, Midwest$Suicide!=0)
Midwest <- subset(Midwest, Midwest$Total_Death_Causes!=0)
Midwest <- subset(Midwest, Midwest$Injury!=0)
Midwest <- subset(Midwest, Midwest$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top two reasons why people are dying in region 2. We came
to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
disease region2 (R squared)
breast cancer 0.016
mva 0.19
chd 0.77
colon cancer 0.12
lung cancer 0.25
injury 0.11
suicide 0.05
stroke 0.13
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
Midwest region.
1. CHD (Corronary heart disease)
2. Lung Cancer
#Since we have taken the CHD and Lung Cancer as the major reason why people are dying. We will perform m
#Summary of regression between total deaths and diseases we selected.
## Call:
## lm(formula = Midwest$Total_Death_Causes ~ Midwest$CHD + Midwest$Lung_Cancer)
## Residuals:
## Min 1Q Median 3Q Max
## -19.6367 -4.0070 -0.6864 3.8916 22.6175
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.26689 0.65579 11.08 <2e-16 ***
## Midwest$CHD 1.12881 0.02958 38.16 <2e-16 ***
## Midwest$Lung_Cancer 2.97013 0.08260 35.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 6.172 on 389 degrees of freedom
## Multiple R-squared: 0.9886, Adjusted R-squared: 0.9885
## F-statistic: 1.688e+04 on 2 and 389 DF, p-value: < 2.2e-16
#plotting regression analysis
50 100 150 200 250
Fitted values
Residuals vs Fitted
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Normal Q−Q
50 100 150 200 250
Fitted values
385 165231
0.00 0.01 0.02 0.03 0.04 0.05
Cook's distance
Residuals vs Leverage
Now that we have established the major disease in Midwest region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multivariate regression with training and testing data.
Following is the code which describes the procedure for CHD.
#Creating a data frame inorder to generate correlation heat map
#Correlation metrix
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.9109385 0.8956826 0.9072785
## No_Exercise 0.9109385 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.8956826 0.9135682 1.0000000 0.9520907
## Obesity 0.9072785 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9045115 0.9155135 0.9331822 0.9536831
## Smoker 0.9076982 0.9292124 0.9400104 0.9312576
## Diabetes 0.8902612 0.9031664 0.8688473 0.9153784
## High_Blood_Pres Smoker Diabetes
## CHD 0.9045115 0.9076982 0.8902612
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
#Melting the correlation matrix and creating a data frame
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
## [1] 0.9275097
## x y value
## CHD :7 CHD :7 Min. :0.8688
## Diabetes :7 Diabetes :7 1st Qu.:0.9073
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.9188
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.9275
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9400
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
#Making various colors to geMWrate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, red) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = MW.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 2")
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
Heat map of correlations in Risk Factors data : Region 2
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
No Exercise %s
% Obesity
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Smoker
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
20% 30% 40%
No Exercise %s
% Smoker
Percentage of No Exercise vs.
#graph of No Exercise VS High Blood Pressure
plt_noExvsblood_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedur
color = factor(signif(MW.Procedures$High_Blood_Pres,
geom_point() +
scale_color_discrete(name="% High Blood Pressure") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'High Blood Pressure %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of High Blood Pressure")
20% 30% 40%
No Exercise %s
% High Blood Pressure
Percentage of No Excersice vs.
Percentage of High Blood Pressure
#Plotting the multivariate scatter plot in order to understand the correlation better.
5 15 5 15 5 10
20 80 10 25 40 5 15 1 3 5
Multivariate Scatterplot : Region 2
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Smoker
3. Obesity
#partitioning the data into Training and Testing set.
train_part <- createDataPartition(y = MW.Procedures$CHD,p = 0.80,list = FALSE)
#Performing regression between CHD and No exercise
#Regression Summary
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
## Residuals:
## Min 1Q Median 3Q Max
## -26.669 -5.992 -1.075 4.611 37.145
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## TrainingCHD$No_Exercise 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 10.56 on 314 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443
## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
## Call:
## lm(formula = O ~ E, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -26.669 -5.992 -1.075 4.611 37.145
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6506 1.2765 2.077 0.0387 *
## E 6.9596 0.1684 41.334 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 10.56 on 314 degrees of freedom
## Multiple R-squared: 0.8447, Adjusted R-squared: 0.8443
## F-statistic: 1709 on 1 and 314 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
## Call:
## lm(formula = O ~ E + C1, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -23.236 -5.870 -0.862 5.417 36.030
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6657 1.2036 0.553 0.581
## E 3.9856 0.4212 9.462 < 2e-16 ***
## C1 3.1306 0.4123 7.593 3.64e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9.723 on 313 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8681
## F-statistic: 1037 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
## Call:
## lm(formula = O ~ E + C2, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -30.550 -5.426 -0.776 4.322 37.881
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6510 1.1790 2.249 0.0252 *
## E 4.0462 0.4223 9.581 < 2e-16 ***
## C2 2.8735 0.3872 7.420 1.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9.757 on 313 degrees of freedom
## Multiple R-squared: 0.868, Adjusted R-squared: 0.8671
## F-statistic: 1029 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
## Call:
## lm(formula = O ~ E + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -22.324 -5.482 -0.944 5.139 34.475
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4248 1.1767 1.211 0.227
## E 4.1168 0.3898 10.560 < 2e-16 ***
## C3 2.8134 0.3545 7.936 3.74e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9.654 on 313 degrees of freedom
## Multiple R-squared: 0.8708, Adjusted R-squared: 0.8699
## F-statistic: 1054 on 2 and 313 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -25.493 -5.404 -0.842 4.904 33.167
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3202 1.1689 1.129 0.259602
## E 2.8621 0.4635 6.175 2.07e-09 ***
## C1 1.1091 0.5728 1.936 0.053747 .
## C2 1.5669 0.4458 3.514 0.000506 ***
## C3 1.4402 0.4852 2.968 0.003230 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9.36 on 311 degrees of freedom
## Multiple R-squared: 0.8793, Adjusted R-squared: 0.8777
## F-statistic: 566.3 on 4 and 311 DF, p-value: < 2.2e-16
#plotting regression.
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
20 40 60 80 100
Fitted values
Residuals vs Fitted
6201 149
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Normal Q−Q
20 40 60 80 100
Fitted values
6201 149
0.00 0.04 0.08 0.12
Cook's distance
Residuals vs Leverage
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
## 1 2 3 4 5 6 7
## 47.35611 71.16578 74.24867 74.90092 88.18896 85.76173 53.21561
## 8 9 10 11 12 13 14
## 46.55992 49.44050 85.85667 62.59279 42.36807 92.85802 41.66495
## 15 16 17 18 19 20 21
## 90.65576 38.70206 44.74763 18.59404 16.51755 23.08953 19.65119
## 22 23 24 25 26 27 28
## 17.38575 19.99998 40.34271 25.57896 22.67974 18.95382 75.20951
## 29 30 31 32 33 34 35
## 20.59550 44.87894 22.94132 21.52470 28.72008 16.55805 21.38459
## 36 37 38 39 40 41 42
## 53.07606 51.01929 52.00420 22.51745 111.28415 89.03471 80.60128
## 43 44 45 46 47 48 49
## 70.85496 44.13435 65.03187 73.84524 36.00484 39.05419 62.18069
## 50 51 52 53 54 55 56
## 29.67611 63.29361 76.79059 38.52942 50.68324 49.06936 21.51555
## 57 58 59 60 61 62 63
## 50.18942 42.64157 20.31264 78.53103 20.19143 18.05501 20.39115
## 64 65 66 67 68 69 70
## 87.13676 84.23844 86.80106 71.39024 69.61883 40.42645 18.85330
## 71 72 73 74 75 76
## 46.86680 32.61050 38.34111 38.45673 67.09643 42.34753
test.y<-tempTest$O <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2) - (SS.regression+SS.residual)
## [1] -1040.415
## [1] 0.854508
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
#Creating a data frame inorder to generate correlation heat map
#Correlation metrix
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.9402863 0.9478072 0.9459666
## No_Exercise 0.9402863 1.0000000 0.9135682 0.9265991
## Few_Fruit_Veg 0.9478072 0.9135682 1.0000000 0.9520907
## Obesity 0.9459666 0.9265991 0.9520907 1.0000000
## High_Blood_Pres 0.9279799 0.9155135 0.9331822 0.9536831
## Smoker 0.9494486 0.9292124 0.9400104 0.9312576
## Diabetes 0.8984837 0.9031664 0.8688473 0.9153784
## High_Blood_Pres Smoker Diabetes
## Lung_Cancer 0.9279799 0.9494486 0.8984837
## No_Exercise 0.9155135 0.9292124 0.9031664
## Few_Fruit_Veg 0.9331822 0.9400104 0.8688473
## Obesity 0.9536831 0.9312576 0.9153784
## High_Blood_Pres 1.0000000 0.9187944 0.9194037
## Smoker 0.9187944 1.0000000 0.8869094
## Diabetes 0.9194037 0.8869094 1.0000000
#Melting the correlation matrix and creating a data frame
MW.Melt<-melt(data=MW.Cor,varnames = c("x","y"))
MW.Melt <- MW.Melt[order(MW.Melt$value),]
#Mean of the melt
## [1] 0.9354118
## x y value
## Diabetes :7 Diabetes :7 Min. :0.8688
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.9155
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.9313
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.9354
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9494
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, red) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = MW.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = MW.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 2")
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
Heat map of correlations in Risk Factors data : Region 2
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS Few Fruit vegitables
plt_Smovsfru_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Few_Fruit_Veg), y = (MW.Procedure
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fruits and Vegetables %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of Few Fruit and vegitables")
10% 20% 30% 40%
Few Fruits and Vegetables %s
% Smoker
Percentage of Smoker vs.
Percentage of Few Fruit and vegitables
#graph of No Exercise VS Smoker
plt_noExvsSmo_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$No_Exercise), y = (MW.Procedures
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'No Exercise %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
20% 30% 40%
No Exercise %s
% Smoker
Percentage of No Exercise vs.
Percentage of Smoker
#graph of Smoker VS Obesity
plt_SmovsObe_MW <- ggplot(data = MW.Procedures, aes(x = (MW.Procedures$Obesity), y = (MW.Procedures$Smok
color = factor(signif(MW.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
20% 30% 40%
Obesity %s
% Smoker
Percentage of Obesity vs.
Percentage of Smoker
#Plotting the multi variate scatter plot in order to understand the correlation better.
5 15 5 15 5 10
10 30 10 25 40 5 15 1 3 5
Multivariate Scatterplot : Region 2
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. Smoker.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. No Exercise
2. Few Fruits and vegetable
3. Obesity
#partitioning the data into Training and Testing set.
train_part <- createDataPartition(y = MW.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- MW.Procedures[train_part,]
TestingLung <- MW.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
#Regression Summary
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
## Residuals:
## Min 1Q Median 3Q Max
## -11.8531 -1.4729 0.0134 1.5588 7.9568
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10692 0.34785 0.307 0.759
## TrainingLung$Smoker 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.047 on 314 degrees of freedom
## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995
## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$Few_Fruit_Veg,C3=Tra
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
reg.E <- lm(O ~ E, data = temp)
## Call:
## lm(formula = O ~ E, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -11.8531 -1.4729 0.0134 1.5588 7.9568
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.10692 0.34785 0.307 0.759
## E 2.35403 0.04434 53.096 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.047 on 314 degrees of freedom
## Multiple R-squared: 0.8998, Adjusted R-squared: 0.8995
## F-statistic: 2819 on 1 and 314 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
## Call:
## lm(formula = O ~ E + C1, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -11.2839 -1.2744 0.1438 1.3621 7.6394
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1171 0.3171 -3.523 0.00049 ***
## E 1.3294 0.1012 13.133 < 2e-16 ***
## C1 1.1735 0.1075 10.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.598 on 313 degrees of freedom
## Multiple R-squared: 0.9274, Adjusted R-squared: 0.9269
## F-statistic: 1999 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
## Call:
## lm(formula = O ~ E + C2, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -11.8397 -1.2740 0.0911 1.3515 8.0799
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.31635 0.32814 -4.012 7.55e-05 ***
## E 1.24197 0.11201 11.088 < 2e-16 ***
## C2 0.39028 0.03696 10.559 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.621 on 313 degrees of freedom
## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9256
## F-statistic: 1961 on 2 and 313 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
## Call:
## lm(formula = O ~ E + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -8.4258 -1.3278 -0.2252 1.5037 7.6570
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9052 0.3127 -2.895 0.00406 **
## E 1.3013 0.1054 12.341 < 2e-16 ***
## C3 1.2172 0.1137 10.703 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.611 on 313 degrees of freedom
## Multiple R-squared: 0.9266, Adjusted R-squared: 0.9262
## F-statistic: 1977 on 2 and 313 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -9.0675 -1.1707 0.1069 1.2777 7.5823
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.79363 0.29587 -6.062 3.88e-09 ***
## E 0.67286 0.11725 5.739 2.27e-08 ***
## C1 0.44770 0.13018 3.439 0.000664 ***
## C2 0.21279 0.04183 5.087 6.31e-07 ***
## C3 0.79075 0.11385 6.946 2.21e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.327 on 311 degrees of freedom
## Multiple R-squared: 0.9421, Adjusted R-squared: 0.9414
## F-statistic: 1265 on 4 and 311 DF, p-value: < 2.2e-16
#plotting regression.
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
5 10 15 20 25 30 35
Fitted values
Residuals vs Fitted
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Normal Q−Q
5 10 15 20 25 30 35
Fitted values
0.00 0.05 0.10 0.15
Cook's distance 0.5
Residuals vs Leverage
#No we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$Few_Fruit_Veg,C3=Te
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2) - (SS.regression+SS.residual)
## [1] 500.1066
## [1] 0.8705158
#This is the regression value Rsquare value for testing data
Region 3 (South region)
We will be analyzing the South region for the following problem. All the code and necessary visualizations
are included in following section.
#Cleaning the data in region 3.
#Removing missing data.
South <- subset(South, South$No_Exercise!=0)
South <- subset(South, South$Few_Fruit_Veg!=0)
South <- subset(South, South$Obesity!=0)
South <- subset(South, South$High_Blood_Pres!=0)
South <- subset(South, South$Smoker!=0)
South <- subset(South, South$Diabetes!=0)
South <- subset(South, South$Lung_Cancer!=0)
South <- subset(South, South$Col_Cancer!=0)
South <- subset(South, South$CHD!=0)
South <- subset(South, South$Brst_Cancer!=0)
South <- subset(South, South$Suicide!=0)
South <- subset(South, South$Total_Death_Causes!=0)
South <- subset(South, South$Injury!=0)
South <- subset(South, South$MVA!=0)
Now we applied the regression model for different kinds of deaths with total number of deaths in this region.
Here for the simplicity we have only included the top three reasons why people are dying in region 3. We
came to the conclusion using single variate regression of the total death with respect to individual disease and
then we combined the features for maximum Rˆ2 value. Following is the table which shows our experimental
disease region3 (R squared)
breast cancer 0.06
mva 0.26
chd 0.79
colon cancer 0.16
lung cancer 0.35
injury 0.17
suicide 0.03
stroke 0.14
Based on the Rˆ2 values we are considering following diseases as the major reasons why people are dying in
South region.
1. CHD (Corronary heart disease)
2. Lung Cancer
3. MVA (Motor Vehicle Accidents)
#Since we have taken the CHD , Lung Cancer and MVA as the major reason why people are dying. We will per
#Summary of regression between total deaths and diseases we selected.
## Call:
## lm(formula = South$Total_Death_Causes ~ South$CHD + South$Lung_Cancer +
## South$MVA)
## Residuals:
## Min 1Q Median 3Q Max
## -40.105 -5.062 -0.461 4.612 30.418
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.17128 0.98511 1.189 0.235
## South$CHD 1.16308 0.02562 45.398 < 2e-16 ***
## South$Lung_Cancer 2.84552 0.07955 35.770 < 2e-16 ***
## South$MVA 1.08049 0.15806 6.836 2.16e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7.829 on 554 degrees of freedom
## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9783
## F-statistic: 8374 on 3 and 554 DF, p-value: < 2.2e-16
#plotting regression analysis
50 100 150 200 250 300
Fitted values
Residuals vs Fitted
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Normal Q−Q
50 100 150 200 250 300
Fitted values
0.00 0.02 0.04 0.06 0.08
Cook's distance
Residuals vs Leverage
Now that we have established the major disease in South region. We will now analyse the relationship
between the these disease and the daily human activities using multivariate regression.
We are using correlation heatmap and multivariate scatterplot. After that we will try to find the confounders
and perform multivariate regression with training and testing data.
Following is the code which describes the procedure for CHD.
#Creating a data frame inorder to generate correlation heat map
#Correlation metrix
## CHD No_Exercise Few_Fruit_Veg Obesity
## CHD 1.0000000 0.8652391 0.8531583 0.8529559
## No_Exercise 0.8652391 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8531583 0.8673295 1.0000000 0.9171308
## Obesity 0.8529559 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8240507 0.8680525 0.8880798 0.8827588
## Smoker 0.8387031 0.8636349 0.8920701 0.8694016
## Diabetes 0.7821009 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## CHD 0.8240507 0.8387031 0.7821009
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
## [1] 0.8792101
## x y value
## CHD :7 CHD :7 Min. :0.7821
## Diabetes :7 Diabetes :7 1st Qu.:0.8530
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 Median :0.8681
## High_Blood_Pres:7 High_Blood_Pres:7 Mean :0.8792
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
#Making various colors to geSOrate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, green) )
# Heat map - using colors | used ggplot2 for the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.8, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
CHD DiabetesFew_Fruit_VegHigh_Blood_PresNo_Exercise Obesity Smoker
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Obesity VS No Exercise
plt_obevsNoEx_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Obesity, 0)))) +
geom_point() +
scale_color_discrete(name="% Obesity") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Obesity %s') +
ggtitle(label = "Percentage of obesity vs.n Percentage of No Exercise")
10% 20%
No Exercise %s
% Obesity
Percentage of obesity vs.
Percentage of No Exercise
#graph of No Exercise VS Smoker
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Smoker")
10% 20%
No Exercise %s
% Smoker
Percentage of No Exercise vs.
#graph of No Exercise VS Few Fruits And Vegetables
plt_noExvsfru_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Few_Fruit_Veg, 0
geom_point() +
scale_color_discrete(name="% Few Fruits And Vegetables") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Few Fruits And Vegetables %s') +
ggtitle(label = "Percentage of No Excersice vs.n Percentage of Few Fruits And Vegetables")
10% 20%
No Exercise %s
% Few Fruits And Vegetables
Percentage of No Excersice vs.
Percentage of Few Fruits And Vegetables
#Plotting the multivariate scatter plot in order to understand the correlation better.
5 15 2 6 12 5 10
20 80 140 10 30 5 15 1 3 5
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the CHD is
highly correlated with the
1. No exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
We based on the correlation heatmap and the scatterplot we can say that 1. Few Fruits And Vegetables
2. Smoker
3. Obesity
#partitioning the data into Training and Testing set.
train_part <- createDataPartition(y = SO.Procedures$CHD,p = 0.80,list = FALSE)
#Performing regression between CHD and No exercise
#Regression Summary
## Call:
## lm(formula = TrainingCHD$CHD ~ TrainingCHD$No_Exercise)
## Residuals:
## Min 1Q Median 3Q Max
## -63.726 -7.474 -0.401 7.460 53.480
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## TrainingCHD$No_Exercise 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.37 on 446 degrees of freedom
## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554
## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingCHD$No_Exercise, C1=TrainingCHD$Obesity,C2=TrainingCHD$Smoker,C3=Training
temp <- mutate(temp, O = TrainingCHD$CHD)
#Regression on CHD and No Exercise
reg.E <- lm(O ~ E, data = temp)
## Call:
## lm(formula = O ~ E, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -63.726 -7.474 -0.401 7.460 53.480
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3934 1.6142 3.341 0.000904 ***
## E 6.2141 0.1672 37.167 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 12.37 on 446 degrees of freedom
## Multiple R-squared: 0.7559, Adjusted R-squared: 0.7554
## F-statistic: 1381 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on CHD and No excercise with Confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
## Call:
## lm(formula = O ~ E + C1, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -56.692 -6.553 -0.426 6.782 49.965
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7745 1.5323 2.463 0.0141 *
## E 3.6456 0.3685 9.893 < 2e-16 ***
## C1 3.1443 0.4080 7.707 8.45e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 11.64 on 445 degrees of freedom
## Multiple R-squared: 0.7847, Adjusted R-squared: 0.7837
## F-statistic: 810.8 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
## Call:
## lm(formula = O ~ E + C2, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -58.419 -7.093 -0.348 6.464 50.962
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2383 1.5379 1.455 0.146
## E 3.8791 0.3103 12.503 <2e-16 ***
## C2 3.0972 0.3567 8.684 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 11.46 on 445 degrees of freedom
## Multiple R-squared: 0.7913, Adjusted R-squared: 0.7904
## F-statistic: 843.6 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on CHD and No exercise with Confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
## Call:
## lm(formula = O ~ E + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -57.863 -6.174 -0.278 6.583 48.496
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0217 1.4558 2.076 0.0385 *
## E 3.3541 0.3044 11.019 <2e-16 ***
## C3 1.1467 0.1064 10.776 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 11.03 on 445 degrees of freedom
## Multiple R-squared: 0.8064, Adjusted R-squared: 0.8056
## F-statistic: 927 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -55.608 -6.372 -0.209 6.585 48.139
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9532 1.4613 1.337 0.182035
## E 2.5933 0.3754 6.907 1.72e-11 ***
## C1 0.8052 0.4921 1.636 0.102483
## C2 1.4445 0.4122 3.504 0.000505 ***
## C3 0.7514 0.1527 4.922 1.21e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 10.87 on 443 degrees of freedom
## Multiple R-squared: 0.8129, Adjusted R-squared: 0.8112
## F-statistic: 481.1 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
20 40 60 80 100 120
Fitted values
Residuals vs Fitted
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Normal Q−Q
20 40 60 80 100 120
Fitted values
0.00 0.02 0.04 0.06 0.08 0.10
Cook's distance
Residuals vs Leverage
#Now we will test our regression model with testing data to check the prerformance.
tempTest <- data.frame(E = TestingCHD$No_Exercise, C1=TestingCHD$Obesity,C2=TestingCHD$Smoker,C3=Testing
tempTest <- mutate(tempTest, O = TestingCHD$CHD)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
## 1 2 3 4 5 6 7
## 93.11863 54.33024 52.85395 45.79852 84.97513 56.88569 90.36550
## 8 9 10 11 12 13 14
## 54.14329 27.01033 21.19971 22.31869 51.65249 44.54791 47.33240
## 15 16 17 18 19 20 21
## 82.57801 95.08141 82.71398 87.34274 77.96068 95.69180 39.98928
## 22 23 24 25 26 27 28
## 92.73972 78.85468 78.94780 80.59074 80.79515 88.44522 51.13704
## 29 30 31 32 33 34 35
## 53.76243 54.93526 44.19820 61.92209 47.95062 57.45545 51.17947
## 36 37 38 39 40 41 42
## 27.33281 62.49750 23.71136 28.54283 26.16418 27.07990 58.66436
## 43 44 45 46 47 48 49
## 102.65357 91.78205 105.83215 55.19101 80.64446 40.27956 53.95203
## 50 51 52 53 54 55 56
## 48.46951 84.13853 40.82755 102.13602 54.52694 48.51574 50.28585
## 57 58 59 60 61 62 63
## 65.78484 53.00168 50.13015 53.58431 48.56988 45.02007 80.19727
## 64 65 66 67 68 69 70
## 85.16485 43.43088 101.14805 81.53974 81.15735 66.76741 60.25659
## 71 72 73 74 75 76 77
## 42.04435 54.65376 103.69421 49.09867 59.43466 51.29171 56.07857
## 78 79 80 81 82 83 84
## 25.54696 55.45294 23.65491 93.97364 51.35518 46.81402 92.22464
## 85 86 87 88 89 90 91
## 84.72256 90.79933 95.60329 59.45610 90.05428 45.56084 76.83904
## 92 93 94 95 96 97 98
## 82.85766 74.82753 77.28768 64.59310 51.51838 40.74676 79.32157
## 99 100 101 102 103 104 105
## 38.59788 52.26890 49.81253 87.43576 89.41201 90.57723 79.44587
## 106 107 108 109 110
## 60.00317 46.91665 59.17558 47.64249 24.00712
test.y<-tempTest$O <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2) - (SS.regression+SS.residual)
## [1] 2561.812
## [1] 0.7099096
#This is the regression value Rsquare value for testing data
Now as we have fitted the regression model for “CHD” we will do the same for the lung cancer.
Here is the code for the lung cancer model.
#Creating a data frame inorder to generate correlation heat map
#Correlation metrix
## Lung_Cancer No_Exercise Few_Fruit_Veg Obesity
## Lung_Cancer 1.0000000 0.8398485 0.8922953 0.8688772
## No_Exercise 0.8398485 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.8922953 0.8673295 1.0000000 0.9171308
## Obesity 0.8688772 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.8698291 0.8680525 0.8880798 0.8827588
## Smoker 0.9145492 0.8636349 0.8920701 0.8694016
## Diabetes 0.7993788 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## Lung_Cancer 0.8698291 0.9145492 0.7993788
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
## [1] 0.8860905
## x y value
## Diabetes :7 Diabetes :7 Min. :0.7994
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8636
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8698
## Lung_Cancer :7 Lung_Cancer :7 Mean :0.8861
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.9031
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, red) )
WtoGrange<-colorRampPalette(c(red, blue) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.9, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
DiabetesFew_Fruit_VegHigh_Blood_PresLung_CancerNo_Exercise Obesity Smoker
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of Smoker VS High Blood Pressure
plt_Smovsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proce
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Smoker vs.n Percentage of High Blood Pressure")
10% 20%
High Blood Pressure %s
% Smoker
Percentage of Smoker vs.
Percentage of High Blood Pressure
#graph of Few Fruits and Vegetables VS Smoker
plt_fruvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Few_Fruit_Veg), y = (SO.Procedure
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'Few Fuits and Vegetable %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of Few Fuits and Vegetable vs.n Percentage of Smoker")
10% 20% 30% 40%
Few Fuits and Vegetable %s
% Smoker
Percentage of Few Fuits and Vegetable vs.
Percentage of Smoker
#graph of Smoker VS Obesity
plt_SmovsObe_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Obesity), y = (SO.Procedures$Smok
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent, name = 'Obesity %s') +
scale_y_continuous(labels = percent, name = 'Smoker %s') +
ggtitle(label = "Percentage of Obesity vs.n Percentage of Smoker")
20% 30% 40% 50%
Obesity %s
% Smoker
Percentage of Obesity vs.
Percentage of Smoker
#Plotting the multi variate scatter plot in order to understand the correlation better.
5 15 2 6 12 5 10
10 30 10 30 5 15 1 3 5
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. Smoker.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure 2. Few Fruits
and vegetable 3. Obesity
#partitioning the data into Training and Testing set.
train_part <- createDataPartition(y = SO.Procedures$Lung_Cancer,p = 0.80,list = FALSE)
TrainingLung <- SO.Procedures[train_part,]
TestingLung <- SO.Procedures[-train_part,]
#Performing regression between Lung Cancer and Smoker
#Regression Summary
## Call:
## lm(formula = TrainingLung$Lung_Cancer ~ TrainingLung$Smoker)
## Residuals:
## Min 1Q Median 3Q Max
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## TrainingLung$Smoker 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.328 on 446 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271
## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having good accuracy we will still add more confounders and then
perform the multivariate regression.
#making a temporary table with all required variables
temp <- data.frame(E = TrainingLung$Smoker, C1=TrainingLung$Obesity,C2=TrainingLung$High_Blood_Pres,C3=T
temp <- mutate(temp, O = TrainingLung$Lung_Cancer)
#Regression on Lung Cancer and Smoker
reg.E <- lm(O ~ E, data = temp)
## Call:
## lm(formula = O ~ E, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -17.9394 -2.0631 -0.1777 1.8757 17.4538
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69422 0.43405 3.903 0.00011 ***
## E 2.37666 0.05138 46.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.328 on 446 degrees of freedom
## Multiple R-squared: 0.8275, Adjusted R-squared: 0.8271
## F-statistic: 2139 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
## Call:
## lm(formula = O ~ E + C1, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -18.098 -1.752 -0.184 1.773 16.095
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92734 0.41643 2.227 0.0265 *
## E 1.71902 0.09419 18.250 < 2e-16 ***
## C1 0.75201 0.09266 8.115 4.78e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.109 on 445 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8491
## F-statistic: 1258 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
## Call:
## lm(formula = O ~ E + C2, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -13.4221 -1.8371 -0.1437 1.7834 17.3119
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99954 0.40928 2.442 0.015 *
## E 1.65336 0.09547 17.317 <2e-16 ***
## C2 0.70444 0.08065 8.735 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.078 on 445 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8521
## F-statistic: 1288 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on Lung Cancer and Smoker with confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
## Call:
## lm(formula = O ~ E + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -1.660 -0.174 1.687 15.611
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.14230 0.40135 2.846 0.00463 **
## E 1.50746 0.10386 14.515 < 2e-16 ***
## C3 0.29928 0.03189 9.385 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 3.044 on 445 degrees of freedom
## Multiple R-squared: 0.856, Adjusted R-squared: 0.8554
## F-statistic: 1323 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -15.8100 -1.7365 -0.1321 1.7312 15.9466
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79644 0.39875 1.997 0.0464 *
## E 1.29944 0.10929 11.889 < 2e-16 ***
## C1 0.18044 0.12280 1.469 0.1424
## C2 0.38893 0.09591 4.055 5.92e-05 ***
## C3 0.17908 0.04193 4.271 2.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.967 on 443 degrees of freedom
## Multiple R-squared: 0.8638, Adjusted R-squared: 0.8626
## F-statistic: 702.5 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients
5 10 15 20 25 30 35
Fitted values
Residuals vs Fitted
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Normal Q−Q
5 10 15 20 25 30 35
Fitted values
0.00 0.02 0.04 0.06 0.08
Cook's distance
Residuals vs Leverage
#Now we will test our regression model with testing data to check the performance.
tempTest <- data.frame(E = TestingLung$Smoker, C1=TestingLung$Obesity,C2=TestingLung$High_Blood_Pres,C3=
tempTest <- mutate(tempTest, O = TestingLung$Lung_Cancer)
test.predictions<-predict(reg.EC1234,newdata = tempTest)
test.y<-tempTest$O <- sum((test.y - mean(test.y))^2)
SS.residual <- sum((test.y - test.predictions)^2)
SS.regression <- sum((test.predictions - mean(test.y))^2) - (SS.regression+SS.residual)
## [1] 882.6084
## [1] 0.7934475
#This is the regression value Rsquare value for testing data
Similarly , we have done calculations for MVA. Here is the code for the MVA model.
#Creating a data frame inorder to generate correlation heat map
#Correlation metrix
## MVA No_Exercise Few_Fruit_Veg Obesity
## MVA 1.0000000 0.7037265 0.5939313 0.6419514
## No_Exercise 0.7037265 1.0000000 0.8673295 0.9031475
## Few_Fruit_Veg 0.5939313 0.8673295 1.0000000 0.9171308
## Obesity 0.6419514 0.9031475 0.9171308 1.0000000
## High_Blood_Pres 0.6708440 0.8680525 0.8880798 0.8827588
## Smoker 0.6515928 0.8636349 0.8920701 0.8694016
## Diabetes 0.6625232 0.8543135 0.8032958 0.8497419
## High_Blood_Pres Smoker Diabetes
## MVA 0.6708440 0.6515928 0.6625232
## No_Exercise 0.8680525 0.8636349 0.8543135
## Few_Fruit_Veg 0.8880798 0.8920701 0.8032958
## Obesity 0.8827588 0.8694016 0.8497419
## High_Blood_Pres 1.0000000 0.8691031 0.8753804
## Smoker 0.8691031 1.0000000 0.8209986
## Diabetes 0.8753804 0.8209986 1.0000000
#Melting the correlation matrix and creating a data frame
SO.Melt<-melt(data=SO.Cor,varnames = c("x","y"))
SO.Melt <- SO.Melt[order(SO.Melt$value),]
#Mean of the melt
## [1] 0.8346534
## x y value
## Diabetes :7 Diabetes :7 Min. :0.5939
## Few_Fruit_Veg :7 Few_Fruit_Veg :7 1st Qu.:0.8033
## High_Blood_Pres:7 High_Blood_Pres:7 Median :0.8681
## MVA :7 MVA :7 Mean :0.8347
## No_Exercise :7 No_Exercise :7 3rd Qu.:0.8921
## Obesity :7 Obesity :7 Max. :1.0000
## Smoker :7 Smoker :7
#Making various colors to generate dynamic range of colors using a given pallate
red=rgb(1,0,0); green=rgb(0,1,0); blue=rgb(0,0,1); white=rgb(1,1,1)
RtoWrange<-colorRampPalette(c(white, blue) )
WtoGrange<-colorRampPalette(c(blue, green) )
# Heat map - using colors | used ggplot2 fr the colors
plt_heat_blue <- ggplot(data = SO.Melt, aes(x=x, y = y)) +
theme(panel.background = element_rect(fill = "snow2")) +
geom_tile(aes(fill = value)) +
scale_fill_gradient2(low = RtoWrange(100),mid = WtoGrange(100), high="gray",
midpoint = SO.MeltMean,
limits = c(0.5, 1), name = "Correlations") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
labs(x=NULL, y=NULL) +
theme(panel.background = element_rect(fill = "snow2")) +
ggtitle("Heat map of correlations in Risk Factors data : Region 3")
DiabetesFew_Fruit_VegHigh_Blood_PresMVA No_Exercise Obesity Smoker
Heat map of correlations in Risk Factors data : Region 3
percent <- c("10%","20%","30%","40%","50%")
breaks <- c(10,20,30,40,50)
#graph of no Exercise VS High Blood Pressure
plt_noExvsblood_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$High_Blood_Pres), y = (SO.Proc
color = factor(signif(SO.Procedures$No_Exercise, 0)))) +
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'High Blood Pressure %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Excercise %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of High Blood Pressure")
10% 20%
High Blood Pressure %s
% No Exercise
Percentage of No Exercise vs.
Percentage of High Blood Pressure
#graph of No Exercise VS Smoker
plt_noExvsSmo_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$No_Exercise), y = (SO.Procedures
color = factor(signif(SO.Procedures$Smoker, 0)))) +
geom_point() +
scale_color_discrete(name="% Smoker") +
scale_x_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'Smoker %s') +
ggtitle(label = "Percentage of No Exercise vs.n Percentage of Smoker")
10% 20%
No Exercise %s
% Smoker
Percentage of No Exercise vs.
Percentage of Smoker
#graph of Diabetes VS No Exercise
plt_noExvsDia_SO <- ggplot(data = SO.Procedures, aes(x = (SO.Procedures$Diabetes), y = (SO.Procedures$No
color = factor(signif(SO.Procedures$No_Exercise, 0))
geom_point() +
scale_color_discrete(name="% No Exercise") +
scale_x_continuous(labels = percent,breaks = breaks,name = 'Diabetes %s') +
scale_y_continuous(labels = percent,breaks = breaks, name = 'No Exercise %s') +
ggtitle(label = "Percentage of Diabetes vs.n Percentage of No Exercise")
Diabetes %s
% No Exercise
Percentage of Diabetes vs.
Percentage of No Exercise
#Plotting the multi variate scatter plot in order to understand the correlation better.
5 15 2 6 12 5 10
5 15 25 10 30 5 15 1 3 5
Multivariate Scatterplot : Region 3
Based on the analysis of the above heatmap and the multivariate scatterplot we can see that the Lung cancer
is highly correlated with the
1. No Exercise.
Now we will find the confounding variables. Using confounding variables we can fit model better and the
output prediction will be less error prone.
Based on the correlation heatmap and the scatterplot we can say that 1. High Blood Pressure
2. Diabetes
3. Smoker
#partitioning the data into Training and Testing set.
train_part <- createDataPartition(y = SO.Procedures$MVA,p = 0.80,list = FALSE)
TrainingMVA <- SO.Procedures[train_part,]
TestingMVA <- SO.Procedures[-train_part,]
#Performing regression between MVA and No exercise
#Regression Summary
## Call:
## lm(formula = TrainingMVA$MVA ~ TrainingMVA$No_Exercise)
## Residuals:
## Min 1Q Median 3Q Max
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## TrainingMVA$No_Exercise 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.989 on 446 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005
## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16
As you can see that model is already having decent accuracy we will still add more confounders and then
perform the multivariate regression.0
#making a temporary table with all required variables
temp <- data.frame(E = TrainingMVA$No_Exercise, C1=TrainingMVA$Smoker,C2=TrainingMVA$High_Blood_Pres,C3=
temp <- mutate(temp, O = TrainingMVA$MVA)
#Regression on MVA and No Exercise
reg.E <- lm(O ~ E, data = temp)
## Call:
## lm(formula = O ~ E, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -6.0631 -1.2072 -0.0759 1.0708 9.8437
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60141 0.26426 6.06 2.9e-09 ***
## E 0.58738 0.02772 21.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.989 on 446 degrees of freedom
## Multiple R-squared: 0.5016, Adjusted R-squared: 0.5005
## F-statistic: 448.9 on 1 and 446 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 1
reg.EC1 <- lm(O ~ E+C1, data = temp)
## Call:
## lm(formula = O ~ E + C1, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -5.9779 -1.1454 -0.0693 1.0158 10.1595
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.52869 0.26785 5.707 2.1e-08 ***
## E 0.50745 0.05789 8.765 < 2e-16 ***
## C1 0.09930 0.06317 1.572 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.986 on 445 degrees of freedom
## Multiple R-squared: 0.5044, Adjusted R-squared: 0.5022
## F-statistic: 226.4 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 2
reg.EC2 <- lm(O ~ E+C2, data = temp)
## Call:
## lm(formula = O ~ E + C2, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -5.5950 -1.1474 -0.0641 1.0431 10.7984
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.51447 0.26458 5.724 1.91e-08 ***
## E 0.45073 0.05872 7.676 1.05e-13 ***
## C2 0.14265 0.05414 2.635 0.00871 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.976 on 445 degrees of freedom
## Multiple R-squared: 0.5093, Adjusted R-squared: 0.5071
## F-statistic: 230.9 on 2 and 445 DF, p-value: < 2.2e-16
#Regression on MVA and No Exercise with confounder 3
reg.EC3 <- lm(O ~ E+C3, data = temp)
## Call:
## lm(formula = O ~ E + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -5.6516 -1.1369 -0.0745 1.1027 10.0249
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60721 0.26225 6.129 1.95e-09 ***
## E 0.45541 0.05443 8.367 7.68e-16 ***
## C3 0.44833 0.15954 2.810 0.00517 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.974 on 445 degrees of freedom
## Multiple R-squared: 0.5103, Adjusted R-squared: 0.5081
## F-statistic: 231.9 on 2 and 445 DF, p-value: < 2.2e-16
#Regression with Explainatory variable and all the counfounders that we think are there.
reg.EC1234 <- lm(O ~ E+C1+C2+C3, data = temp)
## Call:
## lm(formula = O ~ E + C1 + C2 + C3, data = temp)
## Residuals:
## Min 1Q Median 3Q Max
## -5.4964 -1.1325 -0.0407 1.0714 10.5865
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.53795 0.26771 5.745 1.71e-08 ***
## E 0.39844 0.06998 5.693 2.27e-08 ***
## C1 0.02554 0.06972 0.366 0.7143
## C2 0.08004 0.06679 1.198 0.2314
## C3 0.31152 0.18517 1.682 0.0932 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.974 on 443 degrees of freedom
## Multiple R-squared: 0.5127, Adjusted R-squared: 0.5083
## F-statistic: 116.5 on 4 and 443 DF, p-value: < 2.2e-16
#plotting regression.
## Warning in abline(reg.EC1234): only using the first two of 5 regression
## coefficients

