SlideShare a Scribd company logo
1 of 10
Download to read offline
Second Computing Project
Group: 10
Group 10 Members: Carlos H. Diaz, Jinlong Zhao, Hyewha Lee, Hyun Woo Kim
Introduction
The following report refers to the data set obtained as it is run through statistical models for analysis.
The purpose for analysis was to estimate the function used to generate our data. Using Caspi et al, this function
will explain the correlation between depression outcomes and gene-environmental factors; the hypothesis being
that some genes buffer the effect of stressful life events on depression. Our data set contains one dependent
variable and 20 independent variables total: 5 of which represent environmental factors, E1 – E5, and 15 that
represent gene variables, G1 – G15. There were 3000 observations.
Methodology
To analyze the data we received for computing assignment 2, we’ve decided SAS would be the optimal
software to help us to analyze the data. Since the date file was originally created in Excel, it was necessary to
import the file into SAS. This was done by using the built-in import data option under the ‘file’ menu. In SAS,
we first ran the CORR procedure of two groups, which were DV(Dependent Variable) vs. E1, E2, E3, E4, E5
and DV vs. G1, G2, G3, G4, G5, G6, G7, G8, G9, G10, G11, G12, G13, G14, G15. In total, there were 5
environmental variables and 15 gene variables. From the CORR procedure, we were able to find the Pearson
Correlation Coefficients between DV and the other 20 variables, namely E1-E5 and G1-G15. From the Pearson
Correlation Coefficients table of each group, we observed that there is only one variable that has a strong
correlation with the dependent variable, meaning its absolute value of Pearson Correlation Coefficient was close
to 1; that variable being G7, with an absolute coefficient value of .97953. All the other variables had their
respective coefficient values less than .1, which is not close to 1 (indicative of a weak correlation). After we
analyzed the Pearson Correlation Coefficient tables, we used ‘proc transreg’ command to create a box-cox
transformation to find nonlinear transformations of the dependent variable. We also computed two-way
interaction and used the stepwise regression to select the most plausible independent variable for this
experiment.
Results
Results from our procedures are shown in the appendix (Tables 1-3 and five BoxCox plots
representative of five different transformations of our dependent variable with regards to our E and G
independent variables). Through the ‘proc transreg’ process, we create nonlinear transformations. Table 3.1
represents our procedure as an ANOVA table. Our lambda of choice is shown as 1.5 with a 95% Confidence
Interval; our p-value being less than 0.0001. Our experiment prompted 105 two-way interactions.
Using the CORR procedure, we produce our Pearson Correlation Coefficient tables to find strong, weak,
and unrelated correlations, shown in Table 1.1 and 1.2 (one for our E variables and another for our G variables).
Looking at Table 1.1, we find a strong linear correlation between DV and E4. In Table 1.2, we find strong
linear correlations between DV and G4, DV and G5, DV and G6, and DV and G14. These interactions show
extremely small p-values. We also see that G7 has an R-value of 0.97953, which is the closest to 1, indicative
of a strong correlation. We consider these individual interactions on the dependent variable as well as both
factors by way of two-way interaction. Using the transformation log(DV), we start with G6, and step 25 of the
Stepwise Selection shows 23 variables (Table 4.2).
Regression Model -
Conclusion
From Step 25 of the Stepwise Selection, we note the high R-square value indicating how much of the
dependent variable can be explained by the independent variable (Table 4.1); in this case, 99.08%. The F value
is 13943.2 with p < 0.0001 (less than alpha being 0.05). Due to this, we reject the null hypothesis (H0: B=0).
Therefore, we have a positive relationship between the dependent variable and independent variable. The
association between the gene and environmental factors has a significant impact on depression. Utilizing other
transformations of the dependent variable shows altered outcome values and stepwise selection tables.
Appendix
Table 1.1:
Table 1.2:
Table 2:
BoxCox Model 1.1: (DV)
BoxCox Model 1.2: exp(DV)
BoxCox Model 1.3: sqrt(DV)
BoxCox Model 1.4: log(DV)
BoxCox Model 1.5: 1/sqrt(DV)
Table 3.1:
Table 3.2:
Table 4.1
Table 4.2
SAS Script:
PROC IMPORT OUT= WORK.PROJECT2
DATAFILE= "mysbfiles.campus.stonybrook.eduhwkimGroup10-2
.xlsx"
DBMS=EXCEL REPLACE;
RANGE="'10$'";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;
proc corr data=project2;
var DV E1-E5;
run;
proc corr data=project2;
var DV G1-G15;
run;
proc transreg data=project2 ss2 detail;
model BoxCox(DV/lambda=-3 to 3 by 0.5)=identity (E1-E5 G1-G15);
output;
run;
data new;
set project2;
Y=(log(DV));
run;
data new1;
set new;
array one[*] E1-E5 G1-G15;
array two[*]
e1e2 e1e3 e1e4 e1e5 e1g1 e1g2 e1g3 e1g4 e1g5 e1g6 e1g7 e1g8 e1g9 e1g10 e1g11
e1g12 e1g13 e1g14 e1g15
e2e3 e2e4 e2e5 e2g1 e2g2 e2g3 e2g4 e2g5 e2g6 e2g7 e2g8 e2g9 e2g10 e2g11
e2g12 e2g13 e2g14 eg15
e3e4 e3e5 e3g1 e3g2 e3g3 e3g4 e3g5 e3g6 e3g7 e3g8 e3g9 e3g10 e3g11
e3g12 e3g13 e3g14 e3g15
e4e5 e4g1 e4g2 e4g3 e4g4 e4g5 e4g6 e4g7 e4g8 e4g9 e4g10 e4g11
e4g12 e4g13 e4g14 e4g15
e5g1 e5g2 e5g3 e5g4 e5g5 e5g6 e5g7 e5g8 e5g9 e5g10 e5g11
e5g12 e5g13 e5g14 e5g15
g1g2 g1g3 g1g4 g1g5 g1g6 g1g7 g1g8 g1g9 g1g10 g1g11
g1g12 g1g13 g1g14 g1g15
g2g3 g2g4 g2g5 g2g6 g2g7 g2g8 g2g9 g2g10 g2g11
g2g12 g2g13 g2g14 g2g15
g3g4 g3g5 g3g6 g3g7 g3g8 g3g9 g3g10 g3g11
g3g12 g3g13 g3g14 g3g15
g4g5 g4g6 g4g7 g4g8 g4g9 g4g10 g4g11
g4g12 g4g13 g4g14 g4g15
g5g6 g5g7 g5g8 g5g9 g5g10 g5g11
g5g12 g5g13 g5g14 g5g15
g6g7 g6g8 g6g9 g6g10 g6g11
g6g12 g6g13 g6g14 g6g15
g7g8 g7g9 g7g10 g7g11
g7g12 g7g13 g7g14 g7g15
g8g9 g8g10 g8g11
g8g12 g8g13 g8g14 g8g15
g9g10 g9g11
g9g12 g9g13 g9g14 g9g15
g10g11 g10g12 g10g13 g10g14 g10g15
g11g12 g11g13 g11g14 g11g15
g12g13 g12g14 g12g15
g13g14 g13g15
g14g15
;
n=0;
do i=1 to dim(one);
do j=i+1 to dim(one);
n=n+1;
two(n)=one(i)*one(j);
end;
end;
run;
proc reg data=new1;
model Y= E1-E5 G1-G15
e1e2 e1e3 e1e4 e1e5 e1g1 e1g2 e1g3 e1g4 e1g5 e1g6 e1g7 e1g8 e1g9 e1g10 e1g11
e1g12 e1g13 e1g14 e1g15
e2e3 e2e4 e2e5 e2g1 e2g2 e2g3 e2g4 e2g5 e2g6 e2g7 e2g8 e2g9 e2g10 e2g11
e2g12 e2g13 e2g14 eg15
e3e4 e3e5 e3g1 e3g2 e3g3 e3g4 e3g5 e3g6 e3g7 e3g8 e3g9 e3g10 e3g11
e3g12 e3g13 e3g14 e3g15
e4e5 e4g1 e4g2 e4g3 e4g4 e4g5 e4g6 e4g7 e4g8 e4g9 e4g10 e4g11
e4g12 e4g13 e4g14 e4g15
e5g1 e5g2 e5g3 e5g4 e5g5 e5g6 e5g7 e5g8 e5g9 e5g10 e5g11
e5g12 e5g13 e5g14 e5g15
g1g2 g1g3 g1g4 g1g5 g1g6 g1g7 g1g8 g1g9 g1g10 g1g11
g1g12 g1g13 g1g14 g1g15
g2g3 g2g4 g2g5 g2g6 g2g7 g2g8 g2g9 g2g10 g2g11
g2g12 g2g13 g2g14 g2g15
g3g4 g3g5 g3g6 g3g7 g3g8 g3g9 g3g10 g3g11
g3g12 g3g13 g3g14 g3g15
g4g5 g4g6 g4g7 g4g8 g4g9 g4g10 g4g11
g4g12 g4g13 g4g14 g4g15
g5g6 g5g7 g5g8 g5g9 g5g10 g5g11
g5g12 g5g13 g5g14 g5g15
g6g7 g6g8 g6g9 g6g10 g6g11
g6g12 g6g13 g6g14 g6g15
g7g8 g7g9 g7g10 g7g11
g7g12 g7g13 g7g14 g7g15
g8g9 g8g10 g8g11
g8g12 g8g13 g8g14 g8g15
g9g10 g9g11
g9g12 g9g13 g9g14 g9g15
g10g11 g10g12 g10g13 g10g14 g10g15
g11g12 g11g13 g11g14 g11g15
g12g13 g12g14 g12g15
g13g14 g13g15
g14g15
/selection=stepwise SLENTRY=0.01;
plot residual.*predicted.;
run;

More Related Content

Similar to project2an

Week 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docxWeek 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docx
cockekeshia
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 census
Shuai Yuan
 
Table 2Survival Status Disease SeverityDon.docx
Table 2Survival Status      Disease         SeverityDon.docxTable 2Survival Status      Disease         SeverityDon.docx
Table 2Survival Status Disease SeverityDon.docx
perryk1
 

Similar to project2an (20)

Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485
 
Week 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docxWeek 4 Lecture 12 Significance Earlier we discussed co.docx
Week 4 Lecture 12 Significance Earlier we discussed co.docx
 
Binary Logistic Regression
Binary Logistic RegressionBinary Logistic Regression
Binary Logistic Regression
 
Splay Method of Model Acquisition Assessment
Splay Method of Model Acquisition AssessmentSplay Method of Model Acquisition Assessment
Splay Method of Model Acquisition Assessment
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 census
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
Chapter 14
Chapter 14 Chapter 14
Chapter 14
 
Statistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way AnovaStatistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way Anova
 
TUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERS
TUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERSTUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERS
TUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERS
 
TUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERS
TUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERSTUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERS
TUNING OF GAINS IN BASIS WEIGHT USING FUZZY CONTROLLERS
 
AUTO MPG Regression Analysis
AUTO MPG Regression AnalysisAUTO MPG Regression Analysis
AUTO MPG Regression Analysis
 
Relations between coffee world market price and retail price in USA: Applicat...
Relations between coffee world market price and retail price in USA: Applicat...Relations between coffee world market price and retail price in USA: Applicat...
Relations between coffee world market price and retail price in USA: Applicat...
 
Design of Experiments
Design of Experiments Design of Experiments
Design of Experiments
 
Correlation
CorrelationCorrelation
Correlation
 
Table 2Survival Status Disease SeverityDon.docx
Table 2Survival Status      Disease         SeverityDon.docxTable 2Survival Status      Disease         SeverityDon.docx
Table 2Survival Status Disease SeverityDon.docx
 
Causal Inference in R
Causal Inference in RCausal Inference in R
Causal Inference in R
 
Spss in soil science
Spss in soil scienceSpss in soil science
Spss in soil science
 
Auto MPG Regression Analysis
Auto MPG Regression AnalysisAuto MPG Regression Analysis
Auto MPG Regression Analysis
 
Wine.Final.Project.MJv3
Wine.Final.Project.MJv3Wine.Final.Project.MJv3
Wine.Final.Project.MJv3
 
Group5
Group5Group5
Group5
 

project2an

  • 1. Second Computing Project Group: 10 Group 10 Members: Carlos H. Diaz, Jinlong Zhao, Hyewha Lee, Hyun Woo Kim Introduction The following report refers to the data set obtained as it is run through statistical models for analysis. The purpose for analysis was to estimate the function used to generate our data. Using Caspi et al, this function will explain the correlation between depression outcomes and gene-environmental factors; the hypothesis being that some genes buffer the effect of stressful life events on depression. Our data set contains one dependent variable and 20 independent variables total: 5 of which represent environmental factors, E1 – E5, and 15 that represent gene variables, G1 – G15. There were 3000 observations. Methodology To analyze the data we received for computing assignment 2, we’ve decided SAS would be the optimal software to help us to analyze the data. Since the date file was originally created in Excel, it was necessary to import the file into SAS. This was done by using the built-in import data option under the ‘file’ menu. In SAS, we first ran the CORR procedure of two groups, which were DV(Dependent Variable) vs. E1, E2, E3, E4, E5 and DV vs. G1, G2, G3, G4, G5, G6, G7, G8, G9, G10, G11, G12, G13, G14, G15. In total, there were 5 environmental variables and 15 gene variables. From the CORR procedure, we were able to find the Pearson Correlation Coefficients between DV and the other 20 variables, namely E1-E5 and G1-G15. From the Pearson Correlation Coefficients table of each group, we observed that there is only one variable that has a strong correlation with the dependent variable, meaning its absolute value of Pearson Correlation Coefficient was close to 1; that variable being G7, with an absolute coefficient value of .97953. All the other variables had their respective coefficient values less than .1, which is not close to 1 (indicative of a weak correlation). After we analyzed the Pearson Correlation Coefficient tables, we used ‘proc transreg’ command to create a box-cox transformation to find nonlinear transformations of the dependent variable. We also computed two-way interaction and used the stepwise regression to select the most plausible independent variable for this experiment. Results Results from our procedures are shown in the appendix (Tables 1-3 and five BoxCox plots representative of five different transformations of our dependent variable with regards to our E and G independent variables). Through the ‘proc transreg’ process, we create nonlinear transformations. Table 3.1 represents our procedure as an ANOVA table. Our lambda of choice is shown as 1.5 with a 95% Confidence Interval; our p-value being less than 0.0001. Our experiment prompted 105 two-way interactions. Using the CORR procedure, we produce our Pearson Correlation Coefficient tables to find strong, weak, and unrelated correlations, shown in Table 1.1 and 1.2 (one for our E variables and another for our G variables). Looking at Table 1.1, we find a strong linear correlation between DV and E4. In Table 1.2, we find strong linear correlations between DV and G4, DV and G5, DV and G6, and DV and G14. These interactions show extremely small p-values. We also see that G7 has an R-value of 0.97953, which is the closest to 1, indicative of a strong correlation. We consider these individual interactions on the dependent variable as well as both
  • 2. factors by way of two-way interaction. Using the transformation log(DV), we start with G6, and step 25 of the Stepwise Selection shows 23 variables (Table 4.2). Regression Model - Conclusion From Step 25 of the Stepwise Selection, we note the high R-square value indicating how much of the dependent variable can be explained by the independent variable (Table 4.1); in this case, 99.08%. The F value is 13943.2 with p < 0.0001 (less than alpha being 0.05). Due to this, we reject the null hypothesis (H0: B=0). Therefore, we have a positive relationship between the dependent variable and independent variable. The association between the gene and environmental factors has a significant impact on depression. Utilizing other transformations of the dependent variable shows altered outcome values and stepwise selection tables.
  • 5. BoxCox Model 1.2: exp(DV) BoxCox Model 1.3: sqrt(DV)
  • 6. BoxCox Model 1.4: log(DV) BoxCox Model 1.5: 1/sqrt(DV)
  • 9. SAS Script: PROC IMPORT OUT= WORK.PROJECT2 DATAFILE= "mysbfiles.campus.stonybrook.eduhwkimGroup10-2 .xlsx" DBMS=EXCEL REPLACE; RANGE="'10$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; proc corr data=project2; var DV E1-E5; run; proc corr data=project2; var DV G1-G15; run; proc transreg data=project2 ss2 detail; model BoxCox(DV/lambda=-3 to 3 by 0.5)=identity (E1-E5 G1-G15); output; run; data new; set project2; Y=(log(DV)); run; data new1; set new; array one[*] E1-E5 G1-G15; array two[*] e1e2 e1e3 e1e4 e1e5 e1g1 e1g2 e1g3 e1g4 e1g5 e1g6 e1g7 e1g8 e1g9 e1g10 e1g11 e1g12 e1g13 e1g14 e1g15 e2e3 e2e4 e2e5 e2g1 e2g2 e2g3 e2g4 e2g5 e2g6 e2g7 e2g8 e2g9 e2g10 e2g11 e2g12 e2g13 e2g14 eg15 e3e4 e3e5 e3g1 e3g2 e3g3 e3g4 e3g5 e3g6 e3g7 e3g8 e3g9 e3g10 e3g11 e3g12 e3g13 e3g14 e3g15 e4e5 e4g1 e4g2 e4g3 e4g4 e4g5 e4g6 e4g7 e4g8 e4g9 e4g10 e4g11 e4g12 e4g13 e4g14 e4g15 e5g1 e5g2 e5g3 e5g4 e5g5 e5g6 e5g7 e5g8 e5g9 e5g10 e5g11 e5g12 e5g13 e5g14 e5g15 g1g2 g1g3 g1g4 g1g5 g1g6 g1g7 g1g8 g1g9 g1g10 g1g11 g1g12 g1g13 g1g14 g1g15 g2g3 g2g4 g2g5 g2g6 g2g7 g2g8 g2g9 g2g10 g2g11 g2g12 g2g13 g2g14 g2g15 g3g4 g3g5 g3g6 g3g7 g3g8 g3g9 g3g10 g3g11 g3g12 g3g13 g3g14 g3g15 g4g5 g4g6 g4g7 g4g8 g4g9 g4g10 g4g11 g4g12 g4g13 g4g14 g4g15 g5g6 g5g7 g5g8 g5g9 g5g10 g5g11 g5g12 g5g13 g5g14 g5g15 g6g7 g6g8 g6g9 g6g10 g6g11 g6g12 g6g13 g6g14 g6g15 g7g8 g7g9 g7g10 g7g11 g7g12 g7g13 g7g14 g7g15 g8g9 g8g10 g8g11 g8g12 g8g13 g8g14 g8g15 g9g10 g9g11 g9g12 g9g13 g9g14 g9g15 g10g11 g10g12 g10g13 g10g14 g10g15
  • 10. g11g12 g11g13 g11g14 g11g15 g12g13 g12g14 g12g15 g13g14 g13g15 g14g15 ; n=0; do i=1 to dim(one); do j=i+1 to dim(one); n=n+1; two(n)=one(i)*one(j); end; end; run; proc reg data=new1; model Y= E1-E5 G1-G15 e1e2 e1e3 e1e4 e1e5 e1g1 e1g2 e1g3 e1g4 e1g5 e1g6 e1g7 e1g8 e1g9 e1g10 e1g11 e1g12 e1g13 e1g14 e1g15 e2e3 e2e4 e2e5 e2g1 e2g2 e2g3 e2g4 e2g5 e2g6 e2g7 e2g8 e2g9 e2g10 e2g11 e2g12 e2g13 e2g14 eg15 e3e4 e3e5 e3g1 e3g2 e3g3 e3g4 e3g5 e3g6 e3g7 e3g8 e3g9 e3g10 e3g11 e3g12 e3g13 e3g14 e3g15 e4e5 e4g1 e4g2 e4g3 e4g4 e4g5 e4g6 e4g7 e4g8 e4g9 e4g10 e4g11 e4g12 e4g13 e4g14 e4g15 e5g1 e5g2 e5g3 e5g4 e5g5 e5g6 e5g7 e5g8 e5g9 e5g10 e5g11 e5g12 e5g13 e5g14 e5g15 g1g2 g1g3 g1g4 g1g5 g1g6 g1g7 g1g8 g1g9 g1g10 g1g11 g1g12 g1g13 g1g14 g1g15 g2g3 g2g4 g2g5 g2g6 g2g7 g2g8 g2g9 g2g10 g2g11 g2g12 g2g13 g2g14 g2g15 g3g4 g3g5 g3g6 g3g7 g3g8 g3g9 g3g10 g3g11 g3g12 g3g13 g3g14 g3g15 g4g5 g4g6 g4g7 g4g8 g4g9 g4g10 g4g11 g4g12 g4g13 g4g14 g4g15 g5g6 g5g7 g5g8 g5g9 g5g10 g5g11 g5g12 g5g13 g5g14 g5g15 g6g7 g6g8 g6g9 g6g10 g6g11 g6g12 g6g13 g6g14 g6g15 g7g8 g7g9 g7g10 g7g11 g7g12 g7g13 g7g14 g7g15 g8g9 g8g10 g8g11 g8g12 g8g13 g8g14 g8g15 g9g10 g9g11 g9g12 g9g13 g9g14 g9g15 g10g11 g10g12 g10g13 g10g14 g10g15 g11g12 g11g13 g11g14 g11g15 g12g13 g12g14 g12g15 g13g14 g13g15 g14g15 /selection=stepwise SLENTRY=0.01; plot residual.*predicted.; run;