1. Second Computing Project
Group: 10
Group 10 Members: Carlos H. Diaz, Jinlong Zhao, Hyewha Lee, Hyun Woo Kim
Introduction
The following report refers to the data set obtained as it is run through statistical models for analysis.
The purpose for analysis was to estimate the function used to generate our data. Using Caspi et al, this function
will explain the correlation between depression outcomes and gene-environmental factors; the hypothesis being
that some genes buffer the effect of stressful life events on depression. Our data set contains one dependent
variable and 20 independent variables total: 5 of which represent environmental factors, E1 – E5, and 15 that
represent gene variables, G1 – G15. There were 3000 observations.
Methodology
To analyze the data we received for computing assignment 2, we’ve decided SAS would be the optimal
software to help us to analyze the data. Since the date file was originally created in Excel, it was necessary to
import the file into SAS. This was done by using the built-in import data option under the ‘file’ menu. In SAS,
we first ran the CORR procedure of two groups, which were DV(Dependent Variable) vs. E1, E2, E3, E4, E5
and DV vs. G1, G2, G3, G4, G5, G6, G7, G8, G9, G10, G11, G12, G13, G14, G15. In total, there were 5
environmental variables and 15 gene variables. From the CORR procedure, we were able to find the Pearson
Correlation Coefficients between DV and the other 20 variables, namely E1-E5 and G1-G15. From the Pearson
Correlation Coefficients table of each group, we observed that there is only one variable that has a strong
correlation with the dependent variable, meaning its absolute value of Pearson Correlation Coefficient was close
to 1; that variable being G7, with an absolute coefficient value of .97953. All the other variables had their
respective coefficient values less than .1, which is not close to 1 (indicative of a weak correlation). After we
analyzed the Pearson Correlation Coefficient tables, we used ‘proc transreg’ command to create a box-cox
transformation to find nonlinear transformations of the dependent variable. We also computed two-way
interaction and used the stepwise regression to select the most plausible independent variable for this
experiment.
Results
Results from our procedures are shown in the appendix (Tables 1-3 and five BoxCox plots
representative of five different transformations of our dependent variable with regards to our E and G
independent variables). Through the ‘proc transreg’ process, we create nonlinear transformations. Table 3.1
represents our procedure as an ANOVA table. Our lambda of choice is shown as 1.5 with a 95% Confidence
Interval; our p-value being less than 0.0001. Our experiment prompted 105 two-way interactions.
Using the CORR procedure, we produce our Pearson Correlation Coefficient tables to find strong, weak,
and unrelated correlations, shown in Table 1.1 and 1.2 (one for our E variables and another for our G variables).
Looking at Table 1.1, we find a strong linear correlation between DV and E4. In Table 1.2, we find strong
linear correlations between DV and G4, DV and G5, DV and G6, and DV and G14. These interactions show
extremely small p-values. We also see that G7 has an R-value of 0.97953, which is the closest to 1, indicative
of a strong correlation. We consider these individual interactions on the dependent variable as well as both
2. factors by way of two-way interaction. Using the transformation log(DV), we start with G6, and step 25 of the
Stepwise Selection shows 23 variables (Table 4.2).
Regression Model -
Conclusion
From Step 25 of the Stepwise Selection, we note the high R-square value indicating how much of the
dependent variable can be explained by the independent variable (Table 4.1); in this case, 99.08%. The F value
is 13943.2 with p < 0.0001 (less than alpha being 0.05). Due to this, we reject the null hypothesis (H0: B=0).
Therefore, we have a positive relationship between the dependent variable and independent variable. The
association between the gene and environmental factors has a significant impact on depression. Utilizing other
transformations of the dependent variable shows altered outcome values and stepwise selection tables.