dan.crawford.project.final

Using SPLS Regression to identify Genomic Loci affecting Essential
Polyunsaturated Fatty Acid Desaturation and Elongation
Daniel Crawford (0590539)
BINF 6999 - University of Guelph
Advisors: Mutch, D., Dang, S.
Submitted: 8 Aug 2013
Final Weighting: 80% Analysis
20% Laboratory

Table of Contents
Introduction 3
Methods 5
Results 8
Model 1 10
Model 2 (a) 11
(b) 15
Model 3 (a) 16
(b) 19
Discussion 20
Conclusions 25
Appendix i - Confidence Intervals for desaturation / elongation activity 26
Appendix ii - Confidence intervals for effect of SNPs on PUFA levels. 27
Appendix iii - Genomic location of SNPs in mapping the FADS gene cluster 28
References 29
2

INTRODUCTION
Investigations into the human plasma lipidome will provide useful information
regarding many areas of human health. The poly-unsaturated fatty acids (PUFAs)
Alpha-linolenic acid (ALA; 18:3 n-3), and Linoleic acid (LA; 18:2 n-6) are essential fatty
acids for humans (Goodhart, R.,1980). These FAs are precursors to longer chain
PUFAs such as Arachidonic Acid (AA; 20:4 n-6), Eicosapentaenoic Acid (EPA; 20:5 n-3),
and Docosapentaenoic Acid (DPA; 22:5n-3). Levels of these fatty acids (FAs) in cellular
membranes and plasma are related to dietary intake of either the preformed products or
precursor FAs (Marangoni, R., 2002).
Relative amounts of n-6 and n-3 FAs in the diet have an important impact on an
individual's overall health. Typically, a diet rich in n-6 PUFAs will shift the physiological
state towards being prothrombotic and proaggregatory (Simopoulos, A.,1999). High
levels of long chain n-3 PUFAs decrease the production of inflammatory eicosanoids
and cytokines (Caulder, P., 2006). An increased proportion of EPA and DHA in
inflammatory cell phospholipids directly decreases the availability of AA to be used as a
substrate for synthesis of pro-inflammatory eicosanoids (Caughey, GE., 1996). EPA can
also be used as a substrate for eicosanoid production, resulting in less pro-inflammatory
molecules (Goldman, DW., 1983) or even anti-inflammatory molecules (Serhan, CN.,
2000). Long chain n-3 PUFAs can also affect gene transcription, including transcription
of lipogenic enzymes (Clarke, SD., 1994). While the long-chain n-3s exhibit potent anti-
inflammatory effects, the precursor ALA does not, in itself, contribute anti-inflammatory
effects to the same degree (Simopoulos, A.,1999) (Calder, P., 2006); therefore
investigations into desaturation and elongation of the essential PUFAs is warranted.
The fatty acid desaturation pathway
The fatty acid desaturation pathway (Figure 1) consists of a delta-6 desaturation
(D6D), an elongation step, and a delta-5 desaturation (D5D) step. The enzymes
responsible for these steps are Δ6 desaturase, fatty acid elongase, and Δ5 desaturase.
n-3 and n-6 PUFAs share these enzymes and are competitive substrates. Δ6 and Δ5
desaturases are rate-limiting enzymes which catalyze double bond formation at the Δ6
3

and Δ5 position of essential PUFAs, and are coded by the genes FADS1 and FADS2
respectively. The two FADS genes are found in a head to head configuration on
chromosome 11 (61.57 – 61.63 Mb). Recent gene association studies have identified
links between genetic variants in the FADS cluster of genes, plasma fatty acids, and
development of diseases such as metabolic syndrome (Truong, H. et al, 2009),
coronary artery disease (Martinelli, N. et al, 2008), myocardial infarction (Baylin, A. et al,
2007), and dyslipidemia (Lu, Y, et al, 2010). Genetic variation in the form of single
nucleotide polymorphisms (SNPs) within the FADS cluster resulting in altered
desaturation activity can be identified using regression.
The goal of this independent research project was to identify the causal genetic
variants altering the efficiency of the fatty acid desaturase pathway, and construct a
haplotype to differentiate high converters and low converters of n-3 and n-6 PUFAs. A
high converter is an individual with high aggregate desaturase activity (ADA), meaning
the overall activity of the desaturation pathway is increased.
Figure 1- Fatty acid desaturase pathway. The FADS2 (fatty acid desaturase 2) gene
encodes the protein Δ-6 desaturase, catalyzing a double bond formation at the Δ-6
carbon. Fatty acid elongase 2, encoded by ELOVL2, adds 2 carbons. (FADS1) Fatty
acid desaturase 1 encodes Δ-6 desaturase, the enzyme which catalyzes a double bond
formation at the Δ-5 carbon.
4

METHODS
Toronto Nutrigenomics and Health (TNH) Study
The Toronto Nutrigenomics and Health (TNH) Study examined over 2000 young
adults from the University of Toronto. Participants of the study completed a health and
lifestyle questionnaire, which included information about smoking habits, physical
activity, medical history, caffeine habits, as well as ethnic background. The two largest
ethnocultural groups were Caucasian and East Asian. Anthropometric measurements
such as sex, waist circumference and Body Mass Index (BMI) were collected. The study
collected genomic and lipidomic data from a subset of the population.
Sparse Partial Least Squares (SPLS) Regression
As modern clinical studies such as the TNHS are able to produce copious
amounts of genomic and metabolomic data, investigations into statistical methodologies
and bioinformatics are critical in order to make the most efficient extraction of
biologically relevant information.
Partial Least Square (PLS) was introduced by Wold in 1966. PLS performs a
basic latent decomposition of the predictor matrix and the response matrix, to find a
small number direction vectors which represent linear combinations of the predictor
variables. In 2010, Chun, H., and Keles, S., introduced Sparse PLS which incorporates
sparsity directly into the dimension reduction step of PLS. This results in sparse linear
combinations of the original predictors, and in the case of this analysis, the selection of
the SNPs most likely to be biologically relevant. Sparsity is imposed by implementing an
L1 penalty; SPLS uses two tuning parameters, eta and K.
Eta (η) is the thresholding parameter and must be between 0 and 1. SPLS uses
a form of soft thresholding in which components are retained if they are greater then
some fraction, determined by eta, of the maximum component. Eta directly determines
the sparsity of the final solution, as a high eta will result in fewer selected variables. K is
the number of latent components, and must be less then the rank of the predictor
matrix. As K decreases, the solution becomes more sparse. Selection of the optimal
parameters to use in a SPLS calculation is done using a 10-fold cross validation. This
5

algorithm randomly partitions the variables into 10 subgroups, and uses each subgroup
once as the validation group while the other 9 are used as the training set. Optimal
parameters can then be estimated as a combination of the results from each analysis.
An advantage of 10-fold cross validation is that each observation is used one and only
once as part of the training set. An inherent consequence of this method is that the 10
subgroups are randomly assigned, and therefore subsequent cross-validations may
result in different estimations of optimal parameters. Optimal parameters are those
determined to have the lowest mean square error (MSE). Performing the SPLS analysis
using alternate parameters may affect the final number of SNPs selected to be causing
a significant effect, however SNPs that are more significant are less likely be affected by
small changes in eta and K. The SPLS algorithm will automatically set the regression
coefficient of non-causal SNPs to 0.
While previous studies found associations with many SNPs within the FADS
gene cluster, using SPLS regression will select a small number of important variables.
This method will identify the SNP(s) within FADS1/2 an ELOVL2 that are most likely
responsible for the observed variation in FA levels in plasma.
Data was collected as part of the Toronto Nutrigenomics and Health Survey.
Phenotype data for this analysis were GC measurements for the following fatty acids:
Linoleic acid LA 18:2 n-6
γ-Linoleic acid GLA 18:3 n-6
Dihomo γ-Linoleic acid DGLA 20:3 n-6
Arachidonic Acid AA 20:4 n-6
Alpha-linolenic acid ALA 18:3 n-3
Eicosapentaenoic Acid EPA 20:5 n-3
6

Enzyme activity was approximated by the following FA ratios:
n-6 Δ6 Desaturase GLA / LA
n-6 Elongase DGLA / GLA
n-6 Δ5 Desaturase AA / DGLA
n-6 Aggregate Desaturase Activity (n-3 ADA) AA / LA
n-3 Aggregate Desaturase Activity (n-6 ADA) ALA / EPA
Genotype data for 26 SNPs mapped out 3 genes: FADS1, FADS2, and ELOVL2
(see Appendix iii). The genotype data was formatted in a matrix such that individuals
were given a ‘0’ for SNPs which were homozygous for the major allele, and ‘1’ for SNPs
that were either heterozygous, or homozygous for the minor allele. 37 individuals were
missing data for one or more SNP, and were removed from the analysis. Each SNP was
tested for (HWE) in the whole population, the caucasian population, and the asian
population, to identify ethnic dependent polymorphisms.
Data analysis was performed in R (R core team, 2013). The package
“SPLS” (Chum, H and Keles, S, 2010) was used for regression analysis. Optimal
parameters for regression were first determined using cross-validation (CV). The SPLS
function was then used for variable selection, where an SPLS object was created from
the predictor and response matrix data. Confidence intervals were calculated for the
effect of each SNP on multivariate responses.
For model 1, predictor matrices were created for both the Caucasian and Asian
populations, containing genotype data for only the SNPs used in the Merino et al.
analysis. 15 SNPs were tested in Caucasians: 3 SNPs in FADS1 (rs174547, rs412334,
and rs695867) and 12 SNPs in FADS2 (rs174576, rs174579, rs174593, rs174602,
rs174611, rs174627, rs17831757, rs2072114, rs2845573, rs482548, rs498793, and
rs968567). In Asians, 15 SNPs were tested: 3 in FADS1 (rs174547, rs412334, and
rs695867) and 12 SNPs in FADS2 (rs174570, rs174576, rs174579, rs174593,
rs174602, rs174611, rs174627, rs17831757, rs2072114, rs2845573, rs498793, and
rs526126).
7

For models 2 and 3, a training data set was created for the Caucasian
population, composed of randomly selected individuals of Caucasian ethnicity. The
predictor matrix of for these models including the SNPs in HWE within the Caucasian
population. 19 SNPs from the FADS cluster where tested for HWE (RS174547,
RS174570, RS174576, RS174579, RS174593, RS174602, RS174626, RS174627,
RS17831757, RS2072114, RS412334, RS482548, RS498793, RS526126, RS695867,
RS968567, RS174611, RS2845573, RS2851682). 7 SNPs from ELOVL2 were tested
for HWE (RS12195587, RS13204015, RS3798719, RS8523, RS976081, RS3798720,
RS911196). Age, BMI, and sex were included as confounding variables to account for
the effect of these factors on plasma lipid levels.
The SPLS object, containing the set of regression coefficients, was then used to
create a predicted fatty acid profile. The Pearson product-moment correlation coefficient
between the predicted response and the actual response are presented as r2 values.
RESULTS
Data from 1059 individuals (309 males and 750 females) ages 20-29 years was
used for this study. Participants were of normal BMI (22.3 +/- 3.5) and non-smokers.
The population was ethnoculturally diverse; the ethnic group with the largest
representation was Caucasian (n=450), and the second largest was East Asian (n=384).
Gas chromatography results showed that the highest average PUFA in plasma
was Linoleic Acid, and the second highest was Arachidonic Acid (Figure 2). The highest
ratio was AA/LA in Caucasians (Table 1), as well as in Asians (Table 2), and represents
n-6 D5D, and the second highest ratio was DGLA/GLA and represents n-6 elongase.
8

Figure 2 - Distribution of levels of each PUFA as measured by GC.
Table 1- Desaturation pathway activity - Average of Caucasian population
n-6 D6D n-6 Elong n-6 D5D n-6 ADA n-3 ADA
0.011 4.662 4.832 0.203 1.596
Table 2- Desaturation pathway activity - Average of East Asian population
0.011 4.779 5.291 0.204 1.597
The 26 SNPs mapping the FADS1, FADS2, and ELOVL2 genes were tested for HWE.
In the Caucasian population (n=451) 25 SNPs were in HWE:
"RS12195587" "RS174547" "RS174570" "RS174576" "RS174579"
"RS174593" "RS174602" "RS174626" "RS174627" "RS17831757"
"RS2072114" "RS3798719" "RS412334" "RS482548" "RS498793"
"RS526126" "RS695867" "RS8523" "RS968567" "RS976081"
"RS174611" "RS2845573" "RS2851682" "RS3798720" "RS911196"
RS13204015 was not in HWE.
In the East Asian population (n=326), the following 19 SNPs were in HWE:
"RS12195587" "RS13204015” "RS174576" "RS174579" "RS174593"
"RS174602" "RS174626" "RS174627" "RS17831757" "RS2072114"
"RS3798719" "RS412334" "RS482548" "RS526126" "RS695867"
"RS968567" "RS976081" "RS174611" “RS911196"
9

In the general population (n=1059), the following 9 SNPS were in HWE
"RS174579" "RS174593" "RS174626" "RS17831757" "RS3798719" "RS482548"
"RS695867" "RS968567" "RS976081"
MODEL 1 - Replication of Merino et al analysis
The goal of this model was to compare results of SPLS regression to linear
regression. It would be expected that SPLS regression will select the most important
variables in terms of having an actual effect on desaturation activity. Merino et al used
linear regression to test 15 SNPs within a Caucasian group (n=78) for significant
associations with D5D, D6D, n-6 ADA and n-3 ADA. Significant associations were found
for 9 SNPs. Testing the same 15 SNPs using SPLS (eta=0.8 and K=2) in a larger
Caucasian population (n=450) identified 3 SNPs ("RS174547" "RS174576"
"RS968567") with significant associations to desaturase activity.
SPLS regression (Eta = 0.88, K=3) results show that in the caucasian population
RS174547 is significantly associated with D5D and n-6 ADA, RS174576 is significantly
associated with n-3 ADA, and RS968567 is significantly associated with n-6 D5D
activity.
Table 3 - Regression Coefficients of presence of at least 1 minor allele in SNPs
with significant associations to desaturation activity (Caucasian population).
Negative coefficients indicate decreased desaturation activity.
SNP n-6 D6D n-6 D5D n-6 ADA n-3 ADA
"RS174547" -0.299 0 -0.003 0
"RS174576" 0 0 0 -0.130
"RS968567" -0.227 0 0 0
10

Linear regression by Merino et al resulted in 9 SNPs having significant
associations. SPLS (Eta = 0.88, K=2) results show that in the Asian population
RS174547 is significantly associated with D5D, D6D, n-6 ADA, and n-3 ADA. RS174576
is significantly associated with D5D, D6D, n-6 ADA, and n-3 ADA. RS174611 is
significantly associated with n-6 D5D activity.
Table 4- Regression Coefficients of presence of at least 1 ‘C’ allele in SNPs with
significant associations to desaturation activity (Asian population). In the Asian
population, the ‘C’ allele is the major allele in the SNP RS174547. Negative coefficients
indicate decreased desaturation activity.
SNP n-6 D6D n-6 D5D n-6 ADA n-3 ADA
RS174547 -0.250 -0.001 -0.013 -0.041
RS174576 -0.199 -0.002 -0.014 -0.045
RS174611 -0.314 0 0 0
MODEL 2(a) - Multivariate Analysis of Desaturase and Elongase activity (Caucasian
population)
The goal of this model was to select SNPs which affect all or a subset of the
desaturation/elongation activities, including ADA. It is hypothesized that cross validation
will select optimal parameters that will result in a solution with the appropriate sparsity,
SPLS will select SNPs best associated with the desaturase activity responses, and will
exclude SNPs that are highly collinear but do not have any direct effect on any
desaturation activity.
A sample of Caucasians (n=335) were randomly selected to be used as the
training population. Genotype data for 25 SNPs mapping ELOVL2 and FADS1/2, as well
as age, gender, and BMI comprised the prediction matrix. Fatty acid ratios were used to
estimate desaturation/ elongation activity, and Aggregate n-6 and n-3 desaturation
activity. A log transformation of the FA ratios were used as it was more normally
distributed. 10-fold cross validation determined that optimal parameters were eta = 0.64
11

and K = 3, with a mean square prediction error of 0.157. The cross-validation results
indicate that the most likely solution has low sparsity.
Figure 3 - CV Heatmap of Mean Square Prediction Errors for range of possible
SPLS parameters.
SPLS regression selected 10 SNPs that were significantly associated with at
least one step in the fatty acid desaturation pathway. 6 SNPs were associated with D6D
activity, 4 SNPs with elongation, 7 SNPs with n-6 D5D, 9 SNPs with n-6 ADA, and 6
SNPs with n-3 ADA. Table 5 shows the regression coefficients for the SNPs with
signifiant associations, value with a CI containing 0 are automatically set to 0.
12

Table 5 - Regression coefficients for selected variables and desaturation/
elongation activity (eta = .64, K=3). Negative coefficients indicate presence of at least
one minor allele results in decreased desaturation activity.
RS174547 -0.017 0 -0.022 -0.023 -0.033
RS174570 -0.034 0.027 0 -0.018 0
RS174576 -0.016 0 -0.022 -0.023 -0.034
RS174579 0 0 -0.024 -0.013 -0.033
RS174593 0 0 -0.023 -0.013 0
RS2072114 -0.045 0.048 -0.023 -0.020 -0.023
RS968567 0 0 -0.029 -0.012 -0.039
RS976081 0 0 0.036 0 0.034
RS2845573 -0.036 0.028 0 -0.018 0
RS2851682 -0.032 0.025 0 -0.018 0
sex -0.079 0.125 -0.058 0 0
The SPLS derived regression coefficients were used to predict the desaturation
activity response in an independent population of Caucasian individuals. The caucasian
individuals not selected as part of the training population were used as the testing
population (n=115). As SPLS creates linear combinations of the original variables,
predicted response has a linear relationship to actual response, therefore Pearsons
correlation coefficient was used to evaluate how well predicted data matched actual
data. This regression model with direction vectors containing 10 SNPs was best able to
predict n-6 ADA (r2 = 0.24) in the test population.
Table 6- r2 value for predicted response in Caucasian testing population
0.085 0.184 0.188 0.239 0.080
13

The model was then tested on the general population (n=1059), including the training
population (Table 7), and on the second largest ethnicity, the East Asian population
(Table 8):
Table 7- r2 value for whole population
0.252 0.226 0.089 0.249 0.064
Table 8 - r2 value for East Asian population
0.221 0.229 0.053 0.223 0.050
Figure 4 - Predicted desaturation and elongation activity in Caucasian Testing
population compared to actual values. SPLS regression coefficients from training
Caucasian population were applied to a test set of Caucasians.
14

MODEL 2(b)
This model uses the same prediction and response data as model 2(a). The
cross validation heatmap for model 2(a)(figure 3) also indicated a high sparsity solution
could be applied with a small increase (0.001) in MSE. As the parameters for this model
(eta = .96, K=2) impose higher sparsity, it is expected that the number of selected
variables will be reduced to a smaller number of SNPs.
SPLS regression using parameters reduced the number of significant variables to
3. Two SNPs, RS174547 and RS174576 were selected. Sex was also found to
contribute to desaturase activity.
Table 9 - Regression coefficients for selected variables and desaturation/
elongation activity (eta = .96, K=2). Negative coefficients indicate presence of at least
one minor allele results in decreased desaturation activity.
RS174547 -0.059 0.063 -0.057 -0.053 -0.081
RS174576 -0.058 0.061 -0.055 -0.053 -0.080
sex -0.078 0.149 -0.084 0 -0.064
Table 10 - r2 value for Caucasian test population
0.051 0.103 0.202 0.200 0.032
0.224 0.242 0.058 0.210 0.047
Table 12 - r2 value for whole population
0.229 0.238 0.073 0.231 0.058
15

population, from SPLS regression using training Caucasian population.
MODEL 3 (a)
An SPLS regression analysis was done to identify associations between SNPs in
the FADS gene cluster and the ELOVL2 gene, and levels of the plasma FAs: 18.2n6,
18.3n6, 20.3n6, 20.4n6, 18.3n3, and 20.5n3. Fatty acid levels were generally normally
distributed. The training population consisted of 338 randomly selected Caucasian
individuals. Genotype data included 25 SNPs which were in HWE in the caucasian
population mapping the genes FADS1, FADS2, and ELOVL2. Confounding variables,
age, bmi, and sex, were included. The response matrix was composed of the 6
individual plasma fatty acid levels as measured by gas chromatography (GC). Cross
validation determined that the optimal parameters were eta = 0.92, and K = 2. MSE =
2.6128 (Figure 6)
16

Figure 6 - CV Heatmap of Mean Square Prediction Errors for range of possible
SPLS parameters
SPLS regression was performed and 2 SNPs were selected as significant
effectors of the 6 FAs. Only one SNP (RS174547) resulted in a significant effect on
18.2n6 (LA). None of the SNPs significantly explained variation in 20.3n6 (DGLA), and
the remainder of the FAs were effected by 2 SNPs (RS174547 and RS174576) by
approximately the same amount. Age and BMI had significant effects on the amount of
18.2n6 (LA).
Table 13 - Regression coefficients for selected variables and PUFA levels (eta = .
92, K=2). Negative coefficients indicate presence of at least one minor allele results in
decreased desaturation activity.
18.2n6 (LA) 18.3n6
(GLA)
20.3n6
(DGLA)
20.4n6 (AA) 18.3n3 (ALA)20.5n3
(EPA)
RS174547 0.219 -0.017 0 -0.270 0.018 -0.040
RS174576 0 -0.017 0 -0.268 0.018 -0.042
age 0.478 0 0 0 0 0
bmi -0.774 0 0 0 0 0
17

population compared to actual fatty acid levels, from SPLS regression using
training Caucasian population.
When the model was tested on the test set of Caucasians (n=113), the best correlated
prediction was for fa20.4n6 (AA).
Table 14 - r2 value for predicted response in Caucasian testing population
18.2n6 (LA) 18.3n6 (GLA) 20.3n6 (DGLA) 20.4n6 (AA) 18.3n3 (ALA) 20.5n3 (EPA)
0.08679 0.05626 0.01965 0.21063 0.019205 0.00013
When this model was tested on all individuals (n=1059), the predicted fa20.4n6 (AA)
values once again were the best correlated to the actual values.
Table 15 - r2 value for whole population
0.04656 0.15318 0.01167 0.24159 0.01100 0.01226
18

Additionally, this model was tested on the East Asian population (n=384). The prediction
of GLA levels for the asian population was better the in the general population. The best
predicted FA in the East Asian population was AA.
0.03744 0.21149 0.01749 0.21721 0.00822 0.00312
Model 3 (b)
To explore the effects of dimension reduction the regression algorithms was run
again with the eta parameter set to 0. This model therefore will not impose sparsity, thus
resulting in each variable being assigned a coefficient. While each SNP is assigned a
coefficient, not all are significant. The following table shows the association of the 14
SNPs with each FA.
Table 17 - Regression coefficients for variables and PUFA levels (eta = .0, K=2).
Negative coefficients indicate presence of at least one minor allele results in decreased
desaturation activity.
18.2n6
(LA)
18.3n6
(GLA)
20.3n6
(DGLA)
20.4n6
(AA)
18.3n3
(ALA)
20.5n3
(EPA)
RS12195587 -0.388545
8
0 0.036 0 0 0
RS174547 0 -0.005 0 -0.086 0.006 -0.013
RS174570 0 -0.004 0 -0.061 0 0
RS174576 0 -0.005 0 -0.083 0.006 -0.014
RS174579 0 0 0 -0.051 0 -0.010
RS174593 0 -0.004 0 -0.063 0.005 0
RS174602 0 0 0 -0.031 0 0
RS174626 0 0 0 -0.050 0.004 0
RS174627 0 0 0 -0.049 0 0
RS2072114 0 -0.004 0 -0.057 0 0
19

18.2n6
(LA)
18.3n6
(GLA)
20.3n6
(DGLA)
20.4n6
(AA)
18.3n3
(ALA)
20.5n3
(EPA)
RS968567 0 0 0 -0.038 0 -0.011
RS174611 0 -0.004 0 -0.066 0.005 0
RS2845573 0 -0.004 0 -0.053 0 0
RS2851682 0 -0.003 0 -0.046 0 0
BMI -0.700121
4
0 0.062 0 0 0
These coefficients predicted the fatty acid approximately as well as when the dimension
reduction step is included.
Table 18- Correlation with actual values of Caucasian test population
18.2n6 (LA) 18.3n6
(GLA)
20.3n6
(DGLA)
20.4n6 (AA) 18.3n3 (ALA)20.5n3
(EPA)
Sparsity 0.02742 0.06301 0.03332 0.23542 0.01023 0.00023
No Sparsity 0.08679 0.05626 0.01965 0.21063 0.019205 0.00013
DISCUSSION
Identification of genomic loci affecting PUFA metabolism leads itself to functional
analysis of SNPs with a potential importance in the pathology of metabolic diseases and
understanding how those diseases could be prevented. Previous studies tested for
associations with genetic variants using linear regression. When considering genomic
data, a number of issues arise that are not properly dealt with in linear regression
models. SNPs within same gene can be founds in high linkage disequilibrium (LD), thus
resulting in the genomic data being highly collinear. Using linear regression to model
highly collinear relationships can result in unstable coefficients (Wold, 1984), the
individual effects from each SNP, thus predicted responses are often poor. For the type
of data presented in this study, it is assumed that only a few SNPs are causing the
20

observed effects, this is otherwise known as the sparsity principal. This type of SNP
selection problem is effectively handled by Sparse Partial Least Squares (SPLS)
regression created as an adaptation of PLS. SPLS performs simultaneous dimension
reduction and variable selection to identify the most relevant variables, and calculates
coefficients to predict the response.
In high-throughput biological research, accurate variable selection techniques are
critical in determining relevant information. In studies focusing on single nucleotide
polymorphisms, statistical complexities arise including high collinearity between the
variables, and sparsity of relevant SNPs. The present study examined 26 SNPs
mapping out 3 genes. All SNPs were tested for HWE in the whole population, as well as
in the caucasian population, and the asian population. The population with the most
SNPs in HWE was the caucasian population. This was expected as the SNPs for
genotyping were selected from their presence within a caucasian population using
HapMap.
Predictive ability of SPLS was tested by comparison to linear regression. Merino
et al performed linear regression on SNPs within FADS 1 and 2. Ratios of fatty acids
approximating D6-Desaturase, D5-Desaturase, n-6 ADA, and n-3 ADA were the
response variables. 15 SNPs were tested in the Caucasian population (n=78), and 9
were found to be significant, with the strongest association between RS174547
(FADS1) and n-6 ADA (p=3.99×10^8). All other SNPs were no longer significantly
associated when rs174547 was considered as a co-variate. The same analysis was
repeated using SPLS. In the Caucasian 3 SNPs were selected as relevant, RS174547
(FADS1), RS174576 (FADS2), and RS968567 (FADS2). The advantage of SPLS in this
context is that, through the selection of optimal parameters, cross validation will provide
a solution with the appropriate sparsity. Linear regression selected more variables then
SPLS, this suggests that many of these variables were included because of their high
collinearity to the causal loci.
21

In the Asian population (n=69), of the 15 SNPs studied, 8 were found to be
significantly associated with altered fatty acid levels. The strong correlation between
rs174547 (FADS1) and n-6 ADA was also identified in Asians, as well as a strong
association between rs498793 (FADS2) and n-3 ADA. All other associations became
insignificant when rs174547 was added as a covariate. The same analysis was
repeated using SPLS. In the East Asian population, 3 SNPs were selected as relevant,
RS174547 (FADS1), RS174576 (FADS2), and RS174611 (FADS2).
RS174547 and RS174576 were selected to be significant in all models, moreover
at different parameters within these models. This was the case in not only the
Caucasian population, but also the East Asian population. This is evidence that genetic
variation one or both of these loci, or a closely linked non-genotyped loci, may be
directly responsible for variation in plasma PUFAs by affecting desaturation enzyme
activity. RS174547 and RS174576 have been identified as significantly associated with
altered fatty acids in several other studies (Rzehak, J., 2009; Schaeffer, L., 2006;
Martinellis, N., 2008; Boker, S., 2010) however these studies identified multiple SNPs all
in in high LD with RS174547 and RS174576.
RS174547 (FADS1)
Presence of at least one minor allele (T) at this loci was associated with
decreases in D6D, D5D, n-6 ADA, and n-3 ADA. The T allele was associated with
increased precursor levels and decreased products. The T allele was also associated
with increased elongation activity. In model 1, this SNP was associated with D6D activity
and not D5D activity. As this SNP is within the gene for D6 desaturase, this is likely
close to its actual effect. Associations with D5D activity in other models is likely due to
the very high collinearity between this SNP and RS174576. RS174547 is in high LD
with RS174576 (r2= 0.97).
22

RS174576 (FADS2)
Presence of at least one minor allele (A) is associated with decreased D5D
activity, D6D activity, n-3 ADA, and n-6 ADA. Minor allele presence was associated with
decreased GLA, AA, and EPA, increased ALA, and increased n-6 elongation activity. It
is unclear if this SNP has actual causal effects or is associations are due to high
collinearity with the causal SNPs.
RS968567 (FADS2)
Significant associations with the presence of at least one minor (A) allele at this
loci existed with with decreased D6D activity, D5D activity, n-3 and n-6 ADA and with
decreased levels of Ararchidonic acid (fa20.4n6). This SNPs is found within the
promotor region of FADS2, in a predicted binding site for transcription factors such as
SREBP1 and PPARa (Lattka, E. 2010). Functional analysis by Lattka, E. et al showed
using luciferase reporter gene assays that two transcription factors including ELK1 bind
to the promotor region in a manor specific to the RS968567 allele. Promotor activity
increased with the minor T allele. If would be expected that the presence of a minor
allele here would be associated with increases D5D activity however this was not
observed. (Lattka, E. 2010)
ELONGATION ACTIVITY
Both RS174576 and RS174547 showed positive associations with elongation activity
however this is most likely an artifact of highly correlated enzyme activity creating the
precursor and depleting the product of the elongation step. No ELOVL2 SNPs were
selected in any SPLS model. This suggests that genetic variation within these genes is
not resulting in any altered metabolite levels within this pathway. As the elongation step
is not the rate limiting step within this pathway, changes in enzyme activity is unlikely to
significantly impact the lipid measurements.
23

CROSS VALIDATION
For each model, optimal parameters for SPLS regression were selected using a
10-fold cross validation. Each time a CV is performed it may give different results as the
subgroups are randomly selected. A CV heat map can be plotted to visualize the mean
squared prediction error, and can be a useful statistical diagnostic tool displaying which
parameters will provide the ideal sparsity, and if multiple optimal parameters exist. The
optimal parameters selected correspond to the global minima of prediction error,
however local minima may exist. Figure 3 in model 2 demonstrates this idea, since the
global minimum for mean square error is 0.157 and corresponds to eta = 0.54, and K=3,
but a local minima of 0.158 exists when eta = 0.96 and K=3. This shows that a higher
sparsity can be imposed with only a very small increase in mean squared prediction
error. Using these new parameters reduced the number of selected SNPs from 10,
down to 2.
Outcomes of SPLS can be modified by manually selecting the parameters used.
Since eta, the thresholding parameter, determines the sparsity of the final solution, the
dimension reduction feature of the SPLS function can even be omitted by setting eta=0.
This was done with the fatty acid response matrix to elucidate the multicollinearity of
many SNP. When coefficients are assigned to each variable without dimension
reduction, the predicted fatty acid levels are largely the same as when the sparsity
principal is applied, this demonstrates that many of the SNPs selected under low
sparsity conditions are not contributing significantly to the fatty acid responses.
An important corollary of this approach is that the tag-SNPs which are used to
map the genes may not include the causal variant. The tag SNPs which are selected
have the highest likelihood of those which were genotyped to be causal, yet may only
be indicative of the region where the true causal SNP exists. Often more than one SNP
is selected as significant, and this method alone does not distinguish if they are both
causal, if one is causal and the other is in very high LD, or if the two SNPs are in equal
LD to the causal gene. Molecular biology approaches should be used to further
investigate functional changes associated with each selected polymorphism.
24

CONCLUSION
SPLS regression is an effective statistical approach for determining the most
relevant SNPs affecting a multivariate response. SPLS handles multicollinearity through
implementing an L1 penalty, and can impose a reasonable amount of sparsity on the
solution through selecting optimal parameters by cross validation. Selection of optimal
parameters should not be solely left to the included cross validation function as many
parameters will give useful results with negligible increases in MSE. The SNPs
RS174547 and RS174576 were consistently selected as relevant variables and
therefore one or both of them are likely either the causal SNPs, or are very closely link
to the causal loci.
25

Appendix i - Confidence Intervals for desaturation / elongation activity
Training caucasian population (eta=0.96, K=2)
Confidence intervals of effect of selected SNPS on D6D activity
2.5% 97.5%
RS174547 -0.082 -0.036
RS174576 -0.082 -0.034
sex -0.125 -0.032
Confidence intervals of effect of selected SNPS on elongation
2.5% 97.5%
RS174547 0.042 0.085
RS174576 0.039 0.081
sex 0.114 0.186
Confidence intervals of effect of selected SNPS on D5D
2.5% 97.5%
RS174547 -0.074 -0.040
RS174576 -0.073 -0.038
sex -0.117 -0.052
Confidence intervals of effect of selected SNPS on n-6 ada
2.5% 97.5%
RS174547 -0.067 -0.040
RS174576 -0.066 -0.040
sex -0.040 0.014
Confidence intervals of effect of selected SNPS on n-3 ada
2.5% 97.5%
RS174547 -0.111 -0.053
26

2.5% 97.5%
RS174576 -0.110 -0.053
sex -0.116 -0.007
Appendix ii - Confidence intervals for effect of SNPs on PUFA levels.
Confidence intervals of effect of selected SNPS on Linolenic acid (18:2 n-6)
2.5% 97.5%
RS174547 0.018 0.427
RS174576 -0.029 0.383
sex 0.062 0.915
bmi -1.216 -0.302
Confidence intervals of effect of selected SNPS on γ-Linolenic acid (18:3 n-6)
2.5% 97.5%
RS174547 -0.025 -0.009
RS174576 -0.025 -0.009
sex -0.010 0.009
bmi -0.006 0.018
Confidence intervals of effect of selected SNPS on Dihomo γ-Linolenic acid (20:3 n-6)
2.5% 97.5%
RS174547 -0.007 0.033
RS174576 -0.004 0.037
sex -0.051 0.004
bmi -0.005 0.081
Confidence intervals of effect of selected SNPS on Arachidonic acid (20:4 n-6)
2.5% 97.5%
RS174547 -0.332 -0.214
RS174576 -0.328 -0.210
27

2.5% 97.5%
sex -0.103 0.102
bmi -0.004 0.236
Confidence intervals of effect of selected SNPS on Alpha-linolenic acid (18:3 n-3)
2.5% 97.5%
RS174547 0.006 0.035
RS174576 0.005 0.035
sex -0.012 0.024
bmi -0.037 0.012
Confidence intervals of effect of selected SNPS on Eicosapentaenoic acid (20:5 n-3)
2.5% 97.5%
RS174547 -0.064 -0.019
RS174576 -0.066 -0.019
sex -0.007 0.065
bmi -0.043 0.023
Appendix iii - Genomic location of SNPs in mapping the FADS gene cluster (from NCBI)
28

Literature Cited
Robert S. Goodhart and Maurice E. Shils (1980). Modern Nutrition in Health and
Disease (6th ed.). Philadelphia: Lea and Febinger. pp. 134–138.ISBN 0-8121-0645-8.-
Risé P, Marangoni F, Galli C. Prostaglandins Leukot Essent Fatty Acids. 2002 Aug-Sep;
67(2-3):85-9. Regulation of PUFA metabolism: pharmacological and toxicological
aspects.
Artemis P Simopoulos, Essential fatty acids in health and chronic disease.. Am J Clin
Nutr 1999;70(suppl):560S–9S
Calder, C., n−3 Polyunsaturated fatty acids, inflammation, and inflammatory diseases.
2006, American Society for Clinical Nutrition
Caughey GE, Mantzioris E, Gibson RA, Cleland LG, James MJ. The effect on human
tumor necrosis factor α and interleukin 1β production of diets enriched in n−3 fatty acids
from vegetable oil or fish oil. Am J Clin Nutr 1996;63:116–22.
Goldman DW, Pickett WC, Goetzl EJ. Human neutrophil chemotactic and degranulating
activities of leukotriene B5 (LTB5) derived from eicosapentaenoic acid. Biochem Biophys
Res Commun 1983;117:282–8.
Serhan CN, Clish CB, Brannon J, Colgan SP, Gronert K, Chiang N. Anti-inflammatory
lipid signals generated from dietary n−3 fatty acids via cyclooxygenase-2 and
transcellular processing: a novel mechanism for NSAID and n−3 PUFA therapeutic
actions. J Physiol Pharmacol 2000;4:643–54.
Clarke SD, Jump DB. Dietary polyunsaturated fatty acid regulation of gene transcription.
Annu Rev Nutr 1994;14:83–98.
Raatz, K., Bibus, D., Thomas, W., Kris-Etherton, P. Total Fat Intake Modifies Plasma
Fatty Acid Composition in Humans
N.Martinelli,D.Girelli,G.Malerba,P.Guarini,T.Illig,E.Trabetti,M.Sandri,S.Friso, F. Pizzolo,
L. Schaeffer, J. Heinrich, P.F. Pignatti, R. Corrocher, O. Olivieri, FADS genotypes and
desaturase activity estimated by the ratio of arachidonic acid to linoleic acid are
associated with inflammation and coronary artery disease, Am. J. Clin. Nutr. 88 (2008)
941–949
29

H. Truong, J.R. DiBello, E. Ruiz-Narvaez, P. Kraft, H. Campos, A. Baylin, Does genetic
variation in the {Delta}6-desaturase promoter modify the association between {alpha}-
linolenic acid and the prevalence of metabolic syndrome? Am. J. Clin. Nutr. 89 (2009)
920–925.
A. Baylin, E. Ruiz-Narvaez, P. Kraft, H. Campos, {alpha}-Linolenic acid,{Delta 6-
desaturase gene polymorphism, and the risk of nonfatal myocardial infarction, Am. J.
Clin. Nutr. 85 (2007) 554–560.
Y. Lu, E.J. Feskens, M.E. Dolle, S. Imholz, W.M. Verschuren, M. Muller, J.M. Boer,
Dietary n!3 and n!6 polyunsaturated fatty acid intake interacts with FADS1 genetic
variation to affect total and HDL-cholesterol concentrations in the Doetinchem Cohort
Study, Am. J. Clin. Nutr. 92 (2010) 258–265.
R Core Team (2013). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria.
URL http://www.R-project.org/
Lattka, E. The Journal of Lipid Research, 51, 182-191.Cross-validation, and issues
withinA common FADS2 promoter polymorphism increases promoter activity and
facilitates binding of transcription factor ELK1, January 2010
Wold, S., The collinearity problem in linear regression. The PLS approach to
generalized inversions, SIAM J. ScI. STAT. COMPUT. Vol. 5, No. 3, September 1984
P.Rzehak,J.Heinrich,N.Klopp,L.Schaeffer,S.Hoff,G.Wolfram,T.Illig,J.Linseisen, Evidence
for an association between genetic variants of the fatty acid desaturase 1 fatty acid
desaturase 2 (FADS1 FADS2) gene cluster and the fatty acid composition of erythrocyte
membranes, Br. J. Nutr. 101 (2009) 20–26.
L.Schaeffer,H.Gohlke,M.Muller,I.Heid,L.Palmer,I.Kompauer,H.Demmelmair, T. Illig, B.
Koletzko, J. Heinrich, Common genetic variants of the FADS1 FADS2 gene cluster and
their reconstructed haplotypes are associated with the fatty acid composition in
phospholipids, Hum. Mol. Genet. 15 (2006) 1745–1756.
N.Martinelli,D.Girelli,G.Malerba,P.Guarini,T.Illig,E.Trabetti,M.Sandri,S.Friso, F. Pizzolo,
L. Schaeffer, J. Heinrich, P.F. Pignatti, R. Corrocher, O. Olivieri, FADS genotypes and
desaturase activity estimated by the ratio of arachidonic acid to linoleic acid are
associated with inflammation and coronary artery disease, Am. J. Clin. Nutr. 88 (2008)
941–949.
S. Bokor, J. Dumont, A. Spinneker, M. Gonzalez-Gross, E. Nova, K. Widhalm, G.
Moschonis, P. Stehle, P. Amouyel, S. De Henauw, D. Molnar, L.A. Moreno, A.
Meirhaeghe, J. Dallongeville, Single nucleotide polymorphisms in the FADS gene
cluster are associated with delta-5 and delta-6 desaturase activities estimated by serum
fatty acid ratios, J. Lipid Res. 51 (2010) 2325–2333.
30

dan.crawford.project.final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to dan.crawford.project.final

Similar to dan.crawford.project.final (20)

dan.crawford.project.final