SlideShare a Scribd company logo
1 of 84
Download to read offline
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Data-adaptive SNP-set-based Association Tests
for Longitudinal Traits
Doctoral Dissertation Defense
Yang Yang
B.S, Biological Science
M.S, Bioinformatics
Ph.D. Candidate, Biostatistics
Department of Biostatistics, School of Public Health, The University
of Texas Health Science Center at Houston
November 30th 2015
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 1 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Dissertation Committee:
Dissertation Chair and Academic Advisor, Peng Wei, PhD
Minor Advisor, Alanna C. Morrison, PhD
Breadth Advisor, Yun-Xin Fu, PhD
External Advisor, Han Liang, PhD
External Reviewer, Xiaoming Liu, PhD
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 2 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents
1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 Acknowledgement
6 References
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 3 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 4 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Introduction to GWAS
A work flow of GWAS
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 5 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Introduction to GWAS
GWAS contributes to precision medicine
Figure: GWAS contributes to precision medicine
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 6 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
SNP-set based association tests
By pooling multiple low MAF SNPs together, the SNP-set based association test can
detect the signal(s) from a region (such as a gene), rather than from a single SNP.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 7 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Longitudinal data analysis in GWAS
Figure: Trajectories of phenotype left hippocampus volume over time (in months) in three allele
groups of SNP rs2075650 [XSP+14]
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 8 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Longitudinal data analysis in GWAS
A recent study by Xu et. al.[XSP+14] demonstrated the power gain from
longitudinal data analysis over traditional cross-sectional data analysis
used in GWAS.
Figure: Comparison of the Manhattan plots for genome-wide p-values for phenotype left
hippocampus volume from longitudinal analysis (left) and from cross-sectional analysis (right)
[XSP+14]
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 9 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 10 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Studies
General Purpose
Evaluate the proposed methods’ performance in
maintaining the nominal Type I error rate
maintaining a higher power under different simulation scenarios
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 11 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to ARIC ExomeChip Data
LPL's effect on HDL−C
visit
HDL−C
0
50
100
150
1 2 3 4
Genotype: Carrier
1 2 3 4
0
50
100
150
Genotype: Non−Carrier
LIPC's effect on HDL−C
visit
HDL−C
0
50
100
150
1 2 3 4
Genotype: Carrier
1 2 3 4
0
50
100
150
Genotype: Non−Carrier
Phenotype: HDL-C levels measured at four visits in each of n=11,478 ARIC EA subjects
Genotype: rare/low frequency variants (MAF < 5%) on the ExomeChip
Statistical methods: LaSPU(path) proposed here and its extensions
Baseline (aSPU) vs all four measures (LaSPU)
Gene-based and biological pathway-based association analysis
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 12 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 13 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Journal Article 1
Title of Journal Article
Data-adaptive SNP-set based association tests for longitudinal phenotypes within
the Generalized Estimating Equations framework.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 14 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 15 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods I
Journal Article 1
Suppose for each subject i = 1, . . . , n, we have k total longitudinal measurements
yi = (yi1, yi2, . . . , yik )
with yim as a element, p SNPs of interest as a row vector
xi = (xi1, xi2, . . . , xip)
with xij coded as 0,1 or 2 for the count of the minor allele, and
zi = (zi1, zi2, . . . , ziq)
as a row vector for q covariates, e.g., time, gender and leading principal components for
population substructure.
Thus, we have:
Xi =





xi
xi
...
xi





, Zi =





1 zi
1 zi
.
..
.
..
1 zi





Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 16 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods II
Journal Article 1
Xi is a k × p matrix, and Zi is a k × (q + 1) matrix.
Denote the regression coefficient β = (β1, β2, . . . , βp) and ϕ = (ϕ1, ϕ2, . . . , ϕq+1) for Xi and
Zi respectively. We then have the GLM equation as,
g(µi ) = ηi = Zi ϕ + Xi β = Hi θ,
In the analysis of the ARIC data,
Zi = time + gender + BMI(baseline) + age(baseline) + age(baseline)2 + PC1 + PC2
Xi = SNP1 + SNP2 + . . . + SNPp
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 17 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods III
Journal Article 1
The consistent and asymptotically normal estimates of β and ϕ can be obtained by solving the
GEE [LZ86]:
U(ϕ, β) =
n
i=
Ui (ϕ, β) =
n
i=1
(
∂µi
∂θ
) V −1
i (Yi − µi ) = 0,
with
∂µi
∂θ
=
∂g−1(Hi θ)
∂θ
, Vi = φA
1
2
i Rw A
1
2
i ,
and
Ai =





v(µi1) 0 · · · 0
0 v(µi2) 0 0
.
.. 0
...
.
..
0 0 · · · v(µik )





Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 18 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods IV
Journal Article 1
Quantitative traits
We use the identity link, i.e. g(µim) = µim and v(µim) = φ × 1 = φ. Then we have:
U =
i
(Zi , Xi ) R−1
w (Yi − µi )
Σ =
i
(Zi , Xi ) R−1
w (Yi − ˆµi )(Yi − ˆµi ) R−1
w (Zi , Xi ) (1)
if the assumption of a common covariance matrix across Yi for i is valid, e.g. for quantitative
continuous traits[Pan01], we can adopt a more efficient covariance estimator:
Σ =
n
i=1
(Zi , Xi ) var(Yi ) (Zi , Xi ) =
n
i=1
(Zi , Xi )


n
i=1
(Yi − ˆµi )(Yi − ˆµi )
n

 (Zi , Xi ) =
V11 V12
V21 V22
(2)
which is used by default for its better finite-sample performance [Pan01].
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 19 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods V
Journal Article 1
Testing null hypothesis
Ho : β = (β1, β2, . . . , βp) = 0,
where we have g(Yi ) = Zi ϕ to obtain ϕ and predict ˆµ = g−1(Z ˆϕ). Then, we have score vector
under a working independence model is:
U( ˆϕ, 0) = (U.1, U.2) =
n
i=1
(Ui1, Ui2)
where
U.1 =
i
Zi (Yi − ˆµi ), U.2 =
i
Xi (Yi − ˆµi )
As U asymptotically follows a multivariate normal distribution under H0, then the score vector
for β also has an asymptotic normal distribution:
U.2 ∼ N(0, Σ.2), Σ.2 = Cov(U.2) = V22 − V21V −1
11 V12,
where Vxx are defined in Equation 2.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 20 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods VI
Journal Article 1
A general form of score-vector-based statistic can be generalized as:
Tw = w U =
p
j=1
wj Uj
where w = (w1, . . . , wp) is a vector of weights for the p SNVs [LT11].
with special cases:
TSum = 1 U =
p
j=1
Uj , TSSU = U U =
p
j=1
U2
j ,
These two tests are called Sum test and SSU test [Pan09].
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 21 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods VII
Journal Article 1
If we choose weight to be
Wj = Uγ−1
.2,j
for a series of integer value γ = 1, 2, . . . , ∞, leading to the sum of powered score (U) tests
called SPU tests:
TSPU(γ) =
p
j=1
Uγ−1
.2,j U.2,j
When γ → ∞ as an extreme situation, where ∞ is assumed to be an even number, we have
TSPU(γ) ∝ ||U||γ =


p
j=1
|U.2,j |γ


1
γ
→ ||U||∞ = maxp
j=1|U.2,j | ≡ TSPU(∞).
In our experience, SPU(γ) test with a large γ > 8 usually gave similar results as that of
SPU(∞) test [PKZ+14], thus we used γ ∈ Γ = {1, 2, . . . , 8, ∞} for the whole dissertation work.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 22 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods VIII
Journal Article 1
P-value calculation of the SPU test
Let T denotes TSPU(γ) for a specific γ.
Because the null distribution for SPU with γ larger than 2 is difficult to derive, so is aSPU to be
introduced later, we recourse to simulation or permutation based method to calculate the null
distribution.
With empirical distribution of U.2 constructed, we can obtain the statistics under the null
hypothesis: T(b) = p
j=1 U
(b)γ
.2,j and calculate the p-value of TSPU(γ) as
PSPU(γ) =
B
b=1
I(T(b) ≥ Tobs ) + 1
B + 1
.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 23 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods IX
Journal Article 1
Empirical Null Distribution of the Score Vector
We can obtain the empirical distribution of the score vector U.2 under the null hypothesis by
either the simulation or permutation procedure.
By simulation procedure, we draw B samples of U.2 from its asymptotic distribution:
U
(b)
.2 ∼ MVN 0, ˆΣ.2 ,
with b = 1, 2, . . . , B.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 24 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods X
Journal Article 1
The permutation procedure is more robust to the failure of asymptotic property, e.g., the
analysis of rare variants. The permutation procedure can be implemented as follows:
1 identify the max k across all n subjects, which is the number of longitudinal
measurements, e.g. k = 4.
2 detect if the data has missing values, if yes, fill the missing value with NA to complement
the data dimension (for example, subject i with Yi = (yi,1, , , yi,4) has two missing
measurements at time 2 and time 3. After missing value complementing, it becomes
Yi = (yi,1, NA, NA, yi,4) ). Now we should have all the subjects with each Yi of
dimension equal to k × 1.
3 complement Hi to be of full dimension, i.e. k × (p + q + 1), for covariates and SNPs.
Now we should have Yi Hi as an augmented matrix of dimension k × (p + q + 2) for
each subject i, where Hi = (Zi , Xi ). For total n subjects, we have row-wise binded matrix
M =





Y1 H1
Y2 H2
...
...
Yn Hn





of dimension nk × (p + q + 2).
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 25 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods XI
Journal Article 1
4 permute the genotype codes across different individuals, i.e. the Xi in Yi Zi , Xi with
the Xj in Yj Zj , Xj , where i = j.
5 with permuted
M∗(b)
=






Y1 Z1, X
∗(b)
1
Y2 Z1, X
∗(b)
2
...
...
Yn Z1, X
∗(b)
n






we recalculate U
∗(b)
.2 by U
∗(b)
.2 = i (X
∗(b)
i ) (Yi − ˆµi )
6 repeat step 4 - 5 B times to produce U
∗(b)
.2 with b = 1, 2, . . . , B.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 26 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods XII
Journal Article 1
The longitudinal adaptive SPU test (LaSPU)
Although we have a list of SPU(γ) statistics and p-values, we are not sure which one is the
most powerful in a specific unknown association pattern. Thus, it will be convenient to have a
test adaptively and automatically selects the best SPU(γ) test.
We therefore propose a longitudinal adaptive SPU (LaSPU) test:
TaSPU = min
γ∈Γ
PSPU(γ),
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 27 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods XIII
Journal Article 1
P-value calculation of LaSPU test
Similarly,
P
(b)
SPU(γ)
=
B
b1=b
I(T
(b1)
SPU(γ)
≥ T
(b)
SPU(γ)
) + 1
(B − 1) + 1
for every γ and every b. Then, we will have T
(b)
aSPU = min
γ∈Γ
P
(b)
SPU(γ)
, and the final p-value of
LaSPU test is:
PLaSPU =
B
b=1
I(T
(b)
aSPU ≤ Tobs
aSPU ) + 1
B + 1
.
It is worth noting again that the same B simulated score (U.2) vectors have been used in
calculating the PLaSPU .
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 28 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods XIV
Journal Article 1
The stage-wise genome wide scan strategy
In practice for genome wide computation purpose, we can use a stage-wise aSPU test strategy:
1 we first start with a smaller B, say B = 1000
2 we increase B to say 106 for just a few sets of SNPs, which pass a pre-determined
significance cutoff (e.g., p-value ≤ 5/B) in step 1
3 repeat step 2 until a pre-determined B number is reached
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 29 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods XV
Journal Article 1
Other versions of LaSPU test
LaSPUw test
The SPUw test is a diagonal-variance-weighted version of the SPU test, defined as:
TSPUw(γ) =
p
j=1



U.2,j
ˆΣ.2,jj



γ
LaSPU(w).Score test
TaSPU.Score = min min
γ∈Γ
PSPU(γ), PScore ,
LaSPU omnibus test
TaSPU.o = min min
γ∈Γ
PSPU(γ), min
γ∈Γ
PSPUw(γ), PScore .
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 30 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 31 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Methods I
Journal Article 1
Simulation of genotype data following [WE07]
Figure: Demo graph of genotype simulation
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 32 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Methods II
Journal Article 1
Simulation of phenotype data
We setup the mixed effect model to achieve the AR(1) correlation structure as:
yim = µi + bi + ρei,m−1 + si,m
ei,m
, (3)
with m = 1, . . . , k indexes the longitudinal measurements within subject i;
µi = Zi ϕ + Xi β = Hi θ
as in quantitative trait case; bi is the random intercept representing the subject-level random
effect, and
ρei,m−1 + si,m = ei,m,
where ρ is lag-one autocorrelation coefficient, so we can plugin our estimate from real data here
by setting up ρ = 0.7. We assume the following distribution:
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 33 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Methods III
Journal Article 1
bi ∼ N(0, σ2
b)
ei,m ∼ N(0, σ2
e )
si,m ∼ N(0, (1 − ρ2
)σ2
e )
Under this assumption, the variance-covariance matrix across longitudinal measurements
becomes (assuming k = 4 for the number of longitudinal measurements ):
Σ4×4 = Var




bi + ei1
bi + ρei1 + si2
bi + ρei2 + si3
bi + ρei3 + si4



 = σ2
b




1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1



 + σ2
e




1 ρ ρ2 ρ3
ρ 1 ρ ρ2
ρ2 ρ 1 ρ
ρ3 ρ2 ρ 1



 (4)
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 34 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Methods IV
Journal Article 1
Summary of parameter setup in simulation studies
h2
j = 0.001
σb = 7
σe = 27
n varies between 500 and 3000
k = 4
1000 replicates of simulated dataset (5000 replicates as optional validation, result not
shown)
α = 0.05
ρy = 0.7
ρx = 0.8
R = AR(1)
Rw = I
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 35 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 36 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results I
Journal Article 1
Simulation setting A
B = 1000 for simulation of U
sample size of 500, 1000, 2000 and 3000
5/15 + 0/35 SNPs allocation
0.05 < MAF < 0.4
h2
j = 0
α = 0.05
n pSSU pSSUw pScore pSum pUminP LaSPU LaSPUw LaSPU.sco LaSPU.omni
500 0.047 0.048 0.047 0.052 0.048 0.060 0.058 0.058 0.058
1000 0.047 0.046 0.044 0.058 0.048 0.057 0.057 0.057 0.057
2000 0.049 0.047 0.051 0.048 0.048 0.061 0.058 0.058 0.058
3000 0.051 0.052 0.052 0.052 0.050 0.063 0.060 0.059 0.059
Table: Empirical Type I Error Table in the simulation setting A
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 37 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results II
Journal Article 1
Simulation setting B
h2
j = 0.001
●
●
●
●
Power Benchmark on different sample size
Sample Size
EmpiricalPower
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
500 1000 2000 3000
0.00.10.20.30.40.50.60.70.80.91.0
0.00.10.20.30.40.50.60.70.80.91.0
aSPU family
●
●
●
●
aspu_P
aspu_weighted_P
aspu_sco_P
aspu_weighted_sco_P
others
●
●
●
●
●
geescoreP
UminP
Sum Test
w.Sum Test
SSU
Figure: Empirical power benchmark under different n in the
simulation setting B.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 38 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results III
Journal Article 1
Simulation setting C: Testing with a growing number of null SNPs (in the second block)
●
●
●
●
●
Power Benchmark on different number of null SNPs
# of null SNPs
EmpiricalPower
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
50 75 100 200 400
0.00.10.20.30.40.50.60.70.80.91.0
0.00.10.20.30.40.50.60.70.80.91.0
aSPU family
●
●
●
●
aspu_P
aspu_weighted_P
aspu_sco_P
aspu_weighted_sco_P
others
●
●
●
●
●
geescoreP
UminP
Sum Test
w.Sum Test
SSU
Figure: Empirical power benchmark under an increased number of null SNPs in the simulation setting C.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 39 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results IV
Journal Article 1
Simulation setting D: assessing power gain from longitudinal measurements by
decreasing the residual variance via repeated measures
1
1
1
1
Power increase from repeated measurements on different sample size
Sample Size
EmpiricalPower
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
500 1000 2000 3000
0.00.10.20.30.40.50.60.70.80.91.0
0.00.10.20.30.40.50.60.70.80.91.0
Figure: Power increases from repeated measurements in the simulation setting D.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 40 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results for Rare Variants I
Journal Article 1
Simulation setting A’
5/15 with 5 included in the test
0.001 < MAF < 0.01
permutation based method to calculate the P-value
n pSSU pSSUw pScore pSum pUminP LaSPU LaSPUw LaSPU.sco LaSPU.omni
500 0.053 0.054 0.052 0.049 0.047 0.054 0.053 0.060 0.058
1000 0.055 0.040 0.042 0.048 0.054 0.047 0.045 0.052 0.051
2000 0.054 0.050 0.048 0.049 0.046 0.063 0.057 0.058 0.057
3000 0.045 0.044 0.039 0.060 0.053 0.049 0.053 0.049 0.053
Table: Empirical Type I Error Table in the simulation setting A’.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 41 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results for Rare Variants II
Journal Article 1
Simulation setting B’
a
a
a
a
Power Benchmark on different sample size
Sample Size
EmpiricalPower
b
b
b
b
c
c
c
c
d
d
d
d
e
e
e
e
f
f
f
f
g
g
g
g
h
h
h
h
i
i
i
i
j
j
j
j
k
k
k
k
l
l
l
l
m
m
m
m
n
n
n
n
o
o
o
o
p
p
p
p
500 1000 2000 3000
0.00.10.20.30.40.50.60.70.80.91.0
0.00.10.20.30.40.50.60.70.80.91.0
aSPU family
k
l
m
n
o
p
aspu_weighted_sco_P
aspu_weighted_P
aspu_sco_P
aspu_P
aspu_aspuw.sco_P
aspu_aspuw_P
others
a
b
c
d
e
f
g
h
i
j
UminP
mvn.UminP
pSum
pSSUw
pSSU
pScore
SPU(2)
SPUw(2)
SPUw(1)
SPU(1)
Figure: Empirical power benchmark under different n in the simulation setting B’.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 42 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 43 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study I
Journal Article 1
Recap the ARIC data
LPL's effect on HDL−C
visit
HDL−C
0
50
100
150
1 2 3 4
Genotype: Carrier
1 2 3 4
0
50
100
150
Genotype: Non−Carrier
LIPC's effect on HDL−C
visit
HDL−C
0
50
100
150
1 2 3 4
Genotype: Carrier
1 2 3 4
0
50
100
150
Genotype: Non−Carrier
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 44 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study II
Journal Article 1
Phenotype: HDL-C levels measured at four visits in each of n=11,478 ARIC EA subjects
Genotype: rare/low frequency variants (MAF < 5%) on the ExomeChip
Statistical methods: LaSPU proposed here and its extensions
Baseline (aSPU) vs all four measures (LaSPU)
Gene-based association analysis
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 45 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study III
Journal Article 1
Methods in data application
RV defined by MAF < 0.05
Inclusion of only nonsynonymous and splice site variants
QC procedures as in [PAB+14]
Additive genetic model
Covariates include age, age2, gender, BMI, time of measurement and top two PCs
Gene boundary defined in [PAB+14]
Significant: 0.05/25,000 = 2e-06; marginal significant: 1e-03
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 46 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study IV
Journal Article 1
Results in data application: compare longitudinal method with baseline method
A
B
Figure: Manhattan Plot Comparison between baseline study and longitudinal study by LaSPU test on the association between
HDL-C and Rare Variants in the ARIC study. A. baseline study; B. longitudinal study using total four measurements.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 47 / 84
LIPC: previously
not reported
LPL, LIPG, ANGPTL4:
previously reported;
FAM65B: previously
not reported
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study V
Journal Article 1
Results in data application: compare longitudinal method with baseline method
Table: Top Gene-Based Association Results Based on Level of Statistical Significance
Gene Chr p Value No.Variantsa CMACb CMAFc p Value of Baselined
LPL 8 1.00E-06 10 879 0.00807 9.99E-04
FAM65A* 16 1.00E-06 11 751 0.00627 3.79E-04
LIPG 18 1.00E-06 11 369 0.00308 3.13E-02
ANGPTL4 19 1.00E-06 9 579 0.00591 2.89E-01
ANGPTL8 19 1.06E-04 5 64 0.00118 2.07E-01
APOC3 11 5.87E-04 3 21 0.00064 9.99E-04
PAFAH1B2 11 2.19E-03 3 287 0.00879 1.50E-02
a
number of variants contributing to the test
b
cumulative minor allele count of the variants contributing to the test
c
cumulative minor allele frequency of the variants contributing to the test
d
aSPU test using baseline only measurement of HDL-C
*
novelly identified gene(s)
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 48 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study VI
Journal Article 1
Gene effect visualization of previously reported genes
A
●
●
●
●
●
●
●
●
40
44
48
52
56
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
LPL's effect on HDL−C
●
●
●
●
●
●
●
●
44
48
52
56
60
64
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
LIPG's effect on HDL−C
●
●
●
●
●
●
●
●
44
48
52
56
60
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
ANGPTL4's effect on HDL−C
●
●
●
● ●
●
●
●
44
48
52
56
60
64
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
ANGPTL8's effect on HDL−C
●
●
●
●
●
●
●
●
44
48
52
56
60
64
68
72
76
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
APOC3's effect on HDL−C
●
●
●
●
●
●
●
●
44
48
52
56
60
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
PAFAH1B2's effect on HDL−C
B C
D E F
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 49 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study VII
Journal Article 1
Results in data application: compare LaSPU with LSSU method
A
B
Figure: Manhattan Plot Comparison between LaSPU test and LSSU test on the association between HDL-C and Rare
Variants in the ARIC study. A. result by LaSPU; B. result by LSSU .
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 50 / 84
LPL, LIPG, ANGPTL4:
previously reported;
FAM65B: previously
not reported
LPL, LIPG, ANGPTL4:
previously reported;
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study VIII
Journal Article 1
Gene effect visualization of previously unreported genes
A: FAM65A identified by only LaSPU and LSum tests; B: LIPC identified by only aSPU
(baseline analysis)
●
●
●
● ●
●
●
●
44
48
52
56
60
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
FAM65A's effect on HDL−C
●
●
●
●
●
●
●
●
44
48
52
56
1 2 3 4
Visit
HDL−C
Rare Variant
●
●
Carrier
Non−Carrier
LIPC's effect on HDL−CA B
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 51 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 52 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Discussion I
Journal Article 1
LaSPU could efficiently use the longitudinal information to increase testing power
LaSPU could adaptively select the best test and achieve a greater power
LaSPU proved its ability in real data analysis by identifying the same reported
genes associated with HDL-C using the ARIC data set only
LaSPU identified one novel gene associated with HDL-C: FAM65B, which
warrants further validation
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 53 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 54 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Journal Article 2
Title of Journal Article
Pathway-based data-adaptive association tests for longitudinal phenotypes.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 55 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 56 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods I
Journal Article 2
Pathway idea illustration
SNP1 SNP2 SNP3 SNP998 SNP999 SNP1000…
Gene
Pathway
Gene1 Gene100
…
… …
Gene1
Gene36
Gene100
Gene66
Gene16
Pathway1
Figure: Aggregation of SNPs in a Pathway
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 57 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods II
Journal Article 2
A pathway analysis involves multiple genes (e.g. 20 as a typical number). As the genes within a
pathway may contain different numbers of RVs, we need to modify the aSPU test to adjust for
various gene length to avoid dominant influence from a large (or small) gene.
Suppose we let the short notation Ug. represent U.2 for the RVs Xi ’part in the whole score
vector, and Ug. = (Ug,1, Ug,1, . . . , Ug,pg ) is the score vector for gene g with pg RVs of itself.
Given a pathway (or gene set) S, the gene-specific SPU statistic is as follows:
TSPU(γ;g) ∝ ||Ug.||γ =
pg
j=1 |Ug,j |γ
pg
1
γ
(5)
Then accordingly, the pathway-based SPU statistic is
TPath−SPU(γ,γ2;S) =
g∈S
(TSPU(γ;g))γ2
(6)
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 58 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Methods III
Journal Article 2
A data-adaptive pathway-based longitudinal association test: LaSPUpath
The pathway-based aSPU statistic is thus
TPath−aSPU(S) = minγ,γ2PPath−SPU(γ,γ2;S) (7)
We propose to use γ2 ∈ Γ2 = {1, 2, 4, 8}. The set of 1, 2, 4, 8 will cover Sum-like test, SSU-like
test, and two more tests preferring the sparse-causal-gene situation (e.g. only 2 or 3 genes are
associated with traits in a pathway, say with 20 genes).
For any given γ, γ2, We recourse to the similar simulation or permutation based strategy as in
LaSPU test to calculate the P-values of a class of SPUpath tests, and then calculate the final
P-value of the LaSPUpath test.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 59 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 60 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Methods I
Journal Article 2
We simulated a pathway with 20 independent genes. Each gene g contained
pg SNPs with pg randomly draw from a uniform distribution U(3, 20); 5 of
the 20 genes will be randomly selected to be causal, with each causal gene
containing U(1, 3) causal SNPs.
The other parts of the simulation are the same as Article 1.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 61 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Methods II
Journal Article 2
Summary of parameter setup in simulation studies
h2
j = 0 or h2
j = 0.001, 0.0025, 0.005, 0.0075 and 0.010
σb = 7
σe = 27
k = 5
1000 replicates of simulated dataset
B = 1000
α = 0.05
ρy = 0.7
ρx = 0.8
R = AR(1)
Rw = I
0.05 < MAF < 0.4 for CVs; 0.001 < MAF < 0.01 for RVs
causal SNPs are excluded from testing for CVs but retained for RVs
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 62 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 63 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results
Journal Article 2
Simulation setting A
0.05 < MAF < 0.4
h2
j = 0
α = 0.05
pSSU pSSUw pScore pSum pUminP
0.037 0.042 0.025 0.049 0.050
LaSPUpath
0.053
Table: Empirical Type I Error Table in simulation set-up A.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 64 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Simulation Results
Journal Article 2
Simulation setting B
a
a
a
a
a
Power Benchmark on different sample size
Heritability (h2
)
EmpiricalPower
b
b
b
b
b
c
c
c
c
c
d
d
d
d
d
e
e
e
e
e
f
f
f
f
f
0.0010 0.0025 0.0050 0.0075 0.0100
0.00.10.20.30.40.50.60.70.80.91.0
0.00.10.20.30.40.50.60.70.80.91.0
f LaSPUpath
others
a
b
c
d
e
pSSU
pSSUw
pScore
pSum
pUminP
Figure: Empirical power benchmark under different heritability (h2) in simulation set-up B.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 65 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 66 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study I
Journal Article 2
Methods in data application
RV defined by MAF < 0.05
Inclusion of only nonsynonymous and splice site variants
QC procedures as in [PAB+
14]
Additive genetic model
Covariates include age, age2
, gender, BMI, time of measurement and top two PCs
obtain biological pathways from the KEGG database [OGS+
99]
trim pathway to be of moderate size (between 10 and 500 genes)
Gene boundary defined in [PAB+
14]
Significant: 0.05/197 = 0.00025; marginal significant: 1e-03
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 67 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Application to the ARIC study II
Journal Article 2
Results in data application: significant pathways
Table: Results of the ARIC Data Application: KEGG Pathways with p Value < 0.00025
KEGG ID Pathway Name No. of Genes No. of SNPs p Value Contributing Genesa
hsa00561 Glycerolipid metabolism 44 311 1.00E-06
LPL,LIPG,LIPC,
DGKQ,PPAP2A,PNLIPRP3
hsa03320 PPAR signaling pathway 55 465 1.00E-06
LPL,ANGPTL4,NR1H3,
CD36,APOA1
hsa05010 Alzheimers disease 98 747 1.00E-06
LPL,APOE,BID,PLCB3,NDUFS8,
NDUFS3,NDUFB6,RYR3,NCSTN
a
p Value of the gene < 0.05
hsa00561: It belongs to lipid metabolism class; associated with Hyperlipoproteinemia,
type I
hsa03320: It was reported to play a role in the clearance of circulating or cellular lipids via
the regulation of gene expression; associated with multiple lipid-related diseases like
Hyperalphalipoproteinemia
hsa05010: There are a number of studies reporting a reduced level of HDL-C is highly
correlated with the severity of AD; atherosclerosis also increases the risk of AD.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 68 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 69 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Discussion I
Journal Article 1
LaSPUpath could efficiently use the longitudinal information to increase testing
power
LaSPUpath could efficiently use the biological information (such as pathway) to
increase the testing power
LaSPUpath could adaptively select the best test and achieve a greater power
LaSPUpath identified three pathways significantly associated with HDL-C. The
annotated functions of the pathways are closely related to HDL-C, either as the
regulator of lipids (including HDL-C) and/or lead to the lipid-related diseases like
Hyperlipoproteinemia and Alzheimer’s disease.
LaSPUpath could be used in more general gene-set based testings, for example,
annotations of functional variation available in ENCODE
(https://www.encodeproject.org/) and epigenome activities collected in the
Roadmap Epigenomic Data at NCBI/GEO
(http://www.roadmapepigenomics.org/).
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 70 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 71 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Journal Article 3
Title of Journal Article
LaSPU: a suite of powerful data-adaptive SNP-set and pathway-based association
testing tools for longitudinal traits.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 72 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Features
Journal Article 3
implement the LaSPU method
compatible with Unix-like system
command line operated program
optimized and possess flexible parallel computing schema
user-friendly manual and examples
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 73 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Workflow
Journal Article 3
Computation Environment
Setup
•Serial
•Single node with multiple cores
•Multiple nodes with multiple scores
Data Integration
•Genotype
•Genotype annotation
•Longitudinal trait
•Covariate
Ad-Hoc QC for genotype data
•Missing rate
•Monomorphic site
•Set with not enough SNPs
GEE Fitting longitudinal data
under H0 to
derive score vector and
variance-covariance matrix
(optional) Permutation
Procedure to obtain null
distribution of score vector for
LaSPU
Execute Five Tests (default):
•Sum Test
•SSU(w) Tests
•UminP Test
•Score Test
•LaSPU
Figure: LaSPU Workflow Chart.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 74 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Result
Journal Article 3
Licensed under the CC BY-NC 4.0 license and freely available at
https://github.com/xyy2006/LaSPU/.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 75 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 Acknowledgement
6 References
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 76 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Conclusion I
In brief, our developed association test(s) utilized
the longitudinal information
the data adaptivity to select the most powerful test
the biological information such as biological pathways
to boost the statistical power.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 77 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Conclusion II
More elaborately, in this dissertation we have
developed the LaSPU method.
developed the LaSPUpath method.
performed simulation studies that demonstrated that both LaSPU and LaSPUpath could
efficiently utilize the longitudinal information and adaptively select the best test to
increase the statistical power as compared to non-adaptive and/or baseline testing
methods.
performed simulation studies that also demonstrated that the LaSPUpath could efficiently
utilize the additional biological information when available, such as the biological pathway
information, to further boost the statistical power.
Real data application of LaSPU demonstrated that LaSPU can replicate the previously
reported genes with a much smaller sample size ( 1/4). On the other hand, LaSPU
identified a novel gene FAM65B associated with HDL-C.
Real data application of LaSPUpath identified three significant pathways, the functional
annotations of which all closely relate the regulation of lipid metabolism and/or the onset
of lipid-related diseases.
The development of the ”LaSPU” software package enables the research community to
implement the novel methods with ease.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 78 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Future Work
Joint test of the SNPs’ main effect and the SNPs’ interaction effect with time in the
LaSPU/LaSPUpath framework
Setup of appropriate weighting schema for SNPs (and genes) to incorporate functional
annotations of SNPs (and genes)
Setup of appropriate weighting schema for SNPs to enable the analysis of CVs and RVs
in the same region
Extension from testing the genetic pathway to other functional related gene sets.
Inclusion of the aSPUpath method into the ”LaSPU” software package.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 79 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Table of Contents1 Background
Introduction to Genome-wide association study (GWAS)
SNP-set based association tests
Longitudinal data analysis strategy in GWAS
2 Overall Study Design
Simulation studies
Real data application
3 Proposed Journal Articles
Journal Article 1: Data-adaptive SNP-set based association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 2: Pathway-based data-adaptive association tests ...
Methods
Data Simulation Methods
Simulation Results
Application to the ARIC Study
Discussion
Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ...
4 Conclusion and Future Work
5 Acknowledgement
6 References
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 80 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Acknowledgement
Advisors:
Peng Wei, Ph.D
Alanna C. Morrison, Ph.D
Yunxin Fu, Ph.D
Han Liang, Ph.D, Associate Professor, Department of Bioinformatics and Computational
Biology, The University of Texas M.D Anderson Cancer Center
External Reviewer:
Xiaoming Liu, Ph.D
Collaborator
Wei Pan, Ph.D, Professor, Division of Biostatistics, School of Public Health, University of
Minnesota
Supporting Grant:
Title: Association Analysis of Rare Variants with Sequencing Data
Funding Source: NIH/NHLBI (1R01HL116720)
My classmates, colleges and friends at UTSPH and MDACC
...
My Parents
Tianpeng Yang and Qi Lu
My Wife
Nainan Hei
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 81 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
Thanks to Audience
Thank you for your participation!
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 82 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
References I
Alan P Boyle, Eurie L Hong, Manoj Hariharan, Yong Cheng, Marc A Schaub, Maya Kasowski, Konrad J Karczewski,
Julie Park, Benjamin C Hitz, Shuai Weng, et al., Annotation of functional variation in personal genomes using
regulomedb, Genome research 22 (2012), no. 9, 1790–1797.
Dan-Yu Lin and Zheng-Zheng Tang, A general framework for detecting disease associations with rare variants in
sequencing studies, The American Journal of Human Genetics 89 (2011), no. 3, 354–367.
Kung-Yee Liang and Scott L Zeger, Longitudinal data analysis using generalized linear models, Biometrika 73 (1986),
no. 1, 13–22.
Hiroyuki Ogata, Susumu Goto, Kazushige Sato, Wataru Fujibuchi, Hidemasa Bono, and Minoru Kanehisa, Kegg: Kyoto
encyclopedia of genes and genomes, Nucleic acids research 27 (1999), no. 1, 29–34.
Gina M Peloso, Paul L Auer, Joshua C Bis, Arend Voorman, Alanna C Morrison, Nathan O Stitziel, Jennifer A Brody,
Sumeet A Khetarpal, Jacy R Crosby, Myriam Fornage, et al., Association of low-frequency and rare coding-sequence
variants with blood lipids and coronary heart disease in 56,000 whites and blacks, The American Journal of Human
Genetics 94 (2014), no. 2, 223–232.
Wei Pan, On the robust variance estimator in generalised estimating equations, Biometrika 88 (2001), no. 3, 901–906.
, Asymptotic tests of association with multiple snps in linkage disequilibrium, Genetic epidemiology 33 (2009),
no. 6, 497–507.
Wei Pan, Junghi Kim, Yiwei Zhang, Xiaotong Shen, and Peng Wei, A powerful and adaptive association test for rare
variants, Genetics (2014), genetics–114.
Tao Wang and Robert C Elston, Improved power by use of a weighted score test for linkage disequilibrium mapping, The
american journal of human genetics 80 (2007), no. 2, 353–360.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 83 / 84
Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References
References II
Zhiyuan Xu, Xiaotong Shen, Wei Pan, Alzheimer’s Disease Neuroimaging Initiative, et al., Longitudinal analysis is more
powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes, PloS one 9 (2014),
no. 8, e102312.
Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 84 / 84

More Related Content

What's hot

Dissertation Defense Presentation
Dissertation Defense PresentationDissertation Defense Presentation
Dissertation Defense PresentationDr. Timothy Kelly
 
Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...
Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...
Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...Coni Judge, PhD
 
Large Molecule Bioanalysis: LC-MS or ELISA? A Case Study
Large Molecule Bioanalysis: LC-MS or ELISA? A Case StudyLarge Molecule Bioanalysis: LC-MS or ELISA? A Case Study
Large Molecule Bioanalysis: LC-MS or ELISA? A Case StudyQPS Holdings, LLC
 
GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD
GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEADGWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD
GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEADvv628048
 
Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Stuart MacGowan
 
Anil bl gather mass spectrometry
Anil bl gather mass spectrometryAnil bl gather mass spectrometry
Anil bl gather mass spectrometryANIL BL GATHER
 
Cytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdf
Cytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdfCytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdf
Cytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdfQusayAlMaghayerh
 
Dissertation Defense Presentation
Dissertation Defense PresentationDissertation Defense Presentation
Dissertation Defense PresentationAvril El-Amin
 
qPCR Design Strategies for Specific Applications
qPCR Design Strategies for Specific ApplicationsqPCR Design Strategies for Specific Applications
qPCR Design Strategies for Specific ApplicationsIntegrated DNA Technologies
 
Case Study Research Proposal
Case Study Research ProposalCase Study Research Proposal
Case Study Research ProposalDawnAdolfson
 
Dissertation Proposal Defense
Dissertation Proposal DefenseDissertation Proposal Defense
Dissertation Proposal DefenseShajaira Lopez
 
Dissertation defense power point
Dissertation defense power pointDissertation defense power point
Dissertation defense power pointKelly Dodson
 
Dissertation defense ppt
Dissertation defense ppt Dissertation defense ppt
Dissertation defense ppt Dr. James Lake
 
FTIR analysis of secondary structure of protein
FTIR analysis of secondary structure of proteinFTIR analysis of secondary structure of protein
FTIR analysis of secondary structure of proteinAmit Chauhan
 
My Dissertation Proposal Defense
My Dissertation Proposal DefenseMy Dissertation Proposal Defense
My Dissertation Proposal DefenseLaura Pasquini
 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Leighton Pritchard
 
PhD proposal presentation
PhD proposal presentationPhD proposal presentation
PhD proposal presentationMichael Rowe
 
臨床疫学研究におけるNDBの利活用
臨床疫学研究におけるNDBの利活用臨床疫学研究におけるNDBの利活用
臨床疫学研究におけるNDBの利活用Yasuyuki Okumura
 

What's hot (20)

Dissertation Defense Presentation
Dissertation Defense PresentationDissertation Defense Presentation
Dissertation Defense Presentation
 
Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...
Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...
Doctoral Dissertation Defense Conilyn Poulsen Judge The Relationship Between ...
 
Large Molecule Bioanalysis: LC-MS or ELISA? A Case Study
Large Molecule Bioanalysis: LC-MS or ELISA? A Case StudyLarge Molecule Bioanalysis: LC-MS or ELISA? A Case Study
Large Molecule Bioanalysis: LC-MS or ELISA? A Case Study
 
Lcms Basics
Lcms BasicsLcms Basics
Lcms Basics
 
GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD
GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEADGWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD
GWAS "GENOME WIDE ASSOCIATION STUDIES" A STEP AHEAD
 
Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)
 
Anil bl gather mass spectrometry
Anil bl gather mass spectrometryAnil bl gather mass spectrometry
Anil bl gather mass spectrometry
 
Cytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdf
Cytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdfCytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdf
Cytogenetics and Molecular Cytogenetics (CRC Press, 2022).pdf
 
Dissertation Defense Presentation
Dissertation Defense PresentationDissertation Defense Presentation
Dissertation Defense Presentation
 
qPCR Design Strategies for Specific Applications
qPCR Design Strategies for Specific ApplicationsqPCR Design Strategies for Specific Applications
qPCR Design Strategies for Specific Applications
 
Case Study Research Proposal
Case Study Research ProposalCase Study Research Proposal
Case Study Research Proposal
 
Dissertation Proposal Defense
Dissertation Proposal DefenseDissertation Proposal Defense
Dissertation Proposal Defense
 
Dissertation defense power point
Dissertation defense power pointDissertation defense power point
Dissertation defense power point
 
All posters 2017-2018 P5
All posters 2017-2018 P5All posters 2017-2018 P5
All posters 2017-2018 P5
 
Dissertation defense ppt
Dissertation defense ppt Dissertation defense ppt
Dissertation defense ppt
 
FTIR analysis of secondary structure of protein
FTIR analysis of secondary structure of proteinFTIR analysis of secondary structure of protein
FTIR analysis of secondary structure of protein
 
My Dissertation Proposal Defense
My Dissertation Proposal DefenseMy Dissertation Proposal Defense
My Dissertation Proposal Defense
 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010
 
PhD proposal presentation
PhD proposal presentationPhD proposal presentation
PhD proposal presentation
 
臨床疫学研究におけるNDBの利活用
臨床疫学研究におけるNDBの利活用臨床疫学研究におけるNDBの利活用
臨床疫学研究におけるNDBの利活用
 

Similar to dissertation_defense_final

Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...Hui Yang
 
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...Yusuf Brima
 
Inferential statictis ready go
Inferential statictis ready goInferential statictis ready go
Inferential statictis ready goMmedsc Hahm
 
Smart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject RecommendationsSmart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject Recommendationseby
 
Lancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studiesLancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studiesnoorafifah
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Hadi Mohammadzadeh
 
Social science Research Methodology explained
Social science Research Methodology explainedSocial science Research Methodology explained
Social science Research Methodology explainedSwarna Latha Maroju
 
Frictional resistance in self ligating orthodontic brackets and conventionall...
Frictional resistance in self ligating orthodontic brackets and conventionall...Frictional resistance in self ligating orthodontic brackets and conventionall...
Frictional resistance in self ligating orthodontic brackets and conventionall...VARADARAJU MAGESH
 
On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...Susanna-Assunta Sansone
 
NSG 6101.docx
NSG 6101.docxNSG 6101.docx
NSG 6101.docxbkbk37
 
1. The research purpose questions addressed in the studies      .docx
1. The research purpose questions addressed in the studies      .docx1. The research purpose questions addressed in the studies      .docx
1. The research purpose questions addressed in the studies      .docxambersalomon88660
 
Survival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive ModelsSurvival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive ModelsChristos Argyropoulos
 
Leibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific NotationLeibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific Notationkhinsen
 
Observation of The Hot Research on Disruptive Science and Technology via Scie...
Observation of The Hot Research on Disruptive Science and Technology via Scie...Observation of The Hot Research on Disruptive Science and Technology via Scie...
Observation of The Hot Research on Disruptive Science and Technology via Scie...Masatsura IGAMI
 

Similar to dissertation_defense_final (20)

Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
Whole Heart Modeling – Spatiotemporal Dynamics of Electrical Wave Conduction ...
 
Reseach design iste 2013
Reseach design   iste 2013Reseach design   iste 2013
Reseach design iste 2013
 
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...
AIMS Block Presentation]{Deep Transfer Learning for Magnetic Resonance Image ...
 
Research report
Research reportResearch report
Research report
 
Inferential statictis ready go
Inferential statictis ready goInferential statictis ready go
Inferential statictis ready go
 
Smart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject RecommendationsSmart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject Recommendations
 
Lancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studiesLancaster design and analysis of pilot studies
Lancaster design and analysis of pilot studies
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
Social science Research Methodology explained
Social science Research Methodology explainedSocial science Research Methodology explained
Social science Research Methodology explained
 
Frictional resistance in self ligating orthodontic brackets and conventionall...
Frictional resistance in self ligating orthodontic brackets and conventionall...Frictional resistance in self ligating orthodontic brackets and conventionall...
Frictional resistance in self ligating orthodontic brackets and conventionall...
 
On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...On community-standards, data curation and scholarly communication" Stanford M...
On community-standards, data curation and scholarly communication" Stanford M...
 
NSG 6101.docx
NSG 6101.docxNSG 6101.docx
NSG 6101.docx
 
Gaskin2014 sem
Gaskin2014 semGaskin2014 sem
Gaskin2014 sem
 
Research Design and Types of Research Design Arun Joseph MPhil ppt
Research Design and Types of Research Design Arun Joseph MPhil pptResearch Design and Types of Research Design Arun Joseph MPhil ppt
Research Design and Types of Research Design Arun Joseph MPhil ppt
 
1. The research purpose questions addressed in the studies      .docx
1. The research purpose questions addressed in the studies      .docx1. The research purpose questions addressed in the studies      .docx
1. The research purpose questions addressed in the studies      .docx
 
Research design mahesh
Research design maheshResearch design mahesh
Research design mahesh
 
Survival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive ModelsSurvival Analysis With Generalized Additive Models
Survival Analysis With Generalized Additive Models
 
Leibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific NotationLeibniz: A Digital Scientific Notation
Leibniz: A Digital Scientific Notation
 
Observation of The Hot Research on Disruptive Science and Technology via Scie...
Observation of The Hot Research on Disruptive Science and Technology via Scie...Observation of The Hot Research on Disruptive Science and Technology via Scie...
Observation of The Hot Research on Disruptive Science and Technology via Scie...
 
The picture fuzzy distance measure in controlling network power consumption
The picture fuzzy distance measure in controlling network power consumptionThe picture fuzzy distance measure in controlling network power consumption
The picture fuzzy distance measure in controlling network power consumption
 

dissertation_defense_final

  • 1. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Data-adaptive SNP-set-based Association Tests for Longitudinal Traits Doctoral Dissertation Defense Yang Yang B.S, Biological Science M.S, Bioinformatics Ph.D. Candidate, Biostatistics Department of Biostatistics, School of Public Health, The University of Texas Health Science Center at Houston November 30th 2015 Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 1 / 84
  • 2. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Dissertation Committee: Dissertation Chair and Academic Advisor, Peng Wei, PhD Minor Advisor, Alanna C. Morrison, PhD Breadth Advisor, Yun-Xin Fu, PhD External Advisor, Han Liang, PhD External Reviewer, Xiaoming Liu, PhD Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 2 / 84
  • 3. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents 1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 Acknowledgement 6 References Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 3 / 84
  • 4. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 4 / 84
  • 5. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Introduction to GWAS A work flow of GWAS Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 5 / 84
  • 6. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Introduction to GWAS GWAS contributes to precision medicine Figure: GWAS contributes to precision medicine Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 6 / 84
  • 7. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References SNP-set based association tests By pooling multiple low MAF SNPs together, the SNP-set based association test can detect the signal(s) from a region (such as a gene), rather than from a single SNP. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 7 / 84
  • 8. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Longitudinal data analysis in GWAS Figure: Trajectories of phenotype left hippocampus volume over time (in months) in three allele groups of SNP rs2075650 [XSP+14] Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 8 / 84
  • 9. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Longitudinal data analysis in GWAS A recent study by Xu et. al.[XSP+14] demonstrated the power gain from longitudinal data analysis over traditional cross-sectional data analysis used in GWAS. Figure: Comparison of the Manhattan plots for genome-wide p-values for phenotype left hippocampus volume from longitudinal analysis (left) and from cross-sectional analysis (right) [XSP+14] Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 9 / 84
  • 10. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 10 / 84
  • 11. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Studies General Purpose Evaluate the proposed methods’ performance in maintaining the nominal Type I error rate maintaining a higher power under different simulation scenarios Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 11 / 84
  • 12. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to ARIC ExomeChip Data LPL's effect on HDL−C visit HDL−C 0 50 100 150 1 2 3 4 Genotype: Carrier 1 2 3 4 0 50 100 150 Genotype: Non−Carrier LIPC's effect on HDL−C visit HDL−C 0 50 100 150 1 2 3 4 Genotype: Carrier 1 2 3 4 0 50 100 150 Genotype: Non−Carrier Phenotype: HDL-C levels measured at four visits in each of n=11,478 ARIC EA subjects Genotype: rare/low frequency variants (MAF < 5%) on the ExomeChip Statistical methods: LaSPU(path) proposed here and its extensions Baseline (aSPU) vs all four measures (LaSPU) Gene-based and biological pathway-based association analysis Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 12 / 84
  • 13. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 13 / 84
  • 14. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Journal Article 1 Title of Journal Article Data-adaptive SNP-set based association tests for longitudinal phenotypes within the Generalized Estimating Equations framework. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 14 / 84
  • 15. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 15 / 84
  • 16. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods I Journal Article 1 Suppose for each subject i = 1, . . . , n, we have k total longitudinal measurements yi = (yi1, yi2, . . . , yik ) with yim as a element, p SNPs of interest as a row vector xi = (xi1, xi2, . . . , xip) with xij coded as 0,1 or 2 for the count of the minor allele, and zi = (zi1, zi2, . . . , ziq) as a row vector for q covariates, e.g., time, gender and leading principal components for population substructure. Thus, we have: Xi =      xi xi ... xi      , Zi =      1 zi 1 zi . .. . .. 1 zi      Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 16 / 84
  • 17. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods II Journal Article 1 Xi is a k × p matrix, and Zi is a k × (q + 1) matrix. Denote the regression coefficient β = (β1, β2, . . . , βp) and ϕ = (ϕ1, ϕ2, . . . , ϕq+1) for Xi and Zi respectively. We then have the GLM equation as, g(µi ) = ηi = Zi ϕ + Xi β = Hi θ, In the analysis of the ARIC data, Zi = time + gender + BMI(baseline) + age(baseline) + age(baseline)2 + PC1 + PC2 Xi = SNP1 + SNP2 + . . . + SNPp Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 17 / 84
  • 18. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods III Journal Article 1 The consistent and asymptotically normal estimates of β and ϕ can be obtained by solving the GEE [LZ86]: U(ϕ, β) = n i= Ui (ϕ, β) = n i=1 ( ∂µi ∂θ ) V −1 i (Yi − µi ) = 0, with ∂µi ∂θ = ∂g−1(Hi θ) ∂θ , Vi = φA 1 2 i Rw A 1 2 i , and Ai =      v(µi1) 0 · · · 0 0 v(µi2) 0 0 . .. 0 ... . .. 0 0 · · · v(µik )      Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 18 / 84
  • 19. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods IV Journal Article 1 Quantitative traits We use the identity link, i.e. g(µim) = µim and v(µim) = φ × 1 = φ. Then we have: U = i (Zi , Xi ) R−1 w (Yi − µi ) Σ = i (Zi , Xi ) R−1 w (Yi − ˆµi )(Yi − ˆµi ) R−1 w (Zi , Xi ) (1) if the assumption of a common covariance matrix across Yi for i is valid, e.g. for quantitative continuous traits[Pan01], we can adopt a more efficient covariance estimator: Σ = n i=1 (Zi , Xi ) var(Yi ) (Zi , Xi ) = n i=1 (Zi , Xi )   n i=1 (Yi − ˆµi )(Yi − ˆµi ) n   (Zi , Xi ) = V11 V12 V21 V22 (2) which is used by default for its better finite-sample performance [Pan01]. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 19 / 84
  • 20. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods V Journal Article 1 Testing null hypothesis Ho : β = (β1, β2, . . . , βp) = 0, where we have g(Yi ) = Zi ϕ to obtain ϕ and predict ˆµ = g−1(Z ˆϕ). Then, we have score vector under a working independence model is: U( ˆϕ, 0) = (U.1, U.2) = n i=1 (Ui1, Ui2) where U.1 = i Zi (Yi − ˆµi ), U.2 = i Xi (Yi − ˆµi ) As U asymptotically follows a multivariate normal distribution under H0, then the score vector for β also has an asymptotic normal distribution: U.2 ∼ N(0, Σ.2), Σ.2 = Cov(U.2) = V22 − V21V −1 11 V12, where Vxx are defined in Equation 2. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 20 / 84
  • 21. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods VI Journal Article 1 A general form of score-vector-based statistic can be generalized as: Tw = w U = p j=1 wj Uj where w = (w1, . . . , wp) is a vector of weights for the p SNVs [LT11]. with special cases: TSum = 1 U = p j=1 Uj , TSSU = U U = p j=1 U2 j , These two tests are called Sum test and SSU test [Pan09]. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 21 / 84
  • 22. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods VII Journal Article 1 If we choose weight to be Wj = Uγ−1 .2,j for a series of integer value γ = 1, 2, . . . , ∞, leading to the sum of powered score (U) tests called SPU tests: TSPU(γ) = p j=1 Uγ−1 .2,j U.2,j When γ → ∞ as an extreme situation, where ∞ is assumed to be an even number, we have TSPU(γ) ∝ ||U||γ =   p j=1 |U.2,j |γ   1 γ → ||U||∞ = maxp j=1|U.2,j | ≡ TSPU(∞). In our experience, SPU(γ) test with a large γ > 8 usually gave similar results as that of SPU(∞) test [PKZ+14], thus we used γ ∈ Γ = {1, 2, . . . , 8, ∞} for the whole dissertation work. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 22 / 84
  • 23. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods VIII Journal Article 1 P-value calculation of the SPU test Let T denotes TSPU(γ) for a specific γ. Because the null distribution for SPU with γ larger than 2 is difficult to derive, so is aSPU to be introduced later, we recourse to simulation or permutation based method to calculate the null distribution. With empirical distribution of U.2 constructed, we can obtain the statistics under the null hypothesis: T(b) = p j=1 U (b)γ .2,j and calculate the p-value of TSPU(γ) as PSPU(γ) = B b=1 I(T(b) ≥ Tobs ) + 1 B + 1 . Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 23 / 84
  • 24. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods IX Journal Article 1 Empirical Null Distribution of the Score Vector We can obtain the empirical distribution of the score vector U.2 under the null hypothesis by either the simulation or permutation procedure. By simulation procedure, we draw B samples of U.2 from its asymptotic distribution: U (b) .2 ∼ MVN 0, ˆΣ.2 , with b = 1, 2, . . . , B. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 24 / 84
  • 25. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods X Journal Article 1 The permutation procedure is more robust to the failure of asymptotic property, e.g., the analysis of rare variants. The permutation procedure can be implemented as follows: 1 identify the max k across all n subjects, which is the number of longitudinal measurements, e.g. k = 4. 2 detect if the data has missing values, if yes, fill the missing value with NA to complement the data dimension (for example, subject i with Yi = (yi,1, , , yi,4) has two missing measurements at time 2 and time 3. After missing value complementing, it becomes Yi = (yi,1, NA, NA, yi,4) ). Now we should have all the subjects with each Yi of dimension equal to k × 1. 3 complement Hi to be of full dimension, i.e. k × (p + q + 1), for covariates and SNPs. Now we should have Yi Hi as an augmented matrix of dimension k × (p + q + 2) for each subject i, where Hi = (Zi , Xi ). For total n subjects, we have row-wise binded matrix M =      Y1 H1 Y2 H2 ... ... Yn Hn      of dimension nk × (p + q + 2). Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 25 / 84
  • 26. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods XI Journal Article 1 4 permute the genotype codes across different individuals, i.e. the Xi in Yi Zi , Xi with the Xj in Yj Zj , Xj , where i = j. 5 with permuted M∗(b) =       Y1 Z1, X ∗(b) 1 Y2 Z1, X ∗(b) 2 ... ... Yn Z1, X ∗(b) n       we recalculate U ∗(b) .2 by U ∗(b) .2 = i (X ∗(b) i ) (Yi − ˆµi ) 6 repeat step 4 - 5 B times to produce U ∗(b) .2 with b = 1, 2, . . . , B. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 26 / 84
  • 27. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods XII Journal Article 1 The longitudinal adaptive SPU test (LaSPU) Although we have a list of SPU(γ) statistics and p-values, we are not sure which one is the most powerful in a specific unknown association pattern. Thus, it will be convenient to have a test adaptively and automatically selects the best SPU(γ) test. We therefore propose a longitudinal adaptive SPU (LaSPU) test: TaSPU = min γ∈Γ PSPU(γ), Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 27 / 84
  • 28. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods XIII Journal Article 1 P-value calculation of LaSPU test Similarly, P (b) SPU(γ) = B b1=b I(T (b1) SPU(γ) ≥ T (b) SPU(γ) ) + 1 (B − 1) + 1 for every γ and every b. Then, we will have T (b) aSPU = min γ∈Γ P (b) SPU(γ) , and the final p-value of LaSPU test is: PLaSPU = B b=1 I(T (b) aSPU ≤ Tobs aSPU ) + 1 B + 1 . It is worth noting again that the same B simulated score (U.2) vectors have been used in calculating the PLaSPU . Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 28 / 84
  • 29. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods XIV Journal Article 1 The stage-wise genome wide scan strategy In practice for genome wide computation purpose, we can use a stage-wise aSPU test strategy: 1 we first start with a smaller B, say B = 1000 2 we increase B to say 106 for just a few sets of SNPs, which pass a pre-determined significance cutoff (e.g., p-value ≤ 5/B) in step 1 3 repeat step 2 until a pre-determined B number is reached Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 29 / 84
  • 30. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods XV Journal Article 1 Other versions of LaSPU test LaSPUw test The SPUw test is a diagonal-variance-weighted version of the SPU test, defined as: TSPUw(γ) = p j=1    U.2,j ˆΣ.2,jj    γ LaSPU(w).Score test TaSPU.Score = min min γ∈Γ PSPU(γ), PScore , LaSPU omnibus test TaSPU.o = min min γ∈Γ PSPU(γ), min γ∈Γ PSPUw(γ), PScore . Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 30 / 84
  • 31. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 31 / 84
  • 32. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Methods I Journal Article 1 Simulation of genotype data following [WE07] Figure: Demo graph of genotype simulation Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 32 / 84
  • 33. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Methods II Journal Article 1 Simulation of phenotype data We setup the mixed effect model to achieve the AR(1) correlation structure as: yim = µi + bi + ρei,m−1 + si,m ei,m , (3) with m = 1, . . . , k indexes the longitudinal measurements within subject i; µi = Zi ϕ + Xi β = Hi θ as in quantitative trait case; bi is the random intercept representing the subject-level random effect, and ρei,m−1 + si,m = ei,m, where ρ is lag-one autocorrelation coefficient, so we can plugin our estimate from real data here by setting up ρ = 0.7. We assume the following distribution: Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 33 / 84
  • 34. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Methods III Journal Article 1 bi ∼ N(0, σ2 b) ei,m ∼ N(0, σ2 e ) si,m ∼ N(0, (1 − ρ2 )σ2 e ) Under this assumption, the variance-covariance matrix across longitudinal measurements becomes (assuming k = 4 for the number of longitudinal measurements ): Σ4×4 = Var     bi + ei1 bi + ρei1 + si2 bi + ρei2 + si3 bi + ρei3 + si4     = σ2 b     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1     + σ2 e     1 ρ ρ2 ρ3 ρ 1 ρ ρ2 ρ2 ρ 1 ρ ρ3 ρ2 ρ 1     (4) Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 34 / 84
  • 35. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Methods IV Journal Article 1 Summary of parameter setup in simulation studies h2 j = 0.001 σb = 7 σe = 27 n varies between 500 and 3000 k = 4 1000 replicates of simulated dataset (5000 replicates as optional validation, result not shown) α = 0.05 ρy = 0.7 ρx = 0.8 R = AR(1) Rw = I Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 35 / 84
  • 36. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 36 / 84
  • 37. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results I Journal Article 1 Simulation setting A B = 1000 for simulation of U sample size of 500, 1000, 2000 and 3000 5/15 + 0/35 SNPs allocation 0.05 < MAF < 0.4 h2 j = 0 α = 0.05 n pSSU pSSUw pScore pSum pUminP LaSPU LaSPUw LaSPU.sco LaSPU.omni 500 0.047 0.048 0.047 0.052 0.048 0.060 0.058 0.058 0.058 1000 0.047 0.046 0.044 0.058 0.048 0.057 0.057 0.057 0.057 2000 0.049 0.047 0.051 0.048 0.048 0.061 0.058 0.058 0.058 3000 0.051 0.052 0.052 0.052 0.050 0.063 0.060 0.059 0.059 Table: Empirical Type I Error Table in the simulation setting A Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 37 / 84
  • 38. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results II Journal Article 1 Simulation setting B h2 j = 0.001 ● ● ● ● Power Benchmark on different sample size Sample Size EmpiricalPower ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 500 1000 2000 3000 0.00.10.20.30.40.50.60.70.80.91.0 0.00.10.20.30.40.50.60.70.80.91.0 aSPU family ● ● ● ● aspu_P aspu_weighted_P aspu_sco_P aspu_weighted_sco_P others ● ● ● ● ● geescoreP UminP Sum Test w.Sum Test SSU Figure: Empirical power benchmark under different n in the simulation setting B. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 38 / 84
  • 39. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results III Journal Article 1 Simulation setting C: Testing with a growing number of null SNPs (in the second block) ● ● ● ● ● Power Benchmark on different number of null SNPs # of null SNPs EmpiricalPower ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 75 100 200 400 0.00.10.20.30.40.50.60.70.80.91.0 0.00.10.20.30.40.50.60.70.80.91.0 aSPU family ● ● ● ● aspu_P aspu_weighted_P aspu_sco_P aspu_weighted_sco_P others ● ● ● ● ● geescoreP UminP Sum Test w.Sum Test SSU Figure: Empirical power benchmark under an increased number of null SNPs in the simulation setting C. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 39 / 84
  • 40. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results IV Journal Article 1 Simulation setting D: assessing power gain from longitudinal measurements by decreasing the residual variance via repeated measures 1 1 1 1 Power increase from repeated measurements on different sample size Sample Size EmpiricalPower 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 500 1000 2000 3000 0.00.10.20.30.40.50.60.70.80.91.0 0.00.10.20.30.40.50.60.70.80.91.0 Figure: Power increases from repeated measurements in the simulation setting D. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 40 / 84
  • 41. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results for Rare Variants I Journal Article 1 Simulation setting A’ 5/15 with 5 included in the test 0.001 < MAF < 0.01 permutation based method to calculate the P-value n pSSU pSSUw pScore pSum pUminP LaSPU LaSPUw LaSPU.sco LaSPU.omni 500 0.053 0.054 0.052 0.049 0.047 0.054 0.053 0.060 0.058 1000 0.055 0.040 0.042 0.048 0.054 0.047 0.045 0.052 0.051 2000 0.054 0.050 0.048 0.049 0.046 0.063 0.057 0.058 0.057 3000 0.045 0.044 0.039 0.060 0.053 0.049 0.053 0.049 0.053 Table: Empirical Type I Error Table in the simulation setting A’. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 41 / 84
  • 42. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results for Rare Variants II Journal Article 1 Simulation setting B’ a a a a Power Benchmark on different sample size Sample Size EmpiricalPower b b b b c c c c d d d d e e e e f f f f g g g g h h h h i i i i j j j j k k k k l l l l m m m m n n n n o o o o p p p p 500 1000 2000 3000 0.00.10.20.30.40.50.60.70.80.91.0 0.00.10.20.30.40.50.60.70.80.91.0 aSPU family k l m n o p aspu_weighted_sco_P aspu_weighted_P aspu_sco_P aspu_P aspu_aspuw.sco_P aspu_aspuw_P others a b c d e f g h i j UminP mvn.UminP pSum pSSUw pSSU pScore SPU(2) SPUw(2) SPUw(1) SPU(1) Figure: Empirical power benchmark under different n in the simulation setting B’. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 42 / 84
  • 43. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 43 / 84
  • 44. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study I Journal Article 1 Recap the ARIC data LPL's effect on HDL−C visit HDL−C 0 50 100 150 1 2 3 4 Genotype: Carrier 1 2 3 4 0 50 100 150 Genotype: Non−Carrier LIPC's effect on HDL−C visit HDL−C 0 50 100 150 1 2 3 4 Genotype: Carrier 1 2 3 4 0 50 100 150 Genotype: Non−Carrier Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 44 / 84
  • 45. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study II Journal Article 1 Phenotype: HDL-C levels measured at four visits in each of n=11,478 ARIC EA subjects Genotype: rare/low frequency variants (MAF < 5%) on the ExomeChip Statistical methods: LaSPU proposed here and its extensions Baseline (aSPU) vs all four measures (LaSPU) Gene-based association analysis Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 45 / 84
  • 46. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study III Journal Article 1 Methods in data application RV defined by MAF < 0.05 Inclusion of only nonsynonymous and splice site variants QC procedures as in [PAB+14] Additive genetic model Covariates include age, age2, gender, BMI, time of measurement and top two PCs Gene boundary defined in [PAB+14] Significant: 0.05/25,000 = 2e-06; marginal significant: 1e-03 Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 46 / 84
  • 47. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study IV Journal Article 1 Results in data application: compare longitudinal method with baseline method A B Figure: Manhattan Plot Comparison between baseline study and longitudinal study by LaSPU test on the association between HDL-C and Rare Variants in the ARIC study. A. baseline study; B. longitudinal study using total four measurements. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 47 / 84 LIPC: previously not reported LPL, LIPG, ANGPTL4: previously reported; FAM65B: previously not reported
  • 48. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study V Journal Article 1 Results in data application: compare longitudinal method with baseline method Table: Top Gene-Based Association Results Based on Level of Statistical Significance Gene Chr p Value No.Variantsa CMACb CMAFc p Value of Baselined LPL 8 1.00E-06 10 879 0.00807 9.99E-04 FAM65A* 16 1.00E-06 11 751 0.00627 3.79E-04 LIPG 18 1.00E-06 11 369 0.00308 3.13E-02 ANGPTL4 19 1.00E-06 9 579 0.00591 2.89E-01 ANGPTL8 19 1.06E-04 5 64 0.00118 2.07E-01 APOC3 11 5.87E-04 3 21 0.00064 9.99E-04 PAFAH1B2 11 2.19E-03 3 287 0.00879 1.50E-02 a number of variants contributing to the test b cumulative minor allele count of the variants contributing to the test c cumulative minor allele frequency of the variants contributing to the test d aSPU test using baseline only measurement of HDL-C * novelly identified gene(s) Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 48 / 84
  • 49. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study VI Journal Article 1 Gene effect visualization of previously reported genes A ● ● ● ● ● ● ● ● 40 44 48 52 56 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier LPL's effect on HDL−C ● ● ● ● ● ● ● ● 44 48 52 56 60 64 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier LIPG's effect on HDL−C ● ● ● ● ● ● ● ● 44 48 52 56 60 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier ANGPTL4's effect on HDL−C ● ● ● ● ● ● ● ● 44 48 52 56 60 64 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier ANGPTL8's effect on HDL−C ● ● ● ● ● ● ● ● 44 48 52 56 60 64 68 72 76 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier APOC3's effect on HDL−C ● ● ● ● ● ● ● ● 44 48 52 56 60 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier PAFAH1B2's effect on HDL−C B C D E F Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 49 / 84
  • 50. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study VII Journal Article 1 Results in data application: compare LaSPU with LSSU method A B Figure: Manhattan Plot Comparison between LaSPU test and LSSU test on the association between HDL-C and Rare Variants in the ARIC study. A. result by LaSPU; B. result by LSSU . Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 50 / 84 LPL, LIPG, ANGPTL4: previously reported; FAM65B: previously not reported LPL, LIPG, ANGPTL4: previously reported;
  • 51. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study VIII Journal Article 1 Gene effect visualization of previously unreported genes A: FAM65A identified by only LaSPU and LSum tests; B: LIPC identified by only aSPU (baseline analysis) ● ● ● ● ● ● ● ● 44 48 52 56 60 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier FAM65A's effect on HDL−C ● ● ● ● ● ● ● ● 44 48 52 56 1 2 3 4 Visit HDL−C Rare Variant ● ● Carrier Non−Carrier LIPC's effect on HDL−CA B Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 51 / 84
  • 52. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 52 / 84
  • 53. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Discussion I Journal Article 1 LaSPU could efficiently use the longitudinal information to increase testing power LaSPU could adaptively select the best test and achieve a greater power LaSPU proved its ability in real data analysis by identifying the same reported genes associated with HDL-C using the ARIC data set only LaSPU identified one novel gene associated with HDL-C: FAM65B, which warrants further validation Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 53 / 84
  • 54. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 54 / 84
  • 55. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Journal Article 2 Title of Journal Article Pathway-based data-adaptive association tests for longitudinal phenotypes. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 55 / 84
  • 56. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 56 / 84
  • 57. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods I Journal Article 2 Pathway idea illustration SNP1 SNP2 SNP3 SNP998 SNP999 SNP1000… Gene Pathway Gene1 Gene100 … … … Gene1 Gene36 Gene100 Gene66 Gene16 Pathway1 Figure: Aggregation of SNPs in a Pathway Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 57 / 84
  • 58. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods II Journal Article 2 A pathway analysis involves multiple genes (e.g. 20 as a typical number). As the genes within a pathway may contain different numbers of RVs, we need to modify the aSPU test to adjust for various gene length to avoid dominant influence from a large (or small) gene. Suppose we let the short notation Ug. represent U.2 for the RVs Xi ’part in the whole score vector, and Ug. = (Ug,1, Ug,1, . . . , Ug,pg ) is the score vector for gene g with pg RVs of itself. Given a pathway (or gene set) S, the gene-specific SPU statistic is as follows: TSPU(γ;g) ∝ ||Ug.||γ = pg j=1 |Ug,j |γ pg 1 γ (5) Then accordingly, the pathway-based SPU statistic is TPath−SPU(γ,γ2;S) = g∈S (TSPU(γ;g))γ2 (6) Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 58 / 84
  • 59. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Methods III Journal Article 2 A data-adaptive pathway-based longitudinal association test: LaSPUpath The pathway-based aSPU statistic is thus TPath−aSPU(S) = minγ,γ2PPath−SPU(γ,γ2;S) (7) We propose to use γ2 ∈ Γ2 = {1, 2, 4, 8}. The set of 1, 2, 4, 8 will cover Sum-like test, SSU-like test, and two more tests preferring the sparse-causal-gene situation (e.g. only 2 or 3 genes are associated with traits in a pathway, say with 20 genes). For any given γ, γ2, We recourse to the similar simulation or permutation based strategy as in LaSPU test to calculate the P-values of a class of SPUpath tests, and then calculate the final P-value of the LaSPUpath test. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 59 / 84
  • 60. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 60 / 84
  • 61. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Methods I Journal Article 2 We simulated a pathway with 20 independent genes. Each gene g contained pg SNPs with pg randomly draw from a uniform distribution U(3, 20); 5 of the 20 genes will be randomly selected to be causal, with each causal gene containing U(1, 3) causal SNPs. The other parts of the simulation are the same as Article 1. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 61 / 84
  • 62. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Methods II Journal Article 2 Summary of parameter setup in simulation studies h2 j = 0 or h2 j = 0.001, 0.0025, 0.005, 0.0075 and 0.010 σb = 7 σe = 27 k = 5 1000 replicates of simulated dataset B = 1000 α = 0.05 ρy = 0.7 ρx = 0.8 R = AR(1) Rw = I 0.05 < MAF < 0.4 for CVs; 0.001 < MAF < 0.01 for RVs causal SNPs are excluded from testing for CVs but retained for RVs Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 62 / 84
  • 63. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 63 / 84
  • 64. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results Journal Article 2 Simulation setting A 0.05 < MAF < 0.4 h2 j = 0 α = 0.05 pSSU pSSUw pScore pSum pUminP 0.037 0.042 0.025 0.049 0.050 LaSPUpath 0.053 Table: Empirical Type I Error Table in simulation set-up A. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 64 / 84
  • 65. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Simulation Results Journal Article 2 Simulation setting B a a a a a Power Benchmark on different sample size Heritability (h2 ) EmpiricalPower b b b b b c c c c c d d d d d e e e e e f f f f f 0.0010 0.0025 0.0050 0.0075 0.0100 0.00.10.20.30.40.50.60.70.80.91.0 0.00.10.20.30.40.50.60.70.80.91.0 f LaSPUpath others a b c d e pSSU pSSUw pScore pSum pUminP Figure: Empirical power benchmark under different heritability (h2) in simulation set-up B. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 65 / 84
  • 66. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 66 / 84
  • 67. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study I Journal Article 2 Methods in data application RV defined by MAF < 0.05 Inclusion of only nonsynonymous and splice site variants QC procedures as in [PAB+ 14] Additive genetic model Covariates include age, age2 , gender, BMI, time of measurement and top two PCs obtain biological pathways from the KEGG database [OGS+ 99] trim pathway to be of moderate size (between 10 and 500 genes) Gene boundary defined in [PAB+ 14] Significant: 0.05/197 = 0.00025; marginal significant: 1e-03 Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 67 / 84
  • 68. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Application to the ARIC study II Journal Article 2 Results in data application: significant pathways Table: Results of the ARIC Data Application: KEGG Pathways with p Value < 0.00025 KEGG ID Pathway Name No. of Genes No. of SNPs p Value Contributing Genesa hsa00561 Glycerolipid metabolism 44 311 1.00E-06 LPL,LIPG,LIPC, DGKQ,PPAP2A,PNLIPRP3 hsa03320 PPAR signaling pathway 55 465 1.00E-06 LPL,ANGPTL4,NR1H3, CD36,APOA1 hsa05010 Alzheimers disease 98 747 1.00E-06 LPL,APOE,BID,PLCB3,NDUFS8, NDUFS3,NDUFB6,RYR3,NCSTN a p Value of the gene < 0.05 hsa00561: It belongs to lipid metabolism class; associated with Hyperlipoproteinemia, type I hsa03320: It was reported to play a role in the clearance of circulating or cellular lipids via the regulation of gene expression; associated with multiple lipid-related diseases like Hyperalphalipoproteinemia hsa05010: There are a number of studies reporting a reduced level of HDL-C is highly correlated with the severity of AD; atherosclerosis also increases the risk of AD. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 68 / 84
  • 69. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 69 / 84
  • 70. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Discussion I Journal Article 1 LaSPUpath could efficiently use the longitudinal information to increase testing power LaSPUpath could efficiently use the biological information (such as pathway) to increase the testing power LaSPUpath could adaptively select the best test and achieve a greater power LaSPUpath identified three pathways significantly associated with HDL-C. The annotated functions of the pathways are closely related to HDL-C, either as the regulator of lipids (including HDL-C) and/or lead to the lipid-related diseases like Hyperlipoproteinemia and Alzheimer’s disease. LaSPUpath could be used in more general gene-set based testings, for example, annotations of functional variation available in ENCODE (https://www.encodeproject.org/) and epigenome activities collected in the Roadmap Epigenomic Data at NCBI/GEO (http://www.roadmapepigenomics.org/). Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 70 / 84
  • 71. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 AcknowledgementYang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 71 / 84
  • 72. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Journal Article 3 Title of Journal Article LaSPU: a suite of powerful data-adaptive SNP-set and pathway-based association testing tools for longitudinal traits. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 72 / 84
  • 73. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Features Journal Article 3 implement the LaSPU method compatible with Unix-like system command line operated program optimized and possess flexible parallel computing schema user-friendly manual and examples Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 73 / 84
  • 74. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Workflow Journal Article 3 Computation Environment Setup •Serial •Single node with multiple cores •Multiple nodes with multiple scores Data Integration •Genotype •Genotype annotation •Longitudinal trait •Covariate Ad-Hoc QC for genotype data •Missing rate •Monomorphic site •Set with not enough SNPs GEE Fitting longitudinal data under H0 to derive score vector and variance-covariance matrix (optional) Permutation Procedure to obtain null distribution of score vector for LaSPU Execute Five Tests (default): •Sum Test •SSU(w) Tests •UminP Test •Score Test •LaSPU Figure: LaSPU Workflow Chart. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 74 / 84
  • 75. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Result Journal Article 3 Licensed under the CC BY-NC 4.0 license and freely available at https://github.com/xyy2006/LaSPU/. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 75 / 84
  • 76. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 Acknowledgement 6 References Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 76 / 84
  • 77. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Conclusion I In brief, our developed association test(s) utilized the longitudinal information the data adaptivity to select the most powerful test the biological information such as biological pathways to boost the statistical power. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 77 / 84
  • 78. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Conclusion II More elaborately, in this dissertation we have developed the LaSPU method. developed the LaSPUpath method. performed simulation studies that demonstrated that both LaSPU and LaSPUpath could efficiently utilize the longitudinal information and adaptively select the best test to increase the statistical power as compared to non-adaptive and/or baseline testing methods. performed simulation studies that also demonstrated that the LaSPUpath could efficiently utilize the additional biological information when available, such as the biological pathway information, to further boost the statistical power. Real data application of LaSPU demonstrated that LaSPU can replicate the previously reported genes with a much smaller sample size ( 1/4). On the other hand, LaSPU identified a novel gene FAM65B associated with HDL-C. Real data application of LaSPUpath identified three significant pathways, the functional annotations of which all closely relate the regulation of lipid metabolism and/or the onset of lipid-related diseases. The development of the ”LaSPU” software package enables the research community to implement the novel methods with ease. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 78 / 84
  • 79. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Future Work Joint test of the SNPs’ main effect and the SNPs’ interaction effect with time in the LaSPU/LaSPUpath framework Setup of appropriate weighting schema for SNPs (and genes) to incorporate functional annotations of SNPs (and genes) Setup of appropriate weighting schema for SNPs to enable the analysis of CVs and RVs in the same region Extension from testing the genetic pathway to other functional related gene sets. Inclusion of the aSPUpath method into the ”LaSPU” software package. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 79 / 84
  • 80. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Table of Contents1 Background Introduction to Genome-wide association study (GWAS) SNP-set based association tests Longitudinal data analysis strategy in GWAS 2 Overall Study Design Simulation studies Real data application 3 Proposed Journal Articles Journal Article 1: Data-adaptive SNP-set based association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 2: Pathway-based data-adaptive association tests ... Methods Data Simulation Methods Simulation Results Application to the ARIC Study Discussion Journal Article 3: LaSPU: a suite of powerful data-adaptive SNP-set and ... 4 Conclusion and Future Work 5 Acknowledgement 6 References Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 80 / 84
  • 81. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Acknowledgement Advisors: Peng Wei, Ph.D Alanna C. Morrison, Ph.D Yunxin Fu, Ph.D Han Liang, Ph.D, Associate Professor, Department of Bioinformatics and Computational Biology, The University of Texas M.D Anderson Cancer Center External Reviewer: Xiaoming Liu, Ph.D Collaborator Wei Pan, Ph.D, Professor, Division of Biostatistics, School of Public Health, University of Minnesota Supporting Grant: Title: Association Analysis of Rare Variants with Sequencing Data Funding Source: NIH/NHLBI (1R01HL116720) My classmates, colleges and friends at UTSPH and MDACC ... My Parents Tianpeng Yang and Qi Lu My Wife Nainan Hei Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 81 / 84
  • 82. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References Thanks to Audience Thank you for your participation! Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 82 / 84
  • 83. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References References I Alan P Boyle, Eurie L Hong, Manoj Hariharan, Yong Cheng, Marc A Schaub, Maya Kasowski, Konrad J Karczewski, Julie Park, Benjamin C Hitz, Shuai Weng, et al., Annotation of functional variation in personal genomes using regulomedb, Genome research 22 (2012), no. 9, 1790–1797. Dan-Yu Lin and Zheng-Zheng Tang, A general framework for detecting disease associations with rare variants in sequencing studies, The American Journal of Human Genetics 89 (2011), no. 3, 354–367. Kung-Yee Liang and Scott L Zeger, Longitudinal data analysis using generalized linear models, Biometrika 73 (1986), no. 1, 13–22. Hiroyuki Ogata, Susumu Goto, Kazushige Sato, Wataru Fujibuchi, Hidemasa Bono, and Minoru Kanehisa, Kegg: Kyoto encyclopedia of genes and genomes, Nucleic acids research 27 (1999), no. 1, 29–34. Gina M Peloso, Paul L Auer, Joshua C Bis, Arend Voorman, Alanna C Morrison, Nathan O Stitziel, Jennifer A Brody, Sumeet A Khetarpal, Jacy R Crosby, Myriam Fornage, et al., Association of low-frequency and rare coding-sequence variants with blood lipids and coronary heart disease in 56,000 whites and blacks, The American Journal of Human Genetics 94 (2014), no. 2, 223–232. Wei Pan, On the robust variance estimator in generalised estimating equations, Biometrika 88 (2001), no. 3, 901–906. , Asymptotic tests of association with multiple snps in linkage disequilibrium, Genetic epidemiology 33 (2009), no. 6, 497–507. Wei Pan, Junghi Kim, Yiwei Zhang, Xiaotong Shen, and Peng Wei, A powerful and adaptive association test for rare variants, Genetics (2014), genetics–114. Tao Wang and Robert C Elston, Improved power by use of a weighted score test for linkage disequilibrium mapping, The american journal of human genetics 80 (2007), no. 2, 353–360. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 83 / 84
  • 84. Background Overall Study Design Proposed Journal Articles Conclusion and Future Work Acknowledgement References References II Zhiyuan Xu, Xiaotong Shen, Wei Pan, Alzheimer’s Disease Neuroimaging Initiative, et al., Longitudinal analysis is more powerful than cross-sectional analysis in detecting genetic association with neuroimaging phenotypes, PloS one 9 (2014), no. 8, e102312. Yang Yang (UTSPH) Doctoral Dissertation Defense Nov 30 2015 84 / 84