Application of Bayesian and Sparse Network Models for Assessing Linkage Disequilibrium in Animals and Plants
1. Application of Bayesian and Sparse Network
Models for Assessing Linkage Disequilibrium in
Animals and Plants
C-36-6
Gota Morota
Department of Animal Sciences
University of Wisconsin-Madison
Aug 30, 2012
1 / 16
2. Systems Genetics
Figure 1: Multi-dimensional gene network
Purpose of this study
• take the view that loci associate and interact together as a
network
• evaluate LD reflecting the biological nature that loci interact as
a complex system
2 / 16
3. IAMB algorithm
Incremental Association Markov Blanket (Tsamardinos et al. 2003)
1. Compute Markov Blankets (MB)
2. Compute Graph Structure
3. Orient Edges
Figure 2: The Markov Blanket of a node xi
3 / 16
4. Identifying the MB of a node
• Growing phase
• heuristic function:
f (X ; T |CMB ) = MI(X ; T |CMB )
=
cmb ∈CMB
P (CMB )
P (X , T |CMB )
P (X , T |CMB ) log
P (X |CMB )P (T |CMB )
x ∈X t ∈T
• conditional independence tests (Pearson’s χ2 test):
H0 : P (X , T |CMB ) = P (X |CMB ) · P (T |CMB ) (do not add X )
HA : P (X , T |CMB )
P (X |CMB ) · P (T |CMB ) (add X to the CMB)
• Shrinking phase
• conditional independence tests (Pearson’s χ2 test):
H0 : P (X , T |CMB − X )
P (X |CMB − X ) · P (T |CMB − X ) (keep X )
HA : P (X , T |CMB − X ) = P (X |CMB − X ) · P (T |CMB − X ) (remove X
4 / 16
5. Network Structure
Algorithm
Suppose Y ∈ MB (T ). Then T and Y are connected if they are
conditionally dependent given all subsets of the smaller of
MB (T ) − (Y ) and MB (Y ) − (T ).
Example:
• MB (T ) = (A , B , Y ), MB (Y ) = (C , D , E , F , T )
• since MB (T ) < MB (Y ), independence tests are conditional
on all subsets of MB (T ) − (Y ) = (A , B ).
• if any of the
CI(T , Y |{}), CI(T , Y |{A }), CI(T , Y |{B }), andCI(T , Y |{A , B })
imply conditional independence,
↓
• T and Y are considered separate (spouses)
• repeat for T ∈ S and Y ∈ MB (T ),
5 / 16
6. Materials
1. Data
• 4,898 Holstein bulls (USDA-ARS AIPL)
• 37,217 SNP markers (MAF > 0.025)
• milk protein yield
2. Missing genotypes imputation
• fastPHASE (Scheet and Stephens, 2006)
3. Select 15 SNPs
• Bayesian LASSO
4. uncover associations among a set of marker loci found to
have the strongest effects on milk protein yield
6 / 16
7. Results – Top 15 SNPs
IAMB algorithm
Pairwise LD among SNPs (r2)
J
d
A
c
b
a
Z
L
Y
X
M
N
W
V
U
F
T
S
B
K
G
R
Q
P
H
O
N
O
E
M
L
K
J
I
I
H
G
F
E
C
D
C
B
R2 Color Key
A
0
Figure 3: r 2
0.2
0.4
0.6
0.8
1
D
Figure 4: IAMB
7 / 16
8. Conclusion and Possible Improvements
• LD relationships are of a multivariate nature
• r 2 gives an incomplete description of LD
⇓
• undirected networks
• sparsity
8 / 16
9. Pairwise Binary Markov Networks
We estimate the Markov network parameters Θp ×p by maximizing
a log-likelihood.
f (x1 , ..., xp ) =
exp
Ψ(Θ)
1
p
θj ,j xj +
j =1
1≤j <k ≤p
θj ,k xj xk
(1)
where
xj ∈ {0, 1}
Ψ(Θ) =
x ∈0 , 1
(2)
exp
p
θj ,j xj +
j =1
1 ≤j <k ≤p
θj ,k xj xk
(3)
• the first term is a main effect of binary marker xj (node
potential)
• the second term corresponds to an“interaction effect” between
binary markers xj and xk (link potential)
• Ψ(Θ) is the normalization constant (partition function)
9 / 16
10. Ravikumar et al. (2010)
The pseudo-likelihood based on the local conditional likelihood
associated with each binary marker can be represented as
n
p
x
φi ,ij,j (1 − φi ,j )1−xi,j
l (Θ) =
(4)
i =1 j =1
where φi ,j is the conditional probability of xi ,j = 1 given all other
variables. Using a logistic link function,
φi ,j = P(xi ,j = 1|xi ,k , k j ; θj ,k , 1 ≤ k ≤ p )
exp(θj ,j + k j θj ,k xi ,k )
=
1 + exp(θj ,j + k j θj ,k xi ,k )
(5)
(6)
10 / 16
11. Ravikumar et al. (2010) (cont.)
• L1 regularized logistic regressions problem
• regressing each marker on the rest of the markers
• the network structure is recovered from the sparsity pattern of
the regression coefficients
0
ˆ−2
β
1
ˆ .
.
Θ= .
−(p −1)
ˆ
β
1
−p
ˆ
β1
ˆ
β −1 ,
2
0
··· ,
··· ,
··· ,
0
ˆ−(p −1)
· · · , β p −2
ˆ p
· · · , β−−2
p
˜
Θ=
ˆ ˆ
Θ • ΘT
ˆ 1
β−−1
p
ˆ 2
β−−1
p
ˆp
β −1
ˆp
β −2
.
.
.
··· ,
−(p −1)
ˆp
0
β
−p
ˆ
β p −1
0
(7)
(8)
11 / 16
14. Summary
Interactions and associations among the cells and genes form a
complex biological system
⇓
• r 2 → association(m1, m2)|∅ (empty set)
• L1 regularized MN → association(m1, m2) | else
A final remark
• selecting tag SNPs unconditionally, as well as conditionally,
on other markers when the dimension of the data is high
• data generated from next generation sequence technologies
14 / 16