Linear models2

1.
Linear models forGWAS II: Linear mixed models Christoph Lippert Microsoft Research, Los Angeles, USA Current topics in computational biology UCLA October 15th , 2012 C. Lippert Linear models for GWAS II October 17 th 2012 1

2.
Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2

3.

4.

5.

6.

7.

8.

9.

10.

11.
Probability Theory Probabilities Let Xbe a random variable, deﬁned over a set X or measurable space. P(X = x) denotes the probability that X takes value x, short p(x). Probabilities are positive, P(X = x) ≥ 0 Probabilities sum to one x∈X p(x)dx = 1 x∈X p(x) = 1 C. Lippert Linear models for GWAS II October 17 th 2012 3

12.

13.

14.

15.

16.
Probability Theory Probability Theory JointProbability P(X = xi, Y = yj) = ni,j N Marginal Probability P(X = xi) = ci N Conditional Probability P(Y = yj | X = xi) = ni,j ci (C.M. Bishop, Pattern Recognition and Machine Learning) C. Lippert Linear models for GWAS II October 17 th 2012 4

17.
Probability Theory Probability Theory ProductRule P(X = xi, Y = yj) = ni,j N = ni,j ci · ci N = P(Y = yj | X = xi)P(X = xi) Marginal Probability P(X = xi) = ci N Conditional Probability P(Y = yj | X = xi) = ni,j ci (C.M. Bishop, Pattern Recognition and Machine Learning) C. Lippert Linear models for GWAS II October 17 th 2012 4

18.
Probability Theory Probability Theory ProductRule P(X = xi, Y = yj) = ni,j N = ni,j ci · ci N = P(Y = yj | X = xi)P(X = xi) Sum Rule P(X = xi) = ci N = 1 N L j=1 ni,j = j P(X = xi, Y = yj) (C.M. Bishop, Pattern Recognition and Machine Learning) C. Lippert Linear models for GWAS II October 17 th 2012 4

19.
Probability Theory The Rulesof Probability Sum & Product Rule Sum Rule p(x) = y p(x, y) Product Rule p(x, y) = p(y | x)p(x) C. Lippert Linear models for GWAS II October 17 th 2012 5

20.
Probability Theory The Rulesof Probability Bayes Theorem Using the product rule we obtain p(y | x) = p(x | y)p(y) p(x) p(x) = y p(x | y)p(y) C. Lippert Linear models for GWAS II October 17 th 2012 6

21.
Probability Theory Bayesian probabilitycalculus Bayes rule is the basis for inference and learning. Assume we have a model with parameters θ, e.g. y = θ0 + θ1 · x X Y x* Goal: learn parameters θ given Data D. p(θ | D) = p(D | θ) p(θ) p(D) Posterior Likelihood Prior C. Lippert Linear models for GWAS II October 17 th 2012 7

22.
Probability Theory Bayesian probabilitycalculus Bayes rule is the basis for inference and learning. Assume we have a model with parameters θ, e.g. y = θ0 + θ1 · x X Y x* Goal: learn parameters θ given Data D. p(θ | D) = p(D | θ) p(θ) p(D) posterior ∝ likelihood · prior Posterior Likelihood Prior C. Lippert Linear models for GWAS II October 17 th 2012 7

23.
Probability Theory Probability distributions Gaussian p(x| µ, σ2 ) = N (x | µ, σ) = 1 √ 2πσ2 e− 1 2σ2 (x−µ)2 −6 −4 −2 0 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Multivariate Gaussian p(x | µ, Σ) = N (x | µ, Σ) = 1 |2πΣ| exp − 1 2 (x − µ)T Σ−1 (x − µ) C. Lippert Linear models for GWAS II October 17 th 2012 8

24.
Probability Theory Probability distributions Gaussian p(x| µ, σ2 ) = N (x | µ, σ) = 1 √ 2πσ2 e− 1 2σ2 (x−µ)2 −6 −4 −2 0 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Multivariate Gaussian p(x | µ, Σ) = N (x | µ, Σ) = 1 |2πΣ| exp − 1 2 (x − µ)T Σ−1 (x − µ) −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Σ = 1 0.8 0.8 1 C. Lippert Linear models for GWAS II October 17 th 2012 8

25.
Population Structure Genome wideassociation studies (GWAS) Identify associations between variable genetic loci and phenotypes. Linear and logistic regression Statistical dependence tests (F-test, likelihood ratio) ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyyy XX C. Lippert Linear models for GWAS II October 17 th 2012 9

26.
Population Structure Genome wideassociation studies (GWAS) Identify associations between variable genetic loci and phenotypes. Linear and logistic regression Statistical dependence tests (F-test, likelihood ratio) N y|Xβ; σ2 e I N (y|0; σ2 e I) (1) ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 9

27.
Population Structure Population stratiﬁcation Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10

28.
Population Structure Population stratiﬁcation Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10

29.
Population Structure Population stratiﬁcation Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10

30.
Population Structure Population stratiﬁcation Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10

31.
Population Structure Population stratiﬁcation Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10

32.
Population Structure Population stratiﬁcation GWAon inﬂammatory bowel disease (WTCCC) 3.4k cases, 11.9k controls Methods Linear regression Likelihood ratio test [Burton et al., 2007] C. Lippert Linear models for GWAS II October 17 th 2012 11

33.

34.

35.
Population structure correction Outline ProbabilityTheory Population Structure Population structure correction Variance component models Multi locus models Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 12

36.
Population structure correction Genomiccontrol [Devlin and Roeder, Biometrics 1999] Genomic control λ λ = median(2LR) median(χ2) . λ = 1: Calibrated P-values λ > 1: Inﬂation λ < 1: Deﬂation Correct by dividing test statistic by λ. Applicable in combination with every method. Does not change (non-)uniformity of P-values. Very conservative.C. Lippert Linear models for GWAS II October 17 th 2012 13

37.

38.

39.

40.

41.

42.
Population structure correction Eigenstrat Populationstructure causes genome-wide correlations between SNPs A large part of the total variation in the SNPs can be explained by population diﬀerences. Novembre et al. [2008] show that the eigenvectors of the SNP covariance matrix reﬂect population structure. Eigenstrat uses this property to correct for population structure in GWAS. [Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 14

43.

44.

45.

46.
Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. N β X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 15

47.
Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance C. Lippert Linear models for GWAS II October 17 th 2012 15

48.
Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance Eigenvectors Eigenvalues C. Lippert Linear models for GWAS II October 17 th 2012 15

49.
Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance Eigenvectors Eigenvalues Add as covariates C. Lippert Linear models for GWAS II October 17 th 2012 15

50.
Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance Eigenvectors Eigenvalues Add as covariates C. Lippert Linear models for GWAS II October 17 th 2012 15

51.
Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coeﬃcients Identity by state Identity by descent Realized relationship matrix (linear) Sample random eﬀect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16

52.
Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coeﬃcients Identity by state Identity by descent Realized relationship matrix (linear) Sample random eﬀect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16

53.

54.

55.
Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coeﬃcients Identity by state Identity by descent Realized relationship matrix (linear) Sample random eﬀect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK)β ∼ N β ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16

56.
Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coeﬃcients Identity by state Identity by descent Realized relationship matrix (linear) Sample random eﬀect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK)β ∼ N β + ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16

57.
Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coeﬃcients Identity by state Identity by descent Realized relationship matrix (linear) Sample random eﬀect u. Sample phenotype y. u N y|Xβ + u; σ2 e I N u|0; σ2 gK ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK) β ∼ N β + ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16

58.
Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coeﬃcients Identity by state Identity by descent Realized relationship matrix (linear) Sample random eﬀect u. Sample phenotype y. N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK) β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 16

59.
Population structure correction Linearmixed models (LMM) Corrects for all levels of population structure. ML estimation is computationally demanding Non-convex in σ2 g and σ2 e . N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 17

60.

61.

62.

63.

64.
Population structure correction GWASfor Flowering Time in Arabidopsis thaliana Linear Model: QQ-plot: 0.0 0.2 0.4 0.6 0.8 1.0 expected pvalues 0.0 0.2 0.4 0.6 0.8 1.0 observedpvalues 0 1 2 3 4 5 6 log(expected pvalues) 0 5 10 15 20 25 log(observedpvalues) C. Lippert Linear models for GWAS II October 17 th 2012 18

65.
Population structure correction GWASfor Flowering Time in Arabidopsis thaliana Linear Model: Linear Mixed Model: C. Lippert Linear models for GWAS II October 17 th 2012 19

66.
Population structure correction GWASfor Flowering Time in Arabidopsis thaliana Linear Mixed Model: QQ-plot: 0.0 0.2 0.4 0.6 0.8 1.0 expected pvalues 0.0 0.2 0.4 0.6 0.8 1.0 observedpvalues 0 1 2 3 4 5 6 log(expected pvalues) 0 2 4 6 8 10 12 log(observedpvalues) C. Lippert Linear models for GWAS II October 17 th 2012 20

67.
Population structure correction Linearmixed models (LMM) LMM log likelihood LL(β, σ2 g, σ2 e ) = log N y|Xβ; σ2 gK + σ2 e I . Change of variables, introducing δ = σ2 e /σ2 g: LL(β, σ2 g, δ) = log N y|Xβ; σ2 g (K + δI) . ML-parameters ˆβ and ˆσ2 g follow in closed form. Use optimizer to solve 1-dimensional optimization problem over δ. [Kang et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 21

68.

69.

70.

71.
Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X C. Lippert Linear models for GWAS II October 17 th 2012 22

72.

73.

74.
Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y C. Lippert Linear models for GWAS II October 17 th 2012 22

75.

76.

77.

78.

79.
Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y Note that this solution is analogous to the ML solution of the linear regression . C. Lippert Linear models for GWAS II October 17 th 2012 22

80.
Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y Note that this solution is analogous to the ML solution of the linear regression XT X −1 XT y. C. Lippert Linear models for GWAS II October 17 th 2012 22

81.
Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23

82.
Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23

83.
Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23

84.

85.

86.

87.

88.

89.

90.

91.
Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24

92.
Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24

93.
Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N UT y|UT Xβ; σ2 g UT USUT U + δUT U . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24

94.
Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N  UT y|UT Xβ; σ2 g  UT U I S UT U I +δ UT U I     . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24

95.
Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N  UT y|UT Xβ; σ2 g  UT U I S UT U I +δ UT U I     . = N UT y|UT Xβ; σ2 g (S + δI) . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24

96.

97.

98.
Population structure correction FaSTLMM N UT y|UT Xβ; σ2 g (S + δI) . (2) Factored Spectrally Transformed LMM O(N3 ) once for spectral decomposition. O(N2 ) runtime per SNP tested (multiplication with UT ). O(N2 ) memory for storing K and U. [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 25

99.

100.

101.

102.
Population structure correction Summary Populationstructure correction Genomic control Simple method Works with any statistical test Can be combined with other correction methods Very conservative! Eigenstrat (PCA) Corrects well for diﬀerences on population level Does not work well for closer relatedness Linear mixed models Corrects well for most forms of relatedness. C. Lippert Linear models for GWAS II October 17 th 2012 26

103.

104.

105.
Population structure correction Overview Singlemarker association model with random effect term y = xsβs genetic effect + u random effect covariates + noise Shortcomings Weak effects are not captured by single-marker analysis. Complex traits are controlled by more than a single SNP. C. Lippert Linear models for GWAS II October 17 th 2012 27

106.

107.

108.
Population structure correction Multilocus models Generalization to multiple genetic factors y = S s=1 xsβs genetic eﬀect + u random eﬀect covariates + noise Challenge: N << S: explicit estimation of all βs is not feasible. Solutions Regularize βs (Ridge regression, LASSO) Variance component modeling [Wu et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 28

109.

110.

111.

112.
Outline Outline C. Lippert Linearmodels for GWAS II October 17 th 2012 29

113.
Variance component models Outline ProbabilityTheory Population Structure Population structure correction Variance component models Multi locus models Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 30

114.
Variance component modelsMulti locus models Multi locus models Random eﬀect models For now, let’s drop the random eﬀect term y = S s=1 xsβs + . For mathematical convenience, we choose a shared Gaussian distribution on the weights and Gaussian noise p(β1, . . . , βS) = S s=1 N βs 0, σ2 g p( ) = N 0, σ2 e I Marginalize out the weights β1, . . . , βS p(y | X, σ2 g, σ2 e ) = β N y S s=1 xsβs, σ2 e I Data likelihood S s=1 N βs 0, σ2 g weight distribution dβ = N y 0, σ2 g S s=1 xsxT s + σ2 e I C. Lippert Linear models for GWAS II October 17 th 2012 31

115.

116.

117.

118.

119.

120.
Variance component modelsMulti locus models Multi locus models Remarks p(y | X, σ2 g, σ2 e ) = N y | 0, σ2 g S s=1 ssxT s Kg +σ2 e I (3) Kg genotype covariance matrix. Closely related to Kinship explaining population structure. Inference can be done my maximum likelihood. The ratio of σ2 g and σ2 e deﬁnes the narrow sense heritability of the trait h2 = σ2 g σ2 g + σ2 e . C. Lippert Linear models for GWAS II October 17 th 2012 32

121.
Variance component modelsMulti locus models Multi locus models Remarks p(y | X, σ2 g, σ2 e ) = N y | 0, σ2 g S s=1 ssxT s Kg +σ2 e I (3) Kg genotype covariance matrix. Closely related to Kinship explaining population structure. Inference can be done my maximum likelihood. 0 5 10 15 Samples 0 5 10 15 Samples The ratio of σ2 g and σ2 e deﬁnes the narrow sense heritability of the trait h2 = σ2 g σ2 g + σ2 e . C. Lippert Linear models for GWAS II October 17 th 2012 32

122.

123.

124.
Variance component modelsMulti locus models Heritability Heritability estimated on 107 A. thaliana phenotypes Global genetic heritability 0.0 0.2 0.4 0.6 0.8 1.0 Heritability 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Density C. Lippert Linear models for GWAS II October 17 th 2012 33

125.
Variance component modelsMulti locus models Heritability Heritability estimated on 107 A. thaliana phenotypes Estimate can be restricted to a genomic region such as a single chromosome, etc. N y | 0, σ2 g s∈Chrom xsxT s +σ2 e I 1 2 3 4 5 Chromosome 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 PhenotypeID 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 C. Lippert Linear models for GWAS II October 17 th 2012 34

126.
Variance component modelsMulti locus models Window-based composite variance analysis Region-based testing Just ﬁtting a particular region ignores the genome-wide context Variance dissection with region-based separation p(y | W) = N(y | 0, σ2 w s∈W xsxT s Kw +σ2 g s∈W xsxT s Kg +σ2 e I) Explained variance components can be read oﬀ subject to suitable normalization of the covariances Kw and Kg. “Local” heritability h2 (W) = σ2 w σ2 w + σ2 g + σ2 e C. Lippert Linear models for GWAS II October 17 th 2012 35

127.

128.

129.

130.
Variance component modelsMulti locus models Window-based composite variance analysis C. Lippert Linear models for GWAS II October 17 th 2012 36

131.
Variance component modelsMulti locus models Window-based composite variance analysis Significance testing Analogously to fixed effect testing, the significance of a specific window can be tested. Likelihood-ratio statistics to score the relevance of a particular genomic region W LOD(W) = N y 0, σ2 w s∈W xsxT s + σ2 g s∈W xsxT s + σ2 e I N y 0, σ2 g s∈W xsxT s + σ2 e I P-values can be obtained from permutation statistics or analytical approximation (variants of score tests or likelihood ratio tests). [Wu et al., 2011, Listgarten et al., 2012] C. Lippert Linear models for GWAS II October 17 th 2012 37

132.

133.

134.
Variance component modelsPhenotype prediction Making predictions with linear mixed models Linear model, accounting for a set of measured SNPs X p(y | X, θ, σ2 ) = N y S s=1 xsθs, σ2 I Prediction at unseen test input given max. likelihood weight: p(y | x , ˆθ) = N y x ˆθ, σ2 Marginal likelihood p(y | X, σ2 , σ2 g) = θ N y Xθ, σ2 I N θ 0, σ2 gI = N   y 0, σ2 gXXT K +σ2 I    Making predictions with linear mixed models? C. Lippert Linear models for GWAS II October 17 th 2012 38

135.

136.

137.

138.
Variance component modelsPhenotype prediction The Gaussian distribution Linear mixed models are merely based on the good old multivariate Gaussian N x µ, K = 1 |2π K | exp − 1 2 (x − µ)T K −1 (x − µ) Covariance matrix or kernel matrix C. Lippert Linear models for GWAS II October 17 th 2012 39

139.
Variance component modelsPhenotype prediction A 2D Gaussian Probability contour Samples −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.6 0.6 1 C. Lippert Linear models for GWAS II October 17 th 2012 40

140.
Variance component modelsPhenotype prediction A 2D Gaussian Probability contour Samples −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.6 0.6 1 C. Lippert Linear models for GWAS II October 17 th 2012 40

141.
Variance component modelsPhenotype prediction A 2D Gaussian Varying the covariance matrix −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.14 0.14 1 −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.6 0.6 1 −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 -0.9 -0.9 1 C. Lippert Linear models for GWAS II October 17 th 2012 41

142.
Variance component modelsPhenotype prediction A 2D Gaussian Inference −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 C. Lippert Linear models for GWAS II October 17 th 2012 42

143.
Variance component modelsPhenotype prediction A 2D Gaussian Inference − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 C. Lippert Linear models for GWAS II October 17 th 2012 42

144.
Variance component modelsPhenotype prediction A 2D Gaussian Inference − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 C. Lippert Linear models for GWAS II October 17 th 2012 42

145.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

146.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) =? Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

147.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) =? Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

148.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

149.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

150.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI = N (y | µ , Σ ) Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

151.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI = N (y | µ , Σ ) ∝ N y y | 0, σ2 g K K:, KT :, K , σ2 e I Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

152.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI = N (y | µ , Σ ) ∝ N y y | 0, σ2 g K K:, KT :, K , σ2 e I Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identiﬁes the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43

153.
Variance component modelsPhenotype prediction Inference Gaussian conditioning in 2D p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp − 1 2 [y1, y2] K−1 y1 y2 = exp{− 1 2 y2 1K−1 1,1 + y2 2K−1 2,2 + 2y1K−1 1,2y2 } = exp{− 1 2 y2 2K−1 2,2 + 2y2K−1 1,2y1 + C } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 + K−1 1,2y1 K−1 2,2 2 + 1 2 K−1 2,2 K−1 1,2y1 K−1 2,2 2 } = Z exp{− 1 2 K−1 2,2 Σ y2 + K−1 1,2y1 K−1 2,2 −µ 2 } ∝ N (y2 | µ, Σ) C. Lippert Linear models for GWAS II October 17 th 2012 44

154.

155.

156.

157.

158.

159.
Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual. P(y | y) = N y µ , σ2 gVg + σ2 e I Predictive mean: µ = σ2 gK ,: g σ2 gKg + σ2 e I −1 y BLUP Predictive Variance: Vg = K , g − σ2 gK ,: g σ2 gKg + σ2 e I −1 K:, g C. Lippert Linear models for GWAS II October 17 th 2012 45

160.

161.

162.

163.
Variance component modelsPhenotype prediction Summary Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 46

164.

165.

166.

167.

168.

169.
Variance component modelsPhenotype prediction Acknowledgements Joint course material O. Stegle FaST lmm J. Listgarten, D. Heckerman C. Lippert Linear models for GWAS II October 17 th 2012 47

170.
Variance component modelsPhenotype prediction References I P. Burton, D. Clayton, L. Cardon, N. Craddock, P. Deloukas, A. Duncanson, D. Kwiatkowski, M. McCarthy, W. Ouwehand, N. Samani, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678, 2007. B. Devlin and K. Roeder. Genomic control for association studies. Biometrics, 55(4):997–1004, 1999. H. M. Kang, N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman, M. J. Daly, and E. Eskin. Eﬃcient control of population structure in model organism association mapping. Genetics, 107, 2008. C. Lippert, J. Listgarten, Y. Liu, C. Kadie, R. Davidson, and D. Heckerman. Fast linear mixed models for genome-wide association studies. Nature Methods, 8(10):838;835, 10 2011. doi: 10.1038/nmeth.1681. J. Listgarten, C. Lippert, and D. Heckerman. An eﬃcient group test for genetic markers that handles confounding. arXiv preprint arXiv:1205.0793, 2012. J. Novembre, T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S. King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. Genes mirror geography within europe. Nature, 456(7218):98–101, Nov. 2008. N. Patterson, A. L. Price, and D. Reich. Population structure and eigenanalysis. PLoS Genetics, 2(12):e190+, December 2006. C. Lippert Linear models for GWAS II October 17 th 2012 48

171.
Variance component modelsPhenotype prediction References II A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. Principal components analysis corrects for stratiﬁcation in genome-wide association studies. Nature genetics, 38(8):904–909, August 2006. M. Wu, S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 2011. C. Lippert Linear models for GWAS II October 17 th 2012 49

Linear models2

More Related Content

What's hot

Similar to Linear models2

Recently uploaded

Linear models2