Linear models for GWAS
II: Linear mixed models
Christoph Lippert
Microsoft Research, Los Angeles, USA
Current topics in computational biology
UCLA
October 15th
, 2012
C. Lippert Linear models for GWAS II October 17
th
2012 1
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Course structure
October 15th
Introduction
Terminology
Study design
Data preparation
Challenges and pitfalls
Course overview
Linear regression
Parameter estimation
Statistical testing
October 17th
Basic probability theory
Linear mixed models
Population structure
correction
Parameter estimation
Variance component
modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 2
Probability Theory
Probabilities
Let X be a random variable, defined over a set X or measurable
space.
P(X = x) denotes the probability that X takes value x, short p(x).
Probabilities are positive, P(X = x) ≥ 0
Probabilities sum to one
x∈X
p(x)dx = 1
x∈X
p(x) = 1
C. Lippert Linear models for GWAS II October 17
th
2012 3
Probability Theory
Probabilities
Let X be a random variable, defined over a set X or measurable
space.
P(X = x) denotes the probability that X takes value x, short p(x).
Probabilities are positive, P(X = x) ≥ 0
Probabilities sum to one
x∈X
p(x)dx = 1
x∈X
p(x) = 1
C. Lippert Linear models for GWAS II October 17
th
2012 3
Probability Theory
Probabilities
Let X be a random variable, defined over a set X or measurable
space.
P(X = x) denotes the probability that X takes value x, short p(x).
Probabilities are positive, P(X = x) ≥ 0
Probabilities sum to one
x∈X
p(x)dx = 1
x∈X
p(x) = 1
C. Lippert Linear models for GWAS II October 17
th
2012 3
Probability Theory
Probabilities
Let X be a random variable, defined over a set X or measurable
space.
P(X = x) denotes the probability that X takes value x, short p(x).
Probabilities are positive, P(X = x) ≥ 0
Probabilities sum to one
x∈X
p(x)dx = 1
x∈X
p(x) = 1
C. Lippert Linear models for GWAS II October 17
th
2012 3
Probability Theory
Probabilities
Let X be a random variable, defined over a set X or measurable
space.
P(X = x) denotes the probability that X takes value x, short p(x).
Probabilities are positive, P(X = x) ≥ 0
Probabilities sum to one
x∈X
p(x)dx = 1
x∈X
p(x) = 1
C. Lippert Linear models for GWAS II October 17
th
2012 3
Probability Theory
Probability Theory
Joint Probability
P(X = xi, Y = yj) =
ni,j
N
Marginal Probability
P(X = xi) =
ci
N
Conditional Probability
P(Y = yj | X = xi) =
ni,j
ci
(C.M. Bishop, Pattern Recognition and Machine Learning)
C. Lippert Linear models for GWAS II October 17
th
2012 4
Probability Theory
Probability Theory
Product Rule
P(X = xi, Y = yj) =
ni,j
N
=
ni,j
ci
·
ci
N
= P(Y = yj | X = xi)P(X = xi)
Marginal Probability
P(X = xi) =
ci
N
Conditional Probability
P(Y = yj | X = xi) =
ni,j
ci
(C.M. Bishop, Pattern Recognition and Machine Learning)
C. Lippert Linear models for GWAS II October 17
th
2012 4
Probability Theory
Probability Theory
Product Rule
P(X = xi, Y = yj) =
ni,j
N
=
ni,j
ci
·
ci
N
= P(Y = yj | X = xi)P(X = xi)
Sum Rule
P(X = xi) =
ci
N
=
1
N
L
j=1
ni,j
=
j
P(X = xi, Y = yj)
(C.M. Bishop, Pattern Recognition and Machine Learning)
C. Lippert Linear models for GWAS II October 17
th
2012 4
Probability Theory
The Rules of Probability
Sum & Product Rule
Sum Rule p(x) = y p(x, y)
Product Rule p(x, y) = p(y | x)p(x)
C. Lippert Linear models for GWAS II October 17
th
2012 5
Probability Theory
The Rules of Probability
Bayes Theorem
Using the product rule we obtain
p(y | x) =
p(x | y)p(y)
p(x)
p(x) =
y
p(x | y)p(y)
C. Lippert Linear models for GWAS II October 17
th
2012 6
Probability Theory
Bayesian probability calculus
Bayes rule is the basis for inference and learning.
Assume we have a model with parameters θ,
e.g.
y = θ0 + θ1 · x X
Y
x*
Goal: learn parameters θ given Data D.
p(θ | D) =
p(D | θ) p(θ)
p(D)
Posterior
Likelihood
Prior
C. Lippert Linear models for GWAS II October 17
th
2012 7
Probability Theory
Bayesian probability calculus
Bayes rule is the basis for inference and learning.
Assume we have a model with parameters θ,
e.g.
y = θ0 + θ1 · x X
Y
x*
Goal: learn parameters θ given Data D.
p(θ | D) =
p(D | θ) p(θ)
p(D)
posterior ∝ likelihood · prior
Posterior
Likelihood
Prior
C. Lippert Linear models for GWAS II October 17
th
2012 7
Probability Theory
Probability distributions
Gaussian
p(x | µ, σ2
) = N (x | µ, σ) =
1
√
2πσ2
e− 1
2σ2 (x−µ)2
−6 −4 −2 0 2 4 6
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Multivariate Gaussian
p(x | µ, Σ) = N (x | µ, Σ)
=
1
|2πΣ|
exp −
1
2
(x − µ)T
Σ−1
(x − µ)
C. Lippert Linear models for GWAS II October 17
th
2012 8
Probability Theory
Probability distributions
Gaussian
p(x | µ, σ2
) = N (x | µ, σ) =
1
√
2πσ2
e− 1
2σ2 (x−µ)2
−6 −4 −2 0 2 4 6
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Multivariate Gaussian
p(x | µ, Σ) = N (x | µ, Σ)
=
1
|2πΣ|
exp −
1
2
(x − µ)T
Σ−1
(x − µ)
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
Σ =
1 0.8
0.8 1
C. Lippert Linear models for GWAS II October 17
th
2012 8
Population Structure
Genome wide association studies (GWAS)
Identify associations
between variable genetic loci
and phenotypes.
Linear and logistic
regression
Statistical dependence
tests
(F-test, likelihood ratio)
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyyy
XX
C. Lippert Linear models for GWAS II October 17
th
2012 9
Population Structure
Genome wide association studies (GWAS)
Identify associations
between variable genetic loci
and phenotypes.
Linear and logistic
regression
Statistical dependence
tests
(F-test, likelihood ratio)
N y|Xβ; σ2
e I
N (y|0; σ2
e I)
(1)
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
β
∼ N β ; σ2
ey
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 9
Population Structure
Population stratification
Confounding structure leads
to false positives.
Population structure
Family structure
Cryptic relatedness
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
β
∼ N β ; σ2
ey
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 10
Population Structure
Population stratification
Confounding structure leads
to false positives.
Population structure
Family structure
Cryptic relatedness
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
β
∼ N β ; σ2
ey
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 10
Population Structure
Population stratification
Confounding structure leads
to false positives.
Population structure
Family structure
Cryptic relatedness
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
β
∼ N β ; σ2
ey
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 10
Population Structure
Population stratification
Confounding structure leads
to false positives.
Population structure
Family structure
Cryptic relatedness
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ; σ2
ey
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 10
Population Structure
Population stratification
Confounding structure leads
to false positives.
Population structure
Family structure
Cryptic relatedness
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ; σ2
ey
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 10
Population Structure
Population stratification
GWA on inflammatory bowel disease (WTCCC)
3.4k cases, 11.9k controls
Methods
Linear regression
Likelihood ratio test
[Burton et al., 2007]
C. Lippert Linear models for GWAS II October 17
th
2012 11
Population Structure
Population stratification
GWA on inflammatory bowel disease (WTCCC)
3.4k cases, 11.9k controls
Methods
Linear regression
Likelihood ratio test
[Burton et al., 2007]
C. Lippert Linear models for GWAS II October 17
th
2012 11
Population Structure
Population stratification
GWA on inflammatory bowel disease (WTCCC)
3.4k cases, 11.9k controls
Methods
Linear regression
Likelihood ratio test
[Burton et al., 2007]
C. Lippert Linear models for GWAS II October 17
th
2012 11
Population structure correction
Outline
Probability Theory
Population Structure
Population structure correction
Variance component models
Multi locus models
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 12
Population structure correction
Genomic control [Devlin and Roeder, Biometrics 1999]
Genomic control λ
λ =
median(2LR)
median(χ2)
.
λ = 1: Calibrated
P-values
λ > 1: Inflation
λ < 1: Deflation
Correct by dividing test
statistic by λ.
Applicable in combination
with every method.
Does not change
(non-)uniformity of
P-values.
Very conservative.C. Lippert Linear models for GWAS II October 17
th
2012 13
Population structure correction
Genomic control [Devlin and Roeder, Biometrics 1999]
Genomic control λ
λ =
median(2LR)
median(χ2)
.
λ = 1: Calibrated
P-values
λ > 1: Inflation
λ < 1: Deflation
Correct by dividing test
statistic by λ.
Applicable in combination
with every method.
Does not change
(non-)uniformity of
P-values.
Very conservative.C. Lippert Linear models for GWAS II October 17
th
2012 13
Population structure correction
Genomic control [Devlin and Roeder, Biometrics 1999]
Genomic control λ
λ =
median(2LR)
median(χ2)
.
λ = 1: Calibrated
P-values
λ > 1: Inflation
λ < 1: Deflation
Correct by dividing test
statistic by λ.
Applicable in combination
with every method.
Does not change
(non-)uniformity of
P-values.
Very conservative.C. Lippert Linear models for GWAS II October 17
th
2012 13
Population structure correction
Genomic control [Devlin and Roeder, Biometrics 1999]
Genomic control λ
λ =
median(2LR)
median(χ2)
.
λ = 1: Calibrated
P-values
λ > 1: Inflation
λ < 1: Deflation
Correct by dividing test
statistic by λ.
Applicable in combination
with every method.
Does not change
(non-)uniformity of
P-values.
Very conservative.C. Lippert Linear models for GWAS II October 17
th
2012 13
Population structure correction
Genomic control [Devlin and Roeder, Biometrics 1999]
Genomic control λ
λ =
median(2LR)
median(χ2)
.
λ = 1: Calibrated
P-values
λ > 1: Inflation
λ < 1: Deflation
Correct by dividing test
statistic by λ.
Applicable in combination
with every method.
Does not change
(non-)uniformity of
P-values.
Very conservative.C. Lippert Linear models for GWAS II October 17
th
2012 13
Population structure correction
Genomic control [Devlin and Roeder, Biometrics 1999]
Genomic control λ
λ =
median(2LR)
median(χ2)
.
λ = 1: Calibrated
P-values
λ > 1: Inflation
λ < 1: Deflation
Correct by dividing test
statistic by λ.
Applicable in combination
with every method.
Does not change
(non-)uniformity of
P-values.
Very conservative.C. Lippert Linear models for GWAS II October 17
th
2012 13
Population structure correction
Eigenstrat
Population structure causes
genome-wide correlations
between SNPs
A large part of the total
variation in the SNPs can be
explained by population
differences.
Novembre et al. [2008] show that the
eigenvectors of the SNP
covariance matrix reflect
population structure.
Eigenstrat uses this property
to correct for population
structure in GWAS.
[Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 14
Population structure correction
Eigenstrat
Population structure causes
genome-wide correlations
between SNPs
A large part of the total
variation in the SNPs can be
explained by population
differences.
Novembre et al. [2008] show that the
eigenvectors of the SNP
covariance matrix reflect
population structure.
Eigenstrat uses this property
to correct for population
structure in GWAS.
[Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 14
Population structure correction
Eigenstrat
Population structure causes
genome-wide correlations
between SNPs
A large part of the total
variation in the SNPs can be
explained by population
differences.
Novembre et al. [2008] show that the
eigenvectors of the SNP
covariance matrix reflect
population structure.
Eigenstrat uses this property
to correct for population
structure in GWAS.
[Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 14
Population structure correction
Eigenstrat
Population structure causes
genome-wide correlations
between SNPs
A large part of the total
variation in the SNPs can be
explained by population
differences.
Novembre et al. [2008] show that the
eigenvectors of the SNP
covariance matrix reflect
population structure.
Eigenstrat uses this property
to correct for population
structure in GWAS.
[Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 14
Population structure correction
Eigenstrat
Eigenstrat procedure:
Compute covariance matrix
based on SNPs
Compute eigenvectors of
covariance matrix
Add largest eigenvector as
covariate to regression.
Repeat until P-values are
uniform.
N β
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 15
Population structure correction
Eigenstrat
Eigenstrat procedure:
Compute covariance matrix
based on SNPs
Compute eigenvectors of
covariance matrix
Add largest eigenvector as
covariate to regression.
Repeat until P-values are
uniform.
X − E [X]
N β
X
C
C
C
T
T
T
Genome-wide SNP covariance
C. Lippert Linear models for GWAS II October 17
th
2012 15
Population structure correction
Eigenstrat
Eigenstrat procedure:
Compute covariance matrix
based on SNPs
Compute eigenvectors of
covariance matrix
Add largest eigenvector as
covariate to regression.
Repeat until P-values are
uniform.
X − E [X]
N β
X
C
C
C
T
T
T
Genome-wide SNP covariance
Eigenvectors Eigenvalues
C. Lippert Linear models for GWAS II October 17
th
2012 15
Population structure correction
Eigenstrat
Eigenstrat procedure:
Compute covariance matrix
based on SNPs
Compute eigenvectors of
covariance matrix
Add largest eigenvector as
covariate to regression.
Repeat until P-values are
uniform.
X − E [X]
N β
X
C
C
C
T
T
T
Genome-wide SNP covariance
Eigenvectors Eigenvalues
Add as covariates
C. Lippert Linear models for GWAS II October 17
th
2012 15
Population structure correction
Eigenstrat
Eigenstrat procedure:
Compute covariance matrix
based on SNPs
Compute eigenvectors of
covariance matrix
Add largest eigenvector as
covariate to regression.
Repeat until P-values are
uniform.
X − E [X]
N β
X
C
C
C
T
T
T
Genome-wide SNP covariance
Eigenvectors Eigenvalues
Add as covariates
C. Lippert Linear models for GWAS II October 17
th
2012 15
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ; σ2
ey
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ; σ2
e
K
y
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ; σ2
e
K
y
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ; σ2
e
K
y
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
u ∼ N(0; σ2
gK)β
∼ N β ; σ2
e
K
y
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
u ∼ N(0; σ2
gK)β
∼ N β + ; σ2
e
K
y
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
u
N y|Xβ + u; σ2
e I N u|0; σ2
gK
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
u ∼ N(0; σ2
gK)
β
∼ N β + ; σ2
e
K
y
X
X
X
C
C
C
T
T
T
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Covariance matrix K
Estimated from SNP data
Kinship coefficients
Identity by state
Identity by descent
Realized relationship
matrix (linear)
Sample random effect u.
Sample phenotype y.
N y|Xβ; σ2
gK + σ2
e I
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
u ∼ N(0; σ2
gK)
β
∼ N β ;
K
y
X
X
X
C
C
C
T
T
T
K
C. Lippert Linear models for GWAS II October 17
th
2012 16
Population structure correction
Linear mixed models (LMM)
Corrects for all levels of
population structure.
ML estimation is
computationally demanding
Non-convex in σ2
g and σ2
e .
N y|Xβ; σ2
gK + σ2
e I
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ;
K
y
X
X
X
C
C
C
T
T
T
K
C. Lippert Linear models for GWAS II October 17
th
2012 17
Population structure correction
Linear mixed models (LMM)
Corrects for all levels of
population structure.
ML estimation is
computationally demanding
Non-convex in σ2
g and σ2
e .
N y|Xβ; σ2
gK + σ2
e I
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ;
K
y
X
X
X
C
C
C
T
T
T
K
C. Lippert Linear models for GWAS II October 17
th
2012 17
Population structure correction
Linear mixed models (LMM)
Corrects for all levels of
population structure.
ML estimation is
computationally demanding
Non-convex in σ2
g and σ2
e .
N y|Xβ; σ2
gK + σ2
e I
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ;
K
y
X
X
X
C
C
C
T
T
T
K
C. Lippert Linear models for GWAS II October 17
th
2012 17
Population structure correction
Linear mixed models (LMM)
Corrects for all levels of
population structure.
ML estimation is
computationally demanding
Non-convex in σ2
g and σ2
e .
N y|Xβ; σ2
gK + σ2
e I
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ;
K
y
X
X
X
C
C
C
T
T
T
K
C. Lippert Linear models for GWAS II October 17
th
2012 17
Population structure correction
Linear mixed models (LMM)
Corrects for all levels of
population structure.
ML estimation is
computationally demanding
Non-convex in σ2
g and σ2
e .
N y|Xβ; σ2
gK + σ2
e I
ATGACCTGAAACTGGGGGACTGACGTGGAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGCAACTGGGGGACTGACGTGCAACGGT
ATGACCTGAAACTGGGGGATTGACGTGGAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
ATGACCTGCAACTGGGGGATTGACGTGCAACGGT
phenotype
SNPs
yyyyy
population structure
distance
family structure
cryptic relatedness
β
∼ N β ;
K
y
X
X
X
C
C
C
T
T
T
K
C. Lippert Linear models for GWAS II October 17
th
2012 17
Population structure correction
GWAS for Flowering Time in Arabidopsis thaliana
Linear Model:
QQ-plot:
0.0 0.2 0.4 0.6 0.8 1.0
expected pvalues
0.0
0.2
0.4
0.6
0.8
1.0
observedpvalues
0 1 2 3 4 5 6
log(expected pvalues)
0
5
10
15
20
25
log(observedpvalues)
C. Lippert Linear models for GWAS II October 17
th
2012 18
Population structure correction
GWAS for Flowering Time in Arabidopsis thaliana
Linear Model:
Linear Mixed Model:
C. Lippert Linear models for GWAS II October 17
th
2012 19
Population structure correction
GWAS for Flowering Time in Arabidopsis thaliana
Linear Mixed Model:
QQ-plot:
0.0 0.2 0.4 0.6 0.8 1.0
expected pvalues
0.0
0.2
0.4
0.6
0.8
1.0
observedpvalues
0 1 2 3 4 5 6
log(expected pvalues)
0
2
4
6
8
10
12
log(observedpvalues)
C. Lippert Linear models for GWAS II October 17
th
2012 20
Population structure correction
Linear mixed models (LMM)
LMM log likelihood
LL(β, σ2
g, σ2
e ) = log N y|Xβ; σ2
gK + σ2
e I .
Change of variables, introducing δ = σ2
e /σ2
g:
LL(β, σ2
g, δ) = log N y|Xβ; σ2
g (K + δI) .
ML-parameters ˆβ and ˆσ2
g follow in closed form.
Use optimizer to solve 1-dimensional optimization problem over δ.
[Kang et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 21
Population structure correction
Linear mixed models (LMM)
LMM log likelihood
LL(β, σ2
g, σ2
e ) = log N y|Xβ; σ2
gK + σ2
e I .
Change of variables, introducing δ = σ2
e /σ2
g:
LL(β, σ2
g, δ) = log N y|Xβ; σ2
g (K + δI) .
ML-parameters ˆβ and ˆσ2
g follow in closed form.
Use optimizer to solve 1-dimensional optimization problem over δ.
[Kang et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 21
Population structure correction
Linear mixed models (LMM)
LMM log likelihood
LL(β, σ2
g, σ2
e ) = log N y|Xβ; σ2
gK + σ2
e I .
Change of variables, introducing δ = σ2
e /σ2
g:
LL(β, σ2
g, δ) = log N y|Xβ; σ2
g (K + δI) .
ML-parameters ˆβ and ˆσ2
g follow in closed form.
Use optimizer to solve 1-dimensional optimization problem over δ.
[Kang et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 21
Population structure correction
Linear mixed models (LMM)
LMM log likelihood
LL(β, σ2
g, σ2
e ) = log N y|Xβ; σ2
gK + σ2
e I .
Change of variables, introducing δ = σ2
e /σ2
g:
LL(β, σ2
g, δ) = log N y|Xβ; σ2
g (K + δI) .
ML-parameters ˆβ and ˆσ2
g follow in closed form.
Use optimizer to solve 1-dimensional optimization problem over δ.
[Kang et al., 2008]
C. Lippert Linear models for GWAS II October 17
th
2012 21
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
set gradient to zero:
0 =
1
σ2
g
XT
(K + δI)
−1
y − XT
(K + δI)
−1
Xβ
XT
(K + δI)
−1
Xβ = XT
(K + δI)
−1
y
βML = XT
(K + δI)
−1
X
−1
XT
(K + δI)
−1
y
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
set gradient to zero:
0 =
1
σ2
g
XT
(K + δI)
−1
y − XT
(K + δI)
−1
Xβ
XT
(K + δI)
−1
Xβ = XT
(K + δI)
−1
y
βML = XT
(K + δI)
−1
X
−1
XT
(K + δI)
−1
y
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
set gradient to zero:
0 =
1
σ2
g
XT
(K + δI)
−1
y − XT
(K + δI)
−1
Xβ
XT
(K + δI)
−1
Xβ = XT
(K + δI)
−1
y
βML = XT
(K + δI)
−1
X
−1
XT
(K + δI)
−1
y
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
set gradient to zero:
0 =
1
σ2
g
XT
(K + δI)
−1
y − XT
(K + δI)
−1
Xβ
XT
(K + δI)
−1
Xβ = XT
(K + δI)
−1
y
βML = XT
(K + δI)
−1
X
−1
XT
(K + δI)
−1
y
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
set gradient to zero:
0 =
1
σ2
g
XT
(K + δI)
−1
y − XT
(K + δI)
−1
Xβ
XT
(K + δI)
−1
Xβ = XT
(K + δI)
−1
y
βML = XT
(K + δI)
−1
X
−1
XT
(K + δI)
−1
y
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
set gradient to zero:
0 =
1
σ2
g
XT
(K + δI)
−1
y − XT
(K + δI)
−1
Xβ
XT
(K + δI)
−1
Xβ = XT
(K + δI)
−1
y
βML = XT
(K + δI)
−1
X
−1
XT
(K + δI)
−1
y
Note that this solution is analogous to the ML solution of the linear
regression .
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Gradient of the LMM log likelihood w.r.t. β
β log N y|Xβ; σ2
g (K + δI) = β −
1
2σ2
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
=
1
σ2
g
−XT
(K + δI)
−1
y + XT
(K + δI)
−1
X
set gradient to zero:
0 =
1
σ2
g
XT
(K + δI)
−1
y − XT
(K + δI)
−1
Xβ
XT
(K + δI)
−1
Xβ = XT
(K + δI)
−1
y
βML = XT
(K + δI)
−1
X
−1
XT
(K + δI)
−1
y
Note that this solution is analogous to the ML solution of the linear
regression XT
X
−1
XT
y.
C. Lippert Linear models for GWAS II October 17
th
2012 22
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
Linear mixed models (LMM)
ML parameters
Derivative of the LMM log likelihood w.r.t. σ2
g
∂
∂σ2
g
log N y|Xβ; σ2
g (K + δI)
= −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
set derivative to zero:
0 = −
1
2
N
σ2
g
−
N
σ4
g
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
σ2
gML =
1
N
(y − Xβ)
T
(K + δI)
−1
(y − Xβ)
Bottleneck: For every SNP that we test we need to calculate
(K + δI)−1
.
If done naively, this is an O(N3
) operation per SNP. (very expensive!)
C. Lippert Linear models for GWAS II October 17
th
2012 23
Population structure correction
FaST LMM
N y|Xβ; σ2
g (K + δI) .
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 24
Population structure correction
FaST LMM
N y|Xβ; σ2
g (K + δI) .
= N y|Xβ; σ2
g USUT
+ δI .
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 24
Population structure correction
FaST LMM
N y|Xβ; σ2
g (K + δI) .
= N y|Xβ; σ2
g USUT
+ δI .
= N UT
y|UT
Xβ; σ2
g UT
USUT
U + δUT
U .
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 24
Population structure correction
FaST LMM
N y|Xβ; σ2
g (K + δI) .
= N y|Xβ; σ2
g USUT
+ δI .
= N

UT
y|UT
Xβ; σ2
g

UT
U
I
S UT
U
I
+δ UT
U
I



 .
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 24
Population structure correction
FaST LMM
N y|Xβ; σ2
g (K + δI) .
= N y|Xβ; σ2
g USUT
+ δI .
= N

UT
y|UT
Xβ; σ2
g

UT
U
I
S UT
U
I
+δ UT
U
I



 .
= N UT
y|UT
Xβ; σ2
g (S + δI) .
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 24
Population structure correction
FaST LMM
N y|Xβ; σ2
g (K + δI) .
= N y|Xβ; σ2
g USUT
+ δI .
= N

UT
y|UT
Xβ; σ2
g

UT
U
I
S UT
U
I
+δ UT
U
I



 .
= N UT
y|UT
Xβ; σ2
g (S + δI) .
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 24
Population structure correction
FaST LMM
N y|Xβ; σ2
g (K + δI) .
= N y|Xβ; σ2
g USUT
+ δI .
= N

UT
y|UT
Xβ; σ2
g

UT
U
I
S UT
U
I
+δ UT
U
I



 .
= N UT
y|UT
Xβ; σ2
g (S + δI) .
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 24
Population structure correction
FaST LMM
N UT
y|UT
Xβ; σ2
g (S + δI) . (2)
Factored Spectrally Transformed LMM
O(N3
) once for spectral decomposition.
O(N2
) runtime per SNP tested (multiplication with UT
).
O(N2
) memory for storing K and U.
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 25
Population structure correction
FaST LMM
N UT
y|UT
Xβ; σ2
g (S + δI) . (2)
Factored Spectrally Transformed LMM
O(N3
) once for spectral decomposition.
O(N2
) runtime per SNP tested (multiplication with UT
).
O(N2
) memory for storing K and U.
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 25
Population structure correction
FaST LMM
N UT
y|UT
Xβ; σ2
g (S + δI) . (2)
Factored Spectrally Transformed LMM
O(N3
) once for spectral decomposition.
O(N2
) runtime per SNP tested (multiplication with UT
).
O(N2
) memory for storing K and U.
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 25
Population structure correction
FaST LMM
N UT
y|UT
Xβ; σ2
g (S + δI) . (2)
Factored Spectrally Transformed LMM
O(N3
) once for spectral decomposition.
O(N2
) runtime per SNP tested (multiplication with UT
).
O(N2
) memory for storing K and U.
[Lippert et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 25
Population structure correction
Summary
Population structure correction
Genomic control
Simple method
Works with any statistical test
Can be combined with other correction methods
Very conservative!
Eigenstrat (PCA)
Corrects well for differences on population level
Does not work well for closer relatedness
Linear mixed models
Corrects well for most forms of relatedness.
C. Lippert Linear models for GWAS II October 17
th
2012 26
Population structure correction
Summary
Population structure correction
Genomic control
Simple method
Works with any statistical test
Can be combined with other correction methods
Very conservative!
Eigenstrat (PCA)
Corrects well for differences on population level
Does not work well for closer relatedness
Linear mixed models
Corrects well for most forms of relatedness.
C. Lippert Linear models for GWAS II October 17
th
2012 26
Population structure correction
Summary
Population structure correction
Genomic control
Simple method
Works with any statistical test
Can be combined with other correction methods
Very conservative!
Eigenstrat (PCA)
Corrects well for differences on population level
Does not work well for closer relatedness
Linear mixed models
Corrects well for most forms of relatedness.
C. Lippert Linear models for GWAS II October 17
th
2012 26
Population structure correction
Overview
Single marker association model with random effect term
y = xsβs
genetic effect
+ u
random effect covariates
+
noise
Shortcomings
Weak effects are not captured by single-marker analysis.
Complex traits are controlled by more than a single SNP.
C. Lippert Linear models for GWAS II October 17
th
2012 27
Population structure correction
Overview
Single marker association model with random effect term
y = xsβs
genetic effect
+ u
random effect covariates
+
noise
Shortcomings
Weak effects are not captured by single-marker analysis.
Complex traits are controlled by more than a single SNP.
C. Lippert Linear models for GWAS II October 17
th
2012 27
Population structure correction
Overview
Single marker association model with random effect term
y = xsβs
genetic effect
+ u
random effect covariates
+
noise
Shortcomings
Weak effects are not captured by single-marker analysis.
Complex traits are controlled by more than a single SNP.
C. Lippert Linear models for GWAS II October 17
th
2012 27
Population structure correction
Multi locus models
Generalization to multiple genetic factors
y =
S
s=1
xsβs
genetic effect
+ u
random effect covariates
+
noise
Challenge: N << S: explicit estimation of all βs is not feasible.
Solutions
Regularize βs (Ridge regression, LASSO)
Variance component modeling
[Wu et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 28
Population structure correction
Multi locus models
Generalization to multiple genetic factors
y =
S
s=1
xsβs
genetic effect
+ u
random effect covariates
+
noise
Challenge: N << S: explicit estimation of all βs is not feasible.
Solutions
Regularize βs (Ridge regression, LASSO)
Variance component modeling
[Wu et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 28
Population structure correction
Multi locus models
Generalization to multiple genetic factors
y =
S
s=1
xsβs
genetic effect
+ u
random effect covariates
+
noise
Challenge: N << S: explicit estimation of all βs is not feasible.
Solutions
Regularize βs (Ridge regression, LASSO)
Variance component modeling
[Wu et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 28
Population structure correction
Multi locus models
Generalization to multiple genetic factors
y =
S
s=1
xsβs
genetic effect
+ u
random effect covariates
+
noise
Challenge: N << S: explicit estimation of all βs is not feasible.
Solutions
Regularize βs (Ridge regression, LASSO)
Variance component modeling
[Wu et al., 2011]
C. Lippert Linear models for GWAS II October 17
th
2012 28
Outline
Outline
C. Lippert Linear models for GWAS II October 17
th
2012 29
Variance component models
Outline
Probability Theory
Population Structure
Population structure correction
Variance component models
Multi locus models
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 30
Variance component models Multi locus models
Multi locus models
Random effect models
For now, let’s drop the random effect term
y =
S
s=1
xsβs + .
For mathematical convenience, we choose a shared Gaussian
distribution on the weights and Gaussian noise
p(β1, . . . , βS) =
S
s=1
N βs 0, σ2
g p( ) = N 0, σ2
e I
Marginalize out the weights β1, . . . , βS
p(y | X, σ2
g, σ2
e ) =
β
N y
S
s=1
xsβs, σ2
e I
Data likelihood
S
s=1
N βs 0, σ2
g
weight distribution
dβ
= N y 0, σ2
g
S
s=1
xsxT
s + σ2
e I
C. Lippert Linear models for GWAS II October 17
th
2012 31
Variance component models Multi locus models
Multi locus models
Random effect models
For now, let’s drop the random effect term
y =
S
s=1
xsβs + .
For mathematical convenience, we choose a shared Gaussian
distribution on the weights and Gaussian noise
p(β1, . . . , βS) =
S
s=1
N βs 0, σ2
g p( ) = N 0, σ2
e I
Marginalize out the weights β1, . . . , βS
p(y | X, σ2
g, σ2
e ) =
β
N y
S
s=1
xsβs, σ2
e I
Data likelihood
S
s=1
N βs 0, σ2
g
weight distribution
dβ
= N y 0, σ2
g
S
s=1
xsxT
s + σ2
e I
C. Lippert Linear models for GWAS II October 17
th
2012 31
Variance component models Multi locus models
Multi locus models
Random effect models
For now, let’s drop the random effect term
y =
S
s=1
xsβs + .
For mathematical convenience, we choose a shared Gaussian
distribution on the weights and Gaussian noise
p(β1, . . . , βS) =
S
s=1
N βs 0, σ2
g p( ) = N 0, σ2
e I
Marginalize out the weights β1, . . . , βS
p(y | X, σ2
g, σ2
e ) =
β
N y
S
s=1
xsβs, σ2
e I
Data likelihood
S
s=1
N βs 0, σ2
g
weight distribution
dβ
= N y 0, σ2
g
S
s=1
xsxT
s + σ2
e I
C. Lippert Linear models for GWAS II October 17
th
2012 31
Variance component models Multi locus models
Multi locus models
Random effect models
For now, let’s drop the random effect term
y =
S
s=1
xsβs + .
For mathematical convenience, we choose a shared Gaussian
distribution on the weights and Gaussian noise
p(β1, . . . , βS) =
S
s=1
N βs 0, σ2
g p( ) = N 0, σ2
e I
Marginalize out the weights β1, . . . , βS
p(y | X, σ2
g, σ2
e ) =
β
N y
S
s=1
xsβs, σ2
e I
Data likelihood
S
s=1
N βs 0, σ2
g
weight distribution
dβ
= N y 0, σ2
g
S
s=1
xsxT
s + σ2
e I
C. Lippert Linear models for GWAS II October 17
th
2012 31
Variance component models Multi locus models
Multi locus models
Random effect models
For now, let’s drop the random effect term
y =
S
s=1
xsβs + .
For mathematical convenience, we choose a shared Gaussian
distribution on the weights and Gaussian noise
p(β1, . . . , βS) =
S
s=1
N βs 0, σ2
g p( ) = N 0, σ2
e I
Marginalize out the weights β1, . . . , βS
p(y | X, σ2
g, σ2
e ) =
β
N y
S
s=1
xsβs, σ2
e I
Data likelihood
S
s=1
N βs 0, σ2
g
weight distribution
dβ
= N y 0, σ2
g
S
s=1
xsxT
s + σ2
e I
C. Lippert Linear models for GWAS II October 17
th
2012 31
Variance component models Multi locus models
Multi locus models
Random effect models
For now, let’s drop the random effect term
y =
S
s=1
xsβs + .
For mathematical convenience, we choose a shared Gaussian
distribution on the weights and Gaussian noise
p(β1, . . . , βS) =
S
s=1
N βs 0, σ2
g p( ) = N 0, σ2
e I
Marginalize out the weights β1, . . . , βS
p(y | X, σ2
g, σ2
e ) =
β
N y
S
s=1
xsβs, σ2
e I
Data likelihood
S
s=1
N βs 0, σ2
g
weight distribution
dβ
= N y 0, σ2
g
S
s=1
xsxT
s + σ2
e I
C. Lippert Linear models for GWAS II October 17
th
2012 31
Variance component models Multi locus models
Multi locus models
Remarks
p(y | X, σ2
g, σ2
e ) = N y | 0, σ2
g
S
s=1
ssxT
s
Kg
+σ2
e I (3)
Kg genotype covariance matrix.
Closely related to Kinship explaining population structure.
Inference can be done my maximum likelihood.
The ratio of σ2
g and σ2
e defines the
narrow sense heritability of the trait
h2
=
σ2
g
σ2
g + σ2
e
.
C. Lippert Linear models for GWAS II October 17
th
2012 32
Variance component models Multi locus models
Multi locus models
Remarks
p(y | X, σ2
g, σ2
e ) = N y | 0, σ2
g
S
s=1
ssxT
s
Kg
+σ2
e I (3)
Kg genotype covariance matrix.
Closely related to Kinship explaining population structure.
Inference can be done my maximum likelihood.
0 5 10 15
Samples
0
5
10
15
Samples
The ratio of σ2
g and σ2
e defines the
narrow sense heritability of the trait
h2
=
σ2
g
σ2
g + σ2
e
.
C. Lippert Linear models for GWAS II October 17
th
2012 32
Variance component models Multi locus models
Multi locus models
Remarks
p(y | X, σ2
g, σ2
e ) = N y | 0, σ2
g
S
s=1
ssxT
s
Kg
+σ2
e I (3)
Kg genotype covariance matrix.
Closely related to Kinship explaining population structure.
Inference can be done my maximum likelihood.
0 5 10 15
Samples
0
5
10
15
Samples
The ratio of σ2
g and σ2
e defines the
narrow sense heritability of the trait
h2
=
σ2
g
σ2
g + σ2
e
.
C. Lippert Linear models for GWAS II October 17
th
2012 32
Variance component models Multi locus models
Multi locus models
Remarks
p(y | X, σ2
g, σ2
e ) = N y | 0, σ2
g
S
s=1
ssxT
s
Kg
+σ2
e I (3)
Kg genotype covariance matrix.
Closely related to Kinship explaining population structure.
Inference can be done my maximum likelihood.
0 5 10 15
Samples
0
5
10
15
Samples
The ratio of σ2
g and σ2
e defines the
narrow sense heritability of the trait
h2
=
σ2
g
σ2
g + σ2
e
.
C. Lippert Linear models for GWAS II October 17
th
2012 32
Variance component models Multi locus models
Heritability
Heritability estimated on 107 A. thaliana phenotypes
Global genetic heritability
0.0 0.2 0.4 0.6 0.8 1.0
Heritability
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0Density
C. Lippert Linear models for GWAS II October 17
th
2012 33
Variance component models Multi locus models
Heritability
Heritability estimated on 107 A. thaliana phenotypes
Estimate can be
restricted to a
genomic region such
as a single
chromosome, etc.
N y | 0, σ2
g
s∈Chrom
xsxT
s +σ2
e I
1 2 3 4 5
Chromosome
1
6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
101
106
PhenotypeID
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
C. Lippert Linear models for GWAS II October 17
th
2012 34
Variance component models Multi locus models
Window-based composite variance analysis
Region-based testing
Just fitting a particular region ignores the genome-wide context
Variance dissection with region-based separation
p(y | W) = N(y | 0, σ2
w
s∈W
xsxT
s
Kw
+σ2
g
s∈W
xsxT
s
Kg
+σ2
e I)
Explained variance components can be read off subject to suitable
normalization of the covariances Kw and Kg.
“Local” heritability
h2
(W) =
σ2
w
σ2
w + σ2
g + σ2
e
C. Lippert Linear models for GWAS II October 17
th
2012 35
Variance component models Multi locus models
Window-based composite variance analysis
Region-based testing
Just fitting a particular region ignores the genome-wide context
Variance dissection with region-based separation
p(y | W) = N(y | 0, σ2
w
s∈W
xsxT
s
Kw
+σ2
g
s∈W
xsxT
s
Kg
+σ2
e I)
Explained variance components can be read off subject to suitable
normalization of the covariances Kw and Kg.
“Local” heritability
h2
(W) =
σ2
w
σ2
w + σ2
g + σ2
e
C. Lippert Linear models for GWAS II October 17
th
2012 35
Variance component models Multi locus models
Window-based composite variance analysis
Region-based testing
Just fitting a particular region ignores the genome-wide context
Variance dissection with region-based separation
p(y | W) = N(y | 0, σ2
w
s∈W
xsxT
s
Kw
+σ2
g
s∈W
xsxT
s
Kg
+σ2
e I)
Explained variance components can be read off subject to suitable
normalization of the covariances Kw and Kg.
“Local” heritability
h2
(W) =
σ2
w
σ2
w + σ2
g + σ2
e
C. Lippert Linear models for GWAS II October 17
th
2012 35
Variance component models Multi locus models
Window-based composite variance analysis
Region-based testing
Just fitting a particular region ignores the genome-wide context
Variance dissection with region-based separation
p(y | W) = N(y | 0, σ2
w
s∈W
xsxT
s
Kw
+σ2
g
s∈W
xsxT
s
Kg
+σ2
e I)
Explained variance components can be read off subject to suitable
normalization of the covariances Kw and Kg.
“Local” heritability
h2
(W) =
σ2
w
σ2
w + σ2
g + σ2
e
C. Lippert Linear models for GWAS II October 17
th
2012 35
Variance component models Multi locus models
Window-based composite variance analysis
C. Lippert Linear models for GWAS II October 17
th
2012 36
Variance component models Multi locus models
Window-based composite variance analysis
Significance testing
Analogously to fixed effect testing, the significance of a specific
window can be tested.
Likelihood-ratio statistics to score the relevance of a particular
genomic region W
LOD(W) =
N y 0, σ2
w s∈W xsxT
s + σ2
g s∈W xsxT
s + σ2
e I
N y 0, σ2
g s∈W xsxT
s + σ2
e I
P-values can be obtained from permutation statistics or analytical
approximation (variants of score tests or likelihood ratio tests).
[Wu et al., 2011, Listgarten et al., 2012]
C. Lippert Linear models for GWAS II October 17
th
2012 37
Variance component models Multi locus models
Window-based composite variance analysis
Significance testing
Analogously to fixed effect testing, the significance of a specific
window can be tested.
Likelihood-ratio statistics to score the relevance of a particular
genomic region W
LOD(W) =
N y 0, σ2
w s∈W xsxT
s + σ2
g s∈W xsxT
s + σ2
e I
N y 0, σ2
g s∈W xsxT
s + σ2
e I
P-values can be obtained from permutation statistics or analytical
approximation (variants of score tests or likelihood ratio tests).
[Wu et al., 2011, Listgarten et al., 2012]
C. Lippert Linear models for GWAS II October 17
th
2012 37
Variance component models Multi locus models
Window-based composite variance analysis
Significance testing
Analogously to fixed effect testing, the significance of a specific
window can be tested.
Likelihood-ratio statistics to score the relevance of a particular
genomic region W
LOD(W) =
N y 0, σ2
w s∈W xsxT
s + σ2
g s∈W xsxT
s + σ2
e I
N y 0, σ2
g s∈W xsxT
s + σ2
e I
P-values can be obtained from permutation statistics or analytical
approximation (variants of score tests or likelihood ratio tests).
[Wu et al., 2011, Listgarten et al., 2012]
C. Lippert Linear models for GWAS II October 17
th
2012 37
Variance component models Phenotype prediction
Making predictions with linear mixed models
Linear model, accounting for a set of measured SNPs X
p(y | X, θ, σ2
) = N y
S
s=1
xsθs, σ2
I
Prediction at unseen test input given max. likelihood weight:
p(y | x , ˆθ) = N y x ˆθ, σ2
Marginal likelihood
p(y | X, σ2
, σ2
g) =
θ
N y Xθ, σ2
I N θ 0, σ2
gI
= N


y 0, σ2
gXXT
K
+σ2
I



Making predictions with linear mixed models?
C. Lippert Linear models for GWAS II October 17
th
2012 38
Variance component models Phenotype prediction
Making predictions with linear mixed models
Linear model, accounting for a set of measured SNPs X
p(y | X, θ, σ2
) = N y
S
s=1
xsθs, σ2
I
Prediction at unseen test input given max. likelihood weight:
p(y | x , ˆθ) = N y x ˆθ, σ2
Marginal likelihood
p(y | X, σ2
, σ2
g) =
θ
N y Xθ, σ2
I N θ 0, σ2
gI
= N


y 0, σ2
gXXT
K
+σ2
I



Making predictions with linear mixed models?
C. Lippert Linear models for GWAS II October 17
th
2012 38
Variance component models Phenotype prediction
Making predictions with linear mixed models
Linear model, accounting for a set of measured SNPs X
p(y | X, θ, σ2
) = N y
S
s=1
xsθs, σ2
I
Prediction at unseen test input given max. likelihood weight:
p(y | x , ˆθ) = N y x ˆθ, σ2
Marginal likelihood
p(y | X, σ2
, σ2
g) =
θ
N y Xθ, σ2
I N θ 0, σ2
gI
= N


y 0, σ2
gXXT
K
+σ2
I



Making predictions with linear mixed models?
C. Lippert Linear models for GWAS II October 17
th
2012 38
Variance component models Phenotype prediction
Making predictions with linear mixed models
Linear model, accounting for a set of measured SNPs X
p(y | X, θ, σ2
) = N y
S
s=1
xsθs, σ2
I
Prediction at unseen test input given max. likelihood weight:
p(y | x , ˆθ) = N y x ˆθ, σ2
Marginal likelihood
p(y | X, σ2
, σ2
g) =
θ
N y Xθ, σ2
I N θ 0, σ2
gI
= N


y 0, σ2
gXXT
K
+σ2
I



Making predictions with linear mixed models?
C. Lippert Linear models for GWAS II October 17
th
2012 38
Variance component models Phenotype prediction
The Gaussian distribution
Linear mixed models are merely based on the good old multivariate
Gaussian
N x µ, K =
1
|2π K |
exp −
1
2
(x − µ)T
K
−1
(x − µ)
Covariance matrix or kernel matrix
C. Lippert Linear models for GWAS II October 17
th
2012 39
Variance component models Phenotype prediction
A 2D Gaussian
Probability contour
Samples
−3 −2 −1 0 1 2 3
y1
−3
−2
−1
0
1
2
3
y2
K =
1 0.6
0.6 1
C. Lippert Linear models for GWAS II October 17
th
2012 40
Variance component models Phenotype prediction
A 2D Gaussian
Probability contour
Samples
−3 −2 −1 0 1 2 3
y1
−3
−2
−1
0
1
2
3
y2
K =
1 0.6
0.6 1
C. Lippert Linear models for GWAS II October 17
th
2012 40
Variance component models Phenotype prediction
A 2D Gaussian
Varying the covariance matrix
−3 −2 −1 0 1 2 3
y1
−3
−2
−1
0
1
2
3
y2
K =
1 0.14
0.14 1
−3 −2 −1 0 1 2 3
y1
−3
−2
−1
0
1
2
3
y2
K =
1 0.6
0.6 1
−3 −2 −1 0 1 2 3
y1
−3
−2
−1
0
1
2
3
y2
K =
1 -0.9
-0.9 1
C. Lippert Linear models for GWAS II October 17
th
2012 41
Variance component models Phenotype prediction
A 2D Gaussian
Inference
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
C. Lippert Linear models for GWAS II October 17
th
2012 42
Variance component models Phenotype prediction
A 2D Gaussian
Inference
− 3 − 2 − 1 0 1 2 3
− 3
− 2
− 1
0
1
2
3
C. Lippert Linear models for GWAS II October 17
th
2012 42
Variance component models Phenotype prediction
A 2D Gaussian
Inference
− 3 − 2 − 1 0 1 2 3
− 3
− 2
− 1
0
1
2
3
C. Lippert Linear models for GWAS II October 17
th
2012 42
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
P(y | y) =?
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
P(y | y) =?
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
P(y | y) =
P(y, y )
P(y)
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
P(y | y) =
P(y, y )
P(y)
=
N
y
y
| 0, σ2
g
K K:,
KT
:, K , σ2
e I
N 0, σ2
gK + σ2
eI
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
P(y | y) =
P(y, y )
P(y)
=
N
y
y
| 0, σ2
g
K K:,
KT
:, K , σ2
e I
N 0, σ2
gK + σ2
eI
= N (y | µ , Σ )
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
P(y | y) =
P(y, y )
P(y)
=
N
y
y
| 0, σ2
g
K K:,
KT
:, K , σ2
e I
N 0, σ2
gK + σ2
eI
= N (y | µ , Σ ) ∝ N
y
y
| 0, σ2
g
K K:,
KT
:, K , σ2
e I
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values y of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual y .
P(y | y) =
P(y, y )
P(y)
=
N
y
y
| 0, σ2
g
K K:,
KT
:, K , σ2
e I
N 0, σ2
gK + σ2
eI
= N (y | µ , Σ ) ∝ N
y
y
| 0, σ2
g
K K:,
KT
:, K , σ2
e I
Use conditional probability distribution
Note, that the result is again a Gaussian distribution!
As a Gaussian distribution is always normalized, we can drop any
constant terms, that do not contain y .
Completing the square identifies the mean µ and the variance Σ of
y given y.
C. Lippert Linear models for GWAS II October 17
th
2012 43
Variance component models Phenotype prediction
Inference
Gaussian conditioning in 2D
p(y2 | y1, K) =
p(y1, y2 | K)
p(y1 | K)
∝ exp −
1
2
[y1, y2] K−1 y1
y2
= exp{−
1
2
y2
1K−1
1,1 + y2
2K−1
2,2 + 2y1K−1
1,2y2 }
= exp{−
1
2
y2
2K−1
2,2 + 2y2K−1
1,2y1 + C }
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
}
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
+
K−1
1,2y1
K−1
2,2
2
+
1
2
K−1
2,2
K−1
1,2y1
K−1
2,2
2
}
= Z exp{−
1
2
K−1
2,2
Σ
y2 +
K−1
1,2y1
K−1
2,2
−µ
2
} ∝ N (y2 | µ, Σ)
C. Lippert Linear models for GWAS II October 17
th
2012 44
Variance component models Phenotype prediction
Inference
Gaussian conditioning in 2D
p(y2 | y1, K) =
p(y1, y2 | K)
p(y1 | K)
∝ exp −
1
2
[y1, y2] K−1 y1
y2
= exp{−
1
2
y2
1K−1
1,1 + y2
2K−1
2,2 + 2y1K−1
1,2y2 }
= exp{−
1
2
y2
2K−1
2,2 + 2y2K−1
1,2y1 + C }
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
}
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
+
K−1
1,2y1
K−1
2,2
2
+
1
2
K−1
2,2
K−1
1,2y1
K−1
2,2
2
}
= Z exp{−
1
2
K−1
2,2
Σ
y2 +
K−1
1,2y1
K−1
2,2
−µ
2
} ∝ N (y2 | µ, Σ)
C. Lippert Linear models for GWAS II October 17
th
2012 44
Variance component models Phenotype prediction
Inference
Gaussian conditioning in 2D
p(y2 | y1, K) =
p(y1, y2 | K)
p(y1 | K)
∝ exp −
1
2
[y1, y2] K−1 y1
y2
= exp{−
1
2
y2
1K−1
1,1 + y2
2K−1
2,2 + 2y1K−1
1,2y2 }
= exp{−
1
2
y2
2K−1
2,2 + 2y2K−1
1,2y1 + C }
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
}
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
+
K−1
1,2y1
K−1
2,2
2
+
1
2
K−1
2,2
K−1
1,2y1
K−1
2,2
2
}
= Z exp{−
1
2
K−1
2,2
Σ
y2 +
K−1
1,2y1
K−1
2,2
−µ
2
} ∝ N (y2 | µ, Σ)
C. Lippert Linear models for GWAS II October 17
th
2012 44
Variance component models Phenotype prediction
Inference
Gaussian conditioning in 2D
p(y2 | y1, K) =
p(y1, y2 | K)
p(y1 | K)
∝ exp −
1
2
[y1, y2] K−1 y1
y2
= exp{−
1
2
y2
1K−1
1,1 + y2
2K−1
2,2 + 2y1K−1
1,2y2 }
= exp{−
1
2
y2
2K−1
2,2 + 2y2K−1
1,2y1 + C }
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
}
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
+
K−1
1,2y1
K−1
2,2
2
+
1
2
K−1
2,2
K−1
1,2y1
K−1
2,2
2
}
= Z exp{−
1
2
K−1
2,2
Σ
y2 +
K−1
1,2y1
K−1
2,2
−µ
2
} ∝ N (y2 | µ, Σ)
C. Lippert Linear models for GWAS II October 17
th
2012 44
Variance component models Phenotype prediction
Inference
Gaussian conditioning in 2D
p(y2 | y1, K) =
p(y1, y2 | K)
p(y1 | K)
∝ exp −
1
2
[y1, y2] K−1 y1
y2
= exp{−
1
2
y2
1K−1
1,1 + y2
2K−1
2,2 + 2y1K−1
1,2y2 }
= exp{−
1
2
y2
2K−1
2,2 + 2y2K−1
1,2y1 + C }
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
}
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
+
K−1
1,2y1
K−1
2,2
2
+
1
2
K−1
2,2
K−1
1,2y1
K−1
2,2
2
}
= Z exp{−
1
2
K−1
2,2
Σ
y2 +
K−1
1,2y1
K−1
2,2
−µ
2
} ∝ N (y2 | µ, Σ)
C. Lippert Linear models for GWAS II October 17
th
2012 44
Variance component models Phenotype prediction
Inference
Gaussian conditioning in 2D
p(y2 | y1, K) =
p(y1, y2 | K)
p(y1 | K)
∝ exp −
1
2
[y1, y2] K−1 y1
y2
= exp{−
1
2
y2
1K−1
1,1 + y2
2K−1
2,2 + 2y1K−1
1,2y2 }
= exp{−
1
2
y2
2K−1
2,2 + 2y2K−1
1,2y1 + C }
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
}
= Z exp{−
1
2
K−1
2,2 y2
2 + 2y2
K−1
1,2y1
K−1
2,2
+
K−1
1,2y1
K−1
2,2
2
+
1
2
K−1
2,2
K−1
1,2y1
K−1
2,2
2
}
= Z exp{−
1
2
K−1
2,2
Σ
y2 +
K−1
1,2y1
K−1
2,2
−µ
2
} ∝ N (y2 | µ, Σ)
C. Lippert Linear models for GWAS II October 17
th
2012 44
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual.
P(y | y) = N y µ , σ2
gVg + σ2
e I
Predictive mean: µ = σ2
gK ,:
g σ2
gKg + σ2
e I
−1
y
BLUP
Predictive Variance: Vg = K ,
g − σ2
gK ,:
g σ2
gKg + σ2
e I
−1
K:,
g
C. Lippert Linear models for GWAS II October 17
th
2012 45
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual.
P(y | y) = N y µ , σ2
gVg + σ2
e I
Predictive mean: µ = σ2
gK ,:
g σ2
gKg + σ2
e I
−1
y
BLUP
Predictive Variance: Vg = K ,
g − σ2
gK ,:
g σ2
gKg + σ2
e I
−1
K:,
g
C. Lippert Linear models for GWAS II October 17
th
2012 45
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual.
P(y | y) = N y µ , σ2
gVg + σ2
e I
Predictive mean: µ = σ2
gK ,:
g σ2
gKg + σ2
e I
−1
y
BLUP
Predictive Variance: Vg = K ,
g − σ2
gK ,:
g σ2
gKg + σ2
e I
−1
K:,
g
C. Lippert Linear models for GWAS II October 17
th
2012 45
Variance component models Phenotype prediction
Phenotype prediction
Best linear unbiased prediction
Given the phenotype values of a set of individuals and the genetic
relatedness, we can predict the genetic component of the phenotype
of a new individual.
P(y | y) = N y µ , σ2
gVg + σ2
e I
Predictive mean: µ = σ2
gK ,:
g σ2
gKg + σ2
e I
−1
y
BLUP
Predictive Variance: Vg = K ,
g − σ2
gK ,:
g σ2
gKg + σ2
e I
−1
K:,
g
C. Lippert Linear models for GWAS II October 17
th
2012 45
Variance component models Phenotype prediction
Summary
Basic probability theory
Linear mixed models
Population structure correction
Parameter estimation
Variance component modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 46
Variance component models Phenotype prediction
Summary
Basic probability theory
Linear mixed models
Population structure correction
Parameter estimation
Variance component modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 46
Variance component models Phenotype prediction
Summary
Basic probability theory
Linear mixed models
Population structure correction
Parameter estimation
Variance component modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 46
Variance component models Phenotype prediction
Summary
Basic probability theory
Linear mixed models
Population structure correction
Parameter estimation
Variance component modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 46
Variance component models Phenotype prediction
Summary
Basic probability theory
Linear mixed models
Population structure correction
Parameter estimation
Variance component modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 46
Variance component models Phenotype prediction
Summary
Basic probability theory
Linear mixed models
Population structure correction
Parameter estimation
Variance component modeling
Phenotype prediction
C. Lippert Linear models for GWAS II October 17
th
2012 46
Variance component models Phenotype prediction
Acknowledgements
Joint course material
O. Stegle
FaST lmm
J. Listgarten, D. Heckerman
C. Lippert Linear models for GWAS II October 17
th
2012 47
Variance component models Phenotype prediction
References I
P. Burton, D. Clayton, L. Cardon, N. Craddock, P. Deloukas, A. Duncanson, D. Kwiatkowski,
M. McCarthy, W. Ouwehand, N. Samani, et al. Genome-wide association study of 14,000
cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678,
2007.
B. Devlin and K. Roeder. Genomic control for association studies. Biometrics, 55(4):997–1004,
1999.
H. M. Kang, N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman, M. J. Daly, and E. Eskin.
Efficient control of population structure in model organism association mapping. Genetics,
107, 2008.
C. Lippert, J. Listgarten, Y. Liu, C. Kadie, R. Davidson, and D. Heckerman. Fast linear mixed
models for genome-wide association studies. Nature Methods, 8(10):838;835, 10 2011. doi:
10.1038/nmeth.1681.
J. Listgarten, C. Lippert, and D. Heckerman. An efficient group test for genetic markers that
handles confounding. arXiv preprint arXiv:1205.0793, 2012.
J. Novembre, T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S. King,
S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. Genes mirror geography
within europe. Nature, 456(7218):98–101, Nov. 2008.
N. Patterson, A. L. Price, and D. Reich. Population structure and eigenanalysis. PLoS Genetics,
2(12):e190+, December 2006.
C. Lippert Linear models for GWAS II October 17
th
2012 48
Variance component models Phenotype prediction
References II
A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich.
Principal components analysis corrects for stratification in genome-wide association studies.
Nature genetics, 38(8):904–909, August 2006.
M. Wu, S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. Rare-variant association testing for
sequencing data with the sequence kernel association test. The American Journal of Human
Genetics, 2011.
C. Lippert Linear models for GWAS II October 17
th
2012 49

Linear models2

  • 1.
    Linear models forGWAS II: Linear mixed models Christoph Lippert Microsoft Research, Los Angeles, USA Current topics in computational biology UCLA October 15th , 2012 C. Lippert Linear models for GWAS II October 17 th 2012 1
  • 2.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 3.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 4.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 5.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 6.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 7.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 8.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 9.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 10.
    Course structure October 15th Introduction Terminology Studydesign Data preparation Challenges and pitfalls Course overview Linear regression Parameter estimation Statistical testing October 17th Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 2
  • 11.
    Probability Theory Probabilities Let Xbe a random variable, defined over a set X or measurable space. P(X = x) denotes the probability that X takes value x, short p(x). Probabilities are positive, P(X = x) ≥ 0 Probabilities sum to one x∈X p(x)dx = 1 x∈X p(x) = 1 C. Lippert Linear models for GWAS II October 17 th 2012 3
  • 12.
    Probability Theory Probabilities Let Xbe a random variable, defined over a set X or measurable space. P(X = x) denotes the probability that X takes value x, short p(x). Probabilities are positive, P(X = x) ≥ 0 Probabilities sum to one x∈X p(x)dx = 1 x∈X p(x) = 1 C. Lippert Linear models for GWAS II October 17 th 2012 3
  • 13.
    Probability Theory Probabilities Let Xbe a random variable, defined over a set X or measurable space. P(X = x) denotes the probability that X takes value x, short p(x). Probabilities are positive, P(X = x) ≥ 0 Probabilities sum to one x∈X p(x)dx = 1 x∈X p(x) = 1 C. Lippert Linear models for GWAS II October 17 th 2012 3
  • 14.
    Probability Theory Probabilities Let Xbe a random variable, defined over a set X or measurable space. P(X = x) denotes the probability that X takes value x, short p(x). Probabilities are positive, P(X = x) ≥ 0 Probabilities sum to one x∈X p(x)dx = 1 x∈X p(x) = 1 C. Lippert Linear models for GWAS II October 17 th 2012 3
  • 15.
    Probability Theory Probabilities Let Xbe a random variable, defined over a set X or measurable space. P(X = x) denotes the probability that X takes value x, short p(x). Probabilities are positive, P(X = x) ≥ 0 Probabilities sum to one x∈X p(x)dx = 1 x∈X p(x) = 1 C. Lippert Linear models for GWAS II October 17 th 2012 3
  • 16.
    Probability Theory Probability Theory JointProbability P(X = xi, Y = yj) = ni,j N Marginal Probability P(X = xi) = ci N Conditional Probability P(Y = yj | X = xi) = ni,j ci (C.M. Bishop, Pattern Recognition and Machine Learning) C. Lippert Linear models for GWAS II October 17 th 2012 4
  • 17.
    Probability Theory Probability Theory ProductRule P(X = xi, Y = yj) = ni,j N = ni,j ci · ci N = P(Y = yj | X = xi)P(X = xi) Marginal Probability P(X = xi) = ci N Conditional Probability P(Y = yj | X = xi) = ni,j ci (C.M. Bishop, Pattern Recognition and Machine Learning) C. Lippert Linear models for GWAS II October 17 th 2012 4
  • 18.
    Probability Theory Probability Theory ProductRule P(X = xi, Y = yj) = ni,j N = ni,j ci · ci N = P(Y = yj | X = xi)P(X = xi) Sum Rule P(X = xi) = ci N = 1 N L j=1 ni,j = j P(X = xi, Y = yj) (C.M. Bishop, Pattern Recognition and Machine Learning) C. Lippert Linear models for GWAS II October 17 th 2012 4
  • 19.
    Probability Theory The Rulesof Probability Sum & Product Rule Sum Rule p(x) = y p(x, y) Product Rule p(x, y) = p(y | x)p(x) C. Lippert Linear models for GWAS II October 17 th 2012 5
  • 20.
    Probability Theory The Rulesof Probability Bayes Theorem Using the product rule we obtain p(y | x) = p(x | y)p(y) p(x) p(x) = y p(x | y)p(y) C. Lippert Linear models for GWAS II October 17 th 2012 6
  • 21.
    Probability Theory Bayesian probabilitycalculus Bayes rule is the basis for inference and learning. Assume we have a model with parameters θ, e.g. y = θ0 + θ1 · x X Y x* Goal: learn parameters θ given Data D. p(θ | D) = p(D | θ) p(θ) p(D) Posterior Likelihood Prior C. Lippert Linear models for GWAS II October 17 th 2012 7
  • 22.
    Probability Theory Bayesian probabilitycalculus Bayes rule is the basis for inference and learning. Assume we have a model with parameters θ, e.g. y = θ0 + θ1 · x X Y x* Goal: learn parameters θ given Data D. p(θ | D) = p(D | θ) p(θ) p(D) posterior ∝ likelihood · prior Posterior Likelihood Prior C. Lippert Linear models for GWAS II October 17 th 2012 7
  • 23.
    Probability Theory Probability distributions Gaussian p(x| µ, σ2 ) = N (x | µ, σ) = 1 √ 2πσ2 e− 1 2σ2 (x−µ)2 −6 −4 −2 0 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Multivariate Gaussian p(x | µ, Σ) = N (x | µ, Σ) = 1 |2πΣ| exp − 1 2 (x − µ)T Σ−1 (x − µ) C. Lippert Linear models for GWAS II October 17 th 2012 8
  • 24.
    Probability Theory Probability distributions Gaussian p(x| µ, σ2 ) = N (x | µ, σ) = 1 √ 2πσ2 e− 1 2σ2 (x−µ)2 −6 −4 −2 0 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Multivariate Gaussian p(x | µ, Σ) = N (x | µ, Σ) = 1 |2πΣ| exp − 1 2 (x − µ)T Σ−1 (x − µ) −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Σ = 1 0.8 0.8 1 C. Lippert Linear models for GWAS II October 17 th 2012 8
  • 25.
    Population Structure Genome wideassociation studies (GWAS) Identify associations between variable genetic loci and phenotypes. Linear and logistic regression Statistical dependence tests (F-test, likelihood ratio) ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyyy XX C. Lippert Linear models for GWAS II October 17 th 2012 9
  • 26.
    Population Structure Genome wideassociation studies (GWAS) Identify associations between variable genetic loci and phenotypes. Linear and logistic regression Statistical dependence tests (F-test, likelihood ratio) N y|Xβ; σ2 e I N (y|0; σ2 e I) (1) ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 9
  • 27.
    Population Structure Population stratification Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10
  • 28.
    Population Structure Population stratification Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10
  • 29.
    Population Structure Population stratification Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10
  • 30.
    Population Structure Population stratification Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10
  • 31.
    Population Structure Population stratification Confoundingstructure leads to false positives. Population structure Family structure Cryptic relatedness ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 10
  • 32.
    Population Structure Population stratification GWAon inflammatory bowel disease (WTCCC) 3.4k cases, 11.9k controls Methods Linear regression Likelihood ratio test [Burton et al., 2007] C. Lippert Linear models for GWAS II October 17 th 2012 11
  • 33.
    Population Structure Population stratification GWAon inflammatory bowel disease (WTCCC) 3.4k cases, 11.9k controls Methods Linear regression Likelihood ratio test [Burton et al., 2007] C. Lippert Linear models for GWAS II October 17 th 2012 11
  • 34.
    Population Structure Population stratification GWAon inflammatory bowel disease (WTCCC) 3.4k cases, 11.9k controls Methods Linear regression Likelihood ratio test [Burton et al., 2007] C. Lippert Linear models for GWAS II October 17 th 2012 11
  • 35.
    Population structure correction Outline ProbabilityTheory Population Structure Population structure correction Variance component models Multi locus models Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 12
  • 36.
    Population structure correction Genomiccontrol [Devlin and Roeder, Biometrics 1999] Genomic control λ λ = median(2LR) median(χ2) . λ = 1: Calibrated P-values λ > 1: Inflation λ < 1: Deflation Correct by dividing test statistic by λ. Applicable in combination with every method. Does not change (non-)uniformity of P-values. Very conservative.C. Lippert Linear models for GWAS II October 17 th 2012 13
  • 37.
    Population structure correction Genomiccontrol [Devlin and Roeder, Biometrics 1999] Genomic control λ λ = median(2LR) median(χ2) . λ = 1: Calibrated P-values λ > 1: Inflation λ < 1: Deflation Correct by dividing test statistic by λ. Applicable in combination with every method. Does not change (non-)uniformity of P-values. Very conservative.C. Lippert Linear models for GWAS II October 17 th 2012 13
  • 38.
    Population structure correction Genomiccontrol [Devlin and Roeder, Biometrics 1999] Genomic control λ λ = median(2LR) median(χ2) . λ = 1: Calibrated P-values λ > 1: Inflation λ < 1: Deflation Correct by dividing test statistic by λ. Applicable in combination with every method. Does not change (non-)uniformity of P-values. Very conservative.C. Lippert Linear models for GWAS II October 17 th 2012 13
  • 39.
    Population structure correction Genomiccontrol [Devlin and Roeder, Biometrics 1999] Genomic control λ λ = median(2LR) median(χ2) . λ = 1: Calibrated P-values λ > 1: Inflation λ < 1: Deflation Correct by dividing test statistic by λ. Applicable in combination with every method. Does not change (non-)uniformity of P-values. Very conservative.C. Lippert Linear models for GWAS II October 17 th 2012 13
  • 40.
    Population structure correction Genomiccontrol [Devlin and Roeder, Biometrics 1999] Genomic control λ λ = median(2LR) median(χ2) . λ = 1: Calibrated P-values λ > 1: Inflation λ < 1: Deflation Correct by dividing test statistic by λ. Applicable in combination with every method. Does not change (non-)uniformity of P-values. Very conservative.C. Lippert Linear models for GWAS II October 17 th 2012 13
  • 41.
    Population structure correction Genomiccontrol [Devlin and Roeder, Biometrics 1999] Genomic control λ λ = median(2LR) median(χ2) . λ = 1: Calibrated P-values λ > 1: Inflation λ < 1: Deflation Correct by dividing test statistic by λ. Applicable in combination with every method. Does not change (non-)uniformity of P-values. Very conservative.C. Lippert Linear models for GWAS II October 17 th 2012 13
  • 42.
    Population structure correction Eigenstrat Populationstructure causes genome-wide correlations between SNPs A large part of the total variation in the SNPs can be explained by population differences. Novembre et al. [2008] show that the eigenvectors of the SNP covariance matrix reflect population structure. Eigenstrat uses this property to correct for population structure in GWAS. [Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 14
  • 43.
    Population structure correction Eigenstrat Populationstructure causes genome-wide correlations between SNPs A large part of the total variation in the SNPs can be explained by population differences. Novembre et al. [2008] show that the eigenvectors of the SNP covariance matrix reflect population structure. Eigenstrat uses this property to correct for population structure in GWAS. [Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 14
  • 44.
    Population structure correction Eigenstrat Populationstructure causes genome-wide correlations between SNPs A large part of the total variation in the SNPs can be explained by population differences. Novembre et al. [2008] show that the eigenvectors of the SNP covariance matrix reflect population structure. Eigenstrat uses this property to correct for population structure in GWAS. [Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 14
  • 45.
    Population structure correction Eigenstrat Populationstructure causes genome-wide correlations between SNPs A large part of the total variation in the SNPs can be explained by population differences. Novembre et al. [2008] show that the eigenvectors of the SNP covariance matrix reflect population structure. Eigenstrat uses this property to correct for population structure in GWAS. [Price et al., 2006, Patterson et al., 2006, Novembre et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 14
  • 46.
    Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. N β X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 15
  • 47.
    Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance C. Lippert Linear models for GWAS II October 17 th 2012 15
  • 48.
    Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance Eigenvectors Eigenvalues C. Lippert Linear models for GWAS II October 17 th 2012 15
  • 49.
    Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance Eigenvectors Eigenvalues Add as covariates C. Lippert Linear models for GWAS II October 17 th 2012 15
  • 50.
    Population structure correction Eigenstrat Eigenstratprocedure: Compute covariance matrix based on SNPs Compute eigenvectors of covariance matrix Add largest eigenvector as covariate to regression. Repeat until P-values are uniform. X − E [X] N β X C C C T T T Genome-wide SNP covariance Eigenvectors Eigenvalues Add as covariates C. Lippert Linear models for GWAS II October 17 th 2012 15
  • 51.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 ey X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 52.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 53.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 54.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 55.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK)β ∼ N β ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 56.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK)β ∼ N β + ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 57.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. u N y|Xβ + u; σ2 e I N u|0; σ2 gK ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK) β ∼ N β + ; σ2 e K y X X X C C C T T T C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 58.
    Population structure correction Linearmixed models (LMM) Covariance matrix K Estimated from SNP data Kinship coefficients Identity by state Identity by descent Realized relationship matrix (linear) Sample random effect u. Sample phenotype y. N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness u ∼ N(0; σ2 gK) β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 16
  • 59.
    Population structure correction Linearmixed models (LMM) Corrects for all levels of population structure. ML estimation is computationally demanding Non-convex in σ2 g and σ2 e . N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 17
  • 60.
    Population structure correction Linearmixed models (LMM) Corrects for all levels of population structure. ML estimation is computationally demanding Non-convex in σ2 g and σ2 e . N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 17
  • 61.
    Population structure correction Linearmixed models (LMM) Corrects for all levels of population structure. ML estimation is computationally demanding Non-convex in σ2 g and σ2 e . N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 17
  • 62.
    Population structure correction Linearmixed models (LMM) Corrects for all levels of population structure. ML estimation is computationally demanding Non-convex in σ2 g and σ2 e . N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 17
  • 63.
    Population structure correction Linearmixed models (LMM) Corrects for all levels of population structure. ML estimation is computationally demanding Non-convex in σ2 g and σ2 e . N y|Xβ; σ2 gK + σ2 e I ATGACCTGAAACTGGGGGACTGACGTGGAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGCAACTGGGGGACTGACGTGCAACGGT ATGACCTGAAACTGGGGGATTGACGTGGAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT ATGACCTGCAACTGGGGGATTGACGTGCAACGGT phenotype SNPs yyyyy population structure distance family structure cryptic relatedness β ∼ N β ; K y X X X C C C T T T K C. Lippert Linear models for GWAS II October 17 th 2012 17
  • 64.
    Population structure correction GWASfor Flowering Time in Arabidopsis thaliana Linear Model: QQ-plot: 0.0 0.2 0.4 0.6 0.8 1.0 expected pvalues 0.0 0.2 0.4 0.6 0.8 1.0 observedpvalues 0 1 2 3 4 5 6 log(expected pvalues) 0 5 10 15 20 25 log(observedpvalues) C. Lippert Linear models for GWAS II October 17 th 2012 18
  • 65.
    Population structure correction GWASfor Flowering Time in Arabidopsis thaliana Linear Model: Linear Mixed Model: C. Lippert Linear models for GWAS II October 17 th 2012 19
  • 66.
    Population structure correction GWASfor Flowering Time in Arabidopsis thaliana Linear Mixed Model: QQ-plot: 0.0 0.2 0.4 0.6 0.8 1.0 expected pvalues 0.0 0.2 0.4 0.6 0.8 1.0 observedpvalues 0 1 2 3 4 5 6 log(expected pvalues) 0 2 4 6 8 10 12 log(observedpvalues) C. Lippert Linear models for GWAS II October 17 th 2012 20
  • 67.
    Population structure correction Linearmixed models (LMM) LMM log likelihood LL(β, σ2 g, σ2 e ) = log N y|Xβ; σ2 gK + σ2 e I . Change of variables, introducing δ = σ2 e /σ2 g: LL(β, σ2 g, δ) = log N y|Xβ; σ2 g (K + δI) . ML-parameters ˆβ and ˆσ2 g follow in closed form. Use optimizer to solve 1-dimensional optimization problem over δ. [Kang et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 21
  • 68.
    Population structure correction Linearmixed models (LMM) LMM log likelihood LL(β, σ2 g, σ2 e ) = log N y|Xβ; σ2 gK + σ2 e I . Change of variables, introducing δ = σ2 e /σ2 g: LL(β, σ2 g, δ) = log N y|Xβ; σ2 g (K + δI) . ML-parameters ˆβ and ˆσ2 g follow in closed form. Use optimizer to solve 1-dimensional optimization problem over δ. [Kang et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 21
  • 69.
    Population structure correction Linearmixed models (LMM) LMM log likelihood LL(β, σ2 g, σ2 e ) = log N y|Xβ; σ2 gK + σ2 e I . Change of variables, introducing δ = σ2 e /σ2 g: LL(β, σ2 g, δ) = log N y|Xβ; σ2 g (K + δI) . ML-parameters ˆβ and ˆσ2 g follow in closed form. Use optimizer to solve 1-dimensional optimization problem over δ. [Kang et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 21
  • 70.
    Population structure correction Linearmixed models (LMM) LMM log likelihood LL(β, σ2 g, σ2 e ) = log N y|Xβ; σ2 gK + σ2 e I . Change of variables, introducing δ = σ2 e /σ2 g: LL(β, σ2 g, δ) = log N y|Xβ; σ2 g (K + δI) . ML-parameters ˆβ and ˆσ2 g follow in closed form. Use optimizer to solve 1-dimensional optimization problem over δ. [Kang et al., 2008] C. Lippert Linear models for GWAS II October 17 th 2012 21
  • 71.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 72.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 73.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 74.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 75.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 76.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 77.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 78.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 79.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y Note that this solution is analogous to the ML solution of the linear regression . C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 80.
    Population structure correction Linearmixed models (LMM) ML parameters Gradient of the LMM log likelihood w.r.t. β β log N y|Xβ; σ2 g (K + δI) = β − 1 2σ2 g (y − Xβ) T (K + δI) −1 (y − Xβ) = 1 σ2 g −XT (K + δI) −1 y + XT (K + δI) −1 X set gradient to zero: 0 = 1 σ2 g XT (K + δI) −1 y − XT (K + δI) −1 Xβ XT (K + δI) −1 Xβ = XT (K + δI) −1 y βML = XT (K + δI) −1 X −1 XT (K + δI) −1 y Note that this solution is analogous to the ML solution of the linear regression XT X −1 XT y. C. Lippert Linear models for GWAS II October 17 th 2012 22
  • 81.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 82.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 83.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 84.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 85.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 86.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 87.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 88.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 89.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 90.
    Population structure correction Linearmixed models (LMM) ML parameters Derivative of the LMM log likelihood w.r.t. σ2 g ∂ ∂σ2 g log N y|Xβ; σ2 g (K + δI) = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) set derivative to zero: 0 = − 1 2 N σ2 g − N σ4 g (y − Xβ) T (K + δI) −1 (y − Xβ) σ2 gML = 1 N (y − Xβ) T (K + δI) −1 (y − Xβ) Bottleneck: For every SNP that we test we need to calculate (K + δI)−1 . If done naively, this is an O(N3 ) operation per SNP. (very expensive!) C. Lippert Linear models for GWAS II October 17 th 2012 23
  • 91.
    Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24
  • 92.
    Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24
  • 93.
    Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N UT y|UT Xβ; σ2 g UT USUT U + δUT U . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24
  • 94.
    Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N  UT y|UT Xβ; σ2 g  UT U I S UT U I +δ UT U I     . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24
  • 95.
    Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N  UT y|UT Xβ; σ2 g  UT U I S UT U I +δ UT U I     . = N UT y|UT Xβ; σ2 g (S + δI) . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24
  • 96.
    Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N  UT y|UT Xβ; σ2 g  UT U I S UT U I +δ UT U I     . = N UT y|UT Xβ; σ2 g (S + δI) . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24
  • 97.
    Population structure correction FaSTLMM N y|Xβ; σ2 g (K + δI) . = N y|Xβ; σ2 g USUT + δI . = N  UT y|UT Xβ; σ2 g  UT U I S UT U I +δ UT U I     . = N UT y|UT Xβ; σ2 g (S + δI) . [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 24
  • 98.
    Population structure correction FaSTLMM N UT y|UT Xβ; σ2 g (S + δI) . (2) Factored Spectrally Transformed LMM O(N3 ) once for spectral decomposition. O(N2 ) runtime per SNP tested (multiplication with UT ). O(N2 ) memory for storing K and U. [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 25
  • 99.
    Population structure correction FaSTLMM N UT y|UT Xβ; σ2 g (S + δI) . (2) Factored Spectrally Transformed LMM O(N3 ) once for spectral decomposition. O(N2 ) runtime per SNP tested (multiplication with UT ). O(N2 ) memory for storing K and U. [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 25
  • 100.
    Population structure correction FaSTLMM N UT y|UT Xβ; σ2 g (S + δI) . (2) Factored Spectrally Transformed LMM O(N3 ) once for spectral decomposition. O(N2 ) runtime per SNP tested (multiplication with UT ). O(N2 ) memory for storing K and U. [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 25
  • 101.
    Population structure correction FaSTLMM N UT y|UT Xβ; σ2 g (S + δI) . (2) Factored Spectrally Transformed LMM O(N3 ) once for spectral decomposition. O(N2 ) runtime per SNP tested (multiplication with UT ). O(N2 ) memory for storing K and U. [Lippert et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 25
  • 102.
    Population structure correction Summary Populationstructure correction Genomic control Simple method Works with any statistical test Can be combined with other correction methods Very conservative! Eigenstrat (PCA) Corrects well for differences on population level Does not work well for closer relatedness Linear mixed models Corrects well for most forms of relatedness. C. Lippert Linear models for GWAS II October 17 th 2012 26
  • 103.
    Population structure correction Summary Populationstructure correction Genomic control Simple method Works with any statistical test Can be combined with other correction methods Very conservative! Eigenstrat (PCA) Corrects well for differences on population level Does not work well for closer relatedness Linear mixed models Corrects well for most forms of relatedness. C. Lippert Linear models for GWAS II October 17 th 2012 26
  • 104.
    Population structure correction Summary Populationstructure correction Genomic control Simple method Works with any statistical test Can be combined with other correction methods Very conservative! Eigenstrat (PCA) Corrects well for differences on population level Does not work well for closer relatedness Linear mixed models Corrects well for most forms of relatedness. C. Lippert Linear models for GWAS II October 17 th 2012 26
  • 105.
    Population structure correction Overview Singlemarker association model with random effect term y = xsβs genetic effect + u random effect covariates + noise Shortcomings Weak effects are not captured by single-marker analysis. Complex traits are controlled by more than a single SNP. C. Lippert Linear models for GWAS II October 17 th 2012 27
  • 106.
    Population structure correction Overview Singlemarker association model with random effect term y = xsβs genetic effect + u random effect covariates + noise Shortcomings Weak effects are not captured by single-marker analysis. Complex traits are controlled by more than a single SNP. C. Lippert Linear models for GWAS II October 17 th 2012 27
  • 107.
    Population structure correction Overview Singlemarker association model with random effect term y = xsβs genetic effect + u random effect covariates + noise Shortcomings Weak effects are not captured by single-marker analysis. Complex traits are controlled by more than a single SNP. C. Lippert Linear models for GWAS II October 17 th 2012 27
  • 108.
    Population structure correction Multilocus models Generalization to multiple genetic factors y = S s=1 xsβs genetic effect + u random effect covariates + noise Challenge: N << S: explicit estimation of all βs is not feasible. Solutions Regularize βs (Ridge regression, LASSO) Variance component modeling [Wu et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 28
  • 109.
    Population structure correction Multilocus models Generalization to multiple genetic factors y = S s=1 xsβs genetic effect + u random effect covariates + noise Challenge: N << S: explicit estimation of all βs is not feasible. Solutions Regularize βs (Ridge regression, LASSO) Variance component modeling [Wu et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 28
  • 110.
    Population structure correction Multilocus models Generalization to multiple genetic factors y = S s=1 xsβs genetic effect + u random effect covariates + noise Challenge: N << S: explicit estimation of all βs is not feasible. Solutions Regularize βs (Ridge regression, LASSO) Variance component modeling [Wu et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 28
  • 111.
    Population structure correction Multilocus models Generalization to multiple genetic factors y = S s=1 xsβs genetic effect + u random effect covariates + noise Challenge: N << S: explicit estimation of all βs is not feasible. Solutions Regularize βs (Ridge regression, LASSO) Variance component modeling [Wu et al., 2011] C. Lippert Linear models for GWAS II October 17 th 2012 28
  • 112.
    Outline Outline C. Lippert Linearmodels for GWAS II October 17 th 2012 29
  • 113.
    Variance component models Outline ProbabilityTheory Population Structure Population structure correction Variance component models Multi locus models Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 30
  • 114.
    Variance component modelsMulti locus models Multi locus models Random effect models For now, let’s drop the random effect term y = S s=1 xsβs + . For mathematical convenience, we choose a shared Gaussian distribution on the weights and Gaussian noise p(β1, . . . , βS) = S s=1 N βs 0, σ2 g p( ) = N 0, σ2 e I Marginalize out the weights β1, . . . , βS p(y | X, σ2 g, σ2 e ) = β N y S s=1 xsβs, σ2 e I Data likelihood S s=1 N βs 0, σ2 g weight distribution dβ = N y 0, σ2 g S s=1 xsxT s + σ2 e I C. Lippert Linear models for GWAS II October 17 th 2012 31
  • 115.
    Variance component modelsMulti locus models Multi locus models Random effect models For now, let’s drop the random effect term y = S s=1 xsβs + . For mathematical convenience, we choose a shared Gaussian distribution on the weights and Gaussian noise p(β1, . . . , βS) = S s=1 N βs 0, σ2 g p( ) = N 0, σ2 e I Marginalize out the weights β1, . . . , βS p(y | X, σ2 g, σ2 e ) = β N y S s=1 xsβs, σ2 e I Data likelihood S s=1 N βs 0, σ2 g weight distribution dβ = N y 0, σ2 g S s=1 xsxT s + σ2 e I C. Lippert Linear models for GWAS II October 17 th 2012 31
  • 116.
    Variance component modelsMulti locus models Multi locus models Random effect models For now, let’s drop the random effect term y = S s=1 xsβs + . For mathematical convenience, we choose a shared Gaussian distribution on the weights and Gaussian noise p(β1, . . . , βS) = S s=1 N βs 0, σ2 g p( ) = N 0, σ2 e I Marginalize out the weights β1, . . . , βS p(y | X, σ2 g, σ2 e ) = β N y S s=1 xsβs, σ2 e I Data likelihood S s=1 N βs 0, σ2 g weight distribution dβ = N y 0, σ2 g S s=1 xsxT s + σ2 e I C. Lippert Linear models for GWAS II October 17 th 2012 31
  • 117.
    Variance component modelsMulti locus models Multi locus models Random effect models For now, let’s drop the random effect term y = S s=1 xsβs + . For mathematical convenience, we choose a shared Gaussian distribution on the weights and Gaussian noise p(β1, . . . , βS) = S s=1 N βs 0, σ2 g p( ) = N 0, σ2 e I Marginalize out the weights β1, . . . , βS p(y | X, σ2 g, σ2 e ) = β N y S s=1 xsβs, σ2 e I Data likelihood S s=1 N βs 0, σ2 g weight distribution dβ = N y 0, σ2 g S s=1 xsxT s + σ2 e I C. Lippert Linear models for GWAS II October 17 th 2012 31
  • 118.
    Variance component modelsMulti locus models Multi locus models Random effect models For now, let’s drop the random effect term y = S s=1 xsβs + . For mathematical convenience, we choose a shared Gaussian distribution on the weights and Gaussian noise p(β1, . . . , βS) = S s=1 N βs 0, σ2 g p( ) = N 0, σ2 e I Marginalize out the weights β1, . . . , βS p(y | X, σ2 g, σ2 e ) = β N y S s=1 xsβs, σ2 e I Data likelihood S s=1 N βs 0, σ2 g weight distribution dβ = N y 0, σ2 g S s=1 xsxT s + σ2 e I C. Lippert Linear models for GWAS II October 17 th 2012 31
  • 119.
    Variance component modelsMulti locus models Multi locus models Random effect models For now, let’s drop the random effect term y = S s=1 xsβs + . For mathematical convenience, we choose a shared Gaussian distribution on the weights and Gaussian noise p(β1, . . . , βS) = S s=1 N βs 0, σ2 g p( ) = N 0, σ2 e I Marginalize out the weights β1, . . . , βS p(y | X, σ2 g, σ2 e ) = β N y S s=1 xsβs, σ2 e I Data likelihood S s=1 N βs 0, σ2 g weight distribution dβ = N y 0, σ2 g S s=1 xsxT s + σ2 e I C. Lippert Linear models for GWAS II October 17 th 2012 31
  • 120.
    Variance component modelsMulti locus models Multi locus models Remarks p(y | X, σ2 g, σ2 e ) = N y | 0, σ2 g S s=1 ssxT s Kg +σ2 e I (3) Kg genotype covariance matrix. Closely related to Kinship explaining population structure. Inference can be done my maximum likelihood. The ratio of σ2 g and σ2 e defines the narrow sense heritability of the trait h2 = σ2 g σ2 g + σ2 e . C. Lippert Linear models for GWAS II October 17 th 2012 32
  • 121.
    Variance component modelsMulti locus models Multi locus models Remarks p(y | X, σ2 g, σ2 e ) = N y | 0, σ2 g S s=1 ssxT s Kg +σ2 e I (3) Kg genotype covariance matrix. Closely related to Kinship explaining population structure. Inference can be done my maximum likelihood. 0 5 10 15 Samples 0 5 10 15 Samples The ratio of σ2 g and σ2 e defines the narrow sense heritability of the trait h2 = σ2 g σ2 g + σ2 e . C. Lippert Linear models for GWAS II October 17 th 2012 32
  • 122.
    Variance component modelsMulti locus models Multi locus models Remarks p(y | X, σ2 g, σ2 e ) = N y | 0, σ2 g S s=1 ssxT s Kg +σ2 e I (3) Kg genotype covariance matrix. Closely related to Kinship explaining population structure. Inference can be done my maximum likelihood. 0 5 10 15 Samples 0 5 10 15 Samples The ratio of σ2 g and σ2 e defines the narrow sense heritability of the trait h2 = σ2 g σ2 g + σ2 e . C. Lippert Linear models for GWAS II October 17 th 2012 32
  • 123.
    Variance component modelsMulti locus models Multi locus models Remarks p(y | X, σ2 g, σ2 e ) = N y | 0, σ2 g S s=1 ssxT s Kg +σ2 e I (3) Kg genotype covariance matrix. Closely related to Kinship explaining population structure. Inference can be done my maximum likelihood. 0 5 10 15 Samples 0 5 10 15 Samples The ratio of σ2 g and σ2 e defines the narrow sense heritability of the trait h2 = σ2 g σ2 g + σ2 e . C. Lippert Linear models for GWAS II October 17 th 2012 32
  • 124.
    Variance component modelsMulti locus models Heritability Heritability estimated on 107 A. thaliana phenotypes Global genetic heritability 0.0 0.2 0.4 0.6 0.8 1.0 Heritability 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Density C. Lippert Linear models for GWAS II October 17 th 2012 33
  • 125.
    Variance component modelsMulti locus models Heritability Heritability estimated on 107 A. thaliana phenotypes Estimate can be restricted to a genomic region such as a single chromosome, etc. N y | 0, σ2 g s∈Chrom xsxT s +σ2 e I 1 2 3 4 5 Chromosome 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 PhenotypeID 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 C. Lippert Linear models for GWAS II October 17 th 2012 34
  • 126.
    Variance component modelsMulti locus models Window-based composite variance analysis Region-based testing Just fitting a particular region ignores the genome-wide context Variance dissection with region-based separation p(y | W) = N(y | 0, σ2 w s∈W xsxT s Kw +σ2 g s∈W xsxT s Kg +σ2 e I) Explained variance components can be read off subject to suitable normalization of the covariances Kw and Kg. “Local” heritability h2 (W) = σ2 w σ2 w + σ2 g + σ2 e C. Lippert Linear models for GWAS II October 17 th 2012 35
  • 127.
    Variance component modelsMulti locus models Window-based composite variance analysis Region-based testing Just fitting a particular region ignores the genome-wide context Variance dissection with region-based separation p(y | W) = N(y | 0, σ2 w s∈W xsxT s Kw +σ2 g s∈W xsxT s Kg +σ2 e I) Explained variance components can be read off subject to suitable normalization of the covariances Kw and Kg. “Local” heritability h2 (W) = σ2 w σ2 w + σ2 g + σ2 e C. Lippert Linear models for GWAS II October 17 th 2012 35
  • 128.
    Variance component modelsMulti locus models Window-based composite variance analysis Region-based testing Just fitting a particular region ignores the genome-wide context Variance dissection with region-based separation p(y | W) = N(y | 0, σ2 w s∈W xsxT s Kw +σ2 g s∈W xsxT s Kg +σ2 e I) Explained variance components can be read off subject to suitable normalization of the covariances Kw and Kg. “Local” heritability h2 (W) = σ2 w σ2 w + σ2 g + σ2 e C. Lippert Linear models for GWAS II October 17 th 2012 35
  • 129.
    Variance component modelsMulti locus models Window-based composite variance analysis Region-based testing Just fitting a particular region ignores the genome-wide context Variance dissection with region-based separation p(y | W) = N(y | 0, σ2 w s∈W xsxT s Kw +σ2 g s∈W xsxT s Kg +σ2 e I) Explained variance components can be read off subject to suitable normalization of the covariances Kw and Kg. “Local” heritability h2 (W) = σ2 w σ2 w + σ2 g + σ2 e C. Lippert Linear models for GWAS II October 17 th 2012 35
  • 130.
    Variance component modelsMulti locus models Window-based composite variance analysis C. Lippert Linear models for GWAS II October 17 th 2012 36
  • 131.
    Variance component modelsMulti locus models Window-based composite variance analysis Significance testing Analogously to fixed effect testing, the significance of a specific window can be tested. Likelihood-ratio statistics to score the relevance of a particular genomic region W LOD(W) = N y 0, σ2 w s∈W xsxT s + σ2 g s∈W xsxT s + σ2 e I N y 0, σ2 g s∈W xsxT s + σ2 e I P-values can be obtained from permutation statistics or analytical approximation (variants of score tests or likelihood ratio tests). [Wu et al., 2011, Listgarten et al., 2012] C. Lippert Linear models for GWAS II October 17 th 2012 37
  • 132.
    Variance component modelsMulti locus models Window-based composite variance analysis Significance testing Analogously to fixed effect testing, the significance of a specific window can be tested. Likelihood-ratio statistics to score the relevance of a particular genomic region W LOD(W) = N y 0, σ2 w s∈W xsxT s + σ2 g s∈W xsxT s + σ2 e I N y 0, σ2 g s∈W xsxT s + σ2 e I P-values can be obtained from permutation statistics or analytical approximation (variants of score tests or likelihood ratio tests). [Wu et al., 2011, Listgarten et al., 2012] C. Lippert Linear models for GWAS II October 17 th 2012 37
  • 133.
    Variance component modelsMulti locus models Window-based composite variance analysis Significance testing Analogously to fixed effect testing, the significance of a specific window can be tested. Likelihood-ratio statistics to score the relevance of a particular genomic region W LOD(W) = N y 0, σ2 w s∈W xsxT s + σ2 g s∈W xsxT s + σ2 e I N y 0, σ2 g s∈W xsxT s + σ2 e I P-values can be obtained from permutation statistics or analytical approximation (variants of score tests or likelihood ratio tests). [Wu et al., 2011, Listgarten et al., 2012] C. Lippert Linear models for GWAS II October 17 th 2012 37
  • 134.
    Variance component modelsPhenotype prediction Making predictions with linear mixed models Linear model, accounting for a set of measured SNPs X p(y | X, θ, σ2 ) = N y S s=1 xsθs, σ2 I Prediction at unseen test input given max. likelihood weight: p(y | x , ˆθ) = N y x ˆθ, σ2 Marginal likelihood p(y | X, σ2 , σ2 g) = θ N y Xθ, σ2 I N θ 0, σ2 gI = N   y 0, σ2 gXXT K +σ2 I    Making predictions with linear mixed models? C. Lippert Linear models for GWAS II October 17 th 2012 38
  • 135.
    Variance component modelsPhenotype prediction Making predictions with linear mixed models Linear model, accounting for a set of measured SNPs X p(y | X, θ, σ2 ) = N y S s=1 xsθs, σ2 I Prediction at unseen test input given max. likelihood weight: p(y | x , ˆθ) = N y x ˆθ, σ2 Marginal likelihood p(y | X, σ2 , σ2 g) = θ N y Xθ, σ2 I N θ 0, σ2 gI = N   y 0, σ2 gXXT K +σ2 I    Making predictions with linear mixed models? C. Lippert Linear models for GWAS II October 17 th 2012 38
  • 136.
    Variance component modelsPhenotype prediction Making predictions with linear mixed models Linear model, accounting for a set of measured SNPs X p(y | X, θ, σ2 ) = N y S s=1 xsθs, σ2 I Prediction at unseen test input given max. likelihood weight: p(y | x , ˆθ) = N y x ˆθ, σ2 Marginal likelihood p(y | X, σ2 , σ2 g) = θ N y Xθ, σ2 I N θ 0, σ2 gI = N   y 0, σ2 gXXT K +σ2 I    Making predictions with linear mixed models? C. Lippert Linear models for GWAS II October 17 th 2012 38
  • 137.
    Variance component modelsPhenotype prediction Making predictions with linear mixed models Linear model, accounting for a set of measured SNPs X p(y | X, θ, σ2 ) = N y S s=1 xsθs, σ2 I Prediction at unseen test input given max. likelihood weight: p(y | x , ˆθ) = N y x ˆθ, σ2 Marginal likelihood p(y | X, σ2 , σ2 g) = θ N y Xθ, σ2 I N θ 0, σ2 gI = N   y 0, σ2 gXXT K +σ2 I    Making predictions with linear mixed models? C. Lippert Linear models for GWAS II October 17 th 2012 38
  • 138.
    Variance component modelsPhenotype prediction The Gaussian distribution Linear mixed models are merely based on the good old multivariate Gaussian N x µ, K = 1 |2π K | exp − 1 2 (x − µ)T K −1 (x − µ) Covariance matrix or kernel matrix C. Lippert Linear models for GWAS II October 17 th 2012 39
  • 139.
    Variance component modelsPhenotype prediction A 2D Gaussian Probability contour Samples −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.6 0.6 1 C. Lippert Linear models for GWAS II October 17 th 2012 40
  • 140.
    Variance component modelsPhenotype prediction A 2D Gaussian Probability contour Samples −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.6 0.6 1 C. Lippert Linear models for GWAS II October 17 th 2012 40
  • 141.
    Variance component modelsPhenotype prediction A 2D Gaussian Varying the covariance matrix −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.14 0.14 1 −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 0.6 0.6 1 −3 −2 −1 0 1 2 3 y1 −3 −2 −1 0 1 2 3 y2 K = 1 -0.9 -0.9 1 C. Lippert Linear models for GWAS II October 17 th 2012 41
  • 142.
    Variance component modelsPhenotype prediction A 2D Gaussian Inference −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 C. Lippert Linear models for GWAS II October 17 th 2012 42
  • 143.
    Variance component modelsPhenotype prediction A 2D Gaussian Inference − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 C. Lippert Linear models for GWAS II October 17 th 2012 42
  • 144.
    Variance component modelsPhenotype prediction A 2D Gaussian Inference − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 C. Lippert Linear models for GWAS II October 17 th 2012 42
  • 145.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 146.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) =? Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 147.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) =? Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 148.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 149.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 150.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI = N (y | µ , Σ ) Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 151.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI = N (y | µ , Σ ) ∝ N y y | 0, σ2 g K K:, KT :, K , σ2 e I Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 152.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values y of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual y . P(y | y) = P(y, y ) P(y) = N y y | 0, σ2 g K K:, KT :, K , σ2 e I N 0, σ2 gK + σ2 eI = N (y | µ , Σ ) ∝ N y y | 0, σ2 g K K:, KT :, K , σ2 e I Use conditional probability distribution Note, that the result is again a Gaussian distribution! As a Gaussian distribution is always normalized, we can drop any constant terms, that do not contain y . Completing the square identifies the mean µ and the variance Σ of y given y. C. Lippert Linear models for GWAS II October 17 th 2012 43
  • 153.
    Variance component modelsPhenotype prediction Inference Gaussian conditioning in 2D p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp − 1 2 [y1, y2] K−1 y1 y2 = exp{− 1 2 y2 1K−1 1,1 + y2 2K−1 2,2 + 2y1K−1 1,2y2 } = exp{− 1 2 y2 2K−1 2,2 + 2y2K−1 1,2y1 + C } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 + K−1 1,2y1 K−1 2,2 2 + 1 2 K−1 2,2 K−1 1,2y1 K−1 2,2 2 } = Z exp{− 1 2 K−1 2,2 Σ y2 + K−1 1,2y1 K−1 2,2 −µ 2 } ∝ N (y2 | µ, Σ) C. Lippert Linear models for GWAS II October 17 th 2012 44
  • 154.
    Variance component modelsPhenotype prediction Inference Gaussian conditioning in 2D p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp − 1 2 [y1, y2] K−1 y1 y2 = exp{− 1 2 y2 1K−1 1,1 + y2 2K−1 2,2 + 2y1K−1 1,2y2 } = exp{− 1 2 y2 2K−1 2,2 + 2y2K−1 1,2y1 + C } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 + K−1 1,2y1 K−1 2,2 2 + 1 2 K−1 2,2 K−1 1,2y1 K−1 2,2 2 } = Z exp{− 1 2 K−1 2,2 Σ y2 + K−1 1,2y1 K−1 2,2 −µ 2 } ∝ N (y2 | µ, Σ) C. Lippert Linear models for GWAS II October 17 th 2012 44
  • 155.
    Variance component modelsPhenotype prediction Inference Gaussian conditioning in 2D p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp − 1 2 [y1, y2] K−1 y1 y2 = exp{− 1 2 y2 1K−1 1,1 + y2 2K−1 2,2 + 2y1K−1 1,2y2 } = exp{− 1 2 y2 2K−1 2,2 + 2y2K−1 1,2y1 + C } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 + K−1 1,2y1 K−1 2,2 2 + 1 2 K−1 2,2 K−1 1,2y1 K−1 2,2 2 } = Z exp{− 1 2 K−1 2,2 Σ y2 + K−1 1,2y1 K−1 2,2 −µ 2 } ∝ N (y2 | µ, Σ) C. Lippert Linear models for GWAS II October 17 th 2012 44
  • 156.
    Variance component modelsPhenotype prediction Inference Gaussian conditioning in 2D p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp − 1 2 [y1, y2] K−1 y1 y2 = exp{− 1 2 y2 1K−1 1,1 + y2 2K−1 2,2 + 2y1K−1 1,2y2 } = exp{− 1 2 y2 2K−1 2,2 + 2y2K−1 1,2y1 + C } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 + K−1 1,2y1 K−1 2,2 2 + 1 2 K−1 2,2 K−1 1,2y1 K−1 2,2 2 } = Z exp{− 1 2 K−1 2,2 Σ y2 + K−1 1,2y1 K−1 2,2 −µ 2 } ∝ N (y2 | µ, Σ) C. Lippert Linear models for GWAS II October 17 th 2012 44
  • 157.
    Variance component modelsPhenotype prediction Inference Gaussian conditioning in 2D p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp − 1 2 [y1, y2] K−1 y1 y2 = exp{− 1 2 y2 1K−1 1,1 + y2 2K−1 2,2 + 2y1K−1 1,2y2 } = exp{− 1 2 y2 2K−1 2,2 + 2y2K−1 1,2y1 + C } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 + K−1 1,2y1 K−1 2,2 2 + 1 2 K−1 2,2 K−1 1,2y1 K−1 2,2 2 } = Z exp{− 1 2 K−1 2,2 Σ y2 + K−1 1,2y1 K−1 2,2 −µ 2 } ∝ N (y2 | µ, Σ) C. Lippert Linear models for GWAS II October 17 th 2012 44
  • 158.
    Variance component modelsPhenotype prediction Inference Gaussian conditioning in 2D p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp − 1 2 [y1, y2] K−1 y1 y2 = exp{− 1 2 y2 1K−1 1,1 + y2 2K−1 2,2 + 2y1K−1 1,2y2 } = exp{− 1 2 y2 2K−1 2,2 + 2y2K−1 1,2y1 + C } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 } = Z exp{− 1 2 K−1 2,2 y2 2 + 2y2 K−1 1,2y1 K−1 2,2 + K−1 1,2y1 K−1 2,2 2 + 1 2 K−1 2,2 K−1 1,2y1 K−1 2,2 2 } = Z exp{− 1 2 K−1 2,2 Σ y2 + K−1 1,2y1 K−1 2,2 −µ 2 } ∝ N (y2 | µ, Σ) C. Lippert Linear models for GWAS II October 17 th 2012 44
  • 159.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual. P(y | y) = N y µ , σ2 gVg + σ2 e I Predictive mean: µ = σ2 gK ,: g σ2 gKg + σ2 e I −1 y BLUP Predictive Variance: Vg = K , g − σ2 gK ,: g σ2 gKg + σ2 e I −1 K:, g C. Lippert Linear models for GWAS II October 17 th 2012 45
  • 160.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual. P(y | y) = N y µ , σ2 gVg + σ2 e I Predictive mean: µ = σ2 gK ,: g σ2 gKg + σ2 e I −1 y BLUP Predictive Variance: Vg = K , g − σ2 gK ,: g σ2 gKg + σ2 e I −1 K:, g C. Lippert Linear models for GWAS II October 17 th 2012 45
  • 161.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual. P(y | y) = N y µ , σ2 gVg + σ2 e I Predictive mean: µ = σ2 gK ,: g σ2 gKg + σ2 e I −1 y BLUP Predictive Variance: Vg = K , g − σ2 gK ,: g σ2 gKg + σ2 e I −1 K:, g C. Lippert Linear models for GWAS II October 17 th 2012 45
  • 162.
    Variance component modelsPhenotype prediction Phenotype prediction Best linear unbiased prediction Given the phenotype values of a set of individuals and the genetic relatedness, we can predict the genetic component of the phenotype of a new individual. P(y | y) = N y µ , σ2 gVg + σ2 e I Predictive mean: µ = σ2 gK ,: g σ2 gKg + σ2 e I −1 y BLUP Predictive Variance: Vg = K , g − σ2 gK ,: g σ2 gKg + σ2 e I −1 K:, g C. Lippert Linear models for GWAS II October 17 th 2012 45
  • 163.
    Variance component modelsPhenotype prediction Summary Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 46
  • 164.
    Variance component modelsPhenotype prediction Summary Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 46
  • 165.
    Variance component modelsPhenotype prediction Summary Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 46
  • 166.
    Variance component modelsPhenotype prediction Summary Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 46
  • 167.
    Variance component modelsPhenotype prediction Summary Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 46
  • 168.
    Variance component modelsPhenotype prediction Summary Basic probability theory Linear mixed models Population structure correction Parameter estimation Variance component modeling Phenotype prediction C. Lippert Linear models for GWAS II October 17 th 2012 46
  • 169.
    Variance component modelsPhenotype prediction Acknowledgements Joint course material O. Stegle FaST lmm J. Listgarten, D. Heckerman C. Lippert Linear models for GWAS II October 17 th 2012 47
  • 170.
    Variance component modelsPhenotype prediction References I P. Burton, D. Clayton, L. Cardon, N. Craddock, P. Deloukas, A. Duncanson, D. Kwiatkowski, M. McCarthy, W. Ouwehand, N. Samani, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678, 2007. B. Devlin and K. Roeder. Genomic control for association studies. Biometrics, 55(4):997–1004, 1999. H. M. Kang, N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman, M. J. Daly, and E. Eskin. Efficient control of population structure in model organism association mapping. Genetics, 107, 2008. C. Lippert, J. Listgarten, Y. Liu, C. Kadie, R. Davidson, and D. Heckerman. Fast linear mixed models for genome-wide association studies. Nature Methods, 8(10):838;835, 10 2011. doi: 10.1038/nmeth.1681. J. Listgarten, C. Lippert, and D. Heckerman. An efficient group test for genetic markers that handles confounding. arXiv preprint arXiv:1205.0793, 2012. J. Novembre, T. Johnson, K. Bryc, Z. Kutalik, A. R. Boyko, A. Auton, A. Indap, K. S. King, S. Bergmann, M. R. Nelson, M. Stephens, and C. D. Bustamante. Genes mirror geography within europe. Nature, 456(7218):98–101, Nov. 2008. N. Patterson, A. L. Price, and D. Reich. Population structure and eigenanalysis. PLoS Genetics, 2(12):e190+, December 2006. C. Lippert Linear models for GWAS II October 17 th 2012 48
  • 171.
    Variance component modelsPhenotype prediction References II A. L. Price, N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38(8):904–909, August 2006. M. Wu, S. Lee, T. Cai, Y. Li, M. Boehnke, and X. Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 2011. C. Lippert Linear models for GWAS II October 17 th 2012 49