Confounding from population structure, extended families and inbreeding can be a significant issue for burden and kernel association tests on rare variants from next generation DNA sequencing. An obvious solution is to combine the power of a mixed model regression analysis with the ability to assess the rare variant burden using methods such as KBAC or CMC. Recent approaches have adjusted burden and kernel tests using linear regression models; this method adjusts for the relatedness of samples and includes that directly into a logistic regression model.
This webcast will focus on the details of bringing Mixed Model Regression and KBAC together, including: deriving an optimal logistic mixed model algorithm for calculating the reduced model score, how the kinship or random effects matrix should be specified, and how it all comes together into one algorithm. Results from applying the method to variants from the 1000 Genomes project will also be presented and compared to famSKAT.
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
ย
MM - KBAC: Using mixed models to adjust for population structure in a rare-variant burden test
1. MM-KBAC: Using mixed
models to adjust for
population structure in a
rare-variant burden test
Tuesday, June 10, 2014
Greta Linse Peterson
Director of Product Management and Quality
2. Questions during
the presentation
Use the Questions pane in
your GoToWebinar window
3. Golden Helix Offerings
Software
๏ง SNP & Variation
Suite (SVS) for
NGS, SNP, & CNV
data
๏ง GenomeBrowse
๏ง New products in
development
Support
๏ง Support comes
standard with
software
๏ง Customers rave
about our support
๏ง Extensive online
materials including
tutorials and more
Services
๏ง Genomic Analytics
๏ง Genotype
Imputation
๏ง Workflow
Automation
๏ง SVS Certification &
Training
4. Core
Features
Packages
Core Features
๏ง Powerful Data Management
๏ง Rich Visualizations
๏ง Robust Statistics
๏ง Flexible
๏ง Easy-to-use
Applications
๏ง Genotype Analysis
๏ง DNA sequence analysis
๏ง CNV Analysis
๏ง RNA-seq differential
expression
๏ง Family Based Association
SNP & Variation Suite (SVS)
6. Study Design
๏ง Large cohort population based design (cases with matched controls
or quantitative phenotypes and complex traits)
- Assumes: independent and well matched samples
- Can interrogate complex traits
๏ง Small families (trios, quads, small extended pedigrees)
- Can only analyze a single family at a time, looking for de Novo, recessive or
compound het variants unique to an affected sample in a single family
- Looking for highly penetrant variants
7. What If????
๏ง What if we have:
- Known population structure
- Cannot guarantee independence between samples
- Controls were borrowed from a different study
- Multiple families with affected offspring all exhibiting the same phenotype
- Multiple large extended pedigrees of unknown structure
8. Just Add Random Effects!
๏ง Why canโt we just add random effects to our regression models for
our rare-variant burden testing algorithms?
๏ง Existing mixed model algorithms assume a linear model
๏ง Kernel-based adaptive clustering (KBAC) uses a logistic regression
model
๏ง Hmm what to doโฆ.?
9. WARNING!
What is about to follow are formulas and statistics, specifically matrix
algebraโฆ
But donโt worry weโll end the webcast with a presentation of some
preliminary results! So hang in there!
10. But firstโฆ.
The dataset we have chosen for today is the 1000 Genomes Pilot 3
Exons dataset with a simulated phenotype.
12. Why Mixed Models + KBAC?
๏ง OK Mixed Models makes sense, but why KBAC?
๏ง KBAC was chosen as our proof of concept rare-variant burden test
for complex traits
๏ง KBAC uses a score test which is trivial to calculate once you
compute the reduced model
๏ง Mixed models can be added to other burden and kernel tests using
the same principles
13. What is KBAC?
๏ง KBAC = Kernel-based Adaptive
Clustering
๏ง Catalogs and counts multi-marker
genotypes based on variant data
๏ง Assumes the data has been filtered to
only rare variants
๏ง Performs a special case/control test
based on the counts of variants per
region (aka gene)
๏ง Test is weighted based on how often each
genotype is expected to occur according
to the null hypothesis
๏ง Genotypes with higher sample risks are
given higher weights
๏ง One-sided test primarily, which means it
detects higher sample risks
18. KBAC Statistic
ํ ํํ
๏ง ํพํตํดํถ1 = ํ=1
ํด
ํํด โ
ํํ
ํ
ํพํํ ํ
0 ํ ํ
๏ง Where the weight is defined as:
0 ํ ํ =
ํคํ = ํพํ
ํ ํ
ํํ
0
0 ํ ํํ
The weight can be calculated as a:
- Hyper-geometric kernel
- Marginal binomial kernel
- Asymptotic normal kernel
19. Determining the KBAC p-value
๏ง Monte-Carlo Method is used as an approximation for finding the p-value
ํดfor each genotype ํบํ approximates a
๏ง The number of cases ํํ
ํด~ ํตํํํํ ํํ ,
binomial distribution ํํ
ํํด
ํ
๏ง The case status is permuted among all samples. The covariates and
genotypes are held fixed.
20. Logistic Mixed Model Equation
log
ํ ํํ = 1 | ํํ , ํํํํ , ํขํ
1 โ ํ ํํ = 1 | ํํ , ํํํํ , ํขํ
= ํฝ0 + ํฝ1ํํ +
ํ
ํฝํํํํํํ + ํขํ
Null hypothesis: ํป0: ํฝ1 = 0
The score statistic to test the null of the independence of the model
from ํํ is:
ํ = ํ ํํ ํํ โ ํํ , where
ํํ = โ ํํ =
ํ
ํํ
1+ํ
ํํ, and
ํํ = ํฝ0 + ํ ํฝํํํํํํ + ํขํ, and
ํขํ is the random effect for the ํ^(ํกโ) sample.
21. Logistic (Reduced) Mixed Model Equation
log
ํ ํํ = 1 | ํํํํ , ํขํ
1 โ ํ ํํ = 1 | ํํํํ , ํขํ
= ํฝ0 +
ํ
ํฝํํํํํํ + ํขํ
Which can be rewritten as:
ํธ ํ ํข = โ ํํํฝ + ํข = โ ํ = ํ
And
ํํํ ํ ํข = ํด = ํด1
2ํด 1
2
Where ํด is the variance of the binomial distribution itself, where
ํดํํ = ํํ 1 โ ํํ ํํํ ํ = 1. . ํ
ํดํํ = 0 ํํํ ํ โ ํ
And the linear predictor for the model is ํ = ํํฝ + ํข
While โ โ is the inverse link function for the model
22. Solving the Logistic Mixed Model
Iterate between creating a linear pseudo-model and solving for the
pseudo-modelโs coefficients
โ ํ = โ ํ + ฮ ํ ํฝ โ ํฝ + ฮ (ํข โ ํข )
Where
ฮ =
ํโ ํ
ํํ ํฝ, ํข
Rearranging yields
ฮ โ1 ํ โ โ ํ + ํํฝ + ํข = ํํฝ + ํข
The left side is the expected value, conditional on ํข, of
ํ โก ฮ โ1 ํ โ โ ํ + ํํฝ + ํข
The variance of ํ given ํข is
ํํํ ํ ํข = ฮ โ1ํด 1
2ํด 1
2 ฮโ1
Where
ํด ํํ = ํ ํ 1 โ ํ ํ , ํด ํํ = 0 (ํํํ ํ โ ํ)
23. Transform Pseudo-Model to use EMMA
Pseudo-model: ํ = ํํฝ + ํข + ํ and ํํํ ํ = ํํํ ํ ํข
NOTE: As an alternative, rather than using the prediction of ํข from the pseudo-model,
we can use the expected value of ํข, which is zero
Want to solve using EMMA (Kang 2008)
Find ํ such that ํํํ ํํ = ํผ
So that we can write
ํํ = ํํํฝ + ํํข + ํํ
And use EMMA to solve the mixed model
ํํ = ํํํฝ + ํํข + ํโ
Where the variance of ํโis proportional to ํผ
It can be shown that this is solved by letting
ํ = ํด โ 1
2 ฮ
24. Summary of the Algorithm
First pick starting values of ํฝ and ํข , such as all zeros. Repeat the following
steps until the changes in ํฝ and ํข are sufficiently small:
1. Find ํ and ํ from the original linear predictor equation and the definition
of โ โ
2. Find the (diagonal) ฮ matrix
3. Find the pseudo-model ํ
4. Find the (diagonal) matrix ํ
5. Solve the following for new values of ํฝ and ํข using EMMA:
ํํ = ํํํฝ + ํํข
NOTE: The alternative method modifies Step 5 to use EMMA to determine the variance
components and to find a new value for ํฝ , while leaving the value of ํข at its expected
value of zero.
After convergence, the alternative method predicts the values of ํข, and
computes the final values of ํ and ํ from this prediction
31. Conclusion
๏ง This will method will be added into SVS in the near futureโฆ
In the meantimeโฆ
๏ง Like to try it out on your dataset โ ask us to be part of our early-access
program!
๏ง We have submitted an abstract to ASHG, hope to see you there!
32. Announcements
๏ง Webcast recording and slides will be up on our website tomorrow.
๏ง T-shirt Design Contest! Details at www.goldenhelix.com/events/t-shirtcontest.
html
๏ง Next scheduled webcast is July 22nd, but Heather Huson of Cornell
University.