Simulating Genes in
GWAS
Kevin R. Thornton
Ecology and Evolutionary Biology
UC Irvine
slides will be available at
http://w...
Acknowledgements
Tony Long
Andrew Foran
Jaleal Sanjak
Several genomic regions have been implicated in linkage studies
and, recently, replicated evidence implicating specific ge...
the differences observed in their allelic architecture. Some apparent
differences may simply be due to differences in the ...
NHGRI GWA Catalog
www.genome.gov/GWAStudies
www.ebi.ac.uk/fgpt/gwas/
Published Genome-Wide Associations through 12/2012
P ...
doi:10.1371/journal.pbio.1000579
Wray et al.
Unsurprisingly, since the GWAS method is primarily powered
common alleles, risk allele frequencies were well above 5%
all ...
tion explained by rare variants, because natural selection should
mize the frequency of deleterious variants in the popula...
1
2
3
4
5
6
7
8
9
10
OddsRatio
N
on−synonym
ous
sites
Prom
oters
(1kb)
Prom
oters
(5kb)
5’U
TR
s
3’U
TR
s
m
iR
TS
Intronic...
Observation Interpretation
Missing H Lots
Uniform frequencies of “hits” Common associations exist
Rare hits have larger OR...
Observation Interpretation
Rare hits have larger
OR
Rare alleles may have
larger effects
Disease is harmful
with respect t...
0.4 0.02
0.01
0.01
0.00
a b
0.3
Frequencyofobservations
Causalvariantfrequency
0.2
0.1
0
0.05 0.50 1.0
Figure 3 | Inconsis...
0.4 0.020
0.015
0.010
0.005
a b
0.3
Frequencyofobservations
Causalvariantfrequency
0.2
0.1
0
0.05 0.50 1.0 0.1 0.2 0.3 0.4...
The multiplicative model
G =
Y
i
(1 + ei)
0 2 4 6 8 10
0246810
Causative mutations on paternal allele
Causativemutationson...
WWHD?
(What would Haldane do?)
p2 2pq q2
1 1 sh 1 2s
Genotype AA Aa aa
Mating
frequency
Fitness
ˆq =
u
sh
ˆq ⇡
r
u
s
as h ...
Mutation at rate u (per gamete per generation)
“A” allele
X
X
X
“a” allele
is heterogeneous
in its molecular origin
trans-...
E↵ect sizes ⇠ Exp( )
0.0
2.5
5.0
7.5
0.0 0.3 0.6 0.9
Effect size
density
= effect of haplotype.
Additive over causative mu...
Gij =
p
hi ⇥ hj
(geometric mean)
0 2 4 6 8 10
0246810
Causative mutations on paternal allele
Causativemutationsonmaternala...
Aside: simulation tools
• C++ library for rapid forward simulation
• Available from https://github.com/molpopgen/
fwdpp
• ...
1e−031e−021e−011e+001e+01
θ = ρ = 100
Population size (N diploids)
Meanruntime(days)
1000 10000 50000
sfs_code
SLiM
fwdpp ...
2Nsh = 1 2Nsh = 10 2Nsh = 100
0
5
10
15
20
0.1 0.5 1 0.1 0.5 1 0.1 0.5 1
Proportion of new mutations that are deleterious
...
Selection is weak
●●● ● ● ● ● ● ● ● ●
0.0 0.1 0.2 0.3 0.4 0.5
0.700.800.901.00
Mean effect size (λ)
Relativefitness
● Popu...
Heritability plateaus
●
●
●
●
●
● ●
● ● ●
●
0.0 0.1 0.2 0.3 0.4 0.5
0.000.020.040.06
Mean effect size (λλ)
Broad−senseheri...
Rare alleles
0.00.20.4
Derived allele frequency
Proportion
1 5 10
●
●
● ● ● ● ● ● ● ● ●
= 0.25
doi:10.1371/journal.pgen.10...
GWAS have poor power
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.8
Mean effect size (λ)
Power
GWAS
GWAS,
no recombination
resequ...
Compare model to data…
0.4 0.020
0.015
0.010
0.005
a b
0.3
Frequencyofobservations
Causalvariantfrequency
0.2
0.1
0
0.05 0...
…reveals a pretty good fit
doi:10.1371/journal.pbio.1000579
Wray et al.
0246810
MAF of most significant marker
(in cases)
M...
“Burden” tests do badly…
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.81.0
Mean effect size (λ)
Power
GWAS
GWAS
no recombination
...
…because the model is
wrong.
●
●
●
●
●
●
●
●
●
●
0.0 0.1 0.2 0.3 0.4 0.5
02468
Mean effect size (λ)
Meannumberofcausativem...
SKAT does ok
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.81.0
Mean effect size (λ)
Power
Resequencing, default weights and optim...
Manhattan plots
0 20 40 60 80 100
051015
Position (kbp)
−log10(p)
Common
Common, causative
Rare
Rare, causative
0 20 40 60...
A new association test
evolutionary interest, genes showing eviden
particularly interesting for the biology of tra
eases; ...
ESM is a more powerful test
0.0 0.1 0.2 0.3 0.4 0.5
0.00.20.40.60.81.0
Mean effect size (λ)
Power
GWAS
GWAS,
no recombinat...
Running ESM on real data
• We think we can implement ESM using a mix of the
PLINK toolkit plus some custom programs.
• We ...
Rare alleles and missing
heritability
• Current tests are underpowered
• Heterogeneity means that GWAS “hits” tag few
caus...
●
●
●● ●
●
●
● ●● ●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●● ● ● ● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
0.010...
Population growth
Time
PresentPast
Populationsize
H^2 insensitive to growth
●
●
●
●
● ●
●
●
●
●
0.01
0.02
0.03
0.04
0.0 0.1 0.2 0.3 0.4 0.5
Average effect size of new mutat...
Consistent with recent
findings from other groups
N A LY S I S
t despite these substantial shifts in the
rall frequency spe...
Power is affected
0.00
0.02
0.04
0.06
0.08
0.000 0.025 0.050 0.075 0.100
Effect size of segregating causative mutation
Fre...
Excellent fit to empirical
data
Frequency of most−associated marker
No.markers
0.0 0.2 0.4 0.6 0.8 1.0
02468101214
Unpublis...
Implications
• Power to detect regions with modest effects on risk
(4-5% contribution to broad-sense heritability) is
very...
Implications
• Much more likely to detect loci
with mutations of modest
effect
• Underlying distribution of
mean effect si...
Future work
• Multilocus models with epistasis
• Machine learning approaches: do they work?
• Develop new simulation tools...
Other work in the lab
• Copy number variation in Drosophila: doi: 10.1093/
molbev/msu124
• Detecting TE insertions using p...
Upcoming SlideShare
Loading in …5
×

Simulating Genes in Genome-wide Association Studies

724 views
535 views

Published on

Talk given to the UCI Genetic Epidemiology Research Group (GERI, http://www.geri.uci.edu/) on May 16, 2014. Recent results on power to detect associations in growing populations + need for better statistical tests.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
724
On SlideShare
0
From Embeds
0
Number of Embeds
29
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Simulating Genes in Genome-wide Association Studies

  1. 1. Simulating Genes in GWAS Kevin R. Thornton Ecology and Evolutionary Biology UC Irvine slides will be available at http://www.slideshare.net/molpopgen http://www.molpopgen.org
  2. 2. Acknowledgements Tony Long Andrew Foran Jaleal Sanjak
  3. 3. Several genomic regions have been implicated in linkage studies and, recently, replicated evidence implicating specific genes has been reported. Increasing evidence suggests an overlap in genetic suscept- ibility with schizophrenia, a psychotic disorder with many similar- ities to BD. In particular association findings have been reported with expanded reference group analysis (Supplementary Table 9), it is of interest that the closest gene to the signal at rs1526805 (P 5 2.2 3 1027 ) is KCNC2 which encodes the Shaw-related voltage-gated pot- assium channel. Ion channelopathies are well-recognized as causes of episodic central nervous system disease, including seizures, ataxias −log10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 Chromosome Type 2 diabetes 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 22 XX 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Coronary artery disease Crohn’s disease Hypertension Rheumatoid arthritis Type 1 diabetes Bipolar disorder Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases 2log10 of the trend test P value for quality-control-positive SNPs, excluding those in each disease that were excluded for having poor clustering after visual inspection, are plotted against position on each chromosome. Chromosomes are shown in alternating colours for clarity, with P values ,1 3 1025 highlighted in green. All panels are truncated at 2log10(P value) 5 15, although some markers (for example, in the MHC in T1D and RA) exceed this significance threshold. 666 doi:10.1038/nature05911 Burton et al.
  4. 4. the differences observed in their allelic architecture. Some apparent differences may simply be due to differences in the stage of investiga- tion across traits. Studies in several conditions have clearly demon- strated that the number of detected variants increases with increasing sample size22–24 . Population genetic theory suggests an explanation for the paucity of variants explaining a large proportion of disease predisposition, in that decreased reproductive fitness should typically act to reduce the frequencies of high-risk variants. This might explain the relative lack of variants detected so far for some neuropsychiatric conditions, such as autism spectrum disorders, given their low reproductive fitness25 . Yet for a condition such as type 1 diabetes, which has a similar pre- valence, familial risk, early onset and poor reproductive fitness (at yielded intriguing new variants33,34 . Studies of populations of recent African ancestry in particular is likely to increase the yield of rare variants and narrow the large chromosomal regions of association identified in the ‘younger’ population due to extended linkage dis- equilibrium, or the tendency for adjacent genetic loci to be inherited together31 . Isolated populations may also be of value given their potential to be enriched in unique variants35 . The accuracy of current heritability estimates is also important, because experimentally identified variants could never explain all the variance in an erroneously inflated heritability estimate. Heritability of quantitative traits, formally defined as the proportion of pheno- typic variance in a population attributable to additive genetic factors (narrow-sense heritability, h2 (ref. 36)) is typically estimated from Table 1 | Estimates of heritability and number of loci for several complex traits Disease Number of loci Proportion of heritability explained Heritability measure Age-related macular degeneration72 5 50% Sibling recurrence risk Crohn’s disease21 32 20% Genetic risk (liability) Systemic lupus erythematosus73 6 15% Sibling recurrence risk Type 2 diabetes74 18 6% Sibling recurrence risk HDL cholesterol75 7 5.2% Residual* phenotypic variance Height15 40 5% Phenotypic variance Early onset myocardial infarction76 9 2.8% Phenotypic variance Fasting glucose77 4 1.5% Phenotypic variance * Residual is after adjustment for age, gender, diabetes. 748 Macmillan Publishers Limited. All rights reserved©2009 doi:10.1038/nature08494 Manolio et al.
  5. 5. NHGRI GWA Catalog www.genome.gov/GWAStudies www.ebi.ac.uk/fgpt/gwas/ Published Genome-Wide Associations through 12/2012 P -8 for 17 trait categories
  6. 6. doi:10.1371/journal.pbio.1000579 Wray et al.
  7. 7. Unsurprisingly, since the GWAS method is primarily powered common alleles, risk allele frequencies were well above 5% all TASPs (reported index TASs with an association p valu 5.0 ϫ 10Ϫ8 and all HapMap phase II CEU SNPs in LD [r2 Ͼ 0 OCA2, eye color MC1R, hair color LOXL1, exfoliation glaucoma125102030 OddsRatio 0 20 40 60 80 100 Reported risk allele frequency, % 1. Published odds ratios for discrete traits by reported risk allele frequencies. Labeled SNP-trait associations are those with the highest ORs. Note tha is is on the log scale. www.pnas.org/cgi/doi/10.1073/pnas.0903103106 Hindorff et al.
  8. 8. tion explained by rare variants, because natural selection should mize the frequency of deleterious variants in the population [24]. efore, for any phenotype, many causal variants will be rare, and proportion of population-level genetic variance in complex notypes attributable to variants across the allele frequency trum will depend upon the strength of selection in our evolu- ry past. The problem is that this is something that we do not that the power of detection is proportional to pa2 , but it is clear for each complex trait, variance is contributed from the entire a frequency spectrum. This highlights the scarcity of low-frequ variants identified by GWAS for quantitative traits and com disease in humans. Detecting these variants will require a comb tion of greater sample size, better genotyping, and impro phenotyping. Minor allele frequency (A) (B) Absoluteeffect(SDunits) <0.001 0.01 0.1 0.5 0135 Risk allele frequencyOddsraƟo <0.001 0.01 0.1 0.5 1 1510 TRENDS in Genetics e I. For quantitative traits (A), the absolute effect is plotted against the minor allele frequency, whereas for complex common diseases (B), the odds ratio is pl st the risk allele frequency. Each of the 38 quantitative traits and 43 disease traits are represented by different colors. Abbreviation: SD, standard deviation http://dx.doi.org/10.1016/j.tig.2014.02.003 Robinson et al.
  9. 9. 1 2 3 4 5 6 7 8 9 10 OddsRatio N on−synonym ous sites Prom oters (1kb) Prom oters (5kb) 5’U TR s 3’U TR s m iR TS Intronic regions Intergenic regions Intergenic TFBSsC pG islandsPR eM od sites O R egAnno elem entsEAR regions M C Ss H AR s PSG s Annotation Set Enrichment/depletion analysis after adjusting for ’hitchhiking’ effects from non−synonymous sites Fig. 2. Odds ratios for TAS block enrichment/depletion analysis after adjusting for ‘‘hitchhiking’’ effects from nonsynonymous sites. Four annotation sets (Splice sites, Validated enhancers, EvoFold elements, and noncoding RNAs) are not represented here because no TAS blocks mapped to these annotation sets. The blue circle represents the point estimate of the odds ratio (OR) and the red lines represent the 95% CI. Possible ‘‘hitchhiking’’ effects from nonsynonymous sites are reduced by discarding any TASP/control SNP in r2 Ͼ 0.6 with a nonsynonymous SNP. For an explanation of the annotation sets on the x axis, we refer the reader to Table S4. Note that the y axis is on the log scale. Nonsynonymous OR computation is not adjusted for ‘‘hitchhiking’’ effects. www.pnas.org/cgi/doi/10.1073/pnas.0903103106 Hindorff et al.
  10. 10. Observation Interpretation Missing H Lots Uniform frequencies of “hits” Common associations exist Rare hits have larger OR Rare alleles may have larger effects Larger OR in genes Genes matter
  11. 11. Observation Interpretation Rare hits have larger OR Rare alleles may have larger effects Disease is harmful with respect to fitness (in the evolutionary sense). Larger OR in genes Genes matter
  12. 12. 0.4 0.02 0.01 0.01 0.00 a b 0.3 Frequencyofobservations Causalvariantfrequency 0.2 0.1 0 0.05 0.50 1.0 Figure 3 | Inconsistency between genome-wide association stu a | The frequency distribution of risk allele frequencies (shown in lighdoi:10.1038/nrg3118 Gibson
  13. 13. 0.4 0.020 0.015 0.010 0.005 a b 0.3 Frequencyofobservations Causalvariantfrequency 0.2 0.1 0 0.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5 Odds ratio 2 3 4 5 6 7 8 9 > 9 Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19 . The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further WS The multiplicative model G = Y i (1 + ei) Risch & colleagues, Pritchard, countless others
  14. 14. The multiplicative model G = Y i (1 + ei) 0 2 4 6 8 10 0246810 Causative mutations on paternal allele Causativemutationsonmaternalallele 0.2 0.4 0.6 0.8 1 1.2 1.4 Risch & colleagues, Pritchard, countless others
  15. 15. WWHD? (What would Haldane do?) p2 2pq q2 1 1 sh 1 2s Genotype AA Aa aa Mating frequency Fitness ˆq = u sh ˆq ⇡ r u s as h ! 0 DOI: 10.1017/S0305004100015644 Haldane
  16. 16. Mutation at rate u (per gamete per generation) “A” allele X X X “a” allele is heterogeneous in its molecular origin trans-heterozygotes are at risk. Phenotype has (weak) effect on individual fitness doi:10.1371/journal.pgen.1003258 Thornton et al.
  17. 17. E↵ect sizes ⇠ Exp( ) 0.0 2.5 5.0 7.5 0.0 0.3 0.6 0.9 Effect size density = effect of haplotype. Additive over causative mutations hi doi:10.1371/journal.pgen.1003258 Thornton et al.
  18. 18. Gij = p hi ⇥ hj (geometric mean) 0 2 4 6 8 10 0246810 Causative mutations on paternal allele Causativemutationsonmaternalallele 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Pi,j = Gi,j + N(0, ) w = e (Pi,j )2 2 2 S doi:10.1371/journal.pgen.1003258 Thornton et al.
  19. 19. Aside: simulation tools • C++ library for rapid forward simulation • Available from https://github.com/molpopgen/ fwdpp • Preprint on arXiv at http://arxiv.org/abs/1401.3786
  20. 20. 1e−031e−021e−011e+001e+01 θ = ρ = 100 Population size (N diploids) Meanruntime(days) 1000 10000 50000 sfs_code SLiM fwdpp (gamete−based) fwddpp (individual−based) 0.0050.0200.0500.2000.5002.0005.000 θ = ρ = 500 Population size (N diploids) 1000 10000 50000 51020501002005001000 Population size (N diploids) Meanpeakmemoryuse(Mb) 1000 10000 50000 1020501002005001000 Population size (N diploids) 1000 10000 50000 http://arxiv.org/abs/1401.3786 Thornton
  21. 21. 2Nsh = 1 2Nsh = 10 2Nsh = 100 0 5 10 15 20 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1 Proportion of new mutations that are deleterious Meanruntime(hours) Simulation fwdpp (gamete−based) fwdpp (individual−based) SLiM 2Nsh = 1 2Nsh = 10 2Nsh = 100 0 50 100 150 0.1 0.5 1 0.1 0.5 1 0.1 0.5 1 Proportion of new mutations that are deleterious Meanpeakmemoryuse(megabytes) http://arxiv.org/abs/1401.3786 Thornton
  22. 22. Selection is weak ●●● ● ● ● ● ● ● ● ● 0.0 0.1 0.2 0.3 0.4 0.5 0.700.800.901.00 Mean effect size (λ) Relativefitness ● Population mean fitness Average fitness of a case Average minimum fitness doi:10.1371/journal.pgen.1003258 Thornton et al.
  23. 23. Heritability plateaus ● ● ● ● ● ● ● ● ● ● ● 0.0 0.1 0.2 0.3 0.4 0.5 0.000.020.040.06 Mean effect size (λλ) Broad−senseheritability doi:10.1371/journal.pgen.1003258 Thornton et al.
  24. 24. Rare alleles 0.00.20.4 Derived allele frequency Proportion 1 5 10 ● ● ● ● ● ● ● ● ● ● ● = 0.25 doi:10.1371/journal.pgen.1003258 Thornton et al.
  25. 25. GWAS have poor power 0.0 0.1 0.2 0.3 0.4 0.5 0.00.20.40.60.8 Mean effect size (λ) Power GWAS GWAS, no recombination resequencing resequencing no recombination doi:10.1371/journal.pgen.1003258 Thornton et al.
  26. 26. Compare model to data… 0.4 0.020 0.015 0.010 0.005 a b 0.3 Frequencyofobservations Causalvariantfrequency 0.2 0.1 0 0.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5 Odds ratio 2 3 4 5 6 7 8 9 > 9 Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19 . The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further reducing the probability that they can explain observed common variant association. Suppose that a disease has a REVIEWS doi:10.1038/nrg3118 doi:10.1371/journal.pbio.1000579 Gibson Wray et al.
  27. 27. …reveals a pretty good fit doi:10.1371/journal.pbio.1000579 Wray et al. 0246810 MAF of most significant marker (in cases) Meannumberofmarkers n = 36.899 0 0.1 0.2 0.3 0.4 0.5 = 0.05 (Based on simulating imperfect SNP chips)
  28. 28. “Burden” tests do badly… 0.0 0.1 0.2 0.3 0.4 0.5 0.00.20.40.60.81.0 Mean effect size (λ) Power GWAS GWAS no recombination Resequencing Resequencing no recombination 0.0 0.1 0.2 0.3 0.4 0.5 0.00.20.40.60.81.0 Mean effect size (λ) Power 50 markers 50 markers no recombination 100 markers 100 markers no recombination 200 markers 200 markers no recombination 250 markers 250 markers no recombination Madsen and Browning (2009) Li and Leal (2008) doi:10.1371/journal.pgen.1003258 Thornton et al.
  29. 29. …because the model is wrong. ● ● ● ● ● ● ● ● ● ● 0.0 0.1 0.2 0.3 0.4 0.5 02468 Mean effect size (λ) Meannumberofcausativemutationsperdiploid ● ● ● ● ● ● ● ● ● ● ● ● Controls Cases Controls (rares) Cases (rares) doi:10.1371/journal.pgen.1003258 Thornton et al.
  30. 30. SKAT does ok 0.0 0.1 0.2 0.3 0.4 0.5 0.00.20.40.60.81.0 Mean effect size (λ) Power Resequencing, default weights and optimal p−values GWAS, default weights and optimal p−values Resequencing, Madsen−Browning weights and optimal p−values GWAS, Madsen−Browning weights and optimal p−values doi:10.1371/journal.pgen.1003258 Thornton et al.
  31. 31. Manhattan plots 0 20 40 60 80 100 051015 Position (kbp) −log10(p) Common Common, causative Rare Rare, causative 0 20 40 60 80 100 051015 Position (kbp) −log10(p) Common Common, causative Rare Rare, causative Methods), and excluded 153 individuals on this basis. We next evolutio particul eases; po tase 1) a well as biology There capture implem STRUC reverted subset o librium clearly p rather th show th perhaps tary Fig The results Europe trend te 1.05 for diseases than str sion of ariates i only slig graphica P values −log10(P) 0 5 10 15 Chromosome 22 X 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 3020 20 100 0 40 80 60 40 100 Observedteststatistic Expected chi-squared value a b Figure 2 | Genome-wide picture of geographic variation. a, P values for the 11-d.f. test for difference in SNP allele frequencies between geographical regions, within the 9 collections. SNPs have been excluded using the project quality control filters described in Methods. Green dots indicate SNPs with a P value ,1 3 1025 . b, Quantile-quantile plots of these test statistics. SNPs at which the test statistic exceeds 100 are represented by triangles at the top of the plot, and the shaded region is the 95% concentration band (see Methods). Also shown in blue is the quantile-quantile plot resulting from removal of all SNPs in the 13 most differentiated regions (Table 1). NATURE|Vol 447|7 June 2007 doi:10.1371/journal.pgen.1003258 doi:10.1038/nature05911 Burton et al. Thornton et al.
  32. 32. A new association test evolutionary interest, genes showing eviden particularly interesting for the biology of tra eases; possible targets for selection include N tase 1) at 11q13, which could have a role in well as TLR1 (toll-like receptor 1) at 4p14 biology of tuberculosis and leprosy has been There may be important population st captured by current geographical region implementations of strongly model-base STRUCTURE11,12 are impracticable for dat reverted to the classical method of principa subset of 197,175 SNPs chosen to reduce in librium. Nevertheless, four of the first si clearly picked up effects attributable to loc rather than genome-wide structure. The rem show the same predominant geographical t perhaps unsurprisingly, London is set some tary Fig. 8). The overall effect of population struc results seems to be small, once recent Europe are excluded. Estimates of over-disp trend test statistics (usually denoted l; ref. 1 1.05 for RA and T1D, respectively, to 1.08 −log10(P) 0 5 10 15 Chromosome 22 X 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 3020 20 100 0 40 80 60 40 100 Observedteststatistic Expected chi-squared value a b Figure 2 | Genome-wide picture of geographic variation. a, P values for the 11-d.f. test for difference in SNP allele frequencies between geographical regions, within the 9 collections. SNPs have been excluded using the project NATURE|Vol 447|7 June 2007 ESMK = i=KX i=1 ✓ log10(pi) + log10 i K ◆ doi:10.1371/journal.pgen.1003258 Thornton et al.
  33. 33. ESM is a more powerful test 0.0 0.1 0.2 0.3 0.4 0.5 0.00.20.40.60.81.0 Mean effect size (λ) Power GWAS GWAS, no recombination resequencing resequencing no recombination (Caveat: requires permutation to get p-values) doi:10.1371/journal.pgen.1003258 Thornton et al.
  34. 34. Running ESM on real data • We think we can implement ESM using a mix of the PLINK toolkit plus some custom programs. • We need data to test it out on. • There are very few modern GWAS available for reanalysis. • Lack of data sharing hurts the field.
  35. 35. Rare alleles and missing heritability • Current tests are underpowered • Heterogeneity means that GWAS “hits” tag few causative mutations • Causative mutations that are tagged tend to be (relatively) common. These “common” mutations have effect sizes much smaller than the typical causative mutation that segregates
  36. 36. ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.010 0.025 0.050 0.075 0.100 0.125 0.175 0.250 0.350 0.500 0.0000 0.0015 0.0030 0.0000 0.0015 0.0030 0.0000 0.0015 0.0030 0.0000 0.0015 0.0030 0.0000 0.0015 0.0030 0 1 2 0 1 2 Number of copies of derived allele at focal SNP Meannumberofcausativesingletonsperindividual Focal SNP ● ● Most significant marker Unassociated SNP doi:10.1371/journal.pgen.1003258 Thornton et al.
  37. 37. Population growth Time PresentPast Populationsize
  38. 38. H^2 insensitive to growth ● ● ● ● ● ● ● ● ● ● 0.01 0.02 0.03 0.04 0.0 0.1 0.2 0.3 0.4 0.5 Average effect size of new mutation Meanbroad−senseheritability model ● constant growth Unpublished
  39. 39. Consistent with recent findings from other groups N A LY S I S t despite these substantial shifts in the rall frequency spectrum, the impact on netic load—namely, the mean number of eterious variants per individual and thus average fitness—is much more subtle. n the semidominant case, the individual rden is essentially unaffected by these mographic events (Fig. 1c,d). With growth, increased number of segregating sites alanced exactly by a decrease in the mean quency (with the converse being true for bottleneck model) so that the number variants per individual stays constant. is kind of balance is predicted by classic tation-selection balance models18 and n be shown to hold for general changes population size, provided that selection trong and deleterious alleles are at least tially dominant (Supplementary Note). The behavior of the recessive model is re complicated (Fig. 1e,f). In the bottle- a b c d e f 100 –1,000 0 1,000 2,000 3,000 Time since beginning of bottleneck (generations) Time since beginning of growth (generations) 10,000 1,000 –1,000 0 1,000 2,000 3,000 Time (generations) Bottleneck Populationsize 100,000 10,000 Time (generations) Growth Populationsize –200 –100 0 100 200 10 2 10 4 SemidominantRecessive NumberperMB 102 104 102 104 umberperMB umberperMB 100 10 2 10 4 NumberperMB Number of segregating sites Number of segregating sites Number of segregating sites Number of deleterious alleles per individual Number of deleterious alleles per individual Number of rare deleterious alleles Number of segregating sites Number of rare segregating sites Number of rare segregating sites Number of rare segregating sites Number of rare segregating sites Load: number of deleterious alleles per individual Load: number of homozygous sites per individual Load: number of deleterious alleles per individual Number of rare deleterious alleles per individual Number of rare deleterious alleles per individual –200 –100 0 100 200 ure 1 Time course of load and other key ects of variation through a bottleneck and onential growth. (a,b) The bottleneck (a) exponential growth (b). (c–f) The expected mber of variants and alleles per MB assuming midominant mutations (c,d) or recessive tations (e,f) with s = 1% and a mutation rate site per generation of 10−8. Simons et al. doi:10.1038/ng.2896
  40. 40. Power is affected 0.00 0.02 0.04 0.06 0.08 0.000 0.025 0.050 0.075 0.100 Effect size of segregating causative mutation Frequencyinpopulation Model Constant Growth ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 Mean effect size of causative mutation Power Statistic ● ESM50 Logit SKAT Model Constant Growth Unpublished
  41. 41. Excellent fit to empirical data Frequency of most−associated marker No.markers 0.0 0.2 0.4 0.6 0.8 1.0 02468101214 Unpublished
  42. 42. Implications • Power to detect regions with modest effects on risk (4-5% contribution to broad-sense heritability) is very low in growing populations • The explanatory power of simple models is probably far from exhausted
  43. 43. Implications • Much more likely to detect loci with mutations of modest effect • Underlying distribution of mean effect size across loci is completely unknown in any system ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 Mean effect size of causative mutation Power Statistic ● ESM50 Logit SKAT Model Constant Growth Unpublished
  44. 44. Future work • Multilocus models with epistasis • Machine learning approaches: do they work? • Develop new simulation tools • Make simulation output available • Implement ESM test for analyzing real GWAS data
  45. 45. Other work in the lab • Copy number variation in Drosophila: doi: 10.1093/ molbev/msu124 • Detecting TE insertions using paired-end data in Drosophila: doi: 10.1093/molbev/mst129 • Modeling experimental evolution: doi: 10.1093/ molbev/msu048 • Structural variation and variation in gene expression

×