Analytical Profile of Coleus Forskohlii | Forskolin .pptx
gnomAD at AGBT
1. Variation across 141,456
individuals reveals the
spectrum of loss-of-function
intolerance of the human
genome
Konrad Karczewski
March 2, 2019
@konradjk
broad.io/gnomad_lof
2. Range of LoF impact
embryonic lethal
recessive disease
non-essential
complex disease
beneficial
haploinsufficient disease
3. Identifying true LoF variants is challenging
• LoFs are rare
• LoFs are enriched for artifacts
4. Identifying true LoF variants is challenging
• LoFs are rare
• LoFs are enriched for artifacts
6. gnomAD 2.1.1
• Data provided by 109 PIs
• 1.3 and 1.6 petabytes of BAM files
• Uniformly processed and joint called
• 12 and 24 terabyte VCFs
• Developed a novel QC pipeline
• Complete pipeline publicly available: broad.io/gnomad_qc
• All QC and analysis performed using Hail: hail.is
• Scalable to thousands of CPUs
• Enabled rapid iteration (few hours for each component,
few days for entire process)
Grace
Tiao
Laurent
Francioli
Broad Genomics Platform
Broad Data Sciences Platform
Tim PoterbaCotton Seed
9. Staggering amounts of pLoFs
• gnomAD 2.1 contains:
• 230M variants in 15,708
genomes
• 15M variants in 125,748
exomes
• Of these, we observe
515,326 predicted loss-of-
function (pLoF) variants
• Stop-gained
• Essential splice
• Frameshift indel
pLoF
0
100,000
200,000
300,000
400,000
0 40,000 80,000 125,748
Sample size
Numberobserved
10. Identifying true LoF variants is challenging
• LoFs are rare
• LoFs are enriched for artifacts
11. LOFTEE removes benign variation
• LoF filtering plugin to VEP, LOFTEE
• Variants retained by LOFTEE are:
• more rare, and thus
• more deleterious
• After filtering, we discover 443,769
high-confidence pLoFs in gnomAD
https://github.com/konradjk/loftee
●
●
●
●
0.00
0.05
0.10
0.15
synonymous
missense
low
confidence pLoF
high confidence pLoF
MAPS
12. Detecting genes depleted for pLoFs
• Mutational model that predicts the number of SNVs in a given
functional class we would expect to see in each gene in a cohort
• Now incorporating methylation, improved coverage correction, LOFTEE
• Previously transformed into the probability of LoF intolerance (pLI)
• Applying to 125,748 gnomAD exomes
• Median of 17.3 pLoFs expected per gene
• Direct estimate of observed/expected ratio
• New scores: LOEUF
• Upper bound of 90% confidence interval
Kaitlin Samocha
(Samocha et al. 2014;
Lek et al. 2016)
13. Most genes are depleted of LoF variation
MED13L FNDC3B
Phenotype Severe Intellectual Disability None
Observed Expected Obs/Exp (CI) Observed Expected Obs/Exp (CI)
Synonymous 462 465 0.993 (0.92-1.07) 271 266 1.02 (0.92-1.13)
pLoF 0 102 0 (0-0.029) 0 68 0 (0-0.043)
• Many are extremely depleted
(<20% observed compared to
expected)
• Including most known (curated)
haploinsufficient genes
• Using upper bound of
confidence interval corrects
for small genes
0
500
1000
1500
0.0 0.5 1.0 1.5
Observed/Expected
Numberofgenes
0
200
400
600
800
0.0 0.5 1.0 1.5 2.0
LOEUF
Numberofgenes
14. • Binning this spectrum into deciles
Resolving the spectrum of LoF intolerance
Haploinsufficient
Autosomal Recessive
Olfactory Genes
0%
20%
40%
0% 20% 40% 60% 80% 100%
LOEUF decile
Percentofgenelist
More depleted
More constrained
More tolerant
Less constrained
15. • Known haploinsufficient genes have ~10% of the expected pLoFs
Resolving the spectrum of LoF intolerance
Haploinsufficient
Autosomal Recessive
Olfactory Genes
0%
20%
40%
0% 20% 40% 60% 80% 100%
LOEUF decile
Percentofgenelist
16. • Autosomal recessive genes are centered around 60% of expected
Resolving the spectrum of LoF intolerance
Haploinsufficient
Autosomal Recessive
Olfactory Genes
0%
20%
40%
0% 20% 40% 60% 80% 100%
LOEUF decile
Percentofgenelist
Gene list from:
Blekhman et al., 2008
Berg et al., 2013
21. Population-based constraint
• Repeating this constraint
calculation for each population
(8,128 individuals per
population)
• Lower mean o/e and LOEUF
in African/African-Americans
• Consistent with increased
efficiency of selection in
populations with larger effective
population sizes
●
●
●
●
●
0.45
0.50
0.55
0.60
African/
African−American Latino
East Asian
European
South Asian
LOEUF
22. ●
●
●
●
●
●
●● ●● ●● ●
●
●● ●● ●●
synonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymoussynonymous
pLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoFpLoF
0
5
10
15
0% 20% 40% 60% 80% 100%
LOEUF decile
Rateratiofordenovo
variantsinID/DDcases
comparedtocontrols
Constraint improves rare disease diagnosis
• Patients with developmental
delay/intellectual disability are
15X more likely to have an de
novo LoF in a constrained gene
• 8,095 de novos in 5,305 cases
• 2,623 de novos in 2,179 controls
• Integrating expression data
improves this further
Jack
Kosmicki
Beryl
Cummings
broad.io/tx_annotation
23. Constraint informs common disease etiologies
• Compared to genome-wide
background, SNPs near
constrained genes are
enriched in their contribution
to heritability of common traits
• In particular, traits that
previously1 showed an
enrichment of ultra-rare
variants are also enriched
among constrained genes
●
●
●
●
●
●
● ●
●
●
1.0
1.2
1.4
0% 20% 40% 60% 80% 100%
LOEUF decile
Partitioningheritability
enrichment
Schizoprenia
Qualifications: College or University degree
Duration to first press of snap−button in each round
Educational attainment Bipolar
10-2
10
-4
10
-6
10-8
10-10
10
-12
10
-14
Activities
Cardiovascular
Cognitive
Environment
Hematological
Metabolic
Nutritional
Ophthalmological
Psychiatric
Reproduction
Respiratory
Skeletal
Social Interactions
Other
Traitenrichment
p−value
1Ganna et al. 2018 AJHG
24. Data publicly released with no publication restrictions
gnomad.broadinstitute.org
Matt
Solomonson
Nick
Watts
Gene model with
transcripts
Pathogenic Clinvar
Variants
Dataset
selection box
Tissue
isoform
expression
Constraint
metrics
25. Coming soon! Structural variant calls in the browser
gnomad.broadinstitute.org
Matt
Solomonson
Nick
Watts
26. • broad.io/gnomad_lof – Flagship LoF manuscript (these slides)
• broad.io/gnomad_drugs – Applications to drug targets
• broad.io/gnomad_lrrk2 – LRRK2 LoF carriers do not exhibit phenotypes
• broad.io/gnomad_uorfs – Deleterious 5’ UTR variants
• broad.io/tx_annotation – Transcript-based annotation
• Coming soon:
• broad.io/gnomad_mnv – Mechanisms of multi-nucleotide variants
• broad.io/gnomad_sv – Structural variant map of 15K genomes
27. Acknowledgments
• Laurent Francioli
• Grace Tiao
• Kaitlin Samocha
• Ben Neale
• Jessica Alföldi
• Kristen Laricchia
• Genomics Platform
• Data Science
Platform
• Daniel MacArthur
• Mark Daly
• Mike Talkowski
• Beryl Cummings
• Matt Solomonson
• Ben Weisburd
• Hail team
• Tim Poterba
• Cotton Seed
• Analytic and
Translational
Genetics Unit
• Genome Aggregation Database (gnomAD)
• Full list of consortia and contributors at:
gnomad.broadinstitute.org/about
broad.io/gnomad_lof
Editor's Notes
I'm going to talk about work I've done with the gnomAD consortium on loss-of-function variation.
Imagine if we could put each of the 20K genes in the genome along a spectrum of sensitivity to functional disruption, that is, the clinical or phenotypic impact that a loss-of-function variant might have in that gene.
For instance, here over on the left are genes where we’ll never see LoF variants in living humans as these would be incompatible with human life. In the middle are variants and genes that we typically study in the clinical genetics space, from causal variants for dominant and recessive diseases to risk factors for complex disease. On the right, we have genes that are relatively tolerant of LoF variation, potentially even homozygous inactivation.
And unlike in model organisms, where we can effectively engineer such mutations, there are obvious technical and ethical barriers to doing so in humans. But when we sequence healthy individuals or individuals with common diseases, we find plenty of genes inactivated, in the form of naturally-occurring predicted loss-of-function variants (or pLoFs). However…
In order to discover the high-impact rare LoF variants, our group and others have undertaken large sequencing projects to catalog natural genetic variation.
Today, I'm going to describe recent efforts using the larger Genome Aggregation Database (gnomAD) which comprises 125K exomes and 15K genomes, which we’ve now released to the public. We’ve released version 2.1.1 which…
We’ve released version 2.1.1 which comprises about 3 PB of BAM files, generously provided by over 100 PIs. Thanks to the tireless efforts of the Broad Genomics and Data Sciences Platforms, these have been run through a uniform pipeline and joint called into 36 terabytes of VCFs. If you've been using the ExAC or gnomAD datasets, we recommend switching to this new release now, as we’ve developed a novel QC pipeline that we believe increases the quality of the dataset substantially.
This dataset contains a substantial amount of variation, including...
Here you can see the number of variants discovered in the exomes, broken down by functional class, as a function of sample size, which follow approximately a square root law. If we zoom into the predicted LoFs...
...we observe over half a million LoFs, and I should clarify that when I'm talking about pLoFs today, I'm referring to stop-gained, essential splice, and frameshift variants.
So now that we've increased our sensitivity and discovered a bunch of rare putative LoFs, now we'd like to increase our specificity...
To this end, we've created a tool called LOFTEE, a plugin to VEP that filters out common error modes based on first principles, and importantly, does not use frequency. In spite of that, when we look at the mutability adjusted proportion of singletons, or MAPS, a metric of deleteriousness based on frequency, LOFTEE filters out variants that have a frequency spectrum consistent with missense variants, while variants that are retained are much more rare on average and thus more deleterious. After filtering...
A few years ago, Kaitlin Samocha built a mutational model to predict...
And we've now built on this model with a number of improvements to refine the model and increase specificity.
Previously, this constraint metric was defined in a metric called pLI.
However, now that we're applying to a larger dataset, with our greater resolution, we can use the more interpretable observed/expected ratio, and build a confidence interval around this value, which can give us a conservative estimate of the observed to expected ratio.
Using this method, we can return to the question of where genes fall on this spectrum of LoF tolerance. Most genes have a depletion of pLoFs (that is, observed/expected less than 1), and many are extremely depleted, including most known HI genes.
Just a note for anyone who has used the pLI scores previously, we've now flipped the scale, so the genes over on the left side are high pLI genes. So these improvements solidify our ability to detect constraint, here are two very clear examples.
LoFs in MED13L previously demonstrated to cause severe ID, facial features, and cardiac phenotypes. FNDC3B has no known human phenotype, but results in death at birth when knocked out in mice. But if you find a rare disease patient with an LoF in this gene, you might be concerned.
Some of you may notice this tick on the left side, where observed/expected is zero. This can happen due to extreme constraint, or small genes (say, observed = 0, expected = 2). At larger sample sizes, this will even itself out and we could use the observed/expected ratio, but for now, we can use the upper bound of the confidence interval, which we term LOEUF, resulting in a much smoother distribution, which I’ll use from here on out.
So we can bin this metric into deciles, which I'll show on every slide from here on out with the left ...
116
1164
350
so the metric is well-calibrated, and importantly this means we now have improved LoF tolerance scores for all protein-coding genes in the genome
This fits with what we see in model systems, where genes that are embryonically lethal in mouse are more likely to be in the human constrained genes. Similarly, in CRISPR screens, genes that are essential for cell viability are also more likely to be constrained and the opposite for the confidently non-essential genes.
We next explored the correlation between constraint against SNVs with patterns of structural variation. Ryan and Harrison called SVs in 14K individuals, identifying about 10K rare biallelic autosomal LoFs that disrupt gene function.
When they looked at the occurrence of SVs in each of the constraint deciles, they found that on average, the constrained genes had a strong depletion of structural variation,
If here with more than 4 mins to go: Important to note that this is not a per-gene SV metric, as even this dataset of 15K has less than one rare LoF SV per gene.
One interesting aspect we can explore with the new constraint framework is population-specific constraint, and ask whether genes are constrained in one population or another. While our sample size for each population is too small to detect constraint against individual elements, we can see that on average, constraint is most effective in the African/African-American population, consistent with…
If we look at the burden of de novo LoFs in patients with developmental delay or intellectual disability, we observe a 15-fold increased burden in the top 10% most constrained genes in the genome. So in other words, this lowest decile contains genes where a single LoF mutation will prevent you, by an estimable amount, from progressing through a healthy development during childhood.
Finally, we can investigate how these constraint metrics relate to common disease biology. In a partitioning heritability analysis, we find that SNPs near constrained genes are enriched for heritability of common traits.
If we zoom in on which traits have the strongest enrichment for heritability among constrained genes, we find schizophrenia, bipolar disorder, and educational attainment, which is consistent with previous work that marked these traits as enriched for ultra-rare coding variants
And you can load TTN!
And you can load TTN!
If anyone is interested in learning more about the dataset and what you can do with it, we’ve put 5 papers on biorxiv with 2 to come in the next few weeks on different aspects.