Kolkata Call Girls Service ❤️🍑 9xx000xx09 👄🫦 Independent Escort Service Kolka...
2015 functional genomics variant annotation and interpretation- tools and public data
1. Variant Annotation and
Interpretation:
Tools and Public Data
Functional Genomics Symposium, Qatar
December 12, 2015
Gabe Rudy
@gabeinformatics
VP Product Management and Engineering
Golden Helix
2. My Background
Golden Helix
- Founded in 1998
- Genetic association software
- Analytic services
- Over ten-thousand users worldwide
- Over 800 customer citations in journals
Products I Build with My Team
- SNP & Variation Suite (SVS) - Research
- VarSeq – Clinical & NGS Research
- GenomeBrowse (Free!) - All
What I Do (Coding, Bioinformatics)
- Build tools, build pipelines of tools
- Blog
- Participate in GA4GH, HGVS Discussions,
NCBI EVAC
3. Topics
• ACMG Guidelines for Variant Interpretation
• Necessity of visualization
• Public data and tools for annotations
• Accurate gene annotations, choice of “clinical transcripts”
• Variant representation, “left-shifting” and HGVS nomenclature
• Warehousing variants
4. ACMG Guidlines
Five-tier terminology system:
- “pathogenic,” “likely pathogenic,” “uncertain significance,” “likely
benign,” and “benign”
- Mendelian and mitochondrial variants
- Variant assessment guidelines as combined from 11 labs
- Report variant with condition and inheritance pattern
- c.1521_1523delCTT (p.Phe508del),
pathogenic, cystic fibrosis, autosomal recessive
- Likely pathogenic, likely benign mean 90% certainty
- Provide genomic coordinates (g.)
- Transcript selection up to lab to define "clinically relevant"
5. You Need Visualization, Not Just a Table to Interpret
Recovery of Frameshift (in Supercentenarian)
6. Visualization of Variants to Aid Interpretation
Variants + Genomic Context
- Where it is in gene
- Annotations that match, don’t match
- Other variants in cohort / warehouse
- Locality and rare/common variants
- Locality of pathogenic variants
Interpreting Multiple Transcript
Alignment Evidence
- BAM files provide more than is in VCF
- Phasing of same-ready mutations
- Examine sites of related samples with
no variants called
7. Visualization
Free Genome Browsers:
- IGV
- Popular desktop by Broad
- UCSC
- Web-based, most extensive
annotations
- GenomeBrowse
- Designed to be publication ready
- Smooth zoom and navigation
- Built in all Golden Helix curated
annotations (stream or
download)
8. Annotation with Public Data
Pop databases
- Don't assume “population” == healthy controls
- ExAC, EVS, 1kG, dbSNP
Disease databases:
- OMIM, ClinVar, HGMD
In-Silico Prediction
- Whether missense change is damaging
- 65–80% accurate when examining known disease variants
- Expect over-sensitive, but can be a low-pass filter to call "likely benign”
- Expect correlation between tools as often using similar underlying pieces of evidence.
- Splicing: predicting effect on splicing on genes
RefSeqGenes and Human Reference
9. Annotations are Hard!
HGVS is a standard that is not
computable
- Tries to serve different goals
- Many representations of same variant
- Difficult when used as identifier, but only
alternative is genomic representation (g.)
Transcripts
- Transcript set choice extremely important
- Hard to curate with meaningful tx attributes.
Public Data Curation
- ClinVar: multi-record lines, bits in VCF/XML
- NHLBI: MAF vs AAF, splitting “glob” fields
- 1kG: No genotype counts
- ExAC: Multi-allelic splitting, left-align
- ClinVitae (and COSMIC): only HGVS
- dbNSFP: Abbreviations and aggregate
scores
Versioning and Issues
- ClinVar missing ~5K pathogenic in VCF
- dbSNP patches without version changes
10. Population Catalogs
1000 Genomes (WGS, Exome, SNP Array)
- Many releases, most recent now
standardized, still incrementally updated
- 2,500 genomes – Phase3
“ESP” (NHLBI 6,500 Exomes) (a.k.a EVS)
- Had many releases, now V2-SSA137 0.0.30
- European American / African American only
ExAC (Broad 61,486 Exomes v0.3)
- Many sub-populations
Supercentenarians (110+ yo, 17 WGS)
- Available as raw Complete Genomic data
- Requires normalizing to match Illumina NGS
11. InSilico Predictions
Non-synonymous functional
predictions
- SIFT, Polyphen2, LFT, MutationTaster,
MutationAccessor, FATHMM
Conservation
- GERP++, PhyloP, phastCons
All-In-One Scores
- CADD, VAAST,VEST3, DANN, FATHM-
MKL, MetaSVM and MetaLR
- Use machine learning, “feature selection”,
train and predict on public databases
- Can predicting synonymous and intergenic
dbNSFP 3.0 – 82M precomputed scores
- N of 6 Voting on prediction algorithms
RNA Splicing Effect (dbscSNV)
- 5+ splice algorithms, can pre-compute
- −3 to +8 at the 5’, −12 to +2 at the 3’
12. Disease Databases
ClinVar
- Voluntary submissions of lab
- Use 5-tier classification (variant + phenotype
pairs)
- Star-rating of variants
- Lab owns submission, can revoke and
monitor status
ClinVitae (Invitae curated, not updated)
OMIM
- Gene to Phenotype documentation
- Expertly curated, hand updated
- Changes dynamically
- Small list of cited / implicated variants
HGMD
- Commercially supported
- Best linkage of (possible) publication to
variant/genes
- Classifications not directly trusted
Your own Lab (more later)
13. Web-Based Annotation Tools
NCBI Variant Reporter
- HGVS Annotation
- PubMed, ClinVar links
SeattleSeq
- NHLBI supported
- Some public annotations
Ensembl VEP
- Same as running VEP locally
Scripps Genome ADVISER
- Out of of date annotations
- Scripps Wellderly Frequencies
- Splice Site Predictions
- Basic Java GUI for filtering
Mutalyzer – HGVS only
14. Variant Annotation Tools
snpEff
- Open source, commercial use allowed
- Tx Annotation, HGVS output
- Limited public annotations
ANNOVAR
- Academic/Commercial split
- Many public annotations
- Non-standard Tx prioritization
Ensembl VEP
- Ensembl tx only, HGVS output
- Limited public annotations
VarSeq
- Commercially supported
- Largest public annotation repo
- RefSeq/Ensembl tx, HGVS
- Clinical Tx, many export formats
- Integrated data transformations
16. RefSeqGenes – mRNA sequence archive, with mappings to genomes
- Provided mappings to Locus Reference Gene (LRG) database
- Use genome mappings by NCBI (through genome annotation builds). NOT UCSC
- “Clinically Relevant” metric:
- LRG if available
- Longest if tied
Ensembl – defined directly against the human genome
- More inclusive of genes discovered with high-throughput methods
- Gencode subset – similar to RefSeqGenes in size / definition
Each have unique Accessions and Version Numbers
- Newer releases GRCh38
- GRCh37 mappings not being updated (unfortunately)
17. Reference Sequence Versus Gene Sequence
EMG1 on GRCh37
“Gap” of the mRNA coding sequence versus reference seq:
Handled differently by 3 different “gene alignments”
18. Reference Sequence Versus Gene Sequence
EMG1 on GRCh38
Reference sequence patched, no gap
Alignments agree
19.
20. RefSeq Accession Not Sufficient for Var-Tx Interaction
RefSeq defines transcripts as mRNA sequence
NCBI “Annotation Releases” (like v105) provides alignments using “Splign”
UCSC pulls RefSeq mRNA and aligns themselves using “BLAT”
They can choose equally valid but different alignments for the same accession
This alignment of NM_052814.3 places the exon at dramatically different loci.
Will result in different annotations of any variant overlapping these exons
21. Variant Representation and Normalization
Allelic Primitives
- AG/CT -> A/C & G/T
- AT/G -> A/- & T/G
- May have different annotations
Left Align
- NGS standard, not consistent
historically
- May be needed after primitives
- HGVS -> 3’ shift (right for forward)
Multi-Allelic (2 Non-Ref Alleles)
- Each non-ref has own annotations
- Pop level should be “split” for counts
HGVS, Transcript Projection
- Dependent on Tx->Genome Mapping
- hgvs-eval: Benchmarking tool in
progress
22. Left-Align Annotations
Using a Smith-
Waterman
algorithm to left-
align variants
from public
databases show
non-obvious
differences
NGS alignment
and variant
calling always
left-aligned
Left-align your
database so they
can be annotated
26. Multi Allelic
The Supercentenarian annotation found records for both alternates, and looks
like this:
Trio Analysis, Variant is a G/T/C (Reference G, Alternates of T/C):
27. Variant Warehouse
"Clinical laboratories
should implement an
internal system to track
all sequence variants
identified in each gene
and clinical assertions
when reported.
This is is important for
tracking genotype–
phenotype correlations
and the frequency of
variants in affected and
normal populations."
28. Why Warehouse?
A place to archive full VCFs of every
sequenced sample (by assay/test)
Query and retrieve subsets of data
at any time
Ask the Variant Warehouse:
- Have I ever seen this variant in my
previous test samples?
- At what frequency? (counts as well)
- Does this gene contain other rare variants
in my cohort?
- Did I provide a pathogenicity assessment
for this variant? Has that changed?
- Has ClinVar changed since that
assessment was initially made?
- Have I put this variant into a clinical report
for any previous samples?
29.
30. NM_002626.4:c.1877G>C in PFKL
NP_002617.3:p.Arg626Pro missense mutation
Predicted damaging by 4/5 functional predictions
VEST3: 0.948, GERP++: 4.59
ExAC and 1kG have a G>A, but G>C is novel
Variants in region are extremely rare (G>C ExAC 4 of 122,364 alleles) – 0.003%
No ClinVar variants for gene
OMIM entry has no known disease association
PubMed search shows few recent articles: Most recent 1998 paper showed
- phosphofructokinase (PFKL) overexpressed in Down syndrome (DS)
- Transgenic PFKL mice had an abnormal glucose metabolism with reduced clearance
rate from blood and enhanced metabolic rate in brain.