Ⓒ 2014 Invitae
Reece Hart, Ph.D.Reece Hart, Ph.D.
reece@invitae.comreece@invitae.com
Human Variome Project Meeting 2014, P...
2 / 24 Ⓒ 2014 Invitae
The fidelity of transcript-genome mapping matters.The fidelity of transcript-genome mapping matters....
3 / 24 Ⓒ 2014 Invitae
Motivation 1: Discordant exon coordinatesMotivation 1: Discordant exon coordinates
NCBI and UCSC rep...
4 / 24 Ⓒ 2014 Invitae
Motivation 2: indels confound mappingMotivation 2: indels confound mapping
NM_006158.3 (NEFL) contai...
5 / 24 Ⓒ 2014 Invitae
Challenges and Solutions in Transcript ManagementChallenges and Solutions in Transcript Management
➢...
6 / 24 Ⓒ 2014 Invitae
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)
➢ Single database of:
● Multipl...
7 / 24 Ⓒ 2014 Invitae
Our Bermuda TriangleOur Bermuda Triangle
RefAgree
Do transcript and
genome sequences agree?
Transcri...
8 / 24 Ⓒ 2014 Invitae
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)
Multiple sources, multiple vers...
9 / 24 Ⓒ 2014 Invitae
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)
Multiple sources, multiple vers...
10 / 24 Ⓒ 2014 Invitae
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)
Multiple sources, multiple ver...
11 / 24 Ⓒ 2014 Invitae
Universal Transcript Archive (UTA)Universal Transcript Archive (UTA)
Multiple sources, multiple ver...
12 / 24 Ⓒ 2014 Invitae
““RefAgree” Statistics by Protein Coding TranscriptRefAgree” Statistics by Protein Coding Transcrip...
13 / 24 Ⓒ 2014 Invitae
NCBI (Splign) v. UCSC (BLAT) Alignment StatisticsNCBI (Splign) v. UCSC (BLAT) Alignment Statistics
...
14 / 24 Ⓒ 2014 Invitae
Characterization of transcripts discrepanciesCharacterization of transcripts discrepancies
Whether ...
15 / 24 Ⓒ 2014 Invitae
Characterization of transcripts discrepanciesCharacterization of transcripts discrepancies
Referenc...
16 / 24 Ⓒ 2014 Invitae
Summary of Splign-BLAT gene-wise coordinate deltas.Summary of Splign-BLAT gene-wise coordinate delt...
17 / 24 Ⓒ 2014 Invitae
HGVS Python PackageHGVS Python Package
http://bitbucket.org/invitae/hgvs/http://bitbucket.org/invit...
18 / 24 Ⓒ 2014 Invitae
Example: Variant liftover between transcriptsExample: Variant liftover between transcripts
Map
from...
19 / 24 Ⓒ 2014 Invitae
Developer InfoDeveloper Info
Testing
➢ 91% code coverage
➢ 25665 tests variants
● ~200 hand curated...
20 / 24 Ⓒ 2014 Invitae
AcknowledgementsAcknowledgements
➢ Vince Fusaro
➢ John Garcia
➢ Emily Hare
➢ Kevin Jacobs
➢ Geoff N...
21 / 24 Ⓒ 2014 Invitae
22 / 24 Ⓒ 2014 Invitae
T
RefSeq
NM_01234.4
UTA solves four issues with transcript management.UTA solves four issues with t...
24 / 24 Ⓒ 2014 Invitae
ENSTs equivalent with NMsENSTs equivalent with NMs
=> select N.hgnc,N.es_fingerprint,N.tx_ac,E.tx_a...
The Clinical Significance of Transcript Alignment Discrepancies … and tools to help you deal with them - Reece Hart
Upcoming SlideShare
Loading in...5
×

The Clinical Significance of Transcript Alignment Discrepancies … and tools to help you deal with them - Reece Hart

138
-1

Published on

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
138
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Clinical Significance of Transcript Alignment Discrepancies … and tools to help you deal with them - Reece Hart

  1. 1. Ⓒ 2014 Invitae Reece Hart, Ph.D.Reece Hart, Ph.D. reece@invitae.comreece@invitae.com Human Variome Project Meeting 2014, ParisHuman Variome Project Meeting 2014, Paris The Clinical Significance of TranscriptThe Clinical Significance of Transcript Alignment DiscrepanciesAlignment Discrepancies …… and tools to help you deal with them.and tools to help you deal with them.
  2. 2. 2 / 24 Ⓒ 2014 Invitae The fidelity of transcript-genome mapping matters.The fidelity of transcript-genome mapping matters. Variants are identified and computed on in genome coordinates Variants are analyzed and communicated using transcript coordinates genome to transcript (g. to c.) transcript to genome (c. to g.)
  3. 3. 3 / 24 Ⓒ 2014 Invitae Motivation 1: Discordant exon coordinatesMotivation 1: Discordant exon coordinates NCBI and UCSC report different coordinates for NM_052813.3, exon 12NCBI and UCSC report different coordinates for NM_052813.3, exon 12 UCSC (BLAT) NCBI (Splign) Consequences: 1. An assay that targets the wrong genomic region will generate uninformative sequence data. 2. A genomic variant will be interpreted as exonic when it is intronic, or vice versa. exon 12 displaced 322 nt
  4. 4. 4 / 24 Ⓒ 2014 Invitae Motivation 2: indels confound mappingMotivation 2: indels confound mapping NM_006158.3 (NEFL) contains indel in CDSNM_006158.3 (NEFL) contains indel in CDS
  5. 5. 5 / 24 Ⓒ 2014 Invitae Challenges and Solutions in Transcript ManagementChallenges and Solutions in Transcript Management ➢ Biological ● Alternative splicing ● Paralogs ● Natural polymorphisms ● Alternative references ➢ Technical / Logistical ● Multiple transcript sources ● Multiple alignment methods ● Multiple references ● Genome-transcript sequence differences ● Historical transcript alignments ➢ Existing resources ● RefSeq, UCSC, Ensembl ● Locus Reference Genomic ● Mutalyzer ➢ See also ● McCarthy DJ¸ et al. Genome Medicine 6:26 (2014). ● Garla V, et al. Bioinformatics 27(3): 416–8 (2010).
  6. 6. 6 / 24 Ⓒ 2014 Invitae Universal Transcript Archive (UTA)Universal Transcript Archive (UTA) ➢ Single database of: ● Multiple transcripts and versions ● … from multiple sources ● … aligned to multiple references ● … by multiple alignment methods ➢ Freely available! ● Apache licensed ● Public PostgreSQL database instance at uta.invitae.com:5432 ● Local installation instructions ● Code at http://bitbucket.org/invitae/uta/
  7. 7. 7 / 24 Ⓒ 2014 Invitae Our Bermuda TriangleOur Bermuda Triangle RefAgree Do transcript and genome sequences agree? Transcript Equivalence Which RefSeq and Ensembl transcripts are equivalent? RefSeq (NM) Ensembl (ENST) Genome (GRCh37) ➊SNV ➌ ➋ Indel ➍Historical Transcripts
  8. 8. 8 / 24 Ⓒ 2014 Invitae Universal Transcript Archive (UTA)Universal Transcript Archive (UTA) Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database transcript NM_01234.4 NM_01234.4 NM_01234.5 NM_01234.5 NM_01234.5 NM_01234.5 ENST012345 ENST012345 reference NM_01234.4 NC_000012.3 NM_01234.5 NC_000012.3 AC_45678.9 NC_000012.3 ENST012345 NC_000012.3 method self splign self splign splign blat self genebuild exons exon set
  9. 9. 9 / 24 Ⓒ 2014 Invitae Universal Transcript Archive (UTA)Universal Transcript Archive (UTA) Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database transcript NM_01234.4 NM_01234.4 NM_01234.5 NM_01234.5 NM_01234.5 NM_01234.5 ENST012345 ENST012345 reference NM_01234.4 NC_000012.3 NM_01234.5 NC_000012.3 AC_45678.9 NC_000012.3 ENST012345 NC_000012.3 method self splign self splign splign blat self genebuild exons exon set exon alignments NM_01234.4 NC_000012.3 0 50= NM_01234.4 NC_000012.3 1 100=1X49= NM_01234.4 NC_000012.3 2 5=1I44= ➊➋ Alignments use coordinates from source databases.
  10. 10. 10 / 24 Ⓒ 2014 Invitae Universal Transcript Archive (UTA)Universal Transcript Archive (UTA) Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database transcript NM_01234.4 NM_01234.4 NM_01234.5 NM_01234.5 NM_01234.5 NM_01234.5 ENST012345 ENST012345 reference NM_01234.4 NC_000012.3 NM_01234.5 NC_000012.3 AC_45678.9 NC_000012.3 ENST012345 NC_000012.3 method self splign self splign splign blat self genebuild exons exon set ➌
  11. 11. 11 / 24 Ⓒ 2014 Invitae Universal Transcript Archive (UTA)Universal Transcript Archive (UTA) Multiple sources, multiple versions, multiple alignment methods in one databaseMultiple sources, multiple versions, multiple alignment methods in one database transcript NM_01234.4 NM_01234.4 NM_01234.5 NM_01234.5 NM_01234.5 NM_01234.5 ENST012345 ENST012345 reference NM_01234.4 NC_000012.3 NM_01234.5 NC_000012.3 AC_45678.9 NC_000012.3 ENST012345 NC_000012.3 method self splign self splign splign blat self genebuild exons exon set ➍
  12. 12. 12 / 24 Ⓒ 2014 Invitae ““RefAgree” Statistics by Protein Coding TranscriptRefAgree” Statistics by Protein Coding Transcript Sequence concordance between RefSeq and GRCh37 primary assemblySequence concordance between RefSeq and GRCh37 primary assembly c.f. Garla V, et al. Bioinformatics 27(3): 416–8 (2010). 34531 NM transcripts (Jan 2014) 760 0.2% with length discrepancies 3481 10% with substitutions 321 0.9% with deletions 255 0.7% with insertions ➊➋
  13. 13. 13 / 24 Ⓒ 2014 Invitae NCBI (Splign) v. UCSC (BLAT) Alignment StatisticsNCBI (Splign) v. UCSC (BLAT) Alignment Statistics Splign and BLAT provide significantly different exon structures for 886 transcriptsSplign and BLAT provide significantly different exon structures for 886 transcripts Are Splign and BLAT similar ? 31472 (97.3%) transcriptsY N 32358 transcripts w/exon structures ➌ 886 (2.7%) transcripts “similar” means either 1) identical exon coordinates, or 2) coordinates that differ only by short 3' terminal artifacts
  14. 14. 14 / 24 Ⓒ 2014 Invitae Characterization of transcripts discrepanciesCharacterization of transcripts discrepancies Whether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence.Whether alignments provided by NCBI and UCSC agree with GRCh37 primary sequence. Splign BLAT T F T 14 18 F 545 311 886 transcripts with significant discrepancies
  15. 15. 15 / 24 Ⓒ 2014 Invitae Characterization of transcripts discrepanciesCharacterization of transcripts discrepancies Reference agreement (blue) and alignment “simplicity” (green)Reference agreement (blue) and alignment “simplicity” (green) Splign BLAT T F T 14 18 F 545 311 Splign BLAT T F T 200 (0) 4 (97) F 90 (82) 16 (84) Splign BLAT T F T 6 (41) 12 (180) F Splign BLAT T F T 434 (7) F 110 (652) Splign BLAT T F T 14 (11) F 886 transcripts with significant discrepancies
  16. 16. 16 / 24 Ⓒ 2014 Invitae Summary of Splign-BLAT gene-wise coordinate deltas.Summary of Splign-BLAT gene-wise coordinate deltas. delta # genes # ACMG must report =0 15206 44 >=1 183 8 >=10 116 0 >=25 6 0 >=50 5 0 >=250 13 0 >=1000 94 2 ND 3 delta ≝ minimum per gene of maximum per transcript of difference of exon coordinates between NCBI and UCSC. MYH7, TNNI3 (all trivial diffs) LDLR, MYL2, PRKAG2, SDHB, SDHC, TGFBR1, TGFBR2, WT1 APOV, MYHBPC3, NTRK
  17. 17. 17 / 24 Ⓒ 2014 Invitae HGVS Python PackageHGVS Python Package http://bitbucket.org/invitae/hgvs/http://bitbucket.org/invitae/hgvs/ ➢ Parser ● HGVS Python object→ ● Based on a Parsing Expression Grammar ➢ Formatter ● Python object HGVS→ ➢ Validator ● intrinsic & extrinsic validation ➢ Mapping tools indel-aware! ● g. c. p. (m,n,r also supported)↔ → ● transcript-to-transcript liftover ● uses on UTA data
  18. 18. 18 / 24 Ⓒ 2014 Invitae Example: Variant liftover between transcriptsExample: Variant liftover between transcripts Map from NM_182763.2:c.688+403C>T➀ to NC_000001.10:g.150550916G>A➁ to ➂ NM_001197320.1:281C>T with Splign alignments NM_001197320.1 NP_001184249.1 NM_182763.2 NP_877495.1 ➀ ➂ ➁ NC_000001.10
  19. 19. 19 / 24 Ⓒ 2014 Invitae Developer InfoDeveloper Info Testing ➢ 91% code coverage ➢ 25665 tests variants ● ~200 hand curated, rest from dbSNP ● 23436 sub, 1254 del, 908 ins, 45 delins, 22 dup ● 44 distinct transcripts, many selected for difficulty Upcoming issues (all issues are publicly readable) ➢ multi-variant alleles ➢ release LRG ➢ GRCh38 ➢ API changes
  20. 20. 20 / 24 Ⓒ 2014 Invitae AcknowledgementsAcknowledgements ➢ Vince Fusaro ➢ John Garcia ➢ Emily Hare ➢ Kevin Jacobs ➢ Geoff Nilsen ➢ Rudy Rico ➢ Jody Westbrook http://bitbucket.com/invitae/ ➢ Code (Python) ➢ Documentation & Examples ➢ Issues ➢ BED files ➢ Code testing is public Or just: pip install hgvs
  21. 21. 21 / 24 Ⓒ 2014 Invitae
  22. 22. 22 / 24 Ⓒ 2014 Invitae T RefSeq NM_01234.4 UTA solves four issues with transcript management.UTA solves four issues with transcript management. RefSeq NM_01234.5 InDel UCSC NM_01234.5 ➌ Exon coordinate differences between sources for same accession➍ Historical transcripts alignments no longer available ➊ SNV A ➋ Transcript =≠ Genome Reference
  23. 23. 24 / 24 Ⓒ 2014 Invitae ENSTs equivalent with NMsENSTs equivalent with NMs => select N.hgnc,N.es_fingerprint,N.tx_ac,E.tx_ac from uta_20140210.tx_exon_set_summary_mv N join uta_20140210.tx_exon_set_summary_mv E   on N.es_fingerprint=E.es_fingerprint   and N.tx_ac ~ '^NM_' and E.tx_ac ~ '^ENST'   and N.alt_aln_method='transcript'   and E.alt_aln_method='transcript'; ┌─────────┬──────────────────────────────────┬────────────────┬─────────────────┐   │ hgnc              es_fingerprint                tx_ac             tx_ac      │ │ │ │ ├─────────┼──────────────────────────────────┼────────────────┼─────────────────┤  │ AFF2      db0e20be1a2bb687c33227d2e6bf9d53   NM_002025.3      ENST00000370460 │ │ │ │  │ UBE3A     d1eace7da295c45378fa5f898f2f03f6   NM_130838.1      ENST00000438097 │ │ │ │  │ ANXA8L1   1f6fd4f3fe9854aa468489ec7f507512   NM_001098845.1   ENST00000359178 │ │ │ │  │ APOL5     939a9e9e4a46ef9aef862cf9b369afe6   NM_030642.1      ENST00000249044 │ │ │ │  │ ARID4B    524fc954d10b08a4014e86aee81d0358   NM_016374.5      ENST00000264183 │ │ │ │
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×