Clinical significance of transcript alignment discrepancies gne - 20141016

TThhee CClliinniiccaall SSiiggnniiffiiccaannccee ooff TTrraannssccrriipptt
AAlliiggnnmmeenntt DDiissccrreeppaanncciieess
…… aanndd ttoooollss ttoo hheellpp yyoouu ddeeaall wwiitthh tthheemm..
RReeeeccee HHaarrtt,, PPhh..DD..
rrhhaarrtt@@2233aannddmmee..ccoomm
GGeenneenntteecchh
22001144--1100--1166
Available on SlideShare (http://www.slideshare.net/reecehart)

The fidelity of transcript-ggeennoommee mmaappppiinngg mmaatttteerrss..
2 / 28
Variants are identified
and computed on in
genome coordinates
Variants are analyzed and
communicated using
transcript coordinates
genome to
transcript
(g. to c.)
transcript
to genome
(c. to g.)

Motivation 1: Discordant eexxoonn ccoooorrddiinnaatteess
NNCCBBII aanndd UUCCSSCC rreeppoorrtt ddiiffffeerreenntt ccoooorrddiinnaatteess ffoorr CCAARRDD99,, NNMM__005522881133..33,, eexxoonn 1122
exon 12
displaced 322 nt
3 / 28
UCSC
(BLAT)
NCBI
(Splign)
Consequences:
1. An assay that targets the wrong genomic region will generate
uninformative sequence data.
2. A genomic variant will be interpreted as exonic when it is
intronic, or vice versa.

Motivation 2: iinnddeellss ccoonnffoouunndd mmaappppiinngg
NNMM__000066115588..33 ((NNEEFFLL)) ccoonnttaaiinnss iinnddeell iinn CCDDSS
4 / 28
Deletion justified differently!

Motivation 3: Data mmaannaaggeemmeenntt cchhaalllleennggeess
➢ Mutable data (!)
➢ Sporadic failures
➢ Inconsistent data from a single source
➢ Inconsistent data across sources
➢ Opaque and implicit data definitions
➢ Historical alignment data not available
Source AC Reference exons
EUtils NM_005168.3 GRCh37.p10 1146 / 125 / 320 / 1998
NM_005168.4 NG_008492.1 1398 / 125 / 320 / 1998
seqgene NM_005168.3 GRCh37.p10 102 / 1046 / 125 / 321 / 143 / 1855
UCSC NM_005168.4 hg19 1398 / 135 / 244 / 76 / 1997
5 / 28

Motivation 4: Use Ensembl for Variant EEffffeecctt PPrreeddiiccttiioonn
6 / 28
RefAgree
Do transcript and
genome sequences agree?
Transcript Equivalence
Which RefSeq and Ensembl
transcripts are equivalent?
RefSeq
(NM)
Ensembl
(ENST)
Genome
(GRCh37)
➊ SNV
➌
➋ Indel
➍ Historical Transcripts UCSC (NM)
LRG, BIC, …

Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).
MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.
Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658
7 / 28

Challenges and Solutions iinn TTrraannssccrriipptt MMaannaaggeemmeenntt
8 / 28
➢ Biological
● Alternative splicing
● Paralogs
● Natural polymorphisms
● Alternative references
➢ Technical / Logistical
● Multiple transcript sources
● Multiple alignment methods
● Multiple references
● Genome-transcript sequence
differences
● Historical transcript alignments
➢ Existing resources
● RefSeq, UCSC, Ensembl
● Locus Reference Genomic
● Mutalyzer
➢ See also
● McCarthy DJ¸ et al. Genome
Medicine 6:26 (2014).
● Garla V, et al. Bioinformatics
27(3): 416–8 (2010).

Part 1
The Universal Transcript Archive
10 / 28

UTA solves four issues with ttrraannssccrriipptt mmaannaaggeemmeenntt..
A
Transcript ≠≠ Genome Reference
➊ SNV
➋
➍Exon coordinate differences between sources for same accession
11 / 28
T
RefSeq
NM_01234.5
➌
RefSeq
NM_01234.4
InDel
UCSC
NM_01234.5
Historical transcripts alignments no longer available

Universal Transcript AArrcchhiivvee ((UUTTAA))
MMuullttiippllee ssoouurrcceess,, mmuullttiippllee vveerrssiioonnss,, mmuullttiippllee aalliiggnnmmeenntt mmeetthhooddss iinn oonnee ddaattaabbaassee
12 / 28
transcript
NM_01234.4
NM_01234.4
NM_01234.5
NM_01234.5
NM_01234.5
NM_01234.5
ENST012345
ENST012345
reference
NM_01234.4
NC_000012.3
NM_01234.5
NC_000012.3
AC_45678.9
NC_000012.3
ENST012345
NC_000012.3
method
self
splign
self
splign
splign
blat
self
genebuild
exons
exon set

13 / 28
transcript
NM_01234.4
NM_01234.4
NM_01234.5
NM_01234.5
NM_01234.5
NM_01234.5
ENST012345
ENST012345
reference
NM_01234.4
NC_000012.3
NM_01234.5
NC_000012.3
AC_45678.9
NC_000012.3
ENST012345
NC_000012.3
method
self
splign
self
splign
splign
blat
self
genebuild
exons
exon set
exon alignments
NM_01234.4 NC_000012.3 0 50≠
NM_01234.4 NC_000012.3 1 100≠1X49≠
NM_01234.4 NC_000012.3 2 5≠1I44≠
➊➋
Alignments use
coordinates from source
databases.

14 / 28
transcript
NM_01234.4
NM_01234.4
NM_01234.5
NM_01234.5
NM_01234.5
NM_01234.5
ENST012345
ENST012345
reference
NM_01234.4
NC_000012.3
NM_01234.5
NC_000012.3
AC_45678.9
NC_000012.3
ENST012345
NC_000012.3
method
self
splign
self
splign
splign
blat
self
genebuild
exons
exon set
➌

15 / 28
transcript
NM_01234.4
NM_01234.4
NM_01234.5
NM_01234.5
NM_01234.5
NM_01234.5
ENST012345
ENST012345
reference
NM_01234.4
NC_000012.3
NM_01234.5
NC_000012.3
AC_45678.9
NC_000012.3
ENST012345
NC_000012.3
method
self
splign
self
splign
splign
blat
self
genebuild
exons
exon set
➍

““RefAgree” Statistics by Protein CCooddiinngg TTrraannssccrriipptt
SSeeqquueennccee ccoonnccoorrddaannccee bbeettwweeeenn RReeffSSeeqq aanndd GGRRCChh3377 pprriimmaarryy aasssseemmbbllyy
➊➋
34531 NM transcripts (Jan 2014)
760 0.2% with length discrepancies
3481 10% with substitutions
321 0.9% with deletions
255 0.7% with insertions
16 / 28
c.f. Garla V, et al. Bioinformatics 27(3): 416–8 (2010).

Exon structures have uunniiqquuee ffiinnggeerrpprriinnttss
IIddeennttiiffyyiinngg EENNSSTT--NNMM eeqquuiivvaalleenncceess wwiitthh ffiinnggeerrpprriinnttss
=> select N.hgnc,N.es_fingerprint,N.tx_ac,E.tx_ac
from uta_20140210.tx_exon_set_summary_mv N
join uta_20140210.tx_exon_set_summary_mv E
on N.es_fingerprint=E.es_fingerprint
and N.tx_ac ~ '^NM_' and E.tx_ac ~ '^ENST'
and N.alt_aln_method='transcript'
and E.alt_aln_method='transcript';
┌─────────┬──────────────────────────────────┬────────────────┬─────────────────┐
│ hgnc │ es_fingerprint │ tx_ac │ tx_ac │
├─────────┼──────────────────────────────────┼────────────────┼─────────────────┤
│ AFF2 │ db0e20be1a2bb687c33227d2e6bf9d53 │ NM_002025.3 │ ENST00000370460 │
│ UBE3A │ d1eace7da295c45378fa5f898f2f03f6 │ NM_130838.1 │ ENST00000438097 │
│ ANXA8L1 │ 1f6fd4f3fe9854aa468489ec7f507512 │ NM_001098845.1 │ ENST00000359178 │
│ APOL5 │ 939a9e9e4a46ef9aef862cf9b369afe6 │ NM_030642.1 │ ENST00000249044 │
│ ARID4B │ 524fc954d10b08a4014e86aee81d0358 │ NM_016374.5 │ ENST00000264183 │
17 / 28

NCBI (Splign) v. UCSC (BBLLAATT)) AAlliiggnnmmeenntt SSttaattiissttiiccss
SSpplliiggnn aanndd BBLLAATT pprroovviiddee ssiiggnniiffiiccaannttllyy ddiiffffeerreenntt eexxoonn ssttrruuccttuurreess ffoorr 888866 ttrraannssccrriippttss
Are Splign
and BLAT
similar ?
18 / 28
31472 (97.3%)
transcripts
Y
N
32358
transcripts
w/exon structures
➌
886 (2.7%)
transcripts
“similar” means either
1) identical exon coordinates, or
2) coordinates that differ only by
short 3' terminal artifacts

Characterization of transcripts ddiissccrreeppaanncciieess
WWhheetthheerr aalliiggnnmmeennttss pprroovviiddeedd bbyy NNCCBBII aanndd UUCCSSCC aaggrreeee wwiitthh GGRRCChh3377 pprriimmaarryy sseeqquueennccee..
Splign BLAT
T F
T 14 18
F 545 311
886 transcripts with
significant discrepancies
19 / 28

Characterization of transcripts ddiissccrreeppaanncciieess
RReeffeerreennccee aaggrreeeemmeenntt ((bblluuee)) aanndd aalliiggnnmmeenntt ““ssiimmpplliicciittyy”” ((ggrreeeenn))
Splign BLAT
T F
T 14 18
F 545 311
20 / 28
Splign
Splign
BLAT
T F
T 200
(0)
4
(97)
F 90
(82)
16
(84)
BLAT
T F
T 6
(41)
12
(180)
F
Splign
Splign
BLAT
T F
T 434
(7)
F 110
(652)
BLAT
T F
T 14
(11)
F
886 transcripts with
significant discrepancies

AACCMMGG ““MMuusstt RReeppoorrtt”” GGeenneess
Green, R. C., Berg, J. S., Grody, W. W., Kalia, S. S., Korf, B. R., Martin, C. L., … Biesecker, L. G. (2013).
ACMG recommendations for reporting of incidental findings in clinical exome and genome
sequencing. Genetics in Medicine : Official Journal of the American College of Medical Genetics,
15(7), 565–74. doi:10.1038/gim.2013.73
21 / 28

Summary of Splign-BLAT gene-wwiissee ccoooorrddiinnaattee ddeellttaass..
delta # genes # ACMG must
22 / 28
report
=0 15206 45
>=1 183 8
>=10 116 0
>=25 6 0
>=50 5 0
>=250 13 0
>=1000 94 3
delta ≝ minimum per gene of maximum per transcript of
difference of exon coordinates between NCBI and UCSC.
Identical Exon
Structures
(all trivial diffs)
LDLR, MYL2,
PRKAG2, SDHB,
SDHC, TGFBR1,
TGFBR2, WT1
MYBPC3, MYH7,
TNNI3

Part 2
Using HGVS “Nomenclature”
(http://www.hgvs.org/mutnomen/)
23 / 28

24 / 28
HHGGVVSS PPyytthhoonn PPaacckkaaggee
hhttttpp::////bbiittbbuucckkeett..oorrgg//hhggvvss//hhggvvss//
➢ Parser
● HGVS → Python object
● Based on a Parsing Expression
Grammar
➢ Formatter
● Python object → HGVS
➢ Validator
● intrinsic & extrinsic validation
➢ Mapping tools indel-aware!
● g. ↔ c. → p. (m,n,r also supported)
● transcript-to-transcript liftover
● uses on UTA data

Example: Variant liftover bbeettwweeeenn ttrraannssccrriippttss
Map
from ➀ NM_182763.2:c.688+403C>T
to ➁ NC_000001.10:g.150550916G>A
to ➂ NM_001197320.1:281C>T
with Splign alignments
25 / 28
NM_182763.2
NP_877495.1
NM_001197320.1
NP_001184249.1
➀
➂
➁
NC_000001.10

26 / 28
DDeevveellooppeerr IInnffoo
Testing
➢ 91% code coverage
➢ 25665 tests variants
● ~200 hand curated, rest from
dbSNP
● 23436 sub, 1254 del, 908 ins, 45
delins, 22 dup
● 44 distinct transcripts, many
selected for difficulty
➢ >99% concordance with
Mutalyzer
● using >100K variants from
ClinVar
Upcoming directions
(all issues are publicly readable)
➢ multi-variant alleles
➢ release LRG
➢ GRCh38
➢ API changes

CCoonncclluussiioonnss
➢ The fidelity of reference-transcript mapping matters
● For ~800 transcripts, splign and BLAT generate significantly different
alignments
● These differences might affect the interpretation of clinically-relevant
genes (including 3 ACMG must report genes)
➢ Current resources have important limitations
➢ Two tools may help you deal with these limitations
● UTA – Freely available archive of transcripts from multiple sources
● HGVS – Comprehensive parsing, formatting, manipulation, and validation
of variants
27 / 28

28 / 28
AAcckknnoowwlleeddggeemmeennttss
➢ Invitae
● Vince Fusaro
● John Garcia
● Emily Hare
● Kevin Jacobs
● Geoff Nilsen
● Rudy Rico
● Jody Westbrook
●
●
● http://goo.gl/dq2uoW
http://bitbucket.com/hgvs/hgvs
http://bitbucket.com/uta/uta
➢ Code (Python)
➢ Documentation & Examples
➢ Issues
➢ BED files
➢ Code testing is public
Or just:
pip install hgvs

Clinical significance of transcript alignment discrepancies gne - 20141016

Recommended

Recommended

More Related Content

Similar to Clinical significance of transcript alignment discrepancies gne - 20141016

Similar to Clinical significance of transcript alignment discrepancies gne - 20141016 (20)

More from Reece Hart

More from Reece Hart (8)

Recently uploaded

Recently uploaded (20)

Clinical significance of transcript alignment discrepancies gne - 20141016