Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

Mane v2 final

Download to read offline

Presentation on the MANE project by Terence Murphy at GRC/GIAB workshop at ASHG 2018 meeting

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Mane v2 final

  1. 1. MANE Matched Annotation from the NCBI and EMBL-EBI Terence Murphy – Team Lead, NCBI RefSeq
  2. 2. RefSeq Curators Shashi Pujar Eric Cox Catherine Farrell Tamara Goldfarb John Jackson Vinita Joardar Kelly McGarvey Michael Murphy Nuala O’Leary Bhanu Rajput Sanjida Rangwala Lillian Riddick David Webb Terence Murphy, RefSeq Team Lead RefSeq Developers Alex Astashyn Olga Ermolaeva Vamsi Kodali Craig Wallin Adam Frankish, Manual Genome Annotation Coordinator Fiona Cunningham, Variation Annotation Team Lead Ensembl HAVANA/LRG curators Jane Loveland Joannella Morales Ruth Bennett Andrew Berry Claire Davidson Laurent Gil Jose Manuel Gonzalez Matt Hardy Mike Kay Aoife McMahon Marie-Marthe Suner Glen Threadgold This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. NCBI RefSeq
  3. 3. NCBI RefSeq vs. Ensembl/GENCODE NCBI’s RefSeq: • NM/NR: manually annotated set • Only includes full-length transcripts • XM/XR: automatically produced • Predict full-length from partial data • Transcripts don’t necessarily match the genome assembly: • represent a prevalent, 'standard' allele • Independent of reference assembly changes • Clinical annotation predominantly done using a RefSeq transcript or a subset of NMs Ensembl/GENCODE: • ENS ID: More manually-reviewed transcripts • Includes partial transcripts • More transcripts for non-coding genes • On average more transcripts per gene • Must match reference genome • Reference set for gnomAD/ ExAC, GTEx, Decipher, 100,000 Genomes Project, COSMIC, ICGC NCBI
  4. 4. A core set of annotation matches* Different UTR(s) 1k Different end(s) 31k identical 5k Other NM/NR: 20k RefSeq models: 72k Other GENCODE basic: 20k GENCODE comprehensive: 62k GENCODE comprehensive partials: 32k GRCh38 primary assembly HGNC-named protein-coding loci RefSeq AR109 vs. Ensembl 94CCDS (97% of HGNC-named protein-coding genes)
  5. 5. But most have some differences RefSeq Ensembl • Often subtle • RefSeq mismatches require special mapping logic • Differences complicate data exchange, especially for clinical reporting • “Can we match for at least one representative transcript for each gene?”
  6. 6. Why define a representative transcript? • Preferred substrate for clinical reporting • Useful for comparative / evolutionary genomics • Standardize default across resources • LRG, VEP, gnomAD, COSMIC, UCSC, UniProt, others all have their own defaults • Help make a better choice than “I just use the longest/first one”
  7. 7. Matched Annotation from the NCBI and EMBL-EBI • Set of 100% identical RefSeq & Ensembl transcripts • Scope: at least one transcript for all protein-coding genes • Match GRCh38, identical 5’ and 3’ ends, all splice sites, CDS • Three tiers: • MANE Select – one per gene, representative of biology at each locus • Well-supported, expressed, conserved • MANE Plus – alternate transcripts to capture key aspects of gene structure • MANE Extended – additional transcripts that match • Both RefSeq & Ensembl will have additional unmatched transcripts • Fairly stable, but will allow updates when necessary
  8. 8. Methodology • How to pick a Select transcript • How to match ends • Opportunities to improve both RefSeq & Ensembl/GENCODE
  9. 9. Choosing a Select transcript • Ensembl Pipeline • Length • Expression • Conservation (APPRIS) • Representation in UniProt and RefSeq • Coverage of pathogenic variants • RefSeq Select Pipeline • Conservation (PhyloCSF) • Expression • CAGE • Representation in UniProt and Ensembl • Length • Prior manual curation (LRG) RefSeq:Ensembl:S P, 13644 RefSeq:Ensembl CDS match, 4569 other, 1219
  10. 10. Define 5’ ends from FANTOM CAGE data • Deep sequencing dataset of 5’ ends • Integrate data to pick 5’-most strong site (not always the absolute peak) Ensembl RefSeq KNG1 CAGE Transcripts RNAseq
  11. 11. 0 1000 2000 3000 4000 5000 6000 7000 < - 200 -160-120-80-4004080120160200> 200 RefSeq Ensembl Define 5’ ends from FANTOM CAGE data Bias towards shorter 5’ UTR CAGE shorterCAGE longer good CAGE, 14670 CAGE needs review, 1573 no CAGE, 1970 other, 1219 83% of select transcripts matched to CAGE data
  12. 12. Define 3’ ends from polyA sequencing • Long and short read data to define maximum 3’ UTRs • Integrating multiple datasets to define sites within clusters (polyA_DB, PolyAsite, +more) 72% of select transcripts matched to polyA data polyA cluster, no extension, 10968 polyA cluster, possible extension, 3023 other extensions, 646 no polyA, 3576 no match, 1219
  13. 13. Some CDS start sites need to be revised
  14. 14. Analyses find some genes missing significant splice variants
  15. 15. Deliverables • Annotation files and tracks in genome browsers • Synonymous RefSeq & Ensembl IDs • Reciprocal markup in NCBI and EMBL-EBI resources
  16. 16. Timelines • Dec 2018: alpha dataset available, one Matched Select transcript for 50% of coding genes • Bulk RefSeq transcript updates starting in next few months • In browsers Spring 2019 • 2019: select and match transcripts for 90% of coding genes • Emphasis on clinically-relevant loci
  17. 17. We want to hear from you! • NCBI booth: #315 • Find us at this meeting: Terence Murphy, Adam Frankish, Jane Loveland, Joannella Morales • E-mail: refseq-support@nlm.nih.gov gencode-help@ebi.ac.uk

Presentation on the MANE project by Terence Murphy at GRC/GIAB workshop at ASHG 2018 meeting

Views

Total views

495

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

19

Shares

0

Comments

0

Likes

0

×