Successfully reported this slideshow.
Your SlideShare is downloading. ×

The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
Loading in …3
×

Check these out next

1 of 21 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to The Matched Annotation from NCBI and EMBL-EBI (MANE) Project (20)

Advertisement

More from Genome Reference Consortium (20)

Recently uploaded (20)

Advertisement

The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

  1. 1. A standardized “default” transcript set The Matched Annotation from NCBI and EMBL-EBI (MANE) Project Joannella Morales, PhD European Bioinformatics Institute (EMBL-EBI) jmorales@ebi.ac.uk ASHG 2019
  2. 2. Rationale • Accurate identification and description of the genes in the human genome is foundational for biology • The availability of high-quality reference materials is essential for clinical genomics • Comprehensive transcript annotation is central to this endeavor
  3. 3. Sources of transcript annotation: RefSeq and Ensembl/GENCODE NCBI’s RefSeq: • NM_xxxxxx: manually annotated; XM_xxxxxx: automatically produced • May not match the primary reference genome: • represent a prevalent, 'standard' allele but not always reference • Clinical annotation predominantly done using RefSeq transcripts EBI’s Ensembl/GENCODE: • ENSTxxxxxx: More manually-reviewed transcripts • Must match primary reference genome • On average more Ensembl transcripts per gene compared to RefSeqs • Reference set for gnomAD/ ExAC, GTEx, Decipher, 100,000 Genomes Project, ICGC etc.
  4. 4. Rationale • Comprehensive annotation is good BUT… • This can cause some challenges in the clinical context • There are numerous alternatively spliced transcripts for a given gene • Transcripts get updated over time – version changes, hard to track a variant over time • There is no standard • Variant reporting can be done on any transcript • Commonly used tools (gnomAD, HGMD, Decipher etc.) often have different “canonical” transcripts • Which one(s) should be used? • Often the longest transcript at the locus (or the first one described) is used • Even though this one may not be relevant (e.g. minor or not expressed in tissue of interest)
  5. 5. Solution: Define a joint ‘representative’ transcript set • Standardize transcript set across genomics browsers • VEP, gnomAD, HGMD, COSMIC, UniProt, others all have their own “canonicals” • Identify a transcript that captures the most information about each protein-coding gene • Standardize clinical reporting • Useful as starting point for comparative/evolutionary genomics • All transcripts should always be considered for clinical interpretation • We are NOT saying that biology can be simplified to a single transcript at each genomic locus
  6. 6. What is MANE? (Matched Annotation from the NCBI and EMBL-EBI) • A transcript set with the following attributes: • Must match GRCh38 sequence • 100% identical between the RefSeq and corresponding Ensembl transcript • 5’UTR, CDS, and 3’UTR • Transcripts should be: • Well-supported, expressed, conserved • Representative of biology at each locus • Phase 1 - MANE Select – One transcript for each protein-coding locus; to be used as “default” across genomics resources • Phase 2 - MANE Plus – Additional well-supported transcripts of particular interest • For example, for clinical reporting
  7. 7. • Automated with a layer of manual review • Built independent pipelines to select a transcript from each set MANE Select Methodology • RefSeq Select Pipeline • Expression • Conservation • Representation in UniProt and Ensembl • Length • Prior manual curation (LRG) • Ensembl Select Pipeline • Length • Expression • Conservation • Representation in UniProt and RefSeq • Coverage of pathogenic variants
  8. 8. Review UTRs 5’ 3’ Identical splicing, CDS, UTRs 5’ 3’ MANE Select MANE Select Methodology 5’ 3’ RefSeq 5’ 3’ Ensembl/GENCODE Step 1 Select Step 2 Review Step 3 Match
  9. 9. Initial pipeline comparison and bins Bin1: Identical Bin 2: Same CDS, but different UTR length or splicing pattern Bin 3: Different CDS, with or without different UTR length or splicing pattern or Majority of cases Complex loci Annotation differences
  10. 10. Reducing Bin 2 Bin 2 = Both pipelines pick same CDS. Chosen ENST and NM only differ in UTR length and/or UTR splicing pattern • Defined rules to jointly define extent of 5’ and 3’ UTRs • “Longest strong” • Trimmed/Extended ends in an automated manner
  11. 11. Selecting UTRs, 5’ end: CAGE = Cap Analysis of Gene Expression, developed by RIKEN This is a way of getting the full 5’ end of messenger RNA.The output of CAGE is tags, and these give a quantification of the RNA abundance. Longest StrongestLongest strong Ensembl/ GENCODE RefSeq RNAseq CAGE counts Ensembl Genome Browser KNG1
  12. 12. Ensembl RefSeq RNAseq PolyA counts Longest Longest Strong REM2 NCBI’s Genome DataViewer PolyA seq:This is data from the 3’ end. It is the sequence from the polyadenlyated region of mRNA, defining the end of a transcript. Selecting UTRs, 3’ end: INSDC coverage
  13. 13. • Bin 3 = Pipelines picked different CDS • Improved pipelines, based on review of genes in bin 3 • Manually curating genes unresolved after pipeline improvement (prioritizing clinical genes) • This is the hardest bin! • In some cases, there is no right answer. Either one could be selected. This is biology! • In other cases, the corresponding transcript in the other set does not exist, thus requiring a full annotation update. Very time consuming! Reducing Bin 3
  14. 14. MANE Select Progress Update • In April, we released: • MANE Select v0.5 on all browsers, with coverage of 54% across the genome • In September, we released: • MANE Select v0.6 on all browsers, increasing MANE Select coverage to 67% across the genome • Identified additional 4% to increase coverage to 71% of across the genome • We are aiming to increase coverage to 75 – 80% by the end of the year • Our ultimate goal is to achieve genome-wide coverage by 2020
  15. 15. Accessing MANE: Ensembl
  16. 16. Accessing MANE: NCBI Browser
  17. 17. Accessing MANE: UCSC browser
  18. 18. Accessing MANE: NCBI’s FTP ftp://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/
  19. 19. Limitations • MANE Select does not capture biological complexity (requires a single choice) • Transcripts excluded may score approximately equal to MANE Select on any or all supporting attributes • Tissue-specificity vs general pattern of expression • Most highly supported transcript might exclude important tissue specific or clinically relevant isoforms • Gaps in data • Insufficient information to determine transcriptional specificity • Transcript level quantification still difficult
  20. 20. Summary • NCBI and EMBL-EBI are working together to review annotation and produce a matched set of “high-value” transcripts • These transcripts will match GRCh38 and will represent 100% identity between a RefSeq and its corresponding Ensembl transcript • We will define one “default” transcript per locus (MANE Select) • We aim to have widespread adoption of MANE Select as default across genomics resources • We will define additional well-supported transcript (MANE Plus) • We expect all transcripts required for clinical reporting to be in Select and Plus • Feedback welcome - MANE-help@ebi.ac.uk
  21. 21. Fiona Cunningham, Variation Annotation Team Lead Adam Frankish, Manual Genome Annotation Coordinator This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. RefSeq Curators Shashi Pujar Eric Cox Catherine Farrell TamaraGoldfarb John Jackson Vinita Joardar Kelly McGarvey Michael Murphy Nuala O’Leary Bhanu Rajput Sanjida Rangwala Lillian Riddick DavidWebb Terence Murphy, RefSeq Team Lead RefSeq Developers AlexAstashyn Olga Ermolaeva Vamsi Kodali CraigWallin Acknowledgments MANE-help@ebi.ac.uk Matt Hardy Mike Kay Aoife McMahon Marie-MartheSuner GlenThreadgold MANE-help@ncbi.nlm.nih.gov Ensembl/LRG curators Jane Loveland Joannella Morales Ruth Bennett Andrew Berry Claire Davidson Laurent Gil Jose Manuel Gonzalez

Editor's Notes

  • Solutions?
    LRG project – clinical focus
    MANE project – broader focus, useful for clinical community
  • Our effort to select one high-quality transcript at all protein-coding loci, and to have this be consistent across all genomics resources, will give a consistent starting view of biology for researchers, whether the intent is to use it for reporting variants, comparative genomics or any other endeavour. That said, all the transcripts we annotate should always be considered and we are certainly NOT saying that biology can be simplified to a single transcript at each genomic locus. We anticipate expanding the project to include a larger set of transcripts that are well-supported, predicted to be functional or relevant to specific user groups.
  • We did this computationally, by building two, independent, in house pipelines
    The use of two independently built pipeline, designed without input from the other group was important to validate the result. If both groups identify the same model then there is a higher degree of confidence that we have arrived at the correct answer
    Here you see the different factors that each pipeline took into account. You will see that similar factors such as expression, conservation, length were taken into account.
  • CAGE = Cap Analysis of Gene Expression, developed by RIKEN
    This is a way of getting the full 5’ end of messenger RNA. The outputs of CAGE is tags, and these give a quantification of the RNA abundance.

    PolyA seq: This is data from the 3’ end. It is the sequence from the polyadenlyated region of mRNA, defining the end of a transcript.
  • 5’ UTR of KNG1 gene.
    CAGE counts overlaid from NCBI’s data reprocessing
  • Above threshold of 50%
    Clustering algorithm: red cluster is the strongest
  • Supporting attributes: overall support
    expression
    conservation
    known clinical variation

×