SlideShare a Scribd company logo
1 of 21
A standardized “default” transcript set
The Matched Annotation from NCBI and EMBL-EBI
(MANE) Project
Joannella Morales, PhD
European Bioinformatics Institute (EMBL-EBI)
jmorales@ebi.ac.uk
ASHG 2019
Rationale
• Accurate identification and description of the genes in the human genome is
foundational for biology
• The availability of high-quality reference materials is essential for clinical genomics
• Comprehensive transcript annotation is central to this endeavor
Sources of transcript annotation:
RefSeq and Ensembl/GENCODE
NCBI’s RefSeq:
• NM_xxxxxx: manually annotated; XM_xxxxxx: automatically produced
• May not match the primary reference genome:
• represent a prevalent, 'standard' allele but not always reference
• Clinical annotation predominantly done using RefSeq transcripts
EBI’s Ensembl/GENCODE:
• ENSTxxxxxx: More manually-reviewed transcripts
• Must match primary reference genome
• On average more Ensembl transcripts per gene compared to RefSeqs
• Reference set for gnomAD/ ExAC, GTEx, Decipher, 100,000 Genomes Project, ICGC etc.
Rationale
• Comprehensive annotation is good BUT…
• This can cause some challenges in the clinical context
• There are numerous alternatively spliced transcripts for a given gene
• Transcripts get updated over time – version changes, hard to track a variant over time
• There is no standard
• Variant reporting can be done on any transcript
• Commonly used tools (gnomAD, HGMD, Decipher etc.) often have different “canonical” transcripts
• Which one(s) should be used?
• Often the longest transcript at the locus (or the first one described) is used
• Even though this one may not be relevant (e.g. minor or not expressed in tissue of interest)
Solution: Define a joint ‘representative’ transcript set
• Standardize transcript set across genomics browsers
• VEP, gnomAD, HGMD, COSMIC, UniProt, others all have their own “canonicals”
• Identify a transcript that captures the most information about each protein-coding
gene
• Standardize clinical reporting
• Useful as starting point for comparative/evolutionary genomics
• All transcripts should always be considered for clinical interpretation
• We are NOT saying that biology can be simplified to a single transcript at
each genomic locus
What is MANE?
(Matched Annotation from the NCBI and EMBL-EBI)
• A transcript set with the following attributes:
• Must match GRCh38 sequence
• 100% identical between the RefSeq and corresponding Ensembl transcript
• 5’UTR, CDS, and 3’UTR
• Transcripts should be:
• Well-supported, expressed, conserved
• Representative of biology at each locus
• Phase 1 - MANE Select – One transcript for each protein-coding locus; to be used as “default”
across genomics resources
• Phase 2 - MANE Plus – Additional well-supported transcripts of particular interest
• For example, for clinical reporting
• Automated with a layer of manual review
• Built independent pipelines to select a transcript from each set
MANE Select Methodology
• RefSeq Select Pipeline
• Expression
• Conservation
• Representation in UniProt and Ensembl
• Length
• Prior manual curation (LRG)
• Ensembl Select Pipeline
• Length
• Expression
• Conservation
• Representation in UniProt and RefSeq
• Coverage of pathogenic variants
Review UTRs
5’ 3’
Identical splicing, CDS, UTRs
5’ 3’
MANE Select
MANE Select Methodology
5’ 3’
RefSeq
5’ 3’
Ensembl/GENCODE
Step 1
Select
Step 2
Review
Step 3
Match
Initial pipeline comparison and bins
Bin1: Identical
Bin 2: Same CDS, but
different UTR length or
splicing pattern
Bin 3: Different CDS, with
or without different UTR
length or splicing pattern
or Majority of cases
Complex loci
Annotation differences
Reducing Bin 2
Bin 2 = Both pipelines pick same CDS. Chosen ENST and NM only differ in UTR
length and/or UTR splicing pattern
• Defined rules to jointly define extent of 5’ and 3’ UTRs
• “Longest strong”
• Trimmed/Extended ends in an automated manner
Selecting UTRs, 5’ end:
CAGE = Cap Analysis of Gene Expression, developed by RIKEN
This is a way of getting the full 5’ end of messenger RNA.The output of CAGE is tags, and these give a
quantification of the RNA abundance.
Longest StrongestLongest
strong
Ensembl/
GENCODE
RefSeq
RNAseq
CAGE counts
Ensembl Genome Browser
KNG1
Ensembl
RefSeq
RNAseq
PolyA counts
Longest
Longest
Strong
REM2
NCBI’s Genome DataViewer
PolyA seq:This is data from the 3’ end. It is the sequence from the polyadenlyated
region of mRNA, defining the end of a transcript.
Selecting UTRs, 3’ end:
INSDC coverage
• Bin 3 = Pipelines picked different CDS
• Improved pipelines, based on review of genes in bin 3
• Manually curating genes unresolved after pipeline improvement (prioritizing clinical
genes)
• This is the hardest bin!
• In some cases, there is no right answer. Either one could be selected. This is
biology!
• In other cases, the corresponding transcript in the other set does not exist, thus
requiring a full annotation update. Very time consuming!
Reducing Bin 3
MANE Select Progress Update
• In April, we released:
• MANE Select v0.5 on all browsers, with coverage of 54% across the genome
• In September, we released:
• MANE Select v0.6 on all browsers, increasing MANE Select coverage to 67% across the genome
• Identified additional 4% to increase coverage to 71% of across the genome
• We are aiming to increase coverage to 75 – 80% by the end of the year
• Our ultimate goal is to achieve genome-wide coverage by 2020
Accessing MANE: Ensembl
Accessing MANE: NCBI Browser
Accessing MANE: UCSC browser
Accessing MANE: NCBI’s FTP
ftp://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/
Limitations
• MANE Select does not capture biological complexity (requires a single choice)
• Transcripts excluded may score approximately equal to MANE Select on any or all
supporting attributes
• Tissue-specificity vs general pattern of expression
• Most highly supported transcript might exclude important tissue specific or clinically
relevant isoforms
• Gaps in data
• Insufficient information to determine transcriptional specificity
• Transcript level quantification still difficult
Summary
• NCBI and EMBL-EBI are working together to review annotation and produce a matched set
of “high-value” transcripts
• These transcripts will match GRCh38 and will represent 100% identity between a RefSeq
and its corresponding Ensembl transcript
• We will define one “default” transcript per locus (MANE Select)
• We aim to have widespread adoption of MANE Select as default across genomics resources
• We will define additional well-supported transcript (MANE Plus)
• We expect all transcripts required for clinical reporting to be in Select and Plus
• Feedback welcome - MANE-help@ebi.ac.uk
Fiona Cunningham, Variation Annotation Team Lead
Adam Frankish, Manual Genome Annotation Coordinator
This research was supported by the Intramural Research
Program of the NIH, National Library of Medicine.
RefSeq Curators
Shashi Pujar
Eric Cox
Catherine Farrell
TamaraGoldfarb
John Jackson
Vinita Joardar
Kelly McGarvey
Michael Murphy
Nuala O’Leary
Bhanu Rajput
Sanjida Rangwala
Lillian Riddick
DavidWebb
Terence Murphy, RefSeq Team Lead
RefSeq Developers
AlexAstashyn
Olga Ermolaeva
Vamsi Kodali
CraigWallin
Acknowledgments
MANE-help@ebi.ac.uk
Matt Hardy
Mike Kay
Aoife McMahon
Marie-MartheSuner
GlenThreadgold
MANE-help@ncbi.nlm.nih.gov
Ensembl/LRG curators
Jane Loveland
Joannella Morales
Ruth Bennett
Andrew Berry
Claire Davidson
Laurent Gil
Jose Manuel Gonzalez

More Related Content

What's hot

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
Introducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation GuidelinesIntroducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation GuidelinesGolden Helix
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Keith Bradnam
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
 
Paired-end alignments in sequence graphs
Paired-end alignments in sequence graphsPaired-end alignments in sequence graphs
Paired-end alignments in sequence graphsChirag Jain
 
RNA Sequencing from Single Cell
RNA Sequencing from Single CellRNA Sequencing from Single Cell
RNA Sequencing from Single CellQIAGEN
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Mrinal Vashisth
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisEfi Athieniti
 
Single-cell RNA-seq tutorial
Single-cell RNA-seq tutorialSingle-cell RNA-seq tutorial
Single-cell RNA-seq tutorialAaron Diaz
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Torsten Seemann
 
PRIMER DESIGNING
PRIMER DESIGNING PRIMER DESIGNING
PRIMER DESIGNING Amna Sheikh
 
ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]
ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]
ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]HeonjongHan
 

What's hot (20)

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Introducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation GuidelinesIntroducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
Introducing VSClinical: Streamlining ACMG Variant Interpretation Guidelines
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Paired-end alignments in sequence graphs
Paired-end alignments in sequence graphsPaired-end alignments in sequence graphs
Paired-end alignments in sequence graphs
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
RNAseq Analysis
RNAseq AnalysisRNAseq Analysis
RNAseq Analysis
 
RNA Sequencing from Single Cell
RNA Sequencing from Single CellRNA Sequencing from Single Cell
RNA Sequencing from Single Cell
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
 
Single-cell RNA-seq tutorial
Single-cell RNA-seq tutorialSingle-cell RNA-seq tutorial
Single-cell RNA-seq tutorial
 
Primer Designing Event.ppt
Primer Designing Event.pptPrimer Designing Event.ppt
Primer Designing Event.ppt
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
Intro to illumina sequencing
Intro to illumina sequencingIntro to illumina sequencing
Intro to illumina sequencing
 
PRIMER DESIGNING
PRIMER DESIGNING PRIMER DESIGNING
PRIMER DESIGNING
 
ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]
ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]
ACMG guidelines 2015: How to interpret DNA variants? [Today's paper]
 

Similar to The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018GenomeInABottle
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
QIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAGEN
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_predictionBas van Breukelen
 
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013Functional Genomics Data Society
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?Nick Loman
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Analyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAnalyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAlex Rubinsteyn
 

Similar to The Matched Annotation from NCBI and EMBL-EBI (MANE) Project (20)

Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
genomeannotation-160822182432.pdf
genomeannotation-160822182432.pdfgenomeannotation-160822182432.pdf
genomeannotation-160822182432.pdf
 
QIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene PanelsQIAseq Targeted DNA, RNA and Fusion Gene Panels
QIAseq Targeted DNA, RNA and Fusion Gene Panels
 
Ensembl annotation
Ensembl annotationEnsembl annotation
Ensembl annotation
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Genome editing
Genome editingGenome editing
Genome editing
 
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
Mar Gonzales Porta, One gene One transcript, fged_seattle_2013
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
prediction methods for ORF
prediction methods for ORFprediction methods for ORF
prediction methods for ORF
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Analyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and VarcodeAnalyzing Genomic Data with PyEnsembl and Varcode
Analyzing Genomic Data with PyEnsembl and Varcode
 

More from Genome Reference Consortium

Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium
 

More from Genome Reference Consortium (20)

Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 

Recently uploaded

Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDivyaK787011
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2AuEnriquezLontok
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...HafsaHussainp
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 

Recently uploaded (20)

Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 

The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

  • 1. A standardized “default” transcript set The Matched Annotation from NCBI and EMBL-EBI (MANE) Project Joannella Morales, PhD European Bioinformatics Institute (EMBL-EBI) jmorales@ebi.ac.uk ASHG 2019
  • 2. Rationale • Accurate identification and description of the genes in the human genome is foundational for biology • The availability of high-quality reference materials is essential for clinical genomics • Comprehensive transcript annotation is central to this endeavor
  • 3. Sources of transcript annotation: RefSeq and Ensembl/GENCODE NCBI’s RefSeq: • NM_xxxxxx: manually annotated; XM_xxxxxx: automatically produced • May not match the primary reference genome: • represent a prevalent, 'standard' allele but not always reference • Clinical annotation predominantly done using RefSeq transcripts EBI’s Ensembl/GENCODE: • ENSTxxxxxx: More manually-reviewed transcripts • Must match primary reference genome • On average more Ensembl transcripts per gene compared to RefSeqs • Reference set for gnomAD/ ExAC, GTEx, Decipher, 100,000 Genomes Project, ICGC etc.
  • 4. Rationale • Comprehensive annotation is good BUT… • This can cause some challenges in the clinical context • There are numerous alternatively spliced transcripts for a given gene • Transcripts get updated over time – version changes, hard to track a variant over time • There is no standard • Variant reporting can be done on any transcript • Commonly used tools (gnomAD, HGMD, Decipher etc.) often have different “canonical” transcripts • Which one(s) should be used? • Often the longest transcript at the locus (or the first one described) is used • Even though this one may not be relevant (e.g. minor or not expressed in tissue of interest)
  • 5. Solution: Define a joint ‘representative’ transcript set • Standardize transcript set across genomics browsers • VEP, gnomAD, HGMD, COSMIC, UniProt, others all have their own “canonicals” • Identify a transcript that captures the most information about each protein-coding gene • Standardize clinical reporting • Useful as starting point for comparative/evolutionary genomics • All transcripts should always be considered for clinical interpretation • We are NOT saying that biology can be simplified to a single transcript at each genomic locus
  • 6. What is MANE? (Matched Annotation from the NCBI and EMBL-EBI) • A transcript set with the following attributes: • Must match GRCh38 sequence • 100% identical between the RefSeq and corresponding Ensembl transcript • 5’UTR, CDS, and 3’UTR • Transcripts should be: • Well-supported, expressed, conserved • Representative of biology at each locus • Phase 1 - MANE Select – One transcript for each protein-coding locus; to be used as “default” across genomics resources • Phase 2 - MANE Plus – Additional well-supported transcripts of particular interest • For example, for clinical reporting
  • 7. • Automated with a layer of manual review • Built independent pipelines to select a transcript from each set MANE Select Methodology • RefSeq Select Pipeline • Expression • Conservation • Representation in UniProt and Ensembl • Length • Prior manual curation (LRG) • Ensembl Select Pipeline • Length • Expression • Conservation • Representation in UniProt and RefSeq • Coverage of pathogenic variants
  • 8. Review UTRs 5’ 3’ Identical splicing, CDS, UTRs 5’ 3’ MANE Select MANE Select Methodology 5’ 3’ RefSeq 5’ 3’ Ensembl/GENCODE Step 1 Select Step 2 Review Step 3 Match
  • 9. Initial pipeline comparison and bins Bin1: Identical Bin 2: Same CDS, but different UTR length or splicing pattern Bin 3: Different CDS, with or without different UTR length or splicing pattern or Majority of cases Complex loci Annotation differences
  • 10. Reducing Bin 2 Bin 2 = Both pipelines pick same CDS. Chosen ENST and NM only differ in UTR length and/or UTR splicing pattern • Defined rules to jointly define extent of 5’ and 3’ UTRs • “Longest strong” • Trimmed/Extended ends in an automated manner
  • 11. Selecting UTRs, 5’ end: CAGE = Cap Analysis of Gene Expression, developed by RIKEN This is a way of getting the full 5’ end of messenger RNA.The output of CAGE is tags, and these give a quantification of the RNA abundance. Longest StrongestLongest strong Ensembl/ GENCODE RefSeq RNAseq CAGE counts Ensembl Genome Browser KNG1
  • 12. Ensembl RefSeq RNAseq PolyA counts Longest Longest Strong REM2 NCBI’s Genome DataViewer PolyA seq:This is data from the 3’ end. It is the sequence from the polyadenlyated region of mRNA, defining the end of a transcript. Selecting UTRs, 3’ end: INSDC coverage
  • 13. • Bin 3 = Pipelines picked different CDS • Improved pipelines, based on review of genes in bin 3 • Manually curating genes unresolved after pipeline improvement (prioritizing clinical genes) • This is the hardest bin! • In some cases, there is no right answer. Either one could be selected. This is biology! • In other cases, the corresponding transcript in the other set does not exist, thus requiring a full annotation update. Very time consuming! Reducing Bin 3
  • 14. MANE Select Progress Update • In April, we released: • MANE Select v0.5 on all browsers, with coverage of 54% across the genome • In September, we released: • MANE Select v0.6 on all browsers, increasing MANE Select coverage to 67% across the genome • Identified additional 4% to increase coverage to 71% of across the genome • We are aiming to increase coverage to 75 – 80% by the end of the year • Our ultimate goal is to achieve genome-wide coverage by 2020
  • 18. Accessing MANE: NCBI’s FTP ftp://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/
  • 19. Limitations • MANE Select does not capture biological complexity (requires a single choice) • Transcripts excluded may score approximately equal to MANE Select on any or all supporting attributes • Tissue-specificity vs general pattern of expression • Most highly supported transcript might exclude important tissue specific or clinically relevant isoforms • Gaps in data • Insufficient information to determine transcriptional specificity • Transcript level quantification still difficult
  • 20. Summary • NCBI and EMBL-EBI are working together to review annotation and produce a matched set of “high-value” transcripts • These transcripts will match GRCh38 and will represent 100% identity between a RefSeq and its corresponding Ensembl transcript • We will define one “default” transcript per locus (MANE Select) • We aim to have widespread adoption of MANE Select as default across genomics resources • We will define additional well-supported transcript (MANE Plus) • We expect all transcripts required for clinical reporting to be in Select and Plus • Feedback welcome - MANE-help@ebi.ac.uk
  • 21. Fiona Cunningham, Variation Annotation Team Lead Adam Frankish, Manual Genome Annotation Coordinator This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. RefSeq Curators Shashi Pujar Eric Cox Catherine Farrell TamaraGoldfarb John Jackson Vinita Joardar Kelly McGarvey Michael Murphy Nuala O’Leary Bhanu Rajput Sanjida Rangwala Lillian Riddick DavidWebb Terence Murphy, RefSeq Team Lead RefSeq Developers AlexAstashyn Olga Ermolaeva Vamsi Kodali CraigWallin Acknowledgments MANE-help@ebi.ac.uk Matt Hardy Mike Kay Aoife McMahon Marie-MartheSuner GlenThreadgold MANE-help@ncbi.nlm.nih.gov Ensembl/LRG curators Jane Loveland Joannella Morales Ruth Bennett Andrew Berry Claire Davidson Laurent Gil Jose Manuel Gonzalez

Editor's Notes

  1. Solutions? LRG project – clinical focus MANE project – broader focus, useful for clinical community
  2. Our effort to select one high-quality transcript at all protein-coding loci, and to have this be consistent across all genomics resources, will give a consistent starting view of biology for researchers, whether the intent is to use it for reporting variants, comparative genomics or any other endeavour. That said, all the transcripts we annotate should always be considered and we are certainly NOT saying that biology can be simplified to a single transcript at each genomic locus. We anticipate expanding the project to include a larger set of transcripts that are well-supported, predicted to be functional or relevant to specific user groups.
  3. We did this computationally, by building two, independent, in house pipelines The use of two independently built pipeline, designed without input from the other group was important to validate the result. If both groups identify the same model then there is a higher degree of confidence that we have arrived at the correct answer Here you see the different factors that each pipeline took into account. You will see that similar factors such as expression, conservation, length were taken into account.
  4. CAGE = Cap Analysis of Gene Expression, developed by RIKEN This is a way of getting the full 5’ end of messenger RNA. The outputs of CAGE is tags, and these give a quantification of the RNA abundance. PolyA seq: This is data from the 3’ end. It is the sequence from the polyadenlyated region of mRNA, defining the end of a transcript.
  5. 5’ UTR of KNG1 gene. CAGE counts overlaid from NCBI’s data reprocessing
  6. Above threshold of 50% Clustering algorithm: red cluster is the strongest
  7. Supporting attributes: overall support expression conservation known clinical variation