SlideShare a Scribd company logo
GRC Assembly Analysis Workshop At Genome Informatics
September 21, 2014
The NCBI Eukaryotic Genome Annotation Pipeline
And Alternate Genomic Sequences
Paul Kitts
NCBI
National Center for Biotechnology Information
Genomes Annotated By NCBI
Human GRCh38
2014-02-03
Zebrafish GRCz10
in progress
Mouse GRCm38.p2
2013-12-27
Outline
• Overview of the NCBI Eukaryotic Genome Annotation Pipeline
• What to do with alternate loci & patch scaffolds?
• How we use the alt/patch/PAR alignments to inform our annotation
• Examples:
– Annotation only on alternate loci
– Different alleles annotated on primary assembly and alternate loci
– Annotation improved by patches
– Pseudoautosomal Regions annotated consistently on X & Y
• Recent enhancements:
– Using RNA-Seq evidence for gene prediction
– Gap-filling gene models using transcript sequences
– Annotation reports
Eukaryotic Genome
Annotation Pipeline
Overview
Ranking Alignments
• Rank alignments for each query sequence
– using a quality score that combines identity & coverage
– Rank-1 > Rank-2 > Rank-3…
• Conflicting alignments cannot have same rank
– alignments of the same query sequence to an assembly
conflict if they have significant overlap (>= 30%)
– Insignificant
– Significant
• A subset of rank-1 alignments is used for annotation
Span in alignment B
Span in alignment A
Span in alignment B
Span in alignment A
mRNA-F1
Annotation Of A Simple Assembly Using Ranked Alignments
mRNA-F1
mRNA-F2
Input mRNAsGenes in the assembly
mRNA-F2
Unplaced scaffold1
mRNA-F1
Filter out alignments that are not rank-1
GeneF1 GeneF2Chr1
GeneF1Chr1
Resulting annotation
GeneF2 Unplaced scaffold1
mRNA-F2 mRNA-F1
* * **
* * *mRNA-F1mRNA-F2
* *
Rank alignments
Unplaced scaffold1
GeneF2Chr1 GeneF1
Rank-1
Rank-2
Rank-3Rank-1
Rank-2
Align mRNAs
Unplaced scaffold1GeneF1 GeneF2Chr1
What to do with alternate loci & patch scaffolds?
1. Omit the alternate loci & patch scaffolds
2. Include the alternate loci & patch scaffolds;
no special treatment
3. Include the alternate loci & patch scaffolds;
use known relationships to primary assembly
Gene1/A G2-Allele-APrimary Chr1
Resulting annotation
Gene2
mRNA-3A
* * *
Annotation Omitting Alt-scaffolds
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2
mRNA-3A
Gene3
Scenario 1: no annotation for Gene3
no annotation for Gene1/Allele-B
✔
mRNA-1A
Rank-1 mRNA alignments
Gene1/A Gene2Primary Chr1
mRNA-2A
✗✔
Scenario 2: Gene3 annotated at the wrong location
no annotation for Gene1/Allele-B
Gene1/A G2-Allele-A
Gene3
Gene4
Primary Chr1
Alt-scaffold2
Alt-scaffold1
Resulting annotation
Gene2
Annotation Using Alt-scaffolds Without Alt-to-primary Alignments
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2
✔
✗
✔
mRNA-3A
mRNA-1A
mRNA-1B
mRNA-3A
Rank-1 mRNA alignments
Gene1/A Gene2
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
mRNA-2A
✔
Gene1/A G2-Allele-A
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
Resulting annotation
Gene2
Annotation Using Alt-scaffolds & Alt-to-primary Alignments
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2
alt-to-primary alignment
✔
✔
✔
mRNA-3A
mRNA-1A
mRNA-1B
mRNA-3A
Rank-1 mRNA alignments
Gene1/A Gene2
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
mRNA-2A
✔
Pros & cons of different choices for dealing with
alternate loci & patch scaffolds
1. Omit the alternate loci & patch scaffolds
Pros: Easy to implement
Cons: No representation for genes or alleles only on alts.
Incorrect models for genes that have been patched.
2. Include the alternate loci & patch scaffolds;
no special treatment
Pros: Easy to implement
Cons: Incorrectly annotate genes that have alternate alleles or patches
as if they were paralogs.
Wrongly penalize sequences for having multiple or ambiguous
placements.
3. Include the alternate loci & patch scaffolds;
use known relationships to primary assembly
Pros: Genes only on alts are annotated.
Correctly annotate genes with alternate alleles.
Correctly annotate patched genes
Cons: Requires software and pipelines changes
Eukaryotic Genome
Annotation Pipeline:
Steps using alt-to-
primary alignments
Alt-to-primary
alignments
Curated gene
localization
Ranking Alignments Across Assembly Units
• Create graph of related alignments
– Alignments that are collocated or mappable
– Transcript/protein to genomic
– Alt or patch scaffold to primary assembly
• Partition graph into clusters
– Each alignment in the cluster is related to at least one other
alignment in the same cluster
– No alignment is related to any alignment in another cluster
– Split conflicting alignments within a cluster into separate groups
– Merge non-conflicting clusters into groups
• Evaluate groups, sort and assign ranks
– All alignments in a group get the same rank
Ranked Alignment Groups Across Assembly Units
Assembly unit
Assembly alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2
Rank-1
Rank-2
Ranked Alignment Groups Across Assemblies
Assembly unit
Assembly alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2
Rank-1
Rank-2
Assembly2-Primary
Assembly3-Primary
Ranked Alignment Groups Across
Pseudoautosomal Regions (PARs)
Chromosome
PAR alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Chromosome Y
Chromosome X
Rank-1
PAR#1 PAR#2
NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word]
AND gene_nucleotide_pos[filter]
Genes Only Annotated On GRCh38 Alternate Loci
NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word]
AND gene_nucleotide_pos[filter] AND “genetype protein coding”[prop] AND srcdb_refseq_known[prop]
Num. Gene Type
20 Protein Coding
40 Protein Coding (model)
21 Pseudogene
32 Pseudogene (model)
32 ncRNA (model)
5 Other
3 Other (model)
Different Alleles Annotated On GRCh38
Primary Assembly And Alternate Loci
ALT_REF_LOCI_2
ALT_REF_LOCI_7
NM_001243042.1 comment:
This variant represents the C*07:01:01:01 allele of the HLA-C gene.
NM_002117.5 comment:
This variant represents the C*07:02:01 allele of the HLA-C gene.
Annotation Of GRCh37 Improved By Patch Scaffold
EPPK1 gene on primary assembly chromosome 8 has an internal deletion.
EPPK1 gene on patch scaffold is complete.
Primary Assembly chromosome 8
Patch scaffold HG104_HG975_PATCH
Pseudoautosomal Regions Annotated Consistently
on GRCh38 chromosomes X & Y
Recent Enhancements To The Genome Annotation Pipeline:
#1 Using RNA-Seq Evidence For Gene Prediction
0
10000
20000
30000
40000
50000
60000
70000
80000
Number of coding transcripts
predicted +/- RNA-Seq
0
10000
20000
30000
40000
50000
60000
Number of genes
predicted +/- RNA-Seq
Without RNA-Seq
With RNA-Seq
75 organisms annotated with RNA-Seq data
Example Of Tracks Made Using RNA-Seq Data
NCBI > GENE > Xenopus (Silurana) tropicalis nbr1 [neighbor of BRCA1 gene 1]
Recent Enhancements To The Genome Annotation Pipeline:
#2 Gap-filling Gene Models Using Transcript Sequences
Genomic sequence
Transcript alignment
1 32 4
RefSeq model
Gap
How gap-filling works
Reporting of gap-filled regions
Recent Enhancements To The Genome Annotation Pipeline:
#3 Annotation Reports
RNA-Seq
Summary
Including the alternate loci & patch scaffolds and
using their known relationships to the primary
assembly significantly improves the annotation of
GRC assemblies.
It is worth the extra effort!
CREDITS
Genome pipeline infrastructure
Alex Astashyn
Nathan Bouk
Rob Cohen
Mike Dicuccio
Eric Engelson
Olga Ermoloeva
Wratko Hlavina
Lucian Ion
Avi Kimchi
Boris Kiryutin
David Managadze
Eyal Mozes
Terence Murphy
Daniel Rausch
Robert Smith
Sasha Souvorov
Craig Wallin
Alex Zasypkin
Eukaryotic annotation
setup & execution
Françoise Thibaud-Nissen
Jinna Choi
Patrick Masterson
Kim Pruitt and the “genome champions”
from the RefSeq group
Genomic Collections DB
Avi Kimchi
Victor Sapojnikov
Charlie Xiang
Andrey Zherikov
Genome assemblies with alt/patch to primary alignments
Genome Reference Consortium
The Wellcome Trust Sanger Institute
The Genome Institute at Washington University
The European Bioinformatics Institute
The National Center for Biotechnology Information
Eukaryotic Genome Annotation at NCBI: www.ncbi.nlm.nih.gov/genome/annotation_euk/

More Related Content

What's hot

Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
Zeeshan Hanjra
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
Introduction to proteomics
Introduction to proteomicsIntroduction to proteomics
Introduction to proteomics
Shryli Shreekar
 
Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
Tajammal Daultana
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology Overview
Dominic Suciu
 
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
Nusrat Gulbarga
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
Amity university, Noida
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
Sean Davis
 
Pyrosequencing
PyrosequencingPyrosequencing
Pyrosequencing
Usama Aamir
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
Shifa Ansari
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
IndrajaDoradla
 
Dot matrix
Dot matrixDot matrix
Dot matrix
Tania Khan
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
Yaoyu Wang
 
RNA structure analysis
RNA structure analysis RNA structure analysis
RNA structure analysis
Afra Fathima
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
Functional annotation
Functional annotationFunctional annotation
Functional annotation
Ravi Gandham
 
Genome assembly
Genome assemblyGenome assembly

What's hot (20)

Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Introduction to proteomics
Introduction to proteomicsIntroduction to proteomics
Introduction to proteomics
 
Gemome annotation
Gemome annotationGemome annotation
Gemome annotation
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology Overview
 
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Pyrosequencing
PyrosequencingPyrosequencing
Pyrosequencing
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Genomics(functional genomics)
Genomics(functional genomics)Genomics(functional genomics)
Genomics(functional genomics)
 
Dot matrix
Dot matrixDot matrix
Dot matrix
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
RNA structure analysis
RNA structure analysis RNA structure analysis
RNA structure analysis
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Est database
Est databaseEst database
Est database
 
Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 

Viewers also liked

Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
Karan Veer Singh
 
Apollo — Collaborative and Scalable Manual Genome Annotation
Apollo — Collaborative and Scalable Manual Genome AnnotationApollo — Collaborative and Scalable Manual Genome Annotation
Apollo — Collaborative and Scalable Manual Genome Annotation
Nathan Dunn
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
Paul Gardner
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
Torsten Seemann
 
Gene annotation games
Gene annotation gamesGene annotation games
Gene annotation games
Benjamin Good
 
Bioinformatics and functional genomics
Bioinformatics and functional genomicsBioinformatics and functional genomics
Bioinformatics and functional genomicsAisha Kalsoom
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
Vijay Hemmadi
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
Pawan Kumar
 
DNA sequencing
DNA sequencingDNA sequencing
DNA sequencing
130144011
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
ajay301
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencing
Shital Pal
 
DNA Sequencing
DNA SequencingDNA Sequencing
DNA Sequencing
Surender Rawat
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint14cummke
 
DNA SEQUENCING METHOD
DNA SEQUENCING METHODDNA SEQUENCING METHOD
DNA SEQUENCING METHODMusa Khan
 

Viewers also liked (17)

Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Apollo — Collaborative and Scalable Manual Genome Annotation
Apollo — Collaborative and Scalable Manual Genome AnnotationApollo — Collaborative and Scalable Manual Genome Annotation
Apollo — Collaborative and Scalable Manual Genome Annotation
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
Gene annotation games
Gene annotation gamesGene annotation games
Gene annotation games
 
Bioinformatics and functional genomics
Bioinformatics and functional genomicsBioinformatics and functional genomics
Bioinformatics and functional genomics
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
DNA sequencing
DNA sequencingDNA sequencing
DNA sequencing
 
Dna sequencing
Dna    sequencingDna    sequencing
Dna sequencing
 
Dna sequencing
Dna sequencingDna sequencing
Dna sequencing
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencing
 
DNA Sequencing
DNA SequencingDNA Sequencing
DNA Sequencing
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
 
DNA SEQUENCING METHOD
DNA SEQUENCING METHODDNA SEQUENCING METHOD
DNA SEQUENCING METHOD
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 

Similar to The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
fruitbreedomics
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
Prof. Wim Van Criekinge
 
Gene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGHGene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGH
Rafael C. Jimenez
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
Yaoyu Wang
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
HAMNAHAMNA8
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
GenomeInABottle
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
austinps
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Prof. Wim Van Criekinge
 
Fly chromatin dynamics using bidirectional hidden markov model
Fly chromatin dynamics using bidirectional hidden markov modelFly chromatin dynamics using bidirectional hidden markov model
Fly chromatin dynamics using bidirectional hidden markov model
Sanju K. Sinha
 
Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9 Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9
Ek Han Tan
 
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
MANGLAM ARYA
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
Bas van Breukelen
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
Senthil Natesan
 
Molecular markers types and applications
Molecular markers types and applicationsMolecular markers types and applications
Molecular markers types and applications
FAO
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
Delaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
Golden Helix Inc
 
Variants calling and SNP prioritization in mouse RNA.pptx
Variants calling and SNP prioritization in mouse RNA.pptxVariants calling and SNP prioritization in mouse RNA.pptx
Variants calling and SNP prioritization in mouse RNA.pptx
royshikha
 

Similar to The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences (20)

Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
 
Gene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGHGene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGH
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
 
Fly chromatin dynamics using bidirectional hidden markov model
Fly chromatin dynamics using bidirectional hidden markov modelFly chromatin dynamics using bidirectional hidden markov model
Fly chromatin dynamics using bidirectional hidden markov model
 
Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9 Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9
 
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Q biomarkercn
Q biomarkercnQ biomarkercn
Q biomarkercn
 
Molecular markers types and applications
Molecular markers types and applicationsMolecular markers types and applications
Molecular markers types and applications
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Variants calling and SNP prioritization in mouse RNA.pptx
Variants calling and SNP prioritization in mouse RNA.pptxVariants calling and SNP prioritization in mouse RNA.pptx
Variants calling and SNP prioritization in mouse RNA.pptx
 
Pcrarraywhitepaper
PcrarraywhitepaperPcrarraywhitepaper
Pcrarraywhitepaper
 

More from Genome Reference Consortium

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
Genome Reference Consortium
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
Genome Reference Consortium
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
Genome Reference Consortium
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
Genome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
Genome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
Genome Reference Consortium
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
Genome Reference Consortium
 
Mane v2 final
Mane v2 finalMane v2 final
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
Genome Reference Consortium
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
Genome Reference Consortium
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
Genome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
Genome Reference Consortium
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
Genome Reference Consortium
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
Genome Reference Consortium
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
Genome Reference Consortium
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
Genome Reference Consortium
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
Genome Reference Consortium
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
Genome Reference Consortium
 

More from Genome Reference Consortium (20)

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 

Recently uploaded

Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
mediapraxi
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
Renu Jangid
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 

Recently uploaded (20)

Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 

The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

  • 1. GRC Assembly Analysis Workshop At Genome Informatics September 21, 2014 The NCBI Eukaryotic Genome Annotation Pipeline And Alternate Genomic Sequences Paul Kitts NCBI National Center for Biotechnology Information
  • 2. Genomes Annotated By NCBI Human GRCh38 2014-02-03 Zebrafish GRCz10 in progress Mouse GRCm38.p2 2013-12-27
  • 3. Outline • Overview of the NCBI Eukaryotic Genome Annotation Pipeline • What to do with alternate loci & patch scaffolds? • How we use the alt/patch/PAR alignments to inform our annotation • Examples: – Annotation only on alternate loci – Different alleles annotated on primary assembly and alternate loci – Annotation improved by patches – Pseudoautosomal Regions annotated consistently on X & Y • Recent enhancements: – Using RNA-Seq evidence for gene prediction – Gap-filling gene models using transcript sequences – Annotation reports
  • 5. Ranking Alignments • Rank alignments for each query sequence – using a quality score that combines identity & coverage – Rank-1 > Rank-2 > Rank-3… • Conflicting alignments cannot have same rank – alignments of the same query sequence to an assembly conflict if they have significant overlap (>= 30%) – Insignificant – Significant • A subset of rank-1 alignments is used for annotation Span in alignment B Span in alignment A Span in alignment B Span in alignment A
  • 6. mRNA-F1 Annotation Of A Simple Assembly Using Ranked Alignments mRNA-F1 mRNA-F2 Input mRNAsGenes in the assembly mRNA-F2 Unplaced scaffold1 mRNA-F1 Filter out alignments that are not rank-1 GeneF1 GeneF2Chr1 GeneF1Chr1 Resulting annotation GeneF2 Unplaced scaffold1 mRNA-F2 mRNA-F1 * * ** * * *mRNA-F1mRNA-F2 * * Rank alignments Unplaced scaffold1 GeneF2Chr1 GeneF1 Rank-1 Rank-2 Rank-3Rank-1 Rank-2 Align mRNAs Unplaced scaffold1GeneF1 GeneF2Chr1
  • 7. What to do with alternate loci & patch scaffolds? 1. Omit the alternate loci & patch scaffolds 2. Include the alternate loci & patch scaffolds; no special treatment 3. Include the alternate loci & patch scaffolds; use known relationships to primary assembly
  • 8. Gene1/A G2-Allele-APrimary Chr1 Resulting annotation Gene2 mRNA-3A * * * Annotation Omitting Alt-scaffolds mRNA-1A mRNA-1B mRNA-2A Input mRNAs Gene3 Primary Chr1 Alt-scaffold1 Genes/Alleles represented in the assembly Gene1/A Gene2 Gene1/B Alt-scaffold2 mRNA-3A Gene3 Scenario 1: no annotation for Gene3 no annotation for Gene1/Allele-B ✔ mRNA-1A Rank-1 mRNA alignments Gene1/A Gene2Primary Chr1 mRNA-2A ✗✔ Scenario 2: Gene3 annotated at the wrong location no annotation for Gene1/Allele-B
  • 9. Gene1/A G2-Allele-A Gene3 Gene4 Primary Chr1 Alt-scaffold2 Alt-scaffold1 Resulting annotation Gene2 Annotation Using Alt-scaffolds Without Alt-to-primary Alignments mRNA-1A mRNA-1B mRNA-2A Input mRNAs Gene3 Primary Chr1 Alt-scaffold1 Genes/Alleles represented in the assembly Gene1/A Gene2 Gene1/B Alt-scaffold2 ✔ ✗ ✔ mRNA-3A mRNA-1A mRNA-1B mRNA-3A Rank-1 mRNA alignments Gene1/A Gene2 Gene3 Gene1/B Primary Chr1 Alt-scaffold2 Alt-scaffold1 mRNA-2A ✔
  • 10. Gene1/A G2-Allele-A Gene3 Gene1/B Primary Chr1 Alt-scaffold2 Alt-scaffold1 Resulting annotation Gene2 Annotation Using Alt-scaffolds & Alt-to-primary Alignments mRNA-1A mRNA-1B mRNA-2A Input mRNAs Gene3 Primary Chr1 Alt-scaffold1 Genes/Alleles represented in the assembly Gene1/A Gene2 Gene1/B Alt-scaffold2 alt-to-primary alignment ✔ ✔ ✔ mRNA-3A mRNA-1A mRNA-1B mRNA-3A Rank-1 mRNA alignments Gene1/A Gene2 Gene3 Gene1/B Primary Chr1 Alt-scaffold2 Alt-scaffold1 mRNA-2A ✔
  • 11. Pros & cons of different choices for dealing with alternate loci & patch scaffolds 1. Omit the alternate loci & patch scaffolds Pros: Easy to implement Cons: No representation for genes or alleles only on alts. Incorrect models for genes that have been patched. 2. Include the alternate loci & patch scaffolds; no special treatment Pros: Easy to implement Cons: Incorrectly annotate genes that have alternate alleles or patches as if they were paralogs. Wrongly penalize sequences for having multiple or ambiguous placements. 3. Include the alternate loci & patch scaffolds; use known relationships to primary assembly Pros: Genes only on alts are annotated. Correctly annotate genes with alternate alleles. Correctly annotate patched genes Cons: Requires software and pipelines changes
  • 12. Eukaryotic Genome Annotation Pipeline: Steps using alt-to- primary alignments Alt-to-primary alignments Curated gene localization
  • 13. Ranking Alignments Across Assembly Units • Create graph of related alignments – Alignments that are collocated or mappable – Transcript/protein to genomic – Alt or patch scaffold to primary assembly • Partition graph into clusters – Each alignment in the cluster is related to at least one other alignment in the same cluster – No alignment is related to any alignment in another cluster – Split conflicting alignments within a cluster into separate groups – Merge non-conflicting clusters into groups • Evaluate groups, sort and assign ranks – All alignments in a group get the same rank
  • 14. Ranked Alignment Groups Across Assembly Units Assembly unit Assembly alignment mRNA1 alignment mRNA2 alignment Cluster Rank group Assembly1-Primary Assembly1-Alt1 Assembly1-Alt2 Rank-1 Rank-2
  • 15. Ranked Alignment Groups Across Assemblies Assembly unit Assembly alignment mRNA1 alignment mRNA2 alignment Cluster Rank group Assembly1-Primary Assembly1-Alt1 Assembly1-Alt2 Rank-1 Rank-2 Assembly2-Primary Assembly3-Primary
  • 16. Ranked Alignment Groups Across Pseudoautosomal Regions (PARs) Chromosome PAR alignment mRNA1 alignment mRNA2 alignment Cluster Rank group Chromosome Y Chromosome X Rank-1 PAR#1 PAR#2
  • 17. NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word] AND gene_nucleotide_pos[filter] Genes Only Annotated On GRCh38 Alternate Loci NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word] AND gene_nucleotide_pos[filter] AND “genetype protein coding”[prop] AND srcdb_refseq_known[prop] Num. Gene Type 20 Protein Coding 40 Protein Coding (model) 21 Pseudogene 32 Pseudogene (model) 32 ncRNA (model) 5 Other 3 Other (model)
  • 18. Different Alleles Annotated On GRCh38 Primary Assembly And Alternate Loci ALT_REF_LOCI_2 ALT_REF_LOCI_7 NM_001243042.1 comment: This variant represents the C*07:01:01:01 allele of the HLA-C gene. NM_002117.5 comment: This variant represents the C*07:02:01 allele of the HLA-C gene.
  • 19. Annotation Of GRCh37 Improved By Patch Scaffold EPPK1 gene on primary assembly chromosome 8 has an internal deletion. EPPK1 gene on patch scaffold is complete. Primary Assembly chromosome 8 Patch scaffold HG104_HG975_PATCH
  • 20. Pseudoautosomal Regions Annotated Consistently on GRCh38 chromosomes X & Y
  • 21. Recent Enhancements To The Genome Annotation Pipeline: #1 Using RNA-Seq Evidence For Gene Prediction 0 10000 20000 30000 40000 50000 60000 70000 80000 Number of coding transcripts predicted +/- RNA-Seq 0 10000 20000 30000 40000 50000 60000 Number of genes predicted +/- RNA-Seq Without RNA-Seq With RNA-Seq 75 organisms annotated with RNA-Seq data
  • 22. Example Of Tracks Made Using RNA-Seq Data NCBI > GENE > Xenopus (Silurana) tropicalis nbr1 [neighbor of BRCA1 gene 1]
  • 23. Recent Enhancements To The Genome Annotation Pipeline: #2 Gap-filling Gene Models Using Transcript Sequences Genomic sequence Transcript alignment 1 32 4 RefSeq model Gap How gap-filling works Reporting of gap-filled regions
  • 24. Recent Enhancements To The Genome Annotation Pipeline: #3 Annotation Reports RNA-Seq
  • 25. Summary Including the alternate loci & patch scaffolds and using their known relationships to the primary assembly significantly improves the annotation of GRC assemblies. It is worth the extra effort!
  • 26. CREDITS Genome pipeline infrastructure Alex Astashyn Nathan Bouk Rob Cohen Mike Dicuccio Eric Engelson Olga Ermoloeva Wratko Hlavina Lucian Ion Avi Kimchi Boris Kiryutin David Managadze Eyal Mozes Terence Murphy Daniel Rausch Robert Smith Sasha Souvorov Craig Wallin Alex Zasypkin Eukaryotic annotation setup & execution Françoise Thibaud-Nissen Jinna Choi Patrick Masterson Kim Pruitt and the “genome champions” from the RefSeq group Genomic Collections DB Avi Kimchi Victor Sapojnikov Charlie Xiang Andrey Zherikov Genome assemblies with alt/patch to primary alignments Genome Reference Consortium The Wellcome Trust Sanger Institute The Genome Institute at Washington University The European Bioinformatics Institute The National Center for Biotechnology Information Eukaryotic Genome Annotation at NCBI: www.ncbi.nlm.nih.gov/genome/annotation_euk/

Editor's Notes

  1. NCBI developed a genome annotation pipeline 14 years ago to annotate draft versions of the human genome assembly. Since then we have continued to develop and improve the annotation pipeline and to apply it to more and more genomes. So far, we have annotated the genomes for over 150 different eukaryotes, >> we last annotated the GRC’s mouse assembly in December, the GRC’s human assembly in February, and our annotation of the GRC’s new zebrafish assembly is humming along as I speak.
  2. Since the theme of this workshop is using the full GRC assemblies, including the alternate loci and patch scaffolds, and making use of the known relationship between these scaffolds and the primary assembly, the focus of my talk will be on how NCBI uses this information in our annotation pipeline. I will begin by giving an overview of the NCBI eukaryotic genome annotation pipeline. Then I will raise the question of what to do with alternate loci and patch scaffolds, sketching out the consequences of some different choices. I will explain how we use the alt, patch, & PAR alignments to inform our annotation, and then show some examples of the results. I will finish by briefly highlighting some recent enhancements to our annotation pipeline.
  3. Here is a flowchart of our genome annotation pipeline. The substrate for annotation is one or more genome assemblies (in grey). The genomic sequences are masked, and transcripts (blue), proteins (green) and RNA-Seq reads (orange) are aligned to the genome. If available for the organism being annotated, curated RefSeq genomic sequences (pink) are also aligned. The alignments of these different types of evidence go through a ranking step, which I will describe in more detail later, and a filtering step, before being fed into the gene model prediction step. The best models are selected from those known RefSeqs that aligned to the genome and from the predicted models. The selected models are then assigned to genes and named. After that, the annotation products are formatted and deployed to NCBI’s public resources: Nucleotide, Protein, Gene, BLAST & FTP. >> In order to talk about how we handle alt and patch scaffolds in our pipeline, I first need to do give you more detail on how the alignments are ranked >>
  4. We rank alignments for each query sequence using a quality score that combines identity & coverage. So the best alignment gets rank-1, the next best alignment gets rank-2, and so on. … Some rank-1 alignments may not be used: because even rank-1 alignments may not be good enough quality, because curated input may exclude a transcript from being placed on certain assemblies or on certain chromosomes, or because the alignments may be rejected later at the best model selection step.
  5. I am now going to show a carton of how we annotate a simple assembly using ranked alignments. The genomic sequence is shown in orange, with the locations of a couple of genes from a gene family shown in grey. We have mRNA sequences corresponding to these genes shown in green. >> We start by aligning the mRNAs. Alignments are shown in blue, with red stars indicating mismatches. The F1 mRNA aligns not only to the gene from which it is derived, but it also aligns to other genes in the same family, and even to less related sequences. >> the F2 mRNA also aligns to multiple locations. >> So then we rank the alignments of mRNA-F1, >> and the alignments of mRNA-F2. >> We filter out alignments that are not rank-1. >> Keeping only the rank-1 alignments allows us to correctly annotate the two genes in this family. So you see that annotating a simple assembly is pretty straightforward, at least in this cartoon! >> so then the question is…
  6. Here I list three choices. … I will now sketch out the consequences of each of these three choices for gene annotation.
  7. … >> align the mRNAs, rank them, and filter out the non-rank-1 alignments. >> The resulting annotation is good, with Gene1/Allele-A & Gene2 correctly annotated, but incomplete. Our annotation release would be missing Gene3 & Allele-B of Gene1. >> neither we nor our users would be happy. Since the true home for mRNA-3A is only present on one of the alt loci scaffolds that was omitted, one of two things could happen. In the first scenario I showed you, mRNA-3A fails to align anywhere. >> An alternative scenario is that mRNA-3A aligns imperfectly to another gene or pseudogene. Because the true home for this mRNA is missing, this alignment would get assigned rank-1. >> as a result Gene3 would be annotated at the wrong location. >> again, neither we nor or users would be happy.
  8. … >> after aligning the mRNAs, ranking them, and filtering out the non-rank-1 alignments, the picture would look like this… >> using these rank-1 alignments for annotation, correctly annotates Gene2, Gene3 and Gene1/Allele-A. But Allele-B of Gene1 on the alt-scaffold is incorrectly annotated as being a different gene because we had nothing to relate alt-scaffold1 to the primary assembly, and consequently failed to recognize that these were variants of the same gene. >> we would not feel good about putting out such annotation.
  9. … In some other pipelines, a sequence that aligns to equivalent locations on both the primary assembly and on an alt or patch scaffold may be penalized for having multiple or ambiguous placements.
  10. The NCBI Eukaryotic Genome Annotation pipeline has been engineered to use alt-to-primary alignments in two steps: first, in the ranking of transcript & protein alignments; and second in gene assignment & naming. >> We also use input on gene localization to particular chromosomes or assembly-units that our curators maintain. The challenge is how to do ranking across assembly-units. >>
  11. Here is an outline of our algorithm for doing this.
  12. … All the alignments in the group on the right get assigned rank-1. This is what enables us to annotate a gene consistently when it appears in equivalent locations on more than one assembly unit.
  13. Here I have added two additional assemblies into the picture, along with assembly-assembly alignments that relate segments of assembly 1 to assembly 2 or assembly 3. >> the algorithm used to cluster related alignments and rank them can be applied to this more complex picture. This allows us to annotate the same gene consistently across multiple assemblies. For example in our most recent human annotation run, we annotated the HuRef and CHM1 assemblies as well as GRCh38.
  14. The algorithm used to cluster related alignments and rank them is also applicable to pseudoautosomal regions. In this case, alignments between the pseudoautosomal regions on chromosome X and chromosome Y are used to cluster transcript alignments for ranking. Enough of the theory. How do we do in practice?
  15. The NCBI Gene resource tags genes only annotated on the alts with the phrase “only annotated on alternate loci in reference assembly”, which makes it easy to retrieve this set of genes. Running this query shows that there are 153 genes only annotated on the alts. >> the search can be further constrained to just the protein coding genes from known RefSeqs, by adding terms to the query. >> Doing variations of this search shows that the genes only annotated on the alts include 20 protein coding genes with known RefSeqs, another 40 protein coding gene models, and a number of pseudogenes, non-coding RNAs and genes of other types. 
  16. My next example is different alleles… On the top we have chromosome 6 of the primary-assembly around the HLA-C gene, lower down are a scaffold from ALT_REF_LOCI_2 and a scaffold from ALT_REF_LOCI_7. Between them are grey bars showing the alt-to-primary alignments with red lines indicating mismatches, and blue hour-glasses indicating insertions or deletions. The green bar shows the extent of the gene. The blue bar shows the mRNA, with thick parts representing exons and thin parts introns. The red bar is the coding sequence. The two boxes below point out that our annotation used different RefSeq mRNAs for the HLA-C gene on primary vs ALT_REF_LOCI_2, and the comments on these mRNA records identify them as representing different HLA-C alleles. NT_113891.3 Homo sapiens chromosome 6 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_2 HSCHR6_MHC_COX_CTG1 NT_167249.2 Homo sapiens chromosome 6 genomic scaffold, GRCh38 alternate locus group ALT_REF_LOCI_7 HSCHR6_MHC_SSTO_CTG1 http://www.ncbi.nlm.nih.gov/nuccore/NC_000006.12?report=graph&from=31268240&to=31272643&strand=true&app_context=Gene&assm_context=GCF_000001405.26
  17. The GRC has not yet released any patches to GRCh38, so I went back to our annotation of GRCh37.p13 for an example of how annotation can be improved by a patch scaffold. On the top is chromosome 8 from the primary assembly in the region of the EPPK1 gene. This primary assembly has an internal deletion in this gene that was corrected in the patch scaffold shown below. The patch-to-primary alignment shown as this grey bar, helped us to annotate the EPPK1 gene at corresponding locations on the primary assembly chromosome & patch scaffold.
  18. My final example, is…
  19. I just quickly want to tell you about some recent enhancements to the genome annotation pipeline, the first of which is using RNA-Seq data as evidence for gene prediction. We have annotated 75 eukaryotes using RNA-Seq data over the last eighteen months. The chart on the left shows that adding RNA-Seq increased the number of genes predicted by about 20% on average for the organisms in this chart. The chart on the right shows that adding RNA-Seq resulted in an even bigger increase in the number of coding transcripts, as more transcript variants were annotated, an average increase of 88% for these reference assemblies. For many of the less well studies eukaryotes that we have annotated, RNA-Seq data has provided the primary source of evidence for generating gene models.
  20. Here is a graphical view of the Xenopus tropicalis nbr1 gene. We generated four model transcript variants for the nbr1 gene, shown in green. Below the genes track are three tracks generated using RNA-Seq data. The first shows a histogram of the RNA-Seq coverage of the exons, the second track, which looks like a mirror image of the first, shows a histogram of the intron-spanning reads, and the third track shows the intron features inferred from the RNA-Seq alignments.
  21. In April this year we enhanced our annotation pipeline by adding gap-filling of gene models using transcript sequences. What this means is that we can extend model RefSeqs into assembly gaps based on alignments of transcripts or Transcriptome Shotgun Assemblies. How this works is illustrated here. This genomic sequence contains a gap. A transcript aligns to the genomic sequence except for the 5’ end which falls in the gap in the assembly. We can construct a model RefSeq transcript based on the transcript sequence for exon 1 and the genomic sequence for exons 2, 3 & 4. >> Gap-filled regions are reported in the flat file record of transcript and protein models. <Example is platypus model XM_007659754.1>
  22. Last November we began making annotation reports for each annotation run that we do, both as a web page and as XML files. I will give you a just a quick taste of what is in each report. At the top of the report is the Annotation Release information, which includes the date the input data was frozen, and the date the annotation was made public. The assemblies that were annotated are also shown in this section. >> Gene and feature statistics, for example compare Gene or CDS counts between different assemblies or assembly units. >> RefSeq transcript alignment quality report. RefSeq alignments can be used as a metric for relative assembly quality. >> RNA-Seq alignment details
  23. The GRC has not released any patches to GRCh38 yet, so I went back to our annotation of GRCh37.p13 for an example of how annotation can be improved by a patch scaffold. This patch scaffold adds an component that extends the chromosome 17 sequence <point in Tiling Path track>. This is reflected in the Patch to Primary alignment which ends at the junction with the new component. The DOC2B gene on the primary assembly chromosome is partial (as indicated by the double black arrows and the grey bar). It is missing the last three exons. The extra sequence in the patch included the missing exons, hence, the DOC2B gene on the patch scaffold is complete. The patch-to-primary alignment helped us to annotate the DOC2B gene at corresponding locations on the primary assembly chromosome & patch scaffold. NW_004070872.2 Homo sapiens chromosome 17 genomic patch of type FIX, GRCh37.p13 PATCHES HG417_PATCH NC_000017.10 Homo sapiens chromosome 17, GRCh37.p13 Primary Assembly http://www.ncbi.nlm.nih.gov/nuccore/NW_004070872.2?report=graph&from=81581&to=126850&strand=true&app_context=Gene&assm_context=GCF_000001405.25