The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

GRC Assembly Analysis Workshop At Genome Informatics
September 21, 2014
The NCBI Eukaryotic Genome Annotation Pipeline
And Alternate Genomic Sequences
Paul Kitts
NCBI
National Center for Biotechnology Information

Genomes Annotated By NCBI
Human GRCh38
2014-02-03
Zebrafish GRCz10
in progress
Mouse GRCm38.p2
2013-12-27

Outline
• Overview of the NCBI Eukaryotic Genome Annotation Pipeline
• What to do with alternate loci & patch scaffolds?
• How we use the alt/patch/PAR alignments to inform our annotation
• Examples:
– Annotation only on alternate loci
– Different alleles annotated on primary assembly and alternate loci
– Annotation improved by patches
– Pseudoautosomal Regions annotated consistently on X & Y
• Recent enhancements:
– Using RNA-Seq evidence for gene prediction
– Gap-filling gene models using transcript sequences
– Annotation reports

Eukaryotic Genome
Annotation Pipeline
Overview

Ranking Alignments
• Rank alignments for each query sequence
– using a quality score that combines identity & coverage
– Rank-1 > Rank-2 > Rank-3…
• Conflicting alignments cannot have same rank
– alignments of the same query sequence to an assembly
conflict if they have significant overlap (>= 30%)
– Insignificant
– Significant
• A subset of rank-1 alignments is used for annotation
Span in alignment B
Span in alignment A
Span in alignment B
Span in alignment A

mRNA-F1
Annotation Of A Simple Assembly Using Ranked Alignments
mRNA-F1
mRNA-F2
Input mRNAsGenes in the assembly
mRNA-F2
Unplaced scaffold1
mRNA-F1
Filter out alignments that are not rank-1
GeneF1 GeneF2Chr1
GeneF1Chr1
Resulting annotation
GeneF2 Unplaced scaffold1
mRNA-F2 mRNA-F1
* * **
* * *mRNA-F1mRNA-F2
* *
Rank alignments
Unplaced scaffold1
GeneF2Chr1 GeneF1
Rank-1
Rank-2
Rank-3Rank-1
Rank-2
Align mRNAs
Unplaced scaffold1GeneF1 GeneF2Chr1

What to do with alternate loci & patch scaffolds?
1. Omit the alternate loci & patch scaffolds
2. Include the alternate loci & patch scaffolds;
no special treatment
use known relationships to primary assembly

Gene1/A G2-Allele-APrimary Chr1
Gene2
mRNA-3A
* * *
Annotation Omitting Alt-scaffolds
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Genes/Alleles represented in the assembly
Gene1/A Gene2
Gene1/B
Alt-scaffold2
mRNA-3A
Gene3
Scenario 1: no annotation for Gene3
no annotation for Gene1/Allele-B
✔
mRNA-1A
Rank-1 mRNA alignments
Gene1/A Gene2Primary Chr1
mRNA-2A
✗✔
Scenario 2: Gene3 annotated at the wrong location
no annotation for Gene1/Allele-B

Gene1/A G2-Allele-A
Gene3
Gene4
Primary Chr1
Alt-scaffold2
Alt-scaffold1
Gene2
Annotation Using Alt-scaffolds Without Alt-to-primary Alignments
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Gene1/A Gene2
Gene1/B
Alt-scaffold2
✔
✗
✔
mRNA-3A
mRNA-1A
mRNA-1B
mRNA-3A
Gene1/A Gene2
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
mRNA-2A
✔

Gene1/A G2-Allele-A
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
Gene2
Annotation Using Alt-scaffolds & Alt-to-primary Alignments
mRNA-1A
mRNA-1B
mRNA-2A
Input mRNAs
Gene3
Primary Chr1
Alt-scaffold1
Gene1/A Gene2
Gene1/B
Alt-scaffold2
alt-to-primary alignment
✔
✔
✔
mRNA-3A
mRNA-1A
mRNA-1B
mRNA-3A
Gene1/A Gene2
Gene3
Gene1/B
Primary Chr1
Alt-scaffold2
Alt-scaffold1
mRNA-2A
✔

Pros & cons of different choices for dealing with
alternate loci & patch scaffolds
1. Omit the alternate loci & patch scaffolds
Pros: Easy to implement
Cons: No representation for genes or alleles only on alts.
Incorrect models for genes that have been patched.
no special treatment
Pros: Easy to implement
Cons: Incorrectly annotate genes that have alternate alleles or patches
as if they were paralogs.
Wrongly penalize sequences for having multiple or ambiguous
placements.
use known relationships to primary assembly
Pros: Genes only on alts are annotated.
Correctly annotate genes with alternate alleles.
Correctly annotate patched genes
Cons: Requires software and pipelines changes

Eukaryotic Genome
Annotation Pipeline:
Steps using alt-to-
primary alignments
Alt-to-primary
alignments
Curated gene
localization

Ranking Alignments Across Assembly Units
• Create graph of related alignments
– Alignments that are collocated or mappable
– Transcript/protein to genomic
– Alt or patch scaffold to primary assembly
• Partition graph into clusters
– Each alignment in the cluster is related to at least one other
alignment in the same cluster
– No alignment is related to any alignment in another cluster
– Split conflicting alignments within a cluster into separate groups
– Merge non-conflicting clusters into groups
• Evaluate groups, sort and assign ranks
– All alignments in a group get the same rank

Ranked Alignment Groups Across Assembly Units
Assembly unit
Assembly alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2
Rank-1
Rank-2

Ranked Alignment Groups Across Assemblies
Assembly unit
Assembly alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Assembly1-Primary
Assembly1-Alt1
Assembly1-Alt2
Rank-1
Rank-2
Assembly2-Primary
Assembly3-Primary

Ranked Alignment Groups Across
Pseudoautosomal Regions (PARs)
Chromosome
PAR alignment
mRNA1 alignment
mRNA2 alignment
Cluster
Rank group
Chromosome Y
Chromosome X
Rank-1
PAR#1 PAR#2

NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word]
AND gene_nucleotide_pos[filter]
Genes Only Annotated On GRCh38 Alternate Loci
NCBI> Gene> “Homo sapiens”[orgn] AND "only annotated on alternate loci in reference assembly"[Text Word]
AND gene_nucleotide_pos[filter] AND “genetype protein coding”[prop] AND srcdb_refseq_known[prop]
Num. Gene Type
20 Protein Coding
40 Protein Coding (model)
21 Pseudogene
32 Pseudogene (model)
32 ncRNA (model)
5 Other
3 Other (model)

Different Alleles Annotated On GRCh38
Primary Assembly And Alternate Loci
ALT_REF_LOCI_2
ALT_REF_LOCI_7
NM_001243042.1 comment:
This variant represents the C*07:01:01:01 allele of the HLA-C gene.
NM_002117.5 comment:
This variant represents the C*07:02:01 allele of the HLA-C gene.

Annotation Of GRCh37 Improved By Patch Scaffold
EPPK1 gene on primary assembly chromosome 8 has an internal deletion.
EPPK1 gene on patch scaffold is complete.
Primary Assembly chromosome 8
Patch scaffold HG104_HG975_PATCH

Pseudoautosomal Regions Annotated Consistently
on GRCh38 chromosomes X & Y

Recent Enhancements To The Genome Annotation Pipeline:
#1 Using RNA-Seq Evidence For Gene Prediction
0
10000
20000
30000
40000
50000
60000
70000
80000
Number of coding transcripts
predicted +/- RNA-Seq
0
10000
20000
30000
40000
50000
60000
Number of genes
predicted +/- RNA-Seq
Without RNA-Seq
With RNA-Seq
75 organisms annotated with RNA-Seq data

Example Of Tracks Made Using RNA-Seq Data
NCBI > GENE > Xenopus (Silurana) tropicalis nbr1 [neighbor of BRCA1 gene 1]

#2 Gap-filling Gene Models Using Transcript Sequences
Genomic sequence
Transcript alignment
1 32 4
RefSeq model
Gap
How gap-filling works
Reporting of gap-filled regions

#3 Annotation Reports
RNA-Seq

Summary
Including the alternate loci & patch scaffolds and
using their known relationships to the primary
assembly significantly improves the annotation of
GRC assemblies.
It is worth the extra effort!

CREDITS
Genome pipeline infrastructure
Alex Astashyn
Nathan Bouk
Rob Cohen
Mike Dicuccio
Eric Engelson
Olga Ermoloeva
Wratko Hlavina
Lucian Ion
Avi Kimchi
Boris Kiryutin
David Managadze
Eyal Mozes
Terence Murphy
Daniel Rausch
Robert Smith
Sasha Souvorov
Craig Wallin
Alex Zasypkin
Eukaryotic annotation
setup & execution
Françoise Thibaud-Nissen
Jinna Choi
Patrick Masterson
Kim Pruitt and the “genome champions”
from the RefSeq group
Genomic Collections DB
Avi Kimchi
Victor Sapojnikov
Charlie Xiang
Andrey Zherikov
Genome assemblies with alt/patch to primary alignments
Genome Reference Consortium
The Wellcome Trust Sanger Institute
The Genome Institute at Washington University
The European Bioinformatics Institute
The National Center for Biotechnology Information
Eukaryotic Genome Annotation at NCBI: www.ncbi.nlm.nih.gov/genome/annotation_euk/

The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Similar to The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences (20)

More from Genome Reference Consortium

More from Genome Reference Consortium (20)

Recently uploaded

Recently uploaded (20)

The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences

Editor's Notes