Genome Annotation
Karan Veer Singh,
Scientist.
NBAGR, Karnal,
India

1
The Genome
•

The genome contains all the biological information required to
build and maintain any given living organism

•

The genome contains the organisms molecular history

•

Decoding the biological information encoded in these molecules
will have enormous impact in our understanding of biology
Genomics

1.

Structural genomics-genetic and physical mapping of genomes.

2.

Functional genomics-analysis of gene function (and non-genes).

3.

Comparative genomics-comparison of genomes across species.


Includes structural and functional genomics.



Evolutionary genomics.
Human Genome Project

The Human genome project promised to
revolutionise medicine and explain every
base of our DNA.
Large MEDICAL GENETICS focus
Identify variation in
the genome that is
disease causing

Determine how individual
genes play a role in health
and disease
Human Genome Project & Functional
Genome

It cost 3 billion dollars and took 10 years to complete (5 less than
initially predicted).
•

Approx 200 Mb still in progress
– Heterochromatin
– Repetitive
Genomics & Genome
annotation


First genome annotation software system was designed in 1995 by Dr.
Owen White with The Institute for Genomic Research that sequenced
and analyzed the first genome of a free-living organism to be decoded,
the bacterium Haemophilus influenzae



It involve assembling of the reads to form contigs then assembling with
a reference genome (reference assembly) or de novo assembly to
obtain the complete genome



Variations such as mutations, SNP, InDels etc can be identified



The genome is then annotated by structural and functional annotation



Mapping Image of Whole genome in an easily understandable manner.
Sequence to Annotation
Input1 to Genome Viewer- Variant
Annotation
Input2 to Genome Viewer- Structural
Annotation
 Structural

2.5.5)

Annotation- AUGUSTUS (version
Input3 to Genome Viewer-Functional
Annotation
Genome Annotation
 The

process of identifying the locations of
genes and the coding regions in a genome to
determe what those genes do

 Finding

and attaching the structural elements
and its related function to each genome
locations

11
Genome Annotation

gene structure prediction

gene function prediction

Identifying elements
(Introns/exons,CDS,stop,start)
in the genome

Attaching biological information
to these elements- eg: for which
12
protein exon will code for
Structural annotation
Structural annotation - identification of genomic elements
Open reading frame and their localisation
gene structure
coding regions
location of regulatory motifs
Functional annotation
Functional annotation- attaching biological
information to genomic elements
biochemical function
biological function
involved regulations
Genome annotation - workflow
Genome sequence

Repeats

Masked or un-masked genome sequence
Structural annotation-Gene finding
nc-RNAs (tRNA, rRNA),
Introns

Protein-coding genes
Functional annotation

View in Genome viewer
16
Genome Repeats & features
Polymorphic between individuals/populations
 Percentage of repetitive sequences in different organisms
Genome
Aedes aegypti

Genome Size
(Mb)

% Repeat
~70

Anopheles gambiae

260

~30

Culex pipiens







1,300

540

~50

Microsatellite
Minisatellite
Tandem repeat
Short tandem repeat
SSR

17
Finding repeats as a preliminary to gene prediction
 Repeat discovery

Homology based approaches
Use RepeatMasker to search the genome and mask the sequence

18
Masked sequence




Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set

>my sequence

>my sequence (repeatmasked)

atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct

atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct

Positions/locations are not affected by masking
19
Types of Masking- Hard or Soft?


Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked

>my sequence

>my sequence (softmasked)

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT

>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga20
tcttct
Genome annotation - workflow
Genome sequence

Map repeats

Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns

Protein-coding genes
Functional annotation

View in Genome viewer
21
Structural annotation
Identification of genomic elements
 Open

reading frame and their localization
 Coding regions
 Location of regulatory motifs
 Start/Stop
 Splice Sites
 Non coding Regions/RNA’s
 Introns

22
Methods
 Similarity
•

Similarity between sequences which does not necessarily infer any
evolutionary linkage

 Ab- initio prediction
•

Prediction of gene structure from first principles using only the genome
sequence

24
Genefinding
ab initio

similarity

25
ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential

Examples:
Genefinder, Augustus,
Glimmer, SNAP, fgenesh

26
Genefinding - similarity
 Use known coding sequence to define coding regions
 EST sequences
 Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise, Augustus,
Prodigal

Gene-finding - comparative
 Use two or more genomic sequences to predict genes based on
conservation of exon sequences
 Examples: Twinscan and SLAM
27
Genome annotation - workflow
Genome sequence

Map repeats

Masked or un-masked
Gene finding- structural annotation
Gene finding- structural annotation
nc-RNAs, Introns

Protein-coding genes
Functional annotation

View in Genome viewer
28
Genefinding - non-coding RNA genes

 Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples

 tRNAscan - uses an HMM and co-variance model for prediction of
tRNA genes

 Rfam - a suite of HMM’s trained against a large number of different
RNA genes

29
Gene-finding omissions

Alternative isoforms
Currently there is no good method for predicting alternative isoforms
Only created where supporting transcript evidence is present
Pseudogenes
Each genome project has a fuzzy definition of pseudogenes
Badly curated/described across the board

Promoters
Rarely a priority for a genome project
Some algorithms exist but usually not integrated into an annotation set

30
Practical- structural annotation
Eukaryotes- AUGUSTUS (gene model)

~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial
--singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=tr
--progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea
our_genome.fasta >structural_annotation.gff

Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa
-f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt
31
Structural Annotation-output


Structural Annotation conducted using AUGUSTUS (version 2.5.5),
Magnaporthe_grisea as genome model
Functional
annotation

33
Genome annotation - workflow
Genome sequence

Map repeats

Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns

Protein-coding genes
Functional annotation

View in Genome viewer
34
Functional annotation
Genome
Transcription

Primary Transcript
RNA processing

Processed mRNA

ATG

STOP

m 7G

AAAn

Translation

Polypeptide
Protein folding

Folded protein
Find function
Enzyme activity

Functional activity

A

B
35
Functional annotation
Attaching biological information to genomic elements
Biochemical

function
Biological function
Involved regulation and interactions
Expression

•

Utilize known structural annotation to predicted protein sequence

36
Functional annotation – Homology Based


Predicted Exons/CDS/ORF are searched against the non-redundant
protein database (NCBI, SwissProt) to search for similarities



Visually assess the top 5-10 hits to identify whether these have
been assigned a function



Functions are assigned

37
Functional annotation - Other features
 Other








features which can be determined

Signal peptides
Transmembrane domains
Low complexity regions
Various binding sites, glycosylation sites etc.
Protein Domain
Secretome

See http://expasy.org/tools/ for a good list of possible prediction algorithms

38
Functional annotation - Other features
(Ontologies)
 Use



of ontologies to annotate gene products

Gene Ontology (GO)




Cellular component
Molecular function
Biological process

39
Practical - FUNCTIONAL
ANNOTATION


Homology Based Method



setup blast database for nucleotide/protein



Blasting the genome.fasta for annotations (nucleotide/protein)



sorting for blast minimum E-value (>=0.01) for nucleotide/protein



assigning functions

40
Functional annotation- output

August 2008

Bioinformatics tools for Comparative Genomics
of Vectors

41
Conclusion


Annotation accuracy is dependent available supporting data at the
time of annotation; update information is necessary



Gene predictions will change over time as new data becomes
available (NCBI) that are much similar than previous ones



Functional assignments will change over time as new data becomes
available (characterization of hypothetical proteins)

42
Genome annotation - workflow
Genome sequence

Map repeats

Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns

Protein-coding genes
Functional annotation

View in Genome viewer
43
Genome Viewer
The Files that can be visualised
Annotation files
Indel files
Consensus sequence

Comparative Genomics

44
Genome View

August 2008

45
46
47
48
Short Read track

49
Thank You
50

Genome annotation 2013

  • 1.
    Genome Annotation Karan VeerSingh, Scientist. NBAGR, Karnal, India 1
  • 2.
    The Genome • The genomecontains all the biological information required to build and maintain any given living organism • The genome contains the organisms molecular history • Decoding the biological information encoded in these molecules will have enormous impact in our understanding of biology
  • 3.
    Genomics 1. Structural genomics-genetic andphysical mapping of genomes. 2. Functional genomics-analysis of gene function (and non-genes). 3. Comparative genomics-comparison of genomes across species.  Includes structural and functional genomics.  Evolutionary genomics.
  • 4.
    Human Genome Project TheHuman genome project promised to revolutionise medicine and explain every base of our DNA. Large MEDICAL GENETICS focus Identify variation in the genome that is disease causing Determine how individual genes play a role in health and disease
  • 5.
    Human Genome Project& Functional Genome It cost 3 billion dollars and took 10 years to complete (5 less than initially predicted). • Approx 200 Mb still in progress – Heterochromatin – Repetitive
  • 6.
    Genomics & Genome annotation  Firstgenome annotation software system was designed in 1995 by Dr. Owen White with The Institute for Genomic Research that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae  It involve assembling of the reads to form contigs then assembling with a reference genome (reference assembly) or de novo assembly to obtain the complete genome  Variations such as mutations, SNP, InDels etc can be identified  The genome is then annotated by structural and functional annotation  Mapping Image of Whole genome in an easily understandable manner.
  • 7.
  • 8.
    Input1 to GenomeViewer- Variant Annotation
  • 9.
    Input2 to GenomeViewer- Structural Annotation  Structural 2.5.5) Annotation- AUGUSTUS (version
  • 10.
    Input3 to GenomeViewer-Functional Annotation
  • 11.
    Genome Annotation  The processof identifying the locations of genes and the coding regions in a genome to determe what those genes do  Finding and attaching the structural elements and its related function to each genome locations 11
  • 12.
    Genome Annotation gene structureprediction gene function prediction Identifying elements (Introns/exons,CDS,stop,start) in the genome Attaching biological information to these elements- eg: for which 12 protein exon will code for
  • 13.
    Structural annotation Structural annotation- identification of genomic elements Open reading frame and their localisation gene structure coding regions location of regulatory motifs
  • 14.
    Functional annotation Functional annotation-attaching biological information to genomic elements biochemical function biological function involved regulations
  • 15.
    Genome annotation -workflow Genome sequence Repeats Masked or un-masked genome sequence Structural annotation-Gene finding nc-RNAs (tRNA, rRNA), Introns Protein-coding genes Functional annotation View in Genome viewer 16
  • 16.
    Genome Repeats &features Polymorphic between individuals/populations  Percentage of repetitive sequences in different organisms Genome Aedes aegypti Genome Size (Mb) % Repeat ~70 Anopheles gambiae 260 ~30 Culex pipiens      1,300 540 ~50 Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR 17
  • 17.
    Finding repeats asa preliminary to gene prediction  Repeat discovery Homology based approaches Use RepeatMasker to search the genome and mask the sequence 18
  • 18.
    Masked sequence   Repeatmasked sequenceis an artificial construction where those regions which are thought to be repetitive are marked with X’s Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set >my sequence >my sequence (repeatmasked) atgagcttcgatagcgatcagctagcgatcaggct actattggcttctctagactcgtctatctctatta gctatcatctcgatagcgatcagctagcgatcagg ctactattggcttcgatagcgatcagctagcgatc aggctactattggcttcgatagcgatcagctagcg atcaggctactattggctgatcttaggtcttctga tcttct atgagcttcgatagcgatcagctagcgatcaggct actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxatctcgatagcgatcagctagcgatcagg ctactattxxxxxxxxxxxxxxxxxxxtagcgatc aggctactattggcttcgatagcgatcagctagcg atcaggctxxxxxxxxxxxxxxxxxxxtcttctga tcttct Positions/locations are not affected by masking 19
  • 19.
    Types of Masking-Hard or Soft?  Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked >my sequence >my sequence (softmasked) ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA TCTTCT ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC TACTATTggcttctctagactcgtctatctctatt agtatcATCTCGATAGCGATCAGCTAGCGATCAGG CTACTATTggcttcgatagcgatcagcTAGCGATC AGGCTACTATTggcttcgatagcgatcagcTAGCG ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA TCTTCT >my sequence (hardmasked) atgagcttcgatagcgatcagctagcgatcaggct actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxatctcgatagcgatcagctagcgatcagg ctactattxxxxxxxxxxxxxxxxxxxtagcgatc aggctactattggcttcgatagcgatcagctagcg atcaggctxxxxxxxxxxxxxxxxxxxtcttctga20 tcttct
  • 20.
    Genome annotation -workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 21
  • 21.
    Structural annotation Identification ofgenomic elements  Open reading frame and their localization  Coding regions  Location of regulatory motifs  Start/Stop  Splice Sites  Non coding Regions/RNA’s  Introns 22
  • 22.
    Methods  Similarity • Similarity betweensequences which does not necessarily infer any evolutionary linkage  Ab- initio prediction • Prediction of gene structure from first principles using only the genome sequence 24
  • 23.
  • 24.
    ab initio prediction Genome Coding potential ATG& Stop codons Splice sites ATG & Stop codons Coding potential Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh 26
  • 25.
    Genefinding - similarity Use known coding sequence to define coding regions  EST sequences  Peptide sequences Problem to handle fuzzy alignment regions around splice sites Examples: EST2Genome, exonerate, genewise, Augustus, Prodigal Gene-finding - comparative  Use two or more genomic sequences to predict genes based on conservation of exon sequences  Examples: Twinscan and SLAM 27
  • 26.
    Genome annotation -workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 28
  • 27.
    Genefinding - non-codingRNA genes  Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples  tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes  Rfam - a suite of HMM’s trained against a large number of different RNA genes 29
  • 28.
    Gene-finding omissions Alternative isoforms Currentlythere is no good method for predicting alternative isoforms Only created where supporting transcript evidence is present Pseudogenes Each genome project has a fuzzy definition of pseudogenes Badly curated/described across the board Promoters Rarely a priority for a genome project Some algorithms exist but usually not integrated into an annotation set 30
  • 29.
    Practical- structural annotation Eukaryotes-AUGUSTUS (gene model) ~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=tr --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff Prokaryotes – PRODIGAL (Codon Usage table) ~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt 31
  • 30.
    Structural Annotation-output  Structural Annotationconducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model
  • 31.
  • 32.
    Genome annotation -workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 34
  • 33.
    Functional annotation Genome Transcription Primary Transcript RNAprocessing Processed mRNA ATG STOP m 7G AAAn Translation Polypeptide Protein folding Folded protein Find function Enzyme activity Functional activity A B 35
  • 34.
    Functional annotation Attaching biologicalinformation to genomic elements Biochemical function Biological function Involved regulation and interactions Expression • Utilize known structural annotation to predicted protein sequence 36
  • 35.
    Functional annotation –Homology Based  Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities  Visually assess the top 5-10 hits to identify whether these have been assigned a function  Functions are assigned 37
  • 36.
    Functional annotation -Other features  Other       features which can be determined Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain Secretome See http://expasy.org/tools/ for a good list of possible prediction algorithms 38
  • 37.
    Functional annotation -Other features (Ontologies)  Use  of ontologies to annotate gene products Gene Ontology (GO)    Cellular component Molecular function Biological process 39
  • 38.
    Practical - FUNCTIONAL ANNOTATION  HomologyBased Method  setup blast database for nucleotide/protein  Blasting the genome.fasta for annotations (nucleotide/protein)  sorting for blast minimum E-value (>=0.01) for nucleotide/protein  assigning functions 40
  • 39.
    Functional annotation- output August2008 Bioinformatics tools for Comparative Genomics of Vectors 41
  • 40.
    Conclusion  Annotation accuracy isdependent available supporting data at the time of annotation; update information is necessary  Gene predictions will change over time as new data becomes available (NCBI) that are much similar than previous ones  Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins) 42
  • 41.
    Genome annotation -workflow Genome sequence Map repeats Masked or un-masked Gene finding- structural annotation nc-RNAs, Introns Protein-coding genes Functional annotation View in Genome viewer 43
  • 42.
    Genome Viewer The Filesthat can be visualised Annotation files Indel files Consensus sequence Comparative Genomics 44
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.

Editor's Notes

  • #17 Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation
  • #20 Softmasking
  • #21 Softmasking
  • #22 Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation
  • #29 Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation
  • #35 Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation
  • #44 Try to describe Genome annotation as a process Emphasize the ongoing nature of annotation. There is no real end point to the annotation process (only artificially defined ones) Best to think of this as a ‘best guess’ annotation