08_Annotation_2022.pdf

Genome Annotation
MICROBIO 590B Bioinformatics Lab: Bacterial Genomics
Professor Kristen DeAngelis
UMass Amherst
Fall 2022
1

Lecture Learning Goals
• Describe how genes are identified.
• Distinguish between an open reading frame, a genome feature, a
gene, and a protein coding region.
• Explain how genomes are annotated and the kinds of databases that
are used to classify genes.
• List the genes involved in cellular metabolism, for both energy
generation (catabolism) and cell growth (anabolism).
• Explain the idea behind metabolic models, and describe one
application.
2

Annotation of an Open Reading Frame
• …
3

Open Reading Frames
• Some ORFs are located one strand, and others, on the other strand- facing in the
opposite orientation. The strands are designated as + or – and ORFs are
diagrammed as located on the top (+) or bottom (-) strand template. The diagram
below shows that most ORFs are in the same orientation for S-TIM5
bacteriophage.
• For the ORFs located above the line, ‘upstream’, where the promoter is located
(the 5’ end of the ORF), is to the left.
• Open reading frames (ORFs) are sections of the genome that are flanked by start
and stop codons, and thus can be readily identified with computer algorithms.
Algorithm identify ORFs that may or may not be used by the cell to produce a
protein (termed CDS- coding sequence).
4

Open Reading Frames
5
Sabehi, Shaulov, Silver, Yanai, Harel, and Lindell, PNAS. 2012

Origin of replication
• Models for bacterial (A) and eukaryotic
(B) DNA replication initiation.
• A) Circular bacterial chromosomes contain a
cis-acting element, the replicator, that is
located at or near replication origins.
• B) Linear eukaryotic chromosomes contain
many replication origins.
• Most bacterial chromosomes are
circular and contain a single origin of
chromosomal replication (oriC).
• Origins in bacteria contain three
functional elements that control origin
activity:
• conserved DNA repeats that are specifically
recognized by DnaA (called DnaA-boxes)
• an AT-rich DNA unwinding element (DUE)
• and binding sites for proteins that help
regulate replication initiation
6

Ribosomal operons tend to locate near the origin
of replication
• rRNA is the ribosomal RNA, a major constituent of the ribosome, accounting for about 2/3
of its mass
• A large number of ribosomes is required for growing cells
• Fast-growing cells have many copies of the ribosomal operon
7
http://book.bionumbers.org/how-many-ribosomal-rna-gene-copies-are-in-the-genome/

GC skew
• The leading (single) strand tends to
have more Gs than Cs, though the
number of each base are the same
when you examine all base pairs
(double stranded).
• The difference is referred to as GC
skew, which can be examined to
locate the origin of replication.
• When the G content exceed the C
content, this is considered a positive
skew and indicates a leading strand.
8
Billings et al., Standards in Genomic Sciences 2015

key elements to genome annotation
1. The program scans through the sequence to identify rRNA and tRNA genes.
• rRNA = ribosomal RNA genes, structural RNA in the ribosome with ribosomal proteins
• tRNA = transfer RNA genes, connects the amino acid to the mRNA for growing proteins
2. The program predicts gene-encoding regions (also known as Open Reading
Frames, or ORFs)
3. The program looks for other elements of interest (phages, CRISPR arrays, etc)
4. Compare the sequence of a feature (any of items 1-3) to a reference database
of sequences with known functions. If the sequence looks similar to what has
already been annotated in the database (hopefully based on experimental
evidence), then it assigns the same function to this sequence - whether or not
that is actually what it does! But it's the best we can do.
9

Ribosomes and non-coding RNA
• Ribosomes are mostly coded in operons
• Ribosome structure requires 3 types of structural RNA molecules: 5s, 16s and
23s rRNAs
• Ribosomes also require proteins; these are also good phylogenetic markers
• Unlinked rRNA genes are widespread among bacteria and archaea
10
Brewer et al., ISMEJ 2019

Annotate Genomes with Prokka
• Number of genes predicted
• aka total CDS
• aka total coding sequences
• Number of protein coding genes
• Number of genes with non-hypothetical
function
• Number of genes with EC number
• Total tRNAs
• Total rRNAs
11
Seemann, Bioinformatics 2014

How many ORFs are annotated?
• UP to half of all ORFs have no known homologs… !
• Orphan genes, or ORFans … usually considered unique to a very narrow taxon,
generally a species
• Orphans are a subset of taxonomically-restricted genes (TRGs), which are
unique to a specific taxonomic level (e.g. plant-specific)
• Non-homology based methods based on the context and the interactions of a
protein may help identify missing metabolic activities and functional
annotation
• Why?
• Some are sequencing errors
• Some may be derived from horizontal gene transfer, duplication and
divergence, or de novo origination
• Some could be non-coding RNAs
12

Pseudogenes
• Pseudogenes are nonfunctional segments of DNA that resemble
functional genes
• Most bacterial pseudogenes are found in non-free-living organisms,
like symbionts or obligate intracellular parasites
• These will (generally) not be included in genome annotations
13

Categorizing protein coding genes
• Many organizational schemes categorize protein coding genes
• Which one you choose depends upon which are available your goals
• Common options include:
• Enzyme (enzyme nomenclature) and EC numbers,
• FIGfams (functional homologs, part of SEED subsystems),
• Pfam and TIGRfam (curated protein families),
• COG (curated clusters of orthologous groups of proteins),
• KO (KEGG Orthology), KEGG (metabolic pathways and reactions),
• InterPro (protein families and domains),
• GO (gene ontologies),
• LIGAND (compounds), and
• MetaCyc (metabolic pathways)
14
https://img.jgi.doe.gov/datasource.html

Categorizing protein coding genes: EC number
• EC number stand for Enzyme Commission number
• EC numbers are assigned by the Nomenclature Committee of the
International Union of Biochemistry and Molecular Biology
15

Categorizing protein coding genes: EC number
• EC numbers have four positions which describe exactly what kind of
reaction the enzyme catalyzes
• An example is beta-glucosidase, the terminal exonuclease in the
depolymerization of cellulose to sugars
16
EC 3.2.1.21
general type of
reaction catalyzed
by the enzyme;
EC 3 group is
hydrolyase
https://www.qmul.ac.uk/sbcs/iubmb/enzyme/EC3/2/1/21.html
Subclass of the
top-level group;
EC 3.2 group is
glycosylases
Sub-subclass of the
top-level group;
EC 3.2.1 group is
Glycosidases, i.e.
enzymes hydrolysing
O- and S-glycosyl
compounds
serial number of the
enzyme in its sub-subclass;
β-glucosidase, Hydrolysis of
terminal, non-reducing β-
D-glucosyl residues with
release of β-D-glucose

Categorizing protein coding genes: FIGfams
• The original SEED Project was started in 2003 by the Fellowship for Interpretation
of Genomes (FIG) as an open source effort
• annotation is done by the
curation of subsystems across
many genomes, not on a gene-
by-gene basis
• From the curated subsystems we
extract a set of freely available
protein families (FIGfams)
• These FIGfams form the core
component of the RAST server
(RAST=Rapid Annotation using
Subsytems Technology)
17
https://www.theseed.org/wiki/Home_of_the_SEED

18
Meyer et al., Nucleic Acids Research 2009

• Each FIGfam is a set of proteins that are believed to be isofunctional
homologs
• they all are believed to implement the same function,
• and they are believed to derive from a common ancestor because they
appear to be similar
19

Categorizing protein coding genes: pfams
• The Pfam database is a large collection of protein families, each
represented by multiple sequence alignments and hidden Markov
models (HMMs).
• Pfam 34.0 (March 2021, 19179 entries)
• The general purpose of the Pfam database is to provide a complete
and accurate classification of protein families and domains
20
http://pfam.xfam.org; Mistry et al., Nucleic Acids Research, 2020

Categorizing protein coding genes: pfams
• Proteins may have multiple pfams, since domains are characterized
21
Mistry et al., Nucleic Acids Research, 2020
• Newly revised the Pfam entries
that cover the SARS-CoV-2
proteome, with new entries for
regions not covered by Pfam.
• The structure of NSP15 from
Kim et al. shows the three new
Pfam domains,
• (i) CoV_NSP15_N Coronavirus
replicase domain in red,
• (ii) CoV_NSP15_M Coronavirus
replicase NSP15 domain in blue,
• (iii) CoV_NSP15_C Coronavirus
replicase NSP15, uridylate-specific
endoribonuclease in green.

Categorizing protein
coding genes: COGs
• Clusters of Orthologous Genes
(COGs)
• relatively small collection of fewer
than 5000 clusters of orthologous
proteins (COGs) consists of the
products of the most widespread
bacterial and archaeal genes
22
https://www.ncbi.nlm.nih.gov/research/COG

Categorizing protein coding genes: COGs
23
Shields et al., mSphere 2018
• An example of how COGs are used in analyzing change in relative
abundance of protein coding genes across treatments

Categorizing protein coding genes: KEGG and KO
• KEGG: Kyoto Encyclopedia of Genes and Genomes
• KEGG is a database resource for understanding high-level functions
and utilities of the biological system, such as the cell, the organism
and the ecosystem, from molecular-level information, especially
large-scale molecular datasets generated by genome sequencing and
other high-throughput experimental technologies.
24
https://www.genome.jp/kegg/

Categorizing protein
coding genes:
KEGG and KO
• …
25
https://www.genome.jp/kegg/

• KEGG consists of
eighteen original
databases in four
categories
26
Kanehisa et al., Nucleic Acids Research 2020

27

28
• Circles represent metabolites
• Lines represent enzymes that
make biochemical
transformations

29

• The KO (KEGG Orthology) database is a database of molecular
functions represented in terms of functional orthologs.
• A functional ortholog is manually defined in the context of KEGG molecular
networks, namely, KEGG pathway maps, BRITE hierarchies and KEGG
modules.
• Each node of the network, such as a box in the KEGG pathway map, is given a
KO identifier (called K number) as a functional ortholog defined from
experimentally characterized genes and proteins in specific organisms, which
are then used to assign orthologous genes in other organisms based on
sequence similarity.
• The granularity of "function" is context-dependent, and the resulting KO
grouping may correspond to a group of highly similar sequences within a
limited organism group or it may be a more divergent group.
30

• The KO (KEGG Orthology) database
• KEGG pathway maps are drawn based on experimental evidence in
specific organisms but they are designed to be applicable to other
organisms as well, because different organisms, such as human and
mouse, often share identical pathways consisting of functionally
identical genes, called orthologous genes or orthologs
31

Metabolism
• All chemical reactions inside a cell
• Metabolic pathways are the stepwise reactions that generate energy
by breaking down larger molecules (catabolism) or that are
biosynthetic and require energy (anabolism)
32
https://openstax.org/books/microbiology/pages/8-1-energy-matter-and-enzymes

Metabolism
• The energy currency
of cells include ATP,
NAD+, NADP+, and
FAD
• Exergonic reactions
are coupled to
endergonic reactions
to make the
combinations
favorable
33
https://openstax.org/books/microbiology/pages/8-1-energy-matter-and-enzymes

Catabolism of carbohydrates: glycolysis
• the most common pathway for the metabolism of glucose
• Produces energy, reduced electron carriers, and precursor molecules
for anabolism
• Can be coupled to aerobic or anaerobic growth
• Glycolysis
• Embden-Meyerhof-Parnoff pathway, aka “glycolysis”
• Entner-Doudoroff pathway is an alternative glycolysis
• Pentose-phosphate pathway processes five-carbon sugars
34

Glycolysis, the “upper” half
• 2 ATPs are used to
phosphorylate
glucose, which is
then split into two
3-carbon molecules
35
https://openstax.org/books/microbiology/pages/c-metabolic-pathways

Glycolysis, the “lower” half
• Further phosphorylation
requires NAD+, producing
4 ATPs per glucose
• Net 2 ATP per glucose
36

Substrate-level phosphorylation
• One of two enzymatic reactions in the energy payoff phase of
glycolysis generates ATP
37
https://openstax.org/books/microbiology/pages/8-2-catabolism-of-carbohydrates

Entner-Doudoroff
pathway
• to catabolize glucose to
pyruvate, ED uses the
unique enzymes
• 6-phosphogluconate
dehydratase aldolase
(EC 4.2.1.12) and
• 2-keto-deoxy-6-
phosphogluconate
aldolase (EC 4.2.1.14)
(KDPG)
38

Entner-Doudoroff pathway
• EMP glycolysis generates
net 2 ATP per glucose
• ED glycolysis only generates
one ATP per glucose
39
Flamholz et al., PNAS 2013

• EMP glycolysis generates
net 2 ATP per glucose
• ED glycolysis only generates
one ATP per glucose
• Why?
40

• “ED pathway is expected to
require several-fold less
enzymatic protein to
achieve the same glucose
conversion rate as the EMP
pathway”
41

• “energy-deprived anaerobes
overwhelmingly rely upon
the higher ATP yield of the
EMP pathway, whereas the
ED pathway is common
among facultative
anaerobes and even more
common among aerobes”
42

Pentose-Phosphate pathway
• aka phosphogluconate pathway and the hexose monophosphate shunt
• Parallels glycolysis, generates NADPH and 5C sugars as well as ribose 5-
phosphate, a precursor for the synthesis of nucleotides from glucose
43

The Transition Reaction
• Glycolysis produces pyruvate, which can be further oxidized to
generate more energy
• For this to happen, pyruvate must be decarboxylated (below, left)
• This is accomplished by the Coenyzyme-A (“CoA”, below, right)
44

Tricarboxylic Acid (TCA) Cycle
• Closed loop pathway in 8 steps that capture the 2C acetyl group of
acetyl-CoA, producing 2 CO2, 1 ATP, 3 NADH and 1 FADH2
45

TCA cycle intersects anabolism and catabolism
• As well as generating energy,
intermediate compounds are
precursors for biosynthesis of
• amino acids,
• chlorophylls,
• fatty acids, and
• nucleotides
• TCA cycle is anabolic and
catabolic
46

Respiration
• Most cellular ATP is
generated by oxidative
phosphorylation
• As opposed to substrate-
level phosphorylation
• In oxidative
phosphorylation, ATP is
formed from the transfer
of electrons from NADH
or FADH2 to O2 by a
series of electron
carriers
• How much ATP depends
on the terminal electron
acceptor
• More ATP from O2 than
from NO3
-, SO4
2-, Fe3+,
CO2, other inorganics
48
https://openstax.org/books/microbiology/pages/8-3-cellular-respiration

Electron Transport Chain
• A series of electron
carriers and ion pumps
embedded in the cell
membrane that pump
protons (H+) across a
membrane
• Proton motive force is
generated by expelling
protons outside of the cell
• Protons then want to flow
across the membrane, but
must go through the ATP
synthase, which drives
ATP production
49

Carbohydrate Active Enzymes (CAZy)
http://www.cazy.org
Modules that catalyze the breakdown, biosynthesis or modification of
carbohydrates and glycoconjugates :
• Glycoside Hydrolases (GHs) : hydrolysis and/or rearrangement of glycosidic bonds
• GlycosylTransferases (GTs) : formation of glycosidic bonds
• Polysaccharide Lyases (PLs) : non-hydrolytic cleavage of glycosidic bonds
• Carbohydrate Esterases (CEs) : hydrolysis of carbohydrate esters
• Auxiliary Activities (AAs) : redox enzymes that act in conjunction with CAZymes.
Associated Modules currently covered
• Carbohydrate-Binding Modules (CBMs) : adhesion to carbohydrates
50

Metabolic Modeling
• Combination of genome
sequence with physiology to
predict growth
• Mathematical network
model that represents the
systems biology of metabolic
pathways within an organism
51
Sertbas & Ulgen, Front. C.D.B, 2020

Metabolic models help predict pathogenesis
• …
52
Sertbas & Ulgen, Front. C.D.B, 2020

Metabolic models to identify novel antimicrobial
drug targets and develop new antibiotics
53
https://doi.org/10.1038/s41429-020-00366-2

Metabolic models improve food fermentation
• Lactic acid bacteria like Lactococcus
lactis make lactic acid from sugars in
foods like cheese, yogurt, wine, salami,
and sauerkraut
• They also make therapeutic proteins &
flavor ingredients
• By targeting the lac operon (below),
genetic engineers can tune metabolic
pathways and products (left)
54
https://doi.org/10.1016/j.tibtech.2003.11.011

Lecture Learning Goals
• Describe how genes are identified.
• Distinguish between an open reading frame, a genome feature, a
gene, and a protein coding region.
• Explain how genomes are annotated and the kinds of databases that
are used to classify genes.
• List the genes involved in cellular metabolism, for both energy
generation (catabolism) and cell growth (anabolism).
• Explain the idea behind metabolic models, and describe one
application.
55

08_Annotation_2022.pdf

Recommended

Recommended

More Related Content

Similar to 08_Annotation_2022.pdf

Similar to 08_Annotation_2022.pdf (20)

More from Kristen DeAngelis

More from Kristen DeAngelis (20)

Recently uploaded

Recently uploaded (20)

08_Annotation_2022.pdf