6688–6719 Nucleic Acids Research, 2008, Vol. 36, No. 21 Published online 23 October 2008
doi:10.1093/nar/gkn668
SURVEY AND SUMMARY
Genomics of bacteria and archaea: the emerging
dynamic view of the prokaryotic world
Eugene V. Koonin* and Yuri I. Wolf
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health,
Bethesda, MD, USA
Received June 23, 2008; Revised September 15, 2008; Accepted September 22, 2008
ABSTRACT
The first bacterial genome was sequenced in 1995,
and the first archaeal genome in 1996. Soon after
these breakthroughs, an exponential rate of
genome sequencing was established, with a dou-
bling time of approximately 20 months for bacteria
and approximately 34 months for archaea. Com-
parative analysis of the hundreds of sequenced bac-
terial and dozens of archaeal genomes leads to
several generalizations on the principles of
genome organization and evolution. A crucial find-
ing that enables functional characterization of the
sequenced genomes and evolutionary reconstruc-
tion is that the majority of archaeal and bacterial
genes have conserved orthologs in other, often, dis-
tant organisms. However, comparative genomics
also shows that horizontal gene transfer (HGT) is a
dominant force of prokaryotic evolution, along with
Very shortly, thereafter, the second bacterial genome,
that of Mycoplasma genitalium, was sequenced (2), and
modern comparative genomics was born. A considerable
amount of sequences from diverse organisms was avail-
able prior to these reports, but the first fully sequenced
bacterial genome forever changed the state of the art in
genome analysis. The availability of complete genomes
(i.e. with nearly all the genetic material from the given
organism sequenced as opposed to, say, 90%, so that all
genes are available for analysis) is crucial to the entire
enterprise of comparative genomics for at least two related
but distinct, fundamental reasons: (i) some caveats not-
withstanding (see below), the availability of complete
genome sequences (or, more precisely, full complements
of genes) provides for the possibility to identify sets of
orthologs, i.e. genes that evolved from the same ancestral
gene in the common ancestor of the compared genomes,
(ii) comparison of complete genomes (gene sets) is the
necessary condition to determine not only which genes
are present in any particular genome but also which
byguestonhttp://nar.oxfordjournals.org/Downloadedfrom
Sequencing costs have fallen so dramatically that a sin-
gle laboratory can now afford to sequence large, even
human-sized, genomes. Ironically, although sequencing
has become easy, in many ways, genome annotation has
become more challenging. Several factors are respon-
sible for this. First, the shorter read lengths of second-
with some basic UNIX skills, ‘do-it-yourself’ genome
annotation projects are quite feasible using present-
day tools. Here we provide an overview of the eukary-
otic genome annotation process, describe the available
toolsets and outline some best-practice approaches.
A beginner’s guide to eukaryotic
genome annotation
Mark Yandell and Daniel Ence
Abstract | The falling cost of genome sequencing is having a marked impact on the
research community with respect to which genomes are sequenced and how and where
they are annotated. Genome annotation projects have generally become small-scale
affairs that are often carried out by an individual laboratory. Although annotating
a eukaryotic genome assembly is now within the reach of non-experts, it remains a
challenging task. Here we provide an overview of the genome annotation process and
the available tools and describe some best-practice approaches.
STUDY DESIGNS
already known phyla and have shown that only 10% of
Figure 1. The temporal dynamics of genome sequencing for bacteria
and archaea. Bacteria: doubling time 20 months. Archaea: doubling
time 34 months.
principles. The sequenced bacterial genomes span two
orders of magnitude in size, from 180 kb in the intracel-
lular symbiont Carsonella rudii (11) to 13 Mb in the soil
bacterium Sorangium cellulosum (12). Remarkably, bac-
teria show a clear-cut bimodal distribution of genome
Table 1. The state of genome sequencing for the archaeal and bacterial phylaa
Phylum No. of genomes
sequenced
Genome
size range, Mb
Representative (first genome sequenced)
Archaea
Crenarchaeota 16 1.3–3 Aeropyrum pernix K1
Euryarchaeota 34 1.6–5.8 Methanocaldococcus jannaschii DSM 2661
Korarchaeota 1 1.6 Korarchaeum cryptofilum OPF8
Nanoarchaeota 1 0.5 Nanoarchaeum equitans Kin4-M
Bacteria
Acidobacteria 2 5.7–10.0 Acidobacteria bacterium Ellin345
Actinobacteria 54 0.9–9.7 Mycobacterium tuberculosis H37Rv
Aquificae 2 1.6–1.8 Aquifex aeolicus VF5
Bacteriodes/Chlorobi group 21 0.3–6.3 Chlorobium tepidum TLS
Chlamydiae/Verrucomicrobia group 16 1.0–6.0 Chlamydia trachomatis D/UW-3/CX
Chloroflexi 7 1.3–6.7 Dehalococcoides ethenogenes 195
Chrysiogenetes 0 N/A N/A
Cyanobacteria 33 1.6–9.0 Synechocystis sp. PCC 6803
Deinococcus–Thermus group 4 2.1–3.2 Deinococcus radiodurans R1
Firmicutes (Gram-positive bacteria) 150 0.6–6.0 Mycoplasma genitalium G37
Fusobacteria 1 2.2 Fusobacterium nucleatum subsp.
nucleatum ATCC 25586
Gemmatimonadetes 0 N/A N/A
Nitrospirae 0 N/A N/A
Planctomycetes 1 7.1 Rhodopirellula baltica SH 1
Proteobacteria 353 0.2–13.0 Haemophilus influenzae Rd KW20
Spirochaetes 13 0.9–4.7 Borrelia burgdorferi B31
Synergistetes 0 N/A N/A
Thermodesulfobacteria 0 N/A N/A
Thermotogae 7 1.8–2.2 Thermotoga maritima MSB8
a
The classification is from the NCBI taxonomy as of 10 June 2008 (http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy).
Figure 1. The temporal dynamics of genome sequencing for bacteria
and archaea. Bacteria: doubling time 20 months. Archaea: doubling
time 34 months.
byguestonFebruary10,2016ordjournals.org/
such as that of the microsporidian Encephalitozoon
Figure 2. Distribution of genome sizes among bacteria and archaea.
The distributions curves were obtained by Gaussian-kernel smoothing
of the individual data points (276).
Figure 3. Density of protein-coding genes in bacterial and archaeal
genomes. The distributions curves were obtained by Gaussian-kernel
smoothing of the individual data points (276).
t has become
ween bacteria
d, the mimi-
8) and so is
stly, parasitic
d, nearly, the
ving archaea
to be abun-
r side of the
otic genomes,
cephalitozoon
Figure 4. Length distributions of protein-coding genes (a) and inter-
genic regions (b) in bacterial and archaeal genomes. The distributions
curves were obtained by Gaussian-kernel smoothing of the individual
data points (276).
ia and archaea.
ernel smoothing
tonFebruary10,2016
Nature Reviews | Genetics
Escherichia coli
Saccharomyces
cerevisiae
Schizosaccharomyces
pombe
Caenorhabditis
elegans
Arabidopsis
thaliana
Volvox
carteri
Drosophila
melanogaster
Citrus
clementina
Takifugu
rubripes
Oryza
sativa
Populus
trichocarpa
Eucalyptus
grandis
Gallus gallus
Danio rerio
Ornithorhynchus
anatinus
Zea mays
Ailuropoda
melanoleuca
Mus
musculus
Homo
sapiens
1.0 1.5 2.0 2.5 3.0 3.50.5
3.0
2.8
3.2
3.4
3.6
3.8
4.0
4.2
log10 (genome size) (Mbp)
log10(genesize)(bp)
Fungi
Invertebrates
Plants
Bacteria
Mammalian vertebrates
Non−mammalian vertebrates
Figure 1 | Genome and gene sizes for a representative set of genomes. Gene size is plotted as a function
of genome size for some representative bacteria, fungi, plants and animals. This figure illustrates a simple rule of
thumb: in general, bigger genomes have bigger genes. Thus, accurate annotation of a larger genome requires a more
contiguous genome assembly in order to avoid splitting genes across scaffolds. Note too that although the human
and mouse genomes deviate from the simple linear model shown here, the trend still holds. Their unusually large
genes are likely to be a consequence of the mature status of their annotations, which are much more complete as
regards annotation of alternatively spliced transcripts and untranslated regions than those of most other genomes.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Density
Chapter 7 • BACTERIAL AND ARCHAEAL GENET
ple sequence repeats (e.g., microsatellites and minisatellites), gene duplications (both tan-
dem arrays and pseudogenes), and transposable elements. Although bacterial and ar-
0 10
Gene
20 30 40 50 kb
A Human
B Escherichia coli
Human pseudogene
KEY
Repetitive DNA element
0 10 20 30 40 50 kb
FIGURE 7.2. Genome density. Comparison of the genome density and content of humans and Es-
cherichia coli. Each segment is 50 kb in length and represents (A) a portion of the human β T-cell
receptor locus and (B) a region of the E. coli K12 genome. Note the much greater proportion of
genes (red boxes) in E. coli compared to humans.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Number of GenesDNA or selfish DNA. Junk DNA appears to provide little benefit or no function to the
organism. (In some cases this designation is a misnomer resulting from a lack of infor-
30,000
25,000
20,000
15,000
10,000
5,000
0
Bacteria
Genes
Genome size
105 106 107 108 109 1010
Eukaryotes
Viruses
Archaea
FIGURE 7.3. Genome size vs. number of protein-coding genes. The number of genes is highly cor-
related to genome size for bacteria, archaea, and viruses, but less so for eukaryotes. Many archaeal
points (blue triangles) are hidden under bacterial ones (yellow squares).
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
OperonsORIGIN AND DIVERSIFICATION OF LIFE
lacZ
CAP
site
Operator
Promoter
Lactose permease
transports lactose into
the cell
transacetylase+
split lactose to galactose + glucose
CH2OH
OH
OH
H H
H
H OH
H
O
O
-galactosidase
lacY lacA
OH
H
CH2OH
OH
OH
H
H
H OH
H
O OH
H
CH2OH
H
OH
OH
H
H OH
H
O OH
H
CH2OH
H
OH
H H
H
H OH
Lactose Galactose
+
Glucose
H
O
FIGURE 7.4. Lac operon from Escherichia coli. This operon consists of three genes whose transcrip-
tion is regulated by a single promoter. The genes encode proteins involved in utilizing lactose, in-
cluding a permease (encoded by lacY), which brings lactose into the cell from the outside, and two
enzymes (encoded by lacZ and lacA), which split lactose into glucose + galactose (see pp. 52–53).
mation. Some stretches of “junk DNA” have been determined to be involved in gene reg-
Where Are The
Genes?
Box 2 | Gene prediction versus gene annotation
Although the terms ‘gene prediction’ and ‘gene annotation’ are often used as if they are synonyms, they are not. With a
few exceptions, gene predictors find the single most likely coding sequence (CDS) of a gene and do not report untranslated
regions (UTRs) or alternatively spliced variants. Gene prediction is therefore a somewhat misleading term. A more
accurate description might be ‘canonical CDS prediction’.
Gene annotations, conversely, generally include UTRs, alternative splice isoforms and have attributes such as evidence
trails. The figure shows a genome annotation and its associated evidence. Terms in parentheses are the names of
commonly used software tools for assembling particular types of evidence. Note that the gene annotation (shown in blue)
captures both alternatively spliced forms and the 5′ and 3′UTRs suggested by the evidence. By contrast, the gene
prediction that is generated by SNAP (shown in green) is incorrect as regards the gene’s 5′ exons and start-of-translation
site and, like most gene-predictors, it predicts only a single transcript with no UTR.
Gene annotation is thus a more complex task than gene prediction. A pipeline for genome annotation must not only
deal with heterogeneous types of evidence in the form of the expressed sequence tags (ESTs), RNA-seq data, protein
homologies and gene predictions, but it must also synthesize all of these data into coherent gene models and produce
an output that describes its results in sufficient detail for these outputs to become suitable inputs to genome browsers
and annotation databases.
Nature Reviews | Genetics
229,500 229,000 228,500 228,000 227,500 226,500227,000
bp
5′UTR 3′UTR
Gene annotation resulting
from synthesizing all
available evidence
(two alternative splice forms)
Protein evidence
(BLASTX)
mRNA or EST evidence
(Exonerate)
Gene prediction
(SNAP)
Start codon Stop codon
For instance
most likely c
untranslated
transcripts (B
gene predict
such as codo
exon length
regions and
Most gene pr
eter files that
genomes, su
nogaster, Ar
However, un
to an organis
are available,
the genome t
organisms c
codon usage
Given eno
tivity of ab in
However, th
structures is
important to
existing, hig
perfect geno
produce high
sets are rarel
In princi
Box 3 | Non-coding RNAs
Non-coding RNA (ncRNA) annotation is still in its infancy compared with
protein-coding gene annotation, but it is advancing rapidly. The heterogeneity and
poorly conserved nature of many ncRNA genes present major challenges for
annotation pipelines. Unlike protein-encoding genes, ncRNAs are usually not
well-conserved at the primary sequence level; even when they are, nucleotide
homologies are not as easily detected as protein homologies, which limits the power
of evidence-based approaches.
One common approach is to identify ncRNA genes using conserved secondary
structures and motifs. Established examples of these types of tools include
tRNAscan-SE118
and Snoscan119
. MicroRNA (miRNA) gene finders are also
available120
. A more general approach is first to align nucleotide sequences —
genomic, RNA-seq and ESTs — from closely related organisms to the target genome
and then search these for signs of conserved secondary structures. This is a complex
process, however, and can require substantial computational resources; qRNA is
one such tool121
, another is StemLoc122
. Be aware that these tools have high
false-positive rates. RNA sequencing is also greatly aiding ncRNA identification.
For example, miRNAs can be directly identified using specialized RNA preps and
sequencing protocols123,124
. Even with such sophisticated tools and techniques,
distinguishing between bona fide ncRNA genes, spurious transcription and poorly
conserved protein-encoding genes that produce small peptides remains difficult,
especially in the cases of long intergenic non-coding RNAs (lincRNAs)125,126
and
expressed pseudogenes127,128
.
Another approach is to annotate possible ncRNA genes liberally and then use
Infernal129
and Rfam114
to triage and classify these genes based on primary and
secondary sequence similarities. Even with these resources, however, many ncRNAs
will remain unclassifiable. Currently, ncRNA annotation is cutting edge, and those
using ncRNA annotations should bear in mind that ncRNA annotation accuracies are
generally much lower than those of their protein-coding counterparts.
GC- and AT-skews, i.e. excess of purines or pyrimidines
(violation of Chargaff’s second parity rule). The underly-
ing causes of the GC/AT-skews are thought to reflect an
interplay of selective and mutational forces, i.e. selection
One of the earliest and central concepts of bacterial
genetics is the operon, a group of cotranscribed and coreg-
ulated genes (82). Although enormous amount of varia-
tion on the simple theme of regulation by the Lac
(a)
(c)
(b)
(d)
Figure 13. Evolution of gene order in bacteria and archaea: genomic dot-plots. (a) Colinearity with a few breakpoints between closely related
bacteria: Geobacillus thermodenitrificans versus Geobacillus kaustophilus; (b) X-shaped pattern between moderately diverged bacteria: Shewanella sp.
MR-4 versus Shewanella oneidensis; (c) X-shaped pattern between moderately diverged archaea: Pyrococcus horikoshii OT3 versus Pyrococcus abyssi
GE5; and (d) No clear pattern between more distantly related bacteria: Streptococcus gordonii str. Challis versus Streptococcus pneumoniae R6.
In each panel, the genome indicated first is plotted along the vertical axis.
6698 Nucleic Acids Research, 2008, Vol. 36, No. 21
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Gene Order
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Origin of replication
Terminus of replication
Artificially Open Circle
Origin Terminus Origin
Again
Genome 1
Genome2
O T O
O
O
T
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
E. coli K12
Island
Inversion
Repeat
E.coli0157:H7
FIGURE 7.10. Conserved gene order in
the backbone of Escherichia coli K12 and
0157:H7. The two genomes were aligned
with each other and the matching regions
were plotted. The conserved order of
genes in the backbone of the two E. coli
strains is indicated by the diagonal line.
Three important genomic regions are cir-
cled. An island present in one of the two
strains causes a slight shift in the position
of the main diagonal.
AND DIVERSIFICATION OF LIFE
they also occur in virtually the same order in both strains (Fig. 7.10). The genes unique
to each strain are clustered into “islands” interspersed among the stretches of common
genes. Similar patterns of DNA “islands” within a conserved genome backbone have
been found among other related bacteria or archaea.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Chapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND
mon is symmetric inversion around the origin of replication (Fig. 7.14). Such inversions
are seen in almost every comparison of moderately closely related strains or species. Al-
though other rearrangements occur, the symmetric inversions serve as a useful tool for
understanding some features of general evolution and we focus on them here.
Symmetric inversions around the origin are due to a combination of mutation bias
and selection bias. To understand how mutation bias could cause this, it is helpful to un-
400,000
0
400,000
0
800,000
1,200,000
1,600,000
1,667,867
800,000
H. influenzae Rd chromosome
H.pylori26695chromosome
1,200,000 1,830,137
FIGURE 7.11. The lack of conservation of
gene order between Haemophilus influen-
zae and Helicobacter pylori is illustrated.
Linearized chromosomes of H. influenzae
and H. pylori are plotted on the horizontal
and vertical axes, respectively. Each dot rep-
resents a single pair of orthologous proteins.
Genes in similar operons, which do exist,
are too close together to give separated
points on the scale used.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
cently replicated DNA, thereby causing an inversion. As the two replication forks should
Sinorhizobium meliloti
Bacillus subtilis
Borrelia burgdorferi
Treponema pallidum
Helicobacter pylori
Escherichia coli
Haemophilus influenzae
Rickettsia prowazekii
Mycoplasma sp.
Aquifex aeolicus
S6
Thermatoga maritima
Deinococcus radiodurans
Mycobacterium tuberculosis
Chlamydia sp.
Synechocystis
Archaea
SUI1-X1 S-4E L32-L19 X2 cdk-L1--ccm-mms
Small SUr-protein genes
rpoBC str S10 spc alpha
Large SUr-protein genes
Nonribosomal genes
Unknown genes
Breakpoint
Gene insertion
Rho-independent terminator
Missing gene
S4
?
L11(rplK)
L1(rplA)
L10(rplJ)
L7/L12(rplL)
rpoB
rpoC
unknown
S12(rpsL)
S7(rpsG)
fusA
tufA
S10(rpsJ)
L3(rplC)
L4(rplD)
L23(rplW)
L2(rplB)
S19(rpsS)
L22(rplY)
S3(rpsC)
L16(rplP)
L29(rpmC)
S17(rpsQ)
L14(rplN)
L24(rplX)
L5(rplE)
S14(rpsN)
S8(rpsH)
L6(rplF)
L18(rplR)
S5(rpsE)
L30(rpmD)
L15(rplO)
secY
adk
map
infA
L36(rpmJ)
S13(rpsM)
S11(rpsK)
S4(rpsD)
rpoA
L17(rplQ)
xxx
? ?
?
?
FIGURE 7.12. Conservation of gene order of ribosomal protein operons across bacterial and ar-
chaeal species.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coli All
0
1000000
2000000
3000000
4000000
5000000
E.coliCoordinates
0 1000000 2000000 3000000
V. cholerae Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coli Best
0
1000000
2000000
3000000
4000000
5000000
E.coliCoordinates
0 1000000 2000000 3000000
V. cholerae Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coli, Rotated
0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
E.coliORFCoordinates
0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 2 0 0 0 0 0 0 2 5 0 0 0 0 0 3 0 0 0 0 0 0
V. cholerae ORF Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Duplication and Gene Loss Model
Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
V. cholerae vs. E. coli

Orthologs on Both Diagonals
0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
E.coliORFCoordinates
0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 2 0 0 0 0 0 0 2 5 0 0 0 0 0 3 0 0 0 0 0 0
V. cholerae ORF Coordinates Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
C. trachomatis MoPn
C.pneumoniaeAR39
Origin
Terminus
C. trachomatis vs C. pneumoniae
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
B1
A1
B2
A2
B3
A3
B3
B2
24
23
22
21
20
19
18171615
14
13
12
11
10
9
6
7
258
26
27
28
29
30
1 2 3
4
5
3132
B1
3132
6
7
8
9
10
11
12
13
14
15161718
19
20
21
22
23
24
25
26
27
28
29
30
1 2 3
4
5
3132
B3 24
23
22
21
20
19
18171615
14
13
12
11
10
9
6
7
258
26
27
28
29
3
3231 30
4
5
2 1
A1
3132
6
7
8
9
10
11
12
13
14
15161718
19
20
21
22
23
24
25
26
27
28
29
30
1 2 3
4
5
3132
A2
3132
6
7
8
9
10
11
12
13
19
18171615
14
20
21
22
23
24
25
26
27
28
29
30
1 2 3
4
5
3132
A3
2
6
7
8
9
10
11
12
13
19
18171615
14
20
21
22
23
24
25
26
27
5
4
3 31 30
29
28
1 32
B2
Inversion
Around
Terminus (*)
Inversion
Around
Terminus (*)
Inversion
Around
Origin (*)
Inversion
Around
Origin (*)
* *
* *
* *
* *
Common
Ancestor of
A and B
3132
6
7
8
9
10
11
12
13
14
15161718
19
20
21
22
23
24
25
26
27
28
29
30
1 2
3
4
5
3132
A2
A1 A2
A3
B2
B1
Symmetric Inversion Model
Eisen et al., 2000
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
13621300
13621775
13622250
13622725
13623200
0 625 1250 1875 2500
Series1
Streps
0
500
1000
1500
2000
2500
3000
2632200 2632700 2633200 2633700 2634200 2634700 2635200 2635700 2636200 2636700
B. subt vs. Staph
0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
Mycobacteriumtuberculosis
0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0
Mycobacterium leprae
M. tb vs. M. leprae Pyrococcus Thermoplasmas
9945700
9947275
9948850
9950425
9952000
0 2125 4250 6375 8500
Series1
Pseudomonas
The X-Files
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction
• Identification of motifs
! Short regions of sequence similarity that are indicative
of general activity
! e.g., ATP binding
• Homology/similarity based methods
! Gene sequence is searched against a databases of
other sequences
! If significant similar genes are found, their functional
information is used
• Problem
! Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
corresponds to the ‘cloud’ (24 000 clusters) that consists
of genes shared by a small number of organisms. The pos-
sibility exists that the size of the cloud is somewhat inflated,
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
AeropyrumpernixK1
ArchaeoglobusfulgidusDSM4304
Halobacteriumsp.NRC-1
Methanothermobacterthermautotrophicusstr.DeltaH
MethanocaldococcusjannaschiiDSM2661
MethanosarcinaacetivoransC2A
MethanopyruskandleriAV19
PyrococcushorikoshiiOT3
ThermoplasmavolcaniumGSS1
NanoarchaeumequitansKin4-M
MycobacteriumtuberculosisH37Rv
StreptomycescoelicolorA3(2)
BifidobacteriumlongumNCC2705
AquifexaeolicusVF5
BacteroidesthetaiotaomicronVPI-5482
SalinibacterruberDSM13855
ChlorobiumtepidumTLS
ChlamydiamuridarumNigg
CandidatusProtochlamydiaamoebophilaUWE25
Dehalococcoidesethenogenes195
Synechocystissp.PCC6803
AnabaenavariabilisATCC29413
Prochlorococcusmarinussubsp.marinusstr.CCMP1375
DeinococcusradioduransR1
ThermusthermophilusHB27
SolibacterusitatusEllin6076
Bacillussubtilissubsp.subtilisstr.168
Lactococcuslactissubsp.lactisIl1403
ClostridiumacetobutylicumATCC824
MesoplasmaflorumL1
Fusobacteriumnucleatumsubsp.nucleatumATCC25586
Pirellulasp.
Agrobacteriumtumefaciensstr.C58
BurkholderiamalleiATCC23344
Desulfovibriovulgarissubsp.vulgarisstr.Hildenborough
EscherichiacoliK12
PseudomonasaeruginosaPAO1
LeptospirainterrogansserovarCopenhagenistr.FiocruzL1-…
Treponemapallidumsubsp.pallidumstr.Nichols
ThermotogamaritimaMSB8
Figure 5. Coverage of bacterial and archaeal genomes with cluster of orthologous genes. The COGs were from the EggNOG database (41), and the
proteins from each genome were assigned to these clusters using a modified COGNITOR method (42).
7000
8000
9000
OGs
(a)
6692 Nucleic Acids Research, 2008, Vol. 36, No. 21
byguestonFebruary10,2http://nar.oxfordjournals.org/Downloadedfrom
Figure 9. The prokaryotic genome space: a SOM. Th
PRINCIPL
ARCHITE
Almost im
genome se
in bacteria
served (4,
347
Pyrobaculum
aerophilumIM2
Methanosarcina
barkeriFusaro
Aquifexaeolicus
VF5
Lactobacilluscasei
ATCC334
Pseudomonas
aeruginosaPAO1
218
656
342
1042
1104
492
1029
444
555
289
157
517
320
261
930
225
751
323
542
1386
492
1923
1002
765
Figure 10. Distribution of predicted gene functional classes for selected
archaeal and bacterial genomes. Red, information processing genes;
blue, genes involved in cellular functions; green, genes involved in
metabolism and transport; light gray, general prediction only; dark
gray, no prediction. The function class assignment is based on the
inclusion of the respective genes in COGs (34).
Figure 11. D
logs for info
and replicat
are operatio
by Gaussian
velopment, the genome of the
s found to contain only 170
n any estimates of the minimal
er, this unusual organism lacks
ent in all other known bacteria
roteins that appear to be indis-
minoacyl-tRNA synthetases. At
xplanation is that this organism
teins from the host cell, thereby
traint affecting other prokaryo-
s, even intracellular ones (133).
ella is a case of a bacterium-to-
gress (142). The minimal com-
c organism growing on a rich
n at approximately 250 genes.
rrently known free-living organ-
.3 Mb in size, with 1100 genes
n these genomes contain up to
erally, nonessential, it is reason-
gene set for a free-living organ-
d number of approximately 1000
wide spread of nonorthologous
mal prokaryotic gene set is not
nes. Instead, there can be a large
sms with diverse life styles but,
of genes (135).
tions, perhaps, are what deter-
y of bacterial and archaeal gen-
g gives the upper bound to this
s problem, we turn to the ana-
functional categories of genes
already referred to in the above
uction systems. As first noticed
ver et al. (143) in the course of
e bacterium Pseudomonas aeru-
ail by Van Nimwegen (130) and
ly confirmed and explored by
showed an exponent greater than one (Figure 14b).
Figure 14. Scaling of genes in different functional categories with the
total number of genes in archaeal and bacterial and genomes. (a) Data
for individual protein-coding genes. (b) Data for COGs. The function
class assignment is based on the inclusion of the respective genes in
COGs (34).
byguestonFebruary10,2016p://nar.oxfordjournals.org/
ents between bacteria
ht be, it seems appro-
zed functional devices
elated organisms).
demonstration and
ecent HGT is detect-
composition, oligonu-
and other ‘linguistic’
at reveal horizontally
nomalous for a given
rizontally transferred
tively high rate as the
d’ during evolution
HGT between closely
probably, not comple-
gation, bacteriophage-
mation (159).
d HGT among closely
GT across long evolu-
on the evolution of
tter of intense debate.
d ample indications of
ery distant organisms,
The first clear-cut indi-
l HGT were obtained
hermophilic bacteria,
T. maritima (178), con-
haracteristic archaeal
as well as proteins
a and bacteria but
ity to the latter than
Figure 15. The taxonomic breakdown of the best database hits
for proteins encoded in diverse bacterial and archaeal genomes.
(a) A mesophilic bacterium, Bifidobacterium longum (Biflo), compared
to a hyperthermophilic bacterium, T. maritima (Thema). (b) A meso-
philic archaeon, M. mazei (Metma), compared to hyperthemrophilic
archaeon, Sulfolobus solfataricus (Sulso). The best hits were obtained
by processing the results of the searches of the NCBI’s nonredundant
protein sequence database using the BLASTP program (277).
byguestonFebruary10,2016ls.org/
Bacteria
Acidobacteria
Pyrobaculum
aerophilum
Therm
otoga
m
aritim
a
M
SB8Bacteria
Lactobacillus
Spirochaetes
M
ycoplasm
atales
Thermoplasma
Picrophilus torridus DSM 9790
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
BacteriaBacteria
Bacteria
SulfolobusAeropyrum pernix K1
Leptospira interrogans serovar Lai str. 56601
Archaea
Bacteria
Bacteria
a
Colwellia psychrerythraea
Arch
0.5
(a)
(b)
Nucleic Acids Research, 2008, Vol. 36, No. 21
eria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Mycoplasmasynoviae53
Archaea
Dehalococcoidessp.CBDB1
M
ethanosarcinaceae
Pyrococcus horikoshii OT3
Archaea
Archaea
Mycoplasmatales
Therm
oplasm
a
acidophilum
DSM
1728
Thermoplasma
Colwellia psychrerythraea 34H
Archaea
1
0.5
(b)
Figure 16. Two cases of readily demonstrable horizontal gene transfer between archaea and bacteria. (a) COG0030, dimethyladenosine transferase,
an enzyme involved in rRNA methylation. (b) COG0206, FtsZ, a GTPase involved in cell division. Blue, bacteria; magenta, archaea. The trees were
constructed using the maximum likelihood method implemented in the PhyML software (278) (WAG evolutionary model; g-distributed site-specific
rates with the shape parameter 1.0). The complete information on the analyzed sequences and the alignments are available from the authors upon
request.
‘highways’ of HGT that connect closely related or habitat- replication are much less prone to HGT than operational
a
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Mycoplasmasynoviae53
Archaea
Dehalococcoidessp.CBDB1
M
ethanosarcinaceae
Pyrococcus horikoshii OT3
Archaea
Archaea
Mycoplasmatales
Therm
oplasm
a
acidophilum
DSM
1728
Thermoplasma
Colwellia psychrerythraea 34H
Archaea
1
0.5
(b)
Figure 16. Two cases of readily demonstrable horizontal gene transfer between archaea and bacteria. (a) COG0030, dimethyladenosine transferase,
an enzyme involved in rRNA methylation. (b) COG0206, FtsZ, a GTPase involved in cell division. Blue, bacteria; magenta, archaea. The trees were
constructed using the maximum likelihood method implemented in the PhyML software (278) (WAG evolutionary model; g-distributed site-specific
rates with the shape parameter 1.0). The complete information on the analyzed sequences and the alignments are available from the authors upon
request.
eria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Bacteria
Mycoplasmasynoviae53
Archaea
Dehalococcoidessp.CBDB1
M
ethanosarcinaceae
Pyrococcus horikoshii OT3
Archaea
Archaea
Mycoplasmatales
Therm
oplasm
a
acidophilum
DSM
1728
Thermoplasma
Colwellia psychrerythraea 34H
Archaea
1
0.5
(b)
Figure 16. Two cases of readily demonstrable horizontal gene transfer between archaea and bacteria. (a) COG0030, dimethyladenosine transferase,
an enzyme involved in rRNA methylation. (b) COG0206, FtsZ, a GTPase involved in cell division. Blue, bacteria; magenta, archaea. The trees were
constructed using the maximum likelihood method implemented in the PhyML software (278) (WAG evolutionary model; g-distributed site-specific
rates with the shape parameter 1.0). The complete information on the analyzed sequences and the alignments are available from the authors upon
request.
gene flow, emphasizing the interplay between the two
THE PRINCIPAL PROC
EVOLUTION
Having formulated the no
world, we are now in a
processes that affect evo
so, one necessarily must
tion–genetic theory of ev
that was recently expou
essence of this theory is th
increase of complexity su
only when purifying selec
weak, i.e. substantial co
during population bottlen
complexity is not adaptiv
population–genetic proce
ifying selection is (relative
cation starts off as a ‘geno
features subsequently beco
In contrast, in ‘highly suc
fying selection is intense, a
tion is thought to be geno
The concepts of genom
streamlining embody the
tion under which the sele
of an evolving lineage (a f
tive population size and
that affects the entire co
sponding genomes (237).
perspective that is cent
extant genomes
ancestral genomes
extra-and intra-
cellular mobilome
elements
vertical inheritance
horizontal exchange
mobilome exchange
Figure 17. The dynamic view of the prokaryotic world. The figure is a
conceptual schematic representation that is not based on specific data.
The larger blue circles denote extant (solid lines) or ancestral (dashed
lines) archaeal and bacterial genomes. The small red circles denote
mobilome components such as plasmids or phages. Gray lines denote
vertical inheritance of genes; green lines denote recent (solid) or ancient
(dashed) HGT; red lines denote the permanent ongoing process of the
exchange of genetic material between mobilome elements. The thickness
of connecting lines reflects the intensity of gene transfer between the
respective genetic elements.
necessary and, at least, at coarse grain, sufficient, to
account for prokaryotic genome evolution.
regions are contracted. In particular, P. ubiquis seems to
perfectly fit this description, having no detectable
0.1
(100)
1.0
(1,000)
10
(10,000)
Genome size, Mbp
(number of genes)
Main peak of bacterial/
archaeal genome size
distribution
MFLMG
C.r.
Genome degradation
Genome streamlining/
purifying selection
Innovation: duplication, HGT, replicon fusion
VNL
S.c.2nd peak of
bacterial genome
size distribution
Figure 18. The principal forces of evolution in prokaryotes and their effects on archaeal and bacterial genomes. The horizontal line shows archaeal
and bacterial genome size on a logarithmic scale (in megabase pairs) and the approximate corresponding number of genes (in parentheses). On this
axis, some values that are important in the context of comparative genomics are roughly mapped: the two peaks of genome size distribution
(Figure 2); ‘Van Nimwegen Limit’ (VNL) determined by the ‘cellular bureaucracy’ burden; the minimal genome size of free-living archaea and
bacteria (MFL); the minimal genome size inferred by genome comparison [MG, (133,135,136)]; the smallest (C.r., C. rudii); and the largest (S.c.,
S. cellulosum) known bacterial genome size. The effects of the main forces of prokaryotic genome evolution are denoted by triangles that are
positioned, roughly, over the ranges of genome size for which the corresponding effects are thought to be most pronounced.
where
pass th
direct a
bacteri
ately su
Gene
lome is
theless
mobilo
is relat
more s
of the m
tion ty
of arch
ally m
plasmid
and oc
also c
Moreo
become
be view
and the
0.01
0.1
1
400 1600 6400
DN/DS
Number of genes
Rs= −0.52 (p=7x10−5)
R2 = 0.2523
Figure 19. The dependence between genome size and selection pressure
in prokaryotes. The data are from the analysis of 41 alignable tight
genome clusters (ATGCs) of bacteria and archaea [(240); P.S.
Novichkov, Y.I.W., I. Dubchak and E.V.K., unpublished data). DN
is the median of dN, and DS is the median of dS for the respective
ATGC. The greater DN/DS the lower the pressure of purifying selec-
tion that affects the evolution of the genomes within an ATGC is
considered to be. Rs is Spearman ranking correlation coefficient.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction by Phylogeny
• Key step in genome projects
• More accurate predictions help guide experimental and
computational analyses
• Many diverse approaches
• All improved both by “phylogenomic” type analyses that
integrate evolutionary reconstructions and understanding
of how new functions evolve

Microbial Phylogenomics (EVE161) Class 13 - Comparative Genomics

  • 1.
    6688–6719 Nucleic AcidsResearch, 2008, Vol. 36, No. 21 Published online 23 October 2008 doi:10.1093/nar/gkn668 SURVEY AND SUMMARY Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world Eugene V. Koonin* and Yuri I. Wolf National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA Received June 23, 2008; Revised September 15, 2008; Accepted September 22, 2008 ABSTRACT The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a dou- bling time of approximately 20 months for bacteria and approximately 34 months for archaea. Com- parative analysis of the hundreds of sequenced bac- terial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial find- ing that enables functional characterization of the sequenced genomes and evolutionary reconstruc- tion is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, dis- tant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with Very shortly, thereafter, the second bacterial genome, that of Mycoplasma genitalium, was sequenced (2), and modern comparative genomics was born. A considerable amount of sequences from diverse organisms was avail- able prior to these reports, but the first fully sequenced bacterial genome forever changed the state of the art in genome analysis. The availability of complete genomes (i.e. with nearly all the genetic material from the given organism sequenced as opposed to, say, 90%, so that all genes are available for analysis) is crucial to the entire enterprise of comparative genomics for at least two related but distinct, fundamental reasons: (i) some caveats not- withstanding (see below), the availability of complete genome sequences (or, more precisely, full complements of genes) provides for the possibility to identify sets of orthologs, i.e. genes that evolved from the same ancestral gene in the common ancestor of the compared genomes, (ii) comparison of complete genomes (gene sets) is the necessary condition to determine not only which genes are present in any particular genome but also which byguestonhttp://nar.oxfordjournals.org/Downloadedfrom
  • 2.
    Sequencing costs havefallen so dramatically that a sin- gle laboratory can now afford to sequence large, even human-sized, genomes. Ironically, although sequencing has become easy, in many ways, genome annotation has become more challenging. Several factors are respon- sible for this. First, the shorter read lengths of second- with some basic UNIX skills, ‘do-it-yourself’ genome annotation projects are quite feasible using present- day tools. Here we provide an overview of the eukary- otic genome annotation process, describe the available toolsets and outline some best-practice approaches. A beginner’s guide to eukaryotic genome annotation Mark Yandell and Daniel Ence Abstract | The falling cost of genome sequencing is having a marked impact on the research community with respect to which genomes are sequenced and how and where they are annotated. Genome annotation projects have generally become small-scale affairs that are often carried out by an individual laboratory. Although annotating a eukaryotic genome assembly is now within the reach of non-experts, it remains a challenging task. Here we provide an overview of the genome annotation process and the available tools and describe some best-practice approaches. STUDY DESIGNS
  • 3.
    already known phylaand have shown that only 10% of Figure 1. The temporal dynamics of genome sequencing for bacteria and archaea. Bacteria: doubling time 20 months. Archaea: doubling time 34 months.
  • 4.
    principles. The sequencedbacterial genomes span two orders of magnitude in size, from 180 kb in the intracel- lular symbiont Carsonella rudii (11) to 13 Mb in the soil bacterium Sorangium cellulosum (12). Remarkably, bac- teria show a clear-cut bimodal distribution of genome Table 1. The state of genome sequencing for the archaeal and bacterial phylaa Phylum No. of genomes sequenced Genome size range, Mb Representative (first genome sequenced) Archaea Crenarchaeota 16 1.3–3 Aeropyrum pernix K1 Euryarchaeota 34 1.6–5.8 Methanocaldococcus jannaschii DSM 2661 Korarchaeota 1 1.6 Korarchaeum cryptofilum OPF8 Nanoarchaeota 1 0.5 Nanoarchaeum equitans Kin4-M Bacteria Acidobacteria 2 5.7–10.0 Acidobacteria bacterium Ellin345 Actinobacteria 54 0.9–9.7 Mycobacterium tuberculosis H37Rv Aquificae 2 1.6–1.8 Aquifex aeolicus VF5 Bacteriodes/Chlorobi group 21 0.3–6.3 Chlorobium tepidum TLS Chlamydiae/Verrucomicrobia group 16 1.0–6.0 Chlamydia trachomatis D/UW-3/CX Chloroflexi 7 1.3–6.7 Dehalococcoides ethenogenes 195 Chrysiogenetes 0 N/A N/A Cyanobacteria 33 1.6–9.0 Synechocystis sp. PCC 6803 Deinococcus–Thermus group 4 2.1–3.2 Deinococcus radiodurans R1 Firmicutes (Gram-positive bacteria) 150 0.6–6.0 Mycoplasma genitalium G37 Fusobacteria 1 2.2 Fusobacterium nucleatum subsp. nucleatum ATCC 25586 Gemmatimonadetes 0 N/A N/A Nitrospirae 0 N/A N/A Planctomycetes 1 7.1 Rhodopirellula baltica SH 1 Proteobacteria 353 0.2–13.0 Haemophilus influenzae Rd KW20 Spirochaetes 13 0.9–4.7 Borrelia burgdorferi B31 Synergistetes 0 N/A N/A Thermodesulfobacteria 0 N/A N/A Thermotogae 7 1.8–2.2 Thermotoga maritima MSB8 a The classification is from the NCBI taxonomy as of 10 June 2008 (http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy). Figure 1. The temporal dynamics of genome sequencing for bacteria and archaea. Bacteria: doubling time 20 months. Archaea: doubling time 34 months. byguestonFebruary10,2016ordjournals.org/
  • 5.
    such as thatof the microsporidian Encephalitozoon Figure 2. Distribution of genome sizes among bacteria and archaea. The distributions curves were obtained by Gaussian-kernel smoothing of the individual data points (276).
  • 8.
    Figure 3. Densityof protein-coding genes in bacterial and archaeal genomes. The distributions curves were obtained by Gaussian-kernel smoothing of the individual data points (276).
  • 9.
    t has become weenbacteria d, the mimi- 8) and so is stly, parasitic d, nearly, the ving archaea to be abun- r side of the otic genomes, cephalitozoon Figure 4. Length distributions of protein-coding genes (a) and inter- genic regions (b) in bacterial and archaeal genomes. The distributions curves were obtained by Gaussian-kernel smoothing of the individual data points (276). ia and archaea. ernel smoothing tonFebruary10,2016
  • 10.
    Nature Reviews |Genetics Escherichia coli Saccharomyces cerevisiae Schizosaccharomyces pombe Caenorhabditis elegans Arabidopsis thaliana Volvox carteri Drosophila melanogaster Citrus clementina Takifugu rubripes Oryza sativa Populus trichocarpa Eucalyptus grandis Gallus gallus Danio rerio Ornithorhynchus anatinus Zea mays Ailuropoda melanoleuca Mus musculus Homo sapiens 1.0 1.5 2.0 2.5 3.0 3.50.5 3.0 2.8 3.2 3.4 3.6 3.8 4.0 4.2 log10 (genome size) (Mbp) log10(genesize)(bp) Fungi Invertebrates Plants Bacteria Mammalian vertebrates Non−mammalian vertebrates Figure 1 | Genome and gene sizes for a representative set of genomes. Gene size is plotted as a function of genome size for some representative bacteria, fungi, plants and animals. This figure illustrates a simple rule of thumb: in general, bigger genomes have bigger genes. Thus, accurate annotation of a larger genome requires a more contiguous genome assembly in order to avoid splitting genes across scaffolds. Note too that although the human and mouse genomes deviate from the simple linear model shown here, the trend still holds. Their unusually large genes are likely to be a consequence of the mature status of their annotations, which are much more complete as regards annotation of alternatively spliced transcripts and untranslated regions than those of most other genomes.
  • 11.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Gene Density Chapter 7 • BACTERIAL AND ARCHAEAL GENET ple sequence repeats (e.g., microsatellites and minisatellites), gene duplications (both tan- dem arrays and pseudogenes), and transposable elements. Although bacterial and ar- 0 10 Gene 20 30 40 50 kb A Human B Escherichia coli Human pseudogene KEY Repetitive DNA element 0 10 20 30 40 50 kb FIGURE 7.2. Genome density. Comparison of the genome density and content of humans and Es- cherichia coli. Each segment is 50 kb in length and represents (A) a portion of the human β T-cell receptor locus and (B) a region of the E. coli K12 genome. Note the much greater proportion of genes (red boxes) in E. coli compared to humans.
  • 12.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Number of GenesDNA or selfish DNA. Junk DNA appears to provide little benefit or no function to the organism. (In some cases this designation is a misnomer resulting from a lack of infor- 30,000 25,000 20,000 15,000 10,000 5,000 0 Bacteria Genes Genome size 105 106 107 108 109 1010 Eukaryotes Viruses Archaea FIGURE 7.3. Genome size vs. number of protein-coding genes. The number of genes is highly cor- related to genome size for bacteria, archaea, and viruses, but less so for eukaryotes. Many archaeal points (blue triangles) are hidden under bacterial ones (yellow squares).
  • 13.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 OperonsORIGIN AND DIVERSIFICATION OF LIFE lacZ CAP site Operator Promoter Lactose permease transports lactose into the cell transacetylase+ split lactose to galactose + glucose CH2OH OH OH H H H H OH H O O -galactosidase lacY lacA OH H CH2OH OH OH H H H OH H O OH H CH2OH H OH OH H H OH H O OH H CH2OH H OH H H H H OH Lactose Galactose + Glucose H O FIGURE 7.4. Lac operon from Escherichia coli. This operon consists of three genes whose transcrip- tion is regulated by a single promoter. The genes encode proteins involved in utilizing lactose, in- cluding a permease (encoded by lacY), which brings lactose into the cell from the outside, and two enzymes (encoded by lacZ and lacA), which split lactose into glucose + galactose (see pp. 52–53). mation. Some stretches of “junk DNA” have been determined to be involved in gene reg-
  • 14.
  • 15.
    Box 2 |Gene prediction versus gene annotation Although the terms ‘gene prediction’ and ‘gene annotation’ are often used as if they are synonyms, they are not. With a few exceptions, gene predictors find the single most likely coding sequence (CDS) of a gene and do not report untranslated regions (UTRs) or alternatively spliced variants. Gene prediction is therefore a somewhat misleading term. A more accurate description might be ‘canonical CDS prediction’. Gene annotations, conversely, generally include UTRs, alternative splice isoforms and have attributes such as evidence trails. The figure shows a genome annotation and its associated evidence. Terms in parentheses are the names of commonly used software tools for assembling particular types of evidence. Note that the gene annotation (shown in blue) captures both alternatively spliced forms and the 5′ and 3′UTRs suggested by the evidence. By contrast, the gene prediction that is generated by SNAP (shown in green) is incorrect as regards the gene’s 5′ exons and start-of-translation site and, like most gene-predictors, it predicts only a single transcript with no UTR. Gene annotation is thus a more complex task than gene prediction. A pipeline for genome annotation must not only deal with heterogeneous types of evidence in the form of the expressed sequence tags (ESTs), RNA-seq data, protein homologies and gene predictions, but it must also synthesize all of these data into coherent gene models and produce an output that describes its results in sufficient detail for these outputs to become suitable inputs to genome browsers and annotation databases. Nature Reviews | Genetics 229,500 229,000 228,500 228,000 227,500 226,500227,000 bp 5′UTR 3′UTR Gene annotation resulting from synthesizing all available evidence (two alternative splice forms) Protein evidence (BLASTX) mRNA or EST evidence (Exonerate) Gene prediction (SNAP) Start codon Stop codon
  • 16.
    For instance most likelyc untranslated transcripts (B gene predict such as codo exon length regions and Most gene pr eter files that genomes, su nogaster, Ar However, un to an organis are available, the genome t organisms c codon usage Given eno tivity of ab in However, th structures is important to existing, hig perfect geno produce high sets are rarel In princi Box 3 | Non-coding RNAs Non-coding RNA (ncRNA) annotation is still in its infancy compared with protein-coding gene annotation, but it is advancing rapidly. The heterogeneity and poorly conserved nature of many ncRNA genes present major challenges for annotation pipelines. Unlike protein-encoding genes, ncRNAs are usually not well-conserved at the primary sequence level; even when they are, nucleotide homologies are not as easily detected as protein homologies, which limits the power of evidence-based approaches. One common approach is to identify ncRNA genes using conserved secondary structures and motifs. Established examples of these types of tools include tRNAscan-SE118 and Snoscan119 . MicroRNA (miRNA) gene finders are also available120 . A more general approach is first to align nucleotide sequences — genomic, RNA-seq and ESTs — from closely related organisms to the target genome and then search these for signs of conserved secondary structures. This is a complex process, however, and can require substantial computational resources; qRNA is one such tool121 , another is StemLoc122 . Be aware that these tools have high false-positive rates. RNA sequencing is also greatly aiding ncRNA identification. For example, miRNAs can be directly identified using specialized RNA preps and sequencing protocols123,124 . Even with such sophisticated tools and techniques, distinguishing between bona fide ncRNA genes, spurious transcription and poorly conserved protein-encoding genes that produce small peptides remains difficult, especially in the cases of long intergenic non-coding RNAs (lincRNAs)125,126 and expressed pseudogenes127,128 . Another approach is to annotate possible ncRNA genes liberally and then use Infernal129 and Rfam114 to triage and classify these genes based on primary and secondary sequence similarities. Even with these resources, however, many ncRNAs will remain unclassifiable. Currently, ncRNA annotation is cutting edge, and those using ncRNA annotations should bear in mind that ncRNA annotation accuracies are generally much lower than those of their protein-coding counterparts.
  • 17.
    GC- and AT-skews,i.e. excess of purines or pyrimidines (violation of Chargaff’s second parity rule). The underly- ing causes of the GC/AT-skews are thought to reflect an interplay of selective and mutational forces, i.e. selection One of the earliest and central concepts of bacterial genetics is the operon, a group of cotranscribed and coreg- ulated genes (82). Although enormous amount of varia- tion on the simple theme of regulation by the Lac (a) (c) (b) (d) Figure 13. Evolution of gene order in bacteria and archaea: genomic dot-plots. (a) Colinearity with a few breakpoints between closely related bacteria: Geobacillus thermodenitrificans versus Geobacillus kaustophilus; (b) X-shaped pattern between moderately diverged bacteria: Shewanella sp. MR-4 versus Shewanella oneidensis; (c) X-shaped pattern between moderately diverged archaea: Pyrococcus horikoshii OT3 versus Pyrococcus abyssi GE5; and (d) No clear pattern between more distantly related bacteria: Streptococcus gordonii str. Challis versus Streptococcus pneumoniae R6. In each panel, the genome indicated first is plotted along the vertical axis. 6698 Nucleic Acids Research, 2008, Vol. 36, No. 21
  • 18.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Gene Order
  • 19.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014
  • 20.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Origin of replication Terminus of replication Artificially Open Circle Origin Terminus Origin Again Genome 1 Genome2 O T O O O T
  • 21.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014
  • 22.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 E. coli K12 Island Inversion Repeat E.coli0157:H7 FIGURE 7.10. Conserved gene order in the backbone of Escherichia coli K12 and 0157:H7. The two genomes were aligned with each other and the matching regions were plotted. The conserved order of genes in the backbone of the two E. coli strains is indicated by the diagonal line. Three important genomic regions are cir- cled. An island present in one of the two strains causes a slight shift in the position of the main diagonal. AND DIVERSIFICATION OF LIFE they also occur in virtually the same order in both strains (Fig. 7.10). The genes unique to each strain are clustered into “islands” interspersed among the stretches of common genes. Similar patterns of DNA “islands” within a conserved genome backbone have been found among other related bacteria or archaea.
  • 23.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Chapter 7 • BACTERIAL AND ARCHAEAL GENETICS AND mon is symmetric inversion around the origin of replication (Fig. 7.14). Such inversions are seen in almost every comparison of moderately closely related strains or species. Al- though other rearrangements occur, the symmetric inversions serve as a useful tool for understanding some features of general evolution and we focus on them here. Symmetric inversions around the origin are due to a combination of mutation bias and selection bias. To understand how mutation bias could cause this, it is helpful to un- 400,000 0 400,000 0 800,000 1,200,000 1,600,000 1,667,867 800,000 H. influenzae Rd chromosome H.pylori26695chromosome 1,200,000 1,830,137 FIGURE 7.11. The lack of conservation of gene order between Haemophilus influen- zae and Helicobacter pylori is illustrated. Linearized chromosomes of H. influenzae and H. pylori are plotted on the horizontal and vertical axes, respectively. Each dot rep- resents a single pair of orthologous proteins. Genes in similar operons, which do exist, are too close together to give separated points on the scale used.
  • 24.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 cently replicated DNA, thereby causing an inversion. As the two replication forks should Sinorhizobium meliloti Bacillus subtilis Borrelia burgdorferi Treponema pallidum Helicobacter pylori Escherichia coli Haemophilus influenzae Rickettsia prowazekii Mycoplasma sp. Aquifex aeolicus S6 Thermatoga maritima Deinococcus radiodurans Mycobacterium tuberculosis Chlamydia sp. Synechocystis Archaea SUI1-X1 S-4E L32-L19 X2 cdk-L1--ccm-mms Small SUr-protein genes rpoBC str S10 spc alpha Large SUr-protein genes Nonribosomal genes Unknown genes Breakpoint Gene insertion Rho-independent terminator Missing gene S4 ? L11(rplK) L1(rplA) L10(rplJ) L7/L12(rplL) rpoB rpoC unknown S12(rpsL) S7(rpsG) fusA tufA S10(rpsJ) L3(rplC) L4(rplD) L23(rplW) L2(rplB) S19(rpsS) L22(rplY) S3(rpsC) L16(rplP) L29(rpmC) S17(rpsQ) L14(rplN) L24(rplX) L5(rplE) S14(rpsN) S8(rpsH) L6(rplF) L18(rplR) S5(rpsE) L30(rpmD) L15(rplO) secY adk map infA L36(rpmJ) S13(rpsM) S11(rpsK) S4(rpsD) rpoA L17(rplQ) xxx ? ? ? ? FIGURE 7.12. Conservation of gene order of ribosomal protein operons across bacterial and ar- chaeal species.
  • 25.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 V. cholerae vs. E. coli All 0 1000000 2000000 3000000 4000000 5000000 E.coliCoordinates 0 1000000 2000000 3000000 V. cholerae Coordinates Eisen et al., 2000
  • 26.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 V. cholerae vs. E. coli Best 0 1000000 2000000 3000000 4000000 5000000 E.coliCoordinates 0 1000000 2000000 3000000 V. cholerae Coordinates Eisen et al., 2000
  • 27.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 V. cholerae vs. E. coli, Rotated 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0 E.coliORFCoordinates 0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 2 0 0 0 0 0 0 2 5 0 0 0 0 0 3 0 0 0 0 0 0 V. cholerae ORF Coordinates Eisen et al., 2000
  • 28.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Duplication and Gene Loss Model Eisen et al., 2000
  • 29.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 V. cholerae vs. E. coli
 Orthologs on Both Diagonals 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0 E.coliORFCoordinates 0 5 0 0 0 0 0 1 0 0 0 0 0 0 1 5 0 0 0 0 0 2 0 0 0 0 0 0 2 5 0 0 0 0 0 3 0 0 0 0 0 0 V. cholerae ORF Coordinates Eisen et al., 2000
  • 30.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 C. trachomatis MoPn C.pneumoniaeAR39 Origin Terminus C. trachomatis vs C. pneumoniae
  • 31.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 B1 A1 B2 A2 B3 A3 B3 B2 24 23 22 21 20 19 18171615 14 13 12 11 10 9 6 7 258 26 27 28 29 30 1 2 3 4 5 3132 B1 3132 6 7 8 9 10 11 12 13 14 15161718 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 3132 B3 24 23 22 21 20 19 18171615 14 13 12 11 10 9 6 7 258 26 27 28 29 3 3231 30 4 5 2 1 A1 3132 6 7 8 9 10 11 12 13 14 15161718 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 3132 A2 3132 6 7 8 9 10 11 12 13 19 18171615 14 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 3132 A3 2 6 7 8 9 10 11 12 13 19 18171615 14 20 21 22 23 24 25 26 27 5 4 3 31 30 29 28 1 32 B2 Inversion Around Terminus (*) Inversion Around Terminus (*) Inversion Around Origin (*) Inversion Around Origin (*) * * * * * * * * Common Ancestor of A and B 3132 6 7 8 9 10 11 12 13 14 15161718 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 3132 A2 A1 A2 A3 B2 B1 Symmetric Inversion Model Eisen et al., 2000
  • 32.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 13621300 13621775 13622250 13622725 13623200 0 625 1250 1875 2500 Series1 Streps 0 500 1000 1500 2000 2500 3000 2632200 2632700 2633200 2633700 2634200 2634700 2635200 2635700 2636200 2636700 B. subt vs. Staph 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 Mycobacteriumtuberculosis 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 Mycobacterium leprae M. tb vs. M. leprae Pyrococcus Thermoplasmas 9945700 9947275 9948850 9950425 9952000 0 2125 4250 6375 8500 Series1 Pseudomonas The X-Files
  • 33.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction • Identification of motifs ! Short regions of sequence similarity that are indicative of general activity ! e.g., ATP binding • Homology/similarity based methods ! Gene sequence is searched against a databases of other sequences ! If significant similar genes are found, their functional information is used • Problem ! Genes frequently have similarity to hundreds of motifs and multiple genes, not all with the same function
  • 34.
    corresponds to the‘cloud’ (24 000 clusters) that consists of genes shared by a small number of organisms. The pos- sibility exists that the size of the cloud is somewhat inflated, 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% AeropyrumpernixK1 ArchaeoglobusfulgidusDSM4304 Halobacteriumsp.NRC-1 Methanothermobacterthermautotrophicusstr.DeltaH MethanocaldococcusjannaschiiDSM2661 MethanosarcinaacetivoransC2A MethanopyruskandleriAV19 PyrococcushorikoshiiOT3 ThermoplasmavolcaniumGSS1 NanoarchaeumequitansKin4-M MycobacteriumtuberculosisH37Rv StreptomycescoelicolorA3(2) BifidobacteriumlongumNCC2705 AquifexaeolicusVF5 BacteroidesthetaiotaomicronVPI-5482 SalinibacterruberDSM13855 ChlorobiumtepidumTLS ChlamydiamuridarumNigg CandidatusProtochlamydiaamoebophilaUWE25 Dehalococcoidesethenogenes195 Synechocystissp.PCC6803 AnabaenavariabilisATCC29413 Prochlorococcusmarinussubsp.marinusstr.CCMP1375 DeinococcusradioduransR1 ThermusthermophilusHB27 SolibacterusitatusEllin6076 Bacillussubtilissubsp.subtilisstr.168 Lactococcuslactissubsp.lactisIl1403 ClostridiumacetobutylicumATCC824 MesoplasmaflorumL1 Fusobacteriumnucleatumsubsp.nucleatumATCC25586 Pirellulasp. Agrobacteriumtumefaciensstr.C58 BurkholderiamalleiATCC23344 Desulfovibriovulgarissubsp.vulgarisstr.Hildenborough EscherichiacoliK12 PseudomonasaeruginosaPAO1 LeptospirainterrogansserovarCopenhagenistr.FiocruzL1-… Treponemapallidumsubsp.pallidumstr.Nichols ThermotogamaritimaMSB8 Figure 5. Coverage of bacterial and archaeal genomes with cluster of orthologous genes. The COGs were from the EggNOG database (41), and the proteins from each genome were assigned to these clusters using a modified COGNITOR method (42). 7000 8000 9000 OGs (a) 6692 Nucleic Acids Research, 2008, Vol. 36, No. 21
  • 35.
  • 36.
    PRINCIPL ARCHITE Almost im genome se inbacteria served (4, 347 Pyrobaculum aerophilumIM2 Methanosarcina barkeriFusaro Aquifexaeolicus VF5 Lactobacilluscasei ATCC334 Pseudomonas aeruginosaPAO1 218 656 342 1042 1104 492 1029 444 555 289 157 517 320 261 930 225 751 323 542 1386 492 1923 1002 765 Figure 10. Distribution of predicted gene functional classes for selected archaeal and bacterial genomes. Red, information processing genes; blue, genes involved in cellular functions; green, genes involved in metabolism and transport; light gray, general prediction only; dark gray, no prediction. The function class assignment is based on the inclusion of the respective genes in COGs (34). Figure 11. D logs for info and replicat are operatio by Gaussian
  • 37.
    velopment, the genomeof the s found to contain only 170 n any estimates of the minimal er, this unusual organism lacks ent in all other known bacteria roteins that appear to be indis- minoacyl-tRNA synthetases. At xplanation is that this organism teins from the host cell, thereby traint affecting other prokaryo- s, even intracellular ones (133). ella is a case of a bacterium-to- gress (142). The minimal com- c organism growing on a rich n at approximately 250 genes. rrently known free-living organ- .3 Mb in size, with 1100 genes n these genomes contain up to erally, nonessential, it is reason- gene set for a free-living organ- d number of approximately 1000 wide spread of nonorthologous mal prokaryotic gene set is not nes. Instead, there can be a large sms with diverse life styles but, of genes (135). tions, perhaps, are what deter- y of bacterial and archaeal gen- g gives the upper bound to this s problem, we turn to the ana- functional categories of genes already referred to in the above uction systems. As first noticed ver et al. (143) in the course of e bacterium Pseudomonas aeru- ail by Van Nimwegen (130) and ly confirmed and explored by showed an exponent greater than one (Figure 14b). Figure 14. Scaling of genes in different functional categories with the total number of genes in archaeal and bacterial and genomes. (a) Data for individual protein-coding genes. (b) Data for COGs. The function class assignment is based on the inclusion of the respective genes in COGs (34). byguestonFebruary10,2016p://nar.oxfordjournals.org/
  • 38.
    ents between bacteria htbe, it seems appro- zed functional devices elated organisms). demonstration and ecent HGT is detect- composition, oligonu- and other ‘linguistic’ at reveal horizontally nomalous for a given rizontally transferred tively high rate as the d’ during evolution HGT between closely probably, not comple- gation, bacteriophage- mation (159). d HGT among closely GT across long evolu- on the evolution of tter of intense debate. d ample indications of ery distant organisms, The first clear-cut indi- l HGT were obtained hermophilic bacteria, T. maritima (178), con- haracteristic archaeal as well as proteins a and bacteria but ity to the latter than Figure 15. The taxonomic breakdown of the best database hits for proteins encoded in diverse bacterial and archaeal genomes. (a) A mesophilic bacterium, Bifidobacterium longum (Biflo), compared to a hyperthermophilic bacterium, T. maritima (Thema). (b) A meso- philic archaeon, M. mazei (Metma), compared to hyperthemrophilic archaeon, Sulfolobus solfataricus (Sulso). The best hits were obtained by processing the results of the searches of the NCBI’s nonredundant protein sequence database using the BLASTP program (277). byguestonFebruary10,2016ls.org/
  • 39.
    Bacteria Acidobacteria Pyrobaculum aerophilum Therm otoga m aritim a M SB8Bacteria Lactobacillus Spirochaetes M ycoplasm atales Thermoplasma Picrophilus torridus DSM9790 Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria BacteriaBacteria Bacteria SulfolobusAeropyrum pernix K1 Leptospira interrogans serovar Lai str. 56601 Archaea Bacteria Bacteria a Colwellia psychrerythraea Arch 0.5 (a) (b) Nucleic Acids Research, 2008, Vol. 36, No. 21 eria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Mycoplasmasynoviae53 Archaea Dehalococcoidessp.CBDB1 M ethanosarcinaceae Pyrococcus horikoshii OT3 Archaea Archaea Mycoplasmatales Therm oplasm a acidophilum DSM 1728 Thermoplasma Colwellia psychrerythraea 34H Archaea 1 0.5 (b) Figure 16. Two cases of readily demonstrable horizontal gene transfer between archaea and bacteria. (a) COG0030, dimethyladenosine transferase, an enzyme involved in rRNA methylation. (b) COG0206, FtsZ, a GTPase involved in cell division. Blue, bacteria; magenta, archaea. The trees were constructed using the maximum likelihood method implemented in the PhyML software (278) (WAG evolutionary model; g-distributed site-specific rates with the shape parameter 1.0). The complete information on the analyzed sequences and the alignments are available from the authors upon request.
  • 40.
    ‘highways’ of HGTthat connect closely related or habitat- replication are much less prone to HGT than operational a Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Mycoplasmasynoviae53 Archaea Dehalococcoidessp.CBDB1 M ethanosarcinaceae Pyrococcus horikoshii OT3 Archaea Archaea Mycoplasmatales Therm oplasm a acidophilum DSM 1728 Thermoplasma Colwellia psychrerythraea 34H Archaea 1 0.5 (b) Figure 16. Two cases of readily demonstrable horizontal gene transfer between archaea and bacteria. (a) COG0030, dimethyladenosine transferase, an enzyme involved in rRNA methylation. (b) COG0206, FtsZ, a GTPase involved in cell division. Blue, bacteria; magenta, archaea. The trees were constructed using the maximum likelihood method implemented in the PhyML software (278) (WAG evolutionary model; g-distributed site-specific rates with the shape parameter 1.0). The complete information on the analyzed sequences and the alignments are available from the authors upon request. eria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Bacteria Mycoplasmasynoviae53 Archaea Dehalococcoidessp.CBDB1 M ethanosarcinaceae Pyrococcus horikoshii OT3 Archaea Archaea Mycoplasmatales Therm oplasm a acidophilum DSM 1728 Thermoplasma Colwellia psychrerythraea 34H Archaea 1 0.5 (b) Figure 16. Two cases of readily demonstrable horizontal gene transfer between archaea and bacteria. (a) COG0030, dimethyladenosine transferase, an enzyme involved in rRNA methylation. (b) COG0206, FtsZ, a GTPase involved in cell division. Blue, bacteria; magenta, archaea. The trees were constructed using the maximum likelihood method implemented in the PhyML software (278) (WAG evolutionary model; g-distributed site-specific rates with the shape parameter 1.0). The complete information on the analyzed sequences and the alignments are available from the authors upon request.
  • 41.
    gene flow, emphasizingthe interplay between the two THE PRINCIPAL PROC EVOLUTION Having formulated the no world, we are now in a processes that affect evo so, one necessarily must tion–genetic theory of ev that was recently expou essence of this theory is th increase of complexity su only when purifying selec weak, i.e. substantial co during population bottlen complexity is not adaptiv population–genetic proce ifying selection is (relative cation starts off as a ‘geno features subsequently beco In contrast, in ‘highly suc fying selection is intense, a tion is thought to be geno The concepts of genom streamlining embody the tion under which the sele of an evolving lineage (a f tive population size and that affects the entire co sponding genomes (237). perspective that is cent extant genomes ancestral genomes extra-and intra- cellular mobilome elements vertical inheritance horizontal exchange mobilome exchange Figure 17. The dynamic view of the prokaryotic world. The figure is a conceptual schematic representation that is not based on specific data. The larger blue circles denote extant (solid lines) or ancestral (dashed lines) archaeal and bacterial genomes. The small red circles denote mobilome components such as plasmids or phages. Gray lines denote vertical inheritance of genes; green lines denote recent (solid) or ancient (dashed) HGT; red lines denote the permanent ongoing process of the exchange of genetic material between mobilome elements. The thickness of connecting lines reflects the intensity of gene transfer between the respective genetic elements.
  • 42.
    necessary and, atleast, at coarse grain, sufficient, to account for prokaryotic genome evolution. regions are contracted. In particular, P. ubiquis seems to perfectly fit this description, having no detectable 0.1 (100) 1.0 (1,000) 10 (10,000) Genome size, Mbp (number of genes) Main peak of bacterial/ archaeal genome size distribution MFLMG C.r. Genome degradation Genome streamlining/ purifying selection Innovation: duplication, HGT, replicon fusion VNL S.c.2nd peak of bacterial genome size distribution Figure 18. The principal forces of evolution in prokaryotes and their effects on archaeal and bacterial genomes. The horizontal line shows archaeal and bacterial genome size on a logarithmic scale (in megabase pairs) and the approximate corresponding number of genes (in parentheses). On this axis, some values that are important in the context of comparative genomics are roughly mapped: the two peaks of genome size distribution (Figure 2); ‘Van Nimwegen Limit’ (VNL) determined by the ‘cellular bureaucracy’ burden; the minimal genome size of free-living archaea and bacteria (MFL); the minimal genome size inferred by genome comparison [MG, (133,135,136)]; the smallest (C.r., C. rudii); and the largest (S.c., S. cellulosum) known bacterial genome size. The effects of the main forces of prokaryotic genome evolution are denoted by triangles that are positioned, roughly, over the ranges of genome size for which the corresponding effects are thought to be most pronounced.
  • 43.
    where pass th direct a bacteri atelysu Gene lome is theless mobilo is relat more s of the m tion ty of arch ally m plasmid and oc also c Moreo become be view and the 0.01 0.1 1 400 1600 6400 DN/DS Number of genes Rs= −0.52 (p=7x10−5) R2 = 0.2523 Figure 19. The dependence between genome size and selection pressure in prokaryotes. The data are from the analysis of 41 alignable tight genome clusters (ATGCs) of bacteria and archaea [(240); P.S. Novichkov, Y.I.W., I. Dubchak and E.V.K., unpublished data). DN is the median of dN, and DS is the median of dS for the respective ATGC. The greater DN/DS the lower the pressure of purifying selec- tion that affects the evolution of the genomes within an ATGC is considered to be. Rs is Spearman ranking correlation coefficient.
  • 44.
    Slides for UCDavis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction by Phylogeny • Key step in genome projects • More accurate predictions help guide experimental and computational analyses • Many diverse approaches • All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve