2. Microbial Genomics
General features of microbial genomes
Historical overview
Genome sequencing, annotation and analysis
Genome evolution
What we can learn from a genome sequence?
3. General features of genomes
Microbial Human
Small WSIWYG genomes Very large genomes
(Mbp) (Gbp)
Gene density high (>90%)
intergenic regions short
Gene density low
very little repetitiveor non- Only 25% is genes
coding DNA Introns mean only1%
Introns very rare codes
Protein-coding genes Genes can span ≥30
(CDS) short (~1kbp)
kbp
Operons with promoters
just upstream Genes have ~3
Fewer non-coding RNAs transcripts
Splicing and splice
variants
4. Bacterial genome organisation
Chromosomes Plasmids
Most commonly single Independent autonomous
replicon, can be circular or
circular chromosome linear
(always DNA) may integrate into chromosome
BUT many species have copy number varies 1 to 10s
linear chromosome(s) (e.g. often carry non-essential genes
Borrelia, Streptomyces, Rh that confer an adaptive
odoccus) advantage in certain conditions
BUT a few species with two
chromosomes (e.g.
Vibriocholerae)
Can be mix of circular and
linear (e.g.
Agrobacteriumtumefacien
s, B. burgdoferi)
5. Bacterial Genome Size
species which occupy restricted ecological
niches, (e.g. obligate intracellular parasites and
endosymbionts) tend to have smaller genomes
(<1.5 Mb) than generalist bacteria
smallest known bacterial genome:
Carsonellaruddii, 160 kb! (Nakabachi et al. 2006)
BUT mitochondrial genomes are smaller
largest genomes found in bacteria with complex
developmental cycles, e.g. Streptomyces
largest bacterial genome: Sorangiumcellulosum, 13
Mb
6. Bacterial genomes are made from DNA
In 1944, Oswald Avery, Colin MacLeod, and Maclyn
McCarty showed that DNA (not proteins) was the genetic
material responsible for inheritance.
Identified DNA as the "transforming principle" while studying
Streptococcus pneumoniae
Avery, Oswald T., Colin M. MacLeod, and Maclyn McCarty.
Studies on the chemical nature of the substance inducing
transformation of pneumococcal types. Journal of Experimental
Medicine. 1944 Feb 1; 79(2): 137-158.
In 1952, this work was supported by Alfred Hershey and
Martha Chase who showed that only the DNA of a virus
needs to enter a bacterium to infect it.
Used radioactively labelled bacteriophage
Hershey AD and Chase M. Independent functions of viral
protein and nucleic acid in growth of bacteriophage. Journal of
General Physiology. 1952. 36: 39-56.
7. Viral genomes are variable
Use RNA or DNA but not
both in genome
Some have RNA genomes!
Grouped into families
depending on
type of genome: DNA or
RNA, single- or double-
stranded
Typically dozens of genes
or fewer
Large genomes in pox
viruses (~200 kb)
Massive genomes in
megaviruses (1Mbp!)
8. Microbial Genomics Timeline
Year Milestone
1977 Invention of dideoxy chain terminator sequencing (“Sanger sequencing”)
1979 Sequencing of the 5.3-kilobase genome of bacteriophage phiX174
1981 First human mitochondrial genome sequence*
1982 Determination of the 48.5-kilobase genome sequence of bacteriophage lambda through first use
of shotgun sequencing
1986 Development of automated fluorescent sequencing
1995 First complete genome sequences obtained of free-living bacteria (Haemophilus influenzae and
Mycoplasma genitalium)
1996 Mycoplasma becomes first bacterial genus that has completely sequenced genomes from two
different species (M. genitalium and M. pneumoniae)
1997 First genome sequences from Escherichia coli and Bacillus subtilis
1998 First genome sequence from Mycobacterium tuberculosis; genome sequence from
Rickettsiaprowazekii provides first evidence of reductive evolution
9. Microbial Genomics Timeline
Year Milestone
1999 Helicobacter pylori becomes the first species with completely sequenced genomes from two
isolates
2000 Meningococcal genome sequence primes first application of reverse vaccinology
2001 Second E. coli genome sequences reveal unexpected level of horizontal gene transfer;
genome sequence of M. leprae provides compelling evidence of bacterial pseudogenes and
reductive evolution; first paper reporting genome sequences of two strains from one species
(Staphylococcus aureus) in a single publication.
2002 Genome sequencing of multiple strains of Bacillus anthracis to provide markers for forensic
epidemiology
2003 Genome sequencing of uncultivable Tropherymawhippleileads to design of axenic growth
medium
2004 Genome sequence of mimivirus blurs distinctions between bacteria and viruses
2005 Use of whole-genome sequencing used to identify target of new anti-tuberculosis drug
Mycoplasma genitalium genome sequenced using pyrosequencing
2006- Bacterial metagenomics survey of the Sargasso sea yields >1 million new genes
2011 Rise of next-generation or high-throughput sequencing
10. The first genome sequences
The first sequenced gene was from bacteriophage MS2
The gene encoding the coat protein
1972
Min Jou W, Haegeman G, Ysebaert M, and Fiers W. Nucleotide
sequence of the gene coding for the bacteriophage MS2 coat
protein. Nature. 1972 May 12; 237(5350): 82-88.
The first sequenced genome was bacteriophage MS2
1976
RNA genome is 3,569 nucleotides
Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant
D, Merregaert J, Min Jou W, Molemans F, Raeymaekers A, Van
den Berghe A, Volckaert G, and Ysebaert M. Complete
nucleotide sequence of bacteriophage MS2 RNA: primary and
secondary structure of the replicase gene. Nature. 1976 Apr 8;
260(5551): 500-507.
11. The first genome sequences
The first sequenced DNA genome was bacteriophage Φ-
X174
1977
5368 base pairs
Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes
CA, Hutchison CA, Slocombe PM, and Smith M. Nucleotide
sequence of bacteriophage phi X174 DNA. Nature. 1977 265
(5596): 687-695.
The first sequenced bacterial genome was Haemophilus
influenzae
1995
1,830,140 base pairs
Fleischmann R, Adams M, White O, Clayton R, Kirkness
E, Kerlavage A, Bult C, Tomb J, Dougherty B, and Merrick J.
Whole-genome random sequencing and assembly of
Haemophilus influenzae Rd. Science, 1995. 269 (5223): 496-
512.
12. Overview of a genome project
Choose strain Closure and finishing
Fresh isolate or tractable Manually intensive
lab strain? Difficulty depends on
Choose strategy how repetitive
Shotgun sequencing Data Release
Paired-end sequencing Immediate or delayed?
Draft or complete? Annotation
Choose chemistry Manually intensive bottle
Sanger; 454; Illumina; neck
Ion Torrent Publication
Assembly
Automated
13. Methods for genome sequencing – historic
Sanger method sequencing
Sanger F and Coulson AR. A rapid method for
determining sequences in DNA by primed synthesis
with DNA polymerase. Journal of Molecular Biology.
1975 94: 441-448.
Step 1, a sequence-specific DNA primer is radiolabeled
Step 2, the primer is annealed to the template DNA
Step 3, the primer is extended by DNA polymerase
Incorporation of a deoxynucleotide - further extension possible
Incorporation of a dideoxynucleotide – chain termination
Four reactions set up
ddATP, dATP, dCTP, dGTP, dTTP
ddCTP, dATP, dCTP, dGTP, dTTP
ddGTP, dATP, dCTP, dGTP, dTTP
ddTTP, dATP, dCTP, dGTP, dTTP
15. Methods for genome sequencing –
automated Sanger sequencing
Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C,
Kent SBH, and Hood LE. Fluorescence detection in automated DNA
sequence analysis. Nature. 1986 321: 674-679.
Replaced radioisotopes with fluorescent dyes
Safer for the researchers
Each of the four DNA bases could be dyed a different colour
Eliminated the need to run separate reactions in separate lanes
The migration of the dye could be read because of the fluorescence
This information allowed automatic gel reading
Further improvements were made
Improved dye chemistry using fluorescent dideoxy-terminators (DuPont): Prober
JM, Trainor GL, Dam RJ, Hobbs FW, Robertson CW, Zagursky RJ, Cocuzza AJ,
Jensen MA, and Baumeister K. A system for rapid DNA sequencing with
fluorescent chain-terminating dideoxynucleotides. Science 238: 336-341.
Replacing slab gels with re-useable capillary tubes: Ruiz-Martinez MC, Berka J,
Belenkii A, Foret F, Miller AW, and Karger BL. DNA sequencing by capillary
electrophoresis with replaceable linear polyacrylamide and laser-induced
fluorescence detection. Analytical Chemistry 1993 65: 2851-2858.
16. Whole-Genome Shotgun Sanger Sequencing
Random shearing
bacterial
chromosome
Size selection
plasmid vector
Pick colonies to create shotgun
Cloning library
Sequence each insert
with two primers Plasmid preps
17. High-throughput Sequencing
100x faster, 100x cheaper!
A disruptive technology
Several technologies in the marketplace from 2007
onwards
454 (Roche)
Illumina
Ion Torrent
PacBio
Fundamentally new approaches
Solid-phase amplification of clonal templates in “molecular
colonies”
Massive increase in number of “clones” compensates for shorter
read length
New chemistries for sequence reading
454: pyrophosphate detection on base addition
Illumina: reversible de-protection of fluorescent bases
19. 454 sequencing
Emulsion-based clonal amplification
Anneal sstDNA to Clonal amplification Break
Emulsify beads and PCR
an excess of DNA occurs inside microreactors, enric
reagents in water-in-oil
Capture Beads microreactors h for DNA-positive
microreactors
beads
20. Pyrosequencing
DNA template with primer
mixed with the enzymes along
with the two substrates
adenosine 5‟-phosphosulfate
(APS) and luciferin
1. one of the four nucleotides
added to reaction
2. If complementary to base in
template strand then DNA
polymerase incorporates it
3. Pyrophosphate (Ppi)
released then converted to
ATP by sulfurylase in the
presence of APS.
4. ATP serves as a substrate to
luciferase, causing a light
reaction.
5. Excess nucleotides degraded
by apyrase.
22. The Sequence Assembly Problem
Sequencing technologies generate reads of <1000
bp
These reads must be assembled into a single
continuous genomic sequence.
Shotgun sequencing exploits many overlapping
sequences (high coverage) to infer ordering directly
from the sequences themselves
23. The Repeat Problem
Repeats at read ends can be assembled in multiple
ways
Correct
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT
ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGATATCCCT
Incorrect
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGATATCCCT
ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT
24. Random shearing
bacterial
chromosome
Size selection for 3kb or 8kb etc
Obtain sequences from
either side of linker
Paired-end
known distance apart in
genome
Sequencing Add linkers
Circularise
Add adapters Shear and select on size and
presence of linkers
Create long fragments of known
length
Obtain sequence from paired ends
known distance apart
Allows assembly of contigs across
repeats into scaffolds
25. Genome Assembly
Contig 1 Contig 2 Contig 3
Sequence Gap
Scaffold
Physical Gap
26. Re-sequencing
Short reads (<200bp)
inefficient de novo
assembly
Instead they are
mapped against a
reference genome
Re-sequencing is like
assembling a jigsaw
puzzle using the image
on the lid
27. Genome annotation
Annotation is the addition of information about the
predicted sequence features to the flat file of DNA code
Identification of potential coding sequences - CDS
Homology searches to predict function
Other features can be annotated as well
rRNAs
Potential promoters
tRNAs
Small non-coding RNAs
Repeat sequences
Insertion sequences (ISs), transposons, gene fragments
Location of the origin of replication
Determination of the number of bases, genes, and
G+C%.
29. …to this?
FT gene complement(9299..10702)
FT /db_xref="GenBank:2367266”
FT /gene="dnaA”
FT /note="b3702”
FT CDS complement(9299..10702)
FT /db_xref="GI:2367267”
FT /db_xref="PID:g2367267”
FT /function="putative regulator; DNA - replication, repair,
FT restriction/modification”
FT /codon_start=1
FT /protein_id="AAC76725.1”
FT /gene="dnaA”
FT /translation="MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR
FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT
FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG
FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF
FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR
FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL
FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR
FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF
FT SNLIRTLSS”
FT /product="DNA biosynthesis; initiation of chromosome
FT replication; can be transcription regulator”
FT /transl_table=11
FT /note="f467; 100 pct identical to DNAA_ECOLI SW: P03004;
FT CG Site No. 851”
31. An ORF is not a CDS!
An ORF is just an open reading frame
There are many more ORFs than protein coding genes (CDSs) in a
genome
Non-coding ORFs
CDSs
(note ORF can extend
upstream of start codon)
32. The Problem of Frameshift Errors
Actual sequence
10 20 30 40 50 60 70
| | | | | | |
ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAA
M S T A K L V K S K A T N L L Y T R N D V S D S E K
• V P L N • L N Q K R P I C F I P A T M S P T A R K
E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K
10 20 30 40 50 60 70
| | | | | | |
ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAA
M S T A K L V K S K S D Q S A L Y P Q R C L R Q R E
• V P L N • L N Q K A T N L L Y T R N D V S D S E K
E Y R • I S • I K K R P I C F I P A T M S P T A R K
Frameshifted sequence after single base error
33. Homology
Similarities in form the cat sat on the mat
(sequence) allow us die Katze sass auf der Matte
to infer similarities in
“meaning” (structure
and function)
Homology is not just
sequence similarity
Two sequences can
be similar without
any common
ancestry, particularly
if low complexity vge|GBant88-2 ITLITCVSVKDNSKRYVVAG
vge|GEfae9-178 LTLITCDQATKTTGRIIVIA
vge|GSpne1-403 MTLITCDPIPTFNKRLLVNF
sortase_staur LTLITCDDYNEKTGVWEKRK
34. Types of Homology
Homologues can be
divided into
Orthologues: lines of
descent congruent with
whole genome
Paralogues: result of
gene duplication
Xenologues: result of
HGT
35. Homology Searches
The aim of homology searches is to identify sequences
within these databases that are homologous to your
sequence.
This involves comparing your sequence with all the
database sequences
looking for stretches of sequence that appear to be similar
then scoring the matches and ranking them
a measure of the significance of the match is given
Most common program used for homology searches is
BLAST
36. Bacterial Genome Dynamics
Gene Loss Gene Duplication
Gene Gain
Drastic downsizing in isolated
intracellular niches Horizontal gene transfer
by phage, plasmids,
pathogenicity islands
Bacterial Rapid emergence of
Accumulation of
genetically uniform
pseudogenes and IS Genome pathogens from variable
elements after shift to Dynamics ancestral populations
new niche
Recombination and
rearrangements single nucleotide polymorphisms (SNPs)
Gene Change
37. Horizontal gene transfer
Horizontal (or lateral) gene transfer denotes any
transfer, exchange or acquisition of genetic material that
differs from the normal mode of transmission from
parents to offspring (vertical transmission).
Vertical gene transfer
Horizontal gene
38. Bacterial mobile genetic elements
Transposons
pieces of DNA that act as „jumping genes‟ that change
location on chromosome or plasmid chromosomal
localization.
encode transposase that catalyses the transposition
event
can carry resistance or virulence genes
Insertion sequences (IS elements)
transposable elements that encode only the transposase
multiple copies of same IS within genome provide targets
for homologous recombination, rearrangements and
replicon fusions
Conjugative transposons
normally integrated into the chromosome
excise then transferred to recipient cells by conjugation
39. Bacterial mobile genetic elements
Plasmids
self-replicating extrachromosomalreplicons
usually circular but can be linear
Can carry resistance or virulence genes
Bacteriophages
bacterial virusescan carry virulence genes
can insert into bacterial chromosome as prophages
(lysogeny)
Integrons
complex natural cloning and gene expression systems
able to capture promoterless gene cassettes by site-
specific recombination
allow formation of large arrays of gene cassettes
transferred as a whole between different replicons.
40. Genomic islands
large chromosomal regions, part of the flexible gene
pool
previously transferred by other mobile genetic
elements
present in some bacteria but absent in close
relatives
carry multiple genes that increase phenotypic
versatility
contribute to dynamic character of bacterial
chromosomes and can be excised from the
chromosome and transferred to other recipients
pathogenicity islands contain dozens of genes that
allow quantum leap to complex new virulence
41. Core genomes and Pangenomes
Core genome
pool of genes shared by all members of a bacterial
species
Accessory or dispensable genome
pool of genes present in some but not all genomes within
the same bacterial species
Pangenome
global gene repertoire of a bacterial species, comprised of
core genome + accessory genome
Metagenome
global gene repertoire of mixed microbial population
42. Escherichia coli Core and Pan-genomes
Welch et al. Proc Natl Acad Sci U S A. 2002 Dec 24;99(26):17020-4
43. Metagenomics
Environmental shotgun
sequencing
DNA extracted from
mixed microbial
communities sequenced
en masse
Assembled into contigs
Typically only small
contigs can be obtained
44. Uses of a genome sequence
Gene discovery
Fuelling hypothesis driven research on pathogen biology
Comparative genomics
SNP discovery and genomic epiemiology
Functional genomics
Transcriptomics
Proteomics
Interactome
Structural Genomics
Mass Mutagenesis
45. Haemolytic-uraemic syndrome
Shiga-toxin-producing E. coli (STEC)
bloody diarrhoea; damage to kidneys and brain
anaemia; loss of platelets
46. German E. coli O104:H4 outbreak
May-July 2011
>4000 cases
>40 deaths
Link to sprouting seeds
High risk of haemolytic-
uraemic syndrome
Females particularly at risk
Frank et al DOI: 10.1056/NEJMoa1106483
47.
48. Take-away messages from the genome
Pathogens don‟t bother with passports!
Not a new strain: something similar seen in Germany ten
years ago and in Korea
closest genome-sequenced strain was isolated from Central
African Republic in late 1990s, belongs to an
enteroaggregative lineage
German STEC probably comes from a lineage
circulating in human populations rather than from an
animal source (unlike E. coli O157)
49. Take-away messages
Bacteria evolve
quickly
Virulence factors in E.
coli can jump from one
lineage to another on
mobile genetic
elements
Pathotypes can
overlap and evolve
Antibiotic resistance
seen where no
obvious prior use of
antibiotics
50.
51. Take-away messages from genome sequence
Genome sequencing brings the advantages of
open-endedness (revealing the “unknown unknowns”),
universal applicability
ultimate in resolution
Bench-top sequencing platforms now generate data
sufficiently quickly and cheaply to have an impact on
real-world clinical and epidemiological problems