Bio153 microbial genomics 2012

Bio153 Microbial Genomics

Professor Mark Pallen
University of Birmingham

Microbial Genomics
 General features of microbial genomes
 Historical overview
 Genome sequencing, annotation and analysis
 Genome evolution
 What we can learn from a genome sequence?

General features of genomes
Microbial Human
 Small WSIWYG genomes  Very large genomes
(Mbp) (Gbp)
 Gene density high (>90%)
 intergenic regions short
 Gene density low
 very little repetitiveor non-  Only 25% is genes
coding DNA  Introns mean only1%
 Introns very rare codes
 Protein-coding genes  Genes can span ≥30
(CDS) short (~1kbp)
kbp
 Operons with promoters
just upstream  Genes have ~3
 Fewer non-coding RNAs transcripts
 Splicing and splice
variants

Bacterial genome organisation

Chromosomes Plasmids
 Most commonly single  Independent autonomous
replicon, can be circular or
circular chromosome linear
(always DNA)  may integrate into chromosome
 BUT many species have  copy number varies 1 to 10s
linear chromosome(s) (e.g.  often carry non-essential genes
Borrelia, Streptomyces, Rh that confer an adaptive
odoccus) advantage in certain conditions
 BUT a few species with two
chromosomes (e.g.
Vibriocholerae)
 Can be mix of circular and
linear (e.g.
Agrobacteriumtumefacien
s, B. burgdoferi)

Bacterial Genome Size
 species which occupy restricted ecological
niches, (e.g. obligate intracellular parasites and
endosymbionts) tend to have smaller genomes
(<1.5 Mb) than generalist bacteria
 smallest known bacterial genome:
Carsonellaruddii, 160 kb! (Nakabachi et al. 2006)
 BUT mitochondrial genomes are smaller
 largest genomes found in bacteria with complex
developmental cycles, e.g. Streptomyces
 largest bacterial genome: Sorangiumcellulosum, 13
Mb

Bacterial genomes are made from DNA
 In 1944, Oswald Avery, Colin MacLeod, and Maclyn
McCarty showed that DNA (not proteins) was the genetic
material responsible for inheritance.
 Identified DNA as the "transforming principle" while studying
Streptococcus pneumoniae
 Avery, Oswald T., Colin M. MacLeod, and Maclyn McCarty.
Studies on the chemical nature of the substance inducing
transformation of pneumococcal types. Journal of Experimental
Medicine. 1944 Feb 1; 79(2): 137-158.
 In 1952, this work was supported by Alfred Hershey and
Martha Chase who showed that only the DNA of a virus
needs to enter a bacterium to infect it.
 Used radioactively labelled bacteriophage
 Hershey AD and Chase M. Independent functions of viral
protein and nucleic acid in growth of bacteriophage. Journal of
General Physiology. 1952. 36: 39-56.

Viral genomes are variable
 Use RNA or DNA but not
both in genome
 Some have RNA genomes!
 Grouped into families
depending on
 type of genome: DNA or
RNA, single- or double-
stranded
 Typically dozens of genes
or fewer
 Large genomes in pox
viruses (~200 kb)
 Massive genomes in
megaviruses (1Mbp!)

Microbial Genomics Timeline

Year Milestone
1977 Invention of dideoxy chain terminator sequencing (“Sanger sequencing”)
1979 Sequencing of the 5.3-kilobase genome of bacteriophage phiX174
1981 First human mitochondrial genome sequence*
1982 Determination of the 48.5-kilobase genome sequence of bacteriophage lambda through first use
of shotgun sequencing
1986 Development of automated fluorescent sequencing
1995 First complete genome sequences obtained of free-living bacteria (Haemophilus influenzae and
Mycoplasma genitalium)
1996 Mycoplasma becomes first bacterial genus that has completely sequenced genomes from two
different species (M. genitalium and M. pneumoniae)
1997 First genome sequences from Escherichia coli and Bacillus subtilis
1998 First genome sequence from Mycobacterium tuberculosis; genome sequence from
Rickettsiaprowazekii provides first evidence of reductive evolution

Microbial Genomics Timeline
Year Milestone
1999 Helicobacter pylori becomes the first species with completely sequenced genomes from two
isolates
2000 Meningococcal genome sequence primes first application of reverse vaccinology
2001 Second E. coli genome sequences reveal unexpected level of horizontal gene transfer;
genome sequence of M. leprae provides compelling evidence of bacterial pseudogenes and
reductive evolution; first paper reporting genome sequences of two strains from one species
(Staphylococcus aureus) in a single publication.
2002 Genome sequencing of multiple strains of Bacillus anthracis to provide markers for forensic
epidemiology
2003 Genome sequencing of uncultivable Tropherymawhippleileads to design of axenic growth
medium
2004 Genome sequence of mimivirus blurs distinctions between bacteria and viruses
2005 Use of whole-genome sequencing used to identify target of new anti-tuberculosis drug
Mycoplasma genitalium genome sequenced using pyrosequencing
2006- Bacterial metagenomics survey of the Sargasso sea yields >1 million new genes
2011 Rise of next-generation or high-throughput sequencing

The first genome sequences
 The first sequenced gene was from bacteriophage MS2
 The gene encoding the coat protein
 1972
 Min Jou W, Haegeman G, Ysebaert M, and Fiers W. Nucleotide
sequence of the gene coding for the bacteriophage MS2 coat
protein. Nature. 1972 May 12; 237(5350): 82-88.
 The first sequenced genome was bacteriophage MS2
 1976
 RNA genome is 3,569 nucleotides
 Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant
D, Merregaert J, Min Jou W, Molemans F, Raeymaekers A, Van
den Berghe A, Volckaert G, and Ysebaert M. Complete
nucleotide sequence of bacteriophage MS2 RNA: primary and
secondary structure of the replicase gene. Nature. 1976 Apr 8;
260(5551): 500-507.

The first genome sequences
 The first sequenced DNA genome was bacteriophage Φ-
X174
 1977
 5368 base pairs
 Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes
CA, Hutchison CA, Slocombe PM, and Smith M. Nucleotide
sequence of bacteriophage phi X174 DNA. Nature. 1977 265
(5596): 687-695.
 The first sequenced bacterial genome was Haemophilus
influenzae
 1995
 1,830,140 base pairs
 Fleischmann R, Adams M, White O, Clayton R, Kirkness
E, Kerlavage A, Bult C, Tomb J, Dougherty B, and Merrick J.
Whole-genome random sequencing and assembly of
Haemophilus influenzae Rd. Science, 1995. 269 (5223): 496-
512.

Overview of a genome project
 Choose strain  Closure and finishing
 Fresh isolate or tractable  Manually intensive
lab strain?  Difficulty depends on
 Choose strategy how repetitive
 Shotgun sequencing  Data Release
 Paired-end sequencing  Immediate or delayed?
 Draft or complete?  Annotation
 Choose chemistry  Manually intensive bottle
 Sanger; 454; Illumina; neck
Ion Torrent  Publication
 Assembly
 Automated

Methods for genome sequencing – historic
Sanger method sequencing
 Sanger F and Coulson AR. A rapid method for
determining sequences in DNA by primed synthesis
with DNA polymerase. Journal of Molecular Biology.
1975 94: 441-448.
 Step 1, a sequence-specific DNA primer is radiolabeled
 Step 2, the primer is annealed to the template DNA
 Step 3, the primer is extended by DNA polymerase
 Incorporation of a deoxynucleotide - further extension possible
 Incorporation of a dideoxynucleotide – chain termination
 Four reactions set up
 ddATP, dATP, dCTP, dGTP, dTTP
 ddCTP, dATP, dCTP, dGTP, dTTP
 ddGTP, dATP, dCTP, dGTP, dTTP
 ddTTP, dATP, dCTP, dGTP, dTTP

Methods for genome sequencing – historic
Sanger method sequencing

Methods for genome sequencing –
automated Sanger sequencing
 Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C,
Kent SBH, and Hood LE. Fluorescence detection in automated DNA
sequence analysis. Nature. 1986 321: 674-679.
 Replaced radioisotopes with fluorescent dyes
 Safer for the researchers
 Each of the four DNA bases could be dyed a different colour
 Eliminated the need to run separate reactions in separate lanes
 The migration of the dye could be read because of the fluorescence
 This information allowed automatic gel reading
 Further improvements were made
 Improved dye chemistry using fluorescent dideoxy-terminators (DuPont): Prober
JM, Trainor GL, Dam RJ, Hobbs FW, Robertson CW, Zagursky RJ, Cocuzza AJ,
Jensen MA, and Baumeister K. A system for rapid DNA sequencing with
fluorescent chain-terminating dideoxynucleotides. Science 238: 336-341.
 Replacing slab gels with re-useable capillary tubes: Ruiz-Martinez MC, Berka J,
Belenkii A, Foret F, Miller AW, and Karger BL. DNA sequencing by capillary
electrophoresis with replaceable linear polyacrylamide and laser-induced
fluorescence detection. Analytical Chemistry 1993 65: 2851-2858.

Whole-Genome Shotgun Sanger Sequencing
Random shearing
bacterial
chromosome

Size selection

plasmid vector
Pick colonies to create shotgun
Cloning library

Sequence each insert
with two primers Plasmid preps

High-throughput Sequencing
 100x faster, 100x cheaper!
 A disruptive technology
 Several technologies in the marketplace from 2007
onwards
 454 (Roche)
 Illumina
 Ion Torrent
 PacBio
 Fundamentally new approaches
 Solid-phase amplification of clonal templates in “molecular
colonies”
 Massive increase in number of “clones” compensates for shorter
read length
 New chemistries for sequence reading
 454: pyrophosphate detection on base addition
 Illumina: reversible de-protection of fluorescent bases

High-Throughput Shotgun Sequencing
Random shearing
bacterial
chromosome

Size selection

Sequence Amplify Add adapters

454 sequencing

Emulsion-based clonal amplification

Anneal sstDNA to Clonal amplification Break
Emulsify beads and PCR
an excess of DNA occurs inside microreactors, enric
reagents in water-in-oil
Capture Beads microreactors h for DNA-positive
microreactors
beads

Pyrosequencing
 DNA template with primer
mixed with the enzymes along
with the two substrates
adenosine 5‟-phosphosulfate
(APS) and luciferin
1. one of the four nucleotides
added to reaction
2. If complementary to base in
template strand then DNA
polymerase incorporates it
3. Pyrophosphate (Ppi)
released then converted to
ATP by sulfurylase in the
presence of APS.
4. ATP serves as a substrate to
luciferase, causing a light
reaction.
5. Excess nucleotides degraded
by apyrase.

The Sequence Assembly Problem
 Sequencing technologies generate reads of <1000
bp
 These reads must be assembled into a single
continuous genomic sequence.
 Shotgun sequencing exploits many overlapping
sequences (high coverage) to infer ordering directly
from the sequences themselves

The Repeat Problem
 Repeats at read ends can be assembled in multiple
ways
Correct
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT
ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGATATCCCT

Incorrect
ATTTATGTGTGTGTGGTGTG
GTGTGGTGTGATATCCCT

ACTACTGCTGACTACTGTGTGGTGTG
GTGTGGTGTGCACTACTGCT

Random shearing
bacterial
chromosome

Size selection for 3kb or 8kb etc

Obtain sequences from
either side of linker

Paired-end
known distance apart in
genome

Sequencing Add linkers

Circularise
Add adapters Shear and select on size and
presence of linkers

Create long fragments of known
length
Obtain sequence from paired ends
known distance apart
Allows assembly of contigs across
repeats into scaffolds

Genome Assembly

Contig 1 Contig 2 Contig 3
Sequence Gap

Scaffold

Physical Gap

Re-sequencing
 Short reads (<200bp)
inefficient de novo
assembly
 Instead they are
mapped against a
reference genome
 Re-sequencing is like
assembling a jigsaw
puzzle using the image
on the lid

Genome annotation
 Annotation is the addition of information about the
predicted sequence features to the flat file of DNA code
 Identification of potential coding sequences - CDS
 Homology searches to predict function
 Other features can be annotated as well
 rRNAs
 Potential promoters
 tRNAs
 Small non-coding RNAs
 Repeat sequences
 Insertion sequences (ISs), transposons, gene fragments
 Location of the origin of replication
 Determination of the number of bases, genes, and
G+C%.

How to go from this….?
>Escherichia coli K-12 MG1655_3870656-3890655
TGCTGCTGCCTGCTGCGCGGTGCGCTCTACGGATTGCCCGGCGCGATAGAGATCGCTGCCTAAGCCCGCCCCTGCACAACCTGCGTCTATCCACTGCGCCAGGTTTTCTGCGTCACGCCGCAAC
GGCAAAGACTGCGATGTCCGATGGCAATACCGCTTTTAACGCTTTGATGTATTGCGGACCAAAAGCCGATGACGGAAATATTTTCAGCGCCTGCGGCGCCCGCTTCGAGCGCGGTAAAGGCTTCG
GTCGCCGTCGCGCAGCCGGGGCAGACGTCATGCCGTAGCCCACCGCACGGCGGATCACTTCACTATGGATATTGGGCGTAACGATGAGCTGACAGCCCATCCTGGCGAGCGCATCGACCTGTT
CAGGTTTCAGTACCGTACCTGCGCCAATCAACGCCTTGTCGCCGTACGCATCAACGATGCGGGAATGCTTTGCTCCCATTGTGGGGAATTCAGCGGGATTTCAACCGCGTCGAACCCGGCGTCAA
TCACCGCGCCAACATGCGCCAGCGCCTCGTCGGGCGTAATACCGCGCAAAATGGCGATCAGCGGGAGTTTAGTTTGCCACTGCATGAGGATGCTCCTTATACCAGCCTGAAATGCCGTGTCGCC
CGCCACCGCCGTCACGTCGCAACCCATCGCCTGAAAGGCTTGCTGGTAGCGCGCGGTCAGCGATGTTCCGGCGACAAGGGTGATGGCGTGTTGATGGGCCACATAGTCGCGCATACTGGGACC
TCTGCGCCAATCAACAAACCAGAGAGAAATTCGCTGACCTGTTCGCGGGGAAGTGTTCCCAGCACATGCGAGGCGCGAACTTCAAAAAGCTGCGGCAATATGGCGGGCGTATTAAGACCACGCT
CAAGGCCAGCTGTGAAGGCATCGGCAGGTTTTCCTGCGGCGGCAAACCTGCGCCAATCAATGAGTGATTTAACAGTAAATGATGTAATTCACCGGTCATCACGGTGCGAAAATCGTTGATTTGCTG
GCTATCGGCCTGCACCCATTTGCAATGGGTTCCGGGCATGACATAAAGAGAGGAAGAGCCAGAGCTCGCGCGCCGATCAATTGTGTTTCTTCGCCGCGCATCACATTGTGGTTATCGTCATGAGA
GACACATAATCCGGGAATAATCCAGATATTGTCGCCAACTGACGTTAATTGTTCGCCAATAGACGAAAAACAGGCAGGAACAGATAATACGGTGCAACTTTCCAGCCGACGTTGCTGCCAACCATT
CCTGCCATTACCACTGGCGTTTTCTCTTCACGCCAGTCGGTCGTGACTTCTGCTAACACCGCAGCCGGAGATTTTCCGTTCAGGCGCGTGACGCCTGCTTCTGATTGCCTGCTCTCAGGCAGTGG
TCGCCCTGATAAAGCCAGGCGCGCAGATTGGTCGATCCCCAGTCAATTGCGATGTAGCGAGCTGTCATGTGATTTCCTTTAACCTTCGTGTCGAGCTGGCGATCATGGTAAGCGCCGCCTGCTCT
GCCGCATCGCCGTCCTGATGCGTATCGCATCGAACAGCGCCTTATGTTCCTGGAGCGTTTGCGGCATGTTGGCCTCATCGCCCATCCAGGTTCGTTCAAAAACCGCCCGCTGCAGCGAACTGATC
GCAATGCTAAGTTGCTGTAACACCGGGTTATGCACCGACTGCAGCACCGCTCGTGGTAGCGAATATCCGCTTCGTTAAACGCTTCGCGGTCCTGATTGTTGGCAATCATCTCGTTCAGCGCCGATT
CAATCTGCGCCAGATCGCTGGAAGTCGCGCGCTCTGCTCCCAACGGGCAATCGCCGGTTCCACCAGATTTCGCACTTCGTCATGGCACTGATAAGCCGTGGGTCGTAGTCATTTTCCAGCACCCA
TTGCAGTACGTCAGTGTCGAGGTAATTCCACTGGTTACGCGGTGCCACAAACGCCCCGCGATAACGTTTCATTTCAATCAGCCGCTTCGCCATCAGCGAACGGAACACCCACGGATGATGTTGCG
CGAGGTTGCAAACTCCTCACAGAGTTCCGCCTCAGCCGGAAGCGGCGAGCCTGGCACGTATTTGCCGTGAACGATCTGTTTACCCAGCGTAATGACAATGCGATCGGTTTTATTGAGAGTCATGG
AGAGTCCTTGTGCTTGTATGTTCTTCTCTACTTTACCCCGATCGATGCATAACGCGGCAACTTTGTAGTACCAGCGTGATGACGTTCGCGTTTGCCGTGCGTGTAATGTAGTACAAACTTATATTGTT
GTACTACAATTTAGATCACAAAAAGAACAATGCATAAAAAATGACATGCGTCGGGCAGAAATCTGAAAAGGGATATCAGGCGCTAAACAGGAGGGAAAGAAGAGTATGCTTTCAACGGCTTAGCTA
CTCGTTTAAAGGATTAATCATGAAGTTGAATTTTAAGGGATTTTTTAAGGCTGCCGGTTTATTCCCACTGCGCTGATGCTTTCAGGCTGTATCTCGTATGCTCTGGTTTCCCATACCGCAAAGGGTAG
TTCAGGAAAGTATCAATCGCAGTCAGACACCATCACTGGGCTATCGCAGGCAAAAGATAGTAATGGAACAAAAGGCTATGTTTTTGTAGGGGAATCGTGGATTACCTTATCACTGATGGTGCCGAT
GACATCGTTAAGATGCTCAATGATCCAGCACTTAACCGGCACAATATTCAGGTTGCCGATGACGCAAGATTTGTTTTAAATGCGGGGAAAAAGAAATTTACCGGCACAATATCGCTTTACTACTACG
GAATAACGAAGAAGAAAAGGCACTGGCAACGCATTATGGTTTTGCCTGTGGTGTTCAACACTGTACCAGGTCACTGGAAAACCTAAAAGGCACAATCCATGAGAAAAATAAAAACATGGATTACTCA
AAGGTGATGGCGTTCTACCATCCATTTAAGTGCGATTTTATGAATACTATTCACCCAGAGGCATTCCGGGATGGTGTTTCCGCAGCATTACTGCCAGTGACTGTTACGCTGGACATCATTACTGCAC
CGCTGCAATTTCTGGTTGTATATGCAGTAAACCAATAATCAGTAAGCGGGCAAACCGTTTATGCTGTTTGCCCGCCCACAGATTAATTCAGCACATACTTCTCAATAGCAAACGCCACGCCATCTTCA
AGGTTAGATTTGGTGACAAAGTTCGCCACTTCTTTCACTGAAGGAATAGCGTTATCCATCGCCACACCGACGCCTGCATATTAATCATTGCGATATCGTTTTCCTGATCGCCAATCGCCATGATTTCT
TCCGGTTTAATACCTAACACGTCGGCCAGTGATTTCACCCCCGTACCTTTGTTAACGCGTTTATCGAGGATTTCGAGGAAGTACGGCGCACTTTTCAGCACGGTATATTCTCTTTCACTTCCTGCGG
AATACGCGCGATAGCCTGGTCGAGGATGGCGGGTTCATCAATCATCATCACTTTCAGGAACTGGGTATTGGGGTCCATTTTCTCCGCTTCGCAGAACACCAGCGGAATGGTGGCAACGAAGGATT
CATGCACCGTGTGTAGCTGATATCACGGTTGGCGGTGTACAGCGTGGTGCGGTCCAGGGCGTGGAAATGAGAACCGACTTCGCGAGAGAGTTTTTCCAGGAAACGATAGTCGTCATAGCTGAGA
GCAGTTTGCGCCACGGTGCTACCATCAGCGGCCTTCTGTACCACGCGCCGTTATAAGTAATGCAGTAGTCGCCCGGCTGTTCCATATGCAGCTCTTTCAGGTAGTTGTGCACACCTGCATACGGG
CGACCCGTCGTTAGCACGACATTCACGCCACGGGCGCGAGCTGCGGCAATCGCATTTTTAACGGCGGGTGAAAGGTGTGATCGGGCAGCAGAAGGGTGCCATCCATATCGATAGCAATGAGTTT
AATAGCCATGAGTTCCCCAGGTAGATTGGTTCCTGACCCATGCTAACGCGATTCCGCTCAAAAATCAGTACAACACCCGAGGGAAAAGGGGGATGCAACGCGCGTGCGTGCTCCCTTTTTGCTTA
GCGGAAGAGTTTCCCTTTCAGCAGTTCCATGCCTGCGGAAAGCAGATCGTTATTGGCTTGTGGTGACACTTCACCTTGCGGTGAGAGCGCATCAATAATCTTCGGCAATTGTTCTGCCAGTAAACT
GGAAGCTGACTGGTATCCACGCCAAGTTTTTGCCCGAGATCGGACACCGCATTTGTGCCGAGCGCCGATTCCAGTTGCTCGCCACTAACCGATTGATTGCCCTGTTGATTACTCAGCCAGGTTGA
GAGAATGGCCCCTAAGCCGCCACTTTGCAGTTTTTCCACAGCACCTGAATGCCGCCCTGCTCCTCAACCCAACTTAAAATAGCCTGATATTTCCCCGCATCGCCTTTCAGAAAGGCACCGACAACTT
CATCAAAAAGCCCCATGATAATCACCTGTAAAGCGTTACGTGTTGACCCAAAAAGTATAGATTTGCGGATGATAATTGCGGATTGCAGAAATAAAAAGGGCGGAGATGATCTCCGCCCTTTTCTTAT
AGCTTCTTGCCGGATGCGGCGTGAACGCCTTATCCGGCCTACAAAATCATGAAAATTCAATACATTGCAAGATTTTCGTAGGCCTGATAAGCGTGCGCATCAGGCACGCTCGCATGGTTAGCGCCA
TTAAATATCGATATTCGCCGCTTTCAGGGCGTTCTCTTCAATAAACGCACGGCGCGGTTCAACGGCGTCGCCCATCAGCGTGGTGAACAACTGGTCGGCAGCAATCGCATCTTTAACGGTAACCG
CAGCATACGACGACTTTCCGGGTCCATAGTGGTTTCCCACAGCTGTTCCGGGTTCATCTCGCCCAGACCTTTATAACGCTGGATGGAGAGGCCGCGACGGGACTCTTTCACCAGCCAGTCCAGCG
CCTGCTCGAAGCTGGCTACCGGCTGACGCGCTCGCCACGTTCGATAAACGCATCTTCTTCCAGCAAGCCACGCAGTTTCTCACCCAGCGTGCAGATACGACGATATTCGCCACCGGTGATAAACT
CGTGATCCAGCGGATAGTCAGTATCCACACCGTGGGTACGCACGCGAACAATCGGCTCAACAGGTTTTGCTCAGCATTGGTGTGAACATCAAACTTCCACTGGCTGCCGTGCTGTTCTTTGTCGTT
CAGTTCGCTGACCAGCGCGTTCACCCAGCGGGTAACGGTCTGCTCATCAGAAAGGTCAGCTTCCGTCAACGTCGGCTGATAGATAAGTCTTTCAGCATTGCTTTCGGATAACGACGCTCCATACG
ATTGATCATTTTCTGCGTCGCGTTGTACTCAGATACCAGTTTCTCTAACGCTTCGCCAGCCAATGCCGGTGCACTGGCGTTGGTGTGCAGCGTTGCGCCGTCCAGCGCGATAGAGATTGGTACTG
ATCCATCGCTTCGTCGTCTTTAATGTACTGTTCCTGCTTGCCTTTCTTCACTTTGTACAGCGGCGGCTGAGCGATGTAGACGTGACCGCGTTCAACGATTTCCGGCATCTGACGATAGAAGAAGGT
CAACAGCAGCGTACGAATGTGGAGCCGTCGACGTCCGCATCGGTCATGATGATGATGCTGTGATAACGCAGTTTGTCCGGGTTGTACTCGTCACGACCGATACCACAGCCAAGCGCGGTGATAA
GCGTCGCCACTTCCTGAGAAGAGAGCATCTTATCGAAGCGCGCTTTCTCGACTTGAGGATTTTACCCTTCAGCGGCAGAATCGCCTGGTTCTTGCGGTTACGCCCCTGCTTCGCAGAGCCGCCCG
CGGAGTCCCCTTCCACCAGGTACAGTTCGGAAAGCGCCGGATCGCGTTCCTGGCAGTCTGCCAGTTTGCCCGGCAGGCCCGCAAGTCGAGCGCACCTTTACGGCGGGTCATTTCACGCGCGCG
ACGCGGCGCTTCACGGGCACGGGCAGCATCGATAATTTTGCCAACCACGATTTTCGCGTCGGTTGGGTTTTCCAGCAGGTATTCTGCCAGCAGTTCGTTCATCTGCTGTTCAACGCCGATTTCACC
TCAGAAGAAACCAGTTTGTCTTTGGTCTGGGAGGAGAATTTCGGGTCCGGCACTTTCACGGAAACGACCGCAATCAGGCCTTCACGCGCATCGTCACCGGTGGCGCTGACTTTGGCTTTTTTGCT
GTAGCCTTCTTTGTCCATTAGGCGTTCAGGGTACGGGTCATCGCCGCACGGAAGCCTGCCAGGTGAGTACCGCCGTCACGCTGCGGAATGTTGTTGGTAAAGCAGTAGATGTTTTCCTGGAAGCC
ATCGTTCCACTGCAACGCCACTTCGACGCCAATACCGTCTTTTTCAGTGAGAAGTAGAAGATATTCGGGTGGATCGGCGTTTTGTTCTTGTTCAGATATTCAACGAACGCCTTGATGCCGCCTTCAT
AGTGGAAGTGGTCTTCTTTGCCGTCGCGCTTGTCGCGCAGACGAATGGAAACGCCGGAGTTGAGGAACGACAACTCCGCAGACGTTTCGCCAGAATTTCATATTCGAACTCGGTCACATTGGTGA
AGGTTTCGAGGCTGGGCCAGAAACGCACCATGGTGCCGGTTTTTTCAGTCTCGCCGGTAACCGCCAGCGGGGCCTGCGGTACACCGTGTTCGTAGATCTGACGGTGATTTTACCCTCGCGCTGG
ATAACCAGCTCCAGTTTTTGCGACAGGGCGTTTACTACCGAAACACCAACGCCGTGCAGACCGCCGGACACTTTATAGGAGTTATCGTCAAATTTACCGCCTGCGTGCAGAACGGTCATGATCACT
TCCGCCGCCGA

…to this?
 FT gene complement(9299..10702)
 FT /db_xref="GenBank:2367266”
 FT /gene="dnaA”
 FT /note="b3702”
 FT CDS complement(9299..10702)
 FT /db_xref="GI:2367267”
 FT /db_xref="PID:g2367267”
 FT /function="putative regulator; DNA - replication, repair,
 FT restriction/modification”
 FT /codon_start=1
 FT /protein_id="AAC76725.1”
 FT /gene="dnaA”
 FT /translation="MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR
 FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT
 FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG
 FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF
 FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR
 FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL
 FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR
 FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF
 FT SNLIRTLSS”
 FT /product="DNA biosynthesis; initiation of chromosome
 FT replication; can be transcription regulator”
 FT /transl_table=11
 FT /note="f467; 100 pct identical to DNAA_ECOLI SW: P03004;
 FT CG Site No. 851”


An ORF is not a CDS!
An ORF is just an open reading frame
There are many more ORFs than protein coding genes (CDSs) in a
genome

Non-coding ORFs

CDSs
(note ORF can extend
upstream of start codon)

The Problem of Frameshift Errors
Actual sequence

10 20 30 40 50 60 70
| | | | | | |
ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAA
M S T A K L V K S K A T N L L Y T R N D V S D S E K
• V P L N • L N Q K R P I C F I P A T M S P T A R K
E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K

10 20 30 40 50 60 70
| | | | | | |
ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAA
M S T A K L V K S K S D Q S A L Y P Q R C L R Q R E
• V P L N • L N Q K A T N L L Y T R N D V S D S E K
E Y R • I S • I K K R P I C F I P A T M S P T A R K

Frameshifted sequence after single base error

Homology
 Similarities in form the cat sat on the mat
(sequence) allow us die Katze sass auf der Matte
to infer similarities in
“meaning” (structure
and function)
 Homology is not just
sequence similarity
 Two sequences can
be similar without
any common
ancestry, particularly
if low complexity vge|GBant88-2 ITLITCVSVKDNSKRYVVAG
vge|GEfae9-178 LTLITCDQATKTTGRIIVIA
vge|GSpne1-403 MTLITCDPIPTFNKRLLVNF
sortase_staur LTLITCDDYNEKTGVWEKRK

Types of Homology
 Homologues can be
divided into
 Orthologues: lines of
descent congruent with
whole genome
 Paralogues: result of
gene duplication
 Xenologues: result of
HGT

Homology Searches
 The aim of homology searches is to identify sequences
within these databases that are homologous to your
sequence.
 This involves comparing your sequence with all the
database sequences
 looking for stretches of sequence that appear to be similar
 then scoring the matches and ranking them
 a measure of the significance of the match is given
 Most common program used for homology searches is
BLAST

Bacterial Genome Dynamics
Gene Loss Gene Duplication
Gene Gain

Drastic downsizing in isolated
intracellular niches Horizontal gene transfer
by phage, plasmids,
pathogenicity islands

Bacterial Rapid emergence of
Accumulation of
genetically uniform
pseudogenes and IS Genome pathogens from variable
elements after shift to Dynamics ancestral populations
new niche

Recombination and
rearrangements single nucleotide polymorphisms (SNPs)

Gene Change

Horizontal gene transfer
 Horizontal (or lateral) gene transfer denotes any
transfer, exchange or acquisition of genetic material that
differs from the normal mode of transmission from
parents to offspring (vertical transmission).

Vertical gene transfer
Horizontal gene

Bacterial mobile genetic elements
 Transposons
 pieces of DNA that act as „jumping genes‟ that change
location on chromosome or plasmid chromosomal
localization.
 encode transposase that catalyses the transposition
event
 can carry resistance or virulence genes
 Insertion sequences (IS elements)
 transposable elements that encode only the transposase
 multiple copies of same IS within genome provide targets
for homologous recombination, rearrangements and
replicon fusions
 Conjugative transposons
 normally integrated into the chromosome
 excise then transferred to recipient cells by conjugation

Bacterial mobile genetic elements
 Plasmids
 self-replicating extrachromosomalreplicons
 usually circular but can be linear
 Can carry resistance or virulence genes
 Bacteriophages
 bacterial virusescan carry virulence genes
 can insert into bacterial chromosome as prophages
(lysogeny)
 Integrons
 complex natural cloning and gene expression systems
able to capture promoterless gene cassettes by site-
specific recombination
 allow formation of large arrays of gene cassettes
transferred as a whole between different replicons.

Genomic islands
 large chromosomal regions, part of the flexible gene
pool
 previously transferred by other mobile genetic
elements
 present in some bacteria but absent in close
relatives
 carry multiple genes that increase phenotypic
versatility
 contribute to dynamic character of bacterial
chromosomes and can be excised from the
chromosome and transferred to other recipients
 pathogenicity islands contain dozens of genes that
allow quantum leap to complex new virulence

Core genomes and Pangenomes
 Core genome
 pool of genes shared by all members of a bacterial
species
 Accessory or dispensable genome
 pool of genes present in some but not all genomes within
the same bacterial species
 Pangenome
 global gene repertoire of a bacterial species, comprised of
core genome + accessory genome
 Metagenome
 global gene repertoire of mixed microbial population

Escherichia coli Core and Pan-genomes

Welch et al. Proc Natl Acad Sci U S A. 2002 Dec 24;99(26):17020-4

Metagenomics
 Environmental shotgun
sequencing
 DNA extracted from
mixed microbial
communities sequenced
en masse
 Assembled into contigs
 Typically only small
contigs can be obtained

Uses of a genome sequence
 Gene discovery
 Fuelling hypothesis driven research on pathogen biology
 Comparative genomics
 SNP discovery and genomic epiemiology
 Functional genomics
 Transcriptomics
 Proteomics
 Interactome
 Structural Genomics
 Mass Mutagenesis

Haemolytic-uraemic syndrome
 Shiga-toxin-producing E. coli (STEC)
 bloody diarrhoea; damage to kidneys and brain
 anaemia; loss of platelets

German E. coli O104:H4 outbreak

 May-July 2011
 >4000 cases
 >40 deaths
 Link to sprouting seeds
 High risk of haemolytic-
uraemic syndrome
 Females particularly at risk

Frank et al DOI: 10.1056/NEJMoa1106483

Take-away messages from the genome
 Pathogens don‟t bother with passports!
 Not a new strain: something similar seen in Germany ten
years ago and in Korea
 closest genome-sequenced strain was isolated from Central
African Republic in late 1990s, belongs to an
enteroaggregative lineage
 German STEC probably comes from a lineage
circulating in human populations rather than from an
animal source (unlike E. coli O157)

Take-away messages
 Bacteria evolve
quickly
 Virulence factors in E.
coli can jump from one
lineage to another on
mobile genetic
elements
 Pathotypes can
overlap and evolve
 Antibiotic resistance
seen where no
obvious prior use of
antibiotics

Take-away messages from genome sequence
 Genome sequencing brings the advantages of
 open-endedness (revealing the “unknown unknowns”),
 universal applicability
 ultimate in resolution
 Bench-top sequencing platforms now generate data
sufficiently quickly and cheaply to have an impact on
real-world clinical and epidemiological problems

Comprehensive Coverage of Human Microbiome

Comprehensive coverage of tree of life

What will you do when you can sequence
everything?

Bio153 microbial genomics 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Bio153 microbial genomics 2012

Similar to Bio153 microbial genomics 2012 (20)

More from Mark Pallen

More from Mark Pallen (13)

Recently uploaded

Recently uploaded (20)

Bio153 microbial genomics 2012