Bio153 Microbial Genomics Professor Mark Pallen University of Birmingham
Microbial Genomics General features of microbial genomes Historical overview Genome sequencing, annotation and analysis Genome evolution What we can learn from a genome sequence?
General features of genomesMicrobial Human Small WSIWYG genomes Very large genomes (Mbp) (Gbp) Gene density high (>90%) intergenic regions short Gene density low very little repetitiveor non- Only 25% is genes coding DNA Introns mean only1% Introns very rare codes Protein-coding genes Genes can span ≥30 (CDS) short (~1kbp) kbp Operons with promoters just upstream Genes have ~3 Fewer non-coding RNAs transcripts Splicing and splice variants
Bacterial genome organisationChromosomes Plasmids Most commonly single Independent autonomous replicon, can be circular or circular chromosome linear (always DNA) may integrate into chromosome BUT many species have copy number varies 1 to 10s linear chromosome(s) (e.g. often carry non-essential genes Borrelia, Streptomyces, Rh that confer an adaptive odoccus) advantage in certain conditions BUT a few species with two chromosomes (e.g. Vibriocholerae) Can be mix of circular and linear (e.g. Agrobacteriumtumefacien s, B. burgdoferi)
Bacterial Genome Size species which occupy restricted ecological niches, (e.g. obligate intracellular parasites and endosymbionts) tend to have smaller genomes (<1.5 Mb) than generalist bacteria smallest known bacterial genome: Carsonellaruddii, 160 kb! (Nakabachi et al. 2006) BUT mitochondrial genomes are smaller largest genomes found in bacteria with complex developmental cycles, e.g. Streptomyces largest bacterial genome: Sorangiumcellulosum, 13 Mb
Bacterial genomes are made from DNA In 1944, Oswald Avery, Colin MacLeod, and Maclyn McCarty showed that DNA (not proteins) was the genetic material responsible for inheritance. Identified DNA as the "transforming principle" while studying Streptococcus pneumoniae Avery, Oswald T., Colin M. MacLeod, and Maclyn McCarty. Studies on the chemical nature of the substance inducing transformation of pneumococcal types. Journal of Experimental Medicine. 1944 Feb 1; 79(2): 137-158. In 1952, this work was supported by Alfred Hershey and Martha Chase who showed that only the DNA of a virus needs to enter a bacterium to infect it. Used radioactively labelled bacteriophage Hershey AD and Chase M. Independent functions of viral protein and nucleic acid in growth of bacteriophage. Journal of General Physiology. 1952. 36: 39-56.
Viral genomes are variable Use RNA or DNA but not both in genome Some have RNA genomes! Grouped into families depending on type of genome: DNA or RNA, single- or double- stranded Typically dozens of genes or fewer Large genomes in pox viruses (~200 kb) Massive genomes in megaviruses (1Mbp!)
Microbial Genomics TimelineYear Milestone1977 Invention of dideoxy chain terminator sequencing (“Sanger sequencing”)1979 Sequencing of the 5.3-kilobase genome of bacteriophage phiX1741981 First human mitochondrial genome sequence*1982 Determination of the 48.5-kilobase genome sequence of bacteriophage lambda through first use of shotgun sequencing1986 Development of automated fluorescent sequencing1995 First complete genome sequences obtained of free-living bacteria (Haemophilus influenzae and Mycoplasma genitalium)1996 Mycoplasma becomes first bacterial genus that has completely sequenced genomes from two different species (M. genitalium and M. pneumoniae)1997 First genome sequences from Escherichia coli and Bacillus subtilis1998 First genome sequence from Mycobacterium tuberculosis; genome sequence from Rickettsiaprowazekii provides first evidence of reductive evolution
Microbial Genomics TimelineYear Milestone1999 Helicobacter pylori becomes the first species with completely sequenced genomes from two isolates2000 Meningococcal genome sequence primes first application of reverse vaccinology2001 Second E. coli genome sequences reveal unexpected level of horizontal gene transfer; genome sequence of M. leprae provides compelling evidence of bacterial pseudogenes and reductive evolution; first paper reporting genome sequences of two strains from one species (Staphylococcus aureus) in a single publication.2002 Genome sequencing of multiple strains of Bacillus anthracis to provide markers for forensic epidemiology2003 Genome sequencing of uncultivable Tropherymawhippleileads to design of axenic growth medium2004 Genome sequence of mimivirus blurs distinctions between bacteria and viruses2005 Use of whole-genome sequencing used to identify target of new anti-tuberculosis drug Mycoplasma genitalium genome sequenced using pyrosequencing2006- Bacterial metagenomics survey of the Sargasso sea yields >1 million new genes2011 Rise of next-generation or high-throughput sequencing
The first genome sequences The first sequenced gene was from bacteriophage MS2 The gene encoding the coat protein 1972 Min Jou W, Haegeman G, Ysebaert M, and Fiers W. Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein. Nature. 1972 May 12; 237(5350): 82-88. The first sequenced genome was bacteriophage MS2 1976 RNA genome is 3,569 nucleotides Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J, Min Jou W, Molemans F, Raeymaekers A, Van den Berghe A, Volckaert G, and Ysebaert M. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature. 1976 Apr 8; 260(5551): 500-507.
The first genome sequences The first sequenced DNA genome was bacteriophage Φ- X174 1977 5368 base pairs Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, Hutchison CA, Slocombe PM, and Smith M. Nucleotide sequence of bacteriophage phi X174 DNA. Nature. 1977 265 (5596): 687-695. The first sequenced bacterial genome was Haemophilus influenzae 1995 1,830,140 base pairs Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, and Merrick J. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 1995. 269 (5223): 496- 512.
Overview of a genome project Choose strain Closure and finishing Fresh isolate or tractable Manually intensive lab strain? Difficulty depends on Choose strategy how repetitive Shotgun sequencing Data Release Paired-end sequencing Immediate or delayed? Draft or complete? Annotation Choose chemistry Manually intensive bottle Sanger; 454; Illumina; neck Ion Torrent Publication Assembly Automated
Methods for genome sequencing – historicSanger method sequencing Sanger F and Coulson AR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. Journal of Molecular Biology. 1975 94: 441-448. Step 1, a sequence-specific DNA primer is radiolabeled Step 2, the primer is annealed to the template DNA Step 3, the primer is extended by DNA polymerase Incorporation of a deoxynucleotide - further extension possible Incorporation of a dideoxynucleotide – chain termination Four reactions set up ddATP, dATP, dCTP, dGTP, dTTP ddCTP, dATP, dCTP, dGTP, dTTP ddGTP, dATP, dCTP, dGTP, dTTP ddTTP, dATP, dCTP, dGTP, dTTP
Methods for genome sequencing – historicSanger method sequencing
Methods for genome sequencing –automated Sanger sequencing Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SBH, and Hood LE. Fluorescence detection in automated DNA sequence analysis. Nature. 1986 321: 674-679. Replaced radioisotopes with fluorescent dyes Safer for the researchers Each of the four DNA bases could be dyed a different colour Eliminated the need to run separate reactions in separate lanes The migration of the dye could be read because of the fluorescence This information allowed automatic gel reading Further improvements were made Improved dye chemistry using fluorescent dideoxy-terminators (DuPont): Prober JM, Trainor GL, Dam RJ, Hobbs FW, Robertson CW, Zagursky RJ, Cocuzza AJ, Jensen MA, and Baumeister K. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238: 336-341. Replacing slab gels with re-useable capillary tubes: Ruiz-Martinez MC, Berka J, Belenkii A, Foret F, Miller AW, and Karger BL. DNA sequencing by capillary electrophoresis with replaceable linear polyacrylamide and laser-induced fluorescence detection. Analytical Chemistry 1993 65: 2851-2858.
Whole-Genome Shotgun Sanger Sequencing Random shearing bacterialchromosome Size selection plasmid vector Pick colonies to create shotgun Cloning library Sequence each insert with two primers Plasmid preps
High-throughput Sequencing 100x faster, 100x cheaper! A disruptive technology Several technologies in the marketplace from 2007 onwards 454 (Roche) Illumina Ion Torrent PacBio Fundamentally new approaches Solid-phase amplification of clonal templates in “molecular colonies” Massive increase in number of “clones” compensates for shorter read length New chemistries for sequence reading 454: pyrophosphate detection on base addition Illumina: reversible de-protection of fluorescent bases
454 sequencing Emulsion-based clonal amplificationAnneal sstDNA to Clonal amplification Break Emulsify beads and PCRan excess of DNA occurs inside microreactors, enric reagents in water-in-oil Capture Beads microreactors h for DNA-positive microreactors beads
Pyrosequencing DNA template with primer mixed with the enzymes along with the two substrates adenosine 5‟-phosphosulfate (APS) and luciferin1. one of the four nucleotides added to reaction2. If complementary to base in template strand then DNA polymerase incorporates it3. Pyrophosphate (Ppi) released then converted to ATP by sulfurylase in the presence of APS.4. ATP serves as a substrate to luciferase, causing a light reaction.5. Excess nucleotides degraded by apyrase.
The Sequence Assembly Problem Sequencing technologies generate reads of <1000 bp These reads must be assembled into a single continuous genomic sequence. Shotgun sequencing exploits many overlapping sequences (high coverage) to infer ordering directly from the sequences themselves
The Repeat Problem Repeats at read ends can be assembled in multiple ways Correct ATTTATGTGTGTGTGGTGTG GTGTGGTGTGCACTACTGCT ACTACTGCTGACTACTGTGTGGTGTG GTGTGGTGTGATATCCCT Incorrect ATTTATGTGTGTGTGGTGTG GTGTGGTGTGATATCCCT ACTACTGCTGACTACTGTGTGGTGTG GTGTGGTGTGCACTACTGCT
Random shearing bacterial chromosome Size selection for 3kb or 8kb etcObtain sequences from either side of linker Paired-endknown distance apart in genome Sequencing Add linkers Circularise Add adapters Shear and select on size and presence of linkers Create long fragments of known length Obtain sequence from paired ends known distance apart Allows assembly of contigs across repeats into scaffolds
Genome Assembly Contig 1 Contig 2 Contig 3 Sequence Gap Scaffold Physical Gap
Re-sequencing Short reads (<200bp) inefficient de novo assembly Instead they are mapped against a reference genome Re-sequencing is like assembling a jigsaw puzzle using the image on the lid
Genome annotation Annotation is the addition of information about the predicted sequence features to the flat file of DNA code Identification of potential coding sequences - CDS Homology searches to predict function Other features can be annotated as well rRNAs Potential promoters tRNAs Small non-coding RNAs Repeat sequences Insertion sequences (ISs), transposons, gene fragments Location of the origin of replication Determination of the number of bases, genes, and G+C%.
…to this? FT gene complement(9299..10702) FT /db_xref="GenBank:2367266” FT /gene="dnaA” FT /note="b3702” FT CDS complement(9299..10702) FT /db_xref="GI:2367267” FT /db_xref="PID:g2367267” FT /function="putative regulator; DNA - replication, repair, FT restriction/modification” FT /codon_start=1 FT /protein_id="AAC76725.1” FT /gene="dnaA” FT /translation="MSLSLWQQCLARLQDELPATEFSMWIRPLQAELSDNTLALYAPNR FT FVLDWVRDKYLNNINGLLTSFCGADAPQLRFEVGTKPVTQTPQAAVTSNVAAPAQVAQT FT QPQRAAPSTRSGWDNVPAPAEPTYRSNVNVKHTFDNFVEGKSNQLARAAARQVADNPGG FT AYNPLFLYGGTGLGKTHLLHAVGNGIMARKPNAKVVYMHSERFVQDMVKALQNNAIEEF FT KRYYRSVDALLIDDIQFFANKERSQEEFFHTFNALLEGNQQIILTSDRYPKEINGVEDR FT LKSRFGWGLTVAIEPPELETRVAILMKKADENDIRLPGEVAFFIAKRLRSNVRELEGAL FT NRVIANANFTGRAITIDFVREALRDLLALQEKLVTIDNIQKTVAEYYKIKVADLLSKRR FT SRSVARPRQMAMALAKELTNHSLPEIGDAFGGRDHTTVLHACRKIEQLREESHDIKEDF FT SNLIRTLSS” FT /product="DNA biosynthesis; initiation of chromosome FT replication; can be transcription regulator” FT /transl_table=11 FT /note="f467; 100 pct identical to DNAA_ECOLI SW: P03004; FT CG Site No. 851”
An ORF is not a CDS!An ORF is just an open reading frameThere are many more ORFs than protein coding genes (CDSs) in agenome Non-coding ORFs CDSs (note ORF can extend upstream of start codon)
The Problem of Frameshift Errors Actual sequence 10 20 30 40 50 60 70 | | | | | | | ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAAA M S T A K L V K S K A T N L L Y T R N D V S D S E K • V P L N • L N Q K R P I C F I P A T M S P T A R K E Y R • I S • I K S D Q S A L Y P Q R C L R Q R E K 10 20 30 40 50 60 70 | | | | | | | ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTATACCCGCAACGATGTCTCCGACAGCGAGAA M S T A K L V K S K S D Q S A L Y P Q R C L R Q R E • V P L N • L N Q K A T N L L Y T R N D V S D S E K E Y R • I S • I K K R P I C F I P A T M S P T A R K Frameshifted sequence after single base error
Homology Similarities in form the cat sat on the mat (sequence) allow us die Katze sass auf der Matte to infer similarities in “meaning” (structure and function) Homology is not just sequence similarity Two sequences can be similar without any common ancestry, particularly if low complexity vge|GBant88-2 ITLITCVSVKDNSKRYVVAG vge|GEfae9-178 LTLITCDQATKTTGRIIVIA vge|GSpne1-403 MTLITCDPIPTFNKRLLVNF sortase_staur LTLITCDDYNEKTGVWEKRK
Types of Homology Homologues can be divided into Orthologues: lines of descent congruent with whole genome Paralogues: result of gene duplication Xenologues: result of HGT
Homology Searches The aim of homology searches is to identify sequences within these databases that are homologous to your sequence. This involves comparing your sequence with all the database sequences looking for stretches of sequence that appear to be similar then scoring the matches and ranking them a measure of the significance of the match is given Most common program used for homology searches is BLAST
Bacterial Genome Dynamics Gene Loss Gene Duplication Gene Gain Drastic downsizing in isolated intracellular niches Horizontal gene transfer by phage, plasmids, pathogenicity islands Bacterial Rapid emergence ofAccumulation of genetically uniformpseudogenes and IS Genome pathogens from variableelements after shift to Dynamics ancestral populationsnew niche Recombination and rearrangements single nucleotide polymorphisms (SNPs) Gene Change
Horizontal gene transfer Horizontal (or lateral) gene transfer denotes any transfer, exchange or acquisition of genetic material that differs from the normal mode of transmission from parents to offspring (vertical transmission). Vertical gene transfer Horizontal gene
Bacterial mobile genetic elements Transposons pieces of DNA that act as „jumping genes‟ that change location on chromosome or plasmid chromosomal localization. encode transposase that catalyses the transposition event can carry resistance or virulence genes Insertion sequences (IS elements) transposable elements that encode only the transposase multiple copies of same IS within genome provide targets for homologous recombination, rearrangements and replicon fusions Conjugative transposons normally integrated into the chromosome excise then transferred to recipient cells by conjugation
Bacterial mobile genetic elements Plasmids self-replicating extrachromosomalreplicons usually circular but can be linear Can carry resistance or virulence genes Bacteriophages bacterial virusescan carry virulence genes can insert into bacterial chromosome as prophages (lysogeny) Integrons complex natural cloning and gene expression systems able to capture promoterless gene cassettes by site- specific recombination allow formation of large arrays of gene cassettes transferred as a whole between different replicons.
Genomic islands large chromosomal regions, part of the flexible gene pool previously transferred by other mobile genetic elements present in some bacteria but absent in close relatives carry multiple genes that increase phenotypic versatility contribute to dynamic character of bacterial chromosomes and can be excised from the chromosome and transferred to other recipients pathogenicity islands contain dozens of genes that allow quantum leap to complex new virulence
Core genomes and Pangenomes Core genome pool of genes shared by all members of a bacterial species Accessory or dispensable genome pool of genes present in some but not all genomes within the same bacterial species Pangenome global gene repertoire of a bacterial species, comprised of core genome + accessory genome Metagenome global gene repertoire of mixed microbial population
Escherichia coli Core and Pan-genomes Welch et al. Proc Natl Acad Sci U S A. 2002 Dec 24;99(26):17020-4
Metagenomics Environmental shotgun sequencing DNA extracted from mixed microbial communities sequenced en masse Assembled into contigs Typically only small contigs can be obtained
Uses of a genome sequence Gene discovery Fuelling hypothesis driven research on pathogen biology Comparative genomics SNP discovery and genomic epiemiology Functional genomics Transcriptomics Proteomics Interactome Structural Genomics Mass Mutagenesis
Haemolytic-uraemic syndrome Shiga-toxin-producing E. coli (STEC) bloody diarrhoea; damage to kidneys and brain anaemia; loss of platelets
German E. coli O104:H4 outbreak May-July 2011 >4000 cases >40 deaths Link to sprouting seeds High risk of haemolytic- uraemic syndrome Females particularly at risk Frank et al DOI: 10.1056/NEJMoa1106483
Take-away messages from the genome Pathogens don‟t bother with passports! Not a new strain: something similar seen in Germany ten years ago and in Korea closest genome-sequenced strain was isolated from Central African Republic in late 1990s, belongs to an enteroaggregative lineage German STEC probably comes from a lineage circulating in human populations rather than from an animal source (unlike E. coli O157)
Take-away messages Bacteria evolve quickly Virulence factors in E. coli can jump from one lineage to another on mobile genetic elements Pathotypes can overlap and evolve Antibiotic resistance seen where no obvious prior use of antibiotics
Take-away messages from genome sequence Genome sequencing brings the advantages of open-endedness (revealing the “unknown unknowns”), universal applicability ultimate in resolution Bench-top sequencing platforms now generate data sufficiently quickly and cheaply to have an impact on real-world clinical and epidemiological problems