Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing

Lecture 10:
EVE 161: 
Microbial Phylogenomics
Lecture 10:
UC Davis, Winter 2016
Instructors: Jonathan Eisen & Holly Ganz

Answer 2 of these. Please make your answers short.
• 1) List 4-5 Steps in a “Whole Genome Shotgun
Sequencing” Project
• 2) What is meant by the “Add on Costs of Sequencing”
• 3) Explain one form of evidence used to infer lateral gene
transfer and why that evidence sometimes can be
misleading
• 4) Give examples of 3 different ways to fragment genomic
DNA

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
1st Genome Sequence
Fleischmann
et al. 1995
!3

Complete Genome/Chromosome Progress

Fraser et al. 2000
insight progress
M
icrobes were the first organisms on Earth
and preceded animals and plants by more
than 3 billion years. They are the
foundation of the biosphere, from both
an evolutionary and an environmental
perspective1
. It has been estimated that microbial species
comprise about 60% of the Earth’s biomass. The genetic,
metabolic and physiological diversity of microbial species
is far greater than that found in plants and animals. But
the diversity of the microbial world is largely unknown,
with less than one-half of 1% of the estimated 2–3 billion
microbial species identified. Of those species that have
been described, their biological diversity is extraordinary,
having adapted to grow under extremes of temperature,
advancesinDNA-sequencingtechnology,thesequencingof
whole genomes had not progressed beyond lambda-sized
clones (about 40 kbp) because of the lack of sufficient
computational approaches that would enable the efficient
assembly of a large number of independent random
sequencesintoasinglecontig.
For the H. influenzae and subsequent projects, we have
used a computational method that was developed to create
assemblies from hundreds of thousands of complementary
DNA sequences 300–500-bp long4
. This approach has
proved to be a cost-effective and efficient approach to
sequencing megabase-sized segments of genomic DNA.
This strategy does not require an ordered set of cosmids or
other subclones, thus significantly reducing the overall cost
Microbial genome sequencing
Claire M. Fraser, Jonathan A. Eisen & Steven L. Salzberg
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA
Complete genome sequences of 30 microbial species have been determined during the past five years, and
work in progress indicates that the complete sequences of more than 100 further microbial species will be
available in the next two to four years. These results have revealed a tremendous amount of information on
the physiology and evolution of microbial species, and should provide novel approaches to the diagnosis and
treatment of infectious disease.

Fraser et al. Shotgun Sequencing 2000 insight progress
analysis of the genomes of two thermophilic bacterial species,
Aquifex aeolicus and Thermotoga maritima, revealed that 20–25% of
the genes in these species were more similar to genes from archaea
than those from bacteria13,14
. This led to the suggestion of possible
extensive gene exchanges between these species and archaeal
be extensive, it is somehow constrained by phylogenetic relation-
ships. Other evidence for a ‘core’ of particular lineages comes from
the finding of a conserved core of euryarchaeal genomes21,22
and
anotherfindingthatsometypesofgenemightbemorepronetogene
transfer than others23
. It therefore seems likely that horizontal gene
2. Random sequencing phase
GGG ACTGTTC...
(i) Isolate DNA
(ii) Fragment DNA
(iii) Clone DNA
3. Closure phase
(i) Assemble sequences(i) Sequence DNA
(15,000 sequences per Mb)
(ii) Close gaps
(iv) Annotation
(iii) Edit
237 239
238
4. Complete
genome sequence
1. Library construction
–1 –1
1
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.

From http://genomesonline.org

Loman et al. 2012
In bacteriology, the genomic era began
in 1995, when the first bacterial genome
was sequenced using conventional Sanger
sequencing1
. Back then, sequencing pro-
jects required six-figure budgets and
be used to analyse these data and thus move
from draft to complete genomes.
Several high-throughput sequencing
platforms are now chasing the US$1,000
human genome3
. Given that the average
error rate —
usability, su
Template a
general ter
currently o
workflow o
amplificati
preparation
purification
the protoco
can vary fr
microgram
step depen
biomass. Fo
ing suitable
and quality
before usin
preparation
confirm, by
cient quant
However, p
to do this a
sequencing
For sho
fragmentat
High-throughput bacterial genome
sequencing: an embarrassment of
choice, a world of opportunity
Nicholas J. Loman1
, Chrystala Constantinidou1
, Jacqueline Z. M. Chan1
,
Mihail Halachev1
, Martin Sergeant1
, Charles W. Penn1
, Esther R. Robinson2
and Mark J. Pallen1
Abstract|Here,wetakeasnapshotofthehigh-throughputsequencingplatforms,
togetherwiththerelevantanalyticaltools,thatareavailabletomicrobiologistsin
2012, and evaluate the strengths and weaknesses of these platforms in obtaining
bacterial genome sequences. We also scan the horizon of future possibilities,
speculatingonhowtheavailabilityofsequencingthatis‘toocheaptometre’might
change the face of microbiology forever.

Loman et al. Shotgun Sequencing 2014

Table 1 | Comparison of next-generation sequencing platforms
Machine
(manufacturer)
Chemistry Modal
read
length*
(bases)
Run time Gb per run Current,
approximate
cost (US$)‡
Advantages Disadvantages
High-end instruments
454GS FLX+ (Roche) Pyrosequencing 700–800 hours 0.7 500,000 • Long read lengths • Appreciable
hands-on time
• High reagent costs
• High error rate in
homopolymers
HiSeq 2000/2500
(Illumina)
Reversible
terminator
2×100 11 days
(regular
mode) or
da s rapid
run mode)§
600 (regular
mode) or
120 (rapid
run mode)§
750,000 • Cost-effectiveness
• Steadily improving
read lengths
• Massive
throughput
• Minimal hands-on
time
• Long run time
• Short read lengths
• HiSeq 2500
instrument upgrade
not available at
time of writing
(available end 2012)
5500xl SOLiD
(Life Technologies)
Ligation 75 + 35 da s 150 350,000 • Low error rate
• Massive
throughput
• Very short read
lengths
• Long run times
PacBio RS (Pacific
Biosciences)
Real-time
sequencing
3,000
(maximum
15,000)
minutes 3per day 750,000 • Simple sample
preparation
• Low reagent costs
• Very long read
lengths
• High error rate
• Expensive system
• Difficult installation
Bench-top instruments
454GS Junior (Roche) Pyrosequencing 500 hours 0.035 100,000 • Long read lengths • Appreciable
hands-on time
• High reagent costs
homopolymers
Ion Personal Genome
Machine (Life
Technologies)
Proton
detection
100 or 200 hours 0.01–0.1
(314 chip),
0.1–0.5 (316
chip) or up
to1(318
chip)
80,000
(including
OneTouch
and server)
• Short run times
• Appropriate
throughput
for microbial
applications
• Appreciable
hands-on time
homopolymers
Ion Proton (Life
Technologies)
Proton
detection
Up to 200 2 hours Up to 10
(Proton I
chip) or
up to 100
(Proton II
chip)
145,000
+75,000 for
compulsory
server
• Short run times
• Flexible chip
reagents
• Instrument not
available at time of
writing
MiSeq (Illumina) Reversible
terminator
2×150 hours 1.5 125,000 • Cost-effectiveness
• Short run times
• Appropriate
throughput
for microbial
applications
• Minimal hands-on
time
• Read lengths too
short for efficient
assembly
*Average read length for a fragment-based run. ‡
Approximate cost per machine plus additional instrumentation and service contract. See REF. 58. §
Available only
on the HiSeq 2500.
PROGRESSFOCUS ON NEXT-GENERATION SEQUENCING

De novo assemblies can be compared using
Mauve25
or Mugsy26
, and the assemblies
can be manually examined using the Tablet
27
intensive. Some workflows combine a series
of programs and provide an accessible
interface for microbiologists who are not
Table 2 | The applicability of the major high-throughput sequencing platforms
Example application in
bacteriology
Desirable characteristics Machine*
454GS
Junior‡
454GS
FLX+‡
Ion
Personal
Genome
Machine
(318 chip)§
MiSeq||
HiSeq
2000||
5500xl
SOLiD§
PacBio
RS¶
De novo sequencing of novel strains
to generate a single-scaffold
reference genome
• Long reads
• Paired-end protocol and/or
long mate-pair protocol
• Even coverage of genome
X
Rapid characterization of a novel
pathogen (draft de novo assembly of
a genome for a single strain)
• Total run time (library
preparation plussequencing)
of under hours
• Sufficient coverage of a
bacterial genome in a single
run
X X
Rough-draft de novo sequencing
of small numbers of strains (<20)
for comparative analysis of gene
content
• Long or paired-end reads
• High throughput
• Ease of library and sequencing
workflow
• Cost-effective
X
Re-sequencing of many similar
strains (>50) for the discovery of
single nucleotide polymorphisms
and for phylogenetics
• Very high throughput
• Low-cost, high-throughput
sequence library construction
• High accuracy
X X
Small-scale transcriptomics-
by-sequencing experiments
(for example, two strains under
four growth conditions with two
biological replicates, so 16 strains)
• High per-isolate coverage X
Phylogenetic profiling to
genus-level using partial 16S rRNA
gene amplicon sequencing
• High coverage
• Long amplicon input (≥500bp)
• Long reads
• High single-read accuracy
(error rate <1%)
X
Whole-genome metagenomics
for the reconstruction of multiple
genomes in a single sample
• Long reads or paired-end
reads
• Very high throughput
• Low error rate
X
* , particularly well suited; , suitable; X, not suitable. ‡
From Roche. §
From Life Technologies. ||
From Illumina. ¶
From Pacific Biosciences.
interest in alignment-free approaches for
constructing bacterial phylogenies, as it
is thought that these approaches may help
PROGRESSPROGRESS

DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt

Shotgun DNA Sequencing (1995-2005)
DNA target sample

SHEAR
DNA target sample

SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
SHEAR
DNA target sample

SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
SHEAR
DNA target sample
Vector
LIGATE &
CLONE

SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
SHEAR
DNA target sample
Vector
LIGATE &
CLONE
Primer
End Reads (Mates)
SEQUENCE
550bp

Short read genome sequencing (2005-current)

Genomic
DNA
270 bp
fragments
Random
fragmentation

Genomic
DNA
270 bp
fragments
Random
fragmentation
Paired-end short
insert reads
(10’s millions)
molecular biology
Sequencing
(Illumina)

Genomic
DNA
270 bp
fragments
Random
fragmentation
4-8 kb
fragments
Paired-end long
insert reads
(10’s millions)
Paired-end short
insert reads
(10’s millions)
molecular biology
Sequencing
(Illumina)

How do we assemble this data back into a genome?
Genomic
DNA
270 bp
fragments
Random
fragmentation
4-8 kb
fragments
Paired-end long
insert reads
(10’s millions)
Paired-end short
insert reads
(10’s millions)
molecular biology
Sequencing
(Illumina)

Step 3: Assemble

Assembly outline
Contigs
Scaffolds
Reads

Assembly outline
Assembly
algorithms
e.g.
Allpaths, Velvet,
Meraculous
Contigs
Scaffolds
Reads

De Bruijn Graph Assembly

De Bruijn example
“It was the best of times, it was the worst of
times, it was the age of wisdom, it was the
age of foolishness, it was the epoch of belief,
it was the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Example courtesy of J. Leipzig 2010

De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…

De Bruijn example
Generate random ‘reads’
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe
astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea
eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast
astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo
heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel
theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli
fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft
itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast
wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads

De Bruijn example
How do we assemble?

De Bruijn example
How do we assemble?
Traditional all-vs-all assemblers fail due to immense
computational resources (scales with number of reads2)
A million (106 ) reads requires a trillion (1012) pairwise alignments

De Bruijn example
How do we assemble?
De Bruijn solution:
Represent the data as a graph (scales with genome size)
Traditional all-vs-all assemblers fail due to immense
computational resources (scales with number of reads2)
A million (106 ) reads requires a trillion (1012) pairwise alignments

De Bruijn example
Step 1:
Convert reads into “Kmers”
Kmer: a substring of defined length

De Bruijn example
Step 1:
Reads: theageofwi
Kmers :
(k=3)
the

De Bruijn example
Step 1:
Reads: theageofwi
Kmers :
(k=3)
the
hea

De Bruijn example
Step 1:
Reads: theageofwi
Kmers :
(k=3)
the
hea
eag

De Bruijn example
Step 1:
Reads: theageofwi
age
geo
eof
ofw
fwi
Kmers :
(k=3)
the
hea
eag

De Bruijn example
Step 1:
Reads: theageofwi
age
geo
eof
ofw
fwi
sthebestof
sth
the
heb
ebe
bes
est
sto
tof
astheageof
ast
sth
the
hea
eag
age
geo
eof
worstoftim
wor
ors
rst
sto
tof
oft
fti
tim
imesitwast
ime
mes
esi
sit
itw
twa
was
ast
…..etc for all reads in the dataset
Kmers :
(k=3)
the
hea
eag

De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers

De Bruijn example
Step 2:
age geo eof ofw fwihea eagthe

De Bruijn example
Step 2:
age geo eof ofw fwihea eagtheast sth
the hea eag age geo eof

De Bruijn example
Step 2:
sth the
heb ebe bes est sto tof
ast sth

De Bruijn example
Step 2:
sth the
ast sth
wor ors rst
sto tof
oft fti tim
ime mes
esi
sititwtwa
was
ast

De Bruijn example
Step 2:
sth the
ast sth
wor ors rst
sto tof
oft fti tim
ime mes
esi
sititwtwa
was
ast
…..etc for all ‘kmers’ in the dataset

De Bruijn example
Step 3:
Simplify the graph as much as possible:
A De Bruijn Graph

De Bruijn example
Step 3:
Simplify the graph as much as possible:
A De Bruijn Graph
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,.... “
De Bruijn assemblies ‘broken’ by repeats longer than kmer

No single solution!
Drawback of De Bruijn approach
Break graph to produce final assembly
Step 4: Dump graph into consensus (fasta)

Kmer size is an important parameter in De Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief

st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length

st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Why not always use longest ‘k’ possible?
Sequencing errors:
sthebentof
sth the
heb
ebe
ben
ent
nto
tof
sthebentof
k=3
k=10
100% wrong kmer
Mostly unaffected
kmers

Scaffolding

Scaffolding
Contigs
Scaffolds
(An assembly)
Reads
‘De Bruijn’
assembly
Join contigs using evidence
from paired end data
Align reads to DeBruijn contigs

Scaffolding
Contigs
Scaffolds
(An assembly)
Reads
‘De Bruijn’
assembly
“Captured” gaps caused by repeats.
Represented by “NNN” in assembly
Join contigs using evidence
from paired end data
Align reads to DeBruijn contigs

Lander-Waterman statistics
L = read length
T = minimum detectable overlap
G = genome size
N = number of reads
c = coverage (NL / G)
σ = 1 – T/L
E(#islands) = Ne-cσ
E(island size) = L((ecσ – 1) / c + 1 – σ)
contig = island with 2 or more reads

Mis-assembly of repetitive sequence
Schatz M C et al. Brief Bioinform 2013;14:213-224

Mis-assembled repeats
a b c
a c
b
a b c d
I II III
I
II
III
a
bc
d
b c
a b dc e f
I II III IV
I III II IV
a d be c f
a
collapsed tandem excision
rearrangement

Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination

Biased coverage (->gaps)
Assembly in reality
Assembly in theory

Assembly in reality
Assembly in theory
Sequencing errors
(-> fragmented assembly)
*
*
***
*
*

Assembly in reality
Assembly in theory
Chimeric reads (->mis-joins)
Sequencing errors
*
*
***
*
*

Assembly in reality
Assembly in theory
Contaminant reads
(-> incorrect + inflated
assembly)
Sequencing errors
*
*
***
*
*

Assembly in reality
Assembly in theory
Contaminant reads
(-> incorrect + inflated
assembly)
Sequencing errors
*
*
***
*
*
*
Worse than predicted assemblies!

Theoretical
GC% of 100 base windows
Fractionofnormalizedcoverage
Reference position (bp)
Coverage(x)

Genome properties can also make assembly difficult
Biased sequence composition
RESULT:
incomplete / fragmented assembly
ACTGTCTAGTCAGCGCGCGCGC
GCGCGCCCGCGCGCGCGGGCG
GCGGCGCGGGCGGGCGCATGTA
GTGATC
High repeat content
RESULT: misassemblies /
collapsed assemblies
r
r
r
r
r
Polyploidy
RESULT:
fragmented
assembly
a a’
Biased sequence abundance
RESULT:
Incomplete / fragmented assembly

N50
The N50 size of a set of entities (e.g., contigs or scaffolds)
represents the largest entity E such that at least half of the
total size of the entities is contained in entities larger than
E.
For example, given a collection of contigs with sizes 7, 4,
3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4
because we can cover 10 kb with contigs bigger than 4kb.
(http://www.cbcb.umd.edu/research/castats.shtml)
N50 length is the length ‘x’ such that 50% of the sequence
is contained in contigs of length x or greater.
(Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR

Why Completeness is Important
• Improves characterization of genome features
• Gene order, replication origins
• Better comparative genomics
• Genome duplications, inversions
• Presence and absence of particular genes can be very
important
• Missing sequence might be important (e.g., centromere)
• Allows researchers to focus on biology not sequencing
• Facilitates large scale correlation studies

Step 4: Closure
• Physical map information
• PCR and gap spanning
• Other sequencing data

General Steps in Analysis of Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Comparative genomics

General Steps in Analysis of Complete Genomes
• Structural Annotation
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Functional Annotation
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Evolutionary Annotation

Structural Annotation I: Genes in Genomes
• Protein coding genes.
! In long open reading frames
! ORFs interrupted by introns in eukaryotes
! Take up most of the genome in prokaryotes, but only a
small portion of the eukaryotic genome
• RNA-only genes
! Transfer RNA
! ribosomal RNA
! snoRNAs (guide ribosomal and transfer RNA
maturation)
! intron splicing
! guiding mRNAs to the membrane for translation
! gene regulation—this is a growing list

Structural Annotation II: Other Features to Find
• Gene control sequences
! Promoters
! Regulatory elements
• Transposable elements, both active and defective
! DNA transposons and retrotransposons
! Many types and sizes
• Other Repeated sequences.
! Centromeres and telomeres
! Many with unknown (or no) function
• Unique sequences that have no obvious function

Bacteria / Archaeal Protein Coding Genes
• Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and
a few others are occasionally used.
– Remember that start codons are also used internally: the actual start codon may not be the first
one in the ORF.
• The stop codons are the same as in eukaryotes: TGA, TAA, TAG
– stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use
of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation.
• Genes can overlap by a small amount. Not much, but a few codons of overlap is common
enough so that you can’t just eliminate overlaps as impossible.
• Cross-species homology works well for many genes. It is very unlikely that non-coding
sequence will be conserved.
– But, a significant minority of genes (say 20%) are unique to a given species.
• Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often
found just upstream from the start codon
– however, some aren’t recognizable
– genes in operons sometimes don’t always have a separate ribosome binding site for each gene

Composition Methods
• The frequency of various codons is different in coding regions as
compared to non-coding regions.
– This extends to G-C content, dinucleotide frequencies, and other
measures of composition. Dicodons (groups of 6 bases) are often
used
– Well documented experimentally.
• The composition varies between different proteins of course, and
it is affected within a species by the amounts of the various
tRNAs present
– horizontally transferred genes can also confuse things: they tend to
have compositions that reflect their original species.
– A second group with unusual compositions are highly expressed
genes.

Eukaryotic Genes Harder to Find
• Some fundamental differences between
prokaryotes and eukaryotes:
• There is lots of non-coding DNA in eukaryotes.
– First step: find repeated sequences and RNA
genes
– Note that eukaryotes have 3 main RNA
polymerases. RNA polymerase 2 (pol2)
transcribes all protein-coding genes, while pol1
and pol3 transcribe various RNA-only genes.
• most eukaryotic genes are split into exons and
introns.
• Only 1 gene per transcript in eukaryotes.
• No ribosome binding sites: translation starts at
the first ATG in the mRNA
– thus, in eukaryotic genomes, searching for the
transcription start site (TSS) makes sense.
• Many fewer eukaryotic genomes have been
sequenced

Exons
• Exon sequences can often be identified by sequence conservation,
at least roughly.
• Dicodon statistics, as was used for prokaryotes, also is useful
– eukaryotic genomes tend to contain many isochores, regions of
different GC content, and composition statistics can vary between
isochores.
• The initial and terminal exons contain untranslated regions, and
thus special methods are needed to detect them.
• Predicting splice junctions is a matter of collecting information about
the sequences surrounding each possible GT/AC pair, then running
this information through some combination of decision tree, Markov
models, discriminant analysis, or neural networks, in an attemp to
massage the data into giving a reliable score.
– In general, sites are more likely to be correct if predicted by multiple
methods
– Experimental data from ESTs can be very helpful here.

How to Find ncRNAs
• The most universal genes, such as tRNA and rRNA, are very conserved and thus
easy to detect. Finding them first removes some areas of the genome from further
consideration.
• One easy approach to finding common RNA genes is just looking for sequence
homology with related species: a BLAST search will find most of them quite easily
• Functional RNAs are characterized by secondary structure caused by base pairing
within the molecule.
• Determining the folding pattern is a matter of testing many possibilities to find the
one with the minimum free energy, which is the most stable structure.
• The free energy calculations are in turn based on experiments where short synthetic
RNA molecules are melted
• Related to this is the concept that paired regions (stems) will be conserved across
species lines even if the individual bases aren’t conserved. That is, if there is an A-U
pairing on one species, the same position might be occupied by a G-C in another
species.
• This is an example of concerted evolution: a deleterious mutation at one site is
cancelled by a compensating mutation at another site.

RNA Structure
• RNA differs from DNA in having fairly
common G-U base pairs. Also, many
functional RNAs have unusual modified
bases such as pseudouridine and inosine.
• The pseudoknot, pairing between a loop
and a sequence outside its stem, is
especially difficult to detect:
computationally intense and not subject to
the normal situation that RNA base pairing
follows a nested pattern
– But pseudoknots seem to be fairly rare.
• Essentially, RNA folding programs start
with all possible short sequences, then
build to larger ones, adding the
contribution of each structural element.
– There is an element of dynamic
programming here as well.
– And, “stochastic context-free grammars”,
something I really don’t want to approach
right now!

Finding tRNAs
• tRNAs have a highly conserved
structure, with 3 main stem-and-
loop structures that form a
cloverleaf structure, and several
conserved bases. Finding such
sequences is a matter of looking in
the DNA for the proper features
located the proper distance apart.
• Looking for such sequences is
well-suited to a decision tree, a
series of steps that the sequence
must pass.
• In addition, a score is kept, rating
how well the sequence passed
each step. This allows a more
stringent analysis later on, to
eliminate false positives.

eep
ore
me.
sm;
dto
d be
o a
rial
any
nic
ma-
ore
ce.
me
and that proposes three non-overlapping groups of living organisms: the
Table 1 Results of a BLAST search of a newly sequenced M. tuberculosis
gene against a comprehensive protein database
Gene ID Similarity (%) Length (bp) Gene name E-value*
GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2eǁ15
(Klebsiella pneumoniae)
EGAD:22614 46.2 1,191 Gluconokinase 1.4eǁ13
(Bacillus subtilis)
EGAD:20418 43.0 1,302 Xylulose kinase 4.8eǁ13
(Lactobacillus pentosus)
EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7eǁ12
FGGY family
(Archaeoglobus fulgidus)
GP:2895855 42.7 1,263 Xylulokinase 1.0eǁ07
(Lactobacillus brevis)
EGAD:10899 45.4 1,296 Xylulose kinase 2.1eǁ06
(Escherichia coli)
*E-value is a statistical measure of the significance of a BLAST search result.

sight progress
A total of 570 putative secreted
proteins or surface proteins
Protein expression
3–12 months
few months
N. meningitidis
hours
Immune sera
screening
• Bactericidal activity
• Binding to surface
of MenB cells
Seven proteins
selected for follow-up
based on high titres
Final candidate selection
Two proteins were found to exhibit
no sequence variability ➞ clinical trials
Selection of vaccine targets
A total of ~350 recombinant proteins
expressed in E. coli and used to
immunize mice
1
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
All potential antigens
re 2 Diagram depicting how complete microbial genome sequence data can accelerate vaccine development.

Functional Annotation

Functional Classification I: GO
• The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt
describe gene products with a structured controlled vocabulary, a set of invariant
terms that have a known relationship to each other.
• Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For
example, GO:0005102 is “receptor binding”.
• There are 3 root terms: biological process, cellular component, and molecular function. A
gene product will probably be described by GO terms from each of these “ontologies”.
(ontology is a branch of philosophy concerned with the nature of being, and the basic
categories of being and their relationships.)
– For instance, cytochrome c is described with the molecular function term “oxidoreductase
activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”,
and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”
• The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree.
This means simply that each term can have more than one parent term, but the
direction of parent to child (i.e. less specific to more specific) is always maintained.

Functional Classification II: Enzyme Nomenclature
• Enzyme functions: which reactants are converted to which products
– Across many species, the enzymes that perform a specific function are usually
evolutionarily related. However, this isn’t necessarily true. There are cases of two
entirely different enzymes evolving similar functions.
– Often, two or more gene products in a genome will have the same E.C. number.
• Enzyme functions are given unique numbers by the Enzyme Commission.
– E.C. numbers are four integers separated by dots. The left-most number is the
least specific
– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose
components indicate the following groups of enzymes:
• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule)
• EC 3.4 are hydrolases that act on peptide bonds
• EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a
polypeptide
• EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide
• Top level E.C. numbers:
– E.C. 1: oxidoreductases (often dehydrogenases): electron transfer
– E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between
molecules.
– E.C. 3: hydrolases: splitting a molecule by adding water to a bond.
– E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule
– E.C. 5: isomerases: rearrangements of atoms within a molecule
– E.C. 6: ligases: joining two molecules using energy from ATP

Functional Prediction
• BLAST searches
• HMM models of specific genes or gene families (Pfam, TIGRfam,
FIGfam).
• Sequence motifs and domains. If the gene is not a good match to
previously known genes, these provide useful clues.
• Cellular location predictions, especially for transmembrane proteins.
• Genomic neighbors, especially in bacteria, where related functions
are often found together in operons and divergons (genes
transcribed in opposite directions that use a common control region).
• Biochemical pathway/subsystem information. If an organism has
most of the genes needed to perform a function, any missing
functions are probably present too.
– Also, experimental data about an organism’s capacities can be used to
decide whether the relevant functions are present in the genome.

Functional Prediction II: Membrane Spanning
• Integral membrane proteins contain amino acid
sequences that go through the membrane one or
several times.
– There are also peripheral membrane proteins that stick
to the hydrophilic head groups by ionic and polar
interactions
– There are also some that have covalently bound
hydrophobic groups, such as myristoylate, a 14 carbon
saturated fatty acid that is attached to the N-terminal
amino group.
• There are 2 main protein structures that cross
membranes.
– Most are alpha helices, and in proteins that span
multiple times, these alpha helices are packed together
in a coiled-coil. Length = 15-30 amino acids.
– Less commonly, there are proteins with membrane
spanning “beta barrels”, composed of beta sheets
wrapped into a cylinder. An example: porins, which
transport water across the membrane.

Functional Prediction by Phylogeny
• Key step in genome projects
• More accurate predictions help guide experimental and
computational analyses
• Many diverse approaches
• All improved both by “phylogenomic” type analyses that
integrate evolutionary reconstructions and understanding
of how new functions evolve

Functional Prediction
• Identification of motifs
! Short regions of sequence similarity that are indicative
of general activity
! e.g., ATP binding
• Homology/similarity based methods
! Gene sequence is searched against a databases of
other sequences
! If significant similar genes are found, their functional
information is used
• Problem
! Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function

Helicobacter pylori

H. pylori genome - 1997
“The ability of H. pylori to
perform mismatch repair is
suggested by the presence of
methyl transferases, mutS
and uvrD. However,
orthologues of MutH and
MutL were not identified.”

MutL ??
From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html

Phylogenetic Tree of MutS Family
Aquae Trepa
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Borbu
Strpy
Bacsu
Synsp
Ecoli
Neigo
Thema
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Celeg
Human
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.65

MutS Subfamilies
Aquae Trepa
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Borbu
Strpy
Bacsu
Synsp
Ecoli
Neigo
Thema
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Celeg
Human
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4
MSH5 MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.66

Overlaying Functions onto Tree
Aquae Trepa
Rat
Fly
Xenla
Mouse
Human
Yeast
Neucr
Arath
Borbu
Synsp
Neigo
Thema
Strpy
Bacsu
Ecoli
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Human
Celeg
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4
MSH5
MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.67

MutS Subfamilies
• MutS1 Bacterial MMR
• MSH1 Euk - mitochondrial MMR
• MSH2 Euk - all MMR in nucleus
• MSH3 Euk - loop MMR in nucleus
• MSH6 Euk - base:base MMR in nucleus
• MutS2 Bacterial - function unknown
• MSH4 Euk - meiotic crossing-over
• MSH5 Euk - meiotic crossing-over

Functional Prediction Using Tree
Aquae Trepa
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Borbu
Strpy
Bacsu
Synsp
Ecoli
Neigo
Thema
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
MSH1
Mitochondrial
Repair
MSH3 - Nuclear  
RepairOf Loops
MSH6 - Nuclear  
Repair
Of Mismatches
MutS1 - Bacterial Mismatch and Loop Repair
StrpyBacsu
Celeg
Human
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4 - Meiotic Crossing
Over
MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions
MSH2 - Eukaryotic Nuclear
Mismatch and Loop Repair
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.69

Table 3. Presence of MutS Homologs in Complete Genomes Sequences
Species # of MutS
Homologs
Which
Subfamilies?
MutL
Homologs
Bacteria
Escherichia coli K12 1 MutS1 1
Haemophilus influenzae Rd KW20 1 MutS1 1
Neisseria gonorrhoeae 1 MutS1 1
Helicobacter pylori 26695 1 MutS2 -
Mycoplasma genitalium G-37 - - -
Mycoplasma pneumoniae M129 - - -
Bacillus subtilis 169 2 MutS1,MutS2 1
Streptococcus pyogenes 2 MutS1,MutS2 1
Mycobacterium tuberculosis - - -
Synechocystis sp. PCC6803 2 MutS1,MutS2 1
Treponema pallidum Nichols 1 MutS1 1
Borrelia burgdorferi B31 2 MutS1,MutS2 1
Aquifex aeolicus 2 MutS1,MutS2 1
Deinococcus radiodurans R1 2 MutS1,MutS2 1
Archaea
Archaeoglobus fulgidus VC-16, DSM4304 - - -
Methanococcus janasscii DSM 2661 - - -
Methanobacterium thermoautotrophicum ΔH 1 MutS2 -
Eukaryotes
Saccharomyces cerevisiae 6 MSH1-6 3+
Homo sapiens 5 MSH2-6 3+

Blast Search of H. pylori “MutS”
Score E
Sequences producing significant alignments: (bits) Value
sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25
sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10
sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09
sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08
sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07
sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07
• Blast search pulls up Syn. sp MutS#2 with much higher p value
than other MutS homologs
• Based on this TIGR predicted this species had mismatch repair
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.

High Mutation Rate in H. pylori
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.

PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998
Genome Res 8: 163-167.
Phylogenomics

2
3
1
4
5
6

Chemosynthetic Symbionts
Eisen et al. 1992
Eisen et al. 1992. J. Bact.174: 3416

Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring
• Thermophile (grows at 80°C)
• Anaerobic
• Grows very efficiently on CO (Carbon
Monoxide)
• Produces hydrogen gas
• Low GC Gram positive (Firmicute)
• Genome Determined (Wu et al. 2005 PLoS
Genetics 1: e65. )

Homologs of Sporulation Genes
Wu et al. 2005 PLoS
Genetics 1: e65.

Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.

Non-Homology Predictions:
Phylogenetic Profiling
• Step 1: Search all genes in
organisms of interest against all
other genomes
• Ask: Yes or No, is each gene found
in each other species
• Cluster genes by distribution
patterns (profiles)

Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.

B. subtilis new sporulation genes

Functional Prediction III: Colocalization
• Operon structure is often
maintained over fairly large
taxonomic regions.
– Sometimes gene order is altered,
and sometimes one or more
enzymes are missing.
– But in general, this phenomenon
allows recognition or verification
that widely diverged enzymes do
in fact have the same function.
• This is an operon that contains
part of the glycolytic pathway.
– 1: phosphoclycerate mutase
– 2: triosephosphate isomerase
– 3: enolase
– 4: phosphoglycerate kinase
– 5: glyceraldehyde 3-phosphate
dehydrogenase
– 6: central glycolytic gene regulator

Metabolic Predictions

Comparative Genomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !85

Using the Core
!86

800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com
betweenevenrelatedspecies.
Our molecular picture of evolution for the past 20 years has been
dominated by the small-subunit ribosomal RNA phylogentic tree
analysed. Analyses of complete genome sequences have led to many
recent suggestions that the extent of horizontal gene exchange is
much greater than was previously realized10–12
. For example, an
Table 2 Genome features from 24 microbial genome sequencing projects
Organism Genome No. of ORFs Unknown Unique
size (Mbp) (% coding) function ORFs
Aeropyrum pernix K1 1.67 1,885 (89%)
A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)
A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)
B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)
B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)
Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)
Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)
C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)
Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)
E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)
H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)
H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)
Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)
Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)
M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)
M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)
M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)
N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)
Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)
Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)
Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)
T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)
T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)
Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)
50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)
© 2000 Macmillan Magazines Ltd

After the Genomes
• Better analysis and annotation
• Functional genomics (Experimental analysis of gene
function on a genome scale)
• Genome-wide gene expression studies
• Proteomics
• Genome wide genetic experiments

Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing

Similar to Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing (20)

More from Jonathan Eisen

More from Jonathan Eisen (20)

Recently uploaded

Recently uploaded (20)

Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing