SlideShare a Scribd company logo
1 of 123
Download to read offline
Lecture 10:
EVE 161:

Microbial Phylogenomics
Lecture 10:
UC Davis, Winter 2016
Instructors: Jonathan Eisen & Holly Ganz
Answer 2 of these. Please make your answers short.
• 1) List 4-5 Steps in a “Whole Genome Shotgun
Sequencing” Project
• 2) What is meant by the “Add on Costs of Sequencing”
• 3) Explain one form of evidence used to infer lateral gene
transfer and why that evidence sometimes can be
misleading
• 4) Give examples of 3 different ways to fragment genomic
DNA
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
1st Genome Sequence
Fleischmann
et al. 1995
!3
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Complete Genome/Chromosome Progress
Fraser et al. 2000
insight progress
M
icrobes were the first organisms on Earth
and preceded animals and plants by more
than 3 billion years. They are the
foundation of the biosphere, from both
an evolutionary and an environmental
perspective1
. It has been estimated that microbial species
comprise about 60% of the Earth’s biomass. The genetic,
metabolic and physiological diversity of microbial species
is far greater than that found in plants and animals. But
the diversity of the microbial world is largely unknown,
with less than one-half of 1% of the estimated 2–3 billion
microbial species identified. Of those species that have
been described, their biological diversity is extraordinary,
having adapted to grow under extremes of temperature,
advancesinDNA-sequencingtechnology,thesequencingof
whole genomes had not progressed beyond lambda-sized
clones (about 40 kbp) because of the lack of sufficient
computational approaches that would enable the efficient
assembly of a large number of independent random
sequencesintoasinglecontig.
For the H. influenzae and subsequent projects, we have
used a computational method that was developed to create
assemblies from hundreds of thousands of complementary
DNA sequences 300–500-bp long4
. This approach has
proved to be a cost-effective and efficient approach to
sequencing megabase-sized segments of genomic DNA.
This strategy does not require an ordered set of cosmids or
other subclones, thus significantly reducing the overall cost
Microbial genome sequencing
Claire M. Fraser, Jonathan A. Eisen & Steven L. Salzberg
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA
Complete genome sequences of 30 microbial species have been determined during the past five years, and
work in progress indicates that the complete sequences of more than 100 further microbial species will be
available in the next two to four years. These results have revealed a tremendous amount of information on
the physiology and evolution of microbial species, and should provide novel approaches to the diagnosis and
treatment of infectious disease.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Fraser et al. Shotgun Sequencing 2000 insight progress
analysis of the genomes of two thermophilic bacterial species,
Aquifex aeolicus and Thermotoga maritima, revealed that 20–25% of
the genes in these species were more similar to genes from archaea
than those from bacteria13,14
. This led to the suggestion of possible
extensive gene exchanges between these species and archaeal
be extensive, it is somehow constrained by phylogenetic relation-
ships. Other evidence for a ‘core’ of particular lineages comes from
the finding of a conserved core of euryarchaeal genomes21,22
and
anotherfindingthatsometypesofgenemightbemorepronetogene
transfer than others23
. It therefore seems likely that horizontal gene
2. Random sequencing phase
GGG ACTGTTC...
(i) Isolate DNA
(ii) Fragment DNA
(iii) Clone DNA
3. Closure phase
(i) Assemble sequences(i) Sequence DNA
(15,000 sequences per Mb)
(ii) Close gaps
(iv) Annotation
(iii) Edit
237 239
238
4. Complete
genome sequence
1. Library construction
–1 –1
1
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
From http://genomesonline.org
Loman et al. 2012
In bacteriology, the genomic era began
in 1995, when the first bacterial genome
was sequenced using conventional Sanger
sequencing1
. Back then, sequencing pro-
jects required six-figure budgets and
be used to analyse these data and thus move
from draft to complete genomes.
Several high-throughput sequencing
platforms are now chasing the US$1,000
human genome3
. Given that the average
error rate —
usability, su
Template a
general ter
currently o
workflow o
amplificati
preparation
purification
the protoco
can vary fr
microgram
step depen
biomass. Fo
ing suitable
and quality
before usin
preparation
confirm, by
cient quant
However, p
to do this a
sequencing
For sho
fragmentat
High-throughput bacterial genome
sequencing: an embarrassment of
choice, a world of opportunity
Nicholas J. Loman1
, Chrystala Constantinidou1
, Jacqueline Z. M. Chan1
,
Mihail Halachev1
, Martin Sergeant1
, Charles W. Penn1
, Esther R. Robinson2
and Mark J. Pallen1
Abstract|Here,wetakeasnapshotofthehigh-throughputsequencingplatforms,
togetherwiththerelevantanalyticaltools,thatareavailabletomicrobiologistsin
2012, and evaluate the strengths and weaknesses of these platforms in obtaining
bacterial genome sequences. We also scan the horizon of future possibilities,
speculatingonhowtheavailabilityofsequencingthatis‘toocheaptometre’might
change the face of microbiology forever.
Loman et al. Shotgun Sequencing 2014
Table 1 | Comparison of next-generation sequencing platforms
Machine
(manufacturer)
Chemistry Modal
read
length*
(bases)
Run time Gb per run Current,
approximate
cost (US$)‡
Advantages Disadvantages
High-end instruments
454GS FLX+ (Roche) Pyrosequencing 700–800 hours 0.7 500,000 • Long read lengths • Appreciable
hands-on time
• High reagent costs
• High error rate in
homopolymers
HiSeq 2000/2500
(Illumina)
Reversible
terminator
2×100 11 days
(regular
mode) or
da s rapid
run mode)§
600 (regular
mode) or
120 (rapid
run mode)§
750,000 • Cost-effectiveness
• Steadily improving
read lengths
• Massive
throughput
• Minimal hands-on
time
• Long run time
• Short read lengths
• HiSeq 2500
instrument upgrade
not available at
time of writing
(available end 2012)
5500xl SOLiD
(Life Technologies)
Ligation 75 + 35 da s 150 350,000 • Low error rate
• Massive
throughput
• Very short read
lengths
• Long run times
PacBio RS (Pacific
Biosciences)
Real-time
sequencing
3,000
(maximum
15,000)
minutes 3per day 750,000 • Simple sample
preparation
• Low reagent costs
• Very long read
lengths
• High error rate
• Expensive system
• Difficult installation
Bench-top instruments
454GS Junior (Roche) Pyrosequencing 500 hours 0.035 100,000 • Long read lengths • Appreciable
hands-on time
• High reagent costs
• High error rate in
homopolymers
Ion Personal Genome
Machine (Life
Technologies)
Proton
detection
100 or 200 hours 0.01–0.1
(314 chip),
0.1–0.5 (316
chip) or up
to1(318
chip)
80,000
(including
OneTouch
and server)
• Short run times
• Appropriate
throughput
for microbial
applications
• Appreciable
hands-on time
• High error rate in
homopolymers
Ion Proton (Life
Technologies)
Proton
detection
Up to 200 2 hours Up to 10
(Proton I
chip) or
up to 100
(Proton II
chip)
145,000
+75,000 for
compulsory
server
• Short run times
• Flexible chip
reagents
• Instrument not
available at time of
writing
MiSeq (Illumina) Reversible
terminator
2×150 hours 1.5 125,000 • Cost-effectiveness
• Short run times
• Appropriate
throughput
for microbial
applications
• Minimal hands-on
time
• Read lengths too
short for efficient
assembly
*Average read length for a fragment-based run. ‡
Approximate cost per machine plus additional instrumentation and service contract. See REF. 58. §
Available only
on the HiSeq 2500.
PROGRESSFOCUS ON NEXT-GENERATION SEQUENCING
De novo assemblies can be compared using
Mauve25
or Mugsy26
, and the assemblies
can be manually examined using the Tablet
27
intensive. Some workflows combine a series
of programs and provide an accessible
interface for microbiologists who are not
Table 2 | The applicability of the major high-throughput sequencing platforms
Example application in
bacteriology
Desirable characteristics Machine*
454GS
Junior‡
454GS
FLX+‡
Ion
Personal
Genome
Machine
(318 chip)§
MiSeq||
HiSeq
2000||
5500xl
SOLiD§
PacBio
RS¶
De novo sequencing of novel strains
to generate a single-scaffold
reference genome
• Long reads
• Paired-end protocol and/or
long mate-pair protocol
• Even coverage of genome
X
Rapid characterization of a novel
pathogen (draft de novo assembly of
a genome for a single strain)
• Total run time (library
preparation plussequencing)
of under hours
• Sufficient coverage of a
bacterial genome in a single
run
X X
Rough-draft de novo sequencing
of small numbers of strains (<20)
for comparative analysis of gene
content
• Long or paired-end reads
• High throughput
• Ease of library and sequencing
workflow
• Cost-effective
X
Re-sequencing of many similar
strains (>50) for the discovery of
single nucleotide polymorphisms
and for phylogenetics
• Very high throughput
• Low-cost, high-throughput
sequence library construction
• High accuracy
X X
Small-scale transcriptomics-
by-sequencing experiments
(for example, two strains under
four growth conditions with two
biological replicates, so 16 strains)
• High per-isolate coverage X
Phylogenetic profiling to
genus-level using partial 16S rRNA
gene amplicon sequencing
• High coverage
• Long amplicon input (≥500bp)
• Long reads
• High single-read accuracy
(error rate <1%)
X
Whole-genome metagenomics
for the reconstruction of multiple
genomes in a single sample
• Long reads or paired-end
reads
• Very high throughput
• Low error rate
X
* , particularly well suited; , suitable; X, not suitable. ‡
From Roche. §
From Life Technologies. ||
From Illumina. ¶
From Pacific Biosciences.
interest in alignment-free approaches for
constructing bacterial phylogenies, as it
is thought that these approaches may help
PROGRESSPROGRESS
Step 1: Get DNA
Step 2: Shotgun Sequence
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Vector
LIGATE &
CLONE
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
SIZE SELECT
e.g.,
10Kbp
± 8%
std.dev.
SHEAR
Shotgun DNA Sequencing (1995-2005)
DNA target sample
Vector
LIGATE &
CLONE
Primer
End Reads (Mates)
SEQUENCE
550bp
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Genomic
DNA
270 bp
fragments
Random
fragmentation
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Genomic
DNA
270 bp
fragments
Random
fragmentation
Paired-end short
insert reads
(10’s millions)
molecular biology
Sequencing
(Illumina)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
Genomic
DNA
270 bp
fragments
Random
fragmentation
4-8 kb
fragments
Paired-end long
insert reads
(10’s millions)
Paired-end short
insert reads
(10’s millions)
molecular biology
Sequencing
(Illumina)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Short read genome sequencing (2005-current)
How do we assemble this data back into a genome?
Genomic
DNA
270 bp
fragments
Random
fragmentation
4-8 kb
fragments
Paired-end long
insert reads
(10’s millions)
Paired-end short
insert reads
(10’s millions)
molecular biology
Sequencing
(Illumina)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Step 3: Assemble
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Assembly outline
Contigs
Scaffolds
Reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Assembly outline
Assembly
algorithms
e.g.
Allpaths, Velvet,
Meraculous
Contigs
Scaffolds
Reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn Graph Assembly
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
“It was the best of times, it was the worst of
times, it was the age of wisdom, it was the
age of foolishness, it was the epoch of belief,
it was the epoch of incredulity,.... “
Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall
Example courtesy of J. Leipzig 2010
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
Generate random ‘reads’
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe
astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea
eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast
astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo
heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel
theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli
fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft
itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast
wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe
astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea
eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast
astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo
heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel
theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli
fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft
itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast
wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe
astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea
eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast
astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo
heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel
theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli
fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft
itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast
wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
Traditional all-vs-all assemblers fail due to immense
computational resources (scales with number of reads2)
A million (106 ) reads requires a trillion (1012) pairwise alignments
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness…
How do we assemble?
fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho
hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe
astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea
eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast
astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo
heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw
fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel
theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli
fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft
itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast
wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe
…etc. to 10’s of millions of reads
De Bruijn solution:
Represent the data as a graph (scales with genome size)
Traditional all-vs-all assemblers fail due to immense
computational resources (scales with number of reads2)
A million (106 ) reads requires a trillion (1012) pairwise alignments
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1:
Convert reads into “Kmers”
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1:
Convert reads into “Kmers”
Reads: theageofwi
Kmers :
(k=3)
the
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1:
Convert reads into “Kmers”
Reads: theageofwi
Kmers :
(k=3)
the
hea
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1:
Convert reads into “Kmers”
Reads: theageofwi
Kmers :
(k=3)
the
hea
eag
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1:
Convert reads into “Kmers”
Reads: theageofwi
age
geo
eof
ofw
fwi
Kmers :
(k=3)
the
hea
eag
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 1:
Convert reads into “Kmers”
Reads: theageofwi
age
geo
eof
ofw
fwi
sthebestof
sth
the
heb
ebe
bes
est
sto
tof
astheageof
ast
sth
the
hea
eag
age
geo
eof
worstoftim
wor
ors
rst
sto
tof
oft
fti
tim
imesitwast
ime
mes
esi
sit
itw
twa
was
ast
…..etc for all reads in the dataset
Kmers :
(k=3)
the
hea
eag
Kmer: a substring of defined length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthe
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthe
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagtheast sth
the hea eag age geo eof
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthe
sth the
heb ebe bes est sto tof
ast sth
the hea eag age geo eof
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthe
sth the
heb ebe bes est sto tof
ast sth
the hea eag age geo eof
wor ors rst
sto tof
oft fti tim
ime mes
esi
sititwtwa
was
ast
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 2:
Build a De-Bruijn graph from the kmers
age geo eof ofw fwihea eagthe
sth the
heb ebe bes est sto tof
ast sth
the hea eag age geo eof
wor ors rst
sto tof
oft fti tim
ime mes
esi
sititwtwa
was
ast
…..etc for all ‘kmers’ in the dataset
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 3:
Simplify the graph as much as possible:
A De Bruijn Graph
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 3:
Simplify the graph as much as possible:
A De Bruijn Graph
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
De Bruijn example
Step 3:
Simplify the graph as much as possible:
A De Bruijn Graph
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,.... “
De Bruijn assemblies ‘broken’ by repeats longer than kmer
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
No single solution!
Drawback of De Bruijn approach
Break graph to produce final assembly
Step 4: Dump graph into consensus (fasta)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Kmer size is an important parameter in De Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Kmer size is an important parameter in De Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Kmer size is an important parameter in De Bruijn assembly
The final assembly (k=3)
wor times itwasthe foolishness
incredulity age epoch be
st wisdom
of belief
A better assembly (k=20)
itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis…
Repeat with a longer “kmer” length
Why not always use longest ‘k’ possible?
Sequencing errors:
sthebentof
sth the
heb
ebe
ben
ent
nto
tof
sthebentof
k=3
k=10
100% wrong kmer
Mostly unaffected
kmers
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Scaffolding
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Scaffolding
Contigs
Scaffolds
(An assembly)
Reads
‘De Bruijn’
assembly
Join contigs using evidence
from paired end data
Align reads to DeBruijn contigs
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Scaffolding
Contigs
Scaffolds
(An assembly)
Reads
‘De Bruijn’
assembly
“Captured” gaps caused by repeats.
Represented by “NNN” in assembly
Join contigs using evidence
from paired end data
Align reads to DeBruijn contigs
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Lander-Waterman statistics
L = read length
T = minimum detectable overlap
G = genome size
N = number of reads
c = coverage (NL / G)
σ = 1 – T/L
E(#islands) = Ne-cσ
E(island size) = L((ecσ – 1) / c + 1 – σ)
contig = island with 2 or more reads
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Mis-assembly of repetitive sequence
Schatz M C et al. Brief Bioinform 2013;14:213-224
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Mis-assembled repeats
a b c
a c
b
a b c d
I II III
I
II
III
a
bc
d
b c
a b dc e f
I II III IV
I III II IV
a d be c f
a
collapsed tandem excision
rearrangement
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Sequencing errors
(-> fragmented assembly)
*
*
***
*
*
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Chimeric reads (->mis-joins)
Sequencing errors
(-> fragmented assembly)
*
*
***
*
*
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Contaminant reads
(-> incorrect + inflated
assembly)
Chimeric reads (->mis-joins)
Sequencing errors
(-> fragmented assembly)
*
*
***
*
*
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Biased coverage (->gaps)
Assembly in reality
Real life assembly is messy!
Assembly in theory
Uniform coverage, no errors, no contamination
Contaminant reads
(-> incorrect + inflated
assembly)
Chimeric reads (->mis-joins)
Sequencing errors
(-> fragmented assembly)
*
*
***
*
*
*
Worse than predicted assemblies!
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Real life assembly is messy!
Theoretical
GC% of 100 base windows
Fractionofnormalizedcoverage
Reference position (bp)
Coverage(x)
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
Genome properties can also make assembly difficult
Biased sequence composition
RESULT:
incomplete / fragmented assembly
ACTGTCTAGTCAGCGCGCGCGC
GCGCGCCCGCGCGCGCGGGCG
GCGGCGCGGGCGGGCGCATGTA
GTGATC
High repeat content
RESULT: misassemblies /
collapsed assemblies
r
r
r
r
r
Polyploidy
RESULT:
fragmented
assembly
a a’
Biased sequence abundance
RESULT:
Incomplete / fragmented assembly
Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
N50
The N50 size of a set of entities (e.g., contigs or scaffolds)
represents the largest entity E such that at least half of the
total size of the entities is contained in entities larger than
E.
For example, given a collection of contigs with sizes 7, 4,
3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4
because we can cover 10 kb with contigs bigger than 4kb.
(http://www.cbcb.umd.edu/research/castats.shtml)
N50 length is the length ‘x’ such that 50% of the sequence
is contained in contigs of length x or greater.
(Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Why Completeness is Important
• Improves characterization of genome features
• Gene order, replication origins
• Better comparative genomics
• Genome duplications, inversions
• Presence and absence of particular genes can be very
important
• Missing sequence might be important (e.g., centromere)
• Allows researchers to focus on biology not sequencing
• Facilitates large scale correlation studies
Step 4: Closure
• Physical map information
• PCR and gap spanning
• Other sequencing data
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
General Steps in Analysis of Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Comparative genomics
Step 5: Annotate
• `
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
General Steps in Analysis of Complete Genomes
• Structural Annotation
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Functional Annotation
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Evolutionary Annotation
• Comparative genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Annotation I: Genes in Genomes
• Protein coding genes.
! In long open reading frames
! ORFs interrupted by introns in eukaryotes
! Take up most of the genome in prokaryotes, but only a
small portion of the eukaryotic genome
• RNA-only genes
! Transfer RNA
! ribosomal RNA
! snoRNAs (guide ribosomal and transfer RNA
maturation)
! intron splicing
! guiding mRNAs to the membrane for translation
! gene regulation—this is a growing list
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Annotation II: Other Features to Find
• Gene control sequences
! Promoters
! Regulatory elements
• Transposable elements, both active and defective
! DNA transposons and retrotransposons
! Many types and sizes
• Other Repeated sequences.
! Centromeres and telomeres
! Many with unknown (or no) function
• Unique sequences that have no obvious function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Bacteria / Archaeal Protein Coding Genes
• Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and
a few others are occasionally used.
– Remember that start codons are also used internally: the actual start codon may not be the first
one in the ORF.
• The stop codons are the same as in eukaryotes: TGA, TAA, TAG
– stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use
of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation.
• Genes can overlap by a small amount. Not much, but a few codons of overlap is common
enough so that you can’t just eliminate overlaps as impossible.
• Cross-species homology works well for many genes. It is very unlikely that non-coding
sequence will be conserved.
– But, a significant minority of genes (say 20%) are unique to a given species.
• Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often
found just upstream from the start codon
– however, some aren’t recognizable
– genes in operons sometimes don’t always have a separate ribosome binding site for each gene
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Composition Methods
• The frequency of various codons is different in coding regions as
compared to non-coding regions.
– This extends to G-C content, dinucleotide frequencies, and other
measures of composition. Dicodons (groups of 6 bases) are often
used
– Well documented experimentally.
• The composition varies between different proteins of course, and
it is affected within a species by the amounts of the various
tRNAs present
– horizontally transferred genes can also confuse things: they tend to
have compositions that reflect their original species.
– A second group with unusual compositions are highly expressed
genes.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Eukaryotic Genes Harder to Find
• Some fundamental differences between
prokaryotes and eukaryotes:
• There is lots of non-coding DNA in eukaryotes.
– First step: find repeated sequences and RNA
genes
– Note that eukaryotes have 3 main RNA
polymerases. RNA polymerase 2 (pol2)
transcribes all protein-coding genes, while pol1
and pol3 transcribe various RNA-only genes.
• most eukaryotic genes are split into exons and
introns.
• Only 1 gene per transcript in eukaryotes.
• No ribosome binding sites: translation starts at
the first ATG in the mRNA
– thus, in eukaryotic genomes, searching for the
transcription start site (TSS) makes sense.
• Many fewer eukaryotic genomes have been
sequenced
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Exons
• Exon sequences can often be identified by sequence conservation,
at least roughly.
• Dicodon statistics, as was used for prokaryotes, also is useful
– eukaryotic genomes tend to contain many isochores, regions of
different GC content, and composition statistics can vary between
isochores.
• The initial and terminal exons contain untranslated regions, and
thus special methods are needed to detect them.
• Predicting splice junctions is a matter of collecting information about
the sequences surrounding each possible GT/AC pair, then running
this information through some combination of decision tree, Markov
models, discriminant analysis, or neural networks, in an attemp to
massage the data into giving a reliable score.
– In general, sites are more likely to be correct if predicted by multiple
methods
– Experimental data from ESTs can be very helpful here.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
How to Find ncRNAs
• The most universal genes, such as tRNA and rRNA, are very conserved and thus
easy to detect. Finding them first removes some areas of the genome from further
consideration.
• One easy approach to finding common RNA genes is just looking for sequence
homology with related species: a BLAST search will find most of them quite easily
• Functional RNAs are characterized by secondary structure caused by base pairing
within the molecule.
• Determining the folding pattern is a matter of testing many possibilities to find the
one with the minimum free energy, which is the most stable structure.
• The free energy calculations are in turn based on experiments where short synthetic
RNA molecules are melted
• Related to this is the concept that paired regions (stems) will be conserved across
species lines even if the individual bases aren’t conserved. That is, if there is an A-U
pairing on one species, the same position might be occupied by a G-C in another
species.
• This is an example of concerted evolution: a deleterious mutation at one site is
cancelled by a compensating mutation at another site.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
RNA Structure
• RNA differs from DNA in having fairly
common G-U base pairs. Also, many
functional RNAs have unusual modified
bases such as pseudouridine and inosine.
• The pseudoknot, pairing between a loop
and a sequence outside its stem, is
especially difficult to detect:
computationally intense and not subject to
the normal situation that RNA base pairing
follows a nested pattern
– But pseudoknots seem to be fairly rare.
• Essentially, RNA folding programs start
with all possible short sequences, then
build to larger ones, adding the
contribution of each structural element.
– There is an element of dynamic
programming here as well.
– And, “stochastic context-free grammars”,
something I really don’t want to approach
right now!
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Finding tRNAs
• tRNAs have a highly conserved
structure, with 3 main stem-and-
loop structures that form a
cloverleaf structure, and several
conserved bases. Finding such
sequences is a matter of looking in
the DNA for the proper features
located the proper distance apart.
• Looking for such sequences is
well-suited to a decision tree, a
series of steps that the sequence
must pass.
• In addition, a score is kept, rating
how well the sequence passed
each step. This allows a more
stringent analysis later on, to
eliminate false positives.
Step 6: Analyze
eep
ore
me.
sm;
dto
d be
o a
rial
any
nic
ma-
ore
ce.
me
and that proposes three non-overlapping groups of living organisms: the
Table 1 Results of a BLAST search of a newly sequenced M. tuberculosis
gene against a comprehensive protein database
Gene ID Similarity (%) Length (bp) Gene name E-value*
GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2eǁ15
(Klebsiella pneumoniae)
EGAD:22614 46.2 1,191 Gluconokinase 1.4eǁ13
(Bacillus subtilis)
EGAD:20418 43.0 1,302 Xylulose kinase 4.8eǁ13
(Lactobacillus pentosus)
EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7eǁ12
FGGY family
(Archaeoglobus fulgidus)
GP:2895855 42.7 1,263 Xylulokinase 1.0eǁ07
(Lactobacillus brevis)
EGAD:10899 45.4 1,296 Xylulose kinase 2.1eǁ06
(Escherichia coli)
*E-value is a statistical measure of the significance of a BLAST search result.
sight progress
A total of 570 putative secreted
proteins or surface proteins
Protein expression
3–12 months
few months
N. meningitidis
hours
Immune sera
screening
• Bactericidal activity
• Binding to surface
of MenB cells
Seven proteins
selected for follow-up
based on high titres
Final candidate selection
Two proteins were found to exhibit
no sequence variability ➞ clinical trials
Selection of vaccine targets
A total of ~350 recombinant proteins
expressed in E. coli and used to
immunize mice
1
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
All potential antigens
re 2 Diagram depicting how complete microbial genome sequence data can accelerate vaccine development.
LGT
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Annotation
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Classification I: GO
• The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt
describe gene products with a structured controlled vocabulary, a set of invariant
terms that have a known relationship to each other.
• Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For
example, GO:0005102 is “receptor binding”.
• There are 3 root terms: biological process, cellular component, and molecular function. A
gene product will probably be described by GO terms from each of these “ontologies”.
(ontology is a branch of philosophy concerned with the nature of being, and the basic
categories of being and their relationships.)
– For instance, cytochrome c is described with the molecular function term “oxidoreductase
activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”,
and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”
• The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree.
This means simply that each term can have more than one parent term, but the
direction of parent to child (i.e. less specific to more specific) is always maintained.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Classification II: Enzyme Nomenclature
• Enzyme functions: which reactants are converted to which products
– Across many species, the enzymes that perform a specific function are usually
evolutionarily related. However, this isn’t necessarily true. There are cases of two
entirely different enzymes evolving similar functions.
– Often, two or more gene products in a genome will have the same E.C. number.
• Enzyme functions are given unique numbers by the Enzyme Commission.
– E.C. numbers are four integers separated by dots. The left-most number is the
least specific
– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose
components indicate the following groups of enzymes:
• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule)
• EC 3.4 are hydrolases that act on peptide bonds
• EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a
polypeptide
• EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide
• Top level E.C. numbers:
– E.C. 1: oxidoreductases (often dehydrogenases): electron transfer
– E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between
molecules.
– E.C. 3: hydrolases: splitting a molecule by adding water to a bond.
– E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule
– E.C. 5: isomerases: rearrangements of atoms within a molecule
– E.C. 6: ligases: joining two molecules using energy from ATP
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction
• BLAST searches
• HMM models of specific genes or gene families (Pfam, TIGRfam,
FIGfam).
• Sequence motifs and domains. If the gene is not a good match to
previously known genes, these provide useful clues.
• Cellular location predictions, especially for transmembrane proteins.
• Genomic neighbors, especially in bacteria, where related functions
are often found together in operons and divergons (genes
transcribed in opposite directions that use a common control region).
• Biochemical pathway/subsystem information. If an organism has
most of the genes needed to perform a function, any missing
functions are probably present too.
– Also, experimental data about an organism’s capacities can be used to
decide whether the relevant functions are present in the genome.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction II: Membrane Spanning
• Integral membrane proteins contain amino acid
sequences that go through the membrane one or
several times.
– There are also peripheral membrane proteins that stick
to the hydrophilic head groups by ionic and polar
interactions
– There are also some that have covalently bound
hydrophobic groups, such as myristoylate, a 14 carbon
saturated fatty acid that is attached to the N-terminal
amino group.
• There are 2 main protein structures that cross
membranes.
– Most are alpha helices, and in proteins that span
multiple times, these alpha helices are packed together
in a coiled-coil. Length = 15-30 amino acids.
– Less commonly, there are proteins with membrane
spanning “beta barrels”, composed of beta sheets
wrapped into a cylinder. An example: porins, which
transport water across the membrane.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction by Phylogeny
• Key step in genome projects
• More accurate predictions help guide experimental and
computational analyses
• Many diverse approaches
• All improved both by “phylogenomic” type analyses that
integrate evolutionary reconstructions and understanding
of how new functions evolve
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction
• Identification of motifs
! Short regions of sequence similarity that are indicative
of general activity
! e.g., ATP binding
• Homology/similarity based methods
! Gene sequence is searched against a databases of
other sequences
! If significant similar genes are found, their functional
information is used
• Problem
! Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Helicobacter pylori
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
H. pylori genome - 1997
“The ability of H. pylori to
perform mismatch repair is
suggested by the presence of
methyl transferases, mutS
and uvrD. However,
orthologues of MutH and
MutL were not identified.”
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
MutL ??
From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Phylogenetic Tree of MutS Family
Aquae Trepa
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Borbu
Strpy
Bacsu
Synsp
Ecoli
Neigo
Thema
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Celeg
Human
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.65
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
MutS Subfamilies
Aquae Trepa
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Borbu
Strpy
Bacsu
Synsp
Ecoli
Neigo
Thema
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Celeg
Human
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4
MSH5 MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.66
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Overlaying Functions onto Tree
Aquae Trepa
Rat
Fly
Xenla
Mouse
Human
Yeast
Neucr
Arath
Borbu
Synsp
Neigo
Thema
Strpy
Bacsu
Ecoli
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
StrpyBacsu
Human
Celeg
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4
MSH5
MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.67
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
MutS Subfamilies
• MutS1 Bacterial MMR
• MSH1 Euk - mitochondrial MMR
• MSH2 Euk - all MMR in nucleus
• MSH3 Euk - loop MMR in nucleus
• MSH6 Euk - base:base MMR in nucleus
• MutS2 Bacterial - function unknown
• MSH4 Euk - meiotic crossing-over
• MSH5 Euk - meiotic crossing-over
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction Using Tree
Aquae Trepa
Fly
Xenla
Rat
Mouse
Human
Yeast
Neucr
Arath
Borbu
Strpy
Bacsu
Synsp
Ecoli
Neigo
Thema
TheaqDeira
Chltr
Spombe
Yeast
Yeast
Spombe
Mouse
Human
Arath
Yeast
Human
Mouse
Arath
MSH1
Mitochondrial
Repair
MSH3 - Nuclear 

RepairOf Loops
MSH6 - Nuclear 

Repair
Of Mismatches
MutS1 - Bacterial Mismatch and Loop Repair
StrpyBacsu
Celeg
Human
Yeast
MetthBorbu
Aquae
Synsp
Deira Helpy
mSaco
Yeast
Celeg
Human
MSH4 - Meiotic Crossing
Over
MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions
MSH2 - Eukaryotic Nuclear
Mismatch and Loop Repair
Based on Eisen,
1998 Nucl Acids
Res 26: 4291-4300.69
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
Table 3. Presence of MutS Homologs in Complete Genomes Sequences
Species # of MutS
Homologs
Which
Subfamilies?
MutL
Homologs
Bacteria
Escherichia coli K12 1 MutS1 1
Haemophilus influenzae Rd KW20 1 MutS1 1
Neisseria gonorrhoeae 1 MutS1 1
Helicobacter pylori 26695 1 MutS2 -
Mycoplasma genitalium G-37 - - -
Mycoplasma pneumoniae M129 - - -
Bacillus subtilis 169 2 MutS1,MutS2 1
Streptococcus pyogenes 2 MutS1,MutS2 1
Mycobacterium tuberculosis - - -
Synechocystis sp. PCC6803 2 MutS1,MutS2 1
Treponema pallidum Nichols 1 MutS1 1
Borrelia burgdorferi B31 2 MutS1,MutS2 1
Aquifex aeolicus 2 MutS1,MutS2 1
Deinococcus radiodurans R1 2 MutS1,MutS2 1
Archaea
Archaeoglobus fulgidus VC-16, DSM4304 - - -
Methanococcus janasscii DSM 2661 - - -
Methanobacterium thermoautotrophicum ΔH 1 MutS2 -
Eukaryotes
Saccharomyces cerevisiae 6 MSH1-6 3+
Homo sapiens 5 MSH2-6 3+
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Blast Search of H. pylori “MutS”
Score E
Sequences producing significant alignments: (bits) Value
sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25
sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10
sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09
sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08
sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07
sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07
• Blast search pulls up Syn. sp MutS#2 with much higher p value
than other MutS homologs
• Based on this TIGR predicted this species had mismatch repair
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
High Mutation Rate in H. pylori
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWN
FUNCTIONS ONTO TREE
INFER LIKELY FUNCTION
OF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B
2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
1
2
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION
(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998
Genome Res 8: 163-167.
Phylogenomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
2
3
1
4
5
6
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Chemosynthetic Symbionts
Eisen et al. 1992
Eisen et al. 1992. J. Bact.174: 3416
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring
• Thermophile (grows at 80°C)
• Anaerobic
• Grows very efficiently on CO (Carbon
Monoxide)
• Produces hydrogen gas
• Low GC Gram positive (Firmicute)
• Genome Determined (Wu et al. 2005 PLoS
Genetics 1: e65. )
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Homologs of Sporulation Genes
Wu et al. 2005 PLoS
Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Non-Homology Predictions:
Phylogenetic Profiling
• Step 1: Search all genes in
organisms of interest against all
other genomes
• Ask: Yes or No, is each gene found
in each other species
• Cluster genes by distribution
patterns (profiles)
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
B. subtilis new sporulation genes
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction III: Colocalization
• Operon structure is often
maintained over fairly large
taxonomic regions.
– Sometimes gene order is altered,
and sometimes one or more
enzymes are missing.
– But in general, this phenomenon
allows recognition or verification
that widely diverged enzymes do
in fact have the same function.
• This is an operon that contains
part of the glycolytic pathway.
– 1: phosphoclycerate mutase
– 2: triosephosphate isomerase
– 3: enolase
– 4: phosphoglycerate kinase
– 5: glyceraldehyde 3-phosphate
dehydrogenase
– 6: central glycolytic gene regulator
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Metabolic Predictions
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Comparative Genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !85
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Using the Core
!86
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com
betweenevenrelatedspecies.
Our molecular picture of evolution for the past 20 years has been
dominated by the small-subunit ribosomal RNA phylogentic tree
analysed. Analyses of complete genome sequences have led to many
recent suggestions that the extent of horizontal gene exchange is
much greater than was previously realized10–12
. For example, an
Table 2 Genome features from 24 microbial genome sequencing projects
Organism Genome No. of ORFs Unknown Unique
size (Mbp) (% coding) function ORFs
Aeropyrum pernix K1 1.67 1,885 (89%)
A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)
A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)
B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)
B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)
Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)
Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)
C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)
Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)
E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)
H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)
H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)
Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)
Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)
M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)
M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)
M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)
N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)
Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)
Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)
Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)
T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)
T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)
Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)
50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)
© 2000 Macmillan Magazines Ltd
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
After the Genomes
• Better analysis and annotation
• Comparative genomics
• Functional genomics (Experimental analysis of gene
function on a genome scale)
• Genome-wide gene expression studies
• Proteomics
• Genome wide genetic experiments

More Related Content

What's hot

Microbial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative GenomicsMicrobial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative GenomicsJonathan Eisen
 
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomicsUC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomicsJonathan Eisen
 
UC Davis EVE161 Lecture 11 by @phylogenomics
UC Davis EVE161 Lecture 11 by @phylogenomicsUC Davis EVE161 Lecture 11 by @phylogenomics
UC Davis EVE161 Lecture 11 by @phylogenomicsJonathan Eisen
 
Microbial Phylogenomics (EVE161) Class 14: Metagenomics
Microbial Phylogenomics (EVE161) Class 14: MetagenomicsMicrobial Phylogenomics (EVE161) Class 14: Metagenomics
Microbial Phylogenomics (EVE161) Class 14: MetagenomicsJonathan Eisen
 
UC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomics
UC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomicsUC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomics
UC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomicsJonathan Eisen
 
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups Jonathan Eisen
 
EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14Jonathan Eisen
 
EveMicrobial Phylogenomics (EVE161) Class 9
EveMicrobial Phylogenomics (EVE161) Class 9EveMicrobial Phylogenomics (EVE161) Class 9
EveMicrobial Phylogenomics (EVE161) Class 9Jonathan Eisen
 
UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics
UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomicsUC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics
UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomicsJonathan Eisen
 
EVE 161 Winter 2018 Class 10
EVE 161 Winter 2018 Class 10EVE 161 Winter 2018 Class 10
EVE 161 Winter 2018 Class 10Jonathan Eisen
 
EVE 161 Winter 2018 Class 15
EVE 161 Winter 2018 Class 15EVE 161 Winter 2018 Class 15
EVE 161 Winter 2018 Class 15Jonathan Eisen
 
EVE 161 Winter 2018 Class 17
EVE 161 Winter 2018 Class 17EVE 161 Winter 2018 Class 17
EVE 161 Winter 2018 Class 17Jonathan Eisen
 
Microbial Phylogenomics (EVE161) Class 3: Woese and the Tree of Life
Microbial Phylogenomics (EVE161) Class 3: Woese and the Tree of LifeMicrobial Phylogenomics (EVE161) Class 3: Woese and the Tree of Life
Microbial Phylogenomics (EVE161) Class 3: Woese and the Tree of LifeJonathan Eisen
 
Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4Jonathan Eisen
 
Microbial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNA
Microbial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNAMicrobial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNA
Microbial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNAJonathan Eisen
 
American Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk UniversityAmerican Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk Universitymcdonadt
 
EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13Jonathan Eisen
 

What's hot (20)

Microbial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative GenomicsMicrobial Phylogenomics (EVE161) Class 13 - Comparative Genomics
Microbial Phylogenomics (EVE161) Class 13 - Comparative Genomics
 
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomicsUC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
 
UC Davis EVE161 Lecture 11 by @phylogenomics
UC Davis EVE161 Lecture 11 by @phylogenomicsUC Davis EVE161 Lecture 11 by @phylogenomics
UC Davis EVE161 Lecture 11 by @phylogenomics
 
Microbial Phylogenomics (EVE161) Class 14: Metagenomics
Microbial Phylogenomics (EVE161) Class 14: MetagenomicsMicrobial Phylogenomics (EVE161) Class 14: Metagenomics
Microbial Phylogenomics (EVE161) Class 14: Metagenomics
 
UC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomics
UC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomicsUC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomics
UC Davis EVE161 Lecture 8 - rRNA ecology - by Jonathan Eisen @phylogenomics
 
EVE 161 Lecture 4
EVE 161 Lecture 4EVE 161 Lecture 4
EVE 161 Lecture 4
 
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
Microbial Phylogenomics (EVE161) Class 7: rRNA PCR and Major Groups
 
EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14EVE 161 Winter 2018 Class 14
EVE 161 Winter 2018 Class 14
 
EveMicrobial Phylogenomics (EVE161) Class 9
EveMicrobial Phylogenomics (EVE161) Class 9EveMicrobial Phylogenomics (EVE161) Class 9
EveMicrobial Phylogenomics (EVE161) Class 9
 
UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics
UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomicsUC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics
UC Davis EVE 161 Lecture 7 - rRNA workflows - by Jonathan Eisen @phylogenomics
 
EVE 161 Winter 2018 Class 10
EVE 161 Winter 2018 Class 10EVE 161 Winter 2018 Class 10
EVE 161 Winter 2018 Class 10
 
EVE161 Lecture 1
EVE161 Lecture 1EVE161 Lecture 1
EVE161 Lecture 1
 
EVE 161 Winter 2018 Class 15
EVE 161 Winter 2018 Class 15EVE 161 Winter 2018 Class 15
EVE 161 Winter 2018 Class 15
 
EVE 161 Winter 2018 Class 17
EVE 161 Winter 2018 Class 17EVE 161 Winter 2018 Class 17
EVE 161 Winter 2018 Class 17
 
Microbial Phylogenomics (EVE161) Class 3: Woese and the Tree of Life
Microbial Phylogenomics (EVE161) Class 3: Woese and the Tree of LifeMicrobial Phylogenomics (EVE161) Class 3: Woese and the Tree of Life
Microbial Phylogenomics (EVE161) Class 3: Woese and the Tree of Life
 
Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4Microbial Phylogenomics (EVE161) Class 4
Microbial Phylogenomics (EVE161) Class 4
 
Microbial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNA
Microbial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNAMicrobial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNA
Microbial Phylogenomics (EVE161) Class 6: Era II - Culture Independent rRNA
 
American Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk UniversityAmerican Gut Project presentation at Masaryk University
American Gut Project presentation at Masaryk University
 
EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13EVE 161 Winter 2018 Class 13
EVE 161 Winter 2018 Class 13
 
EVE161 Lecture 2
EVE161 Lecture 2EVE161 Lecture 2
EVE161 Lecture 2
 

Similar to Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing

Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomesSurya Saha
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersLarry Smarr
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...VHIR Vall d’Hebron Institut de Recerca
 
Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Jonathan Eisen
 
CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...
CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...
CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...EuFMD
 
EVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - IntroductionEVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - IntroductionJonathan Eisen
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017David Cook
 
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...VHIR Vall d’Hebron Institut de Recerca
 
Next Generation Sequencing methods
Next Generation Sequencing methods Next Generation Sequencing methods
Next Generation Sequencing methods Zohaib HUSSAIN
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfCRISTIANALONSORODRIG1
 
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeUsing Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeLarry Smarr
 
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...Larry Smarr
 
Nature - 454 Sequencing
Nature - 454 SequencingNature - 454 Sequencing
Nature - 454 SequencingMichael Weiner
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 

Similar to Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing (20)

rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics Researchers
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....
 
CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...
CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...
CHARACTERISING FOOT-AND-MOUTH DISEASE VIRUS IN CLINICAL SAMPLES USING NANOPOR...
 
EVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - IntroductionEVE161: Microbial Phylogenomics - Class 1 - Introduction
EVE161: Microbial Phylogenomics - Class 1 - Introduction
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
 
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
NGS Applications II (UEB-UAT Bioinformatics Course - Session 2.1.3 - VHIR, Ba...
 
Next Generation Sequencing methods
Next Generation Sequencing methods Next Generation Sequencing methods
Next Generation Sequencing methods
 
10.1.1.80.2149
10.1.1.80.214910.1.1.80.2149
10.1.1.80.2149
 
whole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdfwhole-genome-sequencing-guide-small-genomes.pdf.pdf
whole-genome-sequencing-guide-small-genomes.pdf.pdf
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeUsing Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of Life
 
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
 
Nature - 454 Sequencing
Nature - 454 SequencingNature - 454 Sequencing
Nature - 454 Sequencing
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 

More from Jonathan Eisen

Eisen.CentralValley2024.pdf
Eisen.CentralValley2024.pdfEisen.CentralValley2024.pdf
Eisen.CentralValley2024.pdfJonathan Eisen
 
Phylogenomics and the Diversity and Diversification of Microbes
Phylogenomics and the Diversity and Diversification of MicrobesPhylogenomics and the Diversity and Diversification of Microbes
Phylogenomics and the Diversity and Diversification of MicrobesJonathan Eisen
 
Talk by Jonathan Eisen for LAMG2022 meeting
Talk by Jonathan Eisen for LAMG2022 meetingTalk by Jonathan Eisen for LAMG2022 meeting
Talk by Jonathan Eisen for LAMG2022 meetingJonathan Eisen
 
Thoughts on UC Davis' COVID Current Actions
Thoughts on UC Davis' COVID Current ActionsThoughts on UC Davis' COVID Current Actions
Thoughts on UC Davis' COVID Current ActionsJonathan Eisen
 
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...Jonathan Eisen
 
A Field Guide to Sars-CoV-2
A Field Guide to Sars-CoV-2A Field Guide to Sars-CoV-2
A Field Guide to Sars-CoV-2Jonathan Eisen
 
EVE198 Summer Session Class 4
EVE198 Summer Session Class 4EVE198 Summer Session Class 4
EVE198 Summer Session Class 4Jonathan Eisen
 
EVE198 Summer Session 2 Class 1
EVE198 Summer Session 2 Class 1 EVE198 Summer Session 2 Class 1
EVE198 Summer Session 2 Class 1 Jonathan Eisen
 
EVE198 Summer Session 2 Class 2 Vaccines
EVE198 Summer Session 2 Class 2 Vaccines EVE198 Summer Session 2 Class 2 Vaccines
EVE198 Summer Session 2 Class 2 Vaccines Jonathan Eisen
 
EVE198 Spring2021 Class1 Introduction
EVE198 Spring2021 Class1 IntroductionEVE198 Spring2021 Class1 Introduction
EVE198 Spring2021 Class1 IntroductionJonathan Eisen
 
EVE198 Spring2021 Class2
EVE198 Spring2021 Class2EVE198 Spring2021 Class2
EVE198 Spring2021 Class2Jonathan Eisen
 
EVE198 Spring2021 Class5 Vaccines
EVE198 Spring2021 Class5 VaccinesEVE198 Spring2021 Class5 Vaccines
EVE198 Spring2021 Class5 VaccinesJonathan Eisen
 
EVE198 Winter2020 Class 8 - COVID RNA Detection
EVE198 Winter2020 Class 8 - COVID RNA DetectionEVE198 Winter2020 Class 8 - COVID RNA Detection
EVE198 Winter2020 Class 8 - COVID RNA DetectionJonathan Eisen
 
EVE198 Winter2020 Class 1 Introduction
EVE198 Winter2020 Class 1 IntroductionEVE198 Winter2020 Class 1 Introduction
EVE198 Winter2020 Class 1 IntroductionJonathan Eisen
 
EVE198 Winter2020 Class 3 - COVID Testing
EVE198 Winter2020 Class 3 - COVID TestingEVE198 Winter2020 Class 3 - COVID Testing
EVE198 Winter2020 Class 3 - COVID TestingJonathan Eisen
 
EVE198 Winter2020 Class 5 - COVID Vaccines
EVE198 Winter2020 Class 5 - COVID VaccinesEVE198 Winter2020 Class 5 - COVID Vaccines
EVE198 Winter2020 Class 5 - COVID VaccinesJonathan Eisen
 
EVE198 Winter2020 Class 9 - COVID Transmission
EVE198 Winter2020 Class 9 - COVID TransmissionEVE198 Winter2020 Class 9 - COVID Transmission
EVE198 Winter2020 Class 9 - COVID TransmissionJonathan Eisen
 
EVE198 Fall2020 "Covid Mass Testing" Class 8 Vaccines
EVE198 Fall2020 "Covid Mass Testing" Class 8 VaccinesEVE198 Fall2020 "Covid Mass Testing" Class 8 Vaccines
EVE198 Fall2020 "Covid Mass Testing" Class 8 VaccinesJonathan Eisen
 
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and Testing
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and TestingEVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and Testing
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and TestingJonathan Eisen
 
EVE198 Fall2020 "Covid Mass Testing" Class 1 Introduction
EVE198 Fall2020 "Covid Mass Testing" Class 1 IntroductionEVE198 Fall2020 "Covid Mass Testing" Class 1 Introduction
EVE198 Fall2020 "Covid Mass Testing" Class 1 IntroductionJonathan Eisen
 

More from Jonathan Eisen (20)

Eisen.CentralValley2024.pdf
Eisen.CentralValley2024.pdfEisen.CentralValley2024.pdf
Eisen.CentralValley2024.pdf
 
Phylogenomics and the Diversity and Diversification of Microbes
Phylogenomics and the Diversity and Diversification of MicrobesPhylogenomics and the Diversity and Diversification of Microbes
Phylogenomics and the Diversity and Diversification of Microbes
 
Talk by Jonathan Eisen for LAMG2022 meeting
Talk by Jonathan Eisen for LAMG2022 meetingTalk by Jonathan Eisen for LAMG2022 meeting
Talk by Jonathan Eisen for LAMG2022 meeting
 
Thoughts on UC Davis' COVID Current Actions
Thoughts on UC Davis' COVID Current ActionsThoughts on UC Davis' COVID Current Actions
Thoughts on UC Davis' COVID Current Actions
 
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
Phylogenetic and Phylogenomic Approaches to the Study of Microbes and Microbi...
 
A Field Guide to Sars-CoV-2
A Field Guide to Sars-CoV-2A Field Guide to Sars-CoV-2
A Field Guide to Sars-CoV-2
 
EVE198 Summer Session Class 4
EVE198 Summer Session Class 4EVE198 Summer Session Class 4
EVE198 Summer Session Class 4
 
EVE198 Summer Session 2 Class 1
EVE198 Summer Session 2 Class 1 EVE198 Summer Session 2 Class 1
EVE198 Summer Session 2 Class 1
 
EVE198 Summer Session 2 Class 2 Vaccines
EVE198 Summer Session 2 Class 2 Vaccines EVE198 Summer Session 2 Class 2 Vaccines
EVE198 Summer Session 2 Class 2 Vaccines
 
EVE198 Spring2021 Class1 Introduction
EVE198 Spring2021 Class1 IntroductionEVE198 Spring2021 Class1 Introduction
EVE198 Spring2021 Class1 Introduction
 
EVE198 Spring2021 Class2
EVE198 Spring2021 Class2EVE198 Spring2021 Class2
EVE198 Spring2021 Class2
 
EVE198 Spring2021 Class5 Vaccines
EVE198 Spring2021 Class5 VaccinesEVE198 Spring2021 Class5 Vaccines
EVE198 Spring2021 Class5 Vaccines
 
EVE198 Winter2020 Class 8 - COVID RNA Detection
EVE198 Winter2020 Class 8 - COVID RNA DetectionEVE198 Winter2020 Class 8 - COVID RNA Detection
EVE198 Winter2020 Class 8 - COVID RNA Detection
 
EVE198 Winter2020 Class 1 Introduction
EVE198 Winter2020 Class 1 IntroductionEVE198 Winter2020 Class 1 Introduction
EVE198 Winter2020 Class 1 Introduction
 
EVE198 Winter2020 Class 3 - COVID Testing
EVE198 Winter2020 Class 3 - COVID TestingEVE198 Winter2020 Class 3 - COVID Testing
EVE198 Winter2020 Class 3 - COVID Testing
 
EVE198 Winter2020 Class 5 - COVID Vaccines
EVE198 Winter2020 Class 5 - COVID VaccinesEVE198 Winter2020 Class 5 - COVID Vaccines
EVE198 Winter2020 Class 5 - COVID Vaccines
 
EVE198 Winter2020 Class 9 - COVID Transmission
EVE198 Winter2020 Class 9 - COVID TransmissionEVE198 Winter2020 Class 9 - COVID Transmission
EVE198 Winter2020 Class 9 - COVID Transmission
 
EVE198 Fall2020 "Covid Mass Testing" Class 8 Vaccines
EVE198 Fall2020 "Covid Mass Testing" Class 8 VaccinesEVE198 Fall2020 "Covid Mass Testing" Class 8 Vaccines
EVE198 Fall2020 "Covid Mass Testing" Class 8 Vaccines
 
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and Testing
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and TestingEVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and Testing
EVE198 Fall2020 "Covid Mass Testing" Class 2: Viruses, COIVD and Testing
 
EVE198 Fall2020 "Covid Mass Testing" Class 1 Introduction
EVE198 Fall2020 "Covid Mass Testing" Class 1 IntroductionEVE198 Fall2020 "Covid Mass Testing" Class 1 Introduction
EVE198 Fall2020 "Covid Mass Testing" Class 1 Introduction
 

Recently uploaded

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 

Recently uploaded (20)

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 

Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing

  • 1. Lecture 10: EVE 161:
 Microbial Phylogenomics Lecture 10: UC Davis, Winter 2016 Instructors: Jonathan Eisen & Holly Ganz
  • 2. Answer 2 of these. Please make your answers short. • 1) List 4-5 Steps in a “Whole Genome Shotgun Sequencing” Project • 2) What is meant by the “Add on Costs of Sequencing” • 3) Explain one form of evidence used to infer lateral gene transfer and why that evidence sometimes can be misleading • 4) Give examples of 3 different ways to fragment genomic DNA
  • 3. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 1st Genome Sequence Fleischmann et al. 1995 !3
  • 4. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Complete Genome/Chromosome Progress
  • 5. Fraser et al. 2000 insight progress M icrobes were the first organisms on Earth and preceded animals and plants by more than 3 billion years. They are the foundation of the biosphere, from both an evolutionary and an environmental perspective1 . It has been estimated that microbial species comprise about 60% of the Earth’s biomass. The genetic, metabolic and physiological diversity of microbial species is far greater than that found in plants and animals. But the diversity of the microbial world is largely unknown, with less than one-half of 1% of the estimated 2–3 billion microbial species identified. Of those species that have been described, their biological diversity is extraordinary, having adapted to grow under extremes of temperature, advancesinDNA-sequencingtechnology,thesequencingof whole genomes had not progressed beyond lambda-sized clones (about 40 kbp) because of the lack of sufficient computational approaches that would enable the efficient assembly of a large number of independent random sequencesintoasinglecontig. For the H. influenzae and subsequent projects, we have used a computational method that was developed to create assemblies from hundreds of thousands of complementary DNA sequences 300–500-bp long4 . This approach has proved to be a cost-effective and efficient approach to sequencing megabase-sized segments of genomic DNA. This strategy does not require an ordered set of cosmids or other subclones, thus significantly reducing the overall cost Microbial genome sequencing Claire M. Fraser, Jonathan A. Eisen & Steven L. Salzberg The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA Complete genome sequences of 30 microbial species have been determined during the past five years, and work in progress indicates that the complete sequences of more than 100 further microbial species will be available in the next two to four years. These results have revealed a tremendous amount of information on the physiology and evolution of microbial species, and should provide novel approaches to the diagnosis and treatment of infectious disease.
  • 6. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Fraser et al. Shotgun Sequencing 2000 insight progress analysis of the genomes of two thermophilic bacterial species, Aquifex aeolicus and Thermotoga maritima, revealed that 20–25% of the genes in these species were more similar to genes from archaea than those from bacteria13,14 . This led to the suggestion of possible extensive gene exchanges between these species and archaeal be extensive, it is somehow constrained by phylogenetic relation- ships. Other evidence for a ‘core’ of particular lineages comes from the finding of a conserved core of euryarchaeal genomes21,22 and anotherfindingthatsometypesofgenemightbemorepronetogene transfer than others23 . It therefore seems likely that horizontal gene 2. Random sequencing phase GGG ACTGTTC... (i) Isolate DNA (ii) Fragment DNA (iii) Clone DNA 3. Closure phase (i) Assemble sequences(i) Sequence DNA (15,000 sequences per Mb) (ii) Close gaps (iv) Annotation (iii) Edit 237 239 238 4. Complete genome sequence 1. Library construction –1 –1 1 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.
  • 7. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 From http://genomesonline.org
  • 8. Loman et al. 2012 In bacteriology, the genomic era began in 1995, when the first bacterial genome was sequenced using conventional Sanger sequencing1 . Back then, sequencing pro- jects required six-figure budgets and be used to analyse these data and thus move from draft to complete genomes. Several high-throughput sequencing platforms are now chasing the US$1,000 human genome3 . Given that the average error rate — usability, su Template a general ter currently o workflow o amplificati preparation purification the protoco can vary fr microgram step depen biomass. Fo ing suitable and quality before usin preparation confirm, by cient quant However, p to do this a sequencing For sho fragmentat High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity Nicholas J. Loman1 , Chrystala Constantinidou1 , Jacqueline Z. M. Chan1 , Mihail Halachev1 , Martin Sergeant1 , Charles W. Penn1 , Esther R. Robinson2 and Mark J. Pallen1 Abstract|Here,wetakeasnapshotofthehigh-throughputsequencingplatforms, togetherwiththerelevantanalyticaltools,thatareavailabletomicrobiologistsin 2012, and evaluate the strengths and weaknesses of these platforms in obtaining bacterial genome sequences. We also scan the horizon of future possibilities, speculatingonhowtheavailabilityofsequencingthatis‘toocheaptometre’might change the face of microbiology forever.
  • 9. Loman et al. Shotgun Sequencing 2014
  • 10. Table 1 | Comparison of next-generation sequencing platforms Machine (manufacturer) Chemistry Modal read length* (bases) Run time Gb per run Current, approximate cost (US$)‡ Advantages Disadvantages High-end instruments 454GS FLX+ (Roche) Pyrosequencing 700–800 hours 0.7 500,000 • Long read lengths • Appreciable hands-on time • High reagent costs • High error rate in homopolymers HiSeq 2000/2500 (Illumina) Reversible terminator 2×100 11 days (regular mode) or da s rapid run mode)§ 600 (regular mode) or 120 (rapid run mode)§ 750,000 • Cost-effectiveness • Steadily improving read lengths • Massive throughput • Minimal hands-on time • Long run time • Short read lengths • HiSeq 2500 instrument upgrade not available at time of writing (available end 2012) 5500xl SOLiD (Life Technologies) Ligation 75 + 35 da s 150 350,000 • Low error rate • Massive throughput • Very short read lengths • Long run times PacBio RS (Pacific Biosciences) Real-time sequencing 3,000 (maximum 15,000) minutes 3per day 750,000 • Simple sample preparation • Low reagent costs • Very long read lengths • High error rate • Expensive system • Difficult installation Bench-top instruments 454GS Junior (Roche) Pyrosequencing 500 hours 0.035 100,000 • Long read lengths • Appreciable hands-on time • High reagent costs • High error rate in homopolymers Ion Personal Genome Machine (Life Technologies) Proton detection 100 or 200 hours 0.01–0.1 (314 chip), 0.1–0.5 (316 chip) or up to1(318 chip) 80,000 (including OneTouch and server) • Short run times • Appropriate throughput for microbial applications • Appreciable hands-on time • High error rate in homopolymers Ion Proton (Life Technologies) Proton detection Up to 200 2 hours Up to 10 (Proton I chip) or up to 100 (Proton II chip) 145,000 +75,000 for compulsory server • Short run times • Flexible chip reagents • Instrument not available at time of writing MiSeq (Illumina) Reversible terminator 2×150 hours 1.5 125,000 • Cost-effectiveness • Short run times • Appropriate throughput for microbial applications • Minimal hands-on time • Read lengths too short for efficient assembly *Average read length for a fragment-based run. ‡ Approximate cost per machine plus additional instrumentation and service contract. See REF. 58. § Available only on the HiSeq 2500. PROGRESSFOCUS ON NEXT-GENERATION SEQUENCING
  • 11. De novo assemblies can be compared using Mauve25 or Mugsy26 , and the assemblies can be manually examined using the Tablet 27 intensive. Some workflows combine a series of programs and provide an accessible interface for microbiologists who are not Table 2 | The applicability of the major high-throughput sequencing platforms Example application in bacteriology Desirable characteristics Machine* 454GS Junior‡ 454GS FLX+‡ Ion Personal Genome Machine (318 chip)§ MiSeq|| HiSeq 2000|| 5500xl SOLiD§ PacBio RS¶ De novo sequencing of novel strains to generate a single-scaffold reference genome • Long reads • Paired-end protocol and/or long mate-pair protocol • Even coverage of genome X Rapid characterization of a novel pathogen (draft de novo assembly of a genome for a single strain) • Total run time (library preparation plussequencing) of under hours • Sufficient coverage of a bacterial genome in a single run X X Rough-draft de novo sequencing of small numbers of strains (<20) for comparative analysis of gene content • Long or paired-end reads • High throughput • Ease of library and sequencing workflow • Cost-effective X Re-sequencing of many similar strains (>50) for the discovery of single nucleotide polymorphisms and for phylogenetics • Very high throughput • Low-cost, high-throughput sequence library construction • High accuracy X X Small-scale transcriptomics- by-sequencing experiments (for example, two strains under four growth conditions with two biological replicates, so 16 strains) • High per-isolate coverage X Phylogenetic profiling to genus-level using partial 16S rRNA gene amplicon sequencing • High coverage • Long amplicon input (≥500bp) • Long reads • High single-read accuracy (error rate <1%) X Whole-genome metagenomics for the reconstruction of multiple genomes in a single sample • Long reads or paired-end reads • Very high throughput • Low error rate X * , particularly well suited; , suitable; X, not suitable. ‡ From Roche. § From Life Technologies. || From Illumina. ¶ From Pacific Biosciences. interest in alignment-free approaches for constructing bacterial phylogenies, as it is thought that these approaches may help PROGRESSPROGRESS
  • 12. Step 1: Get DNA
  • 13. Step 2: Shotgun Sequence
  • 14. DNA target sample Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 15. Shotgun DNA Sequencing (1995-2005) DNA target sample Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 16. SHEAR Shotgun DNA Sequencing (1995-2005) DNA target sample Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 17. SIZE SELECT e.g., 10Kbp ± 8% std.dev. SHEAR Shotgun DNA Sequencing (1995-2005) DNA target sample Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 18. SIZE SELECT e.g., 10Kbp ± 8% std.dev. SHEAR Shotgun DNA Sequencing (1995-2005) DNA target sample Vector LIGATE & CLONE Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 19. SIZE SELECT e.g., 10Kbp ± 8% std.dev. SHEAR Shotgun DNA Sequencing (1995-2005) DNA target sample Vector LIGATE & CLONE Primer End Reads (Mates) SEQUENCE 550bp Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 20. Short read genome sequencing (2005-current) Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 21. Short read genome sequencing (2005-current) Genomic DNA 270 bp fragments Random fragmentation Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 22. Short read genome sequencing (2005-current) Genomic DNA 270 bp fragments Random fragmentation Paired-end short insert reads (10’s millions) molecular biology Sequencing (Illumina) Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 23. Short read genome sequencing (2005-current) Genomic DNA 270 bp fragments Random fragmentation 4-8 kb fragments Paired-end long insert reads (10’s millions) Paired-end short insert reads (10’s millions) molecular biology Sequencing (Illumina) Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 24. Short read genome sequencing (2005-current) How do we assemble this data back into a genome? Genomic DNA 270 bp fragments Random fragmentation 4-8 kb fragments Paired-end long insert reads (10’s millions) Paired-end short insert reads (10’s millions) molecular biology Sequencing (Illumina) Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 25. Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 26. Step 3: Assemble Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 27. Assembly outline Contigs Scaffolds Reads Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 28. Assembly outline Assembly algorithms e.g. Allpaths, Velvet, Meraculous Contigs Scaffolds Reads Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 29. De Bruijn Graph Assembly Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 30. De Bruijn example “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “ Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall Example courtesy of J. Leipzig 2010 Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 31. De Bruijn example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 32. De Bruijn example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… Generate random ‘reads’ fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe …etc. to 10’s of millions of reads Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 33. De Bruijn example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… How do we assemble? fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe …etc. to 10’s of millions of reads Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 34. De Bruijn example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… How do we assemble? fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe …etc. to 10’s of millions of reads Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2) A million (106 ) reads requires a trillion (1012) pairwise alignments Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 35. De Bruijn example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… How do we assemble? fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe …etc. to 10’s of millions of reads De Bruijn solution: Represent the data as a graph (scales with genome size) Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2) A million (106 ) reads requires a trillion (1012) pairwise alignments Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 36. De Bruijn example Step 1: Convert reads into “Kmers” Kmer: a substring of defined length Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 37. De Bruijn example Step 1: Convert reads into “Kmers” Reads: theageofwi Kmers : (k=3) the Kmer: a substring of defined length Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 38. De Bruijn example Step 1: Convert reads into “Kmers” Reads: theageofwi Kmers : (k=3) the hea Kmer: a substring of defined length Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 39. De Bruijn example Step 1: Convert reads into “Kmers” Reads: theageofwi Kmers : (k=3) the hea eag Kmer: a substring of defined length Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 40. De Bruijn example Step 1: Convert reads into “Kmers” Reads: theageofwi age geo eof ofw fwi Kmers : (k=3) the hea eag Kmer: a substring of defined length Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 41. De Bruijn example Step 1: Convert reads into “Kmers” Reads: theageofwi age geo eof ofw fwi sthebestof sth the heb ebe bes est sto tof astheageof ast sth the hea eag age geo eof worstoftim wor ors rst sto tof oft fti tim imesitwast ime mes esi sit itw twa was ast …..etc for all reads in the dataset Kmers : (k=3) the hea eag Kmer: a substring of defined length Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 42. De Bruijn example Step 2: Build a De-Bruijn graph from the kmers Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 43. De Bruijn example Step 2: Build a De-Bruijn graph from the kmers age geo eof ofw fwihea eagthe Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 44. De Bruijn example Step 2: Build a De-Bruijn graph from the kmers age geo eof ofw fwihea eagthe Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 45. De Bruijn example Step 2: Build a De-Bruijn graph from the kmers age geo eof ofw fwihea eagtheast sth the hea eag age geo eof Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 46. De Bruijn example Step 2: Build a De-Bruijn graph from the kmers age geo eof ofw fwihea eagthe sth the heb ebe bes est sto tof ast sth the hea eag age geo eof Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 47. De Bruijn example Step 2: Build a De-Bruijn graph from the kmers age geo eof ofw fwihea eagthe sth the heb ebe bes est sto tof ast sth the hea eag age geo eof wor ors rst sto tof oft fti tim ime mes esi sititwtwa was ast Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 48. De Bruijn example Step 2: Build a De-Bruijn graph from the kmers age geo eof ofw fwihea eagthe sth the heb ebe bes est sto tof ast sth the hea eag age geo eof wor ors rst sto tof oft fti tim ime mes esi sititwtwa was ast …..etc for all ‘kmers’ in the dataset Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 49. De Bruijn example Step 3: Simplify the graph as much as possible: A De Bruijn Graph Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 50. De Bruijn example Step 3: Simplify the graph as much as possible: A De Bruijn Graph Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 51. De Bruijn example Step 3: Simplify the graph as much as possible: A De Bruijn Graph “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “ De Bruijn assemblies ‘broken’ by repeats longer than kmer Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 52. No single solution! Drawback of De Bruijn approach Break graph to produce final assembly Step 4: Dump graph into consensus (fasta) Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 53. Kmer size is an important parameter in De Bruijn assembly The final assembly (k=3) wor times itwasthe foolishness incredulity age epoch be st wisdom of belief Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 54. Kmer size is an important parameter in De Bruijn assembly The final assembly (k=3) wor times itwasthe foolishness incredulity age epoch be st wisdom of belief A better assembly (k=20) itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis… Repeat with a longer “kmer” length Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 55. Kmer size is an important parameter in De Bruijn assembly The final assembly (k=3) wor times itwasthe foolishness incredulity age epoch be st wisdom of belief A better assembly (k=20) itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis… Repeat with a longer “kmer” length Why not always use longest ‘k’ possible? Sequencing errors: sthebentof sth the heb ebe ben ent nto tof sthebentof k=3 k=10 100% wrong kmer Mostly unaffected kmers Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 56. Scaffolding Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 57. Scaffolding Contigs Scaffolds (An assembly) Reads ‘De Bruijn’ assembly Join contigs using evidence from paired end data Align reads to DeBruijn contigs Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 58. Scaffolding Contigs Scaffolds (An assembly) Reads ‘De Bruijn’ assembly “Captured” gaps caused by repeats. Represented by “NNN” in assembly Join contigs using evidence from paired end data Align reads to DeBruijn contigs Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 59. Lander-Waterman statistics L = read length T = minimum detectable overlap G = genome size N = number of reads c = coverage (NL / G) σ = 1 – T/L E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ) contig = island with 2 or more reads Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 60. Mis-assembly of repetitive sequence Schatz M C et al. Brief Bioinform 2013;14:213-224 Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 61. Mis-assembled repeats a b c a c b a b c d I II III I II III a bc d b c a b dc e f I II III IV I III II IV a d be c f a collapsed tandem excision rearrangement Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 62. Real life assembly is messy! Assembly in theory Uniform coverage, no errors, no contamination Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 63. Biased coverage (->gaps) Assembly in reality Real life assembly is messy! Assembly in theory Uniform coverage, no errors, no contamination Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 64. Biased coverage (->gaps) Assembly in reality Real life assembly is messy! Assembly in theory Uniform coverage, no errors, no contamination Sequencing errors (-> fragmented assembly) * * *** * * Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 65. Biased coverage (->gaps) Assembly in reality Real life assembly is messy! Assembly in theory Uniform coverage, no errors, no contamination Chimeric reads (->mis-joins) Sequencing errors (-> fragmented assembly) * * *** * * Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 66. Biased coverage (->gaps) Assembly in reality Real life assembly is messy! Assembly in theory Uniform coverage, no errors, no contamination Contaminant reads (-> incorrect + inflated assembly) Chimeric reads (->mis-joins) Sequencing errors (-> fragmented assembly) * * *** * * Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 67. Biased coverage (->gaps) Assembly in reality Real life assembly is messy! Assembly in theory Uniform coverage, no errors, no contamination Contaminant reads (-> incorrect + inflated assembly) Chimeric reads (->mis-joins) Sequencing errors (-> fragmented assembly) * * *** * * * Worse than predicted assemblies! Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 68. Real life assembly is messy! Theoretical GC% of 100 base windows Fractionofnormalizedcoverage Reference position (bp) Coverage(x) Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 69. Genome properties can also make assembly difficult Biased sequence composition RESULT: incomplete / fragmented assembly ACTGTCTAGTCAGCGCGCGCGC GCGCGCCCGCGCGCGCGGGCG GCGGCGCGGGCGGGCGCATGTA GTGATC High repeat content RESULT: misassemblies / collapsed assemblies r r r r r Polyploidy RESULT: fragmented assembly a a’ Biased sequence abundance RESULT: Incomplete / fragmented assembly Slides from Presentation by Alicia Clum genomebiology.jgi-psf.org/Content/MGM-13.Sep2012/.../3.clum.ppt
  • 70. N50 The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E. For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. (http://www.cbcb.umd.edu/research/castats.shtml) N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater. (Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf)
  • 71. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
  • 72. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Why Completeness is Important • Improves characterization of genome features • Gene order, replication origins • Better comparative genomics • Genome duplications, inversions • Presence and absence of particular genes can be very important • Missing sequence might be important (e.g., centromere) • Allows researchers to focus on biology not sequencing • Facilitates large scale correlation studies
  • 73. Step 4: Closure • Physical map information • PCR and gap spanning • Other sequencing data
  • 74. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 General Steps in Analysis of Complete Genomes • Identification/prediction of genes • Characterization of gene features • Characterization of genome features • Prediction of gene function • Prediction of pathways • Integration with known biological data • Comparative genomics
  • 76. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 General Steps in Analysis of Complete Genomes • Structural Annotation • Identification/prediction of genes • Characterization of gene features • Characterization of genome features • Functional Annotation • Prediction of gene function • Prediction of pathways • Integration with known biological data • Evolutionary Annotation • Comparative genomics
  • 77. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Structural Annotation I: Genes in Genomes • Protein coding genes. ! In long open reading frames ! ORFs interrupted by introns in eukaryotes ! Take up most of the genome in prokaryotes, but only a small portion of the eukaryotic genome • RNA-only genes ! Transfer RNA ! ribosomal RNA ! snoRNAs (guide ribosomal and transfer RNA maturation) ! intron splicing ! guiding mRNAs to the membrane for translation ! gene regulation—this is a growing list
  • 78. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Structural Annotation II: Other Features to Find • Gene control sequences ! Promoters ! Regulatory elements • Transposable elements, both active and defective ! DNA transposons and retrotransposons ! Many types and sizes • Other Repeated sequences. ! Centromeres and telomeres ! Many with unknown (or no) function • Unique sequences that have no obvious function
  • 79. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Bacteria / Archaeal Protein Coding Genes • Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used. – Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF. • The stop codons are the same as in eukaryotes: TGA, TAA, TAG – stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation. • Genes can overlap by a small amount. Not much, but a few codons of overlap is common enough so that you can’t just eliminate overlaps as impossible. • Cross-species homology works well for many genes. It is very unlikely that non-coding sequence will be conserved. – But, a significant minority of genes (say 20%) are unique to a given species. • Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon – however, some aren’t recognizable – genes in operons sometimes don’t always have a separate ribosome binding site for each gene
  • 80. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Composition Methods • The frequency of various codons is different in coding regions as compared to non-coding regions. – This extends to G-C content, dinucleotide frequencies, and other measures of composition. Dicodons (groups of 6 bases) are often used – Well documented experimentally. • The composition varies between different proteins of course, and it is affected within a species by the amounts of the various tRNAs present – horizontally transferred genes can also confuse things: they tend to have compositions that reflect their original species. – A second group with unusual compositions are highly expressed genes.
  • 81. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Eukaryotic Genes Harder to Find • Some fundamental differences between prokaryotes and eukaryotes: • There is lots of non-coding DNA in eukaryotes. – First step: find repeated sequences and RNA genes – Note that eukaryotes have 3 main RNA polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes. • most eukaryotic genes are split into exons and introns. • Only 1 gene per transcript in eukaryotes. • No ribosome binding sites: translation starts at the first ATG in the mRNA – thus, in eukaryotic genomes, searching for the transcription start site (TSS) makes sense. • Many fewer eukaryotic genomes have been sequenced
  • 82. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Exons • Exon sequences can often be identified by sequence conservation, at least roughly. • Dicodon statistics, as was used for prokaryotes, also is useful – eukaryotic genomes tend to contain many isochores, regions of different GC content, and composition statistics can vary between isochores. • The initial and terminal exons contain untranslated regions, and thus special methods are needed to detect them. • Predicting splice junctions is a matter of collecting information about the sequences surrounding each possible GT/AC pair, then running this information through some combination of decision tree, Markov models, discriminant analysis, or neural networks, in an attemp to massage the data into giving a reliable score. – In general, sites are more likely to be correct if predicted by multiple methods – Experimental data from ESTs can be very helpful here.
  • 83. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 How to Find ncRNAs • The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration. • One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily • Functional RNAs are characterized by secondary structure caused by base pairing within the molecule. • Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure. • The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted • Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species. • This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site.
  • 84. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 RNA Structure • RNA differs from DNA in having fairly common G-U base pairs. Also, many functional RNAs have unusual modified bases such as pseudouridine and inosine. • The pseudoknot, pairing between a loop and a sequence outside its stem, is especially difficult to detect: computationally intense and not subject to the normal situation that RNA base pairing follows a nested pattern – But pseudoknots seem to be fairly rare. • Essentially, RNA folding programs start with all possible short sequences, then build to larger ones, adding the contribution of each structural element. – There is an element of dynamic programming here as well. – And, “stochastic context-free grammars”, something I really don’t want to approach right now!
  • 85. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Finding tRNAs • tRNAs have a highly conserved structure, with 3 main stem-and- loop structures that form a cloverleaf structure, and several conserved bases. Finding such sequences is a matter of looking in the DNA for the proper features located the proper distance apart. • Looking for such sequences is well-suited to a decision tree, a series of steps that the sequence must pass. • In addition, a score is kept, rating how well the sequence passed each step. This allows a more stringent analysis later on, to eliminate false positives.
  • 87. eep ore me. sm; dto d be o a rial any nic ma- ore ce. me and that proposes three non-overlapping groups of living organisms: the Table 1 Results of a BLAST search of a newly sequenced M. tuberculosis gene against a comprehensive protein database Gene ID Similarity (%) Length (bp) Gene name E-value* GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2eǁ15 (Klebsiella pneumoniae) EGAD:22614 46.2 1,191 Gluconokinase 1.4eǁ13 (Bacillus subtilis) EGAD:20418 43.0 1,302 Xylulose kinase 4.8eǁ13 (Lactobacillus pentosus) EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7eǁ12 FGGY family (Archaeoglobus fulgidus) GP:2895855 42.7 1,263 Xylulokinase 1.0eǁ07 (Lactobacillus brevis) EGAD:10899 45.4 1,296 Xylulose kinase 2.1eǁ06 (Escherichia coli) *E-value is a statistical measure of the significance of a BLAST search result.
  • 88. sight progress A total of 570 putative secreted proteins or surface proteins Protein expression 3–12 months few months N. meningitidis hours Immune sera screening • Bactericidal activity • Binding to surface of MenB cells Seven proteins selected for follow-up based on high titres Final candidate selection Two proteins were found to exhibit no sequence variability ➞ clinical trials Selection of vaccine targets A total of ~350 recombinant proteins expressed in E. coli and used to immunize mice 1 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 All potential antigens re 2 Diagram depicting how complete microbial genome sequence data can accelerate vaccine development.
  • 89. LGT
  • 90. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Annotation
  • 91. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Classification I: GO • The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt describe gene products with a structured controlled vocabulary, a set of invariant terms that have a known relationship to each other. • Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For example, GO:0005102 is “receptor binding”. • There are 3 root terms: biological process, cellular component, and molecular function. A gene product will probably be described by GO terms from each of these “ontologies”. (ontology is a branch of philosophy concerned with the nature of being, and the basic categories of being and their relationships.) – For instance, cytochrome c is described with the molecular function term “oxidoreductase activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane” • The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree. This means simply that each term can have more than one parent term, but the direction of parent to child (i.e. less specific to more specific) is always maintained.
  • 92. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Classification II: Enzyme Nomenclature • Enzyme functions: which reactants are converted to which products – Across many species, the enzymes that perform a specific function are usually evolutionarily related. However, this isn’t necessarily true. There are cases of two entirely different enzymes evolving similar functions. – Often, two or more gene products in a genome will have the same E.C. number. • Enzyme functions are given unique numbers by the Enzyme Commission. – E.C. numbers are four integers separated by dots. The left-most number is the least specific – For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose components indicate the following groups of enzymes: • EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4 are hydrolases that act on peptide bonds • EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a polypeptide • EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide • Top level E.C. numbers: – E.C. 1: oxidoreductases (often dehydrogenases): electron transfer – E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between molecules. – E.C. 3: hydrolases: splitting a molecule by adding water to a bond. – E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule – E.C. 5: isomerases: rearrangements of atoms within a molecule – E.C. 6: ligases: joining two molecules using energy from ATP
  • 93. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction • BLAST searches • HMM models of specific genes or gene families (Pfam, TIGRfam, FIGfam). • Sequence motifs and domains. If the gene is not a good match to previously known genes, these provide useful clues. • Cellular location predictions, especially for transmembrane proteins. • Genomic neighbors, especially in bacteria, where related functions are often found together in operons and divergons (genes transcribed in opposite directions that use a common control region). • Biochemical pathway/subsystem information. If an organism has most of the genes needed to perform a function, any missing functions are probably present too. – Also, experimental data about an organism’s capacities can be used to decide whether the relevant functions are present in the genome.
  • 94. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction II: Membrane Spanning • Integral membrane proteins contain amino acid sequences that go through the membrane one or several times. – There are also peripheral membrane proteins that stick to the hydrophilic head groups by ionic and polar interactions – There are also some that have covalently bound hydrophobic groups, such as myristoylate, a 14 carbon saturated fatty acid that is attached to the N-terminal amino group. • There are 2 main protein structures that cross membranes. – Most are alpha helices, and in proteins that span multiple times, these alpha helices are packed together in a coiled-coil. Length = 15-30 amino acids. – Less commonly, there are proteins with membrane spanning “beta barrels”, composed of beta sheets wrapped into a cylinder. An example: porins, which transport water across the membrane.
  • 95. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction by Phylogeny • Key step in genome projects • More accurate predictions help guide experimental and computational analyses • Many diverse approaches • All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve
  • 96. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction • Identification of motifs ! Short regions of sequence similarity that are indicative of general activity ! e.g., ATP binding • Homology/similarity based methods ! Gene sequence is searched against a databases of other sequences ! If significant similar genes are found, their functional information is used • Problem ! Genes frequently have similarity to hundreds of motifs and multiple genes, not all with the same function
  • 97. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Helicobacter pylori
  • 98. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 H. pylori genome - 1997 “The ability of H. pylori to perform mismatch repair is suggested by the presence of methyl transferases, mutS and uvrD. However, orthologues of MutH and MutL were not identified.”
  • 99. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 MutL ?? From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html
  • 100. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Phylogenetic Tree of MutS Family Aquae Trepa Fly Xenla Rat Mouse Human Yeast Neucr Arath Borbu Strpy Bacsu Synsp Ecoli Neigo Thema TheaqDeira Chltr Spombe Yeast Yeast Spombe Mouse Human Arath Yeast Human Mouse Arath StrpyBacsu Celeg Human Yeast MetthBorbu Aquae Synsp Deira Helpy mSaco Yeast Celeg Human Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.65
  • 101. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 MutS Subfamilies Aquae Trepa Fly Xenla Rat Mouse Human Yeast Neucr Arath Borbu Strpy Bacsu Synsp Ecoli Neigo Thema TheaqDeira Chltr Spombe Yeast Yeast Spombe Mouse Human Arath Yeast Human Mouse Arath StrpyBacsu Celeg Human Yeast MetthBorbu Aquae Synsp Deira Helpy mSaco Yeast Celeg Human MSH4 MSH5 MutS2 MutS1 MSH1 MSH3 MSH6 MSH2 Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.66
  • 102. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Overlaying Functions onto Tree Aquae Trepa Rat Fly Xenla Mouse Human Yeast Neucr Arath Borbu Synsp Neigo Thema Strpy Bacsu Ecoli TheaqDeira Chltr Spombe Yeast Yeast Spombe Mouse Human Arath Yeast Human Mouse Arath StrpyBacsu Human Celeg Yeast MetthBorbu Aquae Synsp Deira Helpy mSaco Yeast Celeg Human MSH4 MSH5 MutS2 MutS1 MSH1 MSH3 MSH6 MSH2 Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.67
  • 103. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR MutS Subfamilies • MutS1 Bacterial MMR • MSH1 Euk - mitochondrial MMR • MSH2 Euk - all MMR in nucleus • MSH3 Euk - loop MMR in nucleus • MSH6 Euk - base:base MMR in nucleus • MutS2 Bacterial - function unknown • MSH4 Euk - meiotic crossing-over • MSH5 Euk - meiotic crossing-over
  • 104. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction Using Tree Aquae Trepa Fly Xenla Rat Mouse Human Yeast Neucr Arath Borbu Strpy Bacsu Synsp Ecoli Neigo Thema TheaqDeira Chltr Spombe Yeast Yeast Spombe Mouse Human Arath Yeast Human Mouse Arath MSH1 Mitochondrial Repair MSH3 - Nuclear 
 RepairOf Loops MSH6 - Nuclear 
 Repair Of Mismatches MutS1 - Bacterial Mismatch and Loop Repair StrpyBacsu Celeg Human Yeast MetthBorbu Aquae Synsp Deira Helpy mSaco Yeast Celeg Human MSH4 - Meiotic Crossing Over MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions MSH2 - Eukaryotic Nuclear Mismatch and Loop Repair Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300.69
  • 105. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR Table 3. Presence of MutS Homologs in Complete Genomes Sequences Species # of MutS Homologs Which Subfamilies? MutL Homologs Bacteria Escherichia coli K12 1 MutS1 1 Haemophilus influenzae Rd KW20 1 MutS1 1 Neisseria gonorrhoeae 1 MutS1 1 Helicobacter pylori 26695 1 MutS2 - Mycoplasma genitalium G-37 - - - Mycoplasma pneumoniae M129 - - - Bacillus subtilis 169 2 MutS1,MutS2 1 Streptococcus pyogenes 2 MutS1,MutS2 1 Mycobacterium tuberculosis - - - Synechocystis sp. PCC6803 2 MutS1,MutS2 1 Treponema pallidum Nichols 1 MutS1 1 Borrelia burgdorferi B31 2 MutS1,MutS2 1 Aquifex aeolicus 2 MutS1,MutS2 1 Deinococcus radiodurans R1 2 MutS1,MutS2 1 Archaea Archaeoglobus fulgidus VC-16, DSM4304 - - - Methanococcus janasscii DSM 2661 - - - Methanobacterium thermoautotrophicum ΔH 1 MutS2 - Eukaryotes Saccharomyces cerevisiae 6 MSH1-6 3+ Homo sapiens 5 MSH2-6 3+
  • 106. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Blast Search of H. pylori “MutS” Score E Sequences producing significant alignments: (bits) Value sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25 sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10 sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09 sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08 sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07 sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07 • Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs • Based on this TIGR predicted this species had mismatch repair Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
  • 107. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 High Mutation Rate in H. pylori Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
  • 108. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 PHYLOGENENETIC PREDICTION OF GENE FUNCTION IDENTIFY HOMOLOGS OVERLAY KNOWN FUNCTIONS ONTO TREE INFER LIKELY FUNCTION OF GENE(S) OF INTEREST 1 2 3 4 5 6 3 5 3 1A 2A 3A 1B 2B 3B 2A 1B 1A 3A 1B 2B 3B ALIGN SEQUENCES CALCULATE GENE TREE 1 2 4 6 CHOOSE GENE(S) OF INTEREST 2A 2A 5 3 Species 3Species 1 Species 2 1 1 2 2 2 31 1A 3A 1A 2A 3A 1A 2A 3A 4 6 4 5 6 4 5 6 2B 3B 1B 2B 3B 1B 2B 3B ACTUAL EVOLUTION (ASSUMED TO BE UNKNOWN) Duplication? EXAMPLE A EXAMPLE B Duplication? Duplication? Duplication 5 METHOD Ambiguous Based on Eisen, 1998 Genome Res 8: 163-167. Phylogenomics
  • 109. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 2 3 1 4 5 6
  • 110. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Chemosynthetic Symbionts Eisen et al. 1992 Eisen et al. 1992. J. Bact.174: 3416
  • 111. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Carboxydothermus hydrogenoformans • Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005 PLoS Genetics 1: e65. )
  • 112. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Homologs of Sporulation Genes Wu et al. 2005 PLoS Genetics 1: e65.
  • 113. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Carboxydothermus sporulates Wu et al. 2005 PLoS Genetics 1: e65.
  • 114. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Non-Homology Predictions: Phylogenetic Profiling • Step 1: Search all genes in organisms of interest against all other genomes • Ask: Yes or No, is each gene found in each other species • Cluster genes by distribution patterns (profiles)
  • 115. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Sporulation Gene Profile Wu et al. 2005 PLoS Genetics 1: e65.
  • 116. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 B. subtilis new sporulation genes
  • 117. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Functional Prediction III: Colocalization • Operon structure is often maintained over fairly large taxonomic regions. – Sometimes gene order is altered, and sometimes one or more enzymes are missing. – But in general, this phenomenon allows recognition or verification that widely diverged enzymes do in fact have the same function. • This is an operon that contains part of the glycolytic pathway. – 1: phosphoclycerate mutase – 2: triosephosphate isomerase – 3: enolase – 4: phosphoglycerate kinase – 5: glyceraldehyde 3-phosphate dehydrogenase – 6: central glycolytic gene regulator
  • 118. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Metabolic Predictions
  • 119. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Comparative Genomics
  • 120. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !85
  • 121. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 Using the Core !86
  • 122. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com betweenevenrelatedspecies. Our molecular picture of evolution for the past 20 years has been dominated by the small-subunit ribosomal RNA phylogentic tree analysed. Analyses of complete genome sequences have led to many recent suggestions that the extent of horizontal gene exchange is much greater than was previously realized10–12 . For example, an Table 2 Genome features from 24 microbial genome sequencing projects Organism Genome No. of ORFs Unknown Unique size (Mbp) (% coding) function ORFs Aeropyrum pernix K1 1.67 1,885 (89%) A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%) A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%) B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%) B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%) Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%) Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%) C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%) Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%) E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%) H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%) H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%) Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%) Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%) M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%) M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%) M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%) N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%) Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%) Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%) Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%) T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%) T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%) Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%) 50.60 52,462 (89%) 22,358 (43%) 12,161 (23%) © 2000 Macmillan Magazines Ltd
  • 123. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 After the Genomes • Better analysis and annotation • Comparative genomics • Functional genomics (Experimental analysis of gene function on a genome scale) • Genome-wide gene expression studies • Proteomics • Genome wide genetic experiments