2. !2
Environmental Genome Shotgun
Sequencing of the Sargasso Sea
J. Craig Venter,1
* Karin Remington,1
John F. Heidelberg,3
Aaron L. Halpern,2
Doug Rusch,2
Jonathan A. Eisen,3
Dongying Wu,3
Ian Paulsen,3
Karen E. Nelson,3
William Nelson,3
Derrick E. Fouts,3
Samuel Levy,2
Anthony H. Knap,6
Michael W. Lomas,6
Ken Nealson,5
Owen White,3
Jeremy Peterson,3
Jeff Hoffman,1
Rachel Parsons,6
Holly Baden-Tillson,1
Cynthia Pfannkoch,1
Yu-Hui Rogers,4
Hamilton O. Smith1
We have applied “whole-genome shotgun sequencing” to microbial populations
collected en masse on tangential flow and impact filters from seawater samples
collected from the Sargasso Sea near Bermuda. A total of 1.045 billion base pairs
chlo
pho
wer
from
Feb
lect
stat
are
S1;
one
was
gen
2 to
prep
both
Cra
nolo
ers
RESEARCH ARTICLE
22. Four main challenges for de novo
sequencing.
• Repeats.
• Low coverage.
• Errors
These introduce breaks in the
construction of contigs.
• Variation in coverage – transcriptomes and metagenomes, as
well as amplified genomic.
This challenges the assembler to distinguish between erroneous
connections (e.g. repeats) and real connections.
Slide from talk by Titus Brown
23. More assembly issues
• Many parameters to optimize!
• Metagenomes have variation in copy number;
naïve assemblers can treat this as repetitive and
eliminate it.
• Assembly requires gobs of memory (4 lanes, 60m
reads => ~ 150gb RAM)
• How do we evaluate assemblies?
– What’s the best assembler?
Slide from talk by Titus Brown
24. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Wolbachia Metagenomic Sequencing
shotgun
sequence
25. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
26. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
27. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
28. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
29. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Wu et al., 2004. Collaboration between Jonathan Eisen and Scott O’Neill (Yale, U. Queensland).
Wolbachia pipientis wMel
30. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Pierce’s Disease
!30
31. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
• Obligate xylem feeder
• Transmits Xylella between
plants
• Much like mosquitoes
transmit malarial pathogen
• Only animal listed as
possible “bioterror” agent
by US DHS
!31
Glassy winged sharpshooter
GLASSY-WINGED
SHARPSHOOTERASeriousThreattoCaliforniaAgriculture
FROM THE
UNIVERSITY OF CALIFORNIA’S
PIERCE’S DISEASE RESEARCH AND
EMERGENCY RESPONSE TASK FORCE
This informational brochure was produced by ANR
Communication Services for the University of Califor-
nia Pierce’s Disease Research and Emergency
Response Task Force. You may download a copy of the
brochure from the Division of Agriculture and Natural
Resources web site at http://danr.ucop.edu or from the
Communication Services web site at
http://danrcs.ucdavis.edu.
For local information, contact your UC Cooperative
Extension farm advisor:
Adults
Egg masses
Glassy-winged Sharpshooter
Generalized Lifecycle
100
80
60
40
20
0
Jan.
Mar.
May
July
Sept.
Nov.
Glassy-winged sharpshooters overwinter as adults
and begin laying egg masses in late February
through May. This first generation matures as
adults in late May through late August. Second-
generation egg masses are laid starting in mid-
June through late September, which develop into
over-wintering adults.
33. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
DNA
extraction
PCR
Sequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
Yeast
Makes lots of
copies of the
rRNA genes
in sample
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
C
rRNA1
E. coli Humans
Yeast
!33
rRNA1
5’ ...TACAGTATAGGTG
GAGCTAGCGATCGAT
CGA... 3’
PCR and phylogenetic analysis of rRNA genes
bacteriomes
r two obligate
bacteriomes
Moran et al. 2003 Environ. Microbiol.
Moran et al. 2005 Appl. Environ. Microbiol.
cola” (Gammaproteobacteria)
roidetes)
D Takiya
mm
domen of H. vitripennis
34. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumania is close relative of Buchnera symbionts of aphids
Sharpshoote
Aphids
Aphids
Aphids
Ants
Flies
!34
Wu et al. 2006 PLoS Biology 4: e188.
35. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumania is close relative of Buchnera symbionts of aphids
Sharpshoote
Aphids
Aphids
Aphids
Ants
Flies
!35
Wu et al. 2006 PLoS Biology 4: e188.
36. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !36
37. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !37
38. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Sharpshooter Shotgun Sequencing
shotgun
Wu et al. 2006 PLoS Biology 4: e188.
Collaboration with Nancy Moran’s
lab
39. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
40. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
41. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
42. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Predict metabolic networks from Genome
Wu et al.
2006 PLoS
Biology 4:
e188.
!42
43. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumannia is a Vitamin and Cofactor Producing Machine
Wu et al.
2006 PLoS
Biology 4:
e188.
!43
VITAMIN AND
COFACTOR
PRODUCING MACHINE
44. How to do Binning
• Method 2: Assembly in More Complex Ecosystems
• How would this work?
• Advantages and disadvantages?
45. mited depth of se-
level of diversity
n (WGS) assembly
/EMBL/GenBank
AACY00000000,
osited in a corre-
hive. The version
the first version,
onventional WGS
t just contigs and
d paired singletons
order to accurate-
the sample and
tire sample with-
assemblies. Our
well-sampled ge-
folds with at least
were 333 scaffolds
nd spanning 30.9
able S3), account-
ds, or 25% of the
om this set of well-
ble to cluster and
sm; from the rare
d sequence similar-
with computational
alitative and quan-
nd functional diver-
ne environment.
riteria to sort the
entative organism
f coverage, oligo-
Fig. 2. Gene conser-
vation among closely
related Prochlorococ-
cus. The outermost
concentric circle of
the diagram depicts
the competed genom-
ic sequence of Pro-
chlorococcus marinus
MED4 (11). Fragments
from environmental
sequencing were com-
pared to this complet-
ed Prochlorococcus ge-
nome and are shown in
the inner concentric
circles and were given
boxed outlines. Genes
for the outermost cir-
cle have been as-
signed psuedospec-
trum colors based on
the position of those
genes along the chro-
mosome, where genes
nearer to the start of
the genome are col-
ored in red, and genes
nearer to the end of the genome are colored in blue. Fragments from environmental sequencing
were subjected to an analysis that identifies conserved gene order between those fragments and
the completed Prochlorococcus MED4 genome. Genes on the environmental genome segments
that exhibited conserved gene order are colored with the same color assignments as the
Prochlorococcus MED4 chromosome. Colored regions on the environmental segments exhibiting
color differences from the adjacent outermost concentric circle are the result of conserved gene
order with other MED4 regions and probably represent chromosomal rearrangements. Genes that
did not exhibit conserved gene order are colored in black.
46.
47. Fig. 4)
separat
nation
nomic
greater
genome
Dis
contin
scaffold
and 9.3
single
10,000
ence of
the rem
SNP ra
a length
We clo
alignme
and we
distinct
related
creasin
(10), an
homog
consens
haploty
fold reg
Fig. 3. Comparison of
Sargasso Sea scaf-
folds to Crenarchaeal
clone 4B7. Predicted
proteins from 4B7
and the scaffolds
showing significant
homology to 4B7 by
tBLASTx are arrayed
in positional order
along the x and y
axes. Colored boxes
represent BLASTp
matches scoring at
least 25% similarity
and with an e value
of better than 1e-5.
Black vertical and
horizontal lines delin-
eate scaffold borders.
R E S E A R C H A R T I C L E
48. DeLong Lab
• Studying Sar86 and other marine plankton
• Note - published one of first genomic studies of
uncultured microbes - in 1996
JOURNAL OF BACTERIOLOGY, Feb. 1996, p. 591–599 Vol. 178, No. 3
0021-9193/96/$04.00 0
Copyright 1996, American Society for Microbiology
Characterization of Uncultivated Prokaryotes: Isolation and
Analysis of a 40-Kilobase-Pair Genome Fragment
from a Planktonic Marine Archaeon
JEFFEREY L. STEIN,1
* TERENCE L. MARSH,2
KE YING WU,3
HIROAKI SHIZUYA,4
AND EDWARD F. DELONG3
*
Recombinant BioCatalysis, Inc., La Jolla, California 920371
; Microbiology Department, University
of Illinois, Urbana, Illinois 618012
; Department of Ecology, Evolution and Marine Biology,
University of California, Santa Barbara, California 931063
; and Division of Biology,
California Institute of Technology, Pasadena, California 911254
Received 14 July 1995/Accepted 14 November 1995
One potential approach for characterizing uncultivated prokaryotes from natural assemblages involves
genomic analysis of DNA fragments retrieved directly from naturally occurring microbial biomass. In this
study, we sought to isolate large genomic fragments from a widely distributed and relatively abundant but as
yet uncultivated group of prokaryotes, the planktonic marine Archaea. A fosmid DNA library was prepared
49. consensus without any apparent s
haplotypes, such as the Prochlor
fold region (Fig. 5). Indeed, the
cus scaffolds display considerab
ity not only at the nucleotide s
(Fig. 5) but also at the genomic
multiple scaffolds align with the s
the MED4 (11) genome but diff
or genomic island insertion, delet
ment events. This observation is c
previous findings (12). For insta
2221918 and 2223700 share gen
each other and MED4 but differ b
of 15 genes of probable phage
representing an integrated bacteri
genomic differences are display
in Fig. 2, where it is evident th
conflicting scaffolds can align w
region of the MED4 genome. M
of the Prochlorococcus MED4 g
aligned with Sargasso Sea sca
than 10 kb; however, there ap
couple of regions of MED4 that
sented in the 10-kb scaffolds
larger of these two regions (
PMM1277) consists primarily of
coding for surface polysaccharid
which may represent a MED4-sp
charide absent or highly diverge
gasso Sea Prochlorococcus bacte
ogeneity of the Prochlorococcussc
that the scaffolds are not derived
discrete strain, but instead probab
conglomerate assembled from a
closely related Prochlorococcus b
The gene complement of t
The heterogeneity of the Sarga
complicates the identification
genes. The typical approach for
notation, model-based gene find
tirely on training with a subse
Fig. 4. Circular diagrams of nine complete megaplasmids. Genes encoded in the forward direction
are shown in the outer concentric circle; reverse coding genes are shown in the inner concentric
circle. The genes have been given role category assignment and colored accordingly: amino acid
biosynthesis, violet; biosynthesis of cofactors, prosthetic groups, and carriers, light blue; cell
envelope, light green; cellular processes, red; central intermediary metabolism, brown; DNA
metabolism, gold; energy metabolism, light gray; fatty acid and phospholipid metabolism, magenta;
protein fate and protein synthesis, pink; purines, pyrimidines, nucleosides, and nucleotides, orange;
regulatory functions and signal transduction, olive; transcription, dark green; transport and binding
proteins, blue-green; genes with no known homology to other proteins and genes with homology
to genes with no known function, white; genes of unknown function, gray; Tick marks are placed
on 10-kb intervals.
50. d curated genes. With the vast ma-
Sargasso sequence in short (less
unassociated scaffolds and single-
undreds of different organisms, it is
o apply this approach. Instead, we
n evidence-based gene finder (5).
dence in the form of protein align-
quences in the bacterial portion of
ndant amino acid (nraa) data set
sed to determine the most likely
e. Likewise, approximate start and
s were determined from the bound-
tes of the alignments and refined to
cific start and stop codons. This
entified 1,214,207 genes covering
B of the total data set. This repre-
ximately an order of magnitude
nces than currently archived in the
ssProt database (14), which con-
R E S E A
51. MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea
Venter et al., revised
Figure S10. Scaffold 2217664, containing the gene encoding Proteorhodopsin. Genes are
colored using color assignments described in Fig. 2, and contig boundaries are indicated
with red vertical lines. In this scaffold, rhodopsin is associated with a DNA-directed
RNA polymerase, sigma subunit (rpoD) originating in the CFB group.
52. MS 1093857: Environmental Genome Shotgun Sequencing of the Sargasso Sea
Venter et al., revised
Figure S7. Each point in the figure corresponds to a scaffold from the assembly
(restricted to scaffolds > 10kb). Scaffolds were placed in separate panels of the figure
according to the most closely related organism as indicated by the BLAST searches
described in the text. Within a panel, a scaffold is shown with x coordinate equal to its
length, y coordinate equal to its estimated depth of coverage, and color determined by
which of 6 k-mer composition clusters it was assigned to. Depth of coverage was
estimated as the total base pairs in reads belonging to a given assembly piece divided by
the length of the consensus sequence for the piece. K-mer composition clusters were
determined by representing each scaffold as a vector of the frequencies of all possible 4-
mers, considering both the forward and reverse strands of the sequence, and then
applying the K-means clustering algorithm.
53. How to do Binning
• Method 2: Assembly in More Complex Ecosystems
• How would this work?
• Advantages and disadvantages?
54. How to do Binning
• Method 3: Composition of Reads / Contigs
• How would this work?
• Advantages and disadvantages?
56. Effect of Skewed Relative Abundance
B. anthracis and L. monogocytes
Abundance 1:1 Abundance 20:1
57. How to do Binning
• Method 4: Coverage of Reads / Contigs
• How would this work?
• Advantages and disadvantages?
58.
59. How to do Binning
• Method 5: Phylogenetic Analysis of Reads / Contigs
• How would this work?
• Advantages and disadvantages?
60. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumannia is a Vitamin and Cofactor Producing Machine
Wu et al.
2006 PLoS
Biology 4:
e188.
!60
VITAMIN AND
COFACTOR
PRODUCING MACHINE
61. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Baumannia is a Vitamin and Cofactor Producing Machine
Wu et al.
2006 PLoS
Biology 4:
e188.
!61
NO PATHWAYS FOR
ESSENTIAL AMINO
ACID SYNTHESIS
62. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !62
63. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
64. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
???????
65. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
DNA
extraction
PCR
Sequence
rRNA genes
Sequence alignment = Data matrixPhylogenetic tree
PCR
rRNA1
rRNA2
Makes lots of
copies of the
rRNA genes
in sample
rRNA1
5’ ...ACACACATAGGTG
GAGCTAGCGATCGAT
CGA... 3’
E. coli
Humans
A
T
T
A
G
A
A
C
A
T
C
A
C
A
A
C
A
G
G
A
G
T
T
C
rRNA1
E. coli Humans
rRNA2
!64
rRNA2
5’ ...TACAGTATAGGTG
GAGCTAGCGATCGAT
CGA... 3’
PCR and phylogenetic analysis of rRNA genes
3)/,0+
bacteriomes
r two obligate
bacteriomes
Moran et al. 2003 Environ. Microbiol.
Moran et al. 2005 Appl. Environ. Microbiol.
cola” (Gammaproteobacteria)
roidetes)
D Takiya
mm
domen of H. vitripennis
66. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !65
67. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Commonly Used Binning Methods
Did not Work Well
• Assembly
–Only Baumannia generated good contigs
• Depth of coverage
–Everything else 0-1X coverage
• Nucleotide composition
–No detectible peaks in any vector we looked at
68. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
A
B
C
D
E
F
G
T
U
V
W
X
Y
Z
Binning challenge
!67
69. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
A
B
C
D
E
F
G
T
U
V
W
X
Y
Z
Binning challenge
Best binning method: reference genomes
!68
70. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
A
B
C
D
E
F
G
T
U
V
W
X
Y
Z
Binning challenge
No reference genome? What do you do?
Phylogeny ....
71. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
CFB Phyla
72. Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014Wu et al. 2006 PLoS Biology 4: e188.
Baumannia makes amino acids
Sulcia makes vitamins and cofactors
71
73. How to do Binning
• Method 6: Long Reads
• How would this work?
• Advantages and disadvantages?
74. How to do Binning
• Method 7: Cross-linking
• How would this work?
• Advantages and disadvantages?
75. HiC
From Belton JM1, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J. Methods.
2012 Nov;58(3):268-76. doi: 10.1016/j.ymeth.2012.05.001. Hi-C: a comprehensive
technique to capture the conformation of genomes.
76. HiC Crosslinking & Sequencing
Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore
RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-
level deconvolution of a synthetic metagenome by
sequencing proximity ligation products. PeerJ 2:e415
http://dx.doi.org/10.7717/peerj.415
Table 1 Species alignment fractions. The number of reads aligning to each replicon present in the
synthetic microbial community are shown before and after filtering, along with the percent of total
constituted by each species. The GC content (“GC”) and restriction site counts (“#R.S.”) of each replicon,
species, and strain are shown. Bur1: B. thailandensis chromosome 1. Bur2: B. thailandensis chromosome
2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2: L. brevis plasmid 2, Ped: P. pentosaceus,
K12: E. coli K12 DH10B, BL21: E. coli BL21. An expanded version of this table can be found in Table S2.
Sequence Alignment % of Total Filtered % of aligned Length GC #R.S.
Lac0 10,603,204 26.17% 10,269,562 96.85% 2,291,220 0.462 629
Lac1 145,718 0.36% 145,478 99.84% 13,413 0.386 3
Lac2 691,723 1.71% 665,825 96.26% 35,595 0.385 16
Lac 11,440,645 28.23% 11,080,865 96.86% 2,340,228 0.46 648
Ped 2,084,595 5.14% 2,022,870 97.04% 1,832,387 0.373 863
BL21 12,882,177 31.79% 2,676,458 20.78% 4,558,953 0.508 508
K12 9,693,726 23.92% 1,218,281 12.57% 4,686,137 0.507 568
E. coli 22,575,903 55.71% 3,894,739 17.25% 9,245,090 0.51 1076
Bur1 1,886,054 4.65% 1,797,745 95.32% 2,914,771 0.68 144
Bur2 2,536,569 6.26% 2,464,534 97.16% 3,809,201 0.672 225
Bur 4,422,623 10.91% 4,262,279 96.37% 6,723,972 0.68 369
Figure 1 Hi-C insert distribution. The distribution of genomic distances between Hi-C read pairs is
shown for read pairs mapping to each chromosome. For each read pair the minimum path length on
the circular chromosome was calculated and read pairs separated by less than 1000 bp were discarded.
The 2.5 Mb range was divided into 100 bins of equal size and the number of read pairs in each bin
was recorded for each chromosome. Bin values for each chromosome were normalized to sum to 1 and
plotted.
E. coli K12 genome were distributed in a similar manner as previously reported (Fig. 1;
(Lieberman-Aiden et al., 2009)). We observed a minor depletion of alignments spanning
the linearization point of the E. coli K12 assembly (e.g., near coordinates 0 and 4686137)
due to edge eVects induced by BWA treating the sequence as a linear chromosome rather
than circular.
10.7717/peerj.415 9/19
Figure 2 Metagenomic Hi-C associations. The log-scaled, normalized number of Hi-C read pairs
associating each genomic replicon in the synthetic community is shown as a heat map (see color scale,
blue to yellow: low to high normalized, log scaled association rates). Bur1: B. thailandensis chromosome
1. Bur2: B. thailandensis chromosome 2. Lac0: L. brevis chromosome, Lac1: L. brevis plasmid 1, Lac2:
L. brevis plasmid 2, Ped: P. pentosaceus, K12: E. coli K12 DH10B, BL21: E. coli BL21.
reference assemblies of the members of our synthetic microbial community with the same
alignment parameters as were used in the top ranked clustering (described above). We first
Figure 3 Contigs associated by Hi-C reads. A graph is drawn with nodes depicting contigs and edges
depicting associations between contigs as indicated by aligned Hi-C read pairs, with the count thereof
depicted by the weight of edges. Nodes are colored to reflect the species to which they belong (see legend)
with node size reflecting contig size. Contigs below 5 kb and edges with weights less than 5 were excluded.
Contig associations were normalized for variation in contig size.
typically represent the reads and variant sites as a variant graph wherein variant sites are
represented as nodes, and sequence reads define edges between variant sites observed in
the same read (or read pair). We reasoned that variant graphs constructed from Hi-C
data would have much greater connectivity (where connectivity is defined as the mean
path length between randomly sampled variant positions) than graphs constructed from
mate-pair sequencing data, simply because Hi-C inserts span megabase distances. Such
Figure 4 Hi-C contact maps for replicons of Lactobacillus brevis. Contact maps show the number of
Hi-C read pairs associating each region of the L. brevis genome. The L. brevis chromosome (Lac0, (A),
Chris Beitel
@datscimed
Aaron Darling
@koadman
77. How to do Binning
• Method 8: Experiments …
• How would this work?
• Advantages and disadvantages?