Shotgun
metagenomics
C. Titus Brown
UC Davis / School of Veterinary Medicine
ctbrown@ucdavis.edu
Shotgun metagenomics
• Collect samples;
• Extract DNA;
• Feed into sequencer;
• Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
Computational protocol for
assembly
Genome assembly works pretty well
Velvet IDBA Spades
Quality Quality Quality
Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08
Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08
Largest contig 561,449 979,948 1,387,918
# misassembled contigs 631 1032 752
Genome fraction (%) 72.949 90.969 90.424
Duplication ratio 1.004 1.007 1.004
Conclusion: SPAdes and IDBA achieve similar results.
Dr. Sherine Awad
Evaluating short-read
data / coverage plots
Assembly starts to work @ ~10x
What’s coming?
Lots more data.
Who ya gonna call??
…to do your bioinformatics?
Topics to explore -
• Experimental design to enable good data analysis.
• Extracting whole genomes and evaluating strain
variation.
• Technology review and open problems.
• Assembly graphs.
• Long reads and infinite sequencing – what next?
Or just a Q&A.
I can talk for 90 minutes straight, too, but I’d rather not.
Assembly graphs.Assembly “graphs” are a way of representing
sequencing data that retains all variation;
assemblers then crunch this down into a single
sequence.
Image from Iqbal et al., 2012. PMID 23172865
Long read sequencing
See:
http://angus.readthedocs.org/en/2015/_static/Torsten
_Seemann_LRS.pdf
& nanopore etc.
Intro, overview,
and case studies
Shotgun metagenomics
• Collect samples;
• Extract DNA;
• Feed into sequencer;
• Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
FASTQ etc.
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
Name
Quality score
Note: /1 and /2 => interleaved
Shotgun sequencing &
assembly
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
Dealing with shotgun
metagenome reads.
1) Annotate/analyze individual reads.
1) Assemble reads into contigs, genomes, etc.
Annotating individual
reads
• Works really well when you have EITHER
o (a) evolutionarily close references
o (b) rather long sequences
(This is obvious, right?)
Annotating individual
reads #2
• We have found that this does not work well
with Illumina samples from unexplored
environments (e.g. soil).
• Sensitivity is fine (correct match is usually
there)
• Specificity is bad (correct match may be
drowned out by incorrect matches)
Assembly: good.
Howe et al., 2014, PMID 24632729
Assemblies yield much
more significant
homology matches.
Recommendation:
For reads < 200-300 bp,
• Annotate individual reads for human-
associated samples, or exploration of well-
studied systems.
• Otherwise, look to assembly.
So, why assemble?
• Increase your ability to assign
homology/orthology correctly!!
Essentially all functional annotation systems
depend on sequence similarity to assign
homology. This is why you want to assemble
your data.
Why else would you want
to assemble?
• Assemble new “reference”.
• Look for large-scale variation from reference
– pathogenicity islands, etc.
• Discriminate between different members of
gene families.
• Discover operon assemblages & annotate
on co-incidence of genes.
• Reduce size of data!!
Why don’t you want to
assemble??
• Abundance threshold – low-coverage filter.
• Strain variation
• Chimerism
Assembly, in
general
Short read lengths are
hard.
Whiteford et al., Nuc. Acid Res, 2005
Short read lengths are
hard.
Whiteford et al., Nuc. Acid Res, 2005
Conclusion: even with
a read length of 200, the
E. coli genome cannot be
assembled completely.
Why?
Short read lengths are
hard.
Whiteford et al., Nuc. Acid Res, 2005
Conclusion: even with
a read length of 200, the
E. coli genome cannot be
assembled completely.
Why? REPEATS.
This is why paired-end
sequencing is so importan
for assembly.
Four main challenges for
de novo sequencing.
• Repeats.
• Low coverage.
• Errors
These introduce breaks in the
construction of contigs.
• Variation in coverage – transcriptomes and
metagenomes, as well as amplified genomic.
This challenges the assembler to distinguish between
erroneous connections (e.g. repeats) and real
connections.
Repeats
• Overlaps don’t place sequences uniquely when
there are repeats present.
UMD assembly primer (cbcb.umd.edu)
Metagenomics: Mixed
community sampling
Coverage distribution
Conclusions re mixed
community sampling
In shotgun metagenomics, you are sampling
randomly from the mixed population.
Therefore, the lowest abundance member of the
population (that you want to observe) drives the
required depth of sequencing!
1 in a million => ~50 Tbp sequencing for 10x coverage.
Assembly depends on high coverage
HMP mock community assembly; Velvet-based protocol.
Conclusions from
previous slide
• To recover any contigs at all, you need coverage >
10 (green line).
• To recover long sequences, you want coverage >
20 (blue line).
A few assembly stories
• Low complexity/Osedax symbionts
• High complexity/soil.
Osedax symbionts
Table S1 ! Specimens, and collection sites, used in this study
Site Dive1
Date Time Zone
(months)
# of
specimens
Osedax frankpressi
with Rs1 symbiont
2890m T486 Oct 2002 8 2
T610 Aug 2003 18 3
T1069 Jan 2007 59 2
DR010 Mar 2009 85 3
DR098 Nov 2009 93 4
DR204 Oct 2010 104 1
DR234 Jun 2011 112 2
1820m T1048 Oct 2006 7 3
T1071 Jan 2007 10 3
DR012 Mar 2009 36 4
DR2362
Jun 2011 63 2
O. frankpressi from DR236
1
Dive numbers begin with the remotely operated vehicle name; T= Tiburon, DR = Doc
Ricketts (both owned and operated by the Monterey Bay Aquarium Research Institute) .
2
Samples used for genomic analysis were collected during dive DR236.
Goffredi et al., submitted.
16s
shotgun
Metagenomic assembly followed by
binning enabled isolation of fairly
complete genomes
Assembly allowed genomic content
comparisons to nearest cultured relative
Osedax assembly story
Conclusions include:
• Osedax symbionts have genes needed for
free living stage;
• metabolic versatility in carbon, phosphate,
and iron uptake strategies;
• Genome includes mechanisms for
intracellular survival, and numerous potential
virulence capabilities.
Osedax assembly story
• Low diversity metagenome!
• Physical isolation => MDA => sequencing =>
diginorm => binning =>
o 94% complete Rs1
o 66-89% complete Rs2
• Note: many interesting critters are hard to
isolate => so, basically, metagenomes.
Human-associated
communities
• “Time series community genomics analysis
reveals rapid shifts in bacterial species,
strains, and phage during infant gut
colonization.” Sharon et al. (Banfield lab);
Genome Res. 2013 23: 111-120
Setup
• Collected 11 fecal samples from premature
female, days 15-24.
• 260m 100-bp paired-end Illumina HiSeq
reads; 400 and 900 base fragments.
• Assembled > 96% of reads into contigs >
500bp; 8 complete genomes; reconstructed
genes down to 0.05% of population
abundance.
Sharon et al., 2013; pmid 22936250
Key strategy: abundance
binning
Bin reads by k-mer
abundance
Assemble most
abundance bin
Remove reads that
map to assembly
Sharon et al., 2013; pmid 22936250
Sharon et al., 2013; pmid 22936250
Tracking abundance
Sharon et al., 2013; pmid 22936250
Conclusions
• Recovered strain variation, phage variation,
abundance variation, lateral gene transfer.
• Claim that “recovered genomes are
superior to draft genomes generated in
most isolate genome sequencing projects.”
Environmental
metagenomics:
Deepwater Horizon spill
• “Transcriptional response of bathypelagic
marine bacterioplankton to the Deepwater
Horizon oil spill.” Rivers et al., 2013, Moran
lab. Pmid 23902988.
Sequencing strategy
Protocol for messenger RNA
extraction, assembly,
downstream analysis, including
differential expression.
Note: used overlapping paired-
end reads.
Great Prairie Grand
Challenge - soil
• Together with Janet Jansson, Jim Tiedje, et
al.
• “Can we make sense of soil with deep
Illumina sequencing?”
Great Prairie Grand
Challenge - soil
• What ecosystem level functions are present, and
how do microbes do them?
• How does agricultural soil differ from native soil?
• How does soil respond to climate perturbation?
• Questions that are not easy to answer without
shotgun sequencing:
o What kind of strain-level heterogeneity is present in the population?
o What does the phage and viral population look like?
o What species are where?
A “Grand Challenge”
dataset (DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Approach 1: Digital normalization
(a computational version of library normalization)
Suppose you
have a dilution
factor of A (10) to
B(1). To get 10x
of B you need to
get 100x of A!
Overkill!!
This 100x will
consume disk
space and,
because of errors,
memory.
We can discard it
for you…
Approach 2: Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to different
source species.
Can do this based
almost entirely on
connectivity of
sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assembly results for Iowa corn and prairie
(2x ~300 Gbp soil metagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Resulting contigs are low
coverage.
Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
Corn Prairie
Iowa prairie & corn -
very even.
Taxonomy– Iowa prairie
Note: this is predicted taxonomy of contigs w/o considering abundance
or length. (MG-RAST)
Taxonomy – Iowa corn
Note: this is predicted taxonomy of contigs w/o
considering abundance or length. (MG-RAST)
Strain variation?
Toptwoallelefrequencies
Position within contig
Of 5000 most
abundant
contigs, only 1
has a
polymorphism
rate > 5%
Can measure by
read mapping.
Concluding thoughts on
assembly (I)
• There’s no standard approach yet; almost
every paper uses a specialized pipeline of
some sort.
o More like genome assembly
o But unlike transcriptomics…
Concluding thoughts on
assembly (II)
• Anecdotally, everyone worries about strain
variation.
o Some groups (e.g. Banfield, us) have found that
this is not a problem in their system so far.
o Others (viral metagenomes! HMP!) have found
this to be a big concern.
Concluding thoughts on
assembly (III)
• Some groups have found metagenome assembly
to be very useful.
• Others (us! soil!) have not yet proven its utility.
High coverage is critical.
Low coverage is the dominant
problem blocking assembly of
metagenomes
Questions --
• How much sequencing should I do?
• How do I evaluate metagenome
assemblies?
• Which assembler is best?
Assembly strategies &
techniques
Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
Coverage distribution
matters!
(MD amplified)
Assembly depends on high
coverage
Shared low-level
transcripts may
not reach the
threshold for
assembly.
K-mer based assemblers
scale poorly
Why do big data sets require big machines??
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De Bruijn graphs scale poorly with data size
Practical memory
measurements
Velvet measurements (Adina Howe)
How much data do we
need? (I)
(“More” is rather vague…)
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assembly results for Iowa corn and prairie
(2x ~300 Gbp soil metagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Resulting contigs are low
coverage.
Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
How much? (II)
• Suppose we need 10x coverage to assemble a
microbial genome, and microbial genomes average
5e6 bp of DNA.
• Further suppose that we want to be able to assemble
a microbial species that is “1 in a 100000”, i.e. 1 in
1e5.
• Shotgun sequencing samples randomly, so must
sample deeply to be sensitive.
10x coverage x 5e6 bp x 1e5 =~ 50e11, or 5 Tbp of
sequence.
Currently this would cost approximately $100k, for 10 full
Illumina runs, but in a year we will be able to do it for
much less.
We can estimate
sequencing req’d:
http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
“Whoa, that’s a lot of
data…”
http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
Some approximate
metagenome sizes
• Deep carbon mine data set: 60 Mbp (x 10x => 600
Mbp)
• Great Prairie soil: 12 Gbp (4x human genome)
• Amazon Rain Forest Microbial Observatory soil: 26
Gbp
Partitioning separates reads by genome.
When computationally spiking HMP mock data with one E.
coli genome (left) or multiple E. coli strains (right), majority of
partitions contain reads from only a single genome (blue) vs
multi-genome partitions (green).
Partitions containing spiked data indicated with a *Adina Howe
**
Is your assembly good?
• Truly reference-free assembly is hard to evaluate.
• Traditional “genome” measures like N50 and NG50
simply do not apply to metagenomes, because
very often you don’t know what the genome “size”
is.
Evaluating assembly
Predicted genome.
X
X
X
X
X
X
X
X
XX
Reads - noisy observations
of some genome.
Assembler
(a Big Black Box)
Evaluating correctness of metagenomes is still undiscovered country.
Assembler overlap?
See tutorial.
What’s the best
assembler?
-25
-20
-15
-10
-5
0
5
10
BCM* BCM ALLP NEWB SOAP** MERAC CBCB SOAP* SOAP SGA PHUS RAY MLK ABL
CumulativeZ-scorefromrankingmetrics
Bird assembly
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
What’s the best
assembler?
-8
-6
-4
-2
0
2
4
6
8
BCM CSHL CSHL* SYM ALLP SGA SOAP* MERAC ABYSS RAY IOB CTD IOB* CSHL** CTD** CTD*
CumulativeZ-scorefromrankingmetrics
Fish assembly
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
What’s the best
assembler?
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
-14
-10
-6
-2
2
6
10
SGA PHUS MERAC SYMB ABYSS CRACS BCM RAY SOAP GAM CURT
CumulativeZ-scorefromrankingmetrics
Snake assembly
Answer: for eukaryotic
genomes, it depends
• Different assemblers perform differently, depending
on
o Repeat content
o Heterozygosity
• Generally the results are very good (est
completeness, etc.) but different between different
assemblers (!)
• There Is No One Answer.
Estimated completeness: CEGMA
Each assembler lost different ~5% CEGs
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
Tradeoffs in N 50 and %
incl.
0
10
20
30
40
50
60
70
80
90
100
110
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
16,000,000
18,000,000
BCM ALLP SOAP NEWB MERAC SGA CBCB PHUS RAY MLK ABL
%ofestimatedgenomesizepresentinscaffolds>=25Kbp
NG50scaffoldlength(bp)
Bird assembly
Bradnam et al., Assemblathon 2:
http://arxiv.org/pdf/1301.5406v1.pdf
Evaluating assemblies
• Every assembly returns different results for eukaryotic
genomes.
• For metagenomes, it’s even worse.
o No systematic exploration of precision, recall, etc.
o Very little in the way of cross-comparable data sets
o Often sequencing technology being evaluated is out of date
o etc. etc.
Our experience
• Our metagenome assemblies compare well with
others, but we have little in the way of ground truth
with which to evaluate.
• Scaffold assembly is tricky; we believe in contig
assembly for metagenomes, but not scaffolding.
• See arXiv paper, “Assembling large, complex
metagenomes”, for our suggested pipeline and
statistics & references.
Metagenomic assemblies are highly variable
Adina Howe et al., arXiv 1212.0159
How to choose a
metagenome assembler
• Try a few.
• Use what seems to perform best (most bp > some
minimum)
• Try: MEGAHIT and SPAdes and IDBA.
An example
assembly
Deep Carbon data set
• Name: DCO_TCO_MM5
• Masimong Gold Mine; microbial cells filtered from
fracture water from within a 1.9km borehole.
(32,000 year old water?!)
• M.C.Y. Lau, C. Magnabosco, S. Grim, G. Lacrampe
Couloume, K. Wilkie, B. Sherwood Lollar, D.N. Simkus,
G.F. Slater, S. Hendrickson, M. Pullin, T.L. Kieft, O.
Kuloyo, B. Linage, G. Borgonie, E. van Heerden, J.
Ackerman, C. van Jaarsveld, and T.C. Onstott
DCO_TCO_M
M5
20m reads / 2.1 Gbp
5.6m reads / 601.3 Mbp
Adapter trim &
quality filter
Diginorm to C=10
Trim high-
coverage reads at
low-abundance
k-mers
“Could you take a look at this?
MG-RAST is telling us we
have a lot of artificially
duplicated reads, i.e. the data
is bad.”
Entire process took ~4 hours of
computation, or so.
(Minimum genome size est: 60.1 Mbp)
Assembly stats:
All contigs Contigs > 1kb
k N contigs Sum BP N contigs Sum BP
Max
contig
21 343263 63217837 6271 10537601 9987
23 302613 63025311 7183 13867885 21348
25 276261 62874727 7375 15303646 34272
27 258073 62500739 7424 16078145 48742
29 242552 62001315 7349 16426147 48746
31 228043 61445912 7307 16864293 48750
33 214559 60744478 7241 17133827 48768
35 203292 60039871 7129 17249351 45446
37 189948 58899828 7088 17527450 59437
39 180754 58146806 7027 17610071 54112
41 172209 57126650 6914 17551789 65207
43 165563 56440648 6925 17654067 73231
DCO_TCO_MM
5
(Minimum genome size est: 60.1 Mbp)
Chose two:
• A: k=43 (“long contigs”)
• 165563 contigs
• 56.4 Mbp
• longest contig: 73231 bp
• B: k=43 (“high recall”)
• 343263 contigs
• 63.2 Mbp
• longest contig is 9987 bp
DCO_TCO_MM
5
How to evaluate??
How many reads map
back?
Mapped 3.8m paired-end reads (one subsample):
• high-recall: 41% of pairs map
• longer-contigs: 70% of pairs map
+ 150k single-end reads:
• high-recall: 49% of sequences map
• longer-contigs: 79% of sequences map
Annotation/exploration
with MG-RAST
• You can upload genome assemblies to MG-RAST,
and annotate them with coverage; tutorial to
follow.
• What does MG-RAST do?
Conclusion
• This is a pretty good metagenome assembly – >
80% of reads map!
• Surprised that the larger dataset (6.32 Mbp, “high
recall”) accounts for a smaller percentage of the
reads – 49% vs 79% for the 56.4 Mbp “long contigs”
data set.
• I now suspect that different parameters are
recovering different subsets of the sample…
• Don’t trust MG-RAST ADR calls.
You can estimate
metagenome size…
Estimates of metagenome
size
Calculation: # reads * (avg read len) / (diginorm
coverage)
Assumes: few entirely erroneous reads (upper bound);
saturation (lower bound).
• E. coli: 384k * 86.1 / 5.0 => 6.6 Mbp est. (true: 4.5
Mbp)
• MM5 deep carbon: 60 Mbp
• Great Prairie soil: 12 Gbp
• Amazon Rain Forest Microbial Observatory: 26 Gbp
Extracting whole
genomes?
So far, we have only assembled contigs, but not whole
genomes.
Can entire genomes be
assembled from metagenomic
data?
Iverson et al. (2012), from
the Armbrust lab, contains a
technique for scaffolding
metagenome contigs into
~whole genomes. YES.
Pmid 22301318, Iverson et al.
Concluding thoughts
• What works?
• What needs work?
• What will work?
What works?
Today,
• From deep metagenomic data, you can get the
gene and operon content (including abundance of
both) from communities.
• You can get microarray-like expression information
from metatranscriptomics.
What needs work?
• Assembling ultra-deep samples is going to require
more engineering, but is straightforward. (“Infinite
assembly.”)
• Building scaffolds and extracting whole genomes
has been done, but I am not yet sure how feasible it
is to do systematically with existing tools (c.f.
Armbrust Lab).
What will work,
someday?
• Sensitive analysis of strain variation.
o Both assembly and mapping approaches do a poor job detecting many
kinds of biological novelty.
o The 1000 Genomes Project has developed some good tools that need to
be evaluated on community samples.
• Ecological/evolutionary dynamics in vivo.
o Most work done on 16s, not on genomes or functional content.
o Here, sensitivity is really important!
The interpretation
challenge
• For soil, we have generated approximately 1200
bacterial genomes worth of assembled genomic DNA
from two soil samples.
• The vast majority of this genomic DNA contains unknown
genes with largely unknown function.
• Most annotations of gene function & interaction are
from a few phylogenetically limited model organisms
o Est 98% of annotations are computationally inferred: transferred from model
organisms to genomic sequence, using homology.
o Can these annotations be transferred? (Probably not.)
This will be the biggest sequence analysis challenge of the next 50 years.
What are future needs?
• High-quality, medium+ throughput annotation of
genomes?
o Extrapolating from model organisms is both immensely important and yet
lacking.
o Strong phylogenetic sampling bias in existing annotations.
• Synthetic biology for investigating non-model
organisms?
(Cleverness in experimental biology doesn’t scale )
• Integration of microbiology, community
ecology/evolution modeling, and data analysis.

2015 beacon-metagenome-tutorial

  • 1.
    Shotgun metagenomics C. Titus Brown UCDavis / School of Veterinary Medicine ctbrown@ucdavis.edu
  • 2.
    Shotgun metagenomics • Collectsamples; • Extract DNA; • Feed into sequencer; • Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 4.
  • 5.
    Genome assembly workspretty well Velvet IDBA Spades Quality Quality Quality Total length (>= 0 bp) 1.6E+08 2.0E+08 2.0E+08 Total length (>= 1000 bp) 1.6E+08 1.9E+08 1.9E+08 Largest contig 561,449 979,948 1,387,918 # misassembled contigs 631 1032 752 Genome fraction (%) 72.949 90.969 90.424 Duplication ratio 1.004 1.007 1.004 Conclusion: SPAdes and IDBA achieve similar results. Dr. Sherine Awad
  • 6.
    Evaluating short-read data /coverage plots Assembly starts to work @ ~10x
  • 7.
  • 8.
    Who ya gonnacall?? …to do your bioinformatics?
  • 9.
    Topics to explore- • Experimental design to enable good data analysis. • Extracting whole genomes and evaluating strain variation. • Technology review and open problems. • Assembly graphs. • Long reads and infinite sequencing – what next? Or just a Q&A. I can talk for 90 minutes straight, too, but I’d rather not.
  • 10.
    Assembly graphs.Assembly “graphs”are a way of representing sequencing data that retains all variation; assemblers then crunch this down into a single sequence. Image from Iqbal et al., 2012. PMID 23172865
  • 11.
  • 12.
  • 13.
    Shotgun metagenomics • Collectsamples; • Extract DNA; • Feed into sequencer; • Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 14.
  • 15.
    Shotgun sequencing & assembly Randomlyfragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 16.
    Dealing with shotgun metagenomereads. 1) Annotate/analyze individual reads. 1) Assemble reads into contigs, genomes, etc.
  • 17.
    Annotating individual reads • Worksreally well when you have EITHER o (a) evolutionarily close references o (b) rather long sequences (This is obvious, right?)
  • 18.
    Annotating individual reads #2 •We have found that this does not work well with Illumina samples from unexplored environments (e.g. soil). • Sensitivity is fine (correct match is usually there) • Specificity is bad (correct match may be drowned out by incorrect matches)
  • 19.
    Assembly: good. Howe etal., 2014, PMID 24632729 Assemblies yield much more significant homology matches.
  • 20.
    Recommendation: For reads <200-300 bp, • Annotate individual reads for human- associated samples, or exploration of well- studied systems. • Otherwise, look to assembly.
  • 21.
    So, why assemble? •Increase your ability to assign homology/orthology correctly!! Essentially all functional annotation systems depend on sequence similarity to assign homology. This is why you want to assemble your data.
  • 22.
    Why else wouldyou want to assemble? • Assemble new “reference”. • Look for large-scale variation from reference – pathogenicity islands, etc. • Discriminate between different members of gene families. • Discover operon assemblages & annotate on co-incidence of genes. • Reduce size of data!!
  • 23.
    Why don’t youwant to assemble?? • Abundance threshold – low-coverage filter. • Strain variation • Chimerism
  • 24.
  • 25.
    Short read lengthsare hard. Whiteford et al., Nuc. Acid Res, 2005
  • 26.
    Short read lengthsare hard. Whiteford et al., Nuc. Acid Res, 2005 Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why?
  • 27.
    Short read lengthsare hard. Whiteford et al., Nuc. Acid Res, 2005 Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why? REPEATS. This is why paired-end sequencing is so importan for assembly.
  • 28.
    Four main challengesfor de novo sequencing. • Repeats. • Low coverage. • Errors These introduce breaks in the construction of contigs. • Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic. This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.
  • 29.
    Repeats • Overlaps don’tplace sequences uniquely when there are repeats present. UMD assembly primer (cbcb.umd.edu)
  • 30.
  • 32.
    Conclusions re mixed communitysampling In shotgun metagenomics, you are sampling randomly from the mixed population. Therefore, the lowest abundance member of the population (that you want to observe) drives the required depth of sequencing! 1 in a million => ~50 Tbp sequencing for 10x coverage.
  • 33.
    Assembly depends onhigh coverage HMP mock community assembly; Velvet-based protocol.
  • 34.
    Conclusions from previous slide •To recover any contigs at all, you need coverage > 10 (green line). • To recover long sequences, you want coverage > 20 (blue line).
  • 35.
    A few assemblystories • Low complexity/Osedax symbionts • High complexity/soil.
  • 36.
    Osedax symbionts Table S1! Specimens, and collection sites, used in this study Site Dive1 Date Time Zone (months) # of specimens Osedax frankpressi with Rs1 symbiont 2890m T486 Oct 2002 8 2 T610 Aug 2003 18 3 T1069 Jan 2007 59 2 DR010 Mar 2009 85 3 DR098 Nov 2009 93 4 DR204 Oct 2010 104 1 DR234 Jun 2011 112 2 1820m T1048 Oct 2006 7 3 T1071 Jan 2007 10 3 DR012 Mar 2009 36 4 DR2362 Jun 2011 63 2 O. frankpressi from DR236 1 Dive numbers begin with the remotely operated vehicle name; T= Tiburon, DR = Doc Ricketts (both owned and operated by the Monterey Bay Aquarium Research Institute) . 2 Samples used for genomic analysis were collected during dive DR236. Goffredi et al., submitted.
  • 37.
  • 38.
    Metagenomic assembly followedby binning enabled isolation of fairly complete genomes
  • 39.
    Assembly allowed genomiccontent comparisons to nearest cultured relative
  • 40.
    Osedax assembly story Conclusionsinclude: • Osedax symbionts have genes needed for free living stage; • metabolic versatility in carbon, phosphate, and iron uptake strategies; • Genome includes mechanisms for intracellular survival, and numerous potential virulence capabilities.
  • 41.
    Osedax assembly story •Low diversity metagenome! • Physical isolation => MDA => sequencing => diginorm => binning => o 94% complete Rs1 o 66-89% complete Rs2 • Note: many interesting critters are hard to isolate => so, basically, metagenomes.
  • 42.
    Human-associated communities • “Time seriescommunity genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization.” Sharon et al. (Banfield lab); Genome Res. 2013 23: 111-120
  • 43.
    Setup • Collected 11fecal samples from premature female, days 15-24. • 260m 100-bp paired-end Illumina HiSeq reads; 400 and 900 base fragments. • Assembled > 96% of reads into contigs > 500bp; 8 complete genomes; reconstructed genes down to 0.05% of population abundance.
  • 44.
    Sharon et al.,2013; pmid 22936250
  • 45.
    Key strategy: abundance binning Binreads by k-mer abundance Assemble most abundance bin Remove reads that map to assembly Sharon et al., 2013; pmid 22936250
  • 46.
    Sharon et al.,2013; pmid 22936250
  • 47.
    Tracking abundance Sharon etal., 2013; pmid 22936250
  • 48.
    Conclusions • Recovered strainvariation, phage variation, abundance variation, lateral gene transfer. • Claim that “recovered genomes are superior to draft genomes generated in most isolate genome sequencing projects.”
  • 49.
    Environmental metagenomics: Deepwater Horizon spill •“Transcriptional response of bathypelagic marine bacterioplankton to the Deepwater Horizon oil spill.” Rivers et al., 2013, Moran lab. Pmid 23902988.
  • 50.
  • 51.
    Protocol for messengerRNA extraction, assembly, downstream analysis, including differential expression. Note: used overlapping paired- end reads.
  • 52.
    Great Prairie Grand Challenge- soil • Together with Janet Jansson, Jim Tiedje, et al. • “Can we make sense of soil with deep Illumina sequencing?”
  • 53.
    Great Prairie Grand Challenge- soil • What ecosystem level functions are present, and how do microbes do them? • How does agricultural soil differ from native soil? • How does soil respond to climate perturbation? • Questions that are not easy to answer without shotgun sequencing: o What kind of strain-level heterogeneity is present in the population? o What does the phage and viral population look like? o What species are where?
  • 54.
    A “Grand Challenge” dataset(DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 55.
    Approach 1: Digitalnormalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 56.
    Approach 2: Datapartitioning (a computational version of cell sorting) Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly.
  • 57.
    Putting it inperspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 58.
    Resulting contigs arelow coverage. Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
  • 59.
    Corn Prairie Iowa prairie& corn - very even.
  • 60.
    Taxonomy– Iowa prairie Note:this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  • 61.
    Taxonomy – Iowacorn Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  • 62.
    Strain variation? Toptwoallelefrequencies Position withincontig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  • 63.
    Concluding thoughts on assembly(I) • There’s no standard approach yet; almost every paper uses a specialized pipeline of some sort. o More like genome assembly o But unlike transcriptomics…
  • 64.
    Concluding thoughts on assembly(II) • Anecdotally, everyone worries about strain variation. o Some groups (e.g. Banfield, us) have found that this is not a problem in their system so far. o Others (viral metagenomes! HMP!) have found this to be a big concern.
  • 65.
    Concluding thoughts on assembly(III) • Some groups have found metagenome assembly to be very useful. • Others (us! soil!) have not yet proven its utility.
  • 66.
    High coverage iscritical. Low coverage is the dominant problem blocking assembly of metagenomes
  • 67.
    Questions -- • Howmuch sequencing should I do? • How do I evaluate metagenome assemblies? • Which assembler is best?
  • 68.
  • 69.
    Shotgun sequencing and coverage “Coverage”is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 70.
    Random sampling =>deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 71.
  • 72.
    Assembly depends onhigh coverage
  • 73.
    Shared low-level transcripts may notreach the threshold for assembly.
  • 74.
    K-mer based assemblers scalepoorly Why do big data sets require big machines?? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 75.
    Conway T C, Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com De Bruijn graphs scale poorly with data size
  • 76.
  • 77.
    How much datado we need? (I) (“More” is rather vague…)
  • 78.
    Putting it inperspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 79.
    Resulting contigs arelow coverage. Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
  • 80.
    How much? (II) •Suppose we need 10x coverage to assemble a microbial genome, and microbial genomes average 5e6 bp of DNA. • Further suppose that we want to be able to assemble a microbial species that is “1 in a 100000”, i.e. 1 in 1e5. • Shotgun sequencing samples randomly, so must sample deeply to be sensitive. 10x coverage x 5e6 bp x 1e5 =~ 50e11, or 5 Tbp of sequence. Currently this would cost approximately $100k, for 10 full Illumina runs, but in a year we will be able to do it for much less.
  • 81.
    We can estimate sequencingreq’d: http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
  • 82.
    “Whoa, that’s alot of data…” http://ivory.idyll.org/blog/how-much-sequencing-is-needed.html
  • 83.
    Some approximate metagenome sizes •Deep carbon mine data set: 60 Mbp (x 10x => 600 Mbp) • Great Prairie soil: 12 Gbp (4x human genome) • Amazon Rain Forest Microbial Observatory soil: 26 Gbp
  • 84.
    Partitioning separates readsby genome. When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions contain reads from only a single genome (blue) vs multi-genome partitions (green). Partitions containing spiked data indicated with a *Adina Howe **
  • 85.
    Is your assemblygood? • Truly reference-free assembly is hard to evaluate. • Traditional “genome” measures like N50 and NG50 simply do not apply to metagenomes, because very often you don’t know what the genome “size” is.
  • 86.
    Evaluating assembly Predicted genome. X X X X X X X X XX Reads- noisy observations of some genome. Assembler (a Big Black Box) Evaluating correctness of metagenomes is still undiscovered country.
  • 87.
  • 88.
    What’s the best assembler? -25 -20 -15 -10 -5 0 5 10 BCM*BCM ALLP NEWB SOAP** MERAC CBCB SOAP* SOAP SGA PHUS RAY MLK ABL CumulativeZ-scorefromrankingmetrics Bird assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 89.
    What’s the best assembler? -8 -6 -4 -2 0 2 4 6 8 BCMCSHL CSHL* SYM ALLP SGA SOAP* MERAC ABYSS RAY IOB CTD IOB* CSHL** CTD** CTD* CumulativeZ-scorefromrankingmetrics Fish assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 90.
    What’s the best assembler? Bradnamet al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf -14 -10 -6 -2 2 6 10 SGA PHUS MERAC SYMB ABYSS CRACS BCM RAY SOAP GAM CURT CumulativeZ-scorefromrankingmetrics Snake assembly
  • 91.
    Answer: for eukaryotic genomes,it depends • Different assemblers perform differently, depending on o Repeat content o Heterozygosity • Generally the results are very good (est completeness, etc.) but different between different assemblers (!) • There Is No One Answer.
  • 92.
    Estimated completeness: CEGMA Eachassembler lost different ~5% CEGs Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 93.
    Tradeoffs in N50 and % incl. 0 10 20 30 40 50 60 70 80 90 100 110 0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 18,000,000 BCM ALLP SOAP NEWB MERAC SGA CBCB PHUS RAY MLK ABL %ofestimatedgenomesizepresentinscaffolds>=25Kbp NG50scaffoldlength(bp) Bird assembly Bradnam et al., Assemblathon 2: http://arxiv.org/pdf/1301.5406v1.pdf
  • 94.
    Evaluating assemblies • Everyassembly returns different results for eukaryotic genomes. • For metagenomes, it’s even worse. o No systematic exploration of precision, recall, etc. o Very little in the way of cross-comparable data sets o Often sequencing technology being evaluated is out of date o etc. etc.
  • 95.
    Our experience • Ourmetagenome assemblies compare well with others, but we have little in the way of ground truth with which to evaluate. • Scaffold assembly is tricky; we believe in contig assembly for metagenomes, but not scaffolding. • See arXiv paper, “Assembling large, complex metagenomes”, for our suggested pipeline and statistics & references.
  • 96.
    Metagenomic assemblies arehighly variable Adina Howe et al., arXiv 1212.0159
  • 97.
    How to choosea metagenome assembler • Try a few. • Use what seems to perform best (most bp > some minimum) • Try: MEGAHIT and SPAdes and IDBA.
  • 98.
  • 99.
    Deep Carbon dataset • Name: DCO_TCO_MM5 • Masimong Gold Mine; microbial cells filtered from fracture water from within a 1.9km borehole. (32,000 year old water?!) • M.C.Y. Lau, C. Magnabosco, S. Grim, G. Lacrampe Couloume, K. Wilkie, B. Sherwood Lollar, D.N. Simkus, G.F. Slater, S. Hendrickson, M. Pullin, T.L. Kieft, O. Kuloyo, B. Linage, G. Borgonie, E. van Heerden, J. Ackerman, C. van Jaarsveld, and T.C. Onstott
  • 100.
    DCO_TCO_M M5 20m reads /2.1 Gbp 5.6m reads / 601.3 Mbp Adapter trim & quality filter Diginorm to C=10 Trim high- coverage reads at low-abundance k-mers “Could you take a look at this? MG-RAST is telling us we have a lot of artificially duplicated reads, i.e. the data is bad.” Entire process took ~4 hours of computation, or so. (Minimum genome size est: 60.1 Mbp)
  • 101.
    Assembly stats: All contigsContigs > 1kb k N contigs Sum BP N contigs Sum BP Max contig 21 343263 63217837 6271 10537601 9987 23 302613 63025311 7183 13867885 21348 25 276261 62874727 7375 15303646 34272 27 258073 62500739 7424 16078145 48742 29 242552 62001315 7349 16426147 48746 31 228043 61445912 7307 16864293 48750 33 214559 60744478 7241 17133827 48768 35 203292 60039871 7129 17249351 45446 37 189948 58899828 7088 17527450 59437 39 180754 58146806 7027 17610071 54112 41 172209 57126650 6914 17551789 65207 43 165563 56440648 6925 17654067 73231 DCO_TCO_MM 5 (Minimum genome size est: 60.1 Mbp)
  • 102.
    Chose two: • A:k=43 (“long contigs”) • 165563 contigs • 56.4 Mbp • longest contig: 73231 bp • B: k=43 (“high recall”) • 343263 contigs • 63.2 Mbp • longest contig is 9987 bp DCO_TCO_MM 5 How to evaluate??
  • 103.
    How many readsmap back? Mapped 3.8m paired-end reads (one subsample): • high-recall: 41% of pairs map • longer-contigs: 70% of pairs map + 150k single-end reads: • high-recall: 49% of sequences map • longer-contigs: 79% of sequences map
  • 104.
    Annotation/exploration with MG-RAST • Youcan upload genome assemblies to MG-RAST, and annotate them with coverage; tutorial to follow. • What does MG-RAST do?
  • 105.
    Conclusion • This isa pretty good metagenome assembly – > 80% of reads map! • Surprised that the larger dataset (6.32 Mbp, “high recall”) accounts for a smaller percentage of the reads – 49% vs 79% for the 56.4 Mbp “long contigs” data set. • I now suspect that different parameters are recovering different subsets of the sample… • Don’t trust MG-RAST ADR calls.
  • 106.
  • 107.
    Estimates of metagenome size Calculation:# reads * (avg read len) / (diginorm coverage) Assumes: few entirely erroneous reads (upper bound); saturation (lower bound). • E. coli: 384k * 86.1 / 5.0 => 6.6 Mbp est. (true: 4.5 Mbp) • MM5 deep carbon: 60 Mbp • Great Prairie soil: 12 Gbp • Amazon Rain Forest Microbial Observatory: 26 Gbp
  • 108.
    Extracting whole genomes? So far,we have only assembled contigs, but not whole genomes. Can entire genomes be assembled from metagenomic data? Iverson et al. (2012), from the Armbrust lab, contains a technique for scaffolding metagenome contigs into ~whole genomes. YES. Pmid 22301318, Iverson et al.
  • 109.
    Concluding thoughts • Whatworks? • What needs work? • What will work?
  • 110.
    What works? Today, • Fromdeep metagenomic data, you can get the gene and operon content (including abundance of both) from communities. • You can get microarray-like expression information from metatranscriptomics.
  • 111.
    What needs work? •Assembling ultra-deep samples is going to require more engineering, but is straightforward. (“Infinite assembly.”) • Building scaffolds and extracting whole genomes has been done, but I am not yet sure how feasible it is to do systematically with existing tools (c.f. Armbrust Lab).
  • 112.
    What will work, someday? •Sensitive analysis of strain variation. o Both assembly and mapping approaches do a poor job detecting many kinds of biological novelty. o The 1000 Genomes Project has developed some good tools that need to be evaluated on community samples. • Ecological/evolutionary dynamics in vivo. o Most work done on 16s, not on genomes or functional content. o Here, sensitivity is really important!
  • 113.
    The interpretation challenge • Forsoil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples. • The vast majority of this genomic DNA contains unknown genes with largely unknown function. • Most annotations of gene function & interaction are from a few phylogenetically limited model organisms o Est 98% of annotations are computationally inferred: transferred from model organisms to genomic sequence, using homology. o Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years.
  • 114.
    What are futureneeds? • High-quality, medium+ throughput annotation of genomes? o Extrapolating from model organisms is both immensely important and yet lacking. o Strong phylogenetic sampling bias in existing annotations. • Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesn’t scale ) • Integration of microbiology, community ecology/evolution modeling, and data analysis.

Editor's Notes

  • #38 Both BLAST. Shotgun: annotated reads only.
  • #55 Beware of Janet Jansson bearing data.
  • #63 Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.
  • #76 A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.