2014 marine-microbes-grc

Assembling diverse & rich
metagenomes: the secrets of
the ancients.
C. Titus Brown
ctb@msu.edu

Introducing myself --
ged.msu.edu/
 “Data-intensive biology” – tools, etc.
 Not a marine microbiologist at all!
Note: these slides are all on slideshare.
(Google “titus brown slide share”)

My goals
 Enable hypothesis-driven biology
through better hypothesis generation
& refinement.
 Devalue “interest level” of sequence
analysis and put myself out of a job.
 Be a good mutualist!

Part I: Soil Assembly & the
Great Prairie Grand
Challenge
2008

Soil microbial ecology -
questions
 What ecosystem level functions are present,
and how do microbes do them?
 How does agricultural soil differ from native
soil?
 How does soil respond to climate
perturbation?
 Questions that are not easy to answer
without shotgun sequencing:
◦ What kind of strain-level heterogeneity is present
in the population?
◦ What does the phage and viral population look
like?
◦ What species are where?

A “Grand Challenge” dataset
(DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Adina Howe

Approach – assemble into
contigs.
 We found that short reads from
phylogenetically distant and
microbially diverse environments
could not be reliably annotated.
=> Build into longer contigs first.
…5 year odyssey…

(Friends don’t let friends BLAST short
reads.**)
** Applicable to most environmental samples.Howe et al., 2014

Developed two new methods
--
I. Computational “cell sorting”
II. Computational “library
normalization.”
See:
• Pell et al., Tiedje, Brown (2012);
• Howe et al., Tiedje, Brown (2014);
• Goffredi et al. (2014)

Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Result: we (easily, casually) assembled
two of the biggest metagenomes ever.
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
Howe et al, 2014; pmid 24632729
(I’ll come back to this)

So…
We can now achieve an assembly of
pretty much anything (soil was really
hard, virtually everything else is easier!)
Lots of people are interested in
collaborating with us on this!
…but we regard it as a
largely solved problem.

I: assembly “protocols”
 khmer-protocols: open, versioned, citable,
forkable set of instructions to assemble euk
mRNAseq and metagenomes on widely
accessible compute resources.
 Explicit command-line instructions to go from
raw reads to annotated “final product”.
 For mRNAseq: ~$150/compute for $2000 of
data.
(Still in beta, note.)

khmer-protocols Read cleaning
Preprocessing
Assembly
Annotation

Example - Deep Carbon data
set
 Masimong Gold Mine; microbial cells
filtered from fracture water from within
a 1.9km borehole. (32,000 year old
water)
 5.6m reads, 601.3 Mbp;
◦ computational protocol took 4 hours;
◦ Assembled to 56 Mbp > 300 bp
◦ longest contig is 73kb
◦ 70% of paired-end reads mapped.
20
w/M.C.Y. Lau, Tullis Onstott

Our (open) approach:
 If the protocols work for you, great! Cite
us.
 If the protocols don’t work for you, please
let us know so we can fix them.
 If it’s a challenging problem, we’d love
to collaborate.
 We are also happy to help train people.

Things we no longer worry about
(much) – let’s chat:
 Inter-species assembly chimerae
…apart from w/in strain variants, chimerae
are hard to form with contig assembly.
 Finding homology matches in metagenomes
…contigs give as good a
match as possible.
 Assembling contigs when we have sufficient
coverage
…not enough coverage is
usually the problem.

II: Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
23

Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
24

Assembly depends on high
coverage
25
HMP mock community

Downstream goals of
assembly:
(Even assuming ribotyping works perfectly)
 Annotate genes with higher confidence.
 Reconstruct operons & ultimately even
full genomes.
 Analyze strain variation.
 Study organisms that ribotyping can’t
(phage & virus)

Main questions --
I. How do we know if we’ve sequenced
enough?
II. Can we predict how much more we
need to sequence to see <insert
some feature here>?
Note: necessary sequencing depth cannot
accurately be predicted from SSU/amplicon
data

Method 1: looking for WGS
saturation
We can track how many sequences we
keep of the sequences we’ve seen, to
detect saturation.

Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing

We can detect saturation of
shotgun sequencing
C=10, for assembly

Estimating metagenome nt
richness:
# bp at saturation / coverage
 MM5 deep carbon: 60 Mbp
 Iowa prairie soil: 12 Gbp
 Amazon Rain Forest Microbial
Observatory soil: 26 Gbp
Assumes: few entirely erroneous reads (upper
bound); at saturation (lower bound).
31

WGS saturation approach:
 Tells us when we have enough
sequence.
 Can’t be predictive… if you haven’t
sampled something, you can’t say
anything about it.
Can we correlate deep amplicon
sequencing with shallower WGS?

Correlating 16s and shotgun
seq
Errors do not strongly affect saturatio
How
much
of 16s
do
you
see…
with how much shotgun sequencing

WGS saturation ~matches 16s saturation
< rRNA copy
number >

16s region choice is not significant (?!)

Method is robust to organisms
unsampled by amplicons.
Insensitive to
amplicon primer
bias.
Robust to genome
size differences,
eukaryotes, phage.

Can examine specific OTUs

OTU abundance is ~correct.

Running on real communities
--

Thoughts on 16s/WGS
comparison:
 Robust to some real problems (primer
bias; organisms unsampled by
amplicon seq) & insensitive to 16s seq
error.
 Hopefully can be used to build a
predictive framework to answer “how
much more sequencing should I do?”
◦ Sensitivity: “What have I missed?”
◦ Planning: “How much $$ should I ask

Other things that y’all might be
interested in:
 Comparing 16s from amplicon and
shotgun sequencing.
 Metatranscriptome assembly protocol
 Biogeography of genomic sequence

Metatranscriptome assembly
(soil)
Total Length
(bp)
Total rRNA
(bp)
Total
annotated by
MG-RAST
m5nr SEED
Unassembled
MetaT
20,525,296,600
16,987,863,800
(82.8%)
48,080,200
(0.23%)
Assembled
MetaT
32,471,548
7,061,913
(21.8%)
2,075,701
(6.4%)
Aaron Garoutte (w/Tiedje & Howe)

Using shotgun sequence to
cross-validate amplicon
predictions
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
AMP/RDP AMP/SILVA WGS/RDP WGS/SILVA WGS/SILVA(LSU)
Amplicon seq missing Verrucomicrob
Jaron Guo

Primer bias against
Verrucomicrobia
Check taxonomy of reads causing
mismatch (A)
Verrucomicrobia cause
70% (117/168) of
mismatch
Current primer is not effective at amplifying
Verrucomicrobia
Jaron Guo

Biogeography of genomic
DNA
How much genomic DNA is shared between
different sites?
Qingpeng Zhang

Biogeography of genomic DNA
(2)
How much genomic richness is shared
between different sites?
Qingpeng Zhang

Concluding thoughts
 Tools and protocols for data analysis are
fast becoming intrinsic to practice of
biology.
◦ Most tools are wrong, but some are useful.
◦ All of our tools are openly, freely available in
every way possible.
 We are trying to make assembly fast,
cheap, easy, and good.
 We are building on our assembly-based
approaches & intuition to tackle other
questions.

Big Data is neither the real
problem nor the solution.
 Dealing with Big Data requires a new
mentality, so training/experience is
probably most effective way forward.
 With sequencing, few if any of your
biology problems go away, although
some aspects may become more
tractable.
 Think future: any -ome you want from
any sample you can get. …So now

Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
We don’t know what most genes do.
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
Howe et al, 2014; pmid 24632729

Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene function.
C. Computational/data infrastructure
…but planning for poverty, not wealth:
sustainability and “bus factor”.
D. Capacity building
 Standardized data sets; data availability.
 Workshops and training.

Training in data analysis et al.
 Software Carpentry.
 Data Carpentry.
 STAMPS, EDAMAME, MSU NGS
course.
 <other courses go here>

Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene function.
C. Computational/data infrastructure
…but planning for poverty, not
wealth: sustainability and “bus factor”.
D. Capacity building
 Standardized data sets; data
availability.
 Workshops and training.

2014 marine-microbes-grc

More Related Content

What's hot

Viewers also liked

Similar to 2014 marine-microbes-grc

More from c.titus.brown

Recently uploaded

2014 marine-microbes-grc

Editor's Notes