Assembling diverse & rich
metagenomes: the secrets of
the ancients.
C. Titus Brown
ctb@msu.edu
Introducing myself --
ged.msu.edu/
 “Data-intensive biology” – tools, etc.
 Not a marine microbiologist at all!
Note: these slides are all on slideshare.
(Google “titus brown slide share”)
My goals
 Enable hypothesis-driven biology
through better hypothesis generation
& refinement.
 Devalue “interest level” of sequence
analysis and put myself out of a job.
 Be a good mutualist!
Part I: Soil Assembly & the
Great Prairie Grand
Challenge
2008
Soil microbial ecology -
questions
 What ecosystem level functions are present,
and how do microbes do them?
 How does agricultural soil differ from native
soil?
 How does soil respond to climate
perturbation?
 Questions that are not easy to answer
without shotgun sequencing:
◦ What kind of strain-level heterogeneity is present
in the population?
◦ What does the phage and viral population look
like?
◦ What species are where?
A “Grand Challenge” dataset
(DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Adina Howe
Approach – assemble into
contigs.
 We found that short reads from
phylogenetically distant and
microbially diverse environments
could not be reliably annotated.
=> Build into longer contigs first.
…5 year odyssey…
(Friends don’t let friends BLAST short
reads.**)
** Applicable to most environmental samples.Howe et al., 2014
Developed two new methods
--
I. Computational “cell sorting”
II. Computational “library
normalization.”
See:
• Pell et al., Tiedje, Brown (2012);
• Howe et al., Tiedje, Brown (2014);
• Goffredi et al. (2014)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Result: we (easily, casually) assembled
two of the biggest metagenomes ever.
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Howe et al, 2014; pmid 24632729
(I’ll come back to this)
So…
We can now achieve an assembly of
pretty much anything (soil was really
hard, virtually everything else is easier!)
Lots of people are interested in
collaborating with us on this!
…but we regard it as a
largely solved problem.
I: assembly “protocols”
 khmer-protocols: open, versioned, citable,
forkable set of instructions to assemble euk
mRNAseq and metagenomes on widely
accessible compute resources.
 Explicit command-line instructions to go from
raw reads to annotated “final product”.
 For mRNAseq: ~$150/compute for $2000 of
data.
(Still in beta, note.)
khmer-protocols Read cleaning
Preprocessing
Assembly
Annotation
Example - Deep Carbon data
set
 Masimong Gold Mine; microbial cells
filtered from fracture water from within
a 1.9km borehole. (32,000 year old
water)
 5.6m reads, 601.3 Mbp;
◦ computational protocol took 4 hours;
◦ Assembled to 56 Mbp > 300 bp
◦ longest contig is 73kb
◦ 70% of paired-end reads mapped.
20
w/M.C.Y. Lau, Tullis Onstott
Our (open) approach:
 If the protocols work for you, great! Cite
us.
 If the protocols don’t work for you, please
let us know so we can fix them.
 If it’s a challenging problem, we’d love
to collaborate.
 We are also happy to help train people.
Things we no longer worry about
(much) – let’s chat:
 Inter-species assembly chimerae
…apart from w/in strain variants, chimerae
are hard to form with contig assembly.
 Finding homology matches in metagenomes
…contigs give as good a
match as possible.
 Assembling contigs when we have sufficient
coverage
…not enough coverage is
usually the problem.
II: Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the
top through all of the reads.
23
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
24
Assembly depends on high
coverage
25
HMP mock community
Downstream goals of
assembly:
(Even assuming ribotyping works perfectly)
 Annotate genes with higher confidence.
 Reconstruct operons & ultimately even
full genomes.
 Analyze strain variation.
 Study organisms that ribotyping can’t
(phage & virus)
Main questions --
I. How do we know if we’ve sequenced
enough?
II. Can we predict how much more we
need to sequence to see <insert
some feature here>?
Note: necessary sequencing depth cannot
accurately be predicted from SSU/amplicon
data
Method 1: looking for WGS
saturation
We can track how many sequences we
keep of the sequences we’ve seen, to
detect saturation.
Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
C=10, for assembly
Estimating metagenome nt
richness:
# bp at saturation / coverage
 MM5 deep carbon: 60 Mbp
 Iowa prairie soil: 12 Gbp
 Amazon Rain Forest Microbial
Observatory soil: 26 Gbp
Assumes: few entirely erroneous reads (upper
bound); at saturation (lower bound).
31
WGS saturation approach:
 Tells us when we have enough
sequence.
 Can’t be predictive… if you haven’t
sampled something, you can’t say
anything about it.
Can we correlate deep amplicon
sequencing with shallower WGS?
Correlating 16s and shotgun
seq
Errors do not strongly affect saturatio
How
much
of 16s
do
you
see…
with how much shotgun sequencing
Data from Shakya et al., 2013 (pmid: 23387867
WGS saturation ~matches 16s saturation
< rRNA copy
number >
16s region choice is not significant (?!)
Data from Shakya et al., 2013 (pmid: 23387867
Method is robust to organisms
unsampled by amplicons.
Insensitive to
amplicon primer
bias.
Robust to genome
size differences,
eukaryotes, phage.
Data from Shakya et al., 2013 (pmid: 23387867
Can examine specific OTUs
Data from Shakya et al., 2013 (pmid: 23387867
OTU abundance is ~correct.
Data from Shakya et al., 2013 (pmid: 23387867
Running on real communities
--
Running on real communities
--
Thoughts on 16s/WGS
comparison:
 Robust to some real problems (primer
bias; organisms unsampled by
amplicon seq) & insensitive to 16s seq
error.
 Hopefully can be used to build a
predictive framework to answer “how
much more sequencing should I do?”
◦ Sensitivity: “What have I missed?”
◦ Planning: “How much $$ should I ask
Other things that y’all might be
interested in:
 Comparing 16s from amplicon and
shotgun sequencing.
 Metatranscriptome assembly protocol
 Biogeography of genomic sequence
Metatranscriptome assembly
(soil)
Total Length
(bp)
Total rRNA
(bp)
Total
annotated by
MG-RAST
m5nr SEED
Unassembled
MetaT
20,525,296,600
16,987,863,800
(82.8%)
48,080,200
(0.23%)
Assembled
MetaT
32,471,548
7,061,913
(21.8%)
2,075,701
(6.4%)
Aaron Garoutte (w/Tiedje & Howe)
Using shotgun sequence to
cross-validate amplicon
predictions
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%
AMP/RDP AMP/SILVA WGS/RDP WGS/SILVA WGS/SILVA(LSU)
Amplicon seq missing Verrucomicrob
Jaron Guo
Primer bias against
Verrucomicrobia
Check taxonomy of reads causing
mismatch (A)
Verrucomicrobia cause
70% (117/168) of
mismatch
Current primer is not effective at amplifying
Verrucomicrobia
Jaron Guo
Biogeography of genomic
DNA
How much genomic DNA is shared between
different sites?
Qingpeng Zhang
Biogeography of genomic DNA
(2)
How much genomic richness is shared
between different sites?
Qingpeng Zhang
Concluding thoughts
 Tools and protocols for data analysis are
fast becoming intrinsic to practice of
biology.
◦ Most tools are wrong, but some are useful.
◦ All of our tools are openly, freely available in
every way possible.
 We are trying to make assembly fast,
cheap, easy, and good.
 We are building on our assembly-based
approaches & intuition to tackle other
questions.
Big Data is neither the real
problem nor the solution.
 Dealing with Big Data requires a new
mentality, so training/experience is
probably most effective way forward.
 With sequencing, few if any of your
biology problems go away, although
some aspects may become more
tractable.
 Think future: any -ome you want from
any sample you can get. …So now
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
We don’t know what most genes do.
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Howe et al, 2014; pmid 24632729
Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene function.
C. Computational/data infrastructure
…but planning for poverty, not wealth:
sustainability and “bus factor”.
D. Capacity building
 Standardized data sets; data availability.
 Workshops and training.
Training in data analysis et al.
 Software Carpentry.
 Data Carpentry.
 STAMPS, EDAMAME, MSU NGS
course.
 <other courses go here>
Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene function.
C. Computational/data infrastructure
…but planning for poverty, not
wealth: sustainability and “bus factor”.
D. Capacity building
 Standardized data sets; data
availability.
 Workshops and training.

2014 marine-microbes-grc

  • 1.
    Assembling diverse &rich metagenomes: the secrets of the ancients. C. Titus Brown ctb@msu.edu
  • 2.
    Introducing myself -- ged.msu.edu/ “Data-intensive biology” – tools, etc.  Not a marine microbiologist at all! Note: these slides are all on slideshare. (Google “titus brown slide share”)
  • 3.
    My goals  Enablehypothesis-driven biology through better hypothesis generation & refinement.  Devalue “interest level” of sequence analysis and put myself out of a job.  Be a good mutualist!
  • 4.
    Part I: SoilAssembly & the Great Prairie Grand Challenge 2008
  • 5.
    Soil microbial ecology- questions  What ecosystem level functions are present, and how do microbes do them?  How does agricultural soil differ from native soil?  How does soil respond to climate perturbation?  Questions that are not easy to answer without shotgun sequencing: ◦ What kind of strain-level heterogeneity is present in the population? ◦ What does the phage and viral population look like? ◦ What species are where?
  • 6.
    A “Grand Challenge”dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp Adina Howe
  • 7.
    Approach – assembleinto contigs.  We found that short reads from phylogenetically distant and microbially diverse environments could not be reliably annotated. => Build into longer contigs first. …5 year odyssey…
  • 8.
    (Friends don’t letfriends BLAST short reads.**) ** Applicable to most environmental samples.Howe et al., 2014
  • 9.
    Developed two newmethods -- I. Computational “cell sorting” II. Computational “library normalization.” See: • Pell et al., Tiedje, Brown (2012); • Howe et al., Tiedje, Brown (2014); • Goffredi et al. (2014)
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Putting it inperspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Result: we (easily, casually) assembled two of the biggest metagenomes ever. Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729 (I’ll come back to this)
  • 17.
    So… We can nowachieve an assembly of pretty much anything (soil was really hard, virtually everything else is easier!) Lots of people are interested in collaborating with us on this! …but we regard it as a largely solved problem.
  • 18.
    I: assembly “protocols” khmer-protocols: open, versioned, citable, forkable set of instructions to assemble euk mRNAseq and metagenomes on widely accessible compute resources.  Explicit command-line instructions to go from raw reads to annotated “final product”.  For mRNAseq: ~$150/compute for $2000 of data. (Still in beta, note.)
  • 19.
  • 20.
    Example - DeepCarbon data set  Masimong Gold Mine; microbial cells filtered from fracture water from within a 1.9km borehole. (32,000 year old water)  5.6m reads, 601.3 Mbp; ◦ computational protocol took 4 hours; ◦ Assembled to 56 Mbp > 300 bp ◦ longest contig is 73kb ◦ 70% of paired-end reads mapped. 20 w/M.C.Y. Lau, Tullis Onstott
  • 21.
    Our (open) approach: If the protocols work for you, great! Cite us.  If the protocols don’t work for you, please let us know so we can fix them.  If it’s a challenging problem, we’d love to collaborate.  We are also happy to help train people.
  • 22.
    Things we nolonger worry about (much) – let’s chat:  Inter-species assembly chimerae …apart from w/in strain variants, chimerae are hard to form with contig assembly.  Finding homology matches in metagenomes …contigs give as good a match as possible.  Assembling contigs when we have sufficient coverage …not enough coverage is usually the problem.
  • 23.
    II: Shotgun sequencingand coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads. 23
  • 24.
    Random sampling =>deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human) 24
  • 25.
    Assembly depends onhigh coverage 25 HMP mock community
  • 26.
    Downstream goals of assembly: (Evenassuming ribotyping works perfectly)  Annotate genes with higher confidence.  Reconstruct operons & ultimately even full genomes.  Analyze strain variation.  Study organisms that ribotyping can’t (phage & virus)
  • 27.
    Main questions -- I.How do we know if we’ve sequenced enough? II. Can we predict how much more we need to sequence to see <insert some feature here>? Note: necessary sequencing depth cannot accurately be predicted from SSU/amplicon data
  • 28.
    Method 1: lookingfor WGS saturation We can track how many sequences we keep of the sequences we’ve seen, to detect saturation.
  • 29.
    Data from Shakyaet al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing
  • 30.
    Data from Shakyaet al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing C=10, for assembly
  • 31.
    Estimating metagenome nt richness: #bp at saturation / coverage  MM5 deep carbon: 60 Mbp  Iowa prairie soil: 12 Gbp  Amazon Rain Forest Microbial Observatory soil: 26 Gbp Assumes: few entirely erroneous reads (upper bound); at saturation (lower bound). 31
  • 32.
    WGS saturation approach: Tells us when we have enough sequence.  Can’t be predictive… if you haven’t sampled something, you can’t say anything about it. Can we correlate deep amplicon sequencing with shallower WGS?
  • 33.
    Correlating 16s andshotgun seq Errors do not strongly affect saturatio How much of 16s do you see… with how much shotgun sequencing
  • 34.
    Data from Shakyaet al., 2013 (pmid: 23387867 WGS saturation ~matches 16s saturation < rRNA copy number >
  • 35.
    16s region choiceis not significant (?!) Data from Shakya et al., 2013 (pmid: 23387867
  • 36.
    Method is robustto organisms unsampled by amplicons. Insensitive to amplicon primer bias. Robust to genome size differences, eukaryotes, phage. Data from Shakya et al., 2013 (pmid: 23387867
  • 37.
    Can examine specificOTUs Data from Shakya et al., 2013 (pmid: 23387867
  • 38.
    OTU abundance is~correct. Data from Shakya et al., 2013 (pmid: 23387867
  • 39.
    Running on realcommunities --
  • 40.
    Running on realcommunities --
  • 41.
    Thoughts on 16s/WGS comparison: Robust to some real problems (primer bias; organisms unsampled by amplicon seq) & insensitive to 16s seq error.  Hopefully can be used to build a predictive framework to answer “how much more sequencing should I do?” ◦ Sensitivity: “What have I missed?” ◦ Planning: “How much $$ should I ask
  • 42.
    Other things thaty’all might be interested in:  Comparing 16s from amplicon and shotgun sequencing.  Metatranscriptome assembly protocol  Biogeography of genomic sequence
  • 43.
    Metatranscriptome assembly (soil) Total Length (bp) TotalrRNA (bp) Total annotated by MG-RAST m5nr SEED Unassembled MetaT 20,525,296,600 16,987,863,800 (82.8%) 48,080,200 (0.23%) Assembled MetaT 32,471,548 7,061,913 (21.8%) 2,075,701 (6.4%) Aaron Garoutte (w/Tiedje & Howe)
  • 44.
    Using shotgun sequenceto cross-validate amplicon predictions 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% AMP/RDP AMP/SILVA WGS/RDP WGS/SILVA WGS/SILVA(LSU) Amplicon seq missing Verrucomicrob Jaron Guo
  • 45.
    Primer bias against Verrucomicrobia Checktaxonomy of reads causing mismatch (A) Verrucomicrobia cause 70% (117/168) of mismatch Current primer is not effective at amplifying Verrucomicrobia Jaron Guo
  • 46.
    Biogeography of genomic DNA Howmuch genomic DNA is shared between different sites? Qingpeng Zhang
  • 47.
    Biogeography of genomicDNA (2) How much genomic richness is shared between different sites? Qingpeng Zhang
  • 48.
    Concluding thoughts  Toolsand protocols for data analysis are fast becoming intrinsic to practice of biology. ◦ Most tools are wrong, but some are useful. ◦ All of our tools are openly, freely available in every way possible.  We are trying to make assembly fast, cheap, easy, and good.  We are building on our assembly-based approaches & intuition to tackle other questions.
  • 49.
    Big Data isneither the real problem nor the solution.  Dealing with Big Data requires a new mentality, so training/experience is probably most effective way forward.  With sequencing, few if any of your biology problems go away, although some aspects may become more tractable.  Think future: any -ome you want from any sample you can get. …So now
  • 50.
    Putting it inperspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp We don’t know what most genes do. Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729
  • 51.
    Potential discussion topics A.Funding and collaboration models. B. Leveraging data & computation to help understand gene function. C. Computational/data infrastructure …but planning for poverty, not wealth: sustainability and “bus factor”. D. Capacity building  Standardized data sets; data availability.  Workshops and training.
  • 52.
    Training in dataanalysis et al.  Software Carpentry.  Data Carpentry.  STAMPS, EDAMAME, MSU NGS course.  <other courses go here>
  • 53.
    Potential discussion topics A.Funding and collaboration models. B. Leveraging data & computation to help understand gene function. C. Computational/data infrastructure …but planning for poverty, not wealth: sustainability and “bus factor”. D. Capacity building  Standardized data sets; data availability.  Workshops and training.

Editor's Notes

  • #5 Fly-over country (that I live in)
  • #8 Nothing more frustrating to biologists than having data that you can’t analyze 
  • #19 Est 200 hrs of my effort
  • #28 ~Easy to say how much you need for a single genome.
  • #35 Note: 16s is higher copy number, more sensitive than WGS.
  • #39 otu5 is acidobacterium; one species, Acidobacterium capsulatum, with one rRNA; 4.6% of BA community, 4.7% of Illumina reads; # otu2 is chlorobium; five species, total of 10 rRNA; 9.1% of Illumina. Correction factor of 5.
  • #45 JGI v6, 454 amplicon sequencing
  • #47 Original motivation was, should we combine samples?