Assembling diverse & rich
metagenomes: the secrets of
the ancients.
C. Titus Brown
ctb@msu.edu
Introducing myself --
ged.msu.edu/
 “Data-intensive biology” – tools, etc.
 Not a marine microbiologist at all!
Note: th...
My goals
 Enable hypothesis-driven biology
through better hypothesis generation
& refinement.
 Devalue “interest level” ...
Part I: Soil Assembly & the
Great Prairie Grand
Challenge
2008
Soil microbial ecology -
questions
 What ecosystem level functions are present,
and how do microbes do them?
 How does a...
A “Grand Challenge” dataset
(DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultiva...
Approach – assemble into
contigs.
 We found that short reads from
phylogenetically distant and
microbially diverse enviro...
(Friends don’t let friends BLAST short
reads.**)
** Applicable to most environmental samples.Howe et al., 2014
Developed two new methods
--
I. Computational “cell sorting”
II. Computational “library
normalization.”
See:
• Pell et al....
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Result: we (easily, casu...
So…
We can now achieve an assembly of
pretty much anything (soil was really
hard, virtually everything else is easier!)
Lo...
I: assembly “protocols”
 khmer-protocols: open, versioned, citable,
forkable set of instructions to assemble euk
mRNAseq ...
khmer-protocols Read cleaning
Preprocessing
Assembly
Annotation
Example - Deep Carbon data
set
 Masimong Gold Mine; microbial cells
filtered from fracture water from within
a 1.9km bore...
Our (open) approach:
 If the protocols work for you, great! Cite
us.
 If the protocols don’t work for you, please
let us...
Things we no longer worry about
(much) – let’s chat:
 Inter-species assembly chimerae
…apart from w/in strain variants, c...
II: Shotgun sequencing and
coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome...
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
24
Assembly depends on high
coverage
25
HMP mock community
Downstream goals of
assembly:
(Even assuming ribotyping works perfectly)
 Annotate genes with higher confidence.
 Recons...
Main questions --
I. How do we know if we’ve sequenced
enough?
II. Can we predict how much more we
need to sequence to see...
Method 1: looking for WGS
saturation
We can track how many sequences we
keep of the sequences we’ve seen, to
detect satura...
Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
Data from Shakya et al., 2013 (pmid: 23387867
We can detect saturation of
shotgun sequencing
C=10, for assembly
Estimating metagenome nt
richness:
# bp at saturation / coverage
 MM5 deep carbon: 60 Mbp
 Iowa prairie soil: 12 Gbp
 A...
WGS saturation approach:
 Tells us when we have enough
sequence.
 Can’t be predictive… if you haven’t
sampled something,...
Correlating 16s and shotgun
seq
Errors do not strongly affect saturatio
How
much
of 16s
do
you
see…
with how much shotgun ...
Data from Shakya et al., 2013 (pmid: 23387867
WGS saturation ~matches 16s saturation
< rRNA copy
number >
16s region choice is not significant (?!)
Data from Shakya et al., 2013 (pmid: 23387867
Method is robust to organisms
unsampled by amplicons.
Insensitive to
amplicon primer
bias.
Robust to genome
size differenc...
Can examine specific OTUs
Data from Shakya et al., 2013 (pmid: 23387867
OTU abundance is ~correct.
Data from Shakya et al., 2013 (pmid: 23387867
Running on real communities
--
Running on real communities
--
Thoughts on 16s/WGS
comparison:
 Robust to some real problems (primer
bias; organisms unsampled by
amplicon seq) & insens...
Other things that y’all might be
interested in:
 Comparing 16s from amplicon and
shotgun sequencing.
 Metatranscriptome ...
Metatranscriptome assembly
(soil)
Total Length
(bp)
Total rRNA
(bp)
Total
annotated by
MG-RAST
m5nr SEED
Unassembled
MetaT...
Using shotgun sequence to
cross-validate amplicon
predictions
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
40.00%...
Primer bias against
Verrucomicrobia
Check taxonomy of reads causing
mismatch (A)
Verrucomicrobia cause
70% (117/168) of
mi...
Biogeography of genomic
DNA
How much genomic DNA is shared between
different sites?
Qingpeng Zhang
Biogeography of genomic DNA
(2)
How much genomic richness is shared
between different sites?
Qingpeng Zhang
Concluding thoughts
 Tools and protocols for data analysis are
fast becoming intrinsic to practice of
biology.
◦ Most too...
Big Data is neither the real
problem nor the solution.
 Dealing with Big Data requires a new
mentality, so training/exper...
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
We don’t know what most ...
Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene ...
Training in data analysis et al.
 Software Carpentry.
 Data Carpentry.
 STAMPS, EDAMAME, MSU NGS
course.
 <other cours...
Potential discussion topics
A. Funding and collaboration models.
B. Leveraging data & computation to
help understand gene ...
Upcoming SlideShare
Loading in …5
×

2014 marine-microbes-grc

981 views

Published on

Published in: Science, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
981
On SlideShare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Fly-over country (that I live in)
  • Nothing more frustrating to biologists than having data that you can’t analyze 
  • Est 200 hrs of my effort
  • ~Easy to say how much you need for a single genome.
  • Note: 16s is higher copy number, more sensitive than WGS.
  • otu5 is acidobacterium; one species, Acidobacterium capsulatum, with one rRNA; 4.6% of BA community, 4.7% of Illumina reads;
    # otu2 is chlorobium; five species, total of 10 rRNA; 9.1% of Illumina. Correction factor of 5.
  • JGI v6, 454 amplicon sequencing
  • Original motivation was, should we combine samples?
  • 2014 marine-microbes-grc

    1. 1. Assembling diverse & rich metagenomes: the secrets of the ancients. C. Titus Brown ctb@msu.edu
    2. 2. Introducing myself -- ged.msu.edu/  “Data-intensive biology” – tools, etc.  Not a marine microbiologist at all! Note: these slides are all on slideshare. (Google “titus brown slide share”)
    3. 3. My goals  Enable hypothesis-driven biology through better hypothesis generation & refinement.  Devalue “interest level” of sequence analysis and put myself out of a job.  Be a good mutualist!
    4. 4. Part I: Soil Assembly & the Great Prairie Grand Challenge 2008
    5. 5. Soil microbial ecology - questions  What ecosystem level functions are present, and how do microbes do them?  How does agricultural soil differ from native soil?  How does soil respond to climate perturbation?  Questions that are not easy to answer without shotgun sequencing: ◦ What kind of strain-level heterogeneity is present in the population? ◦ What does the phage and viral population look like? ◦ What species are where?
    6. 6. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp Adina Howe
    7. 7. Approach – assemble into contigs.  We found that short reads from phylogenetically distant and microbially diverse environments could not be reliably annotated. => Build into longer contigs first. …5 year odyssey…
    8. 8. (Friends don’t let friends BLAST short reads.**) ** Applicable to most environmental samples.Howe et al., 2014
    9. 9. Developed two new methods -- I. Computational “cell sorting” II. Computational “library normalization.” See: • Pell et al., Tiedje, Brown (2012); • Howe et al., Tiedje, Brown (2014); • Goffredi et al. (2014)
    10. 10. Digital normalization
    11. 11. Digital normalization
    12. 12. Digital normalization
    13. 13. Digital normalization
    14. 14. Digital normalization
    15. 15. Digital normalization
    16. 16. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Result: we (easily, casually) assembled two of the biggest metagenomes ever. Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729 (I’ll come back to this)
    17. 17. So… We can now achieve an assembly of pretty much anything (soil was really hard, virtually everything else is easier!) Lots of people are interested in collaborating with us on this! …but we regard it as a largely solved problem.
    18. 18. I: assembly “protocols”  khmer-protocols: open, versioned, citable, forkable set of instructions to assemble euk mRNAseq and metagenomes on widely accessible compute resources.  Explicit command-line instructions to go from raw reads to annotated “final product”.  For mRNAseq: ~$150/compute for $2000 of data. (Still in beta, note.)
    19. 19. khmer-protocols Read cleaning Preprocessing Assembly Annotation
    20. 20. Example - Deep Carbon data set  Masimong Gold Mine; microbial cells filtered from fracture water from within a 1.9km borehole. (32,000 year old water)  5.6m reads, 601.3 Mbp; ◦ computational protocol took 4 hours; ◦ Assembled to 56 Mbp > 300 bp ◦ longest contig is 73kb ◦ 70% of paired-end reads mapped. 20 w/M.C.Y. Lau, Tullis Onstott
    21. 21. Our (open) approach:  If the protocols work for you, great! Cite us.  If the protocols don’t work for you, please let us know so we can fix them.  If it’s a challenging problem, we’d love to collaborate.  We are also happy to help train people.
    22. 22. Things we no longer worry about (much) – let’s chat:  Inter-species assembly chimerae …apart from w/in strain variants, chimerae are hard to form with contig assembly.  Finding homology matches in metagenomes …contigs give as good a match as possible.  Assembling contigs when we have sufficient coverage …not enough coverage is usually the problem.
    23. 23. II: Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads. 23
    24. 24. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human) 24
    25. 25. Assembly depends on high coverage 25 HMP mock community
    26. 26. Downstream goals of assembly: (Even assuming ribotyping works perfectly)  Annotate genes with higher confidence.  Reconstruct operons & ultimately even full genomes.  Analyze strain variation.  Study organisms that ribotyping can’t (phage & virus)
    27. 27. Main questions -- I. How do we know if we’ve sequenced enough? II. Can we predict how much more we need to sequence to see <insert some feature here>? Note: necessary sequencing depth cannot accurately be predicted from SSU/amplicon data
    28. 28. Method 1: looking for WGS saturation We can track how many sequences we keep of the sequences we’ve seen, to detect saturation.
    29. 29. Data from Shakya et al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing
    30. 30. Data from Shakya et al., 2013 (pmid: 23387867 We can detect saturation of shotgun sequencing C=10, for assembly
    31. 31. Estimating metagenome nt richness: # bp at saturation / coverage  MM5 deep carbon: 60 Mbp  Iowa prairie soil: 12 Gbp  Amazon Rain Forest Microbial Observatory soil: 26 Gbp Assumes: few entirely erroneous reads (upper bound); at saturation (lower bound). 31
    32. 32. WGS saturation approach:  Tells us when we have enough sequence.  Can’t be predictive… if you haven’t sampled something, you can’t say anything about it. Can we correlate deep amplicon sequencing with shallower WGS?
    33. 33. Correlating 16s and shotgun seq Errors do not strongly affect saturatio How much of 16s do you see… with how much shotgun sequencing
    34. 34. Data from Shakya et al., 2013 (pmid: 23387867 WGS saturation ~matches 16s saturation < rRNA copy number >
    35. 35. 16s region choice is not significant (?!) Data from Shakya et al., 2013 (pmid: 23387867
    36. 36. Method is robust to organisms unsampled by amplicons. Insensitive to amplicon primer bias. Robust to genome size differences, eukaryotes, phage. Data from Shakya et al., 2013 (pmid: 23387867
    37. 37. Can examine specific OTUs Data from Shakya et al., 2013 (pmid: 23387867
    38. 38. OTU abundance is ~correct. Data from Shakya et al., 2013 (pmid: 23387867
    39. 39. Running on real communities --
    40. 40. Running on real communities --
    41. 41. Thoughts on 16s/WGS comparison:  Robust to some real problems (primer bias; organisms unsampled by amplicon seq) & insensitive to 16s seq error.  Hopefully can be used to build a predictive framework to answer “how much more sequencing should I do?” ◦ Sensitivity: “What have I missed?” ◦ Planning: “How much $$ should I ask
    42. 42. Other things that y’all might be interested in:  Comparing 16s from amplicon and shotgun sequencing.  Metatranscriptome assembly protocol  Biogeography of genomic sequence
    43. 43. Metatranscriptome assembly (soil) Total Length (bp) Total rRNA (bp) Total annotated by MG-RAST m5nr SEED Unassembled MetaT 20,525,296,600 16,987,863,800 (82.8%) 48,080,200 (0.23%) Assembled MetaT 32,471,548 7,061,913 (21.8%) 2,075,701 (6.4%) Aaron Garoutte (w/Tiedje & Howe)
    44. 44. Using shotgun sequence to cross-validate amplicon predictions 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% AMP/RDP AMP/SILVA WGS/RDP WGS/SILVA WGS/SILVA(LSU) Amplicon seq missing Verrucomicrob Jaron Guo
    45. 45. Primer bias against Verrucomicrobia Check taxonomy of reads causing mismatch (A) Verrucomicrobia cause 70% (117/168) of mismatch Current primer is not effective at amplifying Verrucomicrobia Jaron Guo
    46. 46. Biogeography of genomic DNA How much genomic DNA is shared between different sites? Qingpeng Zhang
    47. 47. Biogeography of genomic DNA (2) How much genomic richness is shared between different sites? Qingpeng Zhang
    48. 48. Concluding thoughts  Tools and protocols for data analysis are fast becoming intrinsic to practice of biology. ◦ Most tools are wrong, but some are useful. ◦ All of our tools are openly, freely available in every way possible.  We are trying to make assembly fast, cheap, easy, and good.  We are building on our assembly-based approaches & intuition to tackle other questions.
    49. 49. Big Data is neither the real problem nor the solution.  Dealing with Big Data requires a new mentality, so training/experience is probably most effective way forward.  With sequencing, few if any of your biology problems go away, although some aspects may become more tractable.  Think future: any -ome you want from any sample you can get. …So now
    50. 50. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp We don’t know what most genes do. Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729
    51. 51. Potential discussion topics A. Funding and collaboration models. B. Leveraging data & computation to help understand gene function. C. Computational/data infrastructure …but planning for poverty, not wealth: sustainability and “bus factor”. D. Capacity building  Standardized data sets; data availability.  Workshops and training.
    52. 52. Training in data analysis et al.  Software Carpentry.  Data Carpentry.  STAMPS, EDAMAME, MSU NGS course.  <other courses go here>
    53. 53. Potential discussion topics A. Funding and collaboration models. B. Leveraging data & computation to help understand gene function. C. Computational/data infrastructure …but planning for poverty, not wealth: sustainability and “bus factor”. D. Capacity building  Standardized data sets; data availability.  Workshops and training.

    ×