Metagenome assembly – part I

         C. Titus Brown
         ctb@msu.edu
About me
• Asst Prof at MSU, in CSE and Micro

• Software: http://github.com/ged-lab/

• Blog: http://ivory.idyll.org/blog/

• Pubs & grants:
  http://ged.msu.edu/interests.html
Tomorrow (talk #2)
             My research!

  Soil! Great Prairie Grand Challenge!

   MASSIVE AMOUNTS OF DATA!!!!

My research solves all your problems!! *

                         *
                             Results may vary. Terms and conditions apply.
Some basic assembly references
• “Assembly algorithms for next gen sequence
  data,” Miller et al., pmid 20211242

• Metagenome assembly tools:
  – MetaVelvet, pmid 22821567
  – MetaIDBA, pmid 21685107
  – SOAPdenovo, pmid 20511140


• My precious! khmer, pmid 22847406.
Illumina + metagenomic assembly
•   MetaHIT (2010): pmid 20203603
•   Rumen (2011): pmid 21273488
•   Permafrost (2011): pmid 22056985
•   Hydrothermal plumes (2012): pmid 22695863
•   HMP (2012): pmid 22699610

       Please let me know if I’ve missed any!
Culture independent methods
• Observation that 99% of microbes cannot easily be
  cultured in the lab. (“The great plate count anomaly”)
• While this is less true for host-associated microbes,
  culture independent methods are still important:
   –   Syntrophic relationships
   –   Niche-specificity or unknown physiology
   –   Dormant microbes
   –   Abundance within communities

       Single-cell sequencing & shotgun metagenomics are two
          common ways to investigate microbial communities.
Shotgun metagenomics
• Collect samples;

• Extract DNA;

• Feed into sequencer;

• Computationally analyze.

                     Wikipedia: Environmental shotgun sequencing.png
Shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
     reassemble computationally.




                    UMD assembly primer (cbcb.umd.edu)
Shotgun sequencing & assembly
• Why assembly?
   – Assumption free (no reference needed)
   – Necessary for soil and marine; useful for host-associated?
   – Assembly can serve as reference for metatranscriptome interpretation

• Fragment, sequence, computationally assemble.

• What kind of results do you get?
   – Almost certainly chimerism between different strains; but still useful
     for gene content & operon structure.
   – Specificity seems high, but sensitivity is dependent on sequencing
     depth.

• Because of sampling rate, Illumina is primary choice for complex
  metagenomes.
Shotgun metagenomics: good news
• Cheap and easy to generate vast whole
  metagenome/metatranscriptome shotgun data sets from
  essentially any community you can sample.

• Such data can be quite interesting!
    – Low hanging fruit – correlation with diet, etc.
    – Still early days for observation of “pan genome” and functional
      content.

• Potential to illuminate or inform:
    – Dynamics and selective pressures of antibiotic resistance, virulence
      genes, and pathogenicity islands
    – Phage and viral communities
    – Community interactions.
Shotgun metagenomics: bad news
• Massive data needed for complex populations (tomorrow!)

• Computational techniques are still relatively immature
   –   Mapping to known genomes?
   –   Discovery of unknown genomes & strain variants?
   –   Sensitivity and specificity are hard to evaluate.
   –   Computational ecosystem is not that rich…

• Interpreting the data is still the bottleneck, of course.
   – Vast majority of genes not usefully annotated.
   – Reliance on specific reference databases, annotations.
   – Tools for (e.g.) inferring community interactions from
     community dynamics & functional capacity are desperately
     needed.
Assembly vs mapping
• No reference needed, for assembly!
  – De novo genomes, transcriptomes…

• But:
  – Scales poorly; need a much bigger computer.
  – Biology gets in the way (repeats!)
  – Need higher coverage

• But but:
  – Often your reference isn’t that great, so assembly
    may actually be the best/only way to go.
Assembly
            It was the best of times, it was the wor
             , it was the worst of times, it was the
               isdom, it was the age of foolishness
            mes, it was the age of wisdom, it was th



It was the best of times, it was the worst of times, it was the age
              of wisdom, it was the age of foolishness

               …but for lots and lots of fragments!
Assemble based on word overlaps:




Repeats cause problems:
Shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
     reassemble computationally.




                    UMD assembly primer (cbcb.umd.edu)
Assembly – no subdivision!
Assembly is inherently an all by all process.
 There is no good way to subdivide the reads
without potentially missing a key connection
Short-read assembly
• Short-read assembly is problematic
• Relies on very deep coverage, ruthless read
  trimming, paired ends.




                           UMD assembly primer (cbcb.umd.edu)
Short read lengths are hard.




                Whiteford et al., Nuc. Acid Res, 2005
Short read lengths are hard.

                              Conclusion: even with
                              a read length of 200, the
                              E. coli genome cannot be
                              assembled completely.

                              Why?




                Whiteford et al., Nuc. Acid Res, 2005
Short read lengths are hard.

                              Conclusion: even with
                              a read length of 200, the
                              E. coli genome cannot be
                              assembled completely.

                              Why? REPEATS.

                              This is why paired-end
                              sequencing is so important
                              for assembly.




                Whiteford et al., Nuc. Acid Res, 2005
Four main challenges for de novo
               sequencing.
• Repeats.
• Low coverage.
• Errors

                   These introduce breaks in the
                     construction of contigs.

• Variation in coverage – transcriptomes and metagenomes, as well
  as amplified genomic.

   This challenges the assembler to distinguish between erroneous
             connections (e.g. repeats) and real connections.
Repeats
• Overlaps don’t place sequences uniquely
  when there are repeats present.




                          UMD assembly primer (cbcb.umd.edu)
Coverage
Easy calculation:

(# reads x avg read length) / genome size

So, for haploid human genome:

30m reads x 100 bp = 3 bn
Coverage
• “1x” doesn’t mean every DNA sequence is
  read once.
• It means that, if sampling were systematic, it
  would be.
• Sampling isn’t systematic, it’s random!

        (What does ‘coverage’ mean, for
                metagenomes?)
Actual coverage varies widely from the
              average.
Two basic assembly approaches
• Overlap/layout/consensus
• De Bruijn k-mer graphs



The former is used for long reads, esp all Sanger-
  based assemblies. The latter is used because
               of memory efficiency.
Overlap/layout/consensus
Essentially,
1.Calculate all overlaps
2.Cluster based on overlap.
3.Do a multiple sequence alignment




                          UMD assembly primer (cbcb.umd.edu)
K-mers
Essentially, break reads (of any length) down into
   multiple overlapping words of fixed length k.

ATGGACCAGATGACAC (k=12) =>

ATGGACCAGATG
 TGGACCAGATGA
  GGACCAGATGAC
   GACCAGATGACA
    ACCAGATGACAC
K-mers – what k to use?




                 Butler et al., Genome Res, 2009
K-mers – what k to use?




                 Butler et al., Genome Res, 2009
Big genomes are problematic




                  Butler et al., Genome Res, 2009
K-mer graphs - overlaps




                J.R. Miller et al. / Genomics (2010)
K-mer graph (k=14)




Each node represents a 14-mer;
Links between each node are 13-mer overlaps
K-mer graph (k=14)




Branches in the graph represent partially overlapping sequences.
K-mer graph (k=14)




Single nucleotide variations cause long branches
K-mer graph (k=14)




Single nucleotide variations cause long branches;
            They don’t rejoin quickly.
K-mer graphs - branching




For decisions about which paths etc, biology-based
           heuristics come into play as well.
K-mer graph complexity - spur



       (Short) dead-end in graph.

Can be caused by error at the end of some
    overlapping reads, or low coverage

                          J.R. Miller et al. / Genomics (2010)
K-mer graph complexity - bubble




Multiple parallel paths that diverge and join.

      Caused by sequencing error and true
      polymorphism / polyploidy in sample.
                                J.R. Miller et al. / Genomics (2010)
K-mer graph complexity – “frayed
             rope”



   Converging, then diverging paths.

   Caused by repetitive sequences.


                         J.R. Miller et al. / Genomics (2010)
Resolving graph complexity
• Primarily heuristic (approximate) approaches.

• Detecting complex graph structures can generally
  not be done efficiently.

• Much of the divergence in functionality of new
  assemblers comes from this.

• Three examples:
Read threading




Single read spans k-mer graph => extract the
               single-read path.



                            J.R. Miller et al. / Genomics (2010)
Mate threading




Resolve “frayed-rope” pattern caused by
  repeats, by separating paths based on mate-
  pair reads.
                             J.R. Miller et al. / Genomics (2010)
Path following




Reject inconsistent paths based on mate-pair
             reads and insert size.

                            J.R. Miller et al. / Genomics (2010)
More assembly issues
• Many parameters to optimize!

• Metagenomes have variation in copy number; naïve
  assemblers can treat this as repetitive and eliminate it.

• Assembly requires gobs of memory (4 lanes, 60m reads
  => ~ 150gb RAM)

• How do we evaluate assemblies?
   – What’s the best assembler?
Metagenomics: Mixed community
          sampling




          Coverage distribution
Conclusions re mixed community
              sampling
  In shotgun metagenomics, you are sampling
      randomly from the mixed population.

  Therefore, the lowest abundance member of
the population (that you want to observe) drives
       the required depth of sequencing!

  1 in a million => ~50 Tbp sequencing for 10x
                     coverage.
‘k’ parameter sets effective coverage.
            Simulated data set.




                    coverage
Conclusions re ‘k’ parameter
• The previous slide shows you coverage histograms for per-
  base (mapping) coverage, as well as k-mer distributions.

• People will tell you k is about specificity: a longer ‘k’ is more
  stringent and requires a more specific overlap between
  reads.

• However, the practical effect of increasing k is to lower
  your effective coverage.

• This is one (the?) reason why different ‘k’ parameters can
  give you different subsets of the metagenomic population.
Assembly depends on high coverage
     HMP mock community assembly; Velvet-based protocol.
Conclusions from previous slide
• To recover any contigs at all, you need
  coverage > 10 (green line).
• To recover long sequences, you want
  coverage > 20 (blue line).
Assemblers fail to assemble complex
       regions into contigs.
Conclusions from previous slide
• Contig assemblers don’t like “complex” regions in the
  graph (repeats, high polymorphism, etc.)
• They will simply end the contig there.

• This is why you need paired-end sequencing and
  scaffolding.

• Friends don’t let friends scaffold metagenomic data 
   – See rumen paper, Hess et al., pmid 21273488, for
     discussion.
What do metagenomic assemblers do?
    MetaVelvet and MetaIDBA (and khmer)
  “partition” the assembly graph into sections
 from different organisms, and then assemble
                those individually.

This allows them to adjust coverage parameters
                   “locally”.
MetaVelvet & partitioning
Errors
200x coverage – but most k-mers are from errors!
Conclusions from previous slide
• For a simulated data set with coverage of 200,
  the vast majority (80%) of unique k-mers are low-
  abundance and caused by errors.

• Errors cause major problems for de Bruijn graph
  assemblers.

• For genomes, you can trim off low-abundance k-
  mers. For metagenomes, that removes real data.
  Dilemma.
For genomes, you can trim low-
      abundance k-mers.
         Not so for metagenomes…




               coverage
Conclusions from previous slide
• For a simulated data set with coverage of 200,
  the vast majority (80%) of unique k-mers are low-
  abundance and caused by errors.

• Errors cause major problems for de Bruijn graph
  assemblers.

• For genomes, you can trim off low-abundance k-
  mers. For metagenomes, that removes real data.
  Dilemma.
Some concluding thoughts (day 1)
• Opinions:
  – For polymorphic/strain variants, contig assemby is
    more likely to fail to produce a contig than it is to
    produce a chimera (contig assembly is specific).
  – Scaffolding, at least with Velvet/MetaVelvet, seems to
    be prone to producing chimerae.
  – My biggest concern with metgenome assembly is not
    specificity but rather sensitivity.
  – We know so little about most environments that we
    have no good way of assessing what we’re missing.
Some more concluding thoughts
• Assembly is a gigantic black box into which you
  feed your data, and out of which comes…
  something.

• Think hard about how to evaluate the results and
  be prepared to spend lots of time doing so.

• Your computation is part of your science! If
  you’re just running someone else’s program
  blindly, you’re doing it wrong.

2012 stamps-mbl-1

  • 1.
    Metagenome assembly –part I C. Titus Brown ctb@msu.edu
  • 2.
    About me • AsstProf at MSU, in CSE and Micro • Software: http://github.com/ged-lab/ • Blog: http://ivory.idyll.org/blog/ • Pubs & grants: http://ged.msu.edu/interests.html
  • 3.
    Tomorrow (talk #2) My research! Soil! Great Prairie Grand Challenge! MASSIVE AMOUNTS OF DATA!!!! My research solves all your problems!! * * Results may vary. Terms and conditions apply.
  • 4.
    Some basic assemblyreferences • “Assembly algorithms for next gen sequence data,” Miller et al., pmid 20211242 • Metagenome assembly tools: – MetaVelvet, pmid 22821567 – MetaIDBA, pmid 21685107 – SOAPdenovo, pmid 20511140 • My precious! khmer, pmid 22847406.
  • 5.
    Illumina + metagenomicassembly • MetaHIT (2010): pmid 20203603 • Rumen (2011): pmid 21273488 • Permafrost (2011): pmid 22056985 • Hydrothermal plumes (2012): pmid 22695863 • HMP (2012): pmid 22699610 Please let me know if I’ve missed any!
  • 6.
    Culture independent methods •Observation that 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”) • While this is less true for host-associated microbes, culture independent methods are still important: – Syntrophic relationships – Niche-specificity or unknown physiology – Dormant microbes – Abundance within communities Single-cell sequencing & shotgun metagenomics are two common ways to investigate microbial communities.
  • 7.
    Shotgun metagenomics • Collectsamples; • Extract DNA; • Feed into sequencer; • Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 8.
    Shotgun sequencing &assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 9.
    Shotgun sequencing &assembly • Why assembly? – Assumption free (no reference needed) – Necessary for soil and marine; useful for host-associated? – Assembly can serve as reference for metatranscriptome interpretation • Fragment, sequence, computationally assemble. • What kind of results do you get? – Almost certainly chimerism between different strains; but still useful for gene content & operon structure. – Specificity seems high, but sensitivity is dependent on sequencing depth. • Because of sampling rate, Illumina is primary choice for complex metagenomes.
  • 10.
    Shotgun metagenomics: goodnews • Cheap and easy to generate vast whole metagenome/metatranscriptome shotgun data sets from essentially any community you can sample. • Such data can be quite interesting! – Low hanging fruit – correlation with diet, etc. – Still early days for observation of “pan genome” and functional content. • Potential to illuminate or inform: – Dynamics and selective pressures of antibiotic resistance, virulence genes, and pathogenicity islands – Phage and viral communities – Community interactions.
  • 11.
    Shotgun metagenomics: badnews • Massive data needed for complex populations (tomorrow!) • Computational techniques are still relatively immature – Mapping to known genomes? – Discovery of unknown genomes & strain variants? – Sensitivity and specificity are hard to evaluate. – Computational ecosystem is not that rich… • Interpreting the data is still the bottleneck, of course. – Vast majority of genes not usefully annotated. – Reliance on specific reference databases, annotations. – Tools for (e.g.) inferring community interactions from community dynamics & functional capacity are desperately needed.
  • 12.
    Assembly vs mapping •No reference needed, for assembly! – De novo genomes, transcriptomes… • But: – Scales poorly; need a much bigger computer. – Biology gets in the way (repeats!) – Need higher coverage • But but: – Often your reference isn’t that great, so assembly may actually be the best/only way to go.
  • 14.
    Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 15.
    Assemble based onword overlaps: Repeats cause problems:
  • 16.
    Shotgun sequencing &assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 17.
    Assembly – nosubdivision! Assembly is inherently an all by all process. There is no good way to subdivide the reads without potentially missing a key connection
  • 18.
    Short-read assembly • Short-readassembly is problematic • Relies on very deep coverage, ruthless read trimming, paired ends. UMD assembly primer (cbcb.umd.edu)
  • 19.
    Short read lengthsare hard. Whiteford et al., Nuc. Acid Res, 2005
  • 20.
    Short read lengthsare hard. Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why? Whiteford et al., Nuc. Acid Res, 2005
  • 21.
    Short read lengthsare hard. Conclusion: even with a read length of 200, the E. coli genome cannot be assembled completely. Why? REPEATS. This is why paired-end sequencing is so important for assembly. Whiteford et al., Nuc. Acid Res, 2005
  • 22.
    Four main challengesfor de novo sequencing. • Repeats. • Low coverage. • Errors These introduce breaks in the construction of contigs. • Variation in coverage – transcriptomes and metagenomes, as well as amplified genomic. This challenges the assembler to distinguish between erroneous connections (e.g. repeats) and real connections.
  • 23.
    Repeats • Overlaps don’tplace sequences uniquely when there are repeats present. UMD assembly primer (cbcb.umd.edu)
  • 24.
    Coverage Easy calculation: (# readsx avg read length) / genome size So, for haploid human genome: 30m reads x 100 bp = 3 bn
  • 25.
    Coverage • “1x” doesn’tmean every DNA sequence is read once. • It means that, if sampling were systematic, it would be. • Sampling isn’t systematic, it’s random! (What does ‘coverage’ mean, for metagenomes?)
  • 26.
    Actual coverage varieswidely from the average.
  • 27.
    Two basic assemblyapproaches • Overlap/layout/consensus • De Bruijn k-mer graphs The former is used for long reads, esp all Sanger- based assemblies. The latter is used because of memory efficiency.
  • 28.
    Overlap/layout/consensus Essentially, 1.Calculate all overlaps 2.Clusterbased on overlap. 3.Do a multiple sequence alignment UMD assembly primer (cbcb.umd.edu)
  • 29.
    K-mers Essentially, break reads(of any length) down into multiple overlapping words of fixed length k. ATGGACCAGATGACAC (k=12) => ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
  • 30.
    K-mers – whatk to use? Butler et al., Genome Res, 2009
  • 31.
    K-mers – whatk to use? Butler et al., Genome Res, 2009
  • 32.
    Big genomes areproblematic Butler et al., Genome Res, 2009
  • 33.
    K-mer graphs -overlaps J.R. Miller et al. / Genomics (2010)
  • 34.
    K-mer graph (k=14) Eachnode represents a 14-mer; Links between each node are 13-mer overlaps
  • 35.
    K-mer graph (k=14) Branchesin the graph represent partially overlapping sequences.
  • 36.
    K-mer graph (k=14) Singlenucleotide variations cause long branches
  • 37.
    K-mer graph (k=14) Singlenucleotide variations cause long branches; They don’t rejoin quickly.
  • 38.
    K-mer graphs -branching For decisions about which paths etc, biology-based heuristics come into play as well.
  • 39.
    K-mer graph complexity- spur (Short) dead-end in graph. Can be caused by error at the end of some overlapping reads, or low coverage J.R. Miller et al. / Genomics (2010)
  • 40.
    K-mer graph complexity- bubble Multiple parallel paths that diverge and join. Caused by sequencing error and true polymorphism / polyploidy in sample. J.R. Miller et al. / Genomics (2010)
  • 41.
    K-mer graph complexity– “frayed rope” Converging, then diverging paths. Caused by repetitive sequences. J.R. Miller et al. / Genomics (2010)
  • 42.
    Resolving graph complexity •Primarily heuristic (approximate) approaches. • Detecting complex graph structures can generally not be done efficiently. • Much of the divergence in functionality of new assemblers comes from this. • Three examples:
  • 43.
    Read threading Single readspans k-mer graph => extract the single-read path. J.R. Miller et al. / Genomics (2010)
  • 44.
    Mate threading Resolve “frayed-rope”pattern caused by repeats, by separating paths based on mate- pair reads. J.R. Miller et al. / Genomics (2010)
  • 45.
    Path following Reject inconsistentpaths based on mate-pair reads and insert size. J.R. Miller et al. / Genomics (2010)
  • 46.
    More assembly issues •Many parameters to optimize! • Metagenomes have variation in copy number; naïve assemblers can treat this as repetitive and eliminate it. • Assembly requires gobs of memory (4 lanes, 60m reads => ~ 150gb RAM) • How do we evaluate assemblies? – What’s the best assembler?
  • 47.
    Metagenomics: Mixed community sampling Coverage distribution
  • 48.
    Conclusions re mixedcommunity sampling In shotgun metagenomics, you are sampling randomly from the mixed population. Therefore, the lowest abundance member of the population (that you want to observe) drives the required depth of sequencing! 1 in a million => ~50 Tbp sequencing for 10x coverage.
  • 49.
    ‘k’ parameter setseffective coverage. Simulated data set. coverage
  • 50.
    Conclusions re ‘k’parameter • The previous slide shows you coverage histograms for per- base (mapping) coverage, as well as k-mer distributions. • People will tell you k is about specificity: a longer ‘k’ is more stringent and requires a more specific overlap between reads. • However, the practical effect of increasing k is to lower your effective coverage. • This is one (the?) reason why different ‘k’ parameters can give you different subsets of the metagenomic population.
  • 51.
    Assembly depends onhigh coverage HMP mock community assembly; Velvet-based protocol.
  • 52.
    Conclusions from previousslide • To recover any contigs at all, you need coverage > 10 (green line). • To recover long sequences, you want coverage > 20 (blue line).
  • 53.
    Assemblers fail toassemble complex regions into contigs.
  • 54.
    Conclusions from previousslide • Contig assemblers don’t like “complex” regions in the graph (repeats, high polymorphism, etc.) • They will simply end the contig there. • This is why you need paired-end sequencing and scaffolding. • Friends don’t let friends scaffold metagenomic data  – See rumen paper, Hess et al., pmid 21273488, for discussion.
  • 55.
    What do metagenomicassemblers do? MetaVelvet and MetaIDBA (and khmer) “partition” the assembly graph into sections from different organisms, and then assemble those individually. This allows them to adjust coverage parameters “locally”.
  • 57.
  • 58.
    Errors 200x coverage –but most k-mers are from errors!
  • 59.
    Conclusions from previousslide • For a simulated data set with coverage of 200, the vast majority (80%) of unique k-mers are low- abundance and caused by errors. • Errors cause major problems for de Bruijn graph assemblers. • For genomes, you can trim off low-abundance k- mers. For metagenomes, that removes real data. Dilemma.
  • 60.
    For genomes, youcan trim low- abundance k-mers. Not so for metagenomes… coverage
  • 61.
    Conclusions from previousslide • For a simulated data set with coverage of 200, the vast majority (80%) of unique k-mers are low- abundance and caused by errors. • Errors cause major problems for de Bruijn graph assemblers. • For genomes, you can trim off low-abundance k- mers. For metagenomes, that removes real data. Dilemma.
  • 62.
    Some concluding thoughts(day 1) • Opinions: – For polymorphic/strain variants, contig assemby is more likely to fail to produce a contig than it is to produce a chimera (contig assembly is specific). – Scaffolding, at least with Velvet/MetaVelvet, seems to be prone to producing chimerae. – My biggest concern with metgenome assembly is not specificity but rather sensitivity. – We know so little about most environments that we have no good way of assessing what we’re missing.
  • 63.
    Some more concludingthoughts • Assembly is a gigantic black box into which you feed your data, and out of which comes… something. • Think hard about how to evaluate the results and be prepared to spend lots of time doing so. • Your computation is part of your science! If you’re just running someone else’s program blindly, you’re doing it wrong.

Editor's Notes

  • #10 xx
  • #27 High coverage is essential.
  • #30 Note, no tolerance for indels