Climbing Mt. Metagenome
Upcoming SlideShare
Loading in...5
×
 

Climbing Mt. Metagenome

on

  • 3,237 views

Assembling very large metagenomes from Illumina short reads.

Assembling very large metagenomes from Illumina short reads.

Statistics

Views

Total Views
3,237
Views on SlideShare
3,224
Embed Views
13

Actions

Likes
2
Downloads
98
Comments
0

4 Embeds 13

http://paper.li 8
https://twitter.com 2
http://staging.plu.mx 2
http://static.slidesharecdn.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Expand on this last point
  • Quantify, or do cumulative distribution
  • Bridge between this kind of view and k-mers
  • Constant memory
  • @@
  • @@
  • @@ k up to 64 graph
  • Expand; talk about density, circumference
  • @@ redo
  • @@ redo
  • Details!
  • 2x coverage vs 10x coverage? Add “reads”
  • Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  • Refactor with error bars, etc.
  • Put in subtraction foo
  • Briefly, all six open reading frames (ORFs) were translated by the ORF_finder (or ORFs were predicted by MetaGene) from translation table 11 with minimum length 30aa. The ORFs were clustered at 90(default 90) % identity to identify the non-redundant sequences, which are further clustered to families at a conservative threshold 60 (default 60) % identity over 80 (default 80) % of length of ORFs. The resulting ORFs are annotated from Pfam and Tigrfam with HMMER, accelerated with Hammerhead, and from COG with RPS-BLAST with e-values less than 0.001. GO annotations were mapped from Pfam or Tigrfam and EC numbers were mapped from the GO database.
  • Paint between the greens.
  • When a green connects two or more colors, recolor one color.
  • Dependent on minimumdensity tagging

Climbing Mt. Metagenome Climbing Mt. Metagenome Presentation Transcript

  • Scaling Mt. Metagenome:Assembling very large data sets
    C. Titus Brown
    Assistant Professor
    Computer Science and Engineering /
    Microbiology and Molecular Genetics
    Michigan State University
  • Thanks for coming!
    Note: this talk is about the computational side of metagenome assembly, motivated by the Great Prairie Grand Challenge soil sequencing project.
    Jim Tiedje will talk about the project as a whole at the JGI User’s Meeting.
  • The basic problem.
    Lots of metagenomic sequence data
    (200 GB Illumina for< $20k?)
    Assembly, especially metagenome assembly, scales poorly (due to high diversity).
    Standard assembly techniques don’t work well with sequences from multiple abundance genomes.
    Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
  • We can’t just throw more hardware at the problem, either.
    Lincoln Stein
  • Jumping to the end:
    We have implemented a solution for these problems:
    Scalability of assembly,
    Lack of resources,
    and parameter choice.
    We demonstrate this solution for a high diversity sample (219.1 Gb of Iowa corn field soil metagenome).
    …there is an additional surprise or two, so you should stick around!
  • Whole genome shotgun sequencing & assembly
    Randomly fragment & sequence from DNA;
    reassemble computationally.
    UMD assembly primer (cbcb.umd.edu)
  • K-mer graphs - overlaps
    J.R. Miller et al. / Genomics (2010)
  • K-mer graphs - branching
    For decisions about which paths etc, biology-based heuristics come into play as well.
  • Too much data – what can we do?
    Reduce the size of the data (either with an approximate or an exact approach)
    Divide & conquer: subdivide the problem.
    For exact data reduction or subdivision, need to grok the entire assembly graph structure.
    …but that is why assembly scales poorly in the first place.
  • Abundance filtering
    Approach used in two published Illumina metagenomic papers (MetaHIT/human microbiomeand rumen papers)
    Remove or trim reads with low-abundance k-mers
    Either due to errors, or low-abundance organisms.
    Inexact data reduction: may or may not remove usable data.
    Works well for high-coverage data sets (rumen est56x!!)
    However, for low-coverage or high-diversity data sets, abundance filtering will reject potentially useful reads.
  • Abundance filtering
  • Two exact data reduction techniques:
    Eliminate reads that do not connect to many other reads.
    Group reads by connectivity into different partitions of the entire graph.
    For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
  • Eliminating unconnected reads
    “Graphsize filtering”
  • Subdividing reads by connection
    “Partitioning”
  • Two exact data reduction techniques:
    Eliminate reads that do not connect to many other reads (“graphsize filtering”).
    Group reads by connectivity into different partitions of the entire graph (“partitioning”).
    For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
  • Engineering overview
    Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure;
    With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k.
    Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes).
    For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
  • Store graph nodes in Bloom filter
    Graph traversal is done in full k-mer space;
    Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
  • Practical application
    Enables:
    graph trimming (exact removal)
    partitioning (exact subdivision)
    abundance filtering
    … all for K <= 64, for 200+ gb sequence collections.
    All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores.
    Similar running times to using Velvet alone.
  • We pre-filter data for assembly:
  • Does removing small graphs work?
    Small data set (35m reads / 3.4 gb rhizosphere soil sample)
    Filtered at k=32, assembled at k=33 with ABYSS
    N contigs / Total bp Largest contig
    130      223,341   61,766 Unfiltered (35m)
    130      223,341   61,766 Filtered (2m reads)
    YES.
  • Does partitioning into disconnected graphs work?
    Partitioned same data set (35m reads / 3.5 gb) into
    45k partitions containing > 10 reads; assembled
    partitions separately (k0=32, k=33).
    N contigs / Total bp Largest contig
    130      223,341   61,766 Unfiltered (35m)
    130      223,341   61,766 Sum partitions
    YES.
  • Data reduction for assembly / practical details
    Reduction performed on machine with 16 gb of RAM.
    Removing poorly connected reads: 35m -> 2m reads.
    - Memory required reduced from 40 gb to 2 gb;
    - Time reduced from 4 hrs to 20 minutes.
    Partitioning reads into disconnected groups:
    - Biggest group is 300k reads
    - Memory required reduced from 40 gb to 500 mb;
    - Time reduced from 4 hrs to < 5 minutes/group.
  • Does it work on bigger data sets?
    35 m read data set partition sizes:
    P1: 277,043 reads
    P2: 5776 reads
    P3: 4444 reads
    P4: 3513 reads
    P5: 2528 reads
    P6: 2397 reads

    Iowa continuous corn GA2 partitions (218.5 m reads):
    P1: 204,582,365 reads
    P2: 3583 reads
    P3: 2917 reads
    P4: 2463 reads
    P5: 2435 reads
    P6: 2316 reads

  • Problem: big data sets have one big partition!?
    Too big to handle on EC2.
    Assembles with low coverage.
    Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage
    As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble!
    Both for our approach,
    And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)
  • Why this lump?
    Real biological connectivity (rRNA, conserved genes, etc.)
    Bug in our software
    Sequencing artifact or error
  • Why this lump?
    Real biological connectivity? Probably not.
    - Increasing Kfrom 32 to ~64 didn’t break up the lump: not biological.
    Bug in our software? Probably not.
    • We have a second, completely separate approach & implementation that confirmed the lump (bleu, by RosangelaCanino-Koning)
    Sequencing artifact or error? YES.
    - (Note, we do filter & quality trim all sequences already)
  • “Good” vs “bad” assembly graph
    Low density
    High density
  • Non-biological levels of local graph connectivity:
  • Higher local graph density correlates with position in read
  • Higher local graph density correlates with position in read
    ARTIFACT
  • Trimming reads
    Trim at high “soddd”, sum of degree degree distribution:
    From each k-mer in each read, walk two k-mers in all directions in the graph;
    If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence.
    Overly stringent; actually trimming (k-1) connectivity graph by degree.
  • Trimmed read examples
    >895:5:1:1986:16019/2
    TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT
    CGACCTGGGCCAACCGATGCGCC
    >895:5:1:1995:6913/1
    TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC
    GCGATG
    >895:5:1:1995:6913/2
    GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT
    GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG
  • Preferential attachment due to bias
    Any sufficiently large collection of connected reads will have one or more reads containing an artifact;
    These artifacts will then connect that group of reads to all other groups possessing artifacts;
    …and all high-coverage contigs will amalgamate into a single graph.
  • Artifacts from sequencing falsely connect graphs
  • Preferential attachment due to bias
    Any sufficiently large collection of connected reads will have one or more reads containing an artifact;
    These artifacts will then connect that group of reads to all other groups possessing artifacts;
    …and all high-coverage contigs will amalgamate into a single graph.
  • Groxel view of knot-like region / ArendHintze
  • Density trimming breaks up the lump:
    Old P1,sodddtrimmed
    (204.6 mreads -> 179 m):
    P1: 23,444,332 reads
    P2: 60,703 reads
    P3: 48,818 reads
    P4: 39,755 reads
    P5: 34,902 reads
    P6: 33,284 reads

    Untrimmed partitioning (218.5 m reads):
    P1: 204,582,365 reads
    P2: 3583 reads
    P3: 2917 reads
    P4: 2463 reads
    P5: 2435 reads
    P6: 2316 reads

  • What does density trimming do to assembly?
    204 m reads in lump:
    assembles into 52,610 contigs;
    total 73.5 MB
    180 m reads in trimmed lump:
    assembles into 57,135 contigs;
    total83.6 MB
    (all contigs > 1kb)
    Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
  • Wait, what?
    Yes, trimming these “knot-like” sequences improves the overall assembly!
    We remove 25.6 m reads and gain 10.1 MB!?
    Trend is same for ABySS, another k-mergraph assembler, as well.
  • Is this a valid assembly?
    Paired-end usage is good.
    50% of contigs have BLASTX hit better than 1e-20 in Swissprot;
    75% of contigs have BLASTX hit better than 1e-20 in TrEMBL;
    Reference genomes sequenced by JGI:
    Frateuriaaurantia: 1376 hits > 100 aa
    Saprospiragrandis: 1114 hits > 100 aa
    (> 50% identity over > 50% of gene)
  • So what’s going on?
    Current assemblers are bad at dealing with certain graph sturctures (“knots”).
    If we can untangle knots for them, that’s good, maybe?
    Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves?
    Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
  • OK, let’s assemble!
    Iowa corn (HiSeq+ GA2): 219.11 Gb of sequence assembles to:
    148,053 contigs,
    in220MB;
    max length 20322
    max coverage ~10x
    …all done on Amazon EC2, ~ 1 week for under $500.
    Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
  • Full Iowa corn / mapping stats
    1,806,800,000 QC/trimmed reads (1.8 bn)
    204,900,000 reads map to somecontig (11%)
    37,244,000 reads map to contigs > 1kb (2.1%)
    > 1 kb contig is a stringent criterion!
    Compare:
    80% of MetaHIT reads to > 500 bp;
    65%+ of rumen reads to > 1kb
  • Percentage mapped vscontig size
  • High coverage partitions assemble more reads
  • Success, tentatively.
    We are still evaluating assembly and assembly parameters; should be possible to improve in every way.
    (~10 hrs to redo entire assembly, once partitioned.)
    The main engineering point is that we can actuallyrun this entire pipeline on a relatively small machine
    (8 core/68 GB RAM)
    We can do dozens of these in parallel on Amazon rental hardware.
    And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
  • Optimizing per-partition assembly
    Metagenomes contain mixed-abundance genomes.
    Current assemblers are not built for mixed-abundance samples (problem with mRNAseq, too).
    Repeat resolution
    Error/edge trimming
    Since we’re breaking the data set into multiple partitions containing reads that may assemble together, can we optimize assembler parameters (k, coverage) for each partition?
  • Mixing parameters improves assembly statistics
    Objective function: maximize sum(contigs > 1kb)
    4.5x average coverage– gained 228 contigs/469 kb
    (over 152/215 kb)
    5.8x average coverage – gained 78 contigs/304 kb
    (over 248/708 kb)
    8.2x average coverage – lost 58 contigs /gained 116 kb
    (over 279/803 kb)
  • Conclusions
    Engineering: can assemble large data sets.
    Scaling: can assemble on rented machines.
    Science: can optimize assembly for individual partitions.
    Science: retain low-abundance.
  • Caveats
    Quality of assembly??
    Illumina sequencing bias/error issue needs to be explored.
    Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs.
    Need to better analyze upper limits of data structures.
    Have not applied our approaches to high-coverage data yet; in progress.
  • Future thoughts
    Our pre-filtering technique alwayshas lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly.
    Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future.
    This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence.
    Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …)
  • Acknowledgements
    The k-mer gang:
    Adina Howe
    Jason Pell
    RosangelaCanino-Koning
    QingpengZhang
    ArendHintze
    Collaborators:
    Jim Tiedje (Il padrino)
    Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)
    Charles Ofria (MSU)
    Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
  • A guide to khmer
    Python wrapping C++; BSD license.
    Tools for:
    K-mer abundance filtering (constant mem; inexact)
    Assembly graph size filtering (constant mem; exact)
    Assembly graph partitioning (exact)
    Error trimming (constant mem; inexact)
    Still in alpha form… undocumented, esp.
  • k-mer coverage by partition
  • Abundance filtering affects low-coverage contigs dramatically
  • Many read pairs map together
  • Bonus slides
    How much more do we need to sequence, anyway??
  • Calculating expected k-mer numbers
    Entire population
    S1
    S2
    Note: no simple way to correct abundance bias, so we don’t, yet.
  • Coverage estimates
    (Based on k-mer mark/recapture analysis.)
    Iowa prairie (136 GB): est 1.26 x
    Iowa corn (62 GB): est 0.86 x
    Wisconsin corn (190 GB): est 2.17 x
    For comparison, the panda genome assembly
    used ~50x with short reads.
    Qingpeng Zhang
  • Coverage estimates: getting to 50x…
    Human -> 150 GB for 50x
    Iowa prairie (136 GB): est 1.26 x -> 5.4 TB for 50x
    Iowa corn (62 GB): est 0.86 x -> 3.6 TB for 50x
    Wisconsin corn (190 GB): est 2.17 x -> 4.4 TB for 50x
    …note that it’s not clear what “coverage” exactly means in this case, since 16s-estimated diversity is very high.
  • What does coverage mean here?
    “Unseen” sequence:
    1x ~ 37%
    2x ~ 14%
    5x ~ 0.7%
    10x ~ .00005%
    50x ~ 2e-20%
    For metagenomes, coverage is of abundance weighted DNA.
  • CAMERA Annotation of full set contigs(>1000 bp)
    # of ORFS: 344,661 (Metagene)
    Longest ORF: 1,974 bp
    Shortest ORF: 20 bp
    Average ORF: 173 bp
    # of COG hits: 153,138 (e-value < 0.001)
    # of Pfam hits: 170,072
    # of TIGRfam hits: 315,776
  • CAMERA COG Summary
  • The k-mer oracle
    Q: is this k-mer present in the data set?
    A: no => then it is not.
    A: yes => it may or may not be present.
    This lets us store k-mers efficiently.
  • Building on the k-mer oracle:
    Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
  • The k-mer graph oracle
    Q: does this k-mer overlap with this other k-mer?
    A: no => then it does not, guaranteed.
    A: yes => it may or may not.
    This lets us traverse de Bruijn graphs efficiently.
  • The contig size oracle
    Q: could this read contribute to a contig bigger than N?
    A: no => then it does not, guaranteed.
    A: yes => then it might.
    This lets us eliminate reads that do not belong to “big” contigs.
  • The read partition oracle
    Does this read connect to this other read in any way?
    A: no => then it does not, guaranteed.
    A: yes => then it might.
    This lets us subdivide the assembly problem into many smaller, disconnected problems that are much easier.
  • Oracular fact
    All of these oracles are cheap, can yield answers from a different probability distribution, and can be “chained” together (so you can keep on asking oracles for as long as you want, and get more and more accurate).
  • Implementing a basic k-mer oracle
    Conveniently, perhaps the simplest data structure in computer science is what we need…
    …a hash table that ignores collisions.
    Note, P(false positive) = fractional occupancy.
  • A more reliable k-mer oracle
    Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable.
  • Scaling the k-mer oracle
    Adding additional filters increases discrimination at the cost of speed.
    This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
  • The k-mer oracle, revisited
    We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.
    This implicitly lets us store the graph structure, too!
  • B. Partitioning graphs into disconnected subgraphs
    Which nodes do not connect to each other?
  • Partitioning graphs – it’s easy looking
    Which nodes do not connect to each other?
  • But partitioning big graphs is expensive
    Requires exhaustive exploration.
  • But partitioning big graphs is expensive
  • Tabu search – avoid global searches
  • Tabu search – systematic local exploration
  • Tabu search – systematic local exploration
  • Tabu search – systematic local exploration
  • Tabu search – systematic local exploration
  • Strategies for completing big searches…
  • Hard-to-traverse graphs are well-connected
  • Add neighborhood-exclusion to tabu search
  • Exclusion strategy lets you systematically explore big graphs with a local algorithm
  • Potential problems
    Our oracle can mistakenly connect clusters.
  • Potential problems
    This is a problem if the rate is sufficiently high!
  • However, the error is one-sided:
    Graphs will never be erroneously disconnected
  • The error is one-sided:
    Nodes will never be erroneously disconnected
  • The error is one-sided:
    Nodes will never be erroneously disconnected.
    This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers.
    This, in turn, lets us reliably partition graphs into smaller graphs.
  • Actual implementation