Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
Larvae/stream bottoms 3-6 years; parasitic adult -> great lakes, 12-20 months feeding. 5-8 years. 40 lbs of fish per life as parasite. 98% of fish in great lakes went away!
Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.
Building better genomes, transcriptomes, and metagenomes with improvedtechniques for de novo assembly -- an easier way to do it C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University December 2012 firstname.lastname@example.org
AcknowledgementsLab members involved Collaborators Adina Howe (w/Tiedje) Jim Tiedje, MSU Jason Pell Erich Schwarz, Caltech / Arend Hintze Cornell Rosangela Canino- Paul Sternberg, Caltech Koning Robin Gasser, U. Qingpeng Zhang Melbourne Elijah Lowe Weiming Li Likit Preeyanon Funding Jiarong Guo Tim Brom USDA NIFA; NSF IOS; Kanchan Pavangadkar BEACON. Eric McDonald
We practice open science! “Be the change you want”Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog („titus brown blog‟) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/interests.html Preprints: on arXiv, q-bio: „diginorm arxiv‟
My interestsI work primarily on organisms of agricultural,evolutionary, or ecological importance, which tendto have poor reference genomes andtranscriptomes. Focus on: Improving assembly sensitivity to better recover genomic/transcriptomic sequence, often from “weird” samples. Scaling sequence assembly approaches so that huge assemblies are possible and big assemblies are straightforward.
“Weird” biological samples: Single genome Hard to sequence DNA (e.g. GC/AT bias) Transcriptome Differential expression! High polymorphism data Multiple alleles Whole genome Often extreme amplified amplification bias (next slide) Metagenome (mixed Differential abundance microbial community) within community.
Shotgun sequencing andcoverage “Coverage” is simply the average number of reads that overlap each true base in genome.Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
Non-normal coverage distributionslead to decreased assemblysensitivity Many assemblers embed a “coverage model” in their approach. Genome assemblers: abnormally low coverage is erroneous; abnormally high coverage is repetitive sequence. Transcriptome assemblers: isoforms should have same coverage across the entire isoform. Metagenome assemblers: differing abundances indicate different strains. Is there a different way? (Yes.)
K-mer based assemblers scalepoorlyWhy do big data sets require big machines??Memory usage ~ “real” variation + number of errorsNumber of errors ~ size of data set
Why does efficiency matter? It is now cheaper to generate sequence than it is to analyze it computationally! Machine time (Wo)man power/time More efficient programs allow better exploration of analysis parameters for maximizing sensitivity. Better or more sensitive bioinformatic approaches can be developed on top of more efficient theory.
Approach: Digital normalization(a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
Digital normalization approach A digital analog to cDNA library normalization, diginorm: Is single pass: looks at each read only once; Does not “collect” the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions.
Coverage before digitalnormalization: (MD amplified)
Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
Digital normalization approach A digital analog to cDNA library normalization, diginorm is a read prefiltering approach that: Is single pass: looks at each read only once; Does not “collect” the majority of errors; Keeps all low-coverage reads; Smooths out coverage of regions.
Contig assembly is significantly more efficient andnow scales with underlying genome size Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results.
Some diginorm examples:1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism” problem.2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem.3. Assembly of two Midwest soil metagenomes, Iowa corn and Iowa prairie – the “impossible” assembly problem.
1. The H. contortus problem A sheep parasite. ~350 Mbp genome Sequenced DNA 6 individuals after whole genome amplification, estimated 10% heterozygosity (!?) Significant bacterial contamination. (w/Robin Gasser, Paul Sternberg, and Erich Schwarz)
H. contortus life cycleRefs.: Nikolaou and Gasser (2006), Int. J. Parasitol. 36, 859-868; Prichard and Geary (2008), Nature 452, 157-158.
The power of next-gen. sequencing: get 180x coverage ... and then watch your assemblies never finish Libraries built and sequenced: 300-nt inserts, 2x75 nt paired-end reads 500-nt inserts, 2x75 and 2x100 nt paired-end reads 2-kb, 5-kb, and 10-kb inserts, 2x49 nt paired-end reads Nothing would assemble at all until filtered for basic quality.Filtering let ≤500 nt-sized inserts to assemble in a mere week.But 2+ kb-sized inserts would not assemble even then.
Assembly after digital normalization Diginorm readily enabled assembly of a 404 Mbp genome with N50 of 15.6 kb; Post-processing with GapCloser and SOAPdenovo scaffolding led to final assembly of 453 Mbp with N50 of 34.2kb. CEGMA estimates 73-94% complete genome. Diginorm helped by: Suppressing high polymorphism, esp in repeats; Eliminating 95% of sequencing errors; “Squashing” coverage variation from whole genome amplification and bacterial contamination
Next steps with H. contortus Publish the genome paper Identification of antibiotic targets for treatment in agricultural settings (animal husbandry). Serving as “reference approach” for a wide variety of parasitic nematodes, many of which have similar genomic issues.
2. Lamprey transcriptome assembly. Lamprey genome is draft and missing ~30%. No closely related reference. Full-length and exon-level gene predictions are 50-75% reliable, and rarely capture UTRs / isoforms. De novo assembly, if we do it well, can identify Novel genes Novel exons Fast evolving genes Somatic recombination: how much are we missing, really?
Sea lamprey in the Great Lakes Non-native Parasite of medium to large fishes Caused populations of host fishes to crash Li Lab / Y-W C-D
Transcriptome results Started with 5.1 billion reads from 50 different tissues. Digital normalization discarded 98.7% of them as redundant, leaving 87m (!) These assembled into 15,100 transcripts > 1kb Against known transcripts, 98.7% agreement (accuracy); 99.7% included (contiguity)
Next steps with lamprey Far more complete transcriptome than the one predicted from the genome! Enabling studies in – Basal vertebrate phylogeny Biliary atresia Evolutionary origin of brown fat (previously thought to be mammalian only!) Pheromonal response in adults
3. Soil metagenome assembly Observation that 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”) Many reasons why you can‟t or don‟t want to culture: Syntrophic relationships Niche-specificity or unknown physiology Dormant microbes Abundance within communities Single-cell sequencing & shotgun metagenomics are two common ways to investigate microbial communities.
Investigating soil microbialecology What ecosystem level functions are present, and how do microbes do them? How does agricultural soil differ from native soil? How does soil respond to climate perturbation? Questions that are not easy to answer without shotgun sequencing: What kind of strain-level heterogeneity is present in the population? What does the phage and viral population look like? What species are where?
“Whoa, that‟s a lot of data…” Estimated sequencing required (bp, w/Illumina) 5E+14 4.5E+14 4E+14 3.5E+14 3E+14 2.5E+14 2E+14 1.5E+14 1E+14 5E+13 0 E. coli Human Vertebrate Human gut Marine Soil genome genome transcriptome
Additional Approach for Metagenomes: Data partitioning (a computational version of cell sorting)Split reads into “bins” belonging to different source species.Can do this based almost entirely on connectivity of sequences.“Divide and conquer”Memory-efficient implementation helps to scale assembly. Pell et al., 2012, PNAS
Partitioning separates reads by genome.When computationally spiking HMP mock data with one E. coli genome (left) or multiple E. coli strains (right), majority of partitions contain reads from only a single genome (blue) vs multi-genome partitions (green). * * Partitions containing spiked data indicated with aAdina Howe *
Assembly results for Iowa corn and prairie(2x ~300 Gbp soil metagenomes) Predicted Total Total Contigs % Reads protein Assembly (> 300 bp) Assembled coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Putting it in perspective: Total equivalent of ~1200 bacterial genomes Adina Howe Human genome ~3 billion bp
Resulting contigs are lowcoverage. Figure 11: Coverage (median basepair) dist ribut ion of assembled cont igs from soil met agenomes.
Strain variation? Can measure by readTop two allele frequencies mapping. Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Position within contig
Tentative observations from our soilsamples: We need 100x as much data… A lot of our sample may consist of phage. Phylogeny varies more than functional predictions. We see little to no strain variation within our samples Not bulk soil! Very small, localized, and low coverage samples We may be able to do selective really deep sequencing and then infer the rest from 16s. Implications for soil aggregate assembly?
Some concluding thoughts Digital normalization is a very powerful technique for “fixing” weird samples. (…and all samples are weird.) A number of real world projects are using diginorm successfully (~6-10 in my lab; ~30-40? overall). A diginorm-derived procedure is now a recommended part of the Trinity mRNAseq assembler. Diginorm is 1. Very computationally efficient; 2. Always “cheaper” than running an assembler in the first place
Where next? Assembly in the cloud! Study and formalize paired/end mate pair handling in diginorm. Web interface to run and evaluate assemblies. New methods to evaluate and improve assemblies, including a “meta assembly” approach for metagenomes. Fast and efficient error correction of sequencing data Can also address assembly of high polymorphism sequence, allelic mapping bias, and others; Can also enable fast/efficient storage and search of nucleic acid databases.
Four+ papers on our work, soon. 2012 PNAS, Pell et al., pmid 22847406 (partitioning). In review, Brown et al., arXiv:1203.4802 (digital normalization). Submitted, Howe et al, arXiv: 1212.0159 (artifact removal from Illumina metagenomes). In preparation, Howe et al. – assembling the heck out of soil. In preparation, Zhang et al. – efficient k-mer
Education: next-gen sequencecourse June 2013, Kellogg Biological Station; < $500 Hands on exposure to data, analysis tools. Metagenomics workshop HERE, tomorrow, 9am-3pm – contact Lex Nederbra
Thanks!Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog („titus brown blog‟) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/interests.html Preprints: on arXiv, q-bio: „diginorm arxiv‟
Why are we applying short-read sequencing toRNAseq and metagenomics? Short-read sampling is deep and quantitative. Statistical argument: your ability to observe rare sequences – your sensitivity of measurement – is directly related to the number of independent sequences you take. Longer reads (PacBio, 454, Ion Torrent) are less informative for quantitation. Majority of metagenome studies going forward will make use of Illumina (~2-3 year statement
Digital normalization retains information, whilediscarding data and errors