SlideShare a Scribd company logo
Scaling metagenome assembly –to infinity and beeeeeeeeeeyond! C. Titus Brown et al. Computer Science / Microbiology Depts Michigan State University In collaboration with Great Prairie Grand Challenge (Tiedje, Jansson, Tringe)
SAMPLING LOCATIONS
Sampling strategy per site 1 M 1 cM 10 M 1 cM Reference soil 1 M Soil cores: 1 inch diameter, 4 inches deep Total: 8 Reference metagenomes + 64 spatially separated cores             (pyrotag sequencing) 10 M
Great Prairie sequencing summary 200x human genome…! > 10x more challenging (total diversity)
Our perspective Great Prairie project: there is no end to the data! Immense biological depth: estimate ~1-2 TB (10**12) of raw sequence needed to assemble top ~20-40% of microbes. Improvements in sequencing tech Existing methods for scaling assembly simply will not suffice: this is a losing battle. Abundance filtering XXX Better data structures XXX Parallelization is not going to be sufficient; neither are advances in data structures. I think: bad scaling is holding back assembly progress.
Our perspective, #2 Deep sampling is needed for these samples Illumina is it, for now. The last thing in the world we want to do is write yet another assembler…pre-assembly filtering, instead. All of our techniques can be used together with any assembler. We’ve mostly stuck with Velvet, for reasons of historical contingency.
Two enabling technologies Very efficient k-mer counting Bloom counting hash/MinCount Sketch data structure; constant memory Scales ~10x over traditional data structures k-independent. Probabilistic properties well suited to next-gen data sets. Very efficient de Bruijn graph representation We traverse k-mers stored in constant-memory Bloom filters. Compressible probabilistic data structure; very accurate. Scales ~20x over traditional data structures. K-independent. …cannot directly be used for assembly because of FP.
Approach 1: Partitioning Use compressible graph representation to explore natural structure of data: many disconnected components.
Partitioning for scaling Can be done in ~10x less memory than assembly. Partition at low k and assemble exactly at any higher k (DBG). Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to high variation, etc.) Can eliminate small partitions/contigs in the partitioning phase. In theory, an exact approach to divide and conquer/data reduction.
Adina Howe
Partitioning challenges Technical challenge: existence of “knots” in the graph that artificially connect everything. Unfortunately, partitioning is not the solution. Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have… 20x scaling isn’t nearly enough, anyway 
Digression: sequencing artifacts Adina Howe
Partitioning challenges Unfortunately, partitioning is not the solution. Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have… 20x scaling isn’t nearly enough, anyway 
Approach 2: Digital normalization “Squash” high coverage reads Eliminate reads we’ve seen before (e.g. “> 5 times”) Digital version of experimental “mRNA normalization”. Nice algorithm! Single-pass Constant memory Trivial to implement Easy to parallelize / scale (memory AND throughput) “Perfect” solution? (Works fine for MDA, mRNAseq…)
Digital normalization Two benefits: Decrease amount of data (real, but redundant sequence) Eliminate errors associated that redundant sequence. Single-pass algorithm (c.f. streaming sketch algorithms)
Digital normalization validation? Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated.
Comparing assemblies quantitatively Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA. This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.
Running HMMs over de Bruijn graphs(=> cross validation) hmmgs: Assemble based on good-scoring HMM paths through the graph. Independent of other assemblers; very sensitive, specific. 95% of hmmgsrplB domains are present in our partitioned assemblies. CTC ACT TTC GTA GAC ATA ACC CTA Jordan Fish, Qiong Wang, and Jim Cole (RDP) GTT
Digital normalization validation Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated.  Hmmgs results tell us that Velvet multi-k assembly is also very sensitive. Our primary concern at this point is about long-range artifacts (chimeric assembly).
Techniques Developed suite of techniques that work for scaling, without loss of information (?) While we have no good way to assess chimeras and misassemblies, basic sequence content and gene content stay the same across treatments. And… what, are we just sitting here writing code? No!  We have data to assemble!
Assembling Great Prairie data, v0.8 Iowa corn GAII, ~500m reads / 50 Gb => largest partition ~200k reads 84 Mb in 53,501 contigs > 1kb. Iowa prairie GAII, ~500m reads / 50 Gb =>  biggest ~100k read partition 102 MB in 70,895 contigs > 1kb. Both done on a single 8-core Amazon EC2 bigmem node, 68 GB of RAM, ~$100. (Yay, we can do it!  Boo, we’re only using 2% of reads.) No systematic optimization of partitions yet;  2-4x improvement expected.  Normalization of HiSeq is also yet to be done. Have applied to other metagenomes, note; longer story.
Future directions? khmer software reasonably stable & well-tested; needs documentation, software engineering love. github.com/ctb/khmer/  (see ‘refactor’ branch…) Massively scalable implementation (HPC & cloud). Scalable digital normalization (~10 TB / 1 day? ;) Iterative partitioning Integrating other types of sequencing data (454, PacBio, …)? Polymorphism rates / error rates seem to be quite a bit higher. Validation and standard data sets?  Someone?  Please?
Lossless assembly; boosting.
Acknowledgements: Thek-mer gang: Adina Howe, Jason Pell, ArendHintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom. mRNAseq: LikitPreeyanon, Alexis Pyrkosz, Hans Cheng, Billie Swalla, and Weiming Li. HMM graph search: Jordan Fish, Qiong Wang, Jim Cole. Great Prairie consortium: Jim Tiedje, Rachel Mackelprang, Susannah Tringe, Janet Jansson Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
Acknowledgements: Thek-mer gang: Adina Howe, Jason Pell, ArendHintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom. mRNAseq: LikitPreeyanon, Alexis Pyrkosz, Hans Cheng, Billie Swalla, and Weiming Li. HMM graph search: Jordan Fish, Qiong Wang, Jim Cole. Great Prairie consortium: Jim Tiedje, Rachel Mackelprang, Susannah Tringe, Janet Jansson Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
Lumps! Adina Howe
Lumps! Adina Howe
Knots in the graph are caused by sequencing artifacts.
Identifying the source of knots Use a systematic traversal algorithm to identify highly-connected k-mers. Removal of these k-mers (trimming) breaks up the knots. Many, but not all, of these highly-connected k-mers are associated with high-abundance k-mers.
Highly connected k-mers are position-dependent Adina Howe
HCKs under-represented in assembly Adina Howe
HCKs tend to end contigs Adina Howe
Our current model Contigs are extended or joined around artifacts, with an observation bias towards such extensions (because of length cutoff). Tendency is for a long contig to be extended by 1-2 reads, so artifacts trend towards location at end of contig. Adina Howe
Conclusions (artifacts) They connect lots of stuff (preferential attachment) They result from something in the sequencing (3’ bias in reads) Assemblers don’t like using them The major effect of removing them is to shorten many contigs by a read.
Digital normalization algorithm for read in dataset: 	if median_kmer_count(read) < CUTOFF: update_kmer_counts(read) save(read) 	else: 		# discard read
Supplemental: abundance filtering is very lossy.
Per-partition assembly optimization Strategy: Vary k from 21 to 51, assemble with velvet. Choose k that maximizes sum(contigs > 1kb) Ran top partitions in Iowa corn (4.2m reads, 303 partitions) For k=33,   3.5 mb in 1876 contigs > 1kb, max 15.7 kb For best k for each partition(varied between 31 and 47), 	5.7 mb in 2511 contigs > 1kb, max 51.7 kb
Comparing assemblies quantitatively Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA. This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.
Comparing assemblies / dendrogram

More Related Content

Viewers also liked

Coalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board MembersCoalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board Members
Circles of San Antonio Community Coalition
 
Organic fertilization and microbial dynamics
Organic fertilization and microbial dynamicsOrganic fertilization and microbial dynamics
Organic fertilization and microbial dynamicsjuveultra
 
Thisted 2010 - Energy
Thisted 2010 - EnergyThisted 2010 - Energy
Thisted 2010 - Energy
Bertel Bolt-Jørgensen
 
2010 Managing Labor and Employee Relations Seminar
2010 Managing Labor and Employee Relations Seminar2010 Managing Labor and Employee Relations Seminar
2010 Managing Labor and Employee Relations Seminar
Kegler Brown Hill + Ritter
 
polar bears
polar bearspolar bears
polar bears
Takahe One
 
Chapter 10 - Added Values
Chapter 10 - Added ValuesChapter 10 - Added Values
Chapter 10 - Added Values
wenchein huang
 
Ejemplo completo de integración JLex y CUP
Ejemplo completo de integración JLex y CUPEjemplo completo de integración JLex y CUP
Ejemplo completo de integración JLex y CUP
Egdares Futch H.
 
Jheickson noguera examen
Jheickson noguera examenJheickson noguera examen
Jheickson noguera examen
Lili Cardenas
 
Big Data for International Development
Big Data for International DevelopmentBig Data for International Development
Big Data for International Development
Alex Rascanu
 
2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle
c.titus.brown
 
Perspectives on Poverty and Class
Perspectives on Poverty and ClassPerspectives on Poverty and Class
Perspectives on Poverty and Class
Sarah Halstead
 
MoMoTLV Israel March 2010 - Agenda
MoMoTLV Israel March 2010 - AgendaMoMoTLV Israel March 2010 - Agenda
MoMoTLV Israel March 2010 - Agenda
MobileMonday Tel-Aviv
 
Etwinning edinburgh april 2016
Etwinning edinburgh april 2016Etwinning edinburgh april 2016
Etwinning edinburgh april 2016
sarahstead
 
Volcano 3
Volcano 3Volcano 3
Volcano 3
bethann1468
 
Maximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesMaximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesKristina O'Regan
 

Viewers also liked (20)

Coalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board MembersCoalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board Members
 
Organic fertilization and microbial dynamics
Organic fertilization and microbial dynamicsOrganic fertilization and microbial dynamics
Organic fertilization and microbial dynamics
 
Thisted 2010 - Energy
Thisted 2010 - EnergyThisted 2010 - Energy
Thisted 2010 - Energy
 
What Is Eric
What Is EricWhat Is Eric
What Is Eric
 
2010 Managing Labor and Employee Relations Seminar
2010 Managing Labor and Employee Relations Seminar2010 Managing Labor and Employee Relations Seminar
2010 Managing Labor and Employee Relations Seminar
 
polar bears
polar bearspolar bears
polar bears
 
Chapter 10 - Added Values
Chapter 10 - Added ValuesChapter 10 - Added Values
Chapter 10 - Added Values
 
h-ubu - CDI in JavaScript
h-ubu - CDI in JavaScripth-ubu - CDI in JavaScript
h-ubu - CDI in JavaScript
 
Ejemplo completo de integración JLex y CUP
Ejemplo completo de integración JLex y CUPEjemplo completo de integración JLex y CUP
Ejemplo completo de integración JLex y CUP
 
Jheickson noguera examen
Jheickson noguera examenJheickson noguera examen
Jheickson noguera examen
 
Big Data for International Development
Big Data for International DevelopmentBig Data for International Development
Big Data for International Development
 
2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle
 
Perspectives on Poverty and Class
Perspectives on Poverty and ClassPerspectives on Poverty and Class
Perspectives on Poverty and Class
 
Troy
TroyTroy
Troy
 
MoMoTLV Israel March 2010 - Agenda
MoMoTLV Israel March 2010 - AgendaMoMoTLV Israel March 2010 - Agenda
MoMoTLV Israel March 2010 - Agenda
 
Etwinning edinburgh april 2016
Etwinning edinburgh april 2016Etwinning edinburgh april 2016
Etwinning edinburgh april 2016
 
2013 gbmf-mmi-ci
2013 gbmf-mmi-ci2013 gbmf-mmi-ci
2013 gbmf-mmi-ci
 
Volcano 3
Volcano 3Volcano 3
Volcano 3
 
Seismic Waves
Seismic WavesSeismic Waves
Seismic Waves
 
Maximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain TimesMaximise Software Investment In Uncertain Times
Maximise Software Investment In Uncertain Times
 

Similar to Scaling metagenome assembly

Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
c.titus.brown
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
Scott Edmunds
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
Jan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
c.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
c.titus.brown
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
ARPUTHA SELVARAJ A
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
c.titus.brown
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
Guy Coates
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
 
Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clustering
ieeepondy
 

Similar to Scaling metagenome assembly (20)

Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
 
SSBSE10.ppt
SSBSE10.pptSSBSE10.ppt
SSBSE10.ppt
 
Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clustering
 

More from c.titus.brown

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
c.titus.brown
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
c.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
c.titus.brown
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
c.titus.brown
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
c.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
c.titus.brown
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
c.titus.brown
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
c.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
c.titus.brown
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
c.titus.brown
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
c.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
c.titus.brown
 

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 

Recently uploaded

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 

Recently uploaded (20)

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 

Scaling metagenome assembly

  • 1. Scaling metagenome assembly –to infinity and beeeeeeeeeeyond! C. Titus Brown et al. Computer Science / Microbiology Depts Michigan State University In collaboration with Great Prairie Grand Challenge (Tiedje, Jansson, Tringe)
  • 3. Sampling strategy per site 1 M 1 cM 10 M 1 cM Reference soil 1 M Soil cores: 1 inch diameter, 4 inches deep Total: 8 Reference metagenomes + 64 spatially separated cores (pyrotag sequencing) 10 M
  • 4. Great Prairie sequencing summary 200x human genome…! > 10x more challenging (total diversity)
  • 5. Our perspective Great Prairie project: there is no end to the data! Immense biological depth: estimate ~1-2 TB (10**12) of raw sequence needed to assemble top ~20-40% of microbes. Improvements in sequencing tech Existing methods for scaling assembly simply will not suffice: this is a losing battle. Abundance filtering XXX Better data structures XXX Parallelization is not going to be sufficient; neither are advances in data structures. I think: bad scaling is holding back assembly progress.
  • 6. Our perspective, #2 Deep sampling is needed for these samples Illumina is it, for now. The last thing in the world we want to do is write yet another assembler…pre-assembly filtering, instead. All of our techniques can be used together with any assembler. We’ve mostly stuck with Velvet, for reasons of historical contingency.
  • 7. Two enabling technologies Very efficient k-mer counting Bloom counting hash/MinCount Sketch data structure; constant memory Scales ~10x over traditional data structures k-independent. Probabilistic properties well suited to next-gen data sets. Very efficient de Bruijn graph representation We traverse k-mers stored in constant-memory Bloom filters. Compressible probabilistic data structure; very accurate. Scales ~20x over traditional data structures. K-independent. …cannot directly be used for assembly because of FP.
  • 8. Approach 1: Partitioning Use compressible graph representation to explore natural structure of data: many disconnected components.
  • 9. Partitioning for scaling Can be done in ~10x less memory than assembly. Partition at low k and assemble exactly at any higher k (DBG). Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to high variation, etc.) Can eliminate small partitions/contigs in the partitioning phase. In theory, an exact approach to divide and conquer/data reduction.
  • 11. Partitioning challenges Technical challenge: existence of “knots” in the graph that artificially connect everything. Unfortunately, partitioning is not the solution. Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have… 20x scaling isn’t nearly enough, anyway 
  • 13. Partitioning challenges Unfortunately, partitioning is not the solution. Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have… 20x scaling isn’t nearly enough, anyway 
  • 14. Approach 2: Digital normalization “Squash” high coverage reads Eliminate reads we’ve seen before (e.g. “> 5 times”) Digital version of experimental “mRNA normalization”. Nice algorithm! Single-pass Constant memory Trivial to implement Easy to parallelize / scale (memory AND throughput) “Perfect” solution? (Works fine for MDA, mRNAseq…)
  • 15. Digital normalization Two benefits: Decrease amount of data (real, but redundant sequence) Eliminate errors associated that redundant sequence. Single-pass algorithm (c.f. streaming sketch algorithms)
  • 16. Digital normalization validation? Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated.
  • 17. Comparing assemblies quantitatively Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA. This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.
  • 18. Running HMMs over de Bruijn graphs(=> cross validation) hmmgs: Assemble based on good-scoring HMM paths through the graph. Independent of other assemblers; very sensitive, specific. 95% of hmmgsrplB domains are present in our partitioned assemblies. CTC ACT TTC GTA GAC ATA ACC CTA Jordan Fish, Qiong Wang, and Jim Cole (RDP) GTT
  • 19. Digital normalization validation Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated. Hmmgs results tell us that Velvet multi-k assembly is also very sensitive. Our primary concern at this point is about long-range artifacts (chimeric assembly).
  • 20. Techniques Developed suite of techniques that work for scaling, without loss of information (?) While we have no good way to assess chimeras and misassemblies, basic sequence content and gene content stay the same across treatments. And… what, are we just sitting here writing code? No! We have data to assemble!
  • 21. Assembling Great Prairie data, v0.8 Iowa corn GAII, ~500m reads / 50 Gb => largest partition ~200k reads 84 Mb in 53,501 contigs > 1kb. Iowa prairie GAII, ~500m reads / 50 Gb => biggest ~100k read partition 102 MB in 70,895 contigs > 1kb. Both done on a single 8-core Amazon EC2 bigmem node, 68 GB of RAM, ~$100. (Yay, we can do it! Boo, we’re only using 2% of reads.) No systematic optimization of partitions yet; 2-4x improvement expected. Normalization of HiSeq is also yet to be done. Have applied to other metagenomes, note; longer story.
  • 22. Future directions? khmer software reasonably stable & well-tested; needs documentation, software engineering love. github.com/ctb/khmer/ (see ‘refactor’ branch…) Massively scalable implementation (HPC & cloud). Scalable digital normalization (~10 TB / 1 day? ;) Iterative partitioning Integrating other types of sequencing data (454, PacBio, …)? Polymorphism rates / error rates seem to be quite a bit higher. Validation and standard data sets? Someone? Please?
  • 24. Acknowledgements: Thek-mer gang: Adina Howe, Jason Pell, ArendHintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom. mRNAseq: LikitPreeyanon, Alexis Pyrkosz, Hans Cheng, Billie Swalla, and Weiming Li. HMM graph search: Jordan Fish, Qiong Wang, Jim Cole. Great Prairie consortium: Jim Tiedje, Rachel Mackelprang, Susannah Tringe, Janet Jansson Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
  • 25. Acknowledgements: Thek-mer gang: Adina Howe, Jason Pell, ArendHintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom. mRNAseq: LikitPreeyanon, Alexis Pyrkosz, Hans Cheng, Billie Swalla, and Weiming Li. HMM graph search: Jordan Fish, Qiong Wang, Jim Cole. Great Prairie consortium: Jim Tiedje, Rachel Mackelprang, Susannah Tringe, Janet Jansson Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
  • 26.
  • 29. Knots in the graph are caused by sequencing artifacts.
  • 30. Identifying the source of knots Use a systematic traversal algorithm to identify highly-connected k-mers. Removal of these k-mers (trimming) breaks up the knots. Many, but not all, of these highly-connected k-mers are associated with high-abundance k-mers.
  • 31. Highly connected k-mers are position-dependent Adina Howe
  • 32. HCKs under-represented in assembly Adina Howe
  • 33. HCKs tend to end contigs Adina Howe
  • 34. Our current model Contigs are extended or joined around artifacts, with an observation bias towards such extensions (because of length cutoff). Tendency is for a long contig to be extended by 1-2 reads, so artifacts trend towards location at end of contig. Adina Howe
  • 35. Conclusions (artifacts) They connect lots of stuff (preferential attachment) They result from something in the sequencing (3’ bias in reads) Assemblers don’t like using them The major effect of removing them is to shorten many contigs by a read.
  • 36. Digital normalization algorithm for read in dataset: if median_kmer_count(read) < CUTOFF: update_kmer_counts(read) save(read) else: # discard read
  • 38. Per-partition assembly optimization Strategy: Vary k from 21 to 51, assemble with velvet. Choose k that maximizes sum(contigs > 1kb) Ran top partitions in Iowa corn (4.2m reads, 303 partitions) For k=33, 3.5 mb in 1876 contigs > 1kb, max 15.7 kb For best k for each partition(varied between 31 and 47), 5.7 mb in 2511 contigs > 1kb, max 51.7 kb
  • 39. Comparing assemblies quantitatively Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA. This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.

Editor's Notes

  1. Thank organizers; point to talk online. Mention Susannah/first asst prof problem.
  2. 1) Very high diversity ~30 billion k-mers. 2) No k-mer overlap between Iowa corn and prairie; co-assembly futile.
  3. Indicate “surprising/awesome” components.
  4. Connectivity source organism abundance
  5. Comparing assemblies is hard, and we’ve had to build tools to build tools to let us compare assemblies. However, the results are good. Multi-k assemblies are essential, note.
  6. Completely different style of assembler; useful for cross validation.
  7. Note that all of this was done on Amazon in 68gb
  8. Move towards loosely coupled environment for lossless approaches to scaling assembly? Weak classifiers &amp; boosting theory can also be applied (trivially). Note, at some point you should just sequence single cells or something.
  9. Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  10. Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.
  11. Multi-k stuff.