SlideShare a Scribd company logo
C. Titus Brown
ctb@msu.edu
Assistant Professor (2008)
Computer Science & Engineering /
Microbiology and Molecular Genetics,
Michigan State University
BA Reed College/Math
PhD Caltech / Developmental Biology
Member of the Python Software Foundation
(a.k.a. awesomest programming language)
I’m a bit sick, so I may cough loudly and
obnoxiously at times.
1. O’Reilly folk asked if I had anything to talk
about.
2. Professors love talking.
3. Nifty techniques, applied to a new problem.
1. Can they be applied to your problem?
2. Do you have any ideas for me?
 ctb@msu.edu
 http://ged.msu.edu/
 http://github.com/ctb/
◦ khmer package, BSD license; k-mer analysis.
◦ …lotsa other stuff.
Slide courtesy of Lincoln Stein
My blog: http://ivory.idyll.org/blog/oct-10/sky-
is-falling ; cloud computing will not save us!
“Quantity has a quality all its own”
J. Stalin
“Quantity has a quality all its own”
J. Stalin
“Ours is a just cause; victory will be ours!”
V. Molotov
SAMPLING LOCATIONS
 Wisconsin
◦ Native prairie (Goose Pond,
Audubon)
◦ Long term cultivation (corn)
◦ Switchgrass rotation (previously
corn)
◦ Restored prairie (from 1998)
 Iowa
◦ Native prairie (Morris prairie)
◦ Long term cultivation (corn)
 Kansas
◦ Native prairie (Konza prairie)
◦ Long term cultivation (corn)
Iowa Native Praire
Switchgrass
(Wisconsin)
Iowa >100 yr tilled
 30 Gb of sequence from Iowa corn
 50 Gb of sequence from Iowa prairie
 200 Gb of sequence from Wisconsin corn,
prairie
http://ivory.idyll.org/blog/aug-10/assembly-part-i
http://ivory.idyll.org/blog/jul-10/kmer-filtering
http://ivory.idyll.org/blog/jul-10/illumina-read-
phenomenology
 Whole (meta)genome shotgun sequencing
involves fragmenting and sequencing,
followed by re-assembly.
 The shorter the reads, the more difficult this
is to do reliably.
 Assembly scales poorly.
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
Assembly is inherently an all by all process.
There is no good way to subdivide the short
sequences without potentially missing a key
connection:
Essentially, break reads (of any length) down into
multiple overlapping words of fixed length k.
ATGGACCAGATGACAC (k=12) =>
ATGGACCAGATG
TGGACCAGATGA
GGACCAGATGAC
GACCAGATGACA
ACCAGATGACAC
J.R. Miller et al. / Genomics (2010)
J.R. Miller et al. / Genomics (2010)
For decisions about which paths etc, biology-
based heuristics come into play as well.
 Fixed-length words => great CS techniques
(hashing, trie structures, etc.)
 Data loading/comparison scales with size of your
data, N.
 Memory usage scales with # of unique words.
 This is an advantage over other techniques
◦ NxN comparisons…
 Some disadvantages, too; see review,
 J.R. Miller et al. / Genomics (2010)
 Unlike some other common computational
science problems in physics and chemistry,
which are combinatorial in nature, graph
analysis requires a lot of RAM (to store the
graph).
 This leads to the mildly unusual HPC scaling
issue of RAM as a limiting factor.
 …and RAM is expensive.
 If we knew which original genomes our short
sequences came from?
 Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!
 Which nodes do not connect to each other?
 If we knew which original genomes our short
sequences came from?
 Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!
 Unfortunately this is already equivalent to
solving the hard component of the assembly
problem…
 Q: is this k-mer present in the data set?
 A: no => then it is not.
 A: yes => it may or may not be present.
This lets us store k-mers efficiently.
 Once we can store/query k-mers efficiently in
this oracle, we can build additional oracles on
top of it:
 Q: does this k-mer overlap with this other k-
mer?
 A: no => then it does not, guaranteed.
 A: yes => it may or may not.
This lets us traverse k-mer graphs efficiently.
 Conveniently, perhaps
the simplest data
structure in computer
science is what we
need…
 …a hash table that
ignores collisions.
 Note, P(false positive) =
fractional occupancy.
 If you ignore collisions…
 O(1) query, insertion, update
 Fixed memory usage
 Ridiculously simple to implement (although
developing a good hash function can take
some effort)
 Conveniently, perhaps
the simplest data
structure in computer
science is what we
need…
 …a hash table that
ignores collisions.
 Note, P(false positive) =
fractional occupancy.
Use a Bloom filter approach – multiple oracles,
in serial, are multiplicatively more reliable.
http://en.wikipedia.org/wiki/Bloom_filter
Adding additional filters increases discrimination
at the cost of speed.
This gives you a fairly straightforward tradeoff:
memory (decrease individual false positives) vs
computation (more filters!)
Memory usage, Bloom filter vs trie (theoretical minimum)
 We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
 This implicitly lets us store the graph
structure, too!
Once you can look up k-mers quickly, traversal
is easy: there are only 8 possible overlapping
k-mers:
4 before, and 4 after.
 We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
 This implicitly lets us store the graph
structure, too, because there are only 8
possible connected nodes.
 We can now traverse this graph structure and
ask several times of questions:
Which of these graphs has more than 3 nodes?
Which of these graphs has more than 3 nodes?
Which nodes do not connect to each other?
Which nodes do not connect to each other?
Our oracle can mistakenly connect clusters.
This is a problem if the rate is sufficiently high!
Graphs will never be erroneously disconnected
Nodes will never be erroneously disconnected
Nodes will never be erroneously disconnected.
This is critically important: it guarantees that our
k-mer graph representation yields reliable “no”
answers.
This, in turn, lets us reliably partition graphs into
smaller graphs…
…and we can do so iteratively.
1. Built lightweight probabilistic data
structure/algorithm for k-mer storage.
- Constant memory, constant lookup
- Linear time to create structure
2. Implemented systematic graph traversal of
arbitrarily large graphs (> ~3 billion connected
k-mers, so far)
- Affine memory (with small linear constant)
- Bounded time for exploration; bound traded for
memory
3. Built partitioning system to eliminate small
graphs and extract disconnected graphs.
Pre-filter/partition for somebody else’s
assembler
N.B. This results in identical assembly.
 Python wrapping C++, ~5000 LoC. (Python handles
parallelization; go free, GIL!)
 Partitioning & assembling 2 Gb data set can be done in ~8
gb of RAM in < 1 day
◦ Compare with 40 gb requirement for existing (released) assemblers.
◦ Probably 10-fold speed improvement easily (KISS; no premature opt)
 Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM,
single chassis, 8 CPU.
 Not yet clear how well it scales to 200 Gb, but should…
 …all of this is running on Amazon cloud rentals.
 Lightweight probabilistic storage system for
k-mers, ~1 byte / k-mer.
 Large graph traversal (10-20 bn k-mers)
◦ Tabu search
◦ Neighborhood exclusion
 Graph partitioning, trimming, grokking.
◦ Iterative refinement is “perfect”
◦ Failure rate ~ memory usage, with good failover (
connectivity increases).
 More general assembly graph analysis
 Breaking graphs in good places
 Clustering of large protein similarity graphs/matrices
Caveats:
 Preferential attachment with false positives?
First publication --
 Bloom counting hash (see kmer-filtering blog post)
 We were lucky & could turn our graph traversal
problem into a set membership query.
 Tabu search / neighborhood exclusion for
exhaustive graph traversal isn’t novel, but might
be useful. Requires systematic tagging.
 But… random and probabilistic approaches (skip
lists, Bloom filters, etc.) can be surprisingly
useful.
◦ One sided errors are awesome for Big Data.
http://en.wikipedia.org/wiki/Category:
Probabilistic_data_structures
GED lab / k-mer gang
Adina Howe (w/Tiedje)
Arend Hintze, postdoc
Jason Pell, grad
Rosangela Canino-Koning,
grad
Qingpeng Zhang, grad
Collaborators (MSU)
Weiming Li
Charles Ofria
Jim Tiedje
(w/Janet Jansson, Rachel
Mackelprang (JGI))
Funding
USDA NIFA, NSF, DOE,
Michigan State U.
 ABySS assembler – multi-node assembly in RAM
On-disk assembly:
 SOAP assembler (BGI) – not open source
 Cortex assembler (EBI) – unpub/not released
 Contrail assembler (Michael Schatz) – unpub/not
released
It’s hard for me to tell how these last three compare ;)
BUT our current approach is orthogonal and can be
used in conjunction (as a pre-filter) with these
assemblers.

More Related Content

What's hot

Basic ideas on keras framework
Basic ideas on keras frameworkBasic ideas on keras framework
Basic ideas on keras framework
Alison Marczewski
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
Longhow Lam
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
Grigory Sapunov
 
capsule network
capsule networkcapsule network
capsule network
민기 정
 

What's hot (7)

Basic ideas on keras framework
Basic ideas on keras frameworkBasic ideas on keras framework
Basic ideas on keras framework
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
capsule network
capsule networkcapsule network
capsule network
 

Viewers also liked

Microprocessor Prosthetic Knee Presentation
Microprocessor Prosthetic Knee PresentationMicroprocessor Prosthetic Knee Presentation
Microprocessor Prosthetic Knee Presentation
lotsabooze
 
Kudos - A Peer-to-Peer Discussion System Based on Social Voting
Kudos - A Peer-to-Peer Discussion System Based on Social VotingKudos - A Peer-to-Peer Discussion System Based on Social Voting
Kudos - A Peer-to-Peer Discussion System Based on Social Voting
Luca Matteis
 
Informationsleder Jane Kruse
Informationsleder Jane KruseInformationsleder Jane Kruse
Informationsleder Jane Kruse
Bertel Bolt-Jørgensen
 
Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?
Sham Yemul
 
Bone Fractures
Bone FracturesBone Fractures
Bone Fracturesavlainich
 
Mediation- What Every Advocate Should Know
Mediation- What Every Advocate Should Know Mediation- What Every Advocate Should Know
Mediation- What Every Advocate Should Know
Kegler Brown Hill + Ritter
 
Global crisis2011
Global crisis2011Global crisis2011
Global crisis2011sadettin
 
OSGi - beyond the myth
OSGi -  beyond the mythOSGi -  beyond the myth
OSGi - beyond the myth
Clément Escoffier
 
The tsunami that washed time away
The tsunami that washed time awayThe tsunami that washed time away
The tsunami that washed time away
Takahe One
 
Do You Know The 11g Plan?
Do You Know The 11g Plan?Do You Know The 11g Plan?
Do You Know The 11g Plan?
Mahesh Vallampati
 
Updated-Enroll And Survey
Updated-Enroll And Survey Updated-Enroll And Survey
Updated-Enroll And Survey
bsrmailbox
 
Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)c.titus.brown
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in India
Kegler Brown Hill + Ritter
 
Coalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board MembersCoalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board Members
Circles of San Antonio Community Coalition
 
Google會怎麼做?
Google會怎麼做?Google會怎麼做?
Google會怎麼做?isvincent
 
Future Developments + Regulations
Future Developments + RegulationsFuture Developments + Regulations
Future Developments + Regulations
Kegler Brown Hill + Ritter
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 

Viewers also liked (20)

Microprocessor Prosthetic Knee Presentation
Microprocessor Prosthetic Knee PresentationMicroprocessor Prosthetic Knee Presentation
Microprocessor Prosthetic Knee Presentation
 
Kudos - A Peer-to-Peer Discussion System Based on Social Voting
Kudos - A Peer-to-Peer Discussion System Based on Social VotingKudos - A Peer-to-Peer Discussion System Based on Social Voting
Kudos - A Peer-to-Peer Discussion System Based on Social Voting
 
Informationsleder Jane Kruse
Informationsleder Jane KruseInformationsleder Jane Kruse
Informationsleder Jane Kruse
 
Vizerra 2010
Vizerra 2010Vizerra 2010
Vizerra 2010
 
Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?
 
Bone Fractures
Bone FracturesBone Fractures
Bone Fractures
 
Mediation- What Every Advocate Should Know
Mediation- What Every Advocate Should Know Mediation- What Every Advocate Should Know
Mediation- What Every Advocate Should Know
 
Review Adobe Wallaby
Review Adobe WallabyReview Adobe Wallaby
Review Adobe Wallaby
 
Global crisis2011
Global crisis2011Global crisis2011
Global crisis2011
 
OSGi - beyond the myth
OSGi -  beyond the mythOSGi -  beyond the myth
OSGi - beyond the myth
 
The tsunami that washed time away
The tsunami that washed time awayThe tsunami that washed time away
The tsunami that washed time away
 
Do You Know The 11g Plan?
Do You Know The 11g Plan?Do You Know The 11g Plan?
Do You Know The 11g Plan?
 
Updated-Enroll And Survey
Updated-Enroll And Survey Updated-Enroll And Survey
Updated-Enroll And Survey
 
Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in India
 
Canada
CanadaCanada
Canada
 
Coalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board MembersCoalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board Members
 
Google會怎麼做?
Google會怎麼做?Google會怎麼做?
Google會怎麼做?
 
Future Developments + Regulations
Future Developments + RegulationsFuture Developments + Regulations
Future Developments + Regulations
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 

Similar to Probabilistic breakdown of assembly graphs

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
Jan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
c.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoRobson Araujo
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
Takipi
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
c.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
elliando dias
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
Turi, Inc.
 
XML Considered Harmful
XML Considered HarmfulXML Considered Harmful
XML Considered Harmful
Prateek Singh
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
AMIT BHARTIYA
 
ConFoo - Exploring .NET’s memory management – a trip down memory lane
ConFoo - Exploring .NET’s memory management – a trip down memory laneConFoo - Exploring .NET’s memory management – a trip down memory lane
ConFoo - Exploring .NET’s memory management – a trip down memory lane
Maarten Balliauw
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 

Similar to Probabilistic breakdown of assembly graphs (20)

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujo
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
XML Considered Harmful
XML Considered HarmfulXML Considered Harmful
XML Considered Harmful
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
ConFoo - Exploring .NET’s memory management – a trip down memory lane
ConFoo - Exploring .NET’s memory management – a trip down memory laneConFoo - Exploring .NET’s memory management – a trip down memory lane
ConFoo - Exploring .NET’s memory management – a trip down memory lane
 
Crypto
CryptoCrypto
Crypto
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 

More from c.titus.brown

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
c.titus.brown
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
c.titus.brown
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
c.titus.brown
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
c.titus.brown
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
c.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
c.titus.brown
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
c.titus.brown
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
c.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
c.titus.brown
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
c.titus.brown
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
c.titus.brown
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
c.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
c.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Probabilistic breakdown of assembly graphs

  • 2. Assistant Professor (2008) Computer Science & Engineering / Microbiology and Molecular Genetics, Michigan State University BA Reed College/Math PhD Caltech / Developmental Biology Member of the Python Software Foundation (a.k.a. awesomest programming language)
  • 3. I’m a bit sick, so I may cough loudly and obnoxiously at times.
  • 4. 1. O’Reilly folk asked if I had anything to talk about. 2. Professors love talking. 3. Nifty techniques, applied to a new problem. 1. Can they be applied to your problem? 2. Do you have any ideas for me?
  • 5.  ctb@msu.edu  http://ged.msu.edu/  http://github.com/ctb/ ◦ khmer package, BSD license; k-mer analysis. ◦ …lotsa other stuff.
  • 6. Slide courtesy of Lincoln Stein My blog: http://ivory.idyll.org/blog/oct-10/sky- is-falling ; cloud computing will not save us!
  • 7. “Quantity has a quality all its own” J. Stalin
  • 8. “Quantity has a quality all its own” J. Stalin “Ours is a just cause; victory will be ours!” V. Molotov
  • 10.  Wisconsin ◦ Native prairie (Goose Pond, Audubon) ◦ Long term cultivation (corn) ◦ Switchgrass rotation (previously corn) ◦ Restored prairie (from 1998)  Iowa ◦ Native prairie (Morris prairie) ◦ Long term cultivation (corn)  Kansas ◦ Native prairie (Konza prairie) ◦ Long term cultivation (corn) Iowa Native Praire Switchgrass (Wisconsin) Iowa >100 yr tilled
  • 11.  30 Gb of sequence from Iowa corn  50 Gb of sequence from Iowa prairie  200 Gb of sequence from Wisconsin corn, prairie http://ivory.idyll.org/blog/aug-10/assembly-part-i http://ivory.idyll.org/blog/jul-10/kmer-filtering http://ivory.idyll.org/blog/jul-10/illumina-read- phenomenology
  • 12.  Whole (meta)genome shotgun sequencing involves fragmenting and sequencing, followed by re-assembly.  The shorter the reads, the more difficult this is to do reliably.  Assembly scales poorly.
  • 13. Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 14. Assembly is inherently an all by all process. There is no good way to subdivide the short sequences without potentially missing a key connection:
  • 15. Essentially, break reads (of any length) down into multiple overlapping words of fixed length k. ATGGACCAGATGACAC (k=12) => ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
  • 16. J.R. Miller et al. / Genomics (2010)
  • 17. J.R. Miller et al. / Genomics (2010)
  • 18. For decisions about which paths etc, biology- based heuristics come into play as well.
  • 19.  Fixed-length words => great CS techniques (hashing, trie structures, etc.)  Data loading/comparison scales with size of your data, N.  Memory usage scales with # of unique words.  This is an advantage over other techniques ◦ NxN comparisons…  Some disadvantages, too; see review,  J.R. Miller et al. / Genomics (2010)
  • 20.  Unlike some other common computational science problems in physics and chemistry, which are combinatorial in nature, graph analysis requires a lot of RAM (to store the graph).  This leads to the mildly unusual HPC scaling issue of RAM as a limiting factor.  …and RAM is expensive.
  • 21.  If we knew which original genomes our short sequences came from?  Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!
  • 22.  Which nodes do not connect to each other?
  • 23.  If we knew which original genomes our short sequences came from?  Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!  Unfortunately this is already equivalent to solving the hard component of the assembly problem…
  • 24.  Q: is this k-mer present in the data set?  A: no => then it is not.  A: yes => it may or may not be present. This lets us store k-mers efficiently.
  • 25.  Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
  • 26.  Q: does this k-mer overlap with this other k- mer?  A: no => then it does not, guaranteed.  A: yes => it may or may not. This lets us traverse k-mer graphs efficiently.
  • 27.  Conveniently, perhaps the simplest data structure in computer science is what we need…  …a hash table that ignores collisions.  Note, P(false positive) = fractional occupancy.
  • 28.  If you ignore collisions…  O(1) query, insertion, update  Fixed memory usage  Ridiculously simple to implement (although developing a good hash function can take some effort)
  • 29.  Conveniently, perhaps the simplest data structure in computer science is what we need…  …a hash table that ignores collisions.  Note, P(false positive) = fractional occupancy.
  • 30. Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable. http://en.wikipedia.org/wiki/Bloom_filter
  • 31. Adding additional filters increases discrimination at the cost of speed. This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
  • 32.
  • 33.
  • 34. Memory usage, Bloom filter vs trie (theoretical minimum)
  • 35.  We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.  This implicitly lets us store the graph structure, too!
  • 36. Once you can look up k-mers quickly, traversal is easy: there are only 8 possible overlapping k-mers: 4 before, and 4 after.
  • 37.  We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.  This implicitly lets us store the graph structure, too, because there are only 8 possible connected nodes.  We can now traverse this graph structure and ask several times of questions:
  • 38. Which of these graphs has more than 3 nodes?
  • 39. Which of these graphs has more than 3 nodes?
  • 40. Which nodes do not connect to each other?
  • 41. Which nodes do not connect to each other?
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. Our oracle can mistakenly connect clusters.
  • 52. This is a problem if the rate is sufficiently high!
  • 53. Graphs will never be erroneously disconnected
  • 54. Nodes will never be erroneously disconnected
  • 55. Nodes will never be erroneously disconnected. This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers. This, in turn, lets us reliably partition graphs into smaller graphs… …and we can do so iteratively.
  • 56.
  • 57. 1. Built lightweight probabilistic data structure/algorithm for k-mer storage. - Constant memory, constant lookup - Linear time to create structure 2. Implemented systematic graph traversal of arbitrarily large graphs (> ~3 billion connected k-mers, so far) - Affine memory (with small linear constant) - Bounded time for exploration; bound traded for memory 3. Built partitioning system to eliminate small graphs and extract disconnected graphs.
  • 58.
  • 59. Pre-filter/partition for somebody else’s assembler N.B. This results in identical assembly.
  • 60.  Python wrapping C++, ~5000 LoC. (Python handles parallelization; go free, GIL!)  Partitioning & assembling 2 Gb data set can be done in ~8 gb of RAM in < 1 day ◦ Compare with 40 gb requirement for existing (released) assemblers. ◦ Probably 10-fold speed improvement easily (KISS; no premature opt)  Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM, single chassis, 8 CPU.  Not yet clear how well it scales to 200 Gb, but should…  …all of this is running on Amazon cloud rentals.
  • 61.  Lightweight probabilistic storage system for k-mers, ~1 byte / k-mer.  Large graph traversal (10-20 bn k-mers) ◦ Tabu search ◦ Neighborhood exclusion  Graph partitioning, trimming, grokking. ◦ Iterative refinement is “perfect” ◦ Failure rate ~ memory usage, with good failover ( connectivity increases).
  • 62.  More general assembly graph analysis  Breaking graphs in good places  Clustering of large protein similarity graphs/matrices Caveats:  Preferential attachment with false positives? First publication --  Bloom counting hash (see kmer-filtering blog post)
  • 63.  We were lucky & could turn our graph traversal problem into a set membership query.  Tabu search / neighborhood exclusion for exhaustive graph traversal isn’t novel, but might be useful. Requires systematic tagging.  But… random and probabilistic approaches (skip lists, Bloom filters, etc.) can be surprisingly useful. ◦ One sided errors are awesome for Big Data. http://en.wikipedia.org/wiki/Category: Probabilistic_data_structures
  • 64. GED lab / k-mer gang Adina Howe (w/Tiedje) Arend Hintze, postdoc Jason Pell, grad Rosangela Canino-Koning, grad Qingpeng Zhang, grad Collaborators (MSU) Weiming Li Charles Ofria Jim Tiedje (w/Janet Jansson, Rachel Mackelprang (JGI)) Funding USDA NIFA, NSF, DOE, Michigan State U.
  • 65.  ABySS assembler – multi-node assembly in RAM On-disk assembly:  SOAP assembler (BGI) – not open source  Cortex assembler (EBI) – unpub/not released  Contrail assembler (Michael Schatz) – unpub/not released It’s hard for me to tell how these last three compare ;) BUT our current approach is orthogonal and can be used in conjunction (as a pre-filter) with these assemblers.

Editor's Notes

  1. Note, no tolerance for indels
  2. @@
  3. @@
  4. Paint between the greens.
  5. When a green connects two or more colors, recolor one color.
  6. Dependent on minimumdensity tagging