SlideShare a Scribd company logo
1 of 65
C. Titus Brown
ctb@msu.edu
Assistant Professor (2008)
Computer Science & Engineering /
Microbiology and Molecular Genetics,
Michigan State University
BA Reed College/Math
PhD Caltech / Developmental Biology
Member of the Python Software Foundation
(a.k.a. awesomest programming language)
I’m a bit sick, so I may cough loudly and
obnoxiously at times.
1. O’Reilly folk asked if I had anything to talk
about.
2. Professors love talking.
3. Nifty techniques, applied to a new problem.
1. Can they be applied to your problem?
2. Do you have any ideas for me?
 ctb@msu.edu
 http://ged.msu.edu/
 http://github.com/ctb/
◦ khmer package, BSD license; k-mer analysis.
◦ …lotsa other stuff.
Slide courtesy of Lincoln Stein
My blog: http://ivory.idyll.org/blog/oct-10/sky-
is-falling ; cloud computing will not save us!
“Quantity has a quality all its own”
J. Stalin
“Quantity has a quality all its own”
J. Stalin
“Ours is a just cause; victory will be ours!”
V. Molotov
SAMPLING LOCATIONS
 Wisconsin
◦ Native prairie (Goose Pond,
Audubon)
◦ Long term cultivation (corn)
◦ Switchgrass rotation (previously
corn)
◦ Restored prairie (from 1998)
 Iowa
◦ Native prairie (Morris prairie)
◦ Long term cultivation (corn)
 Kansas
◦ Native prairie (Konza prairie)
◦ Long term cultivation (corn)
Iowa Native Praire
Switchgrass
(Wisconsin)
Iowa >100 yr tilled
 30 Gb of sequence from Iowa corn
 50 Gb of sequence from Iowa prairie
 200 Gb of sequence from Wisconsin corn,
prairie
http://ivory.idyll.org/blog/aug-10/assembly-part-i
http://ivory.idyll.org/blog/jul-10/kmer-filtering
http://ivory.idyll.org/blog/jul-10/illumina-read-
phenomenology
 Whole (meta)genome shotgun sequencing
involves fragmenting and sequencing,
followed by re-assembly.
 The shorter the reads, the more difficult this
is to do reliably.
 Assembly scales poorly.
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
Assembly is inherently an all by all process.
There is no good way to subdivide the short
sequences without potentially missing a key
connection:
Essentially, break reads (of any length) down into
multiple overlapping words of fixed length k.
ATGGACCAGATGACAC (k=12) =>
ATGGACCAGATG
TGGACCAGATGA
GGACCAGATGAC
GACCAGATGACA
ACCAGATGACAC
J.R. Miller et al. / Genomics (2010)
J.R. Miller et al. / Genomics (2010)
For decisions about which paths etc, biology-
based heuristics come into play as well.
 Fixed-length words => great CS techniques
(hashing, trie structures, etc.)
 Data loading/comparison scales with size of your
data, N.
 Memory usage scales with # of unique words.
 This is an advantage over other techniques
◦ NxN comparisons…
 Some disadvantages, too; see review,
 J.R. Miller et al. / Genomics (2010)
 Unlike some other common computational
science problems in physics and chemistry,
which are combinatorial in nature, graph
analysis requires a lot of RAM (to store the
graph).
 This leads to the mildly unusual HPC scaling
issue of RAM as a limiting factor.
 …and RAM is expensive.
 If we knew which original genomes our short
sequences came from?
 Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!
 Which nodes do not connect to each other?
 If we knew which original genomes our short
sequences came from?
 Then we could just put all the sequences that
came from a particular genome in a smaller
bin, and assemble that independently!
 Unfortunately this is already equivalent to
solving the hard component of the assembly
problem…
 Q: is this k-mer present in the data set?
 A: no => then it is not.
 A: yes => it may or may not be present.
This lets us store k-mers efficiently.
 Once we can store/query k-mers efficiently in
this oracle, we can build additional oracles on
top of it:
 Q: does this k-mer overlap with this other k-
mer?
 A: no => then it does not, guaranteed.
 A: yes => it may or may not.
This lets us traverse k-mer graphs efficiently.
 Conveniently, perhaps
the simplest data
structure in computer
science is what we
need…
 …a hash table that
ignores collisions.
 Note, P(false positive) =
fractional occupancy.
 If you ignore collisions…
 O(1) query, insertion, update
 Fixed memory usage
 Ridiculously simple to implement (although
developing a good hash function can take
some effort)
 Conveniently, perhaps
the simplest data
structure in computer
science is what we
need…
 …a hash table that
ignores collisions.
 Note, P(false positive) =
fractional occupancy.
Use a Bloom filter approach – multiple oracles,
in serial, are multiplicatively more reliable.
http://en.wikipedia.org/wiki/Bloom_filter
Adding additional filters increases discrimination
at the cost of speed.
This gives you a fairly straightforward tradeoff:
memory (decrease individual false positives) vs
computation (more filters!)
Memory usage, Bloom filter vs trie (theoretical minimum)
 We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
 This implicitly lets us store the graph
structure, too!
Once you can look up k-mers quickly, traversal
is easy: there are only 8 possible overlapping
k-mers:
4 before, and 4 after.
 We can now ask, “does k-mer
ACGTGGCAGG… occur in the data set?”,
quickly and accurately.
 This implicitly lets us store the graph
structure, too, because there are only 8
possible connected nodes.
 We can now traverse this graph structure and
ask several times of questions:
Which of these graphs has more than 3 nodes?
Which of these graphs has more than 3 nodes?
Which nodes do not connect to each other?
Which nodes do not connect to each other?
Our oracle can mistakenly connect clusters.
This is a problem if the rate is sufficiently high!
Graphs will never be erroneously disconnected
Nodes will never be erroneously disconnected
Nodes will never be erroneously disconnected.
This is critically important: it guarantees that our
k-mer graph representation yields reliable “no”
answers.
This, in turn, lets us reliably partition graphs into
smaller graphs…
…and we can do so iteratively.
1. Built lightweight probabilistic data
structure/algorithm for k-mer storage.
- Constant memory, constant lookup
- Linear time to create structure
2. Implemented systematic graph traversal of
arbitrarily large graphs (> ~3 billion connected
k-mers, so far)
- Affine memory (with small linear constant)
- Bounded time for exploration; bound traded for
memory
3. Built partitioning system to eliminate small
graphs and extract disconnected graphs.
Pre-filter/partition for somebody else’s
assembler
N.B. This results in identical assembly.
 Python wrapping C++, ~5000 LoC. (Python handles
parallelization; go free, GIL!)
 Partitioning & assembling 2 Gb data set can be done in ~8
gb of RAM in < 1 day
◦ Compare with 40 gb requirement for existing (released) assemblers.
◦ Probably 10-fold speed improvement easily (KISS; no premature opt)
 Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM,
single chassis, 8 CPU.
 Not yet clear how well it scales to 200 Gb, but should…
 …all of this is running on Amazon cloud rentals.
 Lightweight probabilistic storage system for
k-mers, ~1 byte / k-mer.
 Large graph traversal (10-20 bn k-mers)
◦ Tabu search
◦ Neighborhood exclusion
 Graph partitioning, trimming, grokking.
◦ Iterative refinement is “perfect”
◦ Failure rate ~ memory usage, with good failover (
connectivity increases).
 More general assembly graph analysis
 Breaking graphs in good places
 Clustering of large protein similarity graphs/matrices
Caveats:
 Preferential attachment with false positives?
First publication --
 Bloom counting hash (see kmer-filtering blog post)
 We were lucky & could turn our graph traversal
problem into a set membership query.
 Tabu search / neighborhood exclusion for
exhaustive graph traversal isn’t novel, but might
be useful. Requires systematic tagging.
 But… random and probabilistic approaches (skip
lists, Bloom filters, etc.) can be surprisingly
useful.
◦ One sided errors are awesome for Big Data.
http://en.wikipedia.org/wiki/Category:
Probabilistic_data_structures
GED lab / k-mer gang
Adina Howe (w/Tiedje)
Arend Hintze, postdoc
Jason Pell, grad
Rosangela Canino-Koning,
grad
Qingpeng Zhang, grad
Collaborators (MSU)
Weiming Li
Charles Ofria
Jim Tiedje
(w/Janet Jansson, Rachel
Mackelprang (JGI))
Funding
USDA NIFA, NSF, DOE,
Michigan State U.
 ABySS assembler – multi-node assembly in RAM
On-disk assembly:
 SOAP assembler (BGI) – not open source
 Cortex assembler (EBI) – unpub/not released
 Contrail assembler (Michael Schatz) – unpub/not
released
It’s hard for me to tell how these last three compare ;)
BUT our current approach is orthogonal and can be
used in conjunction (as a pre-filter) with these
assemblers.

More Related Content

What's hot

Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
Richard Ashworth
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 

What's hot (7)

Basic ideas on keras framework
Basic ideas on keras frameworkBasic ideas on keras framework
Basic ideas on keras framework
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
capsule network
capsule networkcapsule network
capsule network
 

Viewers also liked

Bone Fractures
Bone FracturesBone Fractures
Bone Fractures
avlainich
 
Global crisis2011
Global crisis2011Global crisis2011
Global crisis2011
sadettin
 
Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)
c.titus.brown
 
Google會怎麼做?
Google會怎麼做?Google會怎麼做?
Google會怎麼做?
isvincent
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
c.titus.brown
 

Viewers also liked (20)

Microprocessor Prosthetic Knee Presentation
Microprocessor Prosthetic Knee PresentationMicroprocessor Prosthetic Knee Presentation
Microprocessor Prosthetic Knee Presentation
 
Kudos - A Peer-to-Peer Discussion System Based on Social Voting
Kudos - A Peer-to-Peer Discussion System Based on Social VotingKudos - A Peer-to-Peer Discussion System Based on Social Voting
Kudos - A Peer-to-Peer Discussion System Based on Social Voting
 
Informationsleder Jane Kruse
Informationsleder Jane KruseInformationsleder Jane Kruse
Informationsleder Jane Kruse
 
Vizerra 2010
Vizerra 2010Vizerra 2010
Vizerra 2010
 
Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?Why PhoneGap, a different perception ?
Why PhoneGap, a different perception ?
 
Bone Fractures
Bone FracturesBone Fractures
Bone Fractures
 
Mediation- What Every Advocate Should Know
Mediation- What Every Advocate Should Know Mediation- What Every Advocate Should Know
Mediation- What Every Advocate Should Know
 
Review Adobe Wallaby
Review Adobe WallabyReview Adobe Wallaby
Review Adobe Wallaby
 
Global crisis2011
Global crisis2011Global crisis2011
Global crisis2011
 
OSGi - beyond the myth
OSGi -  beyond the mythOSGi -  beyond the myth
OSGi - beyond the myth
 
The tsunami that washed time away
The tsunami that washed time awayThe tsunami that washed time away
The tsunami that washed time away
 
Do You Know The 11g Plan?
Do You Know The 11g Plan?Do You Know The 11g Plan?
Do You Know The 11g Plan?
 
Updated-Enroll And Survey
Updated-Enroll And Survey Updated-Enroll And Survey
Updated-Enroll And Survey
 
Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)Pycon 2011 talk (may not be final, note)
Pycon 2011 talk (may not be final, note)
 
Navigating Your Way to Business Success in India
Navigating Your Way to Business Success in IndiaNavigating Your Way to Business Success in India
Navigating Your Way to Business Success in India
 
Canada
CanadaCanada
Canada
 
Coalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board MembersCoalition Orientation for SACADA Board Members
Coalition Orientation for SACADA Board Members
 
Google會怎麼做?
Google會怎麼做?Google會怎麼做?
Google會怎麼做?
 
Future Developments + Regulations
Future Developments + RegulationsFuture Developments + Regulations
Future Developments + Regulations
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 

Similar to Probabilistic breakdown of assembly graphs

2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
c.titus.brown
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujo
Robson Araujo
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
c.titus.brown
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
Adina Chuang Howe
 

Similar to Probabilistic breakdown of assembly graphs (20)

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Scimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujoScimakelatex.83323.robson+medeiros+de+araujo
Scimakelatex.83323.robson+medeiros+de+araujo
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
XML Considered Harmful
XML Considered HarmfulXML Considered Harmful
XML Considered Harmful
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
ConFoo - Exploring .NET’s memory management – a trip down memory lane
ConFoo - Exploring .NET’s memory management – a trip down memory laneConFoo - Exploring .NET’s memory management – a trip down memory lane
ConFoo - Exploring .NET’s memory management – a trip down memory lane
 
Crypto
CryptoCrypto
Crypto
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Probabilistic breakdown of assembly graphs

  • 2. Assistant Professor (2008) Computer Science & Engineering / Microbiology and Molecular Genetics, Michigan State University BA Reed College/Math PhD Caltech / Developmental Biology Member of the Python Software Foundation (a.k.a. awesomest programming language)
  • 3. I’m a bit sick, so I may cough loudly and obnoxiously at times.
  • 4. 1. O’Reilly folk asked if I had anything to talk about. 2. Professors love talking. 3. Nifty techniques, applied to a new problem. 1. Can they be applied to your problem? 2. Do you have any ideas for me?
  • 5.  ctb@msu.edu  http://ged.msu.edu/  http://github.com/ctb/ ◦ khmer package, BSD license; k-mer analysis. ◦ …lotsa other stuff.
  • 6. Slide courtesy of Lincoln Stein My blog: http://ivory.idyll.org/blog/oct-10/sky- is-falling ; cloud computing will not save us!
  • 7. “Quantity has a quality all its own” J. Stalin
  • 8. “Quantity has a quality all its own” J. Stalin “Ours is a just cause; victory will be ours!” V. Molotov
  • 10.  Wisconsin ◦ Native prairie (Goose Pond, Audubon) ◦ Long term cultivation (corn) ◦ Switchgrass rotation (previously corn) ◦ Restored prairie (from 1998)  Iowa ◦ Native prairie (Morris prairie) ◦ Long term cultivation (corn)  Kansas ◦ Native prairie (Konza prairie) ◦ Long term cultivation (corn) Iowa Native Praire Switchgrass (Wisconsin) Iowa >100 yr tilled
  • 11.  30 Gb of sequence from Iowa corn  50 Gb of sequence from Iowa prairie  200 Gb of sequence from Wisconsin corn, prairie http://ivory.idyll.org/blog/aug-10/assembly-part-i http://ivory.idyll.org/blog/jul-10/kmer-filtering http://ivory.idyll.org/blog/jul-10/illumina-read- phenomenology
  • 12.  Whole (meta)genome shotgun sequencing involves fragmenting and sequencing, followed by re-assembly.  The shorter the reads, the more difficult this is to do reliably.  Assembly scales poorly.
  • 13. Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 14. Assembly is inherently an all by all process. There is no good way to subdivide the short sequences without potentially missing a key connection:
  • 15. Essentially, break reads (of any length) down into multiple overlapping words of fixed length k. ATGGACCAGATGACAC (k=12) => ATGGACCAGATG TGGACCAGATGA GGACCAGATGAC GACCAGATGACA ACCAGATGACAC
  • 16. J.R. Miller et al. / Genomics (2010)
  • 17. J.R. Miller et al. / Genomics (2010)
  • 18. For decisions about which paths etc, biology- based heuristics come into play as well.
  • 19.  Fixed-length words => great CS techniques (hashing, trie structures, etc.)  Data loading/comparison scales with size of your data, N.  Memory usage scales with # of unique words.  This is an advantage over other techniques ◦ NxN comparisons…  Some disadvantages, too; see review,  J.R. Miller et al. / Genomics (2010)
  • 20.  Unlike some other common computational science problems in physics and chemistry, which are combinatorial in nature, graph analysis requires a lot of RAM (to store the graph).  This leads to the mildly unusual HPC scaling issue of RAM as a limiting factor.  …and RAM is expensive.
  • 21.  If we knew which original genomes our short sequences came from?  Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!
  • 22.  Which nodes do not connect to each other?
  • 23.  If we knew which original genomes our short sequences came from?  Then we could just put all the sequences that came from a particular genome in a smaller bin, and assemble that independently!  Unfortunately this is already equivalent to solving the hard component of the assembly problem…
  • 24.  Q: is this k-mer present in the data set?  A: no => then it is not.  A: yes => it may or may not be present. This lets us store k-mers efficiently.
  • 25.  Once we can store/query k-mers efficiently in this oracle, we can build additional oracles on top of it:
  • 26.  Q: does this k-mer overlap with this other k- mer?  A: no => then it does not, guaranteed.  A: yes => it may or may not. This lets us traverse k-mer graphs efficiently.
  • 27.  Conveniently, perhaps the simplest data structure in computer science is what we need…  …a hash table that ignores collisions.  Note, P(false positive) = fractional occupancy.
  • 28.  If you ignore collisions…  O(1) query, insertion, update  Fixed memory usage  Ridiculously simple to implement (although developing a good hash function can take some effort)
  • 29.  Conveniently, perhaps the simplest data structure in computer science is what we need…  …a hash table that ignores collisions.  Note, P(false positive) = fractional occupancy.
  • 30. Use a Bloom filter approach – multiple oracles, in serial, are multiplicatively more reliable. http://en.wikipedia.org/wiki/Bloom_filter
  • 31. Adding additional filters increases discrimination at the cost of speed. This gives you a fairly straightforward tradeoff: memory (decrease individual false positives) vs computation (more filters!)
  • 32.
  • 33.
  • 34. Memory usage, Bloom filter vs trie (theoretical minimum)
  • 35.  We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.  This implicitly lets us store the graph structure, too!
  • 36. Once you can look up k-mers quickly, traversal is easy: there are only 8 possible overlapping k-mers: 4 before, and 4 after.
  • 37.  We can now ask, “does k-mer ACGTGGCAGG… occur in the data set?”, quickly and accurately.  This implicitly lets us store the graph structure, too, because there are only 8 possible connected nodes.  We can now traverse this graph structure and ask several times of questions:
  • 38. Which of these graphs has more than 3 nodes?
  • 39. Which of these graphs has more than 3 nodes?
  • 40. Which nodes do not connect to each other?
  • 41. Which nodes do not connect to each other?
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. Our oracle can mistakenly connect clusters.
  • 52. This is a problem if the rate is sufficiently high!
  • 53. Graphs will never be erroneously disconnected
  • 54. Nodes will never be erroneously disconnected
  • 55. Nodes will never be erroneously disconnected. This is critically important: it guarantees that our k-mer graph representation yields reliable “no” answers. This, in turn, lets us reliably partition graphs into smaller graphs… …and we can do so iteratively.
  • 56.
  • 57. 1. Built lightweight probabilistic data structure/algorithm for k-mer storage. - Constant memory, constant lookup - Linear time to create structure 2. Implemented systematic graph traversal of arbitrarily large graphs (> ~3 billion connected k-mers, so far) - Affine memory (with small linear constant) - Bounded time for exploration; bound traded for memory 3. Built partitioning system to eliminate small graphs and extract disconnected graphs.
  • 58.
  • 59. Pre-filter/partition for somebody else’s assembler N.B. This results in identical assembly.
  • 60.  Python wrapping C++, ~5000 LoC. (Python handles parallelization; go free, GIL!)  Partitioning & assembling 2 Gb data set can be done in ~8 gb of RAM in < 1 day ◦ Compare with 40 gb requirement for existing (released) assemblers. ◦ Probably 10-fold speed improvement easily (KISS; no premature opt)  Can partition, assemble ~50 Gb in < 1 wk in 70 gb of RAM, single chassis, 8 CPU.  Not yet clear how well it scales to 200 Gb, but should…  …all of this is running on Amazon cloud rentals.
  • 61.  Lightweight probabilistic storage system for k-mers, ~1 byte / k-mer.  Large graph traversal (10-20 bn k-mers) ◦ Tabu search ◦ Neighborhood exclusion  Graph partitioning, trimming, grokking. ◦ Iterative refinement is “perfect” ◦ Failure rate ~ memory usage, with good failover ( connectivity increases).
  • 62.  More general assembly graph analysis  Breaking graphs in good places  Clustering of large protein similarity graphs/matrices Caveats:  Preferential attachment with false positives? First publication --  Bloom counting hash (see kmer-filtering blog post)
  • 63.  We were lucky & could turn our graph traversal problem into a set membership query.  Tabu search / neighborhood exclusion for exhaustive graph traversal isn’t novel, but might be useful. Requires systematic tagging.  But… random and probabilistic approaches (skip lists, Bloom filters, etc.) can be surprisingly useful. ◦ One sided errors are awesome for Big Data. http://en.wikipedia.org/wiki/Category: Probabilistic_data_structures
  • 64. GED lab / k-mer gang Adina Howe (w/Tiedje) Arend Hintze, postdoc Jason Pell, grad Rosangela Canino-Koning, grad Qingpeng Zhang, grad Collaborators (MSU) Weiming Li Charles Ofria Jim Tiedje (w/Janet Jansson, Rachel Mackelprang (JGI)) Funding USDA NIFA, NSF, DOE, Michigan State U.
  • 65.  ABySS assembler – multi-node assembly in RAM On-disk assembly:  SOAP assembler (BGI) – not open source  Cortex assembler (EBI) – unpub/not released  Contrail assembler (Michael Schatz) – unpub/not released It’s hard for me to tell how these last three compare ;) BUT our current approach is orthogonal and can be used in conjunction (as a pre-filter) with these assemblers.

Editor's Notes

  1. Note, no tolerance for indels
  2. @@
  3. @@
  4. Paint between the greens.
  5. When a green connects two or more colors, recolor one color.
  6. Dependent on minimumdensity tagging