SlideShare a Scribd company logo
1 of 28
Handling ridiculous amounts of data with probabilistic data structures C. Titus Brown Michigan State University Computer Science / Microbiology
Resources http://www.slideshare.net/c.titus.brown/ Webinar: http://oreillynet.com/pub/e/1784 Source: 						github.com/ctb/ N-grams (this talk): 			khmer-ngram DNA (the real cheese):		khmer khmer is implemented in C++ with a Python wrapper, which has been awesome for scripting, testing, and general development.  (But man, does C++ suck…)
Lincoln Stein Sequencing capacity is outscaling Moore’s Law.
Hat tip to Narayan Desai / ANL We don’t have enough resources or people to analyze data.
Data generation vs data analysis It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week.   (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)  …x1000 sequencers Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
The challenge? Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume. Note: cloud computing isn’t a solution to a sustained scaling problem!!  (See: Moore’s Law slide)
Life’s too short to tackle the easy problems – come to academia! Easy stuff like Google Search Awesomeness
A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the  isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.
Define a hash function (word => num) def hash(word):     assert len(word) <= MAX_K     value = 0     for n, ch in enumerate(word):         value += ord(ch) * 128**n     return value
class BloomFilter(object):     def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br />										for size in tablesizes ] self.k = k     def add(self, word):	# insert; ignore collisions val = hash(word)         for size, ht in self.tables: ht[val % size] = 1     def __contains__(self, word): val = hash(word)         return all( ht[val % size] br />							for (size, ht) in self.tables )
class BloomFilter(object):     def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br />										for size in tablesizes ] self.k = k     def add(self, word):	# insert; ignore collisions val = hash(word)         for size, ht in self.tables: ht[val % size] = 1     def __contains__(self, word): val = hash(word)         return all( ht[val % size] br />							for (size, ht) in self.tables )
class BloomFilter(object):     def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br />										for size in tablesizes ] self.k = k     def add(self, word):	# insert; ignore collisions val = hash(word)         for size, ht in self.tables: ht[val % size] = 1     def __contains__(self, word): val = hash(word)         return all( ht[val % size] br />							for (size, ht) in self.tables )
Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2])		 >>> x.add('a') >>> 'a' in x		# no false negatives True >>> 'b' in x False >>> 'c' in x		# …but false positives True
Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2])		# …false positives >>> x.add('a') >>> 'a' in x True >>> 'b' in x False >>> 'c' in x True
Storing text in a Bloom filter class BloomFilter(object):   … 	def insert_text(self, text):     for i in range(len(text)-self.k+1): self.add(text[i:i+self.k])
def next_words(bf, word):	# try all 1-ch extensions     prefix = word[1:]     for ch in bf.allchars:         word = prefix + ch         if word in bf:             yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start):     word = start[-bf.k:] n = -1     for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch)         for sentence in ss:             yield sentence     if n < 0:         yield start
def next_words(bf, word):	# try all 1-ch extensions     prefix = word[1:]     for ch in bf.allchars:         word = prefix + ch         if word in bf:             yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start):     word = start[-bf.k:] n = -1     for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch)         for sentence in ss:             yield sentence     if n < 0:         yield start
Storing and retrieving text >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('foo bar bazbif zap!') >>> x.insert_text('the quick brown fox jumped over the lazy dog') >>> print retrieve_first_sentence(x, 'foo bar ') foo bar bazbif zap! >>> print retrieve_first_sentence(x, 'the quic') the quick brown fox jumped over the lazy dog
Sequence assembly >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('the quick brown fox jumped ') >>> x.insert_text('jumped over the lazy dog') >>> retrieve_first_sentence(x, 'the quic') the quick brown fox jumpedover the lazy dog (This is known as the de Bruin graph approach to assembly; c.f. Velvet, ABySS, SOAPdenovo)
Repetitive strings are the devil >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('nanana, batman!') >>> x.insert_text('my chemical romance: nanana') >>> retrieve_first_sentence(x, "my chemical") 'my chemical romance: nanana, batman!'
Note, it’s a probabilistic data structure Retrieval errors: >>> x = BloomFilter([1001, 1003])		# small Bloom filter… >>> x.insert_text('the quick brown fox jumped over the lazy dog’) >>> retrieve_first_sentence(x, 'the quic'), ('the quick brY',)
Assembling DNA sequence Can’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties) But we can use the data structure to grok graph properties and eliminate/break up data: Eliminate small graphs (no false negatives!) Disconnected partitions (parts -> map reduce) Local graph complexity reduction & error/artifact trimming …and then feed into other programs. This is a data reducing prefilter
Right, but does it work?? Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500). …compare with not at allon a 512 GB RAM machine. Error/repeat trimming on a tricky worm genome: reduction from 170 GB resident / 60 hrs 54 GB resident / 13 hrs
How good is this graph representation? V. low false positive rates at ~2 bytes/k-mer; Nearly exact human genome graph in ~5 GB. Estimate we eventually need to store/traverse 50 billion k-mers (soil metagenome) Good failure mode: it’s all connected, Jim!  (No loss of connections => good prefilter) Did I mention it’s constant memory?  And independent of word size? …only works for de Bruijn graphs 
Thoughts for the future Unless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), oryour problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformatics Synopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure. Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinode graph distribution.
Groxel view of knot-like region / ArendHintze
Acknowledgements: The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning Qingpeng Zhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

More Related Content

What's hot

Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Ganesh Samarthyam
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Pythonpugpe
 
Learn python - for beginners - part-2
Learn python - for beginners - part-2Learn python - for beginners - part-2
Learn python - for beginners - part-2RajKumar Rampelli
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesNiranjan Sarade
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsSarah Allen
 
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...Moses Boudourides
 
Hash table and heaps
Hash table and heapsHash table and heaps
Hash table and heapsKatang Isip
 
Ruby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と HashRuby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と Hashhigaki
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Efficient Process Model Discovery Using Maximal Pattern Mining
Efficient Process Model Discovery Using Maximal Pattern MiningEfficient Process Model Discovery Using Maximal Pattern Mining
Efficient Process Model Discovery Using Maximal Pattern MiningDr. Sira Yongchareon
 
Clips basics how to make expert system in clips | facts adding | rules makin...
Clips basics  how to make expert system in clips | facts adding | rules makin...Clips basics  how to make expert system in clips | facts adding | rules makin...
Clips basics how to make expert system in clips | facts adding | rules makin...NaumanMalik30
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101Faisal Abid
 
Odessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and PythonOdessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and PythonMax Klymyshyn
 
Everything is composable
Everything is composableEverything is composable
Everything is composableVictor Igor
 
Start Writing Groovy
Start Writing GroovyStart Writing Groovy
Start Writing GroovyEvgeny Goldin
 
Internal Project: Under the Hood
Internal Project: Under the HoodInternal Project: Under the Hood
Internal Project: Under the HoodVladik Khononov
 
Nomica: a scalable FPGA-based architecture for variant-calling
Nomica: a scalable FPGA-based architecture for variant-callingNomica: a scalable FPGA-based architecture for variant-calling
Nomica: a scalable FPGA-based architecture for variant-callingNECST Lab @ Politecnico di Milano
 

What's hot (20)

Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)Software Design in Practice (with Java examples)
Software Design in Practice (with Java examples)
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
Learn python - for beginners - part-2
Learn python - for beginners - part-2Learn python - for beginners - part-2
Learn python - for beginners - part-2
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examples
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Begin with Python
Begin with PythonBegin with Python
Begin with Python
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and Iterators
 
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
Ανάλυση Δικτύων με το NetworkX της Python: Μια προκαταρκτική (αλλά ημιτελής ω...
 
Haskell
HaskellHaskell
Haskell
 
Hash table and heaps
Hash table and heapsHash table and heaps
Hash table and heaps
 
Ruby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と HashRuby初級者向けレッスン 48回 ─── Array と Hash
Ruby初級者向けレッスン 48回 ─── Array と Hash
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Efficient Process Model Discovery Using Maximal Pattern Mining
Efficient Process Model Discovery Using Maximal Pattern MiningEfficient Process Model Discovery Using Maximal Pattern Mining
Efficient Process Model Discovery Using Maximal Pattern Mining
 
Clips basics how to make expert system in clips | facts adding | rules makin...
Clips basics  how to make expert system in clips | facts adding | rules makin...Clips basics  how to make expert system in clips | facts adding | rules makin...
Clips basics how to make expert system in clips | facts adding | rules makin...
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101
 
Odessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and PythonOdessapy2013 - Graph databases and Python
Odessapy2013 - Graph databases and Python
 
Everything is composable
Everything is composableEverything is composable
Everything is composable
 
Start Writing Groovy
Start Writing GroovyStart Writing Groovy
Start Writing Groovy
 
Internal Project: Under the Hood
Internal Project: Under the HoodInternal Project: Under the Hood
Internal Project: Under the Hood
 
Nomica: a scalable FPGA-based architecture for variant-calling
Nomica: a scalable FPGA-based architecture for variant-callingNomica: a scalable FPGA-based architecture for variant-calling
Nomica: a scalable FPGA-based architecture for variant-calling
 

Viewers also liked

Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenomec.titus.brown
 
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...Kamiya Toshihiro
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assemblyfnothaft
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in PythonValerio Maggio
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and AssemblyShaun Jackman
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
Functions
FunctionsFunctions
Functionsgalahim
 
Grandparents day
Grandparents day Grandparents day
Grandparents day Takahe One
 
Doing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselDoing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselKegler Brown Hill + Ritter
 
Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Edmund_Wheeler
 
Consiquences Of Phylosophies Human Mind Developed Beside Revealations
Consiquences Of  Phylosophies Human Mind Developed Beside RevealationsConsiquences Of  Phylosophies Human Mind Developed Beside Revealations
Consiquences Of Phylosophies Human Mind Developed Beside RevealationsFawad Kiyani
 

Viewers also liked (20)

Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 
Bloomfilter
BloomfilterBloomfilter
Bloomfilter
 
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
An Execution-Semantic and Content-and-Context-Based Code-Clone Detection and ...
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
 
001 bacterial panicle blight, milton rush
001   bacterial panicle blight, milton rush001   bacterial panicle blight, milton rush
001 bacterial panicle blight, milton rush
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
RealTimeStudio
RealTimeStudioRealTimeStudio
RealTimeStudio
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Roundtable Discussions with Experts - India
Roundtable Discussions with Experts - India Roundtable Discussions with Experts - India
Roundtable Discussions with Experts - India
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Functions
FunctionsFunctions
Functions
 
Breve Historia
Breve HistoriaBreve Historia
Breve Historia
 
Grandparents day
Grandparents day Grandparents day
Grandparents day
 
Doing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate CounselDoing Business Internationally: Implications for Corporate Counsel
Doing Business Internationally: Implications for Corporate Counsel
 
Carte(2)
Carte(2)Carte(2)
Carte(2)
 
Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09Nh Accounting Workbook 8.4.09
Nh Accounting Workbook 8.4.09
 
Consiquences Of Phylosophies Human Mind Developed Beside Revealations
Consiquences Of  Phylosophies Human Mind Developed Beside RevealationsConsiquences Of  Phylosophies Human Mind Developed Beside Revealations
Consiquences Of Phylosophies Human Mind Developed Beside Revealations
 

Similar to PyCon 2011 talk - ngram assembly with Bloom filters

Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)Nate Murray
 
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fuclimatewarrior
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Languagevsssuresh
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationPrestaShop
 
Making the most of 2.2
Making the most of 2.2Making the most of 2.2
Making the most of 2.2markstory
 
C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview QuestionsGradeup
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertextfrankieroberto
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonPython Ireland
 
PostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesPostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesHans-Jürgen Schönig
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Bryan O'Sullivan
 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Modelguest2a5acfb
 

Similar to PyCon 2011 talk - ngram assembly with Bloom filters (20)

Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)Higher Order Procedures (in Ruby)
Higher Order Procedures (in Ruby)
 
Python quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung FuPython quickstart for programmers: Python Kung Fu
Python quickstart for programmers: Python Kung Fu
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 
Pythonic Math
Pythonic MathPythonic Math
Pythonic Math
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
Making the most of 2.2
Making the most of 2.2Making the most of 2.2
Making the most of 2.2
 
C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview Questions
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
SQL -PHP Tutorial
SQL -PHP TutorialSQL -PHP Tutorial
SQL -PHP Tutorial
 
Python Puzzlers
Python PuzzlersPython Puzzlers
Python Puzzlers
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in Python
 
PostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesPostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tables
 
Scala 2 + 2 > 4
Scala 2 + 2 > 4Scala 2 + 2 > 4
Scala 2 + 2 > 4
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Rclass
RclassRclass
Rclass
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7
 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Scala en
Scala enScala en
Scala en
 
Threading Is Not A Model
Threading Is Not A ModelThreading Is Not A Model
Threading Is Not A Model
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 

Recently uploaded

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 

Recently uploaded (20)

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 

PyCon 2011 talk - ngram assembly with Bloom filters

  • 1.
  • 2. Handling ridiculous amounts of data with probabilistic data structures C. Titus Brown Michigan State University Computer Science / Microbiology
  • 3. Resources http://www.slideshare.net/c.titus.brown/ Webinar: http://oreillynet.com/pub/e/1784 Source: github.com/ctb/ N-grams (this talk): khmer-ngram DNA (the real cheese): khmer khmer is implemented in C++ with a Python wrapper, which has been awesome for scripting, testing, and general development. (But man, does C++ suck…)
  • 4. Lincoln Stein Sequencing capacity is outscaling Moore’s Law.
  • 5. Hat tip to Narayan Desai / ANL We don’t have enough resources or people to analyze data.
  • 6. Data generation vs data analysis It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week. (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.) …x1000 sequencers Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
  • 7. The challenge? Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume. Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)
  • 8. Life’s too short to tackle the easy problems – come to academia! Easy stuff like Google Search Awesomeness
  • 9. A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.
  • 10. Define a hash function (word => num) def hash(word): assert len(word) <= MAX_K value = 0 for n, ch in enumerate(word): value += ord(ch) * 128**n return value
  • 11. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br /> for size in tablesizes ] self.k = k def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
  • 12. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br /> for size in tablesizes ] self.k = k def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
  • 13. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) br /> for size in tablesizes ] self.k = k def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
  • 14. Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2]) >>> x.add('a') >>> 'a' in x # no false negatives True >>> 'b' in x False >>> 'c' in x # …but false positives True
  • 15. Storing words in a Bloom filter >>> x = BloomFilter([1001, 1003, 1005]) >>> 'oogaboog' in x False >>> x.add('oogaboog') >>> 'oogaboog' in x True >>> x = BloomFilter([2]) # …false positives >>> x.add('a') >>> 'a' in x True >>> 'b' in x False >>> 'c' in x True
  • 16. Storing text in a Bloom filter class BloomFilter(object): … def insert_text(self, text): for i in range(len(text)-self.k+1): self.add(text[i:i+self.k])
  • 17. def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start): word = start[-bf.k:] n = -1 for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence if n < 0: yield start
  • 18. def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch # descend into all successive 1-ch extensions def retrieve_all_sentences(bf, start): word = start[-bf.k:] n = -1 for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence if n < 0: yield start
  • 19. Storing and retrieving text >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('foo bar bazbif zap!') >>> x.insert_text('the quick brown fox jumped over the lazy dog') >>> print retrieve_first_sentence(x, 'foo bar ') foo bar bazbif zap! >>> print retrieve_first_sentence(x, 'the quic') the quick brown fox jumped over the lazy dog
  • 20. Sequence assembly >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('the quick brown fox jumped ') >>> x.insert_text('jumped over the lazy dog') >>> retrieve_first_sentence(x, 'the quic') the quick brown fox jumpedover the lazy dog (This is known as the de Bruin graph approach to assembly; c.f. Velvet, ABySS, SOAPdenovo)
  • 21. Repetitive strings are the devil >>> x = BloomFilter([1001, 1003, 1005, 1007]) >>> x.insert_text('nanana, batman!') >>> x.insert_text('my chemical romance: nanana') >>> retrieve_first_sentence(x, "my chemical") 'my chemical romance: nanana, batman!'
  • 22. Note, it’s a probabilistic data structure Retrieval errors: >>> x = BloomFilter([1001, 1003]) # small Bloom filter… >>> x.insert_text('the quick brown fox jumped over the lazy dog’) >>> retrieve_first_sentence(x, 'the quic'), ('the quick brY',)
  • 23. Assembling DNA sequence Can’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties) But we can use the data structure to grok graph properties and eliminate/break up data: Eliminate small graphs (no false negatives!) Disconnected partitions (parts -> map reduce) Local graph complexity reduction & error/artifact trimming …and then feed into other programs. This is a data reducing prefilter
  • 24. Right, but does it work?? Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500). …compare with not at allon a 512 GB RAM machine. Error/repeat trimming on a tricky worm genome: reduction from 170 GB resident / 60 hrs 54 GB resident / 13 hrs
  • 25. How good is this graph representation? V. low false positive rates at ~2 bytes/k-mer; Nearly exact human genome graph in ~5 GB. Estimate we eventually need to store/traverse 50 billion k-mers (soil metagenome) Good failure mode: it’s all connected, Jim! (No loss of connections => good prefilter) Did I mention it’s constant memory? And independent of word size? …only works for de Bruijn graphs 
  • 26. Thoughts for the future Unless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), oryour problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformatics Synopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure. Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinode graph distribution.
  • 27. Groxel view of knot-like region / ArendHintze
  • 28. Acknowledgements: The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning Qingpeng Zhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

Editor's Notes

  1. Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.