SlideShare a Scribd company logo
Instrument ALL the things:
Studying data-intensive
workflows in the clowd.
C. Titus Brown
Michigan State University
(See blog post)
A few upfront definitions
Big Data, n: whatever is still inconvenient to compute on.
Data scientist, n: a statistician who lives in San Francisco.
Professor, n: someone who writes grants to fund people
who do the work (c.f. Fernando Perez)
I am a professor (not a data scientist) who
writes grants so that others can do data-
intensive biology.
This talk dedicated to Terry Peppers
Titus, I no longer understand
what you actually do…
Daddy, what do you do at
work!?
I assemble puzzles for a living.
Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.
Three bioinformatic strategies in use
• Greedy: “if the piece sorta fits…”
• N2 – “Do these two pieces match? How about
this next one?”
• The Dutch approach.
The Dutch Solution
(De Bruijn assembly)
Find similarities within puzzle pieces
The Dutch Solution
Algorithmically:
• Is linear in time with number of pieces 
(Way better than N2!)
• Is linear in memory with volume of data 
(This is due to errors in digitization process.)
Practical memory measurements
Velvet measurements (Adina Howe)
GB RAM
(About $500 of data)
Our research challenges –
1. It costs only $10k & 1 week to generate
enough sequence data that no commodity
computer (and few supercomputers) can
assemble it.
2. Hundreds -> thousands of such data sets are
being generated each year.
Our research challenges –
1. It costs only $10k & 1 week to generate
enough sequence data that no commodity
computer (and few supercomputers) can
assemble it.
2. Hundreds -> thousands of such data sets are
being generated each year.
Our research (i) - CS
• Streaming lossy compression approach that
discards pieces we’ve seen before.
• Low memory probabilistic data structures.
(…see Pycon 2013 talk)
=> RAM now scales better: O(I) where I << N
(I is sample dependent but typically I < N/20)
Our research (ii) - approach
• Open source, open data, open science, and
reproducible computational research.
– GitHub
– Automated testing, CI, & literate reSTing
– Blogging, Twitter
– IPython Notebook for data analysis, figures.
• Protocols for assembling in the cloud.
Molgula oculata
Molgula occulta
Molgula oculata
Real solutions, tackling squishy biology!
Elijah Lowe & Billie Swalla
Doing things right => #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
Benchmarking strategy
• Rent a bunch of cloud VMs from Amazon and
Rackspace.
• Extract commands from tutorials using
literate-resting.
• Use ‘sar’ (sysstat pkg) to sample CPU, RAM,
and disk I/O.
Benchmarking output
Data subset; AWS m1.xlarge
Each protocol has many steps
Data subset; AWS m1.xlarge
Most interested in RAM-intensive bit
Data subset; AWS m1.xlarge
Most interested in RAM-intensive bit
Complete data; AWS m1.xlarge
Observation #1: Rackspace is faster
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlarge
EBS, max
IOPS ephemeral 49.1 $23.56
m1.xlarge
EBS, max
IOPS EBS, max IOPS 52.5 $25.20
Surprise #1: AWS ephemeral storage is
FASTER
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlarge
EBS, max
IOPS ephemeral 49.1 $23.56
m1.xlarge
EBS, max
IOPS EBS, max IOPS 52.5 $25.20
Observation #2: NUMA costs
Same task done with varying memory sizes.
Observation #2: NUMA costs
Same task done with varying memory sizes.
Can’t we just use a faster computer?
• Demo data on m1.xlarge: 2789 s
• Demo data on m3.xlarge: 1970 s – 30% faster!
(Why?
m3.xlarge has 2x40 GB SSD drives & 40% faster
cores.)
Great! Let’s try it out!
Observation #3: multifaceted problem!
• Full data on m1.xlarge: 45.5 h
• Full data on m3.xlarge: out of disk space.
We need about 200 GB to run the full pipeline.
You can have fast disk or lots of disk but not
both, for the moment.
Future directions
1. Invest in cache-local data structures and
algorithms.
2. Invest in streaming/in-memory approaches.
3. Not clear (to me) that straight code
optimization or infrastructure engineering is
worthwhile investment.
Frequently Offered Solutions
1. You should like, totally multithread that.
(See: McDonald & Brown, POSA)
2. Hadoop will just crush that workload, dude.
(Unlikely to be cost-effective.)
3. Have you tried <my proprietary Big Data
technology stack>?
(Thatz Not Science)
Optimization vs scaling
• Linear time/memory improvements would not
have addressed our core problem.
(2 years, 20x improvement, 100x increase in data.)
• Puzzle problem is a graph problem with big
data, no locality, small compute. Not friendly.
• We need(ed) to scale our algorithms.
• Can now run on single-chassis, in ~15 GB RAM.
Optimization vs scaling --
Scaling can be more important!
What are we losing by focusing our
engineering on pleasantly parallel
problems?
• Hadoop is fundamentally not that interesting.
• Research is about the 100x.
• Scaling new problems, evaluating/creating
new data structures and algorithms, etc.
(From my PyCon 2011 talk.)
Theme: Life’s too short to tackle the
easy problems – come to academia!
Thanks!
• Leigh Sheneman, for starting the
benchmarking project.
• Labbies: Michael R. Crusoe, Luiz Irber, Likit
Preeyanon, Camille Scott, and Qingpeng
Zhang.
Thanks!
• github.com/ged-lab/
– khmer – core project
– khmer-protocols – tutorials/acceptance tests
– literate-resting – script to pull out code from reST tutorials
• Blog post at: http://ivory.idyll.org/blog/2014-pycon.html
• Michael R. Crusoe, Likit Preeyanon, Camille Scott, and
Qingpeng Zhang are here at PyCon.
…note, you can probably afford to
buy them off me :)
Different computational strategies for
k-mer counting, revealed!
Khmer-counting paper pipeline; Qingpeng Zhang

More Related Content

Viewers also liked

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slidesc.titus.brown
 
Chapter 10 - Added Values
Chapter 10 - Added ValuesChapter 10 - Added Values
Chapter 10 - Added Values
wenchein huang
 
Castello Normanno Di Adrano
Castello Normanno Di AdranoCastello Normanno Di Adrano
Castello Normanno Di Adrano
Yvonne Sgroi
 
Manduca
ManducaManduca
Manduca
nbmro
 
polar bears
polar bearspolar bears
polar bears
Takahe One
 
Evaluaciones de jheickson noguera
Evaluaciones de jheickson nogueraEvaluaciones de jheickson noguera
Evaluaciones de jheickson noguera
Lili Cardenas
 
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Kegler Brown Hill + Ritter
 
Turning event attendees into active active participants
Turning event attendees into active active participantsTurning event attendees into active active participants
Turning event attendees into active active participants
Live Union
 
Anunturi De Pomina
Anunturi De PominaAnunturi De Pomina
Anunturi De Pominanbmro
 
Vrouwen In Het Management
Vrouwen In Het ManagementVrouwen In Het Management
Vrouwen In Het Management
Aydin Kintziger
 
Volcano 3
Volcano 3Volcano 3
Volcano 3
bethann1468
 
Writing
WritingWriting
Writing
Rachel Wolfe
 
유기화학 2nd
유기화학 2nd유기화학 2nd
유기화학 2ndshinkyung
 
Grandparents day
Grandparents day Grandparents day
Grandparents day
Takahe One
 
Kakapo slideshow by Izak and Ezra
Kakapo slideshow by Izak and EzraKakapo slideshow by Izak and Ezra
Kakapo slideshow by Izak and EzraTakahe One
 
E Syn Doc2032009112513
E Syn Doc2032009112513E Syn Doc2032009112513
E Syn Doc2032009112513
Piet van Vugt
 
Printemps
PrintempsPrintemps
PrintempsJURY
 
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiMaurizio Repetto
 

Viewers also liked (20)

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
Chapter 10 - Added Values
Chapter 10 - Added ValuesChapter 10 - Added Values
Chapter 10 - Added Values
 
Seismic Waves
Seismic WavesSeismic Waves
Seismic Waves
 
Castello Normanno Di Adrano
Castello Normanno Di AdranoCastello Normanno Di Adrano
Castello Normanno Di Adrano
 
Manduca
ManducaManduca
Manduca
 
polar bears
polar bearspolar bears
polar bears
 
Evaluaciones de jheickson noguera
Evaluaciones de jheickson nogueraEvaluaciones de jheickson noguera
Evaluaciones de jheickson noguera
 
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
 
Turning event attendees into active active participants
Turning event attendees into active active participantsTurning event attendees into active active participants
Turning event attendees into active active participants
 
Anunturi De Pomina
Anunturi De PominaAnunturi De Pomina
Anunturi De Pomina
 
Vrouwen In Het Management
Vrouwen In Het ManagementVrouwen In Het Management
Vrouwen In Het Management
 
Volcano 3
Volcano 3Volcano 3
Volcano 3
 
Writing
WritingWriting
Writing
 
유기화학 2nd
유기화학 2nd유기화학 2nd
유기화학 2nd
 
Ferrari
FerrariFerrari
Ferrari
 
Grandparents day
Grandparents day Grandparents day
Grandparents day
 
Kakapo slideshow by Izak and Ezra
Kakapo slideshow by Izak and EzraKakapo slideshow by Izak and Ezra
Kakapo slideshow by Izak and Ezra
 
E Syn Doc2032009112513
E Syn Doc2032009112513E Syn Doc2032009112513
E Syn Doc2032009112513
 
Printemps
PrintempsPrintemps
Printemps
 
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
 

Similar to 2014 pycon-talk

CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
Jan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
c.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
Arjen de Vries
 
Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12Atner Yegorov
 
Data Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big DatabasesData Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big Databasesomnidba
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
Christos Charmatzis
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
Claus Stie Kallesøe
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
Pôle Systematic Paris-Region
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
Denis Rothman
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
Dataconomy Media
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
DataconomyGmbH
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
GoDataDriven
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 

Similar to 2014 pycon-talk (20)

CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12
 
Data Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big DatabasesData Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big Databases
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
 
PyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc AltedPyData Paris 2015 - Closing keynote Francesc Alted
PyData Paris 2015 - Closing keynote Francesc Alted
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
 
Data Science Accelerator Program
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 

More from c.titus.brown

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
c.titus.brown
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
c.titus.brown
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
c.titus.brown
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
c.titus.brown
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
c.titus.brown
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
c.titus.brown
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
c.titus.brown
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
c.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
c.titus.brown
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
c.titus.brown
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
c.titus.brown
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
c.titus.brown
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
c.titus.brown
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
c.titus.brown
 

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 

Recently uploaded

3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Red blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptxRed blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptx
muralinath2
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 

Recently uploaded (20)

3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Red blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptxRed blood cells- genesis-maturation.pptx
Red blood cells- genesis-maturation.pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 

2014 pycon-talk

  • 1. Instrument ALL the things: Studying data-intensive workflows in the clowd. C. Titus Brown Michigan State University (See blog post)
  • 2. A few upfront definitions Big Data, n: whatever is still inconvenient to compute on. Data scientist, n: a statistician who lives in San Francisco. Professor, n: someone who writes grants to fund people who do the work (c.f. Fernando Perez) I am a professor (not a data scientist) who writes grants so that others can do data- intensive biology.
  • 3. This talk dedicated to Terry Peppers Titus, I no longer understand what you actually do… Daddy, what do you do at work!?
  • 4. I assemble puzzles for a living. Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.
  • 5. Three bioinformatic strategies in use • Greedy: “if the piece sorta fits…” • N2 – “Do these two pieces match? How about this next one?” • The Dutch approach.
  • 6. The Dutch Solution (De Bruijn assembly) Find similarities within puzzle pieces
  • 7. The Dutch Solution Algorithmically: • Is linear in time with number of pieces  (Way better than N2!) • Is linear in memory with volume of data  (This is due to errors in digitization process.)
  • 8. Practical memory measurements Velvet measurements (Adina Howe) GB RAM (About $500 of data)
  • 9. Our research challenges – 1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it. 2. Hundreds -> thousands of such data sets are being generated each year.
  • 10. Our research challenges – 1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it. 2. Hundreds -> thousands of such data sets are being generated each year.
  • 11. Our research (i) - CS • Streaming lossy compression approach that discards pieces we’ve seen before. • Low memory probabilistic data structures. (…see Pycon 2013 talk) => RAM now scales better: O(I) where I << N (I is sample dependent but typically I < N/20)
  • 12. Our research (ii) - approach • Open source, open data, open science, and reproducible computational research. – GitHub – Automated testing, CI, & literate reSTing – Blogging, Twitter – IPython Notebook for data analysis, figures. • Protocols for assembling in the cloud.
  • 13. Molgula oculata Molgula occulta Molgula oculata Real solutions, tackling squishy biology! Elijah Lowe & Billie Swalla
  • 14. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 15. Benchmarking strategy • Rent a bunch of cloud VMs from Amazon and Rackspace. • Extract commands from tutorials using literate-resting. • Use ‘sar’ (sysstat pkg) to sample CPU, RAM, and disk I/O.
  • 17. Each protocol has many steps Data subset; AWS m1.xlarge
  • 18. Most interested in RAM-intensive bit Data subset; AWS m1.xlarge
  • 19. Most interested in RAM-intensive bit Complete data; AWS m1.xlarge
  • 20. Observation #1: Rackspace is faster machine data disk working hours cost rackspace-15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20
  • 21. Surprise #1: AWS ephemeral storage is FASTER machine data disk working hours cost rackspace-15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20
  • 22. Observation #2: NUMA costs Same task done with varying memory sizes.
  • 23. Observation #2: NUMA costs Same task done with varying memory sizes.
  • 24. Can’t we just use a faster computer? • Demo data on m1.xlarge: 2789 s • Demo data on m3.xlarge: 1970 s – 30% faster! (Why? m3.xlarge has 2x40 GB SSD drives & 40% faster cores.) Great! Let’s try it out!
  • 25. Observation #3: multifaceted problem! • Full data on m1.xlarge: 45.5 h • Full data on m3.xlarge: out of disk space. We need about 200 GB to run the full pipeline. You can have fast disk or lots of disk but not both, for the moment.
  • 26. Future directions 1. Invest in cache-local data structures and algorithms. 2. Invest in streaming/in-memory approaches. 3. Not clear (to me) that straight code optimization or infrastructure engineering is worthwhile investment.
  • 27. Frequently Offered Solutions 1. You should like, totally multithread that. (See: McDonald & Brown, POSA) 2. Hadoop will just crush that workload, dude. (Unlikely to be cost-effective.) 3. Have you tried <my proprietary Big Data technology stack>? (Thatz Not Science)
  • 28. Optimization vs scaling • Linear time/memory improvements would not have addressed our core problem. (2 years, 20x improvement, 100x increase in data.) • Puzzle problem is a graph problem with big data, no locality, small compute. Not friendly. • We need(ed) to scale our algorithms. • Can now run on single-chassis, in ~15 GB RAM.
  • 30. Scaling can be more important!
  • 31. What are we losing by focusing our engineering on pleasantly parallel problems? • Hadoop is fundamentally not that interesting. • Research is about the 100x. • Scaling new problems, evaluating/creating new data structures and algorithms, etc.
  • 32. (From my PyCon 2011 talk.) Theme: Life’s too short to tackle the easy problems – come to academia!
  • 33. Thanks! • Leigh Sheneman, for starting the benchmarking project. • Labbies: Michael R. Crusoe, Luiz Irber, Likit Preeyanon, Camille Scott, and Qingpeng Zhang.
  • 34. Thanks! • github.com/ged-lab/ – khmer – core project – khmer-protocols – tutorials/acceptance tests – literate-resting – script to pull out code from reST tutorials • Blog post at: http://ivory.idyll.org/blog/2014-pycon.html • Michael R. Crusoe, Likit Preeyanon, Camille Scott, and Qingpeng Zhang are here at PyCon. …note, you can probably afford to buy them off me :)
  • 35. Different computational strategies for k-mer counting, revealed! Khmer-counting paper pipeline; Qingpeng Zhang

Editor's Notes

  1. …spent last15 years getting to the point where I earn considerably less than many of you
  2. Billions of pieces; hi-dimensional puzzle
  3. Acceptance testing other people’s software
  4. Color.
  5. Walk through
  6. Add cost.
  7. Add cost.
  8. Apparently I’m approachable, trying to work on that.