2014 pycon-talk

Instrument ALL the things:
Studying data-intensive
workflows in the clowd.
C. Titus Brown
Michigan State University
(See blog post)
A few upfront definitions
Big Data, n: whatever is still inconvenient to compute on.
Data scientist, n: a statistician who lives in San Francisco.
Professor, n: someone who writes grants to fund people
who do the work (c.f. Fernando Perez)
I am a professor (not a data scientist) who
writes grants so that others can do data-
intensive biology.
This talk dedicated to Terry Peppers
Titus, I no longer understand
what you actually do…
Daddy, what do you do at
work!?
I assemble puzzles for a living.
Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.
Three bioinformatic strategies in use
• Greedy: “if the piece sorta fits…”
• N2 – “Do these two pieces match? How about
this next one?”
• The Dutch approach.
The Dutch Solution
(De Bruijn assembly)
Find similarities within puzzle pieces
The Dutch Solution
Algorithmically:
• Is linear in time with number of pieces 
(Way better than N2!)
• Is linear in memory with volume of data 
(This is due to errors in digitization process.)
Practical memory measurements
Velvet measurements (Adina Howe)
GB RAM
(About $500 of data)
Our research challenges –
1. It costs only $10k & 1 week to generate
enough sequence data that no commodity
computer (and few supercomputers) can
assemble it.
2. Hundreds -> thousands of such data sets are
being generated each year.
Our research challenges –
1. It costs only $10k & 1 week to generate
enough sequence data that no commodity
computer (and few supercomputers) can
assemble it.
2. Hundreds -> thousands of such data sets are
being generated each year.
Our research (i) - CS
• Streaming lossy compression approach that
discards pieces we’ve seen before.
• Low memory probabilistic data structures.
(…see Pycon 2013 talk)
=> RAM now scales better: O(I) where I << N
(I is sample dependent but typically I < N/20)
Our research (ii) - approach
• Open source, open data, open science, and
reproducible computational research.
– GitHub
– Automated testing, CI, & literate reSTing
– Blogging, Twitter
– IPython Notebook for data analysis, figures.
• Protocols for assembling in the cloud.
Molgula oculata
Molgula occulta
Molgula oculata
Real solutions, tackling squishy biology!
Elijah Lowe & Billie Swalla
Doing things right => #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
Benchmarking strategy
• Rent a bunch of cloud VMs from Amazon and
Rackspace.
• Extract commands from tutorials using
literate-resting.
• Use ‘sar’ (sysstat pkg) to sample CPU, RAM,
and disk I/O.
Benchmarking output
Data subset; AWS m1.xlarge
Each protocol has many steps
Data subset; AWS m1.xlarge
Most interested in RAM-intensive bit
Data subset; AWS m1.xlarge
Most interested in RAM-intensive bit
Complete data; AWS m1.xlarge
Observation #1: Rackspace is faster
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlarge
EBS, max
IOPS ephemeral 49.1 $23.56
m1.xlarge
EBS, max
IOPS EBS, max IOPS 52.5 $25.20
Surprise #1: AWS ephemeral storage is
FASTER
machine data disk working hours cost
rackspace-15gb 200 GB 100 GB 34.9 $23.70
m2.xlarge EBS ephemeral 44.7 $18.34
m1.xlarge EBS ephemeral 45.5 $21.82
m1.xlarge
EBS, max
IOPS ephemeral 49.1 $23.56
m1.xlarge
EBS, max
IOPS EBS, max IOPS 52.5 $25.20
Observation #2: NUMA costs
Same task done with varying memory sizes.
Observation #2: NUMA costs
Same task done with varying memory sizes.
Can’t we just use a faster computer?
• Demo data on m1.xlarge: 2789 s
• Demo data on m3.xlarge: 1970 s – 30% faster!
(Why?
m3.xlarge has 2x40 GB SSD drives & 40% faster
cores.)
Great! Let’s try it out!
Observation #3: multifaceted problem!
• Full data on m1.xlarge: 45.5 h
• Full data on m3.xlarge: out of disk space.
We need about 200 GB to run the full pipeline.
You can have fast disk or lots of disk but not
both, for the moment.
Future directions
1. Invest in cache-local data structures and
algorithms.
2. Invest in streaming/in-memory approaches.
3. Not clear (to me) that straight code
optimization or infrastructure engineering is
worthwhile investment.
Frequently Offered Solutions
1. You should like, totally multithread that.
(See: McDonald & Brown, POSA)
2. Hadoop will just crush that workload, dude.
(Unlikely to be cost-effective.)
3. Have you tried <my proprietary Big Data
technology stack>?
(Thatz Not Science)
Optimization vs scaling
• Linear time/memory improvements would not
have addressed our core problem.
(2 years, 20x improvement, 100x increase in data.)
• Puzzle problem is a graph problem with big
data, no locality, small compute. Not friendly.
• We need(ed) to scale our algorithms.
• Can now run on single-chassis, in ~15 GB RAM.
Optimization vs scaling --
Scaling can be more important!
What are we losing by focusing our
engineering on pleasantly parallel
problems?
• Hadoop is fundamentally not that interesting.
• Research is about the 100x.
• Scaling new problems, evaluating/creating
new data structures and algorithms, etc.
(From my PyCon 2011 talk.)
Theme: Life’s too short to tackle the
easy problems – come to academia!
Thanks!
• Leigh Sheneman, for starting the
benchmarking project.
• Labbies: Michael R. Crusoe, Luiz Irber, Likit
Preeyanon, Camille Scott, and Qingpeng
Zhang.
Thanks!
• github.com/ged-lab/
– khmer – core project
– khmer-protocols – tutorials/acceptance tests
– literate-resting – script to pull out code from reST tutorials
• Blog post at: http://ivory.idyll.org/blog/2014-pycon.html
• Michael R. Crusoe, Likit Preeyanon, Camille Scott, and
Qingpeng Zhang are here at PyCon.
…note, you can probably afford to
buy them off me :)
Different computational strategies for
k-mer counting, revealed!
Khmer-counting paper pipeline; Qingpeng Zhang
1 of 35

Recommended

2015 msu-code-review by
2015 msu-code-review2015 msu-code-review
2015 msu-code-reviewc.titus.brown
1.4K views20 slides
TDD er død. Lenge leve TDD! by
TDD er død. Lenge leve TDD!TDD er død. Lenge leve TDD!
TDD er død. Lenge leve TDD!Kjetil Klaussen
659 views45 slides
TDD, the way to better software | Dan Ursu | CodeWay 2015 by
TDD, the way to better software | Dan Ursu | CodeWay 2015TDD, the way to better software | Dan Ursu | CodeWay 2015
TDD, the way to better software | Dan Ursu | CodeWay 2015YOPESO
835 views34 slides
2014 toronto-torbug by
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbugc.titus.brown
1.5K views40 slides
Daniel Cerecedo | From legacy to cloud... and beyond | Codemotion Madrid 2018 by
Daniel Cerecedo | From legacy to cloud... and beyond | Codemotion Madrid 2018  Daniel Cerecedo | From legacy to cloud... and beyond | Codemotion Madrid 2018
Daniel Cerecedo | From legacy to cloud... and beyond | Codemotion Madrid 2018 Codemotion
205 views42 slides
Importance of test automation, excuses and TDD introduction by
Importance of test automation, excuses and TDD introductionImportance of test automation, excuses and TDD introduction
Importance of test automation, excuses and TDD introductionNicolas De Boose
922 views29 slides

More Related Content

Viewers also liked

2014 ismb-extra-slides by
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slidesc.titus.brown
552 views5 slides
Chapter 10 - Added Values by
Chapter 10 - Added ValuesChapter 10 - Added Values
Chapter 10 - Added Valueswenchein huang
284 views14 slides
Seismic Waves by
Seismic WavesSeismic Waves
Seismic Wavesguest264ffd
1.3K views8 slides
Castello Normanno Di Adrano by
Castello Normanno Di AdranoCastello Normanno Di Adrano
Castello Normanno Di AdranoYvonne Sgroi
1.2K views7 slides
Manduca by
ManducaManduca
Manducanbmro
238 views7 slides
polar bears by
polar bearspolar bears
polar bearsTakahe One
168 views3 slides

Viewers also liked(20)

Seismic Waves by guest264ffd
Seismic WavesSeismic Waves
Seismic Waves
guest264ffd1.3K views
Castello Normanno Di Adrano by Yvonne Sgroi
Castello Normanno Di AdranoCastello Normanno Di Adrano
Castello Normanno Di Adrano
Yvonne Sgroi1.2K views
Manduca by nbmro
ManducaManduca
Manduca
nbmro238 views
Evaluaciones de jheickson noguera by Lili Cardenas
Evaluaciones de jheickson nogueraEvaluaciones de jheickson noguera
Evaluaciones de jheickson noguera
Lili Cardenas236 views
Turning event attendees into active active participants by Live Union
Turning event attendees into active active participantsTurning event attendees into active active participants
Turning event attendees into active active participants
Live Union786 views
Anunturi De Pomina by nbmro
Anunturi De PominaAnunturi De Pomina
Anunturi De Pomina
nbmro317 views
유기화학 2nd by shinkyung
유기화학 2nd유기화학 2nd
유기화학 2nd
shinkyung1K views
Grandparents day by Takahe One
Grandparents day Grandparents day
Grandparents day
Takahe One1.1K views
Kakapo slideshow by Izak and Ezra by Takahe One
Kakapo slideshow by Izak and EzraKakapo slideshow by Izak and Ezra
Kakapo slideshow by Izak and Ezra
Takahe One350 views
Printemps by JURY
PrintempsPrintemps
Printemps
JURY262 views
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti by Maurizio Repetto
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Maurizio Repetto252 views

Similar to 2014 pycon-talk

CT Brown - Doing next-gen sequencing analysis in the cloud by
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
922 views25 slides
Talk at Bioinformatics Open Source Conference, 2012 by
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
937 views25 slides
2014 nicta-reproducibility by
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
1.9K views66 slides
PUC Masterclass Big Data by
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
335 views36 slides
Bender kuszmaul tutorial-xldb12 by
Bender kuszmaul tutorial-xldb12Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12Atner Yegorov
722 views208 slides
Data Structures and Algorithms for Big Databases by
Data Structures and Algorithms for Big DatabasesData Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big Databasesomnidba
7.9K views208 slides

Similar to 2014 pycon-talk(20)

CT Brown - Doing next-gen sequencing analysis in the cloud by Jan Aerts
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
Jan Aerts922 views
Talk at Bioinformatics Open Source Conference, 2012 by c.titus.brown
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
c.titus.brown937 views
2014 nicta-reproducibility by c.titus.brown
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown1.9K views
Bender kuszmaul tutorial-xldb12 by Atner Yegorov
Bender kuszmaul tutorial-xldb12Bender kuszmaul tutorial-xldb12
Bender kuszmaul tutorial-xldb12
Atner Yegorov722 views
Data Structures and Algorithms for Big Databases by omnidba
Data Structures and Algorithms for Big DatabasesData Structures and Algorithms for Big Databases
Data Structures and Algorithms for Big Databases
omnidba7.9K views
Big Data Analytics: Finding diamonds in the rough with Azure by Christos Charmatzis
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
2014 manchester-reproducibility by c.titus.brown
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown2.6K views
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas by Databricks
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Databricks929 views
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ... by Claus Stie Kallesøe
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
Taken some of the hype out of Big Data again - Medtech Pharma, Nürnberg july ...
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify by Dataconomy Media
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
Dataconomy Media254 views
The Data Janitor Returns | Daniel Molnar | DN18 by DataconomyGmbH
The Data Janitor Returns | Daniel Molnar | DN18The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
DataconomyGmbH46 views
Data Science Accelerator Program by GoDataDriven
Data Science Accelerator ProgramData Science Accelerator Program
Data Science Accelerator Program
GoDataDriven581 views
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data by Cloudera, Inc.
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera, Inc.1.4K views
Data science presentation by MSDEVMTL
Data science presentationData science presentation
Data science presentation
MSDEVMTL38.3K views
2013 py con awesome big data algorithms by c.titus.brown
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown3.5K views

More from c.titus.brown

2016 bergen-sars by
2016 bergen-sars2016 bergen-sars
2016 bergen-sarsc.titus.brown
915 views56 slides
2016 davis-plantbio by
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbioc.titus.brown
729 views55 slides
2016 davis-biotech by
2016 davis-biotech2016 davis-biotech
2016 davis-biotechc.titus.brown
1.9K views56 slides
2015 genome-center by
2015 genome-center2015 genome-center
2015 genome-centerc.titus.brown
1.5K views62 slides
2015 beacon-metagenome-tutorial by
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
3.9K views114 slides
2015 aem-grs-keynote by
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
1.4K views32 slides

More from c.titus.brown(20)

2015 beacon-metagenome-tutorial by c.titus.brown
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown3.9K views
2015 vancouver-vanbug by c.titus.brown
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
c.titus.brown1.1K views
2015 balti-and-bioinformatics by c.titus.brown
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
c.titus.brown1.6K views
2014 anu-canberra-streaming by c.titus.brown
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
c.titus.brown949 views

Recently uploaded

Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...SwagatBehera9
5 views36 slides
Ecology by
Ecology Ecology
Ecology Abhijith Raj.R
13 views10 slides
Distinct distributions of elliptical and disk galaxies across the Local Super... by
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...Sérgio Sacani
33 views12 slides
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...ILRI
7 views1 slide
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...InsideScientific
78 views62 slides
RemeOs science and clinical evidence by
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidencePetrusViitanen1
47 views96 slides

Recently uploaded(20)

Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... by SwagatBehera9
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
SwagatBehera95 views
Distinct distributions of elliptical and disk galaxies across the Local Super... by Sérgio Sacani
Distinct distributions of elliptical and disk galaxies across the Local Super...Distinct distributions of elliptical and disk galaxies across the Local Super...
Distinct distributions of elliptical and disk galaxies across the Local Super...
Sérgio Sacani33 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI7 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific78 views
RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen147 views
Nitrosamine & NDSRI.pptx by NileshBonde4
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptx
NileshBonde418 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH19 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew9 views
Open Access Publishing in Astrophysics by Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles1.2K views
Applications of Large Language Models in Materials Discovery and Design by Anubhav Jain
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain13 views
CSF -SHEEBA.D presentation.pptx by SheebaD7
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptx
SheebaD715 views

2014 pycon-talk

  • 1. Instrument ALL the things: Studying data-intensive workflows in the clowd. C. Titus Brown Michigan State University (See blog post)
  • 2. A few upfront definitions Big Data, n: whatever is still inconvenient to compute on. Data scientist, n: a statistician who lives in San Francisco. Professor, n: someone who writes grants to fund people who do the work (c.f. Fernando Perez) I am a professor (not a data scientist) who writes grants so that others can do data- intensive biology.
  • 3. This talk dedicated to Terry Peppers Titus, I no longer understand what you actually do… Daddy, what do you do at work!?
  • 4. I assemble puzzles for a living. Well, ok, I strategize about solving multi-dimensional puzzles with billions of pieces and no box.
  • 5. Three bioinformatic strategies in use • Greedy: “if the piece sorta fits…” • N2 – “Do these two pieces match? How about this next one?” • The Dutch approach.
  • 6. The Dutch Solution (De Bruijn assembly) Find similarities within puzzle pieces
  • 7. The Dutch Solution Algorithmically: • Is linear in time with number of pieces  (Way better than N2!) • Is linear in memory with volume of data  (This is due to errors in digitization process.)
  • 8. Practical memory measurements Velvet measurements (Adina Howe) GB RAM (About $500 of data)
  • 9. Our research challenges – 1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it. 2. Hundreds -> thousands of such data sets are being generated each year.
  • 10. Our research challenges – 1. It costs only $10k & 1 week to generate enough sequence data that no commodity computer (and few supercomputers) can assemble it. 2. Hundreds -> thousands of such data sets are being generated each year.
  • 11. Our research (i) - CS • Streaming lossy compression approach that discards pieces we’ve seen before. • Low memory probabilistic data structures. (…see Pycon 2013 talk) => RAM now scales better: O(I) where I << N (I is sample dependent but typically I < N/20)
  • 12. Our research (ii) - approach • Open source, open data, open science, and reproducible computational research. – GitHub – Automated testing, CI, & literate reSTing – Blogging, Twitter – IPython Notebook for data analysis, figures. • Protocols for assembling in the cloud.
  • 13. Molgula oculata Molgula occulta Molgula oculata Real solutions, tackling squishy biology! Elijah Lowe & Billie Swalla
  • 14. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 15. Benchmarking strategy • Rent a bunch of cloud VMs from Amazon and Rackspace. • Extract commands from tutorials using literate-resting. • Use ‘sar’ (sysstat pkg) to sample CPU, RAM, and disk I/O.
  • 17. Each protocol has many steps Data subset; AWS m1.xlarge
  • 18. Most interested in RAM-intensive bit Data subset; AWS m1.xlarge
  • 19. Most interested in RAM-intensive bit Complete data; AWS m1.xlarge
  • 20. Observation #1: Rackspace is faster machine data disk working hours cost rackspace-15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20
  • 21. Surprise #1: AWS ephemeral storage is FASTER machine data disk working hours cost rackspace-15gb 200 GB 100 GB 34.9 $23.70 m2.xlarge EBS ephemeral 44.7 $18.34 m1.xlarge EBS ephemeral 45.5 $21.82 m1.xlarge EBS, max IOPS ephemeral 49.1 $23.56 m1.xlarge EBS, max IOPS EBS, max IOPS 52.5 $25.20
  • 22. Observation #2: NUMA costs Same task done with varying memory sizes.
  • 23. Observation #2: NUMA costs Same task done with varying memory sizes.
  • 24. Can’t we just use a faster computer? • Demo data on m1.xlarge: 2789 s • Demo data on m3.xlarge: 1970 s – 30% faster! (Why? m3.xlarge has 2x40 GB SSD drives & 40% faster cores.) Great! Let’s try it out!
  • 25. Observation #3: multifaceted problem! • Full data on m1.xlarge: 45.5 h • Full data on m3.xlarge: out of disk space. We need about 200 GB to run the full pipeline. You can have fast disk or lots of disk but not both, for the moment.
  • 26. Future directions 1. Invest in cache-local data structures and algorithms. 2. Invest in streaming/in-memory approaches. 3. Not clear (to me) that straight code optimization or infrastructure engineering is worthwhile investment.
  • 27. Frequently Offered Solutions 1. You should like, totally multithread that. (See: McDonald & Brown, POSA) 2. Hadoop will just crush that workload, dude. (Unlikely to be cost-effective.) 3. Have you tried <my proprietary Big Data technology stack>? (Thatz Not Science)
  • 28. Optimization vs scaling • Linear time/memory improvements would not have addressed our core problem. (2 years, 20x improvement, 100x increase in data.) • Puzzle problem is a graph problem with big data, no locality, small compute. Not friendly. • We need(ed) to scale our algorithms. • Can now run on single-chassis, in ~15 GB RAM.
  • 30. Scaling can be more important!
  • 31. What are we losing by focusing our engineering on pleasantly parallel problems? • Hadoop is fundamentally not that interesting. • Research is about the 100x. • Scaling new problems, evaluating/creating new data structures and algorithms, etc.
  • 32. (From my PyCon 2011 talk.) Theme: Life’s too short to tackle the easy problems – come to academia!
  • 33. Thanks! • Leigh Sheneman, for starting the benchmarking project. • Labbies: Michael R. Crusoe, Luiz Irber, Likit Preeyanon, Camille Scott, and Qingpeng Zhang.
  • 34. Thanks! • github.com/ged-lab/ – khmer – core project – khmer-protocols – tutorials/acceptance tests – literate-resting – script to pull out code from reST tutorials • Blog post at: http://ivory.idyll.org/blog/2014-pycon.html • Michael R. Crusoe, Likit Preeyanon, Camille Scott, and Qingpeng Zhang are here at PyCon. …note, you can probably afford to buy them off me :)
  • 35. Different computational strategies for k-mer counting, revealed! Khmer-counting paper pipeline; Qingpeng Zhang

Editor's Notes

  1. …spent last15 years getting to the point where I earn considerably less than many of you
  2. Billions of pieces; hi-dimensional puzzle
  3. Acceptance testing other people’s software
  4. Color.
  5. Walk through
  6. Add cost.
  7. Add cost.
  8. Apparently I’m approachable, trying to work on that.