SlideShare a Scribd company logo
1 of 66
Openness and reproducibility 
in computational science: 
tools, approaches, and 
thought patterns. 
C. Titus Brown 
ctb@msu.edu 
October 16, 2014
Hello! 
Assistant Professor @ MSU; Microbiology; Computer 
Science; etc. 
=> UC Davis VetMed in 2015. 
More information at: 
• ged.msu.edu/ 
• github.com/ged-lab/ 
• ivory.idyll.org/blog/ 
• @ctitusbrown
The challenges of non-model 
sequencing 
• Missing or low quality genome reference. 
• Evolutionarily distant. 
• Most extant computational tools focus on model 
organisms – 
o Assume low polymorphism (internal variation) 
o Assume reference genome 
o Assume somewhat reliable functional annotation 
o More significant compute infrastructure 
…and cannot easily or directly be used on critters of interest.
Shotgun sequencing & assembly 
http://eofdreams.com/library.html; 
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; 
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
Shotgun sequencing 
analysis goals: 
• Assembly (what is the text?) 
o Produces new genomes & transcriptomes. 
o Gene discovery for enzymes, drug targets, etc. 
• Counting (how many copies of each book?) 
o Measure gene expression levels, protein-DNA 
interactions 
• Variant calling (how does each edition vary?) 
o Discover genetic variation: genotyping, linkage 
studies… 
o Allele-specific expression analysis.
Assembly 
It was the best of times, it was the wor 
, it was the worst of times, it was the 
isdom, it was the age of foolishness 
mes, it was the age of wisdom, it was th 
It was the best of times, it was the worst of times, it was 
the age of wisdom, it was the age of foolishness 
…but for lots and lots of fragments!
Shared low-level 
fragments may 
not reach the 
threshold for 
assembly. 
Lamprey mRNAseq:
Assembly graphs scale with data size, not 
information. 
Conway T C , Bromage A J Bioinformatics 2011;27:479-486 
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, 
please email: journals.permissions@oup.com
Practical memory 
measurements (soil) 
Velvet measurements (Adina Howe)
Data set size and cost 
• $1000 gets you ~200m “reads”, or about 20-80 GB of 
data, in ~week. 
• > 1000 labs doing this regularly. 
• Each data set analysis is ~custom. 
• Analyses are data intensive and memory intensive.
Efficient data structures & 
algorithms 
Efficient online 
counting of k-mers 
Trimming reads 
on abundance 
Efficient De 
Bruijn graph 
representations 
Read 
abundance 
normalization
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
Shotgun sequencing is massively redundant; can we 
eliminate redundancy while retaining information? 
Analog: JPEG lossy compression 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB)
Sparse collections of k-mers can be 
stored efficiently in Bloom filters 
Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109
Data structures & 
algorithms papers 
• “These are not the k-mers you are looking for…”, 
Zhang et al., PLoS One, 2014. 
• “Scaling metagenome sequence assembly with 
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. 
• “A Reference-Free Algorithm for Computational 
Normalization of Shotgun Sequencing Data”, Brown 
et al., arXiv 1203.4802.
Data analysis papers 
• “Tackling soil diversity with the assembly of large, 
complex metagenomes”, Howe et al., PNAS, 2014. 
• Assembling novel ascidian genomes & 
transcriptomes, Stolfi et al. (eLife 2014), Lowe et (in 
prep) 
• A de novo lamprey transcriptome from large scale 
multi-tissue mRNAseq, Scott et al., in prep.
Lab approach – not 
intentional, but working out. 
Novel data 
structures and 
algorithms 
Implement at 
scale 
Apply to real 
biological 
problems
This leads to good things. 
Efficient online 
counting of k-mers 
Trimming reads 
on abundance 
Efficient De 
Bruijn graph 
representations 
(khmer software) 
Read 
abundance 
normalization
Efficient online 
counting of k-mers 
Trimming reads 
on abundance 
Efficient De 
Bruijn graph 
representations 
Read 
abundance 
normalization 
Streaming 
algorithms for 
assembly, 
variant calling, 
and error 
correction 
Efficient graph 
labeling & 
exploration 
Cloud assembly 
protocols 
Efficient search 
for target genes 
Data set 
partitioning 
approaches 
Assembly-free 
comparison of 
data sets 
HMM-guided 
assembly 
Current research 
(khmer software)
Testing & version control 
– the not so secret sauce 
• High test coverage - grown over time. 
• Stupidity driven testing – we write tests for bugs after 
we find them and before we fix them. 
• Pull requests & continuous integration – does your 
proposed merge break tests? 
• Pull requests & code review – does new code meet 
our minimal coding etc requirements? 
o Note: spellchecking!!!
Our “novel research” enables 
this: 
• Novel data structures and algorithms; 
• Permit low(er) memory data analysis; 
• Liberate analyses from specialized hardware.
Running entirely w/in cloud 
~40 hours 
Complete data; AWS m1.xlarge 
(See PyCon 2014 talk; video and blog post.) 
MEMORY
On the “novel research” side: 
• Novel data structures and algorithms; 
• Permit low(er) memory data analysis; 
• Liberate analyses from specialized hardware. 
This last bit? => reproducibility.
Reproducibility! 
Scientific progress relies on reproducibility of 
analysis. (Aristotle, Nature, 322 BCE.) 
“There is no such thing as ‘reproducible science’. 
There is only ‘science’, and ‘not science.’” – 
someone on Twitter (Fernando Perez?)
Disclaimer 
Not a researcher of reproducibility! 
Merely a practitioner. 
Please take my points below as an argument 
and not as research conclusions. 
(But I’m right.)
Replication vs 
reproducibility 
• I will not clearly distinguish. 
• There are important differences. 
o Replication: someone using same data, same tools, => same results 
o Reproduction: someone using different data and/or different tools => 
same result. 
• The former is much easier. 
• The latter is much stronger. 
• Science is failing even mere replication!? 
• So, mostly I will talk about how we make our 
analyses replicable.
My usual intro: 
We practice open science! 
Everything discussed here: 
• Code: github.com/ged-lab/ ; BSD license 
• Blog: http://ivory.idyll.org/blog (‘titus brown blog’) 
• Twitter: @ctitusbrown 
• Grants on Lab Web site: 
http://ged.msu.edu/research.html 
• Preprints available. 
Everything is > 80% reproducible.
My usual intro: 
We practice open science! 
Everything discussed here: 
• Code: github.com/ged-lab/ ; BSD license 
• Blog: http://ivory.idyll.org/blog (‘titus brown blog’) 
• Twitter: @ctitusbrown 
• Grants on Lab Web site: 
http://ged.msu.edu/research.html 
• Preprints available. 
Everything is > 80% reproducible.
My lab & the diginorm paper. 
• All our code was already on github; 
• Much of our data analysis was already in the cloud; 
• Our figures were already made in IPython Notebook 
• Our paper was already in LaTeX
IPython Notebook: data + 
IPythcoond)Ne o=t>ebook)
My lab & the diginorm paper. 
• All our code was already on github; 
• Much of our data analysis was already in the cloud; 
• Our figures were already made in IPython Notebook 
• Our paper was already in LaTeX 
…why not push a bit more and make it easily 
reproducible? 
This involved writing a tutorial. And that’s it.
To reproduce our paper: 
git clone <khmer> && python setup.py install 
git clone <pipeline> 
cd pipeline 
wget <data> && tar xzf <data> 
make && cd ../notebook && make 
cd ../ && make
Now standard in lab -- 
Our papers now have: 
• Source hosted on github; 
• Data hosted there or on AWS; 
• Long running data analysis => 
‘make’ 
• Graphing and data digestion 
=> IPython Notebook (also in 
github) 
Qingpeng Zhang
Research process 
Generate new 
results; encode 
in Makefile 
Summarize in 
IPython 
Notebook 
Discuss, explore Push to github
Literate graphing & 
interactive exploration
The process 
• We start with pipeline reproducibility 
• Baked into lab culture; default “use git; write scripts” 
Community of practice! 
• Use standard open source approaches, so OSS 
developers learn it easily. 
• Enables easy collaboration w/in lab 
• Valuable learning tool!
Growing & refining the 
process 
• Now moving to Ubuntu Long-Term Support + install 
instructions. 
• Everything is as automated as is convenient. 
• Students expected to communicate with me in IPython 
Notebooks. 
• Trying to avoid building (or even using) new repro tools. 
• Avoid maintenance burden as much as possible.
1. Use standard OS; provide 
install instructions 
• Providing install, execute for Ubuntu Long-Term 
Support release 14.04: supported through 2017 and 
beyond. 
• Avoid pre-configured virtual machines! They: 
o Lock you into specific cloud homes. 
o Challenge remixability and extensibility.
2. Automate 
• Literate graphing now easy with knitr and IPython 
Notebook. 
• Build automation with make, or whatever. To first 
order, it does not matter what tools you use. 
• Explicit is better than implicit. Make it easy to 
understand what you’re doing and how to extend it.
k-mer counting paper 
(Ubuntu 14.04, git, make, IPython Notebook, latex)
Time from publication of KAnalyze to our 
100% reproducible re-evaluation? ~8 hours.
3. Protocols, not pipelines. 
STOP HIDING THE ANALYSIS STEPS.
Write down what you’re 
doing… 
https://khmer-protocols.readthedocs.org/
…and add automated 
end-to-end tests. 
c.f. “literate ReSTing”
4. Drive sustainable software 
development with use cases.
…that are explicit…
…versioned…
…and automated.
5. Invest in automated, reproducible 
workflows 
Genome Reference 
Quality Filtered Diginorm Partition Reinflation 
Velvet - 80.90 83.64 84.57 
IDBA 90.96 91.38 90.52 88.80 
SPAdes 90.42 90.35 89.57 90.02 
Mis-assembled Contig Length 
Velvet - 52071358 44730449 45381867 
IDBA 21777032 20807513 17159671 18684159 
SPAdes 28238787 21506019 14247392 18851571 
Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013 
Also! Tip o’ the hat to Michael Barton, nucleotid.es
Automation enables super 
fun paper reviews! 
• “What a nice new transcriptome assembler! Interesting 
how it doesn’t perform that well on my 10 test data sets.” 
• “Hey, so you make these claims, but I ran your code, 
and…” 
• “Fun fact! Your source code has a syntax error in it – 
even Perl has standards! You’re still sure that’s the script 
you used?” 
• “Here – use our evaluation pipeline, since you clearly 
need something better.” 
The Brown Lab: taking passive aggression to a whole new level!
Myths of reproducible 
research 
(Opinions from personal experience.)
Myth 1: Partial 
reproducibility is hard. 
“Here’s my script.” => Methods 
More generally, 
• Many scientists cannot replicate any part of their 
analysis without a lot of manual work. 
• Automating this is a win for reasons that have 
nothing to do with reproducibility… efficiency! 
See: Software Carpentry.
Myth 2: Incomplete 
reproducibility is useless 
Paraphrase: “We can’t possibly reproduce the 
experimental data exactly, so we shouldn’t bother 
with anything else, either.” 
(Analogous arg re software testing & code coverage.) 
• …I really have a hard time arguing the paraphrase 
honestly… 
• Being able to reanalyze your raw data? Interesting. 
• Knowing how you made your figures? Really useful.
Myth 3: We need new 
platforms 
• Techies always want to build something (which is fun!) 
but don’t want to do science (which is hard!) 
• We probably do need new platforms, but stop thinking 
that building them does a service. 
• Platforms need to be use driven. Seriously. 
• If you write good software for scientific inquiry and make 
it easy to use reproducibly, that will drive virtuousity.
Myth 4. Virtual Machine 
reproducibility is an end solution. 
• Good start! Better than nothing! 
But: 
• Limits understanding & reuse. 
• Limits remixing: often cannot install other software! 
• “Chinese Room” argument: could be just a lookup 
table. 
…what about Docker?
Myth 5: We can use GUIs 
for reproducible research 
(OK, this is partly just to make people think ;) 
• Almost all data analysis takes place within a larger 
pipeline; the GUI must consume entire pipeline in 
order to be reproducible. 
• IFF GUI wraps command line, that’s a decent 
compromise (e.g. Galaxy) but handicaps 
researchers using novel approaches. 
• By the time it’s in a GUI, it’s no longer research. But it 
can be useful for research…
Our current efforts? 
• Semantic versioning of our own code: stable 
command-line interface. 
• Writing easy-to-teach tutorials and protocols for 
common analysis pipelines. 
• Automate ‘em for testing purposes. 
• Encourage their use, inclusion, and adaptation by 
others.
Literate testing 
• Our shell-command tutorials for bioinformatics can 
now be executed in an automated fashion – 
commands are extracted automatically into shell 
scripts. 
• See: github.com/ged-lab/literate-resting/. 
• Tremendously improves peace of mind and 
confidence moving forward! 
Leigh Sheneman
Doing things right 
=> #awesomesauce 
Protocols in English 
for running analyses in 
the cloud 
Literate reSTing => 
shell scripts 
Tool 
competitions 
Benchmarking 
Education 
Acceptance 
tests
What bits should people 
adopt? 
• Version control! 
• Literate graphing - IPython Notebook/knitr! 
• Automated “build” from data => results! 
• Make data available as early in your pipeline as 
possible.
Our approaches -- 
• We are not doing anything particularly neat on the 
computational side... No “magic sauce.” 
• Much of our effort is now driven by sheer utility: 
o Automation reduces our maintenance burden. 
o Extensibility makes revisions much easier! 
o Explicit instructions are good great for training. 
• Some effort needed at the beginning, but once 
practices are established, “virtuous cycle” takes 
over.
New science vs 
reproducibility 
• Nobody would care that we were doing things 
reproducibly if our science wasn’t decent. 
• Make sure students realize that faffing about on 
infrastructure isn’t science. 
• Research is about doing science. Reproducibility 
(like other good practices) is much easier to 
proselytize if you can link it to progress in science.
Is there a reproducibility 
crisis? 
• Mina Bissell: maybe, but science is hard and we 
should not overly focus on replicating published 
results vs doing new research. 
Bissel, 2013. 
• “But we can’t even get the software in the first 
place!” 
Collberg et al., 2014. 
Computational science should be the easiest thing to 
replicate… but it’s not!?
“Replication debt” 
• Can we borrow idea of “technical debt” from 
software engineering? 
• Semi-independent replication after initial 
exploratory phase, followed by articulation of 
protocols and independent replication. 
Monday, July 11th, 2039 Image from blog.crisp.se
“Replication debt” 
• Semi-independent replication after initial 
exploratory phase, followed by articulation of 
protocols and independent replication. 
• Public acknowledgement of debt is important. 
Monday, July 11th, 2039 Image from blog.crisp.se
Biology & sequence analysis is in a 
perfect place for reproducibility 
We are lucky! A good opportunity! 
• Big Data: laptops are too small; 
• Excel doesn’t scale any more; 
• Few tools in use; most of them are $$ or UNIX; 
• Little in the way of entrenched research practice;
Thanks! 
Talk will soon be on slideshare: 
slideshare.net/c.titus.brown 
E-mail or tweet me: 
ctb@msu.edu 
@ctitusbrown 
Talk at ANU, 3:30pm today

More Related Content

What's hot

PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
Antiy Labs
 

What's hot (20)

Parallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysisParallel Programming in Python: Speeding up your analysis
Parallel Programming in Python: Speeding up your analysis
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
 
TensorFlow
TensorFlowTensorFlow
TensorFlow
 
Applications of Hierarchical Temporal Memory (HTM)
Applications of Hierarchical Temporal Memory (HTM)Applications of Hierarchical Temporal Memory (HTM)
Applications of Hierarchical Temporal Memory (HTM)
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Practical Deep Learning
Practical Deep LearningPractical Deep Learning
Practical Deep Learning
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 Presentation
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
TensorFlow in Context
TensorFlow in ContextTensorFlow in Context
TensorFlow in Context
 
HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)
 
Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?Open & reproducible research - What can we do in practice?
Open & reproducible research - What can we do in practice?
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 

Viewers also liked

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
c.titus.brown
 
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Maurizio Repetto
 
Urinary System #4
Urinary System #4Urinary System #4
Urinary System #4
avlainich
 
Il vino da socievole a sociale
Il vino da socievole a socialeIl vino da socievole a sociale
Il vino da socievole a sociale
Slawka G. Scarso
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online Video
Eyeblaster Spain
 

Viewers also liked (20)

Optimizing Your Receivables: Using Lessons From Trying Times
Optimizing Your Receivables: Using Lessons From Trying TimesOptimizing Your Receivables: Using Lessons From Trying Times
Optimizing Your Receivables: Using Lessons From Trying Times
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
 
Passivhuse: Udfordringer og muligheder
Passivhuse: Udfordringer og mulighederPassivhuse: Udfordringer og muligheder
Passivhuse: Udfordringer og muligheder
 
Social Media Strategies - Blog to Broadcast
Social Media Strategies - Blog to BroadcastSocial Media Strategies - Blog to Broadcast
Social Media Strategies - Blog to Broadcast
 
Shepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4thShepley ross introduction_od_es_manual_4th
Shepley ross introduction_od_es_manual_4th
 
Sue
SueSue
Sue
 
Manduca
ManducaManduca
Manduca
 
Social Media 101
Social Media 101Social Media 101
Social Media 101
 
Social Media Overview
Social Media OverviewSocial Media Overview
Social Media Overview
 
Section 1031 For Pros Handbook
Section 1031 For Pros HandbookSection 1031 For Pros Handbook
Section 1031 For Pros Handbook
 
Urinary System #4
Urinary System #4Urinary System #4
Urinary System #4
 
Rscrirg
RscrirgRscrirg
Rscrirg
 
The critical role of the manager in supporting learning at work through coach...
The critical role of the manager in supporting learning at work through coach...The critical role of the manager in supporting learning at work through coach...
The critical role of the manager in supporting learning at work through coach...
 
The role of skills in recession and recovery by Chris Humphries
The role of skills in recession and recovery by Chris HumphriesThe role of skills in recession and recovery by Chris Humphries
The role of skills in recession and recovery by Chris Humphries
 
Il vino da socievole a sociale
Il vino da socievole a socialeIl vino da socievole a sociale
Il vino da socievole a sociale
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Celebrating 30 years
Celebrating 30 yearsCelebrating 30 years
Celebrating 30 years
 
Eyeblaster Analytics Bulleting Online Video
Eyeblaster  Analytics  Bulleting  Online VideoEyeblaster  Analytics  Bulleting  Online Video
Eyeblaster Analytics Bulleting Online Video
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Similar to 2014 nicta-reproducibility

2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practices
c.titus.brown
 

Similar to 2014 nicta-reproducibility (20)

2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practices
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 

Recently uploaded

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Cherry
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Cherry
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Cherry
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
RaunakRastogi4
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Cherry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 

Recently uploaded (20)

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolation
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
Daily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter PhysicsDaily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter Physics
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 

2014 nicta-reproducibility

  • 1. Openness and reproducibility in computational science: tools, approaches, and thought patterns. C. Titus Brown ctb@msu.edu October 16, 2014
  • 2. Hello! Assistant Professor @ MSU; Microbiology; Computer Science; etc. => UC Davis VetMed in 2015. More information at: • ged.msu.edu/ • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
  • 3. The challenges of non-model sequencing • Missing or low quality genome reference. • Evolutionarily distant. • Most extant computational tools focus on model organisms – o Assume low polymorphism (internal variation) o Assume reference genome o Assume somewhat reliable functional annotation o More significant compute infrastructure …and cannot easily or directly be used on critters of interest.
  • 4. Shotgun sequencing & assembly http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 5. Shotgun sequencing analysis goals: • Assembly (what is the text?) o Produces new genomes & transcriptomes. o Gene discovery for enzymes, drug targets, etc. • Counting (how many copies of each book?) o Measure gene expression levels, protein-DNA interactions • Variant calling (how does each edition vary?) o Discover genetic variation: genotyping, linkage studies… o Allele-specific expression analysis.
  • 6. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 7. Shared low-level fragments may not reach the threshold for assembly. Lamprey mRNAseq:
  • 8. Assembly graphs scale with data size, not information. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 9. Practical memory measurements (soil) Velvet measurements (Adina Howe)
  • 10. Data set size and cost • $1000 gets you ~200m “reads”, or about 20-80 GB of data, in ~week. • > 1000 labs doing this regularly. • Each data set analysis is ~custom. • Analyses are data intensive and memory intensive.
  • 11. Efficient data structures & algorithms Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization
  • 12. Raw data (~10-100 GB) Analysis "Information" ~1 GB Shotgun sequencing is massively redundant; can we eliminate redundancy while retaining information? Analog: JPEG lossy compression "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB)
  • 13. Sparse collections of k-mers can be stored efficiently in Bloom filters Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109
  • 14. Data structures & algorithms papers • “These are not the k-mers you are looking for…”, Zhang et al., PLoS One, 2014. • “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. • “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802.
  • 15. Data analysis papers • “Tackling soil diversity with the assembly of large, complex metagenomes”, Howe et al., PNAS, 2014. • Assembling novel ascidian genomes & transcriptomes, Stolfi et al. (eLife 2014), Lowe et (in prep) • A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
  • 16. Lab approach – not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
  • 17. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations (khmer software) Read abundance normalization
  • 18. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Efficient graph labeling & exploration Cloud assembly protocols Efficient search for target genes Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Current research (khmer software)
  • 19. Testing & version control – the not so secret sauce • High test coverage - grown over time. • Stupidity driven testing – we write tests for bugs after we find them and before we fix them. • Pull requests & continuous integration – does your proposed merge break tests? • Pull requests & code review – does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
  • 20. Our “novel research” enables this: • Novel data structures and algorithms; • Permit low(er) memory data analysis; • Liberate analyses from specialized hardware.
  • 21. Running entirely w/in cloud ~40 hours Complete data; AWS m1.xlarge (See PyCon 2014 talk; video and blog post.) MEMORY
  • 22. On the “novel research” side: • Novel data structures and algorithms; • Permit low(er) memory data analysis; • Liberate analyses from specialized hardware. This last bit? => reproducibility.
  • 23. Reproducibility! Scientific progress relies on reproducibility of analysis. (Aristotle, Nature, 322 BCE.) “There is no such thing as ‘reproducible science’. There is only ‘science’, and ‘not science.’” – someone on Twitter (Fernando Perez?)
  • 24. Disclaimer Not a researcher of reproducibility! Merely a practitioner. Please take my points below as an argument and not as research conclusions. (But I’m right.)
  • 25. Replication vs reproducibility • I will not clearly distinguish. • There are important differences. o Replication: someone using same data, same tools, => same results o Reproduction: someone using different data and/or different tools => same result. • The former is much easier. • The latter is much stronger. • Science is failing even mere replication!? • So, mostly I will talk about how we make our analyses replicable.
  • 26. My usual intro: We practice open science! Everything discussed here: • Code: github.com/ged-lab/ ; BSD license • Blog: http://ivory.idyll.org/blog (‘titus brown blog’) • Twitter: @ctitusbrown • Grants on Lab Web site: http://ged.msu.edu/research.html • Preprints available. Everything is > 80% reproducible.
  • 27. My usual intro: We practice open science! Everything discussed here: • Code: github.com/ged-lab/ ; BSD license • Blog: http://ivory.idyll.org/blog (‘titus brown blog’) • Twitter: @ctitusbrown • Grants on Lab Web site: http://ged.msu.edu/research.html • Preprints available. Everything is > 80% reproducible.
  • 28. My lab & the diginorm paper. • All our code was already on github; • Much of our data analysis was already in the cloud; • Our figures were already made in IPython Notebook • Our paper was already in LaTeX
  • 29. IPython Notebook: data + IPythcoond)Ne o=t>ebook)
  • 30. My lab & the diginorm paper. • All our code was already on github; • Much of our data analysis was already in the cloud; • Our figures were already made in IPython Notebook • Our paper was already in LaTeX …why not push a bit more and make it easily reproducible? This involved writing a tutorial. And that’s it.
  • 31. To reproduce our paper: git clone <khmer> && python setup.py install git clone <pipeline> cd pipeline wget <data> && tar xzf <data> make && cd ../notebook && make cd ../ && make
  • 32. Now standard in lab -- Our papers now have: • Source hosted on github; • Data hosted there or on AWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
  • 33. Research process Generate new results; encode in Makefile Summarize in IPython Notebook Discuss, explore Push to github
  • 34. Literate graphing & interactive exploration
  • 35. The process • We start with pipeline reproducibility • Baked into lab culture; default “use git; write scripts” Community of practice! • Use standard open source approaches, so OSS developers learn it easily. • Enables easy collaboration w/in lab • Valuable learning tool!
  • 36. Growing & refining the process • Now moving to Ubuntu Long-Term Support + install instructions. • Everything is as automated as is convenient. • Students expected to communicate with me in IPython Notebooks. • Trying to avoid building (or even using) new repro tools. • Avoid maintenance burden as much as possible.
  • 37. 1. Use standard OS; provide install instructions • Providing install, execute for Ubuntu Long-Term Support release 14.04: supported through 2017 and beyond. • Avoid pre-configured virtual machines! They: o Lock you into specific cloud homes. o Challenge remixability and extensibility.
  • 38. 2. Automate • Literate graphing now easy with knitr and IPython Notebook. • Build automation with make, or whatever. To first order, it does not matter what tools you use. • Explicit is better than implicit. Make it easy to understand what you’re doing and how to extend it.
  • 39. k-mer counting paper (Ubuntu 14.04, git, make, IPython Notebook, latex)
  • 40. Time from publication of KAnalyze to our 100% reproducible re-evaluation? ~8 hours.
  • 41. 3. Protocols, not pipelines. STOP HIDING THE ANALYSIS STEPS.
  • 42. Write down what you’re doing… https://khmer-protocols.readthedocs.org/
  • 43. …and add automated end-to-end tests. c.f. “literate ReSTing”
  • 44. 4. Drive sustainable software development with use cases.
  • 48. 5. Invest in automated, reproducible workflows Genome Reference Quality Filtered Diginorm Partition Reinflation Velvet - 80.90 83.64 84.57 IDBA 90.96 91.38 90.52 88.80 SPAdes 90.42 90.35 89.57 90.02 Mis-assembled Contig Length Velvet - 52071358 44730449 45381867 IDBA 21777032 20807513 17159671 18684159 SPAdes 28238787 21506019 14247392 18851571 Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013 Also! Tip o’ the hat to Michael Barton, nucleotid.es
  • 49. Automation enables super fun paper reviews! • “What a nice new transcriptome assembler! Interesting how it doesn’t perform that well on my 10 test data sets.” • “Hey, so you make these claims, but I ran your code, and…” • “Fun fact! Your source code has a syntax error in it – even Perl has standards! You’re still sure that’s the script you used?” • “Here – use our evaluation pipeline, since you clearly need something better.” The Brown Lab: taking passive aggression to a whole new level!
  • 50. Myths of reproducible research (Opinions from personal experience.)
  • 51. Myth 1: Partial reproducibility is hard. “Here’s my script.” => Methods More generally, • Many scientists cannot replicate any part of their analysis without a lot of manual work. • Automating this is a win for reasons that have nothing to do with reproducibility… efficiency! See: Software Carpentry.
  • 52. Myth 2: Incomplete reproducibility is useless Paraphrase: “We can’t possibly reproduce the experimental data exactly, so we shouldn’t bother with anything else, either.” (Analogous arg re software testing & code coverage.) • …I really have a hard time arguing the paraphrase honestly… • Being able to reanalyze your raw data? Interesting. • Knowing how you made your figures? Really useful.
  • 53. Myth 3: We need new platforms • Techies always want to build something (which is fun!) but don’t want to do science (which is hard!) • We probably do need new platforms, but stop thinking that building them does a service. • Platforms need to be use driven. Seriously. • If you write good software for scientific inquiry and make it easy to use reproducibly, that will drive virtuousity.
  • 54. Myth 4. Virtual Machine reproducibility is an end solution. • Good start! Better than nothing! But: • Limits understanding & reuse. • Limits remixing: often cannot install other software! • “Chinese Room” argument: could be just a lookup table. …what about Docker?
  • 55. Myth 5: We can use GUIs for reproducible research (OK, this is partly just to make people think ;) • Almost all data analysis takes place within a larger pipeline; the GUI must consume entire pipeline in order to be reproducible. • IFF GUI wraps command line, that’s a decent compromise (e.g. Galaxy) but handicaps researchers using novel approaches. • By the time it’s in a GUI, it’s no longer research. But it can be useful for research…
  • 56. Our current efforts? • Semantic versioning of our own code: stable command-line interface. • Writing easy-to-teach tutorials and protocols for common analysis pipelines. • Automate ‘em for testing purposes. • Encourage their use, inclusion, and adaptation by others.
  • 57. Literate testing • Our shell-command tutorials for bioinformatics can now be executed in an automated fashion – commands are extracted automatically into shell scripts. • See: github.com/ged-lab/literate-resting/. • Tremendously improves peace of mind and confidence moving forward! Leigh Sheneman
  • 58. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 59. What bits should people adopt? • Version control! • Literate graphing - IPython Notebook/knitr! • Automated “build” from data => results! • Make data available as early in your pipeline as possible.
  • 60. Our approaches -- • We are not doing anything particularly neat on the computational side... No “magic sauce.” • Much of our effort is now driven by sheer utility: o Automation reduces our maintenance burden. o Extensibility makes revisions much easier! o Explicit instructions are good great for training. • Some effort needed at the beginning, but once practices are established, “virtuous cycle” takes over.
  • 61. New science vs reproducibility • Nobody would care that we were doing things reproducibly if our science wasn’t decent. • Make sure students realize that faffing about on infrastructure isn’t science. • Research is about doing science. Reproducibility (like other good practices) is much easier to proselytize if you can link it to progress in science.
  • 62. Is there a reproducibility crisis? • Mina Bissell: maybe, but science is hard and we should not overly focus on replicating published results vs doing new research. Bissel, 2013. • “But we can’t even get the software in the first place!” Collberg et al., 2014. Computational science should be the easiest thing to replicate… but it’s not!?
  • 63. “Replication debt” • Can we borrow idea of “technical debt” from software engineering? • Semi-independent replication after initial exploratory phase, followed by articulation of protocols and independent replication. Monday, July 11th, 2039 Image from blog.crisp.se
  • 64. “Replication debt” • Semi-independent replication after initial exploratory phase, followed by articulation of protocols and independent replication. • Public acknowledgement of debt is important. Monday, July 11th, 2039 Image from blog.crisp.se
  • 65. Biology & sequence analysis is in a perfect place for reproducibility We are lucky! A good opportunity! • Big Data: laptops are too small; • Excel doesn’t scale any more; • Few tools in use; most of them are $$ or UNIX; • Little in the way of entrenched research practice;
  • 66. Thanks! Talk will soon be on slideshare: slideshare.net/c.titus.brown E-mail or tweet me: ctb@msu.edu @ctitusbrown Talk at ANU, 3:30pm today

Editor's Notes

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. Slow, but powerful.
  3. Acceptance testing other people’s software