This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
Big data from the LHC commissioning: practical lessons from big science - Sim...jaxLondonConference
Presented at JAX London 2013
The Large Hadron Collider experiments manage tens of petabytes of data spread across hundreds of data centres. Managing and processing this volume required significant infrastructure and novel software systems, involving years of R&D and significant commissioning to prepare for the LHC First Data. The evolution of this global computing infrastructure, and the specialisations made by the experiments, have lessons relevant for many commercial "big data" users.
Big data from the LHC commissioning: practical lessons from big science - Sim...jaxLondonConference
Presented at JAX London 2013
The Large Hadron Collider experiments manage tens of petabytes of data spread across hundreds of data centres. Managing and processing this volume required significant infrastructure and novel software systems, involving years of R&D and significant commissioning to prepare for the LHC First Data. The evolution of this global computing infrastructure, and the specialisations made by the experiments, have lessons relevant for many commercial "big data" users.
TensorFlow에 대한 분석 내용
- TensorFlow?
- 배경
- DistBelief
- Tutorial - Logistic regression
- TensorFlow - 내부적으로는
- Tutorial - CNN, RNN
- Benchmarks
- 다른 오픈 소스들
- TensorFlow를 고려한다면
- 설치
- 참고 자료
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.
Extending Flink for anomaly detection with Hierarchical Temporal Memory (HTM). Presented at Bay Area Apache Flink Meetup, in San Jose on June 27, 2016.
https://github.com/htm-community/flink-htm
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
Talk on my project within the Open Science Fellowship program, held at the Bordeaux Neurocampus on April 2018.
Note: For working videos, please refer to the GitHub source code http://bit.ly/bx18s
Kegler Brown's creditors' rights and bankruptcy attorneys presented this seminar to a full house on 6/10/2009. Topics covered included an overview of collections in today's economy, the necessity of critical contract terms and negotiations, and resources that every credit manager can use to improve his or her collections process.
TensorFlow에 대한 분석 내용
- TensorFlow?
- 배경
- DistBelief
- Tutorial - Logistic regression
- TensorFlow - 내부적으로는
- Tutorial - CNN, RNN
- Benchmarks
- 다른 오픈 소스들
- TensorFlow를 고려한다면
- 설치
- 참고 자료
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.
Extending Flink for anomaly detection with Hierarchical Temporal Memory (HTM). Presented at Bay Area Apache Flink Meetup, in San Jose on June 27, 2016.
https://github.com/htm-community/flink-htm
Open & reproducible research - What can we do in practice?Felix Z. Hoffmann
Talk on my project within the Open Science Fellowship program, held at the Bordeaux Neurocampus on April 2018.
Note: For working videos, please refer to the GitHub source code http://bit.ly/bx18s
Kegler Brown's creditors' rights and bankruptcy attorneys presented this seminar to a full house on 6/10/2009. Topics covered included an overview of collections in today's economy, the necessity of critical contract terms and negotiations, and resources that every credit manager can use to improve his or her collections process.
Social media has become a terrific markeitng tool, but many companies have trouble getting started. Here are some of the steps I recommend to clients, starting with the blog.
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes. Overview of work underway to add applications and computational analysis pipelines to iPlant for metagenomics and microbial ecology.
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Lateral Ventricles.pdf very easy good diagrams comprehensive
2014 nicta-reproducibility
1. Openness and reproducibility
in computational science:
tools, approaches, and
thought patterns.
C. Titus Brown
ctb@msu.edu
October 16, 2014
2. Hello!
Assistant Professor @ MSU; Microbiology; Computer
Science; etc.
=> UC Davis VetMed in 2015.
More information at:
• ged.msu.edu/
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
3. The challenges of non-model
sequencing
• Missing or low quality genome reference.
• Evolutionarily distant.
• Most extant computational tools focus on model
organisms –
o Assume low polymorphism (internal variation)
o Assume reference genome
o Assume somewhat reliable functional annotation
o More significant compute infrastructure
…and cannot easily or directly be used on critters of interest.
5. Shotgun sequencing
analysis goals:
• Assembly (what is the text?)
o Produces new genomes & transcriptomes.
o Gene discovery for enzymes, drug targets, etc.
• Counting (how many copies of each book?)
o Measure gene expression levels, protein-DNA
interactions
• Variant calling (how does each edition vary?)
o Discover genetic variation: genotyping, linkage
studies…
o Allele-specific expression analysis.
6. Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
10. Data set size and cost
• $1000 gets you ~200m “reads”, or about 20-80 GB of
data, in ~week.
• > 1000 labs doing this regularly.
• Each data set analysis is ~custom.
• Analyses are data intensive and memory intensive.
11. Efficient data structures &
algorithms
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
12. Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
Shotgun sequencing is massively redundant; can we
eliminate redundancy while retaining information?
Analog: JPEG lossy compression
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
13. Sparse collections of k-mers can be
stored efficiently in Bloom filters
Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109
14. Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., PLoS One, 2014.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802.
15. Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Stolfi et al. (eLife 2014), Lowe et (in
prep)
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.
16. Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems
17. This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
(khmer software)
Read
abundance
normalization
18. Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Efficient graph
labeling &
exploration
Cloud assembly
protocols
Efficient search
for target genes
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Current research
(khmer software)
19. Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!
20. Our “novel research” enables
this:
• Novel data structures and algorithms;
• Permit low(er) memory data analysis;
• Liberate analyses from specialized hardware.
21. Running entirely w/in cloud
~40 hours
Complete data; AWS m1.xlarge
(See PyCon 2014 talk; video and blog post.)
MEMORY
22. On the “novel research” side:
• Novel data structures and algorithms;
• Permit low(er) memory data analysis;
• Liberate analyses from specialized hardware.
This last bit? => reproducibility.
23. Reproducibility!
Scientific progress relies on reproducibility of
analysis. (Aristotle, Nature, 322 BCE.)
“There is no such thing as ‘reproducible science’.
There is only ‘science’, and ‘not science.’” –
someone on Twitter (Fernando Perez?)
24. Disclaimer
Not a researcher of reproducibility!
Merely a practitioner.
Please take my points below as an argument
and not as research conclusions.
(But I’m right.)
25. Replication vs
reproducibility
• I will not clearly distinguish.
• There are important differences.
o Replication: someone using same data, same tools, => same results
o Reproduction: someone using different data and/or different tools =>
same result.
• The former is much easier.
• The latter is much stronger.
• Science is failing even mere replication!?
• So, mostly I will talk about how we make our
analyses replicable.
26. My usual intro:
We practice open science!
Everything discussed here:
• Code: github.com/ged-lab/ ; BSD license
• Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
• Twitter: @ctitusbrown
• Grants on Lab Web site:
http://ged.msu.edu/research.html
• Preprints available.
Everything is > 80% reproducible.
27. My usual intro:
We practice open science!
Everything discussed here:
• Code: github.com/ged-lab/ ; BSD license
• Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
• Twitter: @ctitusbrown
• Grants on Lab Web site:
http://ged.msu.edu/research.html
• Preprints available.
Everything is > 80% reproducible.
28. My lab & the diginorm paper.
• All our code was already on github;
• Much of our data analysis was already in the cloud;
• Our figures were already made in IPython Notebook
• Our paper was already in LaTeX
30. My lab & the diginorm paper.
• All our code was already on github;
• Much of our data analysis was already in the cloud;
• Our figures were already made in IPython Notebook
• Our paper was already in LaTeX
…why not push a bit more and make it easily
reproducible?
This involved writing a tutorial. And that’s it.
31. To reproduce our paper:
git clone <khmer> && python setup.py install
git clone <pipeline>
cd pipeline
wget <data> && tar xzf <data>
make && cd ../notebook && make
cd ../ && make
32. Now standard in lab --
Our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang
33. Research process
Generate new
results; encode
in Makefile
Summarize in
IPython
Notebook
Discuss, explore Push to github
35. The process
• We start with pipeline reproducibility
• Baked into lab culture; default “use git; write scripts”
Community of practice!
• Use standard open source approaches, so OSS
developers learn it easily.
• Enables easy collaboration w/in lab
• Valuable learning tool!
36. Growing & refining the
process
• Now moving to Ubuntu Long-Term Support + install
instructions.
• Everything is as automated as is convenient.
• Students expected to communicate with me in IPython
Notebooks.
• Trying to avoid building (or even using) new repro tools.
• Avoid maintenance burden as much as possible.
37. 1. Use standard OS; provide
install instructions
• Providing install, execute for Ubuntu Long-Term
Support release 14.04: supported through 2017 and
beyond.
• Avoid pre-configured virtual machines! They:
o Lock you into specific cloud homes.
o Challenge remixability and extensibility.
38. 2. Automate
• Literate graphing now easy with knitr and IPython
Notebook.
• Build automation with make, or whatever. To first
order, it does not matter what tools you use.
• Explicit is better than implicit. Make it easy to
understand what you’re doing and how to extend it.
48. 5. Invest in automated, reproducible
workflows
Genome Reference
Quality Filtered Diginorm Partition Reinflation
Velvet - 80.90 83.64 84.57
IDBA 90.96 91.38 90.52 88.80
SPAdes 90.42 90.35 89.57 90.02
Mis-assembled Contig Length
Velvet - 52071358 44730449 45381867
IDBA 21777032 20807513 17159671 18684159
SPAdes 28238787 21506019 14247392 18851571
Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013
Also! Tip o’ the hat to Michael Barton, nucleotid.es
49. Automation enables super
fun paper reviews!
• “What a nice new transcriptome assembler! Interesting
how it doesn’t perform that well on my 10 test data sets.”
• “Hey, so you make these claims, but I ran your code,
and…”
• “Fun fact! Your source code has a syntax error in it –
even Perl has standards! You’re still sure that’s the script
you used?”
• “Here – use our evaluation pipeline, since you clearly
need something better.”
The Brown Lab: taking passive aggression to a whole new level!
51. Myth 1: Partial
reproducibility is hard.
“Here’s my script.” => Methods
More generally,
• Many scientists cannot replicate any part of their
analysis without a lot of manual work.
• Automating this is a win for reasons that have
nothing to do with reproducibility… efficiency!
See: Software Carpentry.
52. Myth 2: Incomplete
reproducibility is useless
Paraphrase: “We can’t possibly reproduce the
experimental data exactly, so we shouldn’t bother
with anything else, either.”
(Analogous arg re software testing & code coverage.)
• …I really have a hard time arguing the paraphrase
honestly…
• Being able to reanalyze your raw data? Interesting.
• Knowing how you made your figures? Really useful.
53. Myth 3: We need new
platforms
• Techies always want to build something (which is fun!)
but don’t want to do science (which is hard!)
• We probably do need new platforms, but stop thinking
that building them does a service.
• Platforms need to be use driven. Seriously.
• If you write good software for scientific inquiry and make
it easy to use reproducibly, that will drive virtuousity.
54. Myth 4. Virtual Machine
reproducibility is an end solution.
• Good start! Better than nothing!
But:
• Limits understanding & reuse.
• Limits remixing: often cannot install other software!
• “Chinese Room” argument: could be just a lookup
table.
…what about Docker?
55. Myth 5: We can use GUIs
for reproducible research
(OK, this is partly just to make people think ;)
• Almost all data analysis takes place within a larger
pipeline; the GUI must consume entire pipeline in
order to be reproducible.
• IFF GUI wraps command line, that’s a decent
compromise (e.g. Galaxy) but handicaps
researchers using novel approaches.
• By the time it’s in a GUI, it’s no longer research. But it
can be useful for research…
56. Our current efforts?
• Semantic versioning of our own code: stable
command-line interface.
• Writing easy-to-teach tutorials and protocols for
common analysis pipelines.
• Automate ‘em for testing purposes.
• Encourage their use, inclusion, and adaptation by
others.
57. Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman
58. Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
59. What bits should people
adopt?
• Version control!
• Literate graphing - IPython Notebook/knitr!
• Automated “build” from data => results!
• Make data available as early in your pipeline as
possible.
60. Our approaches --
• We are not doing anything particularly neat on the
computational side... No “magic sauce.”
• Much of our effort is now driven by sheer utility:
o Automation reduces our maintenance burden.
o Extensibility makes revisions much easier!
o Explicit instructions are good great for training.
• Some effort needed at the beginning, but once
practices are established, “virtuous cycle” takes
over.
61. New science vs
reproducibility
• Nobody would care that we were doing things
reproducibly if our science wasn’t decent.
• Make sure students realize that faffing about on
infrastructure isn’t science.
• Research is about doing science. Reproducibility
(like other good practices) is much easier to
proselytize if you can link it to progress in science.
62. Is there a reproducibility
crisis?
• Mina Bissell: maybe, but science is hard and we
should not overly focus on replicating published
results vs doing new research.
Bissel, 2013.
• “But we can’t even get the software in the first
place!”
Collberg et al., 2014.
Computational science should be the easiest thing to
replicate… but it’s not!?
63. “Replication debt”
• Can we borrow idea of “technical debt” from
software engineering?
• Semi-independent replication after initial
exploratory phase, followed by articulation of
protocols and independent replication.
Monday, July 11th, 2039 Image from blog.crisp.se
64. “Replication debt”
• Semi-independent replication after initial
exploratory phase, followed by articulation of
protocols and independent replication.
• Public acknowledgement of debt is important.
Monday, July 11th, 2039 Image from blog.crisp.se
65. Biology & sequence analysis is in a
perfect place for reproducibility
We are lucky! A good opportunity!
• Big Data: laptops are too small;
• Excel doesn’t scale any more;
• Few tools in use; most of them are $$ or UNIX;
• Little in the way of entrenched research practice;
66. Thanks!
Talk will soon be on slideshare:
slideshare.net/c.titus.brown
E-mail or tweet me:
ctb@msu.edu
@ctitusbrown
Talk at ANU, 3:30pm today
Editor's Notes
A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.