Your SlideShare is downloading. ×
0
Six ways to Sunday:
approaches to computational
reproducibility in non-model
system sequence analysis.
C. Titus Brown
ctb@...
Hello!
Assistant Professor; Microbiology; Computer Science;
etc.
More information at:
• ged.msu.edu/
• github.com/ged-lab/...
The challenges of non-
model sequencing
• Missing or low quality genome reference.
• Evolutionarily distant.
• Most extant...
Shotgun sequencing & assembly
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shreddi...
Shotgun sequencing
analysis goals:
• Assembly (what is the text?)
o Produces new genomes & transcriptomes.
o Gene discover...
Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolish...
Shared low-level
fragments may
not reach the
threshold for
assembly.
Lamprey mRNAseq:
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGA...
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCG...
De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error r...
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All right...
Practical memory
measurements (soil)
Velvet measurements (Adina Howe)
Data set size and cost
• $1000 gets you ~100m “reads”, or about 10-40 GB of
data, in ~week.
• > 1000 labs doing this regul...
Efficient data structures &
algorithms
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn g...
Shotgun sequencing is massively redundant; can we
eliminate redundancy while retaining information?
Analog: JPEG lossy com...
Sparse collections of k-mers can be
stored efficiently in Bloom filters
Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464...
Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in r...
Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014....
Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
bio...
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
represe...
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
nor...
Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – w...
On the “novel research” side:
• Novel data structures and algorithms;
• Permit low(er) memory data analysis;
• Liberate an...
Running entirely w/in cloud
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)
MEMORY
On the “novel research” side:
• Novel data structures and algorithms;
• Permit low(er) memory data analysis;
• Liberate an...
Reproducibility!
Scientific progress relies on reproducibility of
analysis. (Aristotle, Nature, 322 BCE.)
“There is no suc...
Disclaimer
Not a researcher of reproducibility!
Merely a practitioner.
Please take my points below as an argument
and not ...
My usual intro:
We practice open science!
Everything discussed here:
• Code: github.com/ged-lab/ ; BSD license
• Blog: htt...
My usual intro:
We practice open science!
Everything discussed here:
• Code: github.com/ged-lab/ ; BSD license
• Blog: htt...
My lab & the diginorm paper.
• All our code was already on github;
• Much of our data analysis was already in the cloud;
•...
IPython Notebook: data +
code =>IPython)Notebook)
My lab & the diginorm paper.
• All our code was already on github;
• Much of our data analysis was already in the cloud;
•...
To reproduce our paper:
git clone <khmer> && python setup.py install
git clone <pipeline>
cd pipeline
wget <data> && tar x...
Now standard in lab --
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running d...
Research process
Generate new
results; encode
in Makefile
Summarize in
IPython
Notebook
Push to githubDiscuss, explore
Literate graphing &
interactive exploration
The process
• We start with pipeline reproducibility
• Baked into lab culture; default “use git; write scripts”
Community ...
Growing & refining the
process
• Now moving to Ubuntu Long-Term Support + install
instructions.
• Everything is as automat...
1. Use standard OS; provide
install instructions
• Providing install, execute for Ubuntu Long-Term
Support release 14.04: ...
2. Automate
• Literate graphing now easy with knitr and IPython
Notebook.
• Build automation with make, or whatever. To fi...
Myths of reproducible
research
(Opinions from personal experience.)
Myth 1: Partial
reproducibility is hard.
“Here’s my script.” => Methods
More generally,
• Many scientists cannot replicate...
Myth 2: Incomplete
reproducibility is useless
Paraphrase: “We can’t possibly reproduce the
experimental data exactly, so w...
Myth 3: We need new
platforms
• Techies always want to build something (which is fun!)
but don’t want to do science (which...
Myth 4. Virtual Machine
reproducibility is an end solution.
• Good start! Better than nothing!
But:
• Limits understanding...
Myth 5: We can use GUIs
for reproducible research
(OK, this is partly just to make people think ;)
• Almost all data analy...
Our current efforts?
• Semantic versioning of our own code: stable
command-line interface.
• Writing easy-to-teach tutoria...
khmer-protocols
khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw re...
Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands a...
Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scrip...
Concluding thoughts
• We are not doing anything particularly neat on the
computational side... No “magic sauce.”
• Much of...
What bits should people
adopt?
• Version control!
• Literate graphing!
• Automated “build” from data => results!
• Make av...
More concluding
thoughts
• Nobody would care that we were doing things
reproducibly if our science wasn’t decent.
• Make s...
Biology & sequence analysis is in a
perfect place for reproducibility
We are lucky! A good opportunity!
• Big Data: laptop...
Thanks!
Talk is on slideshare: slideshare.net/c.titus.brown
E-mail or tweet me:
ctb@msu.edu
@ctitusbrown
Upcoming SlideShare
Loading in...5
×

2014 manchester-reproducibility

1,407

Published on

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,407
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
14
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  • Slow, but powerful.
  • Acceptance testing other people’s software
  • Transcript of "2014 manchester-reproducibility"

    1. 1. Six ways to Sunday: approaches to computational reproducibility in non-model system sequence analysis. C. Titus Brown ctb@msu.edu May 21, 2014
    2. 2. Hello! Assistant Professor; Microbiology; Computer Science; etc. More information at: • ged.msu.edu/ • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
    3. 3. The challenges of non- model sequencing • Missing or low quality genome reference. • Evolutionarily distant. • Most extant computational tools focus on model organisms – o Assume low polymorphism (internal variation) o Assume reference genome o Assume somewhat reliable functional annotation o More significant compute infrastructure …and cannot easily or directly be used on critters of interest.
    4. 4. Shotgun sequencing & assembly http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
    5. 5. Shotgun sequencing analysis goals: • Assembly (what is the text?) o Produces new genomes & transcriptomes. o Gene discovery for enzymes, drug targets, etc. • Counting (how many copies of each book?) o Measure gene expression levels, protein-DNA interactions • Variant calling (how does each edition vary?) o Discover genetic variation: genotyping, linkage studies… o Allele-specific expression analysis.
    6. 6. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
    7. 7. Shared low-level fragments may not reach the threshold for assembly. Lamprey mRNAseq:
    8. 8. Introducing k-mers CCGATTGCACTGGACCGA (<- read) CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
    9. 9. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG
    10. 10. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG CATGGACCGATTGCACTGGACCGATGCACGGACCG (with no accounting for mismatches or indels)
    11. 11. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
    12. 12. The problem with k-mers CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG Each sequencing error results in k novel k-mers!
    13. 13. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Assembly graphs scale with data size, not information.
    14. 14. Practical memory measurements (soil) Velvet measurements (Adina Howe)
    15. 15. Data set size and cost • $1000 gets you ~100m “reads”, or about 10-40 GB of data, in ~week. • > 1000 labs doing this regularly. • Each data set analysis is ~custom. • Analyses are data intensive and memory intensive.
    16. 16. Efficient data structures & algorithms Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization
    17. 17. Shotgun sequencing is massively redundant; can we eliminate redundancy while retaining information? Analog: JPEG lossy compression Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB)
    18. 18. Sparse collections of k-mers can be stored efficiently in Bloom filters Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109
    19. 19. Data structures & algorithms papers • “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review. • “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. • “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
    20. 20. Data analysis papers • “Tackling soil diversity with the assembly of large, complex metagenomes”, Howe et al., PNAS, 2014. • Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep. • A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
    21. 21. Lab approach – not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
    22. 22. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization (khmer software)
    23. 23. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch (khmer software)
    24. 24. Testing & version control – the not so secret sauce • High test coverage - grown over time. • Stupidity driven testing – we write tests for bugs after we find them and before we fix them. • Pull requests & continuous integration – does your proposed merge break tests? • Pull requests & code review – does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
    25. 25. On the “novel research” side: • Novel data structures and algorithms; • Permit low(er) memory data analysis; • Liberate analyses from specialized hardware.
    26. 26. Running entirely w/in cloud Complete data; AWS m1.xlarge ~40 hours (See PyCon 2014 talk; video and blog post.) MEMORY
    27. 27. On the “novel research” side: • Novel data structures and algorithms; • Permit low(er) memory data analysis; • Liberate analyses from specialized hardware. This last bit? => reproducibility.
    28. 28. Reproducibility! Scientific progress relies on reproducibility of analysis. (Aristotle, Nature, 322 BCE.) “There is no such thing as ‘reproducible science’. There is only ‘science’, and ‘not science.’” – someone on Twitter (Fernando Perez?)
    29. 29. Disclaimer Not a researcher of reproducibility! Merely a practitioner. Please take my points below as an argument and not as research conclusions. (But I’m right.)
    30. 30. My usual intro: We practice open science! Everything discussed here: • Code: github.com/ged-lab/ ; BSD license • Blog: http://ivory.idyll.org/blog (‘titus brown blog’) • Twitter: @ctitusbrown • Grants on Lab Web site: http://ged.msu.edu/research.html • Preprints available. Everything is > 80% reproducible.
    31. 31. My usual intro: We practice open science! Everything discussed here: • Code: github.com/ged-lab/ ; BSD license • Blog: http://ivory.idyll.org/blog (‘titus brown blog’) • Twitter: @ctitusbrown • Grants on Lab Web site: http://ged.msu.edu/research.html • Preprints available. Everything is > 80% reproducible.
    32. 32. My lab & the diginorm paper. • All our code was already on github; • Much of our data analysis was already in the cloud; • Our figures were already made in IPython Notebook • Our paper was already in LaTeX
    33. 33. IPython Notebook: data + code =>IPython)Notebook)
    34. 34. My lab & the diginorm paper. • All our code was already on github; • Much of our data analysis was already in the cloud; • Our figures were already made in IPython Notebook • Our paper was already in LaTeX …why not push a bit more and make it easily reproducible? This involved writing a tutorial. And that’s it.
    35. 35. To reproduce our paper: git clone <khmer> && python setup.py install git clone <pipeline> cd pipeline wget <data> && tar xzf <data> make && cd ../notebook && make cd ../ && make
    36. 36. Now standard in lab -- All our papers now have: • Source hosted on github; • Data hosted there or on AWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
    37. 37. Research process Generate new results; encode in Makefile Summarize in IPython Notebook Push to githubDiscuss, explore
    38. 38. Literate graphing & interactive exploration
    39. 39. The process • We start with pipeline reproducibility • Baked into lab culture; default “use git; write scripts” Community of practice! • Use standard open source approaches, so OSS developers learn it easily. • Enables easy collaboration w/in lab • Valuable learning tool!
    40. 40. Growing & refining the process • Now moving to Ubuntu Long-Term Support + install instructions. • Everything is as automated as is convenient. • Students expected to communicate with me in IPython Notebooks. • Trying to avoid building (or even using) new tools. • Avoid maintenance burden as much as possible.
    41. 41. 1. Use standard OS; provide install instructions • Providing install, execute for Ubuntu Long-Term Support release 14.04: supported through 2017 and beyond. • Avoid pre-configured virtual machines! o Locks you into specific cloud homes. o Challenges remixability and extensibility.
    42. 42. 2. Automate • Literate graphing now easy with knitr and IPython Notebook. • Build automation with make, or whatever. To first order, it does not matter what tools you use. • Explicit is better than implicit. Make it easy to understand what you’re doing and how to extend it.
    43. 43. Myths of reproducible research (Opinions from personal experience.)
    44. 44. Myth 1: Partial reproducibility is hard. “Here’s my script.” => Methods More generally, • Many scientists cannot replicate any part of their analysis without a lot of manual work. • Automating this is a win for reasons that have nothing to do with reproducibility… efficiency! See: Software Carpentry.
    45. 45. Myth 2: Incomplete reproducibility is useless Paraphrase: “We can’t possibly reproduce the experimental data exactly, so we shouldn’t bother with anything else, either.” (Analogous arg re software testing & code coverage.) • …I really have a hard time arguing the paraphrase honestly… • Being able to reanalyze your raw data? Interesting. • Knowing how you made your figures? Really useful.
    46. 46. Myth 3: We need new platforms • Techies always want to build something (which is fun!) but don’t want to do science (which is hard!) • We probably do need new platforms, but stop thinking that building them does a service. • Platforms need to be use driven. Seriously. • If you write good software for scientific inquiry and make it easy to use reproducibly, that will drive virtuousity.
    47. 47. Myth 4. Virtual Machine reproducibility is an end solution. • Good start! Better than nothing! But: • Limits understanding & reuse. • Limits remixing: often cannot install other software! • “Chinese Room” argument: could be just a lookup table.
    48. 48. Myth 5: We can use GUIs for reproducible research (OK, this is partly just to make people think ;) • Almost all data analysis takes place within a larger pipeline; the GUI must consume entire pipeline in order to be reproducible. • IFF GUI wraps command line, that’s a decent compromise (e.g. Galaxy) but handicaps researchers using novel approaches. • By the time it’s in a GUI, it’s no longer research.
    49. 49. Our current efforts? • Semantic versioning of our own code: stable command-line interface. • Writing easy-to-teach tutorials and protocols for common analysis pipelines. • Automate ‘em for testing purposes. • Encourage their use, inclusion, and adaptation by others.
    50. 50. khmer-protocols
    51. 51. khmer-protocols: • Provide standard “cheap” assembly protocols for the cloud. • Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers) • Open, versioned, forkable, citable…. Read cleaning Diginorm Assembly Annotation RSEM differential expression
    52. 52. Literate testing • Our shell-command tutorials for bioinformatics can now be executed in an automated fashion – commands are extracted automatically into shell scripts. • See: github.com/ged-lab/literate-resting/. • Tremendously improves peace of mind and confidence moving forward! Leigh Sheneman
    53. 53. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
    54. 54. Concluding thoughts • We are not doing anything particularly neat on the computational side... No “magic sauce.” • Much of our effort is now driven by sheer utility: o Automation reduces our maintenance burden. o Extensibility makes revisions much easier! o Explicit instructions are good for training. • Some effort needed at the beginning, but once practices are established, “virtuous cycle” takes over.
    55. 55. What bits should people adopt? • Version control! • Literate graphing! • Automated “build” from data => results! • Make available data as early in your pipeline as possible.
    56. 56. More concluding thoughts • Nobody would care that we were doing things reproducibly if our science wasn’t decent. • Make sure students realize that faffing about on infrastructure isn’t science. • Research is about doing science. Reproducibility (like other good practices) is much easier to proselytize if you can link it to progress in science.
    57. 57. Biology & sequence analysis is in a perfect place for reproducibility We are lucky! A good opportunity! • Big Data: laptops are too small; • Excel doesn’t scale any more; • Few tools in use; most of them are $$ or UNIX; • Little in the way of entrenched research practice;
    58. 58. Thanks! Talk is on slideshare: slideshare.net/c.titus.brown E-mail or tweet me: ctb@msu.edu @ctitusbrown
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×