2014 toronto-torbug

1,138 views

Published on

Published in: Science, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,138
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Slow, but powerful.
  • Acceptance testing other people’s software
  • Update from Jordan
  • More generally….
  • 2014 toronto-torbug

    1. 1. Building khmer, a platform for research in scalable sequence analysis C. Titus Brown ctb@msu.edu
    2. 2. Hello! Assistant Professor; Microbiology; Computer Science; etc. More information at: • ged.msu.edu/ • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
    3. 3. Introducing k-mers CCGATTGCACTGGACCGA (<- read) CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
    4. 4. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG
    5. 5. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG CATGGACCGATTGCACTGGACCGATGCACGGACCG (with no accounting for mismatches or indels)
    6. 6. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
    7. 7. The problem with k-mers CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG Each sequencing error results in k novel k-mers!
    8. 8. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Assembly graphs scale with data size, not information.
    9. 9. Practical memory measurements (soil) Velvet measurements (Adina Howe)
    10. 10. Counting k-mers efficiently (RAM)
    11. 11. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization
    12. 12. Data structures & algorithms papers • “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review. • “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. • “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
    13. 13. Data analysis papers • “Tackling soil diversity with the assembly of large, complex metagenomes”, Howe et al., PNAS, 2014. • Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep. • A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
    14. 14. Lab approach – not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
    15. 15. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization (khmer software)
    16. 16. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch (khmer software)
    17. 17. How is this feasible?! Representative half-arsed lab software development Version that worked once, for some publication. Grad student 1 research Grad student 2 research Incompatible and broken code
    18. 18. Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests A not-insane way to do software development
    19. 19. A not-insane way to do software development Stable version Grad student 1 research Grad student 2 research Stable, tested code Run tests Run tests Run tests Run tests Run tests Run tests Run tests
    20. 20. Testing & version control – the not so secret sauce • High test coverage - grown over time. • Stupidity driven testing – we write tests for bugs after we find them and before we fix them. • Pull requests & continuous integration – does your proposed merge break tests? • Pull requests & code review – does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
    21. 21. Integration testing • khmer is designed to work with other packages. • For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages. • These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…
    22. 22. khmer-protocols
    23. 23. khmer-protocols: • Provide standard “cheap” assembly protocols for the cloud. • Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers) • Open, versioned, forkable, citable…. Read cleaning Diginorm Assembly Annotation RSEM differential expression
    24. 24. Literate testing • Our shell-command tutorials for bioinformatics can now be executed in an automated fashion – commands are extracted automatically into shell scripts. • See: github.com/ged-lab/literate-resting/. • Tremendously improves peace of mind and confidence moving forward! Leigh Sheneman
    25. 25. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
    26. 26. Benchmarking protocols Data subset; AWS m1.xlarge ~1 hour (See PyCon 2014 talk; video and blog post.)
    27. 27. Benchmarking protocols Complete data; AWS m1.xlarge ~40 hours (See PyCon 2014 talk; video and blog post.)
    28. 28. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch
    29. 29. Genomic intervals shared between data sets Qingpeng Zhang * Assembly free!
    30. 30. Error correction via graph alignment Jason Pell and Jordan Fish
    31. 31. Error correction on simulated E. coli data 1% error rate, 100x coverage. Jordan Fish and Jason Pell TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed)
    32. 32. Single pass, reference free, tunable, streaming online variant calling. Streaming, online variant calling. See NIH BIG DATA grant, http://ged.msu.edu/.
    33. 33. Novelty… to what power? • “Novelty” requirements for “high impact publishing”: o Must do novel algorithm development o …and apply to novel and interesting data sets. o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662) • We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)
    34. 34. Reproducibility Scientific progress relies on reproducibility of analysis. (Aristotle, Nature, 322 BCE.) All our papers now have: • Source hosted on github; • Data hosted there or on AWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
    35. 35. Concluding thoughts • API is destiny – without online counting, diginorm & streaming approaches would not have been possible. • Tackle the hard problems – engineering optimization would not have gotten us very far. • Testing lets us scale development & process – which means when something works, we can run with it.
    36. 36. Caveats • Expense and effort – you can spend an infinite amount of time on infrastructure & process! o Advice: choose techniques that address actual pain points. o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014) • Funders and reviewers just don’t care – adopt good software practices for yourself, not others. o Advice: briefly mention keywords in grants, papers. • Advisors just don’t care – see above. o These are 90% true statements :>
    37. 37. Can we crowdsource bioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of it!) “It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?” - http://thescienceweb.wordpress.com/2014/02/21/bioinfor matics-software-companies-have-no-clue-why-no-one- buys-their-products/
    38. 38. Thanks!
    39. 39. Prospective: sequencing tumor cells • Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations. • 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence. • Most of this data will be redundant and not useful. • Developing diginorm-based algorithms to eliminate data while retaining variant information.
    40. 40. Where are we taking this? • Streaming online algorithms only look at data ~once. • Diginorm is streaming, online… • Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.

    ×