Best practices for scientific        computing                    C. Titus Brown                    ctb@msu.edu       Asst...
Best practices for scientific        computing                    C. Titus Brown                    ctb@msu.edu       Asst...
Towards better practices for   scientific computing                     C. Titus Brown                     ctb@msu.edu    ...
Who are we? Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P.Chue Hong, Matt Davis, Richard T. Guy, Steven H. D.    Hadd...
Who am I?• “Computational scientist”• Worked in:   – Evolutionary modeling   – Albedo measurements (Earthshine)   – Develo...
Who am I?                  (Alternative version)• Open source / free software• Member of the Python Software Foundation• D...
What is this talk about?• Most scientists engage with computation in their  science…• …but most are never exposed to good ...
A non-dogmatic perspective• There are few practices that you really need to use.    – Version control.    – Testing of som...
What do scientists care about?1. Correctness2. Reproducibility and provenance3. Efficiency
What do scientists actually care            about?1. Efficiency2. Correctness3. Reproducibility and provenance
Our concern• As we become more reliant on computational inference, does more of our   science become wrong?• “Big Data” in...
Our central thesisWith only a little bit of training and effort,• Computational scientists can become more  efficient and ...
The paper• Code for people             • Plan for mistakes• Automate repetitive tasks   • Avoid premature• Record history ...
The subset of these I’ll discuss1. Use version control2. Plan for mistakes3. Automate repetitive tasks4. Document design &...
Use version control!1. Any kind of version control is better than none.2. Distributed version control (Git, Mercurial) is ...
Use version control• Version control enables efficient single-user work  by “gating” changes into discrete chunks.• Versio...
Original github                repo              Change A"Fork"              Change BChange 1                             ...
Plan for mistakes!1. Program defensively --  Use assertions to enforce conditions upon execution  def calc_gc_content(dna)...
Plan for mistakes!2. Write/run tests –def test_calc_gc_1():         gc = calc_gc(“AT”)         assert gc == 0def test_calc...
Plan for mistakes!3. Black box regression tests:For fixed input, do we get the same (recorded) output aslast day/week/mont...
Plan for mistakes!Write/run tests –A few personal maxims:- simple tests are already very useful (if they don’t work…)- pas...
Automate repetitive tasks!Automate your builds, your test running, your analysis pipeline, andyour graph production.1.   A...
IPython Notebook
Cloud computing/VMs• One approach my lab has been using is to make  publication’s data, code, and instructions available f...
Document design & purposex = x + 1 # add 1 to x• vs# increase past possible fencepost boundary errorrange_end = range_end ...
Document design & purposeMore generally,- describe APIs- provide tutorials on use- discuss the design for domain experts &...
Anecdotes I need to remember to              tell1) A sizeable fraction of my “single-use” scripts   were wrong, upon reus...
DVCS particularly facilitates long term branching.   Original github        repo      Change A      Change B              ...
There are many, many practices I         did not discuss.Testing:• TDD vs BDD vs SDD?• Functional tests vs unit testing vs...
Software Carpentry              http://software-carpentry.org• Invite us to run a workshop!• 2 days of training at appropr...
Contact info                     Titus Brown, ctb@msu.edu                      http://ivory.idyll.org/blog/               ...
Upcoming SlideShare
Loading in …5
×

2013 ucar best practices

1,242 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,242
On SlideShare
0
From Embeds
0
Number of Embeds
67
Actions
Shares
0
Downloads
7
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Increasingly being connected via SOA to backend db
  • 2013 ucar best practices

    1. 1. Best practices for scientific computing C. Titus Brown ctb@msu.edu Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)
    2. 2. Best practices for scientific computing C. Titus Brown ctb@msu.edu Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)
    3. 3. Towards better practices for scientific computing C. Titus Brown ctb@msu.edu Asst Professor, Michigan State University (Microbiology, Computer Science, and BEACON)
    4. 4. Who are we? Greg Wilson, D. A. Aruliah, C. Titus Brown, Neil P.Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Katy Huff, Ian M. Mitchell, MarkPlumbley, Ben Waugh, Ethan P. White, Paul WilsonAuthors of “Best Practices for Scientific Computing” http://arxiv.org/abs/1210.0530
    5. 5. Who am I?• “Computational scientist”• Worked in: – Evolutionary modeling – Albedo measurements (Earthshine) – Developmental biology & genomics – Bioinformatics• “Data driven biologist” – Data of Unusual Size + bio
    6. 6. Who am I? (Alternative version)• Open source / free software• Member of the Python Software Foundation• Developed a few different pieces of non-scientific software, mostly in testing world. => Open science, reproducibility, better practices.
    7. 7. What is this talk about?• Most scientists engage with computation in their science…• …but most are never exposed to good software engineering practices.• This is not surprising. – Computer science generally does not teach “practice” – Learning your scientific domain is hard enough.
    8. 8. A non-dogmatic perspective• There are few practices that you really need to use. – Version control. – Testing of some sort – Automation of some sort (builds, deployment, pipelines)• There are lots of practices that will consume your time and eat your science. – …but figuring out which practices are useful is often somewhat domain and project and person specific.• There are no silver bullets. (Sorry!)
    9. 9. What do scientists care about?1. Correctness2. Reproducibility and provenance3. Efficiency
    10. 10. What do scientists actually care about?1. Efficiency2. Correctness3. Reproducibility and provenance
    11. 11. Our concern• As we become more reliant on computational inference, does more of our science become wrong?• “Big Data” increasingly requires sophisticated computational pipelines…• We know that simple computational errors have gone undetected for many years – a sign error => retraction of 3 Science, 1 Nature, 1 PNAS – Rejection of grants, publications! http://boscoh.com/protein/a-sign-a-flipped-structure-and-a-scientific-flameout-of-epic-proportions
    12. 12. Our central thesisWith only a little bit of training and effort,• Computational scientists can become more efficient and effective at getting their work done,• while considerably improving correctness and reproducibility of their code.
    13. 13. The paper• Code for people • Plan for mistakes• Automate repetitive tasks • Avoid premature• Record history optimization• Make incremental changes • Document design &• Use version control purpose of code, not details• Don’t repeat yourself • Collaborate
    14. 14. The subset of these I’ll discuss1. Use version control2. Plan for mistakes3. Automate repetitive tasks4. Document design & purpose of code
    15. 15. Use version control!1. Any kind of version control is better than none.2. Distributed version control (Git, Mercurial) is very different from centralized VCS (CVS, Subversion).3. Sites like github and bitbucket are changing software development in really interesting ways. (see: www.wired.com/opinion/2013/03/github/, “The github revolution”)
    16. 16. Use version control• Version control enables efficient single-user work by “gating” changes into discrete chunks.• Version control is essential to multiperson collaboration on software.• Distributed version control enables remixing and reuse without permission, while retaining provenance.
    17. 17. Original github repo Change A"Fork" Change BChange 1 "Fork" Change CChange 2 Change AA Change DChange 3 Change BB Change EChange 4
    18. 18. Plan for mistakes!1. Program defensively -- Use assertions to enforce conditions upon execution def calc_gc_content(dna): assert ‘N’ not in dna, “DNA is only A/C/G/T”
    19. 19. Plan for mistakes!2. Write/run tests –def test_calc_gc_1(): gc = calc_gc(“AT”) assert gc == 0def test_calc_gc_2(): gc = calc_gc(“”) asssert gc == 0
    20. 20. Plan for mistakes!3. Black box regression tests:For fixed input, do we get the same (recorded) output aslast day/week/month?(Very powerful when combined with version control.)
    21. 21. Plan for mistakes!Write/run tests –A few personal maxims:- simple tests are already very useful (if they don’t work…)- past mistakes are a guide to future mistakes- any tests are better than no tests- if they’re not easy to run, no one will run them
    22. 22. Automate repetitive tasks!Automate your builds, your test running, your analysis pipeline, andyour graph production.1. Augments reusability/reproducibility.2. Encodes expert knowledge into scripts.3. Decreases arguments about culpability :)4. Excellent training mechanism for new students/collaborators!5. Combined with version control => provenance of analysis results!6. Improves ability to revise, reuse, remix.
    23. 23. IPython Notebook
    24. 24. Cloud computing/VMs• One approach my lab has been using is to make publication’s data, code, and instructions available for Amazon EC2 instances: ged.msu.edu/papers/2012-diginorm/• Reviewers have been known to actually go rerun our pipeline…• More to the point, this enables others (including collaborators) to revise, reuse, remix.
    25. 25. Document design & purposex = x + 1 # add 1 to x• vs# increase past possible fencepost boundary errorrange_end = range_end + 1
    26. 26. Document design & purposeMore generally,- describe APIs- provide tutorials on use- discuss the design for domain experts & programmers, not for novices.
    27. 27. Anecdotes I need to remember to tell1) A sizeable fraction of my “single-use” scripts were wrong, upon reuse.2) New students in my lab run through at least one old paper’s execution pipeline before starting their work.3) Students may develop for long time on own branch, while continually merging from main.
    28. 28. DVCS particularly facilitates long term branching. Original github repo Change A Change B "Fork" Change C Change AA Change D Change BB Change E Change CC
    29. 29. There are many, many practices I did not discuss.Testing:• TDD vs BDD vs SDD?• Functional tests vs unit testing vs …• Code coverage analysis.• Continuous integration! My view: be generally aware of what’s out there & focus on what addresses your pain points.
    30. 30. Software Carpentry http://software-carpentry.org• Invite us to run a workshop!• 2 days of training at appropriate/desired level: – Beginning/intro – Intermediate – Advanced (?)• Funded by Sloan and operated by Mozilla
    31. 31. Contact info Titus Brown, ctb@msu.edu http://ivory.idyll.org/blog/ @ctitusbrown on TwitterThis talk will be on slideshare shortly; google “titus brown slideshare” Best Practices for Scientific Computing http://arxiv.org/abs/1210.0530 Git can facilitate greater reproducibility… (K. Ram) http://www.scfbm.org/content/8/1/7/abstract

    ×