Upcoming SlideShare
×

# 2013 10-30-sbc361-reproducible designsandsustainablesoftware

940 views

Published on

Queen Mary U London SBC361
Experimental Design
Reproducible Research
Sustainable Software

Published in: Education, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
940
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
22
0
Likes
0
Embeds 0
No embeds

No notes for slide

### 2013 10-30-sbc361-reproducible designsandsustainablesoftware

1. 1. Programming in R Quick refresher
2. 2. • creating a vector • three synonyms: > myvector > myvector > myvector > myvector [1] 5 6 <- 5:11 <- seq(from=5, to=11, by=1) <- c(5, 6, 7, 8, 9, 10, 11) 7 8 9 10 11 • accessing a subset • of a vector > bigvector <- 150:100 > bigvector [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 1 [20] 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 1 [39] 112 111 110 109 108 107 106 105 104 103 102 101 100 > mysubset <- bigvector[myvector] > mysubset [1] 146 145 144 143 142 141 140 > subset(bigvector, bigvector > 120) [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 1 [20] 131 130 129 128 127 126 125 124 123 122 121
3. 3. Regular expressions: Text search on steroids. Regular expression David Dav(e|id) Dav(e|id|ide|o) At{1,2}enborough Atte[nm]borough At{1,2}[ei][nm]bo{0,1}ro(ugh){0,1} Finds David David, Dave David, Dave, Davide, Davo Attenborough, Atenborough Attenborough, Attemborough Atimbro, attenbrough, etc. Easy counting, replacing all with “Sir David Attenborough”
4. 4. • for subsetting/counting: grep() • for replacing: gsub()
5. 5. Functions •R has many. e.g.: plot(), t.test() • Making your own: tree_age_estimate <- function(diameter, species) { [...do the magic... # maybe something like: growth.rate <- growth.rates[ species ] age.estimate <- diameter / growth.rate ...] return(age.estimate) } > + > + tree_age_estimate(25, "White Oak") 66 tree_age_estimate(60, "Carya ovata") 190
6. 6. “for” Loop > possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue', 'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue') > possible_colours [1] "blue" "cyan" "sky-blue" [5] "steel blue" "royal blue" "slate blue" [9] "dark blue" "prussian blue" "indigo" [13] "electric blue" > for (colour in possible_colours) { + print(paste("The sky is oh so, so", colour)) + } [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] "The "The "The "The "The "The "The "The "The "The "The "The sky sky sky sky sky sky sky sky sky sky sky sky is is is is is is is is is is is is so, so, so, so, so, so, so, so, so, so, so, so, oh oh oh oh oh oh oh oh oh oh oh oh so so so so so so so so so so so so blue" cyan" sky-blue" navy blue" steel blue" royal blue" slate blue" light blue" dark blue" prussian blue" indigo" baby blue" "navy blue" "light blue" "baby blue"
7. 7. Experimental design Reproducible research & Scientiﬁc computing.
8. 8. Why consider experimental design? • If you’re performing experiments • Cost • Time • for experiment • for analysis • Ethics • If you’re deciding to fund? to buy? to approve? to compete? • are the results real? • can you trust the data?
9. 9. Main potential problems • Insufﬁcient data/power • Inappropriate statistics • Pseudoreplication • Confounding factors Inaccurate & Misleading Wrong
10. 10. Example: deer parasites • Do red deer that feed in woodland have more parasites than deer that feed on moorland? • Find a woodland + a highland; collect faecal samples from 20 deer in each. • Conclusion? • But: • pseudoreplication: (n = 1 not 20!): • shared environment (inﬂuence each other) • relatedness • many confounding factors: (e.g. altitude...)
11. 11. Your turn: small & big Pheidole workers. • Is there a genetic predisposition for becoming a larger worker? • Design an experiment alone. • Exchange ideas with your neighbor.
12. 12. e.g.: John.
13. 13. Your turn again: protein production • Large amounts of potential superdrug takeItEasyProtein™ required for Phase II trials. • 10 cell lines can produce takeItEasyProtein™. • You have 5 possible growth media. • Optimization question: Which combination of temperature, cell line, and growth medium will perform best? • Constraints: • each assay takes 4 days. • access to 2 incubators (each can contain 1-100 growth tubes). • large scale production starts in 2 weeks • Design an experiment alone. • Exchange ideas with your neighbor.
14. 14. Reproducible Research & Scientiﬁc Computing
15. 15. Why care?
16. 16. Some sources of inspiration
17. 17. (steve@practicalcomputing.org),†† University of Wisconsin (khuﬀ@cae.w Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , (ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils ∗ Greg Wilson , Best Practices for Scientiﬁc Computing Steven H.D. Haddock ∗∗ , Katy Huﬀ †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope (mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute (steve@practicalcomputing.org),†† University of Wisconsin (khuﬀ@cae.wisc.edu),‡‡ University of British Columbia (mi Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗ University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu) ∗ arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using a software. However, most scientists are never taught how to do this i eﬃciently. As a result, many are unaware of tools and practices that d would allow them to write more reliable and maintainable code with p less eﬀort. We describe a set of best practices for scientiﬁc software m Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, development that have solid foundations in ical studies of scientiﬁc computing [4, 31, software. However, most scientists are never taught how to do this e eﬃciently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity development in general (summarized in would allow them to write more reliable and maintainable code with software. describe a set of best practices for scientiﬁc software practices will guarantee eﬃcient, error-frt less eﬀort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and that improve scientists’ productivitypeople, reliability of their and the not computers. errors in scientiﬁc software, make it easie 1. Write programs for the authors of the software time and eﬀo software. Software is as important to modern focusing on the underlying scientiﬁc ques scientiﬁc research as 2. Automate repetitive tasks. 3. Use important to tubes. From groups the test modern scientiﬁc research telescopesasand computer to record history. as that work exclusively Software is 1 telescopes andMaketubes. From groups that work exclusively test incremental changes. 4. on computationalto traditional laboratory and ﬁeld 1. laboratory andpeople, not c problems, to traditional Write programs for ﬁeld on computational problems, control. 5. Use version Scientists writing software need to writeS scientists, more and more of the daily operation of science re- operation of science rescientists, more and more of the daily cutes correctly and can be easily read and 6. computers. This includes the development of volves aroundDon’t repeat yourself (or others). c programmers (especially the author’s fut volves 7. Plan for mistakes. around computers. This includes the development of new algorithms, managing and analyzing the large amounts cannot be easily read and understood it is p of data algorithms, managing andworksand that are generated in single research projects, correctly.the large amounts new 8. Optimize software only after it analyzingknow that it is actually doing what it i to combining disparate datasets to assess synthetic problems. c 9. Document the designown software single research projects, and must t and purpose ofthese rather than itssoftware developers code be productive, mechanics. of Scientists that are generated in for data typically develop their aspects of human cognition into account t 10. Conduct requires substantial domain-speciﬁc purposes because doing so code reviews. human working memory is limited, huma