Masters bioinfo 2013-11-14-15


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Masters bioinfo 2013-11-14-15

  1. 1. Research Computing & Bioinformatics @yannick__ QMUL Masters/PhD 11-2013
  2. 2. © Alex Wild & others
  3. 3. Atta leaf-cutter ants © National Geographic
  4. 4. Atta leaf-cutter ants © National Geographic
  5. 5. Atta leaf-cutter ants © National Geographic
  6. 6. Oecophylla Weaver ants ©
  7. 7. Fourmis tisserandes ©
  8. 8. Oecophylla Weaver ants ©
  9. 9. © wynnie@flickr ©
  10. 10. Forelius pusillus Tofilski et al 2008
  11. 11. Forelius pusillus hides the nest entrance at night Tofilski et al 2008
  12. 12. Forelius pusillus hides the nest entrance at night Tofilski et al 2008
  13. 13. Forelius pusillus hides the nest entrance at night Tofilski et al 2008
  14. 14. Forelius pusillus hides the nest entrance at night Tofilski et al 2008
  15. 15. Forelius pusillus hides the nest entrance at night Avant Workers staying outside die « preventive self-sacrifice » Tofilski et al 2008
  16. 16. Dorylus driver ants: ants with no home © BBC
  17. 17. Animal biomass (Brazilian rainforest) Soil fauna excluding earthworms, ants & termites Spiders Earthworms Mammals Ants & termites Other insects Birds Reptiles Amphibians from Fittkau & Klinge 1973
  18. 18. (my background) (my interests)
  19. 19. Big data is invading biology
  20. 20. Any lab can sequence anything! 454 Illumina Solid... This changes everything.
  21. 21. Big data is invading biology • Genomics • Biodiversity assessments • Stool microbiome sequencing • Personalized medicine • Cancer genomics • Sensor networks - e.g tracking microclimates • Aerial surveys (Drones) - e.g. crop productivity; rainforest cover • Camera traps
  22. 22. Plan: 1. Unix/High Performance Computing/Cluster stuff 2. Programming in R 3. TBD
  23. 23. Unix/High Performance Computing/Cluster
  24. 24. Recap/Questions • How do you connect to Apocrita? • Where do you run jobs? • How do you run something on Apocrita?
  25. 25. Examples
  26. 26. Case Study 2 • Create ssh private/public key-pair • Log in to the head node using ssh key • Connect to sm11 via ssh – Hint: use agent forwarding: -A • • • • Create ssh shortcuts (on your local machine) Connect to sm11 via ssh proxy Start sample application (top, or xclock) Move it between foreground/background – Hint: use bg and fg • Start the application in the background – Hint: use & (ampersand)
  27. 27. Case Study 2 (2) • Monitor processes – Hint: use top, ps, pstree • Experiment with top parameters (man top) • Stop selected process – Hint: use kill • Start multiple process of the same thing • Check if they running – Hint: use ps and grep, pstree and top • Stop all of them – Hint: use killall
  28. 28. Case Study 3 • Log in to sm11 via ssh • Start virtual terminal – Hint: use screen • Start a process in it (use top) • Detach from the screen session – Hint: use Ctr+a d • Log out and back in • List current screen sessions – Hint: screen -ls
  29. 29. Case Study 3 (2) • Attach to screen session – Hint: screen –r [sessionID] • Destroy screen session – Hint: Ctr+a k • Check if the screen session is destroyed • Experiment with nohup and disown – Hint: use nohup application name – Hint: use disown –h jobID • Check if they are running
  30. 30. Case Study 4 • Log in to the head node via ssh • Obtain the test dataset file from • Compress it using gzip • Compare the sizes – Hint: use ls –la[h], or du [–s] • Analyse the contents of the gzipped archive – Hint: use zcat • Search for a pattern in the gzipped archive – Hint: use zgrep • Extract the contents of the archive
  31. 31. Case Study 4 (2) • Compress multiple text files into single archive – Hint: use tar • Compress a directory containing multiple files – Hint: use tar • tar • or • • List the contents of the tar archive zip – Hint: use tar -tvf
  32. 32. Case Study 5 • Log in to sm11 • Download code from • • • • Extract the contents Compile Analyse the results of the compilation Add the binaries to the PATH
  33. 33. Case Study 5 (2) • Run it on the Si_gnF.scaffold.fasta data – Use it to extract the following sequences Si_gnF.scaffold05788 Si_gnF.scaffold05760 Si_gnF.scaffold01035 Si_gnF.scaffold07345 Si_gnF.scaffold07801 Si_gnF.scaffold07087 Si_gnF.scaffold05362 Si_gnF.scaffold08533 Si_gnF.scaffold02116 Si_gnF.scaffold08406
  34. 34. Case Study 5 (3) • Can it run on gzipped version? • If so how? • Repeat for the code from
  35. 35. Programming in R • Regular Expressions • Functions • Loops
  36. 36. Programming in R Quick refresher
  37. 37. • creating a vector • three synonyms: > myvector > myvector > myvector > myvector [1] 5 6 <- 5:11 <- seq(from=5, to=11, by=1) <- c(5, 6, 7, 8, 9, 10, 11) 7 8 9 10 11 • accessing a subset • of a vector > bigvector <- 150:100 > bigvector [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 1 [20] 131 130 129 128 127 126 125 124 123 122 121 120 119 118 117 1 [39] 112 111 110 109 108 107 106 105 104 103 102 101 100 > mysubset <- bigvector[myvector] > mysubset [1] 146 145 144 143 142 141 140 > subset(bigvector, bigvector > 120) [1] 150 149 148 147 146 145 144 143 142 141 140 139 138 137 136 1 [20] 131 130 129 128 127 126 125 124 123 122 121
  38. 38. Regular expressions: Text search on steroids. Regular expression David Dav(e|id) Dav(e|id|ide|o) At{1,2}enborough Atte[nm]borough At{1,2}[ei][nm]bo{0,1}ro(ugh){0,1} Finds David David, Dave David, Dave, Davide, Davo Attenborough, Atenborough Attenborough, Attemborough Atimbro, attenbrough, etc. Easy counting, replacing all with “Sir David Attenborough”
  39. 39. Regular expressions Synonymous with d [:digit:] [0-9] [A-z] [A-z], ie [A-Za-z] s whitespace . any single character .+ one to many of anything b* between 0 and infinity letter ‘b’ [^abc] any character other than a, b or c. ( ( [:punct:] any of these: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | • Google “Regular • ?regexp expression cheat sheet”
  40. 40. • for subsetting/counting: grep() • for replacing: gsub()
  41. 41. Functions •R has many. e.g.: plot(), t.test() • Making your own: tree_age_estimate <- function(diameter, species) { [ the magic... # maybe something like: growth.rate <- growth.rates[ species ] age.estimate <- diameter / growth.rate ...] return(age.estimate) } > + > + tree_age_estimate(25, "White Oak") 66 tree_age_estimate(60, "Carya ovata") 190
  42. 42. “for” Loop > possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue', 'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue') > possible_colours [1] "blue" "cyan" "sky-blue" [5] "steel blue" "royal blue" "slate blue" [9] "dark blue" "prussian blue" "indigo" [13] "electric blue" > for (colour in possible_colours) { + print(paste("The sky is oh so, so", colour)) + } [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] "The "The "The "The "The "The "The "The "The "The "The "The sky sky sky sky sky sky sky sky sky sky sky sky is is is is is is is is is is is is so, so, so, so, so, so, so, so, so, so, so, so, oh oh oh oh oh oh oh oh oh oh oh oh so so so so so so so so so so so so blue" cyan" sky-blue" navy blue" steel blue" royal blue" slate blue" light blue" dark blue" prussian blue" indigo" baby blue" "navy blue" "light blue" "baby blue"
  43. 43. Lets do it Regular expressions.
  44. 44. Lets do it part 2
  45. 45. Reproducible Research & Scientific Computing
  46. 46. Why care?
  47. 47. Some sources of inspiration
  48. 48. (,†† University of Wisconsin (khuff@cae.w Mary University of London (,¶¶ Unive University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy , (, and ††† Hong § , Matt of ¶ , Richard T. (wils ∗ Greg Wilson , Best Practices for Scientific Computing Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ , Ethan P. White ∗∗∗ , Paul Wilson ††† Software Carpentry (,† University of Ontario Institute of Technology (Dhavide.Aru State University (,§ Software Sustainability Institute (,¶ Space Telescope (, University of Toronto (,∗∗ Monterey Bay Aquarium Research Institute (,†† University of Wisconsin (,‡‡ University of British Columbia (mi Mary University of London (,¶¶ University College London (,∗∗ University (, and ††† University of Wisconsin ( ∗ arXiv:1210.0530v3 [cs.MS] 29 Nov 2012 Scientists spend an increasing amount of time building and using a software. However, most scientists are never taught how to do this i efficiently. As a result, many are unaware of tools and practices that d would allow them to write more reliable and maintainable code with p less effort. We describe a set of best practices for scientific software m Scientists spend an increasing amount of time building and using research and software development [61 and open source experience, development that have solid foundations in ical studies of scientific computing [4, 31, software. However, most scientists are never taught how to do this e efficiently. As a improve are unaware of tools and practices thatand the reliability of their and that result, many scientists’ productivity development in general (summarized in would allow them to write more reliable and maintainable code with software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt less effort. We ment, but used in concert they will red f development that have solid foundations in research and experience, and that improve scientists’ productivitypeople, reliability of their and the not computers. errors in scientific software, make it easie 1. Write programs for the authors of the software time and effo software. Software is as important to modern focusing on the underlying scientific ques scientific research as 2. Automate repetitive tasks. 3. Use important to tubes. From groups the test modern scientific research telescopesasand computer to record history. as that work exclusively Software is 1 telescopes andMaketubes. From groups that work exclusively test incremental changes. 4. on computationalto traditional laboratory and field 1. laboratory andpeople, not c problems, to traditional Write programs for field on computational problems, control. 5. Use version Scientists writing software need to writeS scientists, more and more of the daily operation of science re- operation of science rescientists, more and more of the daily cutes correctly and can be easily read and 6. computers. This includes the development of volves aroundDon’t repeat yourself (or others). c programmers (especially the author’s fut volves 7. Plan for mistakes. around computers. This includes the development of new algorithms, managing and analyzing the large amounts cannot be easily read and understood it is p of data algorithms, managing andworksand that are generated in single research projects, correctly.the large amounts new 8. Optimize software only after it analyzingknow that it is actually doing what it i to combining disparate datasets to assess synthetic problems. c 9. Document the designown software single research projects, and must t and purpose ofthese rather than itssoftware developers code be productive, mechanics. of Scientists that are generated in for data typically develop their aspects of human cognition into account t 10. Conduct requires substantial domain-specific purposes because doing so code reviews. human working memory is limited, huma
  49. 49. R style guide •
  50. 50. Education A Quick Guide to Organizing Computational Biology Projects William Stafford Noble1,2* 1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America Introduction under a common root directory. The understanding your work or who may be exception to this rule is source code or evaluating your research skills. Most comMost bioinformatics coursework focusscripts that are used in multiple projects. monly, however, that ‘‘someone’’ is you. A es on algorithms, with perhaps some Each such program might have a project few months from now, you may not components devoted to learning prodirectory of its own. remember what you were up to when you gramming skills and learning how to Within a given project, I use a top-level created a particular set of files, or you may use existing bioinformatics software. Unorganization that is logical, with chrononot remember what conclusions you drew. fortunately, for students who are preparlogical organization at the next level, and You will either have to then spend time ing for a research career, this type of logical organization below that. A sample reconstructing your previous experiments curriculum fails to address many of the project, called msms, is shown in Figure 1. or lose whatever insights you gained from day-to-day organizational challenges asAt the root of most of my projects, I have a those experiments. sociated with performing computational data directory for storing fixed data sets, a This leads to the second principle, experiments. In practice, the principles results directory for tracking computawhich is actually more like a version of Figure names are typeface, and filenames are behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data, the files are shown here. NoteLaw: Everything you sorted in chronological order. The computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README source is compiled create bin/ms-analysis a doc directory with one subdirectory per will probably files in what date. The driver script results/2009-01-15/runall learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parsegenerates the three subdirectories split1, split2, and in strongly influenced by personal predilec- script is called by bothpreparation driverthe data being for source code and bin for compiled your initial of the runall of scripts. doi:10.1371/journal.pcbi.1000424.g001 tions as well as by chance interactions binaries or scripts. analyzed, or you will get access to new with collaborators or colleagues. Within the data and results a complete data, the distinction be- The your paramThese types of entries provide directowith this approach,or you will decide that Lab Notebook The purpose of this article is to describe data and results may of a particular model was not picture of the development a similar, tween not be useful. ries, it is often tempting to apply of the project eterization In parallel with this chronological over time. Instead, could one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you enough. This directory that I useful directory called something like experiIn practice, I ask members of my putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks maintain a or even may lab research three put their lab against ments experiment you did notebook. This is a document that resides 2008-12-19. Optionally, the directory profound issues such as how to formulate which plan to password protection if the set of experiments you’veroot of the results directory andyou online, behind benchmark your in the been workname also include a or two necessary. When I meet with a member hypotheses, design experiments, or draw might ing on over word past month, will probably that records your progress algorithms, ofso lab or a could team, we can one in detail. indicating the topic of the the experiment my you project create refer Entries in the notebook conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on on directory need a be redone. If and they should be relatively verbose, with to the online them under data. will often require more than one day of the current entry but scrolling up to relatively mundane issues such as organizthis and documented your work clearly, thenimages In my experience, entries approach is risky, links or embedded or tables work, and so you may end up working a previous as necessary. The URL ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabobecause logical structure remote final few days more before the experiment with the new In each results folder: •script getResults.rb or WHATIDID.txt or MyAnalysis.Rnw •intermediates •output
  51. 51. Take notes in Markdown “compile” to html, pdf,
  52. 52. knitr (sweave)Analyzing & Reporting in a single file. MyFile.Rnw documentclass{article} usepackage[sc]{mathpazo} usepackage[T1]{fontenc} usepackage{url} begin{document} Also works with Markdown instead of LaTeX! ### in R: library(knitr) knit(“MyFile.Rnw”) # --> creates MyFile.tex <<setup, include=FALSE, cache=FALSE, echo=FALSE>>= # this is equivalent to SweaveOpts{...} opts_chunk$set(fig.path='figure/minimal-', fig.align='center','hold') options(replace.assign=TRUE,width=90) @ title{A Minimal Demo of knitr} ### in shell: pdflatex MyFile.tex # --> creates MyFile.pdf author{Yihui Xie} A Minimal Demo of knitr maketitle You can test if textbf{knitr} works with this minimal demo. OK, let's get started with some boring random numbers: Yihui Xie February 26, 2012 <<boring-random,echo=TRUE,cache=TRUE>>= set.seed(1121) (x=rnorm(20)) mean(x);var(x) @ You can test if knitr works with this minimal demo. OK, let’s get started with s numbers: The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots and histograms recorded by the PDF device: set.seed(1121) (x <- rnorm(20)) <<boring-plots,cache=TRUE,echo=TRUE>>= ## two plots side by side par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1) boxplot(x) hist(x,main='') @ Do the above chunks work? You should be able to compile the TeX{} ## [1] 0.14496 0.43832 ## [10] -0.02531 0.15088 ## [19] 0.13272 -0.15594 mean(x) ## [1] 0.3217 var(x) 0.15319 0.11008 1.08494 1.99954 -0.81188 1.35968 -0.32699 -0.71638 0.16027 1.80977 0 0
  53. 53. Choosing a programming language Good for Bad: Excel quick & dirty R numbers, stats, genomics easy to make mistakes programming Unix commandline (i.e., shell, i.e., bash) Can’t escape it. Quick & Dirty programming, complicated things Java User interfaces in the 1990s. overcomplicated. Perl 1980s. Everything. Python scripting, text Ruby scripting, text Javascript web apps
  54. 54. Ruby. “Friends don’t let friends do Perl” - reddit user example: reverse the contents of each line in a file ### in PERL: open INFILE, "my_file.txt"; while (defined ($line = <INFILE>)) { chomp($line); @letters = split(//, $line); @reverse_letters = reverse(@letters); $reverse_string = join("", @reverse_letters); print $reverse_string, "n"; } ### in Ruby:"my_file.txt").each do |line| puts line.chomp.reverse end
  55. 55. More ruby examples. 5.times do puts "Hello world" end # Sorting people people_sorted_by_age = people.sort_by{ |person| person.age}
  56. 56. Getting help. • In real life: Make friends with people. Talk to them. • Online: • Specific discussion mailing lists (e.g.: R, Stacks, bioruby, MAKER...) • Programming: • Bioinformatics: • Sequencing-related: • Stats:
  57. 57. Lets do it part 2