0
Tips & Tricks for
Software Engineering in
     Bioinformatics
        Presented by:
         Joel Dudley
Who is this guy?
Avg. time spent programming (hours)




                                      10.0

                                      ...
http://www.megasoftware.net
Kumar S. and Dudley J. “Bioinformatics software for biologists in the genomics era.”
Bioinformatics (2007) vol. 23 (14) pp...
Bioinformatics Philosophy
Build Your Toolbox
Learn UNIX!
Be a jack of all trades, but master of one.




    http://oreilly.com/news/graphics/prog_lang_poster.pdf
R
     C/C++ PHP
VB                   PERL


                            Python




                            Ruby
     ...
Java is not just for Java




                           http://jruby.codehaus.org
http://www.jython.org
Simplified Wrapper and Interface Generator (SWIG)


              Greasy-fast C library




                Doughy-soft
   ...
Frameworks are Friends




        BioBike
Stand on the slumped, dandruff-covered shoulders of
            millions of computer nerds.
Don’t trust yourself (or your hard disk).
Don’t be afraid to use more than three letters
             to define a variable!

#!/usr/bin/perl
# 472-byte qrpff, Keith ...
Object-Oriented Software Design Decisions



                                 shment
                          compli
    ...
module GraphBuilder
  LINE_TYPES = [:solid,:dashed,:dotted]
  module Nodes
    SHAPE_TYPES =
[:rectangle,:roundrectangle,:...
To Subclass or not to subclass? Use mixins!
   class Array
     def arithmetic_mean
       self.inject(0.0) { |sum,x| x = ...
Documenting code sucks! Automate it.

• Come up with a convention for your
  “headers”
• Use automated documentation gener...
A little performance optimization goes a long way

     • General tools
      • DTrace
      • strace
      • gdb
     • L...
Working with data
# Copyright © 1996-2007 SRI International, Marine Biological Laboratory, DoubleTwist Inc.,
# The Institute for Genomic Res...
If you can represent most of your data as key/value
    pairs, then at the very least use a BerkeleyDB




  http://www.or...
In most cases a relational database is an
    appropriate choice for bioinformatics data
• Clean and consolidated (vs. a r...
“But I’m a scientist, not a DBA! Harrumph!”


                              http://www.sqlite.org
“...SQLite is a software...
But seriously, don’t write any SQL (What?)
               Relational Database
          (MySQL, PostgreSQL, Oracle, etc)

...
Beyond the RDBMS




http://strokedb.com/       http://incubator.apache.org/couchdb




                 http://www.hypert...
Thinking in Parallel
Loosely Coupled                Tightly Coupled
•                              •
    Each task is independent       Tasks a...
Use your idle CPU cores!
Start thinking in terms of MapReduce
   (old hat for Lisp programmers!)




Image source: http://code.google.com/edu/paral...
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
  EmitIntermedi...
map(String key, String value):
// key: Sequence alignment file name
// value: multiple alignment
for each exon w in value:...
http://sourceforge.net/projects/cloudburst-bio/
MapReduce Implementations



http://hadoop.apache.org/core/
                                               http://skynet.r...
Embracing Hardware
Single Instruction, Multiple Data (SIMD)
Graphics Processing Unit (GPU):
    Not just fun and games
GPU Programming is Getting Easier




 Compute Unified
                                             OpenCL
Device Architect...
Field Programmable Gate Arrays (FPGA)
Field Programmable Gate Arrays (FPGA)
Playing nice with others
Data Interchange Formats


• JSON
• YAML
• XML
 • Microformats
 • RDF
person = {
       quot;namequot;: quot;Joel Dudleyquot;,
       quot;agequot;: 32,
       quot;heightquot;: 1.83,
       q...
Web Services



• Remote Procedure Call (RPC)
• Representational State Transfer (ReST)
• SOAP
• ActiveResource Pattern
class Video < ActiveYouTube
  self.site = quot;http://gdata.youtube.com/feeds/apiquot;

  ## To search by categories and t...
search = Video.find(:first, :params => {:vq => 'ruby', :quot;max-resultsquot; => '5'})
  puts search.entry.length

 ## vid...
Teamwork
Be Agile
      Manifesto for Agile Software Development

          We are uncovering better ways of developing
          s...
Be Agile

As a [role], I want to [goal], so I can [reason].


                  Storyboard
                      Iterate!
...
Automate Development



http://nant.sourceforge.net/     http://www.scons.org/




  http://www.capify.org/       http://n...
Lightweight Tools for Project Management
Closing Remarks

• Focus on the goal (Biology/Medicine)
• Don’t be clever (you’ll trick yourself)
• Value your time
• Outs...
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
Upcoming SlideShare
Loading in...5
×

Tips And Tricks For Bioinformatics Software Engineering

19,425

Published on

This is a talk I've given twice at Stanford recently. It's essentially a brain dump of my thoughts on being a Bioinformatician with lots of links to useful tools.

Published in: Technology

Transcript of "Tips And Tricks For Bioinformatics Software Engineering"

  1. 1. Tips & Tricks for Software Engineering in Bioinformatics Presented by: Joel Dudley
  2. 2. Who is this guy?
  3. 3. Avg. time spent programming (hours) 10.0 7.5 5.0 2.5 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 25 26 27 28 29 30 31 32 Age (years)
  4. 4. http://www.megasoftware.net
  5. 5. Kumar S. and Dudley J. “Bioinformatics software for biologists in the genomics era.” Bioinformatics (2007) vol. 23 (14) pp. 1713-7
  6. 6. Bioinformatics Philosophy
  7. 7. Build Your Toolbox
  8. 8. Learn UNIX!
  9. 9. Be a jack of all trades, but master of one. http://oreilly.com/news/graphics/prog_lang_poster.pdf
  10. 10. R C/C++ PHP VB PERL Python Ruby Java LISP
  11. 11. Java is not just for Java http://jruby.codehaus.org http://www.jython.org
  12. 12. Simplified Wrapper and Interface Generator (SWIG) Greasy-fast C library Doughy-soft scripting language http://www.swig.org/
  13. 13. Frameworks are Friends BioBike
  14. 14. Stand on the slumped, dandruff-covered shoulders of millions of computer nerds.
  15. 15. Don’t trust yourself (or your hard disk).
  16. 16. Don’t be afraid to use more than three letters to define a variable! #!/usr/bin/perl # 472-byte qrpff, Keith Winstein and Marc Horowitz <sipb-iap-dvd@mit.edu> # MPEG 2 PS VOB file -> descrambled output on stdout. # usage: perl -I <k1>:<k2>:<k3>:<k4>:<k5> qrpff # where k1..k5 are the title key bytes in least to most-significant order s''$/=2048;while(<>){G=29;R=142;if((@a=unqT=quot;C*quot;,_)[20]&48){D=89;_=unqb24,qT,@ b=map{ord qB8,unqb8,qT,_^$a[--D]}@INC;s/...$/1$&/;Q=unqV,qb25,_;H=73;O=$b[4]<<9 |256|$b[3];Q=Q>>8^(P=(E=255)&(Q>>12^Q>>4^Q/8^Q))<<17,O=O>>8^(E&(F=(S=O>>14&7^O) ^S*8^S<<6))<<9,_=(map{U=_%16orE^=R^=110&(S=(unqT,quot;xbntdxbzx14dquot;)[_/16%8]);E ^=(72,@z=(64,72,G^=12*(U-2?0:S&17)),H^=_%64?12:0,@z)[_%8]}(16..271))[_]^((D>>=8 )+=P+(~F&E))for@a[128..$#a]}print+qT,@a}';s/[D-HO-U_]/$$&/g;s/q/pack+/g;eval
  17. 17. Object-Oriented Software Design Decisions shment compli Ac tecture Archi
  18. 18. module GraphBuilder LINE_TYPES = [:solid,:dashed,:dotted] module Nodes SHAPE_TYPES = [:rectangle,:roundrectangle,:ellipse,:parallelogram,:hexagon,:octagon,:diamond,:triangle,:trapezoid,:trapezoid2,:rectangle3d] class BaseNode attr_accessor :label,:geometry,:fill_colors,:outline,:degree,:data def initialize(opts={}) @opts = { :form=>:ellipse, :height=>50.0, :width=>50.0, :label=>quot;GraphNode#{self.object_id}quot;, :line_type=>:solid, :fill_color => {:R=>255,:G=>204,:B=>0,:A=>255}, :fill_color2 => nil, :data => {}, :outline_color=>{:R=>0,:G=>0,:B=>0,:A=>255}, # Set to nil or {:R=>0,:G=>0,:B=>0,:A=>0} for no outline }.merge(opts) @data = @opts[:data] # for storing application-specific data @label = Labels::NodeLabel.new(@opts[:label]) @geometry = {:pos_x=>0.0,:pos_y=>0.0,:width=>1.0,:height=>1.0} @fill_colors = [@opts[:fill_color],nil] @outline = {:line_type=>@opts[:line_type],:color=>@opts[:outline_color]} @degree = {:in=>0,:out=>0} end def clone_params { :label=>text, :fill_color=>@fill_colors.first, :form=>@form, :height=>@geometry[:height], :width=>@geometry[:width] } end end class ShapeNode < BaseNode attr_accessor :form def initialize(opts={}) super @form = @opts[:form] @geometry[:height] = @opts[:height] @geometry[:width] = @opts[:width] end
  19. 19. To Subclass or not to subclass? Use mixins! class Array def arithmetic_mean self.inject(0.0) { |sum,x| x = x.real if x.is_a?(Complex); sum + x.to_f } / self.length.to_f end def geometric_mean begin Math.exp(self.select { |x| x > 0.0 }.collect { |x| Math.log(x) }.arithmetic_mean) rescue Errno::ERANGE Math.exp(self.select { |x| x > 0.0 }.collect { |x| BigMath.log(x,50) }.arithmetic_mean) end end def median if self.length.odd? self[self.length / 2] else upper_median = self[self.length / 2] lower_median = self[(self.length / 2) - 1] [upper_median,lower_median].arithmetic_mean end end def standard_deviation mean = self.arithmetic_mean deviations = self.map { |x| x - mean } sqr_deviations = deviations.map { |x| x**2 } sum_sqr_deviations = sqr_deviations.inject(0.0) { |sum,x| sum + x } Math.sqrt(sum_sqr_deviations/(self.length - 1).to_f) end alias_method :sd, :standard_deviation def shuffle sort_by { rand } end def shuffle! self.replace shuffle end end
  20. 20. Documenting code sucks! Automate it. • Come up with a convention for your “headers” • Use automated documentation generation tools • JavaDoc • Rdoc • Pydoc / Epydoc • Save code snippets in a searchable repository
  21. 21. A little performance optimization goes a long way • General tools • DTrace • strace • gdb • Language specific • Ruby-prof • Psyco/Pyrex • JBoss Profiler/JIT
  22. 22. Working with data
  23. 23. # Copyright © 1996-2007 SRI International, Marine Biological Laboratory, DoubleTwist Inc., # The Institute for Genomic Research, J. Craig Venter Institute, University of California at San Diego, and UNAM. All Rights Reserved. # # # Please see the license agreement regarding the use of and distribution of this file. # The format of this file is defined at http://bioinformatics.ai.sri.com/ptools/flatfile- format.html . # # Species: E. coli K-12 # Database: EcoCyc # Version: 11.5 # File Name: dnabindsites.dat # Date and time generated: August 6, 2007, 17:32:33 # # Attributes: # UNIQUE-ID # TYPES # COMMON-NAME # ABS-CENTER-POS # APPEARS-IN-BINDING-REACTIONS # CITATIONS # COMMENT # COMPONENT-OF # COMPONENTS # CREDITS # DATA-SOURCE # DBLINKS # INSTANCE-NAME-TEMPLATE # INVOLVED-IN-REGULATION # LEFT-END-POSITION # REGULATED-PROMOTER # RELATIVE-CENTER-DISTANCE # RIGHT-END-POSITION # SYNONYMS # UNIQUE-ID - BS86 TYPES - DNA-Binding-Sites ABS-CENTER-POS - 4098761 CITATIONS - 94018613 CITATIONS - 94018613:EV-EXP-IDA-BINDING-OF-CELLULAR-EXTRACTS:3310246267:martin CITATIONS - 14711822:EV-COMP-AINF-SIMILAR-TO-CONSENSUS:3310246267:martin COMPONENT-OF - TU00064 INVOLVED-IN-REGULATION - REG0-5521 TYPE-OF-EVIDENCE - :BINDING-OF-CELLULAR-EXTRACTS //
  24. 24. If you can represent most of your data as key/value pairs, then at the very least use a BerkeleyDB http://www.oracle.com/technology/products/berkeley-db/index.html
  25. 25. In most cases a relational database is an appropriate choice for bioinformatics data • Clean and consolidated (vs. a rats nest of files and folders) • Improved performance (memory usage and File I/O) • Data consistency through constraints and transactions • Easily portable (SQL92 standard) • Querying (asking questions about data) vs. Parsing (reading and loading data) • Commonly used data processing functions can be implemented as stored procedures
  26. 26. “But I’m a scientist, not a DBA! Harrumph!” http://www.sqlite.org “...SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine...”
  27. 27. But seriously, don’t write any SQL (What?) Relational Database (MySQL, PostgreSQL, Oracle, etc) Object Relational Mapper (ORM) Model Instance
  28. 28. Beyond the RDBMS http://strokedb.com/ http://incubator.apache.org/couchdb http://www.hypertable.org
  29. 29. Thinking in Parallel
  30. 30. Loosely Coupled Tightly Coupled • • Each task is independent Tasks are interdependent • • No synchronous inter- Synchronous inter-task task communication communication via messaging interface • Example: Computing a • Maximum Likelihood Example: Monte Carlo Phylogeny for every gene simulation of 3D protein family in the Panther interactions in cytoplasm Database • Software: OpenMPI, • Software: OpenPBS, MPICH, PVM SGE, Xgrid, PlatformLSF
  31. 31. Use your idle CPU cores!
  32. 32. Start thinking in terms of MapReduce (old hat for Lisp programmers!) Image source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
  33. 33. map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, quot;1quot;); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); [1]
  34. 34. map(String key, String value): // key: Sequence alignment file name // value: multiple alignment for each exon w in value: EmitIntermediate(w, CpGIndex); reduce(String key, Iterator values): // key: an exon // values: a list of CpG Index Values int result = 0; for each i in values: result += ParseInt(v); Emit(AsString(result/length(values)); [1]
  35. 35. http://sourceforge.net/projects/cloudburst-bio/
  36. 36. MapReduce Implementations http://hadoop.apache.org/core/ http://skynet.rubyforge.org/ http://discoproject.org/ http://labs.trolltech.com/page/Projects/Threads/QtConcurrent
  37. 37. Embracing Hardware
  38. 38. Single Instruction, Multiple Data (SIMD)
  39. 39. Graphics Processing Unit (GPU): Not just fun and games
  40. 40. GPU Programming is Getting Easier Compute Unified OpenCL Device Architecture http://www.nvidia.com/cuda http://s08.idav.ucdavis.edu/munshi-opencl.pdf
  41. 41. Field Programmable Gate Arrays (FPGA)
  42. 42. Field Programmable Gate Arrays (FPGA)
  43. 43. Playing nice with others
  44. 44. Data Interchange Formats • JSON • YAML • XML • Microformats • RDF
  45. 45. person = { quot;namequot;: quot;Joel Dudleyquot;, quot;agequot;: 32, quot;heightquot;: 1.83, quot;urlsquot;: [ quot;http://www.joeldudley.com/quot;, quot;http://www.linkedin.com/in/joeldudleyquot; ] } VS. <person> <name>Joel Dudley</name> <age>32</age> <height>1.83</height> <urls> <url>http://www.joeldudley.com/</url> <url> http://www.linkedin.com/in/joeldudley </url> </urls> </person>
  46. 46. Web Services • Remote Procedure Call (RPC) • Representational State Transfer (ReST) • SOAP • ActiveResource Pattern
  47. 47. class Video < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; ## To search by categories and tags def self.search_by_tags (*options) from_urls = [] if options.last.is_a? Hash excludes = options.slice!(options.length-1) if excludes[:exclude].kind_of? Array from_urls << excludes[:exclude].map{|keyword| quot;-quot;+keyword}.join(quot;/quot;) else from_urls << quot;-quot;+excludes[:exclude] end end from_urls << options.find_all{|keyword| keyword =~ /^[a-z]/}.join(quot;/quot;) from_urls << options.find_all{|category| category =~ /^[A-Z]/}.join(quot;%7Cquot;) from_urls.delete_if {|x| x.empty?} self.find(:all,:from=>quot;/feeds/api/videos/-/quot;+from_urls.reverse.join(quot;/quot;)) end end class User < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; end class Standardfeed < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; end class Playlist < ActiveYouTube self.site = quot;http://gdata.youtube.com/feeds/apiquot; end
  48. 48. search = Video.find(:first, :params => {:vq => 'ruby', :quot;max-resultsquot; => '5'}) puts search.entry.length ## video information of id = ZTUVgYoeN_o vid = Video.find(quot;ZTUVgYoeN_oquot;) puts vid.group.content[0].url ## video comments comments = Video.find_custom(quot;ZTUVgYoeN_oquot;).get(:comments) puts comments.entry[0].link[2].href ## searching with category/tags results = Video.search_by_tags(quot;Comedyquot;) puts results[0].entry[0].title # more examples: # Video.search_by_tags(quot;Comedyquot;, quot;dogquot;) # Video.search_by_tags(quot;Newsquot;,quot;Sportsquot;,quot;footballquot;, :exclude=>quot;soccerquot;)
  49. 49. Teamwork
  50. 50. Be Agile Manifesto for Agile Software Development We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value: • Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan That is, while there is value in the items on the right, we value the items on the left more. http://agilemanifesto.org/
  51. 51. Be Agile As a [role], I want to [goal], so I can [reason]. Storyboard Iterate! Feedback Acceptance Unit Testing Testing
  52. 52. Automate Development http://nant.sourceforge.net/ http://www.scons.org/ http://www.capify.org/ http://nant.sourceforge.net/
  53. 53. Lightweight Tools for Project Management
  54. 54. Closing Remarks • Focus on the goal (Biology/Medicine) • Don’t be clever (you’ll trick yourself) • Value your time • Outsource everything but genius • Use the tools available to you • Have fun!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×