This document provides tips and tricks for software engineering in bioinformatics. It discusses using object-oriented software design principles like encapsulation and inheritance. It also covers best practices like automating documentation, performance optimization, working with data using databases and file formats, parallel and distributed computing, hardware acceleration, and web services.
17. Don’t be afraid to use more than three letters
to define a variable!
#!/usr/bin/perl
# 472-byte qrpff, Keith Winstein and Marc Horowitz <sipb-iap-dvd@mit.edu>
# MPEG 2 PS VOB file -> descrambled output on stdout.
# usage: perl -I <k1>:<k2>:<k3>:<k4>:<k5> qrpff
# where k1..k5 are the title key bytes in least to most-significant order
s''$/=2048;while(<>){G=29;R=142;if((@a=unqT=quot;C*quot;,_)[20]&48){D=89;_=unqb24,qT,@
b=map{ord qB8,unqb8,qT,_^$a[--D]}@INC;s/...$/1$&/;Q=unqV,qb25,_;H=73;O=$b[4]<<9
|256|$b[3];Q=Q>>8^(P=(E=255)&(Q>>12^Q>>4^Q/8^Q))<<17,O=O>>8^(E&(F=(S=O>>14&7^O)
^S*8^S<<6))<<9,_=(map{U=_%16orE^=R^=110&(S=(unqT,quot;xbntdxbzx14dquot;)[_/16%8]);E
^=(72,@z=(64,72,G^=12*(U-2?0:S&17)),H^=_%64?12:0,@z)[_%8]}(16..271))[_]^((D>>=8
)+=P+(~F&E))for@a[128..$#a]}print+qT,@a}';s/[D-HO-U_]/$$&/g;s/q/pack+/g;eval
19. module GraphBuilder
LINE_TYPES = [:solid,:dashed,:dotted]
module Nodes
SHAPE_TYPES =
[:rectangle,:roundrectangle,:ellipse,:parallelogram,:hexagon,:octagon,:diamond,:triangle,:trapezoid,:trapezoid2,:rectangle3d]
class BaseNode
attr_accessor :label,:geometry,:fill_colors,:outline,:degree,:data
def initialize(opts={})
@opts = {
:form=>:ellipse,
:height=>50.0,
:width=>50.0,
:label=>quot;GraphNode#{self.object_id}quot;,
:line_type=>:solid,
:fill_color => {:R=>255,:G=>204,:B=>0,:A=>255},
:fill_color2 => nil,
:data => {},
:outline_color=>{:R=>0,:G=>0,:B=>0,:A=>255}, # Set to nil or {:R=>0,:G=>0,:B=>0,:A=>0} for no outline
}.merge(opts)
@data = @opts[:data] # for storing application-specific data
@label = Labels::NodeLabel.new(@opts[:label])
@geometry = {:pos_x=>0.0,:pos_y=>0.0,:width=>1.0,:height=>1.0}
@fill_colors = [@opts[:fill_color],nil]
@outline = {:line_type=>@opts[:line_type],:color=>@opts[:outline_color]}
@degree = {:in=>0,:out=>0}
end
def clone_params
{
:label=>text,
:fill_color=>@fill_colors.first,
:form=>@form,
:height=>@geometry[:height],
:width=>@geometry[:width]
}
end
end
class ShapeNode < BaseNode
attr_accessor :form
def initialize(opts={})
super
@form = @opts[:form]
@geometry[:height] = @opts[:height]
@geometry[:width] = @opts[:width]
end
20. To Subclass or not to subclass? Use mixins!
class Array
def arithmetic_mean
self.inject(0.0) { |sum,x| x = x.real if x.is_a?(Complex); sum + x.to_f } / self.length.to_f
end
def geometric_mean
begin
Math.exp(self.select { |x| x > 0.0 }.collect { |x| Math.log(x) }.arithmetic_mean)
rescue Errno::ERANGE
Math.exp(self.select { |x| x > 0.0 }.collect { |x| BigMath.log(x,50) }.arithmetic_mean)
end
end
def median
if self.length.odd?
self[self.length / 2]
else
upper_median = self[self.length / 2]
lower_median = self[(self.length / 2) - 1]
[upper_median,lower_median].arithmetic_mean
end
end
def standard_deviation
mean = self.arithmetic_mean
deviations = self.map { |x| x - mean }
sqr_deviations = deviations.map { |x| x**2 }
sum_sqr_deviations = sqr_deviations.inject(0.0) { |sum,x| sum + x }
Math.sqrt(sum_sqr_deviations/(self.length - 1).to_f)
end
alias_method :sd, :standard_deviation
def shuffle
sort_by { rand }
end
def shuffle!
self.replace shuffle
end
end
21. Documenting code sucks! Automate it.
• Come up with a convention for your
“headers”
• Use automated documentation generation
tools
• JavaDoc
• Rdoc
• Pydoc / Epydoc
• Save code snippets in a searchable
repository
22. A little performance optimization goes a long way
• General tools
• DTrace
• strace
• gdb
• Language specific
• Ruby-prof
• Psyco/Pyrex
• JBoss Profiler/JIT
25. If you can represent most of your data as key/value
pairs, then at the very least use a BerkeleyDB
http://www.oracle.com/technology/products/berkeley-db/index.html
26. In most cases a relational database is an
appropriate choice for bioinformatics data
• Clean and consolidated (vs. a rats nest of files and
folders)
• Improved performance (memory usage and File I/O)
• Data consistency through constraints and transactions
• Easily portable (SQL92 standard)
• Querying (asking questions about data) vs. Parsing
(reading and loading data)
• Commonly used data processing functions can be
implemented as stored procedures
27. “But I’m a scientist, not a DBA! Harrumph!”
http://www.sqlite.org
“...SQLite is a software library that implements a self-contained, serverless,
zero-configuration, transactional SQL database engine...”
28. But seriously, don’t write any SQL (What?)
Relational Database
(MySQL, PostgreSQL, Oracle, etc)
Object Relational Mapper (ORM)
Model
Instance
31. Loosely Coupled Tightly Coupled
• •
Each task is independent Tasks are interdependent
• •
No synchronous inter- Synchronous inter-task
task communication communication via
messaging interface
• Example: Computing a
•
Maximum Likelihood Example: Monte Carlo
Phylogeny for every gene simulation of 3D protein
family in the Panther interactions in cytoplasm
Database
• Software: OpenMPI,
• Software: OpenPBS, MPICH, PVM
SGE, Xgrid, PlatformLSF
33. Start thinking in terms of MapReduce
(old hat for Lisp programmers!)
Image source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
34. map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, quot;1quot;);
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result)); [1]
35. map(String key, String value):
// key: Sequence alignment file name
// value: multiple alignment
for each exon w in value:
EmitIntermediate(w, CpGIndex);
reduce(String key, Iterator values):
// key: an exon
// values: a list of CpG Index Values
int result = 0;
for each i in values:
result += ParseInt(v);
Emit(AsString(result/length(values)); [1]
49. Web Services
• Remote Procedure Call (RPC)
• Representational State Transfer (ReST)
• SOAP
• ActiveResource Pattern
50. class Video < ActiveYouTube
self.site = quot;http://gdata.youtube.com/feeds/apiquot;
## To search by categories and tags
def self.search_by_tags (*options)
from_urls = []
if options.last.is_a? Hash
excludes = options.slice!(options.length-1)
if excludes[:exclude].kind_of? Array
from_urls << excludes[:exclude].map{|keyword| quot;-quot;+keyword}.join(quot;/quot;)
else
from_urls << quot;-quot;+excludes[:exclude]
end
end
from_urls << options.find_all{|keyword| keyword =~ /^[a-z]/}.join(quot;/quot;)
from_urls << options.find_all{|category| category =~ /^[A-Z]/}.join(quot;%7Cquot;)
from_urls.delete_if {|x| x.empty?}
self.find(:all,:from=>quot;/feeds/api/videos/-/quot;+from_urls.reverse.join(quot;/quot;))
end
end
class User < ActiveYouTube
self.site = quot;http://gdata.youtube.com/feeds/apiquot;
end
class Standardfeed < ActiveYouTube
self.site = quot;http://gdata.youtube.com/feeds/apiquot;
end
class Playlist < ActiveYouTube
self.site = quot;http://gdata.youtube.com/feeds/apiquot;
end
51. search = Video.find(:first, :params => {:vq => 'ruby', :quot;max-resultsquot; => '5'})
puts search.entry.length
## video information of id = ZTUVgYoeN_o
vid = Video.find(quot;ZTUVgYoeN_oquot;)
puts vid.group.content[0].url
## video comments
comments = Video.find_custom(quot;ZTUVgYoeN_oquot;).get(:comments)
puts comments.entry[0].link[2].href
## searching with category/tags
results = Video.search_by_tags(quot;Comedyquot;)
puts results[0].entry[0].title
# more examples:
# Video.search_by_tags(quot;Comedyquot;, quot;dogquot;)
# Video.search_by_tags(quot;Newsquot;,quot;Sportsquot;,quot;footballquot;, :exclude=>quot;soccerquot;)
53. Be Agile
Manifesto for Agile Software Development
We are uncovering better ways of developing
software by doing it and helping others do it.
Through this work we have come to value:
• Individuals and interactions over processes and tools
• Working software over comprehensive documentation
• Customer collaboration over contract negotiation
• Responding to change over following a plan
That is, while there is value in the items on the right, we value the
items on the left more.
http://agilemanifesto.org/
54. Be Agile
As a [role], I want to [goal], so I can [reason].
Storyboard
Iterate!
Feedback
Acceptance
Unit Testing
Testing
57. Closing Remarks
• Focus on the goal (Biology/Medicine)
• Don’t be clever (you’ll trick yourself)
• Value your time
• Outsource everything but genius
• Use the tools available to you
• Have fun!