SlideShare a Scribd company logo
1 of 45
Role of Programming in
Computational
Biology
Atreyi Banerjee
Programming languages
• Self-contained language
– Platform-independent
– Used to write O/S
– C (imperative, procedural)
– C++, Java (object-oriented)
– Lisp, Haskell, Prolog (functional)
• Scripting language
– Closely tied to O/S
– Perl, Python, Ruby
• Domain-specific language
– R (statistics)
– MatLab (numerics)
– SQL (databases)• An O/S typically manages…
– Devices (see above)
– Files & directories
– Users & permissions
– Processes & signals
Role of Programming

Reduces time

Reduces money on R&D

Reduces human effort

Streamlines workflow

Standard Algorithm unifies data result and
assures reproducability

Reduces human error
Applications of Programming

Data Mining

Genome Annotation

Microarray Analysis

Website Development

Tool Development

Statistical Analyses

Phylogeny

Genome Wide Association Studies (GWAS)

Next Generation Sequencing studies
Bioinformatics “pipelines” often involve
chaining together multiple tools
Perl is the most-used bioinformatics language
Most popular bioinformatics programming languages
Bioinformatics career survey, 2008
Michael Barton
PERL

Practical Extraction & Report Language

Interpreted, not compiled
− Fast edit-run-revise cycle
• Procedural & imperative
− Sequence of instructions (“control flow”)
− Variables, subroutines
• Syntax close to C (the de facto standard minimal language)
− Weakly typed (unlike C)
− Redundant, not minimal (“there’s more than one way to do it”)
− “Syntactic sugar”

High-level data structures & algorithms
– Hashes, arrays

Operating System support (files, processes, signals)

String manipulation
Pros and Cons of Perl
• Reasons for Perl’s popularity in bioinformatics (Lincoln Stein)
– Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,
summarizing and otherwise mangling text
– Perl is forgiving
– Perl is component-oriented
– Perl is easy to write and fast to develop in
– Perl is a good prototyping language
– Perl is a good language for Web CGI scripting
• Problems with Perl
– Hard to read (“there’s more than one way to do it”, cryptic syntax…)
– Too forgiving (no strong typing, allows sloppy code…)
General principles of programming

Make incremental changes

Test everything you do
− the edit-run-revise cycle

Write so that others can read it
− (when possible, write with others)

Think before you write

Use a good text editor

Good debugging style
Regular expressions

Perl provides a pattern-matching engine

Patterns are called regular expressions

They are extremely powerful
− probably Perl's strongest feature, compared to other
languages

Often called "regexps" for short
Programming in PERL

Data Types

Scalars ($)

Arrays (@)

Hashes (%)

Conditional Operators

AND (&&),

OR (||),

NOT (!)

Arithmetic Operators (+, -,*, /)

CONDITIONS

If else

Elsif ladder

LOOPS

For

While

Foreach

Default Variables

$_ default variable

@_ default array
Finding all sequence lengths
Open file
Read line
End of file?
Line starts with “>” ?
Remove “n” newline
character at end of line
Sequence name Sequence data
Add length of line
to running totalRecord the name
Reset running total of
current sequence length
First sequence?Print last
sequence
length
Stop
noyes
yes
yes
no
no
Start
Print last
sequence
length
DNA Microarrays
Normalizing microarray data
• Often microarray data are normalized as a precursor
to further analysis (e.g. clustering)
• This can eliminate systematic bias; e.g.
− if every level for a particular gene is elevated, this might
signal a problem with the probe for that gene
− if every level for a particular experiment is elevated, there
might have been a problem with that experiment, or with
the subsequent image analysis
• Normalization is crude (it can eliminate real signal as
well as noise), but common
Rescaling an array

For each element of the array:
add a, then multiply by b
@array = (1, 3, 5, 7, 9);
print "Array before rescaling: @arrayn";
rescale_array (@array, -1, 2);
print "Array after rescaling: @arrayn";
sub rescale_array {
my ($arrayRef, $a, $b) = @_;
foreach my $x (@$arrayRef) {
$x = ($x + $a) * $b;
}
}
Array before rescaling: 1 3 5 7 9
Array after rescaling: 0 4 8 12 16
Array is passed by reference
Microarray expression data

A simple format with tab-separated fields

First line contains experiment names

Subsequent lines contain:
− gene name
− expression levels for each experiment
* EmbryoStage1 EmbryoStage2 EmbryoStage3 ...
Cyp12a5 104.556 102.441 55.643 ...
MRG15 4590.15 6691.11 9472.22 ...
Cop 33.12 56.3 66.21 ...
bor 5512.36 3315.12 1044.13 ...
Bx42 1045.1 632.7 200.11 ...
... ... ... ...
Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn…
Reading a file of expression data
sub read_expr {
my ($filename) = @_;
open EXPR, "<$filename";
my $firstLine = <EXPR>;
chomp $firstLine;
my @experiment = split /t/, $firstLine;
shift @experiment;
my %expr;
while (my $line = <EXPR>) {
chomp $line;
my ($gene, @data) = split /t/, $line;
if (@data+0 != @experiment+0) {
warn "Line has wrong number of fieldsn";
}
$expr{$gene} = @data;
}
close EXPR;
return (@experiment, %expr);
}
Note use of
scalar context
to compare
array sizes
Reference to array of experiment names
Reference to hash of arrays
(hash key is gene name, array
elements are expression data)
Normalizing by gene

A program to normalize expression data from
a set of microarray experiments

Normalizes by gene
($experiment, $expr) = read_expr ("expr.txt");
while (($geneName, $lineRef) = each %$expr) {
normalize_array ($lineRef);
}
sub normalize_array {
my ($data) = @_;
my ($mean, $sd) = mean_sd (@$data);
@$data= map (($_ - $mean) / $sd, @$data);
}
NB $data
is a reference
to an array
Could also use the following:
rescale_array($data,-$mean,1/$sd);
Normalizing by column

Remaps gene arrays to column arrays
($experiment, $expr)
= read_expr ("expr.txt");
my @genes = sort keys %$expr;
for ($i = 0; $i < @$experiment; ++$i) {
my @col;
foreach $j (0..@genes-1) {
$col[$j] = $expr->{$genes[$j]}->[$i];
}
normalize_array(@col);
foreach $j (0..@genes-1) {
$expr->{$genes[$j]}->[$i] = $col[$j];
}
}
Puts column
data in @col
Puts @col
back into %expr
Normalizes (note use
of reference)
Genome annotations
GFF annotation format• Nine-column tab-delimited format for simple annotations:
• Many of these now obsolete, but name/start/end/strand (and sometimes
type) are useful
• Methods: read, write, compareTo(GFF_file), getSeq(FASTA_file)
SEQ1 EMBL atg 103 105 . + 0 group1
SEQ1 EMBL exon 103 172 . + 0 group1
SEQ1 EMBL splice5 172 173 . + . group1
SEQ1 netgene splice5 172 173 0.94 + . group1
SEQ1 genie sp5-20 163 182 2.3 + . group1
SEQ1 genie sp5-10 168 177 2.1 + . group1
SEQ2 grail ATG 17 19 2.1 - 0 group2
Sequence
name
Program
Feature
type Start
residue
(starts at 1)
End
residue
(starts at 1)
Score
Strand
(+ or -)
Coding
frame
("." if not
applicable)
Group
Artemis
A tool for genome annotation
Reading a GFF file
• This subroutine reads a GFF file
• Each line is made into an array via the split command
• The subroutine returns an array of such arrays
sub read_GFF {
my ($filename) = @_;
open GFF, "<$filename";
my @gff;
while (my $line = <GFF>) {
chomp $line;
my @data = split /t/, $line, 9;
push @gff, @data;
}
close GFF;
return @gff;
}
Splits the line into at most nine
fields, separated by tabs ("t")
Appends a reference to @data
to the @gff array
Writing a GFF file
• We should be able to write as well as read all datatypes
• Each array is made into a line via the join command
• Arguments: filename & reference to array of arrays
sub write_GFF {
my ($filename, $gffRef) = @_;
open GFF, ">$filename" or die $!;
foreach my $gff (@$gffRef) {
print GFF join ("t", @$gff), "n";
}
close GFF or die $!;
}
open evaluates FALSE if
the file failed to open, and
$! contains the error message
close evaluates FALSE if
there was an error with the file
GFF intersect detection
• Let (name1,start1,end1) and (name2,start2,end2) be the co-
ordinates of two segments
• If they don't overlap, there are three possibilities:
• name1 and name2 are different;
• name1 = name2 but start1 > end2;
• name1 = name2 but start2 > end1;
• Checking every possible pair takes time N2
to run, where N is
the number of GFF lines (how can this be improved?)
Self-intersection of a GFF file
sub self_intersect_GFF {
my @gff = @_;
my @intersect;
foreach $igff (@gff) {
foreach $jgff (@gff) {
if ($igff ne $jgff) {
if ($$igff[0] eq $$jgff[0]) {
if (!($$igff[3] > $$jgff[4]
|| $$jgff[3] > $$igff[4])) {
push @intersect, $igff;
last;
}
}
}
}
}
return @intersect;
}
Note: this code is slow.
Vast improvements in
speed can be gained if
we sort the @gff array
before checking for
intersection.
Fields 0, 3 and 4 of the
GFF line are the sequence
name, start and end co-
ordinates of the feature
Converting GFF to sequence
• Puts together several previously-described subroutines
• Namely: read_FASTA read_GFF revcomp print_seq
($gffFile, $seqFile) = @ARGV;
@gff = read_GFF ($gffFile);
%seq = read_FASTA ($seqFile);
foreach $gffLine (@gff) {
$seqName = $gffLine->[0];
$seqStart = $gffLine->[3];
$seqEnd = $gffLine->[4];
$seqStrand = $gffLine->[6];
$seqLen = $seqEnd + 1 - $seqStart;
$subseq = substr ($seq{$seqName}, $seqStart-1, $seqLen);
if ($seqStrand eq "-") { $subseq = revcomp ($subseq); }
print_seq ("$seqName/$seqStart-$seqEnd/$seqStrand", $subseq);
}
Phylogenetics

Analysis of relationships between organisms
through phylogenetic programs like PHYLIP
can be automated or run on command line
Packages

Perl allows you to organise your subroutines in
packages each with its own namespace

Perl looks for the packages in a list of directories
specified by the array @INC

Many packages available at
http://www.cpan.org/
use PackageName;
PackageName::doSomething();
This line includes a file called
"PackageName.pm" in your code
print "INC dirs: @INCn";
INC dirs: Perl/lib Perl/site/lib .
The "." means the
directory that the
script is saved in
This invokes a subroutine called doSomething()
in the package called "PackageName.pm"
Object-oriented programming

Data structures are often associated with code
− FASTA: read_FASTA print_seq revcomp ...
− GFF: read_GFF write_GFF ...
− Expression data: read_expr mean_sd ...

Object-oriented programming makes this association
explicit.

A type of data structure, with an associated set of
subroutines, is called a class

The subroutines themselves are called methods

A particular instance of the class is an object
OOP concepts
• Abstraction
– represent the essentials, hide the details
• Encapsulation
– storing data and subroutines in a single unit
– hiding private data (sometimes all data, via accessors)
• Inheritance
– abstract base interfaces
– multiple derived classes
• Polymorphism
– different derived classes exhibit different behaviors in response
to the same requests
OOP: Analogy
OOP: Analogy
o Messages (the words in the speech balloons, and also perhaps the coffee itself)
o Overloading (Waiter's response to "A coffee", different response to "A black coffee")
o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)
o Encapsulation (Customer doesn't need to know about Kitchen)
o Inheritance (not exactly used here, except implicitly: all types of coffee can be drunk or spilled, all
humans can speak basic English and hold cups of coffee, etc.)
o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is a Factory
(and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.
OOP: Advantages
• Often more intuitive
– Data has behavior
• Modularity
– Interfaces are well-defined
– Implementation details are hidden
• Maintainability
– Easier to debug, extend
• Framework for code libraries
– Graphics & GUIs
– BioPerl, BioJava…
OOP: Jargon
• Member, method
– A variable/subroutine associated with a particular class
• Overriding
– When a derived class implements a method differently from its parent
class
• Constructor, destructor
– Methods called when an object is created/destroyed
• Accessor
– A method that provides [partial] access to hidden data
• Factory
– An [abstract] object that creates other objects
• Singleton
– A class which is only ever instantiated once (i.e. there’s only ever one
object of this class)
– C.f. static member variables, which occur once per class
Objects in Perl
• An object in Perl is usually a reference to a hash
• The method subroutines for an object are found in a
class-specific package
– Command bless $x, MyPackage associates variable
$x with package MyPackage
• Syntax of method calls
– e.g. $x->save();
– this is equivalent to PackageName::save($x);
– Typical constructor: PackageName->new();
– @EXPORT and @EXPORT_OK arrays used to export
method names to user’s namespace
• Many useful Perl objects available at CPAN
Common Gateway Interface
• CGI (Common Gateway Interface)
– Page-based web programming paradigm
• Can construct static (HTML) as well as
dynamic (CGI) web pages.
• CGI.pm (also by Lincoln Stein)
– Perl CGI interface
– runs on a webserver
– allows you to write a program that runs behind a
webpage
• CGI (static, page-based) is gradually being
supplemented by AJAX
GUI

Graphical User Interface (GUI) are standalone
modules created to make the work of an end
user simpler.

Can be achieved through PERL Tk
BioPerl
• Bioperl is a collection of Perl modules that facilitate the
development of Perl scripts for bioinformatics applications.
• A set of Open Source Bioinformatics packages
– largely object-oriented
• Implements sequence and alignments manipulation, accessing
of sequence databases and parsing of the results of various
molecular biology programs including Blast, clustalw, TCoffee,
genscan, ESTscan and HMMER.
• Bioperl enables developing scripts that can analyze large
quantities of sequence data in ways that are typically difficult
or impossible with web based systems.
• Parses BLAST and other programs
• Basis for Ensembl
– the human genome annotation project
– www.ensembl.org
Basic BioPerl modules

Bio::Perl

Bio::Seq

Bio::SeqIO

Bio::Align

Bio::AlignIO

Bio::Tools::Run::StandAloneBlast
BLAST
CLUSTALW
References

http://www.bioperl.org

http://www.cpan.org

http://www.pasteur.fr

Learning Perl

Beginning Perl for Bioinformatics

Mastering Perl for Bioinformatics
Thank You

More Related Content

What's hot

Drug Discovery Method (Bioinformatics)
Drug Discovery Method (Bioinformatics)Drug Discovery Method (Bioinformatics)
Drug Discovery Method (Bioinformatics)Hemantkrdu
 
Secondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elementsSecondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elementsVinaKhan1
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databasesPranavathiyani G
 
HEX structure prediction tool.pptx
HEX structure prediction tool.pptxHEX structure prediction tool.pptx
HEX structure prediction tool.pptxManimaran G
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENTMariya Raju
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methodsratanvishwas
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
Protein Structure Determination
Protein Structure DeterminationProtein Structure Determination
Protein Structure DeterminationAmjad Ibrahim
 
Protein micro array
Protein micro arrayProtein micro array
Protein micro arraykrupa sagar
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsAsad Afridi
 
Bioinformatics in drug discovery
Bioinformatics in drug discoveryBioinformatics in drug discovery
Bioinformatics in drug discoveryKAUSHAL SAHU
 
Role of bioinformatics and pharmacogenomics in drug discovery
Role of bioinformatics and pharmacogenomics in drug discoveryRole of bioinformatics and pharmacogenomics in drug discovery
Role of bioinformatics and pharmacogenomics in drug discoveryArindam Chakraborty
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)ZoufishanY
 

What's hot (20)

Drug Discovery Method (Bioinformatics)
Drug Discovery Method (Bioinformatics)Drug Discovery Method (Bioinformatics)
Drug Discovery Method (Bioinformatics)
 
Secondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elementsSecondary structure of rna and its predicting elements
Secondary structure of rna and its predicting elements
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
HEX structure prediction tool.pptx
HEX structure prediction tool.pptxHEX structure prediction tool.pptx
HEX structure prediction tool.pptx
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Structure alignment methods
Structure alignment methodsStructure alignment methods
Structure alignment methods
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
docking
docking docking
docking
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
Protein Structure Determination
Protein Structure DeterminationProtein Structure Determination
Protein Structure Determination
 
Protein micro array
Protein micro arrayProtein micro array
Protein micro array
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Cath
CathCath
Cath
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Snp
SnpSnp
Snp
 
Bioinformatics in drug discovery
Bioinformatics in drug discoveryBioinformatics in drug discovery
Bioinformatics in drug discovery
 
YEAST TWO HYBRID SYSTEM
 YEAST TWO HYBRID SYSTEM YEAST TWO HYBRID SYSTEM
YEAST TWO HYBRID SYSTEM
 
Role of bioinformatics and pharmacogenomics in drug discovery
Role of bioinformatics and pharmacogenomics in drug discoveryRole of bioinformatics and pharmacogenomics in drug discovery
Role of bioinformatics and pharmacogenomics in drug discovery
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 

Viewers also liked

Bioinformatics and BioPerl
Bioinformatics and BioPerlBioinformatics and BioPerl
Bioinformatics and BioPerlJason Stajich
 
Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Annex Publishers
 
2011-04-26_various-assemblers-presentation
2011-04-26_various-assemblers-presentation2011-04-26_various-assemblers-presentation
2011-04-26_various-assemblers-presentationmhaimel
 
Computational Biology and Bioinformatics
Computational Biology and BioinformaticsComputational Biology and Bioinformatics
Computational Biology and BioinformaticsSharif Shuvo
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Sfu ngs course_workshop tutorial_2.1
Sfu ngs course_workshop tutorial_2.1Sfu ngs course_workshop tutorial_2.1
Sfu ngs course_workshop tutorial_2.1Shaojun Xie
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems BiologyMike Hucka
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014Torsten Seemann
 
Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Mark Pallen
 
PERL- Bioperl modules
PERL- Bioperl modulesPERL- Bioperl modules
PERL- Bioperl modulesNixon Mendez
 
BioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics ToolkitBioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics ToolkitJason Stajich
 
Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Mark Pallen
 
Introduction to Perl - Day 1
Introduction to Perl - Day 1Introduction to Perl - Day 1
Introduction to Perl - Day 1Dave Cross
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsHamid Ur-Rahman
 

Viewers also liked (20)

Bioinformatics and BioPerl
Bioinformatics and BioPerlBioinformatics and BioPerl
Bioinformatics and BioPerl
 
Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)
 
2011-04-26_various-assemblers-presentation
2011-04-26_various-assemblers-presentation2011-04-26_various-assemblers-presentation
2011-04-26_various-assemblers-presentation
 
COMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGYCOMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGY
 
Computational Biology and Bioinformatics
Computational Biology and BioinformaticsComputational Biology and Bioinformatics
Computational Biology and Bioinformatics
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Sfu ngs course_workshop tutorial_2.1
Sfu ngs course_workshop tutorial_2.1Sfu ngs course_workshop tutorial_2.1
Sfu ngs course_workshop tutorial_2.1
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012Bio305 genome analysis and annotation 2012
Bio305 genome analysis and annotation 2012
 
PERL- Bioperl modules
PERL- Bioperl modulesPERL- Bioperl modules
PERL- Bioperl modules
 
Perl Introduction
Perl IntroductionPerl Introduction
Perl Introduction
 
BioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics ToolkitBioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics Toolkit
 
Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Bio153 microbial genomics 2012
Bio153 microbial genomics 2012
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Introduction to Perl - Day 1
Introduction to Perl - Day 1Introduction to Perl - Day 1
Introduction to Perl - Day 1
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 

Similar to Programming in Computational Biology

Erlang Message Passing Concurrency, For The Win
Erlang  Message  Passing  Concurrency,  For  The  WinErlang  Message  Passing  Concurrency,  For  The  Win
Erlang Message Passing Concurrency, For The Winl xf
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Diving into Functional Programming
Diving into Functional ProgrammingDiving into Functional Programming
Diving into Functional ProgrammingLev Walkin
 
Лев Валкин — Программируем функционально
Лев Валкин — Программируем функциональноЛев Валкин — Программируем функционально
Лев Валкин — Программируем функциональноDaria Oreshkina
 
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...Hamidreza Soleimani
 
Perl 101 - The Basics of Perl Programming
Perl  101 - The Basics of Perl ProgrammingPerl  101 - The Basics of Perl Programming
Perl 101 - The Basics of Perl ProgrammingUtkarsh Sengar
 
Python Workshop. LUG Maniapl
Python Workshop. LUG ManiaplPython Workshop. LUG Maniapl
Python Workshop. LUG ManiaplAnkur Shrivastava
 
name name2 n
name name2 nname name2 n
name name2 ncallroom
 
name name2 n
name name2 nname name2 n
name name2 ncallroom
 

Similar to Programming in Computational Biology (20)

Erlang Message Passing Concurrency, For The Win
Erlang  Message  Passing  Concurrency,  For  The  WinErlang  Message  Passing  Concurrency,  For  The  Win
Erlang Message Passing Concurrency, For The Win
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekinge
 
Diving into Functional Programming
Diving into Functional ProgrammingDiving into Functional Programming
Diving into Functional Programming
 
Parsing
ParsingParsing
Parsing
 
Лев Валкин — Программируем функционально
Лев Валкин — Программируем функциональноЛев Валкин — Программируем функционально
Лев Валкин — Программируем функционально
 
Perl
PerlPerl
Perl
 
Perl tutorial final
Perl tutorial finalPerl tutorial final
Perl tutorial final
 
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
Architecting Scalable Platforms in Erlang/OTP | Hamidreza Soleimani | Diginex...
 
Perl 101 - The Basics of Perl Programming
Perl  101 - The Basics of Perl ProgrammingPerl  101 - The Basics of Perl Programming
Perl 101 - The Basics of Perl Programming
 
Python Workshop. LUG Maniapl
Python Workshop. LUG ManiaplPython Workshop. LUG Maniapl
Python Workshop. LUG Maniapl
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Ruby quick ref
Ruby quick refRuby quick ref
Ruby quick ref
 
ppt7
ppt7ppt7
ppt7
 
ppt2
ppt2ppt2
ppt2
 
name name2 n
name name2 nname name2 n
name name2 n
 
test ppt
test ppttest ppt
test ppt
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt21
ppt21ppt21
ppt21
 

Programming in Computational Biology

  • 1. Role of Programming in Computational Biology Atreyi Banerjee
  • 2. Programming languages • Self-contained language – Platform-independent – Used to write O/S – C (imperative, procedural) – C++, Java (object-oriented) – Lisp, Haskell, Prolog (functional) • Scripting language – Closely tied to O/S – Perl, Python, Ruby • Domain-specific language – R (statistics) – MatLab (numerics) – SQL (databases)• An O/S typically manages… – Devices (see above) – Files & directories – Users & permissions – Processes & signals
  • 3. Role of Programming  Reduces time  Reduces money on R&D  Reduces human effort  Streamlines workflow  Standard Algorithm unifies data result and assures reproducability  Reduces human error
  • 4. Applications of Programming  Data Mining  Genome Annotation  Microarray Analysis  Website Development  Tool Development  Statistical Analyses  Phylogeny  Genome Wide Association Studies (GWAS)  Next Generation Sequencing studies
  • 5. Bioinformatics “pipelines” often involve chaining together multiple tools
  • 6. Perl is the most-used bioinformatics language Most popular bioinformatics programming languages Bioinformatics career survey, 2008 Michael Barton
  • 7. PERL  Practical Extraction & Report Language  Interpreted, not compiled − Fast edit-run-revise cycle • Procedural & imperative − Sequence of instructions (“control flow”) − Variables, subroutines • Syntax close to C (the de facto standard minimal language) − Weakly typed (unlike C) − Redundant, not minimal (“there’s more than one way to do it”) − “Syntactic sugar”  High-level data structures & algorithms – Hashes, arrays  Operating System support (files, processes, signals)  String manipulation
  • 8. Pros and Cons of Perl • Reasons for Perl’s popularity in bioinformatics (Lincoln Stein) – Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing, summarizing and otherwise mangling text – Perl is forgiving – Perl is component-oriented – Perl is easy to write and fast to develop in – Perl is a good prototyping language – Perl is a good language for Web CGI scripting • Problems with Perl – Hard to read (“there’s more than one way to do it”, cryptic syntax…) – Too forgiving (no strong typing, allows sloppy code…)
  • 9. General principles of programming  Make incremental changes  Test everything you do − the edit-run-revise cycle  Write so that others can read it − (when possible, write with others)  Think before you write  Use a good text editor  Good debugging style
  • 10. Regular expressions  Perl provides a pattern-matching engine  Patterns are called regular expressions  They are extremely powerful − probably Perl's strongest feature, compared to other languages  Often called "regexps" for short
  • 11. Programming in PERL  Data Types  Scalars ($)  Arrays (@)  Hashes (%)  Conditional Operators  AND (&&),  OR (||),  NOT (!)  Arithmetic Operators (+, -,*, /)
  • 13. Finding all sequence lengths Open file Read line End of file? Line starts with “>” ? Remove “n” newline character at end of line Sequence name Sequence data Add length of line to running totalRecord the name Reset running total of current sequence length First sequence?Print last sequence length Stop noyes yes yes no no Start Print last sequence length
  • 15. Normalizing microarray data • Often microarray data are normalized as a precursor to further analysis (e.g. clustering) • This can eliminate systematic bias; e.g. − if every level for a particular gene is elevated, this might signal a problem with the probe for that gene − if every level for a particular experiment is elevated, there might have been a problem with that experiment, or with the subsequent image analysis • Normalization is crude (it can eliminate real signal as well as noise), but common
  • 16. Rescaling an array  For each element of the array: add a, then multiply by b @array = (1, 3, 5, 7, 9); print "Array before rescaling: @arrayn"; rescale_array (@array, -1, 2); print "Array after rescaling: @arrayn"; sub rescale_array { my ($arrayRef, $a, $b) = @_; foreach my $x (@$arrayRef) { $x = ($x + $a) * $b; } } Array before rescaling: 1 3 5 7 9 Array after rescaling: 0 4 8 12 16 Array is passed by reference
  • 17. Microarray expression data  A simple format with tab-separated fields  First line contains experiment names  Subsequent lines contain: − gene name − expression levels for each experiment * EmbryoStage1 EmbryoStage2 EmbryoStage3 ... Cyp12a5 104.556 102.441 55.643 ... MRG15 4590.15 6691.11 9472.22 ... Cop 33.12 56.3 66.21 ... bor 5512.36 3315.12 1044.13 ... Bx42 1045.1 632.7 200.11 ... ... ... ... ... Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn…
  • 18. Reading a file of expression data sub read_expr { my ($filename) = @_; open EXPR, "<$filename"; my $firstLine = <EXPR>; chomp $firstLine; my @experiment = split /t/, $firstLine; shift @experiment; my %expr; while (my $line = <EXPR>) { chomp $line; my ($gene, @data) = split /t/, $line; if (@data+0 != @experiment+0) { warn "Line has wrong number of fieldsn"; } $expr{$gene} = @data; } close EXPR; return (@experiment, %expr); } Note use of scalar context to compare array sizes Reference to array of experiment names Reference to hash of arrays (hash key is gene name, array elements are expression data)
  • 19. Normalizing by gene  A program to normalize expression data from a set of microarray experiments  Normalizes by gene ($experiment, $expr) = read_expr ("expr.txt"); while (($geneName, $lineRef) = each %$expr) { normalize_array ($lineRef); } sub normalize_array { my ($data) = @_; my ($mean, $sd) = mean_sd (@$data); @$data= map (($_ - $mean) / $sd, @$data); } NB $data is a reference to an array Could also use the following: rescale_array($data,-$mean,1/$sd);
  • 20. Normalizing by column  Remaps gene arrays to column arrays ($experiment, $expr) = read_expr ("expr.txt"); my @genes = sort keys %$expr; for ($i = 0; $i < @$experiment; ++$i) { my @col; foreach $j (0..@genes-1) { $col[$j] = $expr->{$genes[$j]}->[$i]; } normalize_array(@col); foreach $j (0..@genes-1) { $expr->{$genes[$j]}->[$i] = $col[$j]; } } Puts column data in @col Puts @col back into %expr Normalizes (note use of reference)
  • 22. GFF annotation format• Nine-column tab-delimited format for simple annotations: • Many of these now obsolete, but name/start/end/strand (and sometimes type) are useful • Methods: read, write, compareTo(GFF_file), getSeq(FASTA_file) SEQ1 EMBL atg 103 105 . + 0 group1 SEQ1 EMBL exon 103 172 . + 0 group1 SEQ1 EMBL splice5 172 173 . + . group1 SEQ1 netgene splice5 172 173 0.94 + . group1 SEQ1 genie sp5-20 163 182 2.3 + . group1 SEQ1 genie sp5-10 168 177 2.1 + . group1 SEQ2 grail ATG 17 19 2.1 - 0 group2 Sequence name Program Feature type Start residue (starts at 1) End residue (starts at 1) Score Strand (+ or -) Coding frame ("." if not applicable) Group
  • 23. Artemis A tool for genome annotation
  • 24. Reading a GFF file • This subroutine reads a GFF file • Each line is made into an array via the split command • The subroutine returns an array of such arrays sub read_GFF { my ($filename) = @_; open GFF, "<$filename"; my @gff; while (my $line = <GFF>) { chomp $line; my @data = split /t/, $line, 9; push @gff, @data; } close GFF; return @gff; } Splits the line into at most nine fields, separated by tabs ("t") Appends a reference to @data to the @gff array
  • 25. Writing a GFF file • We should be able to write as well as read all datatypes • Each array is made into a line via the join command • Arguments: filename & reference to array of arrays sub write_GFF { my ($filename, $gffRef) = @_; open GFF, ">$filename" or die $!; foreach my $gff (@$gffRef) { print GFF join ("t", @$gff), "n"; } close GFF or die $!; } open evaluates FALSE if the file failed to open, and $! contains the error message close evaluates FALSE if there was an error with the file
  • 26. GFF intersect detection • Let (name1,start1,end1) and (name2,start2,end2) be the co- ordinates of two segments • If they don't overlap, there are three possibilities: • name1 and name2 are different; • name1 = name2 but start1 > end2; • name1 = name2 but start2 > end1; • Checking every possible pair takes time N2 to run, where N is the number of GFF lines (how can this be improved?)
  • 27. Self-intersection of a GFF file sub self_intersect_GFF { my @gff = @_; my @intersect; foreach $igff (@gff) { foreach $jgff (@gff) { if ($igff ne $jgff) { if ($$igff[0] eq $$jgff[0]) { if (!($$igff[3] > $$jgff[4] || $$jgff[3] > $$igff[4])) { push @intersect, $igff; last; } } } } } return @intersect; } Note: this code is slow. Vast improvements in speed can be gained if we sort the @gff array before checking for intersection. Fields 0, 3 and 4 of the GFF line are the sequence name, start and end co- ordinates of the feature
  • 28. Converting GFF to sequence • Puts together several previously-described subroutines • Namely: read_FASTA read_GFF revcomp print_seq ($gffFile, $seqFile) = @ARGV; @gff = read_GFF ($gffFile); %seq = read_FASTA ($seqFile); foreach $gffLine (@gff) { $seqName = $gffLine->[0]; $seqStart = $gffLine->[3]; $seqEnd = $gffLine->[4]; $seqStrand = $gffLine->[6]; $seqLen = $seqEnd + 1 - $seqStart; $subseq = substr ($seq{$seqName}, $seqStart-1, $seqLen); if ($seqStrand eq "-") { $subseq = revcomp ($subseq); } print_seq ("$seqName/$seqStart-$seqEnd/$seqStrand", $subseq); }
  • 29. Phylogenetics  Analysis of relationships between organisms through phylogenetic programs like PHYLIP can be automated or run on command line
  • 30. Packages  Perl allows you to organise your subroutines in packages each with its own namespace  Perl looks for the packages in a list of directories specified by the array @INC  Many packages available at http://www.cpan.org/ use PackageName; PackageName::doSomething(); This line includes a file called "PackageName.pm" in your code print "INC dirs: @INCn"; INC dirs: Perl/lib Perl/site/lib . The "." means the directory that the script is saved in This invokes a subroutine called doSomething() in the package called "PackageName.pm"
  • 31. Object-oriented programming  Data structures are often associated with code − FASTA: read_FASTA print_seq revcomp ... − GFF: read_GFF write_GFF ... − Expression data: read_expr mean_sd ...  Object-oriented programming makes this association explicit.  A type of data structure, with an associated set of subroutines, is called a class  The subroutines themselves are called methods  A particular instance of the class is an object
  • 32. OOP concepts • Abstraction – represent the essentials, hide the details • Encapsulation – storing data and subroutines in a single unit – hiding private data (sometimes all data, via accessors) • Inheritance – abstract base interfaces – multiple derived classes • Polymorphism – different derived classes exhibit different behaviors in response to the same requests
  • 34. OOP: Analogy o Messages (the words in the speech balloons, and also perhaps the coffee itself) o Overloading (Waiter's response to "A coffee", different response to "A black coffee") o Polymorphism (Waiter and Kitchen implement "A black coffee" differently) o Encapsulation (Customer doesn't need to know about Kitchen) o Inheritance (not exactly used here, except implicitly: all types of coffee can be drunk or spilled, all humans can speak basic English and hold cups of coffee, etc.) o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is a Factory (and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.
  • 35. OOP: Advantages • Often more intuitive – Data has behavior • Modularity – Interfaces are well-defined – Implementation details are hidden • Maintainability – Easier to debug, extend • Framework for code libraries – Graphics & GUIs – BioPerl, BioJava…
  • 36. OOP: Jargon • Member, method – A variable/subroutine associated with a particular class • Overriding – When a derived class implements a method differently from its parent class • Constructor, destructor – Methods called when an object is created/destroyed • Accessor – A method that provides [partial] access to hidden data • Factory – An [abstract] object that creates other objects • Singleton – A class which is only ever instantiated once (i.e. there’s only ever one object of this class) – C.f. static member variables, which occur once per class
  • 37. Objects in Perl • An object in Perl is usually a reference to a hash • The method subroutines for an object are found in a class-specific package – Command bless $x, MyPackage associates variable $x with package MyPackage • Syntax of method calls – e.g. $x->save(); – this is equivalent to PackageName::save($x); – Typical constructor: PackageName->new(); – @EXPORT and @EXPORT_OK arrays used to export method names to user’s namespace • Many useful Perl objects available at CPAN
  • 38. Common Gateway Interface • CGI (Common Gateway Interface) – Page-based web programming paradigm • Can construct static (HTML) as well as dynamic (CGI) web pages. • CGI.pm (also by Lincoln Stein) – Perl CGI interface – runs on a webserver – allows you to write a program that runs behind a webpage • CGI (static, page-based) is gradually being supplemented by AJAX
  • 39. GUI  Graphical User Interface (GUI) are standalone modules created to make the work of an end user simpler.  Can be achieved through PERL Tk
  • 40. BioPerl • Bioperl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. • A set of Open Source Bioinformatics packages – largely object-oriented • Implements sequence and alignments manipulation, accessing of sequence databases and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER. • Bioperl enables developing scripts that can analyze large quantities of sequence data in ways that are typically difficult or impossible with web based systems. • Parses BLAST and other programs • Basis for Ensembl – the human genome annotation project – www.ensembl.org
  • 42. BLAST