This document discusses the role of programming in computational biology. It begins by describing different types of programming languages like imperative, object-oriented, and functional languages. It then discusses how programming can reduce time, money, effort and errors in computational biology applications. Some key applications of programming in computational biology mentioned are data mining, genome annotation, microarray analysis, phylogenetics, and next generation sequencing studies. The document also discusses popular bioinformatics programming languages like Perl and describes concepts in programming like objects, modules, and the common gateway interface.
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
this presentation is about bioinformatics. the contents of bioinformatics are as under:
1.Introduction to bioinformatics.
2.Why bioinformatics is necessary?
3.Goals of bioinformatics
4.Field of bioinformatics
5.Where bioinformatics help?
6.Applications of bioinformatics
7.Software and tools of bioinformatics
8.References
The experimental methods used by biotechnologists to determine the structures of proteins demand sophisticated equipment and time.
A host of computational methods are developed to predict the location of secondary structure elements in proteins for complementing or creating insights into experimental results.
Chou-Fasman algorithm is an empirical algorithm developed for the prediction of protein secondary structure
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
Journal of Computational Systems Biology (JCSB) is an open access online journal which aims to publish peer reviewed research articles and short communications in all aspects of computational biology and bioinformatics. JCSB comprehend the broad spectrum of computational bioscience including biological databases and bioalgorithms.
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
this presentation is about bioinformatics. the contents of bioinformatics are as under:
1.Introduction to bioinformatics.
2.Why bioinformatics is necessary?
3.Goals of bioinformatics
4.Field of bioinformatics
5.Where bioinformatics help?
6.Applications of bioinformatics
7.Software and tools of bioinformatics
8.References
The experimental methods used by biotechnologists to determine the structures of proteins demand sophisticated equipment and time.
A host of computational methods are developed to predict the location of secondary structure elements in proteins for complementing or creating insights into experimental results.
Chou-Fasman algorithm is an empirical algorithm developed for the prediction of protein secondary structure
INTRODUCTION.
NCBI.
EMBL.
DDBJ.
CONCLUSION.
REFERENSE.
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health.
The NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored by Senator Claude Pepper.
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature.
All these databases are available online through the Entrez search engine.
Journal of Computational Systems Biology (JCSB) is an open access online journal which aims to publish peer reviewed research articles and short communications in all aspects of computational biology and bioinformatics. JCSB comprehend the broad spectrum of computational bioscience including biological databases and bioalgorithms.
Computational Biology and BioinformaticsSharif Shuvo
Computational Biology and Bioinformatics is a rapidly developing multi-disciplinary field. The systematic achievement of data made possible by genomics and proteomics technologies has created a tremendous gap between available data and their biological interpretation.
Neuroscience core lecture given at the Icahn school of medicine at Mount Sinai. This is the version 2 of the same topic. I have made some modifications to give a more gentle introduction and add a new example for ngs.plot.
Computational Approaches to Systems BiologyMike Hucka
Presentation given at the Sydney Computational Biologists meetup on 21 August 2013 (http://australianbioinformatics.net/past-events/2013/8/21/computational-approaches-to-systems-biology.html).
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
An introduction to the commonly used formats for the next-generation sequencing data. ngs.plot is a popular tool for the visualization and data mining of the NGS data.
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for
NGS Data quality check and Dataformat of top sequencing machine
BioPerl is an active open source software project supported by the Open Bioinformatics Foundation.
BioPerl is a product of community effort to produce Perl code which is useful in biology.
BioPerl is a collection of Perl modules
It has played an integral role in the Human Genome Project
A description of how technology has changed the face of Biology, specially in the fields of genetics, proteomics, and evolution.
It includes a brief history, examples of usage, and a look into the future.
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
code4lib 2011 preconference, presented by Erik Hatcher of Lucid Imagination.
Abstract: The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.
- Basics: BEAM Ecosystem and Erlang Programming Language
- Functional Programming: How Make your code Beautiful
- Concurrency: You need to be concurrent to survive in the parallel world
- Fault Tolerance: Keep calm and let it crash
- Soft Real-time: Accept the reality, be real, be yourself
- Software Architecture: How to look nice in a bigger picture
2. Programming languages
• Self-contained language
– Platform-independent
– Used to write O/S
– C (imperative, procedural)
– C++, Java (object-oriented)
– Lisp, Haskell, Prolog (functional)
• Scripting language
– Closely tied to O/S
– Perl, Python, Ruby
• Domain-specific language
– R (statistics)
– MatLab (numerics)
– SQL (databases)• An O/S typically manages…
– Devices (see above)
– Files & directories
– Users & permissions
– Processes & signals
3. Role of Programming
Reduces time
Reduces money on R&D
Reduces human effort
Streamlines workflow
Standard Algorithm unifies data result and
assures reproducability
Reduces human error
4. Applications of Programming
Data Mining
Genome Annotation
Microarray Analysis
Website Development
Tool Development
Statistical Analyses
Phylogeny
Genome Wide Association Studies (GWAS)
Next Generation Sequencing studies
6. Perl is the most-used bioinformatics language
Most popular bioinformatics programming languages
Bioinformatics career survey, 2008
Michael Barton
7. PERL
Practical Extraction & Report Language
Interpreted, not compiled
− Fast edit-run-revise cycle
• Procedural & imperative
− Sequence of instructions (“control flow”)
− Variables, subroutines
• Syntax close to C (the de facto standard minimal language)
− Weakly typed (unlike C)
− Redundant, not minimal (“there’s more than one way to do it”)
− “Syntactic sugar”
High-level data structures & algorithms
– Hashes, arrays
Operating System support (files, processes, signals)
String manipulation
8. Pros and Cons of Perl
• Reasons for Perl’s popularity in bioinformatics (Lincoln Stein)
– Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,
summarizing and otherwise mangling text
– Perl is forgiving
– Perl is component-oriented
– Perl is easy to write and fast to develop in
– Perl is a good prototyping language
– Perl is a good language for Web CGI scripting
• Problems with Perl
– Hard to read (“there’s more than one way to do it”, cryptic syntax…)
– Too forgiving (no strong typing, allows sloppy code…)
9. General principles of programming
Make incremental changes
Test everything you do
− the edit-run-revise cycle
Write so that others can read it
− (when possible, write with others)
Think before you write
Use a good text editor
Good debugging style
10. Regular expressions
Perl provides a pattern-matching engine
Patterns are called regular expressions
They are extremely powerful
− probably Perl's strongest feature, compared to other
languages
Often called "regexps" for short
11. Programming in PERL
Data Types
Scalars ($)
Arrays (@)
Hashes (%)
Conditional Operators
AND (&&),
OR (||),
NOT (!)
Arithmetic Operators (+, -,*, /)
13. Finding all sequence lengths
Open file
Read line
End of file?
Line starts with “>” ?
Remove “n” newline
character at end of line
Sequence name Sequence data
Add length of line
to running totalRecord the name
Reset running total of
current sequence length
First sequence?Print last
sequence
length
Stop
noyes
yes
yes
no
no
Start
Print last
sequence
length
15. Normalizing microarray data
• Often microarray data are normalized as a precursor
to further analysis (e.g. clustering)
• This can eliminate systematic bias; e.g.
− if every level for a particular gene is elevated, this might
signal a problem with the probe for that gene
− if every level for a particular experiment is elevated, there
might have been a problem with that experiment, or with
the subsequent image analysis
• Normalization is crude (it can eliminate real signal as
well as noise), but common
16. Rescaling an array
For each element of the array:
add a, then multiply by b
@array = (1, 3, 5, 7, 9);
print "Array before rescaling: @arrayn";
rescale_array (@array, -1, 2);
print "Array after rescaling: @arrayn";
sub rescale_array {
my ($arrayRef, $a, $b) = @_;
foreach my $x (@$arrayRef) {
$x = ($x + $a) * $b;
}
}
Array before rescaling: 1 3 5 7 9
Array after rescaling: 0 4 8 12 16
Array is passed by reference
17. Microarray expression data
A simple format with tab-separated fields
First line contains experiment names
Subsequent lines contain:
− gene name
− expression levels for each experiment
* EmbryoStage1 EmbryoStage2 EmbryoStage3 ...
Cyp12a5 104.556 102.441 55.643 ...
MRG15 4590.15 6691.11 9472.22 ...
Cop 33.12 56.3 66.21 ...
bor 5512.36 3315.12 1044.13 ...
Bx42 1045.1 632.7 200.11 ...
... ... ... ...
Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn…
18. Reading a file of expression data
sub read_expr {
my ($filename) = @_;
open EXPR, "<$filename";
my $firstLine = <EXPR>;
chomp $firstLine;
my @experiment = split /t/, $firstLine;
shift @experiment;
my %expr;
while (my $line = <EXPR>) {
chomp $line;
my ($gene, @data) = split /t/, $line;
if (@data+0 != @experiment+0) {
warn "Line has wrong number of fieldsn";
}
$expr{$gene} = @data;
}
close EXPR;
return (@experiment, %expr);
}
Note use of
scalar context
to compare
array sizes
Reference to array of experiment names
Reference to hash of arrays
(hash key is gene name, array
elements are expression data)
19. Normalizing by gene
A program to normalize expression data from
a set of microarray experiments
Normalizes by gene
($experiment, $expr) = read_expr ("expr.txt");
while (($geneName, $lineRef) = each %$expr) {
normalize_array ($lineRef);
}
sub normalize_array {
my ($data) = @_;
my ($mean, $sd) = mean_sd (@$data);
@$data= map (($_ - $mean) / $sd, @$data);
}
NB $data
is a reference
to an array
Could also use the following:
rescale_array($data,-$mean,1/$sd);
20. Normalizing by column
Remaps gene arrays to column arrays
($experiment, $expr)
= read_expr ("expr.txt");
my @genes = sort keys %$expr;
for ($i = 0; $i < @$experiment; ++$i) {
my @col;
foreach $j (0..@genes-1) {
$col[$j] = $expr->{$genes[$j]}->[$i];
}
normalize_array(@col);
foreach $j (0..@genes-1) {
$expr->{$genes[$j]}->[$i] = $col[$j];
}
}
Puts column
data in @col
Puts @col
back into %expr
Normalizes (note use
of reference)
24. Reading a GFF file
• This subroutine reads a GFF file
• Each line is made into an array via the split command
• The subroutine returns an array of such arrays
sub read_GFF {
my ($filename) = @_;
open GFF, "<$filename";
my @gff;
while (my $line = <GFF>) {
chomp $line;
my @data = split /t/, $line, 9;
push @gff, @data;
}
close GFF;
return @gff;
}
Splits the line into at most nine
fields, separated by tabs ("t")
Appends a reference to @data
to the @gff array
25. Writing a GFF file
• We should be able to write as well as read all datatypes
• Each array is made into a line via the join command
• Arguments: filename & reference to array of arrays
sub write_GFF {
my ($filename, $gffRef) = @_;
open GFF, ">$filename" or die $!;
foreach my $gff (@$gffRef) {
print GFF join ("t", @$gff), "n";
}
close GFF or die $!;
}
open evaluates FALSE if
the file failed to open, and
$! contains the error message
close evaluates FALSE if
there was an error with the file
26. GFF intersect detection
• Let (name1,start1,end1) and (name2,start2,end2) be the co-
ordinates of two segments
• If they don't overlap, there are three possibilities:
• name1 and name2 are different;
• name1 = name2 but start1 > end2;
• name1 = name2 but start2 > end1;
• Checking every possible pair takes time N2
to run, where N is
the number of GFF lines (how can this be improved?)
27. Self-intersection of a GFF file
sub self_intersect_GFF {
my @gff = @_;
my @intersect;
foreach $igff (@gff) {
foreach $jgff (@gff) {
if ($igff ne $jgff) {
if ($$igff[0] eq $$jgff[0]) {
if (!($$igff[3] > $$jgff[4]
|| $$jgff[3] > $$igff[4])) {
push @intersect, $igff;
last;
}
}
}
}
}
return @intersect;
}
Note: this code is slow.
Vast improvements in
speed can be gained if
we sort the @gff array
before checking for
intersection.
Fields 0, 3 and 4 of the
GFF line are the sequence
name, start and end co-
ordinates of the feature
30. Packages
Perl allows you to organise your subroutines in
packages each with its own namespace
Perl looks for the packages in a list of directories
specified by the array @INC
Many packages available at
http://www.cpan.org/
use PackageName;
PackageName::doSomething();
This line includes a file called
"PackageName.pm" in your code
print "INC dirs: @INCn";
INC dirs: Perl/lib Perl/site/lib .
The "." means the
directory that the
script is saved in
This invokes a subroutine called doSomething()
in the package called "PackageName.pm"
31. Object-oriented programming
Data structures are often associated with code
− FASTA: read_FASTA print_seq revcomp ...
− GFF: read_GFF write_GFF ...
− Expression data: read_expr mean_sd ...
Object-oriented programming makes this association
explicit.
A type of data structure, with an associated set of
subroutines, is called a class
The subroutines themselves are called methods
A particular instance of the class is an object
32. OOP concepts
• Abstraction
– represent the essentials, hide the details
• Encapsulation
– storing data and subroutines in a single unit
– hiding private data (sometimes all data, via accessors)
• Inheritance
– abstract base interfaces
– multiple derived classes
• Polymorphism
– different derived classes exhibit different behaviors in response
to the same requests
34. OOP: Analogy
o Messages (the words in the speech balloons, and also perhaps the coffee itself)
o Overloading (Waiter's response to "A coffee", different response to "A black coffee")
o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)
o Encapsulation (Customer doesn't need to know about Kitchen)
o Inheritance (not exactly used here, except implicitly: all types of coffee can be drunk or spilled, all
humans can speak basic English and hold cups of coffee, etc.)
o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is a Factory
(and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.
35. OOP: Advantages
• Often more intuitive
– Data has behavior
• Modularity
– Interfaces are well-defined
– Implementation details are hidden
• Maintainability
– Easier to debug, extend
• Framework for code libraries
– Graphics & GUIs
– BioPerl, BioJava…
36. OOP: Jargon
• Member, method
– A variable/subroutine associated with a particular class
• Overriding
– When a derived class implements a method differently from its parent
class
• Constructor, destructor
– Methods called when an object is created/destroyed
• Accessor
– A method that provides [partial] access to hidden data
• Factory
– An [abstract] object that creates other objects
• Singleton
– A class which is only ever instantiated once (i.e. there’s only ever one
object of this class)
– C.f. static member variables, which occur once per class
37. Objects in Perl
• An object in Perl is usually a reference to a hash
• The method subroutines for an object are found in a
class-specific package
– Command bless $x, MyPackage associates variable
$x with package MyPackage
• Syntax of method calls
– e.g. $x->save();
– this is equivalent to PackageName::save($x);
– Typical constructor: PackageName->new();
– @EXPORT and @EXPORT_OK arrays used to export
method names to user’s namespace
• Many useful Perl objects available at CPAN
38. Common Gateway Interface
• CGI (Common Gateway Interface)
– Page-based web programming paradigm
• Can construct static (HTML) as well as
dynamic (CGI) web pages.
• CGI.pm (also by Lincoln Stein)
– Perl CGI interface
– runs on a webserver
– allows you to write a program that runs behind a
webpage
• CGI (static, page-based) is gradually being
supplemented by AJAX
39. GUI
Graphical User Interface (GUI) are standalone
modules created to make the work of an end
user simpler.
Can be achieved through PERL Tk
40. BioPerl
• Bioperl is a collection of Perl modules that facilitate the
development of Perl scripts for bioinformatics applications.
• A set of Open Source Bioinformatics packages
– largely object-oriented
• Implements sequence and alignments manipulation, accessing
of sequence databases and parsing of the results of various
molecular biology programs including Blast, clustalw, TCoffee,
genscan, ESTscan and HMMER.
• Bioperl enables developing scripts that can analyze large
quantities of sequence data in ways that are typically difficult
or impossible with web based systems.
• Parses BLAST and other programs
• Basis for Ensembl
– the human genome annotation project
– www.ensembl.org