Bioinformatics

Seyed mohammad motevalli
December 2013

outline
 Introduction to bioinformatics
 Biological databases
 Sequence alignment and their algorithms
 Structural prediction

 Web-based tools
 Stand-alone software

Introduction to bioinformatics
 What is the bioinformatics?
Bioinformatics is an interdisciplinary research area at the interface between
computer science and biological science.

Introduction to bioinformatics
 What are differences between bioinformatics and

informatics?
 What are differences between bioinformatics and
computational biology?
 What is the algorithm?

Biological databases
 Database

A database is a computerized archive used to store and organize data in such a
way that information can be retrieved easily via a variety of search criteria
 Entry
Each record should contain a number of fields that hold the actual data items
 Value
a particular piece of information
 Making a query
To retrieve a particular record from the database, a user can specify a value to
be found in a particular field and expect the computer to retrieve the whole
data record

 Primary databases
 Gen bank (NCBI)
 EMBL
 DDBJ

www.ncbi.nlm.nih.gov
www.ebi.ac.uk/embl/index.html
www.ddbj.nig.ac.jp

 Secondary databases
 ExPASY
 PIR
 SWISS-Prot

http://web.expasy.org
http://pir.georgetown.edu/pirwww/pirhome3.shtml
www.ebi.ac.uk/swissprot/access.html

 Interconnection between Biological Databases

 Pitfalls of biological databases
 The causes of redundancy include: repeated submission of identical or

overlapping sequences by the same or different authors, revision of
annotations, dumping of expressed sequence tags (EST) data
 Redundant sequences
 Non-redundant sequences (Ref Seq)

 Further databases
 NCBI







Uniprot
http://www.uniprot.org
ExPASY
PIR
http://pir.georgetown.edu/
SWISS-Prot
http://swissmodel.expasy.org/
PDB
http://www.rcsb.org/pdb/home/home.do
Enzyme structure http://www.ebi.ac.uk/thornton-srv/databases/enzymes

 NCBI


 Uniprot


 ExPASY


 PIR

http://pir.georgetown.edu/

 SWISS-Prot

http://swissmodel.expasy.org/

 PDB

http://www.rcsb.org/pdb/home/home.do

 Enzyme structure

http://www.ebi.ac.uk/thornton-srv/databases/enzymes

Sequence alignment and their
algorithms
 Pairwise sequence alignment
Pairwise sequence alignment is the process of aligning two sequences and is
the basis of database similarity searching and multiple sequence alignment

 Sequence similarity versus sequence homology
When two sequences are descended from a common evolutionary origin, they
are said to have a homologous relationship or share homology. A related but
different term is sequence similarity, which is the percentage of aligned
residues that are similar in physiochemical properties such as size, charge,
and hydrophobicity

 Sequence similarity versus sequence identity
In a protein sequence alignment, sequence identity refers to the percentage of
matches of the same amino acid residues between two aligned sequences.
Similarity refers to the percentage of aligned residues that have similar
physicochemical characteristics and can be more readily substituted for each
other

algorithms
 Sequence alignment strategies
 Global alignment

In global alignment, two sequences to be aligned are assumed to be generally
similar over their entire length. Alignment is carried out from beginning to end
of both sequences to find the best possible alignment across the entire length
between the two sequences
 Local alignment
In local alignment does not assume that the two sequences in question have
similarity over the entire length. It only finds local regions with the highest
level of similarity between the two sequences and aligns these regions without
regard for the alignment of the rest of the sequence regions

algorithms

algorithms
Linear gap penalty: The cost for creation and extension of gaps are the same

W(I)= gI, g is the cost for each gap and I is the length

Affine gap penalty: different cost for creation and extension
W(I)=gopen + gext (I-1) and gopen < Gext

S

S

,

W I

algorithms
 Alignment Algorithms And Methodes
 The dot matrix method
 The word method
 The dynamic programming method

algorithms
 Alignment Algorithms
 The dot matrix method

The most basic sequence alignment method is the dot matrix method, also
known as the dot plot method

algorithms
 The word method

It works by finding short stretches of identical or nearly identical letters in
two sequences. These short strings of characters are called words, which
are similar to the windows used in the dot matrix method

algorithms
 The word method

algorithms

Dynamic programming is a method that determines optimal alignment by
matching two sequences for all possible pairs of characters between the
two sequences

algorithms
 Global alignment

The classical global pairwise alignment algorithm using dynamic
programming is the Needleman–Wunsch algorithm. In this algorithm, an
optimal alignment is obtained over the entire lengths of the two sequences
 Local alignment

The first application of dynamic programming in local alignment is the
Smith–Waterman algorithm. In this algorithm, positive scores are
assigned for matching residues and zeros for mismatches. No negative
scores are used

algorithms
 substitution matrix
 PAM matrices (point accepted mutation)

The PAM matrices were subsequently derived based on the evolutionary
divergence between sequences of the same cluster. One PAM unit is defined as
1% of the amino acid positions that have been changed. Because of the use of
very closely related homologs, the observed mutations were not expected to
significantly change the common function of the proteins

algorithms
 PAM matrices (point accepted mutation)

algorithms
 BLOSUM matrices

This is the series of blocks amino acid substitution matrices (BLOSUM), all of
which are derived based on direct observation for every possible amino acid
substitution in multiple sequence alignments

algorithms
 BLOSUM matrices

algorithms
What Matrices should be used and when?
Matrix
PAM40

Best use
Similarity (%)
Short alignment that are
70-90
highly similar
PAM160
Detecting members of a
50-60
protein family
PAM250
Longer alignments of more App. 30
divergent sequences
BLUSOM90
Short alignment that are
70-90
highly similar
BLUSOME80
Detecting members of a
50-60
protein family
BLUSOME62
Most effective in finding
30-40
all potential similarities
BLUSOME30
Longer alignments of more <30
divergent sequences
Similarity: the range of similarities that the matrix is able to best tdetecr.

Comparison
• PAM is based on an evolutionary model
using phylogenetic trees
• BLOSUM assumes no evolutionary model,
but rather conserved “blocks” of proteins

algorithms
 Heuristic database searching
The heuristic algorithms perform faster searches because they examine only a
fraction of the possible alignments examined in regular dynamic programming
 BLAST (basic local alignment search tool)
BLAST uses heuristics to align a query sequence with all sequences in a
database

algorithms

algorithms
6- finishing

Negative scores from scoring matrix

Threshold for stopping extension

Minimum
Score (S)
Neighborhood
Score Threshold (T)

If the extension stopped after crossing the X, the alignment is called
High-scoring segment pair (HSP)

algorithms
Suggested BLAST Cutoffs
Finding by chance in nucleotide database is more than proteins
Identity in proteins is more informative than in the nucleic acids
For nucleotide-based searches: hits with E values of 10-6 or
less and seq identity 70% or more
For protein-based searches: hits with E values of 10-3 or less and
seq. identity of 25% or more.

algorithms
 BLASTN

queries nucleotide sequences with a nucleotide sequence database
 BLASTP
uses protein sequences as queries to search against a protein sequence
database
 BLASTX
uses nucleotide sequences as queries and translates them in all six reading
frames to produce translated protein sequences, which are used to query a
protein sequence database
 TBLASTN
queries protein sequences to a nucleotide sequence database with the
sequences translated in all six reading frames
 TBLASTX
uses nucleotide sequences, which are translated in all six frames, to search
against a nucleotide sequence database that has all the sequences
translated in six frames

algorithms
 PSI-BLAST

Position-specific iterated BLAST (PSI-BLAST) builds profiles and performs
database searches in an iterative fashion. The main feature of PSI-BLAST is
that profiles are constructed automatically and arefine-tunedin each successive
cycle

algorithms
 PSI-BLAST

algorithms
 Multiple sequence alignment

algorithms
 Exhaustive algorithms

The exhaustive alignment method involves examining all possible aligned
positions simultaneously
 Heuristic algorithms
 Because the use of dynamic programming is not feasible for routine multiple
sequence alignment, faster and heuristic algorithms have been developed.
computational strategy to find a near-optimal solution by using rules of
thumb. Essentially, this strategy takes shortcuts by reducing the search
space according to certain criteria

algorithms
 Progressive alignment
 Progressive alignment depends on the stepwise assembly of multiple

alignment and is heuristic in nature
 Clustal
It is a progressive multiple alignment program available either as a standalone or on-line program
 T-coffee
T-coffee performs progressive sequence alignments as in Clustal. The main
difference is that, in processing a query, T-Coffee performs both global and
local pairwise alignment for all possible pairs involved. The global pairwise
alignment is performed using the Clustal program

algorithms
 Iterative alignment

The iterative approach is based on the idea that an optimal
solution can be found by repeatedly modifying existing
suboptimal solutions

algorithms
 Block-Based Alignment

The strategy identifies a block of ungapped alignment shared by all the
sequences, hence, the block-based local alignment strategy

Structural prediction
 Structural prediction methods
 Ab-initio prediction

Computational prediction based on first principles or using the most
elementary information
 Threading
Method of predicting the most likely protein structural fold based on secondary
structure similarity with database structures and assessment of energies of the
potential fold. The term has been used interchangeably with fold recognition
 Homology-based modeling
Method for predicting the three-dimensional structure of a protein based on
homology by assigning the structure of an unknown protein using an existing
homologous protein structure as a template

Hidden Markova algorithm
Statistical model composed of a number of interconnected. Markov chains
with the capability to generate the probability value of an event by taking
into account the influence from hidden variables. Mathematically, it
calculates probability values of connected states among the Markov chains
to find an optimal path within the network of states. It requires training to
obtain the probability values of state transitions. When using a hidden
Markov model to represent a multiple sequence alignment, a sequence can
be generated through the model by incorporating probability values of
match, insertion, and deletion states

Neural network algorithm
Machine-learning algorithm for pattern recognition. It is composed of
input, hidden, and output layers. Units of information in each layer are
called nodes. The nodes of different layers are interconnected to form a
network analogous to a biological nervous system. Between the nodes are
mathematical weight parameters that can be trained with known patterns
so they can be used for later predictions. After training, the network is able
to recognize correlation between an input and output

Web-based tools
 Alignment tools
 Sequence-based methods
 T-coffee










http://tcoffee.crg.cat/apps/tcoffee/do:regular
NCBI
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Uniprot
EMBL
http://coot.embl.de/Alignment
Structural-based methods
Dali server
http://ekhidna.biocenter.helsinki.fi/dali_server
FSSP
http://protein.hbu.cn/fssp
Signal peptide resource http://proline.bic.nus.edu.sg/spdb/searchn.html
Active site prediction http://www.scfbio-iitd.res.in/dock/ActiveSite.jsp

Web-based tools
 T-coffee

http://tcoffee.crg.cat/apps/tcoffee/do:regular

Web-based tools
 NCBI

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Web-based tools
 Uniprot


Web-based tools
 EMBL

http://coot.embl.de/Alignment

Web-based tools
 Dali server

http://ekhidna.biocenter.helsinki.fi/dali_server

Web-based tools
FSSP
http://protein.hbu.cn/fssp


Web-based tools
 Secondary structures prediction
 Sopma







http://npsapbil.ibcp.fr/cgibin/npsa_automat.pl?page=npsa_sopma.html
Jpred3
http://www.compbio.dundee.ac.uk/www-jpred
PreSSaPro
http://bioinformatica.isa.cnr.it/PRESSAPRO
HMM protein structure prediction
http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html
PROF
http://www.aber.ac.uk/~phiwww/prof
Software package http://molbiol-tools.ca/Protein_secondary_structure.htm



Web-based tools
Sopma
http://npsapbil.ibcp.fr/cgibin/npsa_automat.pl?page=npsa_sopma.html

Web-based tools
 Jpred3

http://www.compbio.dundee.ac.uk/www-jpred

Web-based tools
 PreSSaPro

http://bioinformatica.isa.cnr.it/PRESSAPRO

Web-based tools
 HMM protein structure prediction

http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html

Web-based tools
 PROF

http://www.aber.ac.uk/~phiwww/prof

Web-based tools
Software package


http://molbiol-tools.ca/Protein_secondary_structure.htm

Web-basedhttp://proline.bic.nus.edu.sg/spdb/searchn.html
tools
Signal peptide resource


Web-based tools
 Active site prediction

http://www.scfbio-iitd.res.in/dock/ActiveSite.jsp

Web-based tools
 Tertiary structure prediction
 Phyre2

http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index

Web-based tools
 Biochemical features
 Protein calculator






http://www.scripps.edu/~cdputnam/protcalc.html
Amino acid calculator
http://proteome.gs.washington.edu/cgibin/aa_calc.pl
Peptide property calculator
https://www.genscript.com/sslbin/site2/peptide_calculation.cgi
http://www.innovagen.se/custom-peptidesynthesis/peptide-property-calculator/peptide-property-calculator.asp
Physico-chemical profiles
http://npsa-pbil.ibcp.fr/cgibin/npsa_automat.pl?page=/NPSA/npsa_pcprof.html
Tagldent tool
http://web.expasy.org/tagident/

Web-based tools
 Biochemical features
 Peptide cutter








http://web.expasy.org/peptide_cutter/
Kyte doolittle hydropahty plot http://gcat.davidson.edu/DGPB/kd/kytedoolittle.htm
GRAVY calculator
http://www.gravy-calculator.de/index.php
ProtScale
http://web.expasy.org/protscale/
ProtParam
http://web.expasy.org/protparam/
Prosite
http://prosite.expasy.org/prosite.html
Interpro
http://www.ebi.ac.uk/interpro/

Web-based tools
Protein calculator http://www.scripps.edu/~cdputnam/protcalc.html


Web-based tools
Amino acid calculator


http://proteome.gs.washington.edu/cgibin/aa_calc.pl

Web-based tools


https://www.genscript.com/ssl-bin/site2/peptide_calculation.cgi

Web-based tools
 Peptide property calculator

http://www.innovagen.se/custom-peptidesynthesis/peptide-property-calculator/peptide-property-calculator.asp

Web-based tools
 Physico-chemical profiles

http://npsa-pbil.ibcp.fr/cgibin/npsa_automat.pl?page=/NPSA/npsa_pcprof.html

Web-based tools
 Tagldent tool

http://web.expasy.org/tagident/

Web-based tools
Peptide cutter
http://web.expasy.org/peptide_cutter/


Web-based tools
Kyte doolittle hydropahty plot http://gcat.davidson.edu/DGPB/kd/kyte

doolittle.htm

Web-based http://www.gravy-calculator.de/index.php
tools
GRAVY calculator


Web-based tools
 ProtScale

http://web.expasy.org/protscale/

Web-based tools
 ProtParam

http://web.expasy.org/protparam/

Web-based tools
Prosite
http://prosite.expasy.org/prosite.html


Web-based tools
Interpro
http://www.ebi.ac.uk/interpro/


Stand-alone softwares
 MEGA

 CLC main workbench

 UGENE

 Spdb viewer

 Pairwise structure alignment

 Cn3D

Stand-alone software
 BioEdit

Stand-alone software
 ClustalX

Bioinformatics

More Related Content

What's hot

Viewers also liked

Similar to Bioinformatics

More from seyed mohammad motevalli

Recently uploaded

Bioinformatics