2. Introduction to Bioinformatics
Combined
to solve
complex
Biological problems
Biology
Chemistry
Statistics
Computer
science
Bioinformatics
Algorithms and techniques of computer science being used to solve the problems
faced by molecular biologists
‘Information technology applied to the management and analysis of biological data’
Storage and Analysis are two important functions – bioinformaticians build tools for
each
Bio IT market has observed significant growth in genomic era
3. Fields of Bioinformatics
The need for bioinformatics has arisen from the recent explosion of publicly
available genomic information, such as resulting from the Human Genome
Project.
Gain a better understanding of gene analysis, taxonomy, & evolution.
To work efficiently on the rational drug designs and reduce the time taken for
development of drug manually.
Unravel the wealth of Biological information hidden in mass of sequence,
structure, literature and biological data
Has environmental-clean up benefits
In agriculture, it can be used to produce high productivity crops
Gene Therapy
Forensic Analysis
Understanding biological pathways and networks in System Biology
4. Bioinformatics key areas
Bioinf ture
organisation of knowledge
(sequences, structures,
functional data)
e.g. homology
searches
5. Applications of Bioinformatics
Provides central, globally accessible databases that enable scientists to submit, search and analyze
information and offers software for data studies, modelling andinterpretation.
SequenceAnalysis:-
The application of sequence analysis determines those genes which encode regulatory
sequences or peptides by using the information of sequencing. These computational tools also
detect the DNA mutations in an organism and identify those sequences which are related.
Special software is used to see the overlapping of fragments and their assembly.
Prediction of Protein Structure:-
It is easy to determine the primary structure of proteins in the form of amino acids which
are present on the DNA molecule but it is difficult to determine the secondary, tertiary or
T
ools of bioinformatics can be used to determine the
quaternary structures of proteins.
complex protein structures.
Genome Annotation:-
In genome annotation, genomes are marked to know the regulatory sequences and protein
important part of the human genome project as it determines the
coding. It is a very
regulatory sequences
6. Comparative Genomics:-
Comparative genomics is the branch of
bioinformatics which determines the
genomic structure and function relation
between different biological species
which enable the scientists to trace the
processes of evolution that occur in
genomes of different species.
Pharmaceutical Research:-
Tools of bioinformatics are also helpful in
drug discovery, diagnosis and disease
management. Complete sequencing of
human genes has enabled the scientists to
make medicines and drugs which can
target more than 500 genes. Accurate
prediction in screening.
7. S.
No
Unix Windows Linux
1. Open source Close source Open source
2. Very high security system Low security system High security system
3. Command-line GUI Hybrid
4. File system is arranged in
hierarchical manner
File system is arranged in parallel
manner
File system is arranged in
hierarchical manner
5. Not user friendly User friendly User friendly
6. Single tasking Multi tasking Multi tasking
8.
9. Biological databanks and databases
Very fast growth of biological
data
Diversity of biological data:
o Primary sequences
o 3D structures
o Functional data
Database entry usually
required for publication
o Sequences
o Structures
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS- PROT
NRL-3D
Major primary databases
10. Sequence Databases
Three databanks exchange data on a daily basis
Data can be submitted and accessed at either location
Nucleotides db:
GenBank - https://www.ncbi.nlm.nih.gov/
EMBL - https://www.ebi.ac.uk/
DDBJ - https://www.ddbj.nig.ac.jp/index-e.html
Bibliographic db:
PubMed , Medline
Specialized db:
RDP
, IMGT
, TRANSFAC, MitBase
Genetic db:
SGD – https://www.yeastgenome.org/
ACeDB, OMIM
11. Composite Databases Secondary Databases
Swiss Prot
PIR
GenBank
NRL-3D
Store structure info or results
of searches of the primary
databases
Composite Databases Primary Source
PROSITE
https://prosite.expasy.org/
SWISS-PROT
PRINTS
http://130.88.97.239/PRINTS/index.p
hp
OWL
13. SCOP
Structural Classification of Proteins
http://scop.mrc-lmb.cam.ac.uk/
SCOP database aims to provide a detailed and comprehensive description of
structural and evolutionary relationship between all proteins
Levels of hierarchy
Family : Pairwise residue identities of aa 30% or greater
Superfamily : Eventhough low seq identities, should have common
evolutionary origin
Eg: ATPase domain of HSP and HK
Fold : Major structural similarity
Class : all α , all β, α or β, α and β, Multidomain
14. CATH
https://www.cathdb.info/
Class : 2º structure
Architecture : Gross orientation of 2º structure, independent of connectivities
Topology (fold family) : topological connection of super families
S level : Sequence and structural identities
15. Basis of Sequence Alignment
1. Aligning sequences
2. To find the relatedness of the proteins or gene, if they have a
common ancestor or not.
3. Mutation in the sequences, brings the changes or divergence in the
sequences.
4. Can also reveal the part of the sequence which is crucial for the
functioning of gene or protein.
Similarity indicates conserved function
Human and mouse genes are more than 80% similar
Comparing sequences helps us understand function
16. Sequence Alignment
After obtaining nuc/aa sequences, first thing is to compare with the known sequences.
Comparison is done at the level of constituents. Then finding of conserved residues to predict
the nature and function of the protein. This process of mapping is called
Pairwise sequence alignment
Sequence Alignment
Multiple sequence alignment
1. Local alignment – Smith & Waterman Algorithm
2. Global alignment – Needleman & Wunch Algorithm
Gapped Alignment
Ungapped Alignment
Terms to Know - Homolog, Ortholog, Paralog, Xenolog, Similar and Identical
Alignment scoring and substitution matrices
Dot plots
Dynamic programming algorithm
Heuristic methods (In order to reduce time)
FASTA
BLAST
17. Scoring a sequence alignment
Match score:
Mismatch score:
Gap penalty:
+1
+0
–1
Matches:
Mismatches:
Gaps:
18 × (+1)
2 × 0
7 × (– 1)
Score = +11
ACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for a new
gap, than for extending an existing gap
Maximum no of matches gives high similarity – Optimum Alignment
ACGTCTGATACGCCGTATAGTCTATCT
||||| ||| || ||||||||
----CTGATTCGC---ATCGTCTATCT
18. Scores:
positive for identical or similar
negative for different
negative for insertion in one of the two sequences
Substitution matrices – weights replacement of one residue by another
assumption of evolution by point mutations
amino acid replacement (by base replacement)
amino acid insertion
amino acid deletion
Significance of alignment
Depends critically on gap penalty
Need to adjust to given sequence
19. Derivation of substitution matrices
PAM matrices
First substitution matrix; Developed by Dayhoff (1978) based on Point
Accepted Mutation (PAM) model of evolution
1PAM (without sub) is a unit of evolutionary divergence in which 1% of the aa
have been changed
Derived from alignment of very similar sequences
e.g.: PAM2 = PAM1*PAM1; PAM3 =
PAM1 = mutation events that change 1% of AA
PAM2, PAM3, ... extrapolated by matrix multiplication
PAM2 * PAM1 etc
Lower distance PAM matrix for closely related proteins eg., PAM30
Higher distance PAM matrix for highly diverged sequences eg., PAM250
Problems with PAMmatrices:
Incorrect modelling of long time substitutions, since conservative mutations dominated by
single nucleotide change
e.g.: L <–> I, L <–> V, Y<–> F
long time: any Amino Acid change
21. BLOSUM matrices
BLOCKSAminoacidSubstitutionMatrices
Similaras PAM;howeverthedatawerederivedfrom localalignmentsfordistantlyrelatedproteins
depositedin BLOCKSdb
UnlikePAMthereis no evolutionarybasis
BLOSUM series (BLOSUM50, BLOSUM62, ...)
BLOCKS database:
ungapped multiple alignments of protein families at a given identity
E.g.,
BLOSUM 30 better for gapped alignments – for comparing highly diverged seq
BLOSUM 90 better for ungapped alignments – for very close seq
BLOSUM 62 was derived from a set of sequences which are 62% or less similar
22. DOT Plot
Simple comparison without alignment
2D graphical representation method primarily used for finding regions of
local matches between two sequences
DOTTER, PALIGN, DOTLET (https://dotlet.vital-it.ch/)
Distinguish by alignment score
Similarities increase score (positive)
Mismatches decrease score (Negative)
Gaps decrease score
Number of possible dots = (probability of pair) x (length of seq A) x (length of seq B)
Disadv – No direct seq homology & Statistically weak
23. Dynamic programming algorithm
To build up optimal alignment which maximizes the similarity we need some scoring
methods
The dynamic programming relies on a principle of optimality.
PROCEDURE
Construct a two-dimensional matrix whose axes are the two sequences to be compared.
The scores are calculated one row at a time. This starts with the first row of one
sequence, which is used to scan through the entire length of the other sequence,
followed by scanning of the second row.
The scanning of the second row takes into account the scores already obtained in the
first round. The best score is put into the bottom right corner of an intermediate
matrix.
This process is iterated until values for all the cells are filled.
24. Depicting the results:
Back tracing
The best matching path is the one that has the maximum total score.
If two or more paths reach the same highest score, one is chosen
arbitrarily to represent the best alignment.
The path can also move horizontally or vertically at a certain
point, which corresponds to introduction of a gap or an insertion
or deletion for one of the two sequences.
25. BLAST
Basic Local Alignment search tool
https://blast.ncbi. nlm. nih.gov/Blast.cgi
Multi- step approach to find high- scoring local alignments between
two sequences
List words of fixed length (3AA) (11nuc) expected to give score larger
than threshold (seed alignment)
For every word, search database and extend ungapped alignment in
both directions upto a certain length to get HSPs
New versions of BLAST allow gaps
Blastn:
Blastp:
tBlastn:
Blastx:
tBlastx:
nucleotide sequences
protein sequences
protein query - translated database
nucleotide query -
nucleotide query -
protein database
translated database
26.
27. Interpretation
Rapid and easier to find homolog by scanning huge db
Search against specialized db
Blast program employ SEG program to filter low complexity regions before
executing db search
Quality of the alignment is represented by score (to identify hits)
Significance of the alignment is represented as e-value (Expected value)
E-value decreases exponentially as the score increases
The E-value provides information about the likelihood that a given sequence
match is purely by chance. The lower the E- value, the less likely the
database and therefore more significant the match is.
If E is between 0.01 and 10, the match is considered not significant.
28. FASTA
More sensitive than BLAST
Table to locate all identically matching words of
length Ktup between two sequences
Blast – Hit extension step
Fasta – Exact word match
As the high value of Ktup increases the search
becomes slow
FASTAalso uses E-values and bit scores. The FASTA
output provides one more statistical parameter,
the Z-score.
If Z is in the range of 5 to 15, the sequence pair
can be described as highly probable homologs. If
Z < 5, their relationships is described as less
certain
29. Phylogenetics
Phylogenetics is the study of evolutionary relatedness among various groups of
organisms (e.g., species, populations).
Methods of PhylogeneticAnalysis:
Monophyletic group – all taxa share by one common ancestor
Paraphyletic groping – share common ancestor but not all
Errors in alignment mislead tree
Phenetic
NJ,
UPGMA
Cladistic
MP
ML
30. A phylogenetic tree is a tree showing the
evolutionary interrelationships among various
species or other entities that are believed to
have a common ancestor. A phylogenetic tree
is a form of a cladogram. In a phylogenetic
tree, each node with descendants represents
the most recent common ancestor of the
descendants, and edge lengths correspond to
time estimates.
Each node in a phylogenetic tree is called a
taxonomic unit. Internal nodes are generally
referred to as Hypothetical Taxonomic Units
(HTUs) as they cannot be directly observed
Distances – no of changes
Parts of a phylogenetic tree
Node
Root
Outgroup
Ingroup
Branch
31. Phenetic Method of analysis:
Also known as numerical taxonomy
Involves various measures of overall similarity for ranking species
All the data are first converted to a numerical value without any character
(weighing). Then no of similarities / differences is calculated.
Then clustering or grouping close together
Lack of evolutionary significance in phenetics
Cladistic method of analysis:
Alternative approach
Diagramming relationship between taxa
Basic assumption – members of the group share a common evolutionary
history
Typically based on morphological data
32. Distance and Character
A tree can be based on
1. quantitative measures like the distance or similarity between species, or
2. based on qualitative aspects like common characters.
Molecular clock assumption – substitution in nu / aa are being compared at constant rate
33. Maximum Parsimony:
Finds the optimum tree by minimizing the number of evolutionary changes
No assumptions on the evolutionary pattern
MSA then scoring
Rather time consuming works well if seq have strong similarity
May oversimplify evolution
May produce several equally good trees
PAUP
, MacClade
Maximum Likelihood:
The best tree is found based on assumptions on evolution model
Nucleotide models more advanced at the moment than aminoacid models
Programs require lot of capacity from the system
34. Neighbour Joining:
The sequences that should be joined are chosen to give the best least-squares estimates of the
branch length that most closely reflect the actual distances between the sequences
NJ method begins by creating a star topology in which no neighbours are connected
Then tree is modified by joining pair of sequences. Pair to be joined is chosen by calculating
the sum of branch length
Distance table
No molecular clock assumed
UPGMA
Unweighted Pair Group method with Arithmetic Mean
Works by clustering, starting with more similar towards distant
Dot representation
Molecular clock assumed
35. PHYLIP (Phylogeny Inference Package)
Available free in Windows/MacOS/Linux systems
Parsimony, distance matrix and likelihood methods (bootstrapping and
consensus trees)
Data can be molecular sequences, gene frequencies, restriction sites and
fragments, distance matrices and discrete characters