This document provides an outline of basic concepts in bioinformatics including sequence alignment, scoring alignments, inserting gaps, dynamic programming, and database searches. It discusses comparing biological sequences to determine similarity and homology for predicting gene/protein function and constructing phylogenies. Scoring matrices like BLOSUM and PAM are described for quantifying sequence similarity. Dynamic programming algorithms like Needleman-Wunsch and Smith-Waterman are summarized for global and local sequence alignment. Database search tools like FASTA and BLAST are introduced for searching sequence databases.
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
The following slides were prepared by POORNIMA M.S student of II M.Sc., Life Science Bangalore University, Bangalore
Lecture delivered by T. Ashok Kumar, Head, Department of Bioinformatics, Noorul Islam College of Arts and Science, Kumaracoil, Thuckalay, INDIA. UGC Sponsored National Workshop on BIOINFORMATICS AND GENOME ANALYSIS for College Teachers on August 11 & 12, 2014. Organized by Centre for Bioinformatics, Department of Zoology, NMCC.
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
The following slides were prepared by POORNIMA M.S student of II M.Sc., Life Science Bangalore University, Bangalore
Lecture delivered by T. Ashok Kumar, Head, Department of Bioinformatics, Noorul Islam College of Arts and Science, Kumaracoil, Thuckalay, INDIA. UGC Sponsored National Workshop on BIOINFORMATICS AND GENOME ANALYSIS for College Teachers on August 11 & 12, 2014. Organized by Centre for Bioinformatics, Department of Zoology, NMCC.
Ab Initio Protein Structure Prediction is a method to determine the tertiary structure of protein in the absence of experimentally solved structure of a similar/homologous protein. This method builds protein structure guided by energy function.
I had prepared this presentation for an internal project during my masters degree course.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
Ab Initio Protein Structure Prediction is a method to determine the tertiary structure of protein in the absence of experimentally solved structure of a similar/homologous protein. This method builds protein structure guided by energy function.
I had prepared this presentation for an internal project during my masters degree course.
Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends).
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...journal ijrtem
process in which instead comparing whole query sequence with database sequence it breaks
query sequence into small words and these words are used to align patterns. it uses heuristic method which
make it faster than earlier smith-waterman algorithm. But due small query sequence used for align in case of
very large database with complex queries it may perform poor. To remove this draw back we suggest by using
MSA tools which can filter database in by removing unnecessary sequences from data. This sorted data set then
applies to BLAST which can then indentify relationship among them i.e. HOMOLOGS, ORTHOLOGS,
PARALOGS. The proposed system can be further use to find relation among two persons or used to create
family tree. Ortholog is interesting for a wide range of bioinformatics analyses, including functional annotation,
phylogenetic inference, or genome evolution. This system describes and motivates the algorithm for predicting
orthologous relationships among complete genomes. The algorithm takes a pairwise approach, thus neither
requiring tree reconstruction nor reconciliation
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...IJRTEMJOURNAL
BLAST is most popular sequence alignment tool used to align bioinformatics patterns. It uses
local alignment process in which instead comparing whole query sequence with database sequence it breaks
query sequence into small words and these words are used to align patterns. it uses heuristic method which
make it faster than earlier smith-waterman algorithm. But due small query sequence used for align in case of
very large database with complex queries it may perform poor. To remove this draw back we suggest by using
MSA tools which can filter database in by removing unnecessary sequences from data. This sorted data set then
applies to BLAST which can then indentify relationship among them i.e. HOMOLOGS, ORTHOLOGS,
PARALOGS. The proposed system can be further use to find relation among two persons or used to create
family tree. Ortholog is interesting for a wide range of bioinformatics analyses, including functional annotation,
phylogenetic inference, or genome evolution. This system describes and motivates the algorithm for predicting
orthologous relationships among complete genomes. The algorithm takes a pairwise approach, thus neither
requiring tree reconstruction nor reconciliation
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal (or lateral) gene transfer event (xenologs).[1]
Homology among DNA, RNA, or proteins is typically inferred from their nucleotide or amino acid sequence similarity. Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous.
Rai University provides high quality education for MSc, Law, Mechanical Engineering, BBA, MSc, Computer Science, Microbiology, Hospital Management, Health Management and IT Engineering.
3. Sequence Alignment
Comparing sequences for
– Similarity
– Homology
Prediction of function of genes and proteins
Construction of phylogeny
Finding motifs
4. Sequence Alignment - HOMOLOGY
Orthologues : any gene pairwise relation
where the ancestor node is a speciation
event. Often have similar function
Paralogues : any gene pairwise relation
where the ancestor node is a duplication
event. Paralogs tend to have different
functions
9. Scoring Alignments and Substitution
Matrices
The quality of an alignment is measured by
giving it a quantitative score
The simplest way of quatifying similarity
between two sequences is percentage
identity.
– Simply measured by counting the number of
identical bases or amino acids matched
between the aligned sequences.
10. Scoring Alignments and Substitution
Matrices
The dot-plot
gives a visual
assesment of
similarity based
on identity.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]5.
11. Scoring Alignments and Substitution
Matrices
Percentage identity is a relatively crude
measure and does bot give a complete
picture of the degree of similarity of two
sequences.
Scoring identical matches 1 and mismatches
as 0 ignores the fact that the type of amino
acids involved is highly significant.
12. Scoring Alignments and Substitution
Matrices
Genuine matches may not be identical:
Seq1: T H I S I S A S E Q U E N C E
Seq1: T H A T _ _ _ S E Q U E N C E
Isoleucine – Alanine: both hydrophobic
Serine – Threonine : both polar
13. Scoring Alignments and Substitution
Matrices
Scoring pairs of amino acids:
– with similar properties higher scores
– With different properties lower scores
14. Scoring Alignments and Substitution
Matrices
To assign scores for alignmens use
SUBSTITUTION MATRICES
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
5.
15. Scoring Alignments and Substitution
Matrices
Different types of substitution matrices are
being used based on:
– The number of mutations required for
convertion of one amino acid to the other
– Similarities in physicochemical properties.
16. Scoring Alignments and Substitution
Matrices
PAM substitution matrices:
– Use closely related protein sequences to
derive substitution frequencies
– Accepted Point Mutations per 100 residues
250 PAM 250 mutation on 100 residues
17. Scoring Alignments and Substitution
Matrices
BLOSUM substitution matrices:
– BLOcks of Amino Acid SUbstitution Matrix
– Use mutation data from highly conserved
local regions
– BLOSUM 62 62% identity
18. Scoring Alignments and Substitution
Matrices
Which matrix to use ?
– Depends on the problem properties,
– Distantly related sequences : PAM 250 –
BLOSUM 50
– Closely related sequences: PAM 120,
BLOSUM 80
19. Scoring Alignments and Substitution
Matrices
Which matrix to use ?
– Some special purpose matrices (SLIM and
PHAT are designed for membrane proteins)
– The length of the sequende is important
Short sequences PAM 40 or BLOSUM 80
Long sequences PAM 250 or BLOSUM 50
20. Scoring Alignments and Substitution
Matrices
BLOSUM – 62 and PAM 120
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 6.
21. Inserting Gaps
Gap insertion requires a scoring penalty (gap
penalty).
To achieve correct matches gaps are
required
Alignment programs use gap penalties to
limit the introduction of gaps in the
alignments
22. Inserting Gaps
Insertions tend to be several residues long
rather than just a single residue long
– Fewer insertions and deletions occur in sequences
of structural importance
– Smaller penalty on lengthening an existing gap
(gap extension penalty) than introducing a new
gap
– Gap penaly is high the number of gaps will be
decreased
– Gap penalty is low more and large gaps will be
inserted.
23. Inserting Gaps
Choosing gap penalties:
– Linear
– Affine
Gap open penalty
Gap extension penlty
24. Dynamic Programming
Global and Local alignments
Pairwise and Multiple alignments
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum] 7.
25. For a pair of sequences there is a large
number of possible alignments.
2 sequences of length 1000 have
appriximately 10600
different alignments.
Dynamic Programming
26. Dynamic Programming:
– Problem can be divided into many smaller parts.
– Optimal alignment will not contain parts that are
not themselves optimal.
– Start from sufficiently short sub-sequences.
– Alignement is additive:
Dynamic Programming
27. Needleman and Wunsch were the first to
propose this method.
Find optimal global alignments.
Align sequences:
– Seq1: x (x1x2x3…xm)
– Seq1: y (y1y2y3…yn)
Dynamic Programming
28. s(a,b) = score of aligning a and b
F(i,j) = optimal similarity of X(1:i) and Y(1:j)
Recurrence relation:
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
– Assume linear gap penalty
Dynamic Programming
29. Dynamic Programming
Matrix S of optimal scores of sub-sequence
alignments.
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]
9.
34. Dynamic Programming
Semi – global alignment:
– When we treat terminal gaps differently than
internal gaps
– How to modify dynamic programming to be able
to make semi – global alignment ?
35. Dynamic Programming
Local alignment:
– If we compare a sequence to whole genome
– Find sub-strings whose optimal global
alignment value is maximum
36. Dynamic Programming
What is the difference between global and
local alignment ?
Can we define the recuernce relation of local
alignment similar to global alignment ?
37. Recurrence relation of GLOBAL ALIGNMENT:
(Needleman & Wunsch)
– F(i,0) = Σ s(X(k), gap), 0 <= k <= i
– F(0,j) =Σ s(gap, B(k)), 0 <= k <= j
– F(i,j) = max [ F(i,j-1) + s(gap,Y(j),
F(i-1,j) + s(X(i),gap),
F(i-1, j-1) + s(X(i), Y(j)]
Dynamic Programming
39. Database Searches
FASTA and BLAST
Use some heuristics
Dynamic Programming Complexity
– Time O(n*m)
– Space O(n*m)
40. Database Searches FASTA
Good local alignment should have some
exact match subsequence.
Find all k-tuples. (k=1-2 for proteins, 3-6 for
DNA sequences)
Protein k – tuples nc, sp, … (k = 2)
Nucleotide k – tuples TAAA, CTCC,…(k = 4)
41. Database Searches FASTA
If k = 3 for nucleotide sequences.
– There will be 64 possible k – tuples
– Assign a number e( ):
e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3
Each 3 – tuples are represented as xi xi+1xi+2
Assign a number to each 3 – tuple
– Ci = e(xi)42
+ e(xi+1)41
+ e(xi+2)40
– For example: AAA
AAA 042
+ 041
+ 040
= 0
CAA 142
+ 041
+ 040
= 16
42. Database Searches FASTA
Find each occurance of k – tuples in the
sequences.
Chaining Look – Up Tables
Consider TAAAACTCTAAC (if k = 3):
3 - tuples Position
AAA (0) 2, 3
AAC (1) 4, 10
AAG (2) 0
AAT (3) 0
… …
43. Database Searches BLAST
Use short words to search the database
sequence.
Searches for k – mers that will score above a
threshold (T) value when aligned with query k -
mer (Remember FASTA looks for k – tuples
which are identical).
Use a scheme based on finite state automata
(Remember FASTA use hashing and chaining
fot rapid identification of k - tuples)
44. Database Searches BLAST
From Query Sequence, create query words
(for protein sequences word size is 3)
45. Database Searches BLAST
Blast uses a list of high scoring words created
from words similar to query words. Considers
the words with a score bigger than a threshold
value.
46. Database Searches BLAST
Scan each database sequence for an exact
match to the list of words.
Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".
47. Database Searches BLAST
Keep only the extended matches that have a
score at least S.
Determine statistical significance of each
remaining match.
54. Books and Web References
Books Name :
1. Introduction To Bioinformatics by T. K. Attwood
2. BioInformatics by Sangita
3. Basic Bioinformatics by S.Ignacimuthu, s.j.
http://en.wikipedia.org/wiki/Sequence_alignment
http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf
http://www.ks.uiuc.edu/Training/Tutorials/science/bioinformatics-
tutorial/bioinformatics.pdf
M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008,
Garland Science
Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A
practical guide to the analysis of genes and proteins”, 2001,
Wiley.54
55. Images References
1.http://gorbi.irb.hr/files/5712/7497/9729/Slide09.jpg
2.http://www.ensembl.org/info/genome/compara/tree_exa
mple1.png
3.http://www.nature.com/nature/journal/v496/n7445/imag
es/nature12027-f1.2.jpg
4.
http://upload.wikimedia.org/wikipedia/commons/e/e6/Spo
mbe_Pop2p_protein_structure_rainbow.png
5. & 6. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
7. to 13. Book: Basic Bioinformatics by S.Ignacimuthu, s.j.
14. to 18. http://blast.ncbi.nlm.nih.gov/Blast.cgi