Sequence alignment belgaum

SEQUENCE ALIGNMENT P.S.CHANDRANAND

Objectives General Terms What is Alignment ? Basic concept of Alignment Rationale Behind Alignment Types of Alignment Comparative Analysis Biological Significance of Gaps

Some Definitions Similarity The extent to which nucleotide or protein sequences are related The extent of similarity between two sequences might be expressed based on percent sequence identity and/or conservation. Identity The extent to which two (nucleotide or amino acid) sequences are invariant Conservation Changes at a specific position of an amino acid or less commonly, a DNA sequence, that preserves the physico-chemical properties of the original residue Optimal Alignment An alignment of two sequences with the highest possible score Query The input sequence which is compared to all entries in database

Homologous refers to conclusion drawn from the data that the two genes or sequences have descended from a common ancestor Homologous sequences are of two types Orthologous Homologous sequences in different species that arose from a common ancestral gene during speciation Parologous Homologous sequences within a single species that arose by gene duplication

What is Alignment ? Explicit mapping between two or more sequences To place one sequence over another in such a fashion so as to get maximum similarity SEQUENCE ALIGNMENT STRUCTURAL ALIGNMENT

WHY ALIGNMENT IS NECESSARY ? We need to be able to compare sequences for similarities and differences Often what we are looking for are not exact matches, but similarities Similarity is based on biology

Conserved regions Some regions tend to be more conserved than others Conserved regions (amino acid residues) may suggest which residues are critical for structure or function BUT may just be accident of history

Similarity vs. homology SIMILARITY – observable quantity that can be expressed as %identity or some suitable measure HOMOLOGY – a conclusion drawn from similarity data regarding shared evolutionary history (is it homologous or not?) E.g. human myoglobin and tuna myoglobin – some similarities can be found

Proteins of 100% identity (Human & Xenopus Myoglobin) MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG

MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG GLSDGEWQ Q VLNVWGKVEADI A GHGQEV LIRLF T GHPETLEKFDKFKHLKTE A EMKA SEDLKKHG TV VLTALGGILKKKGHHEAE L KPLAQSHATKHKIP I KYLEFIS DA II H VL H SKHPGDFGADAQGAM T KALELFR N D I A A K YKELGFQG Proteins with similarity (H orse P02188 & Xenopus)

Evolutionary Basis Presumption is homologous sequences have diverged from a common ancestor But we do not have the ancestral sequence, only raw sequence from living organisms

Basic Concept of Alignment Firstly, both the sequences are matched in a arbitrary way. Quality of the match is then reflected in terms of score. Then one of the two sequences is moved w.r.t other and match is scored. This process is repeated until we find best scoring alignment. But, if this process is carried out for 2 sequences of length N each (N=10,000), then there will be around N 2 alignments, which is computationally impossible to calculate Thus we look for Optimal alignment which is done through Dynamic programming

What is the rationale behind alignment ? The resemblance of two DNA sequences taken from different organisms means that sequences have arisen from one common ancestral DNA by the process of mutations and selection, modifying the DNA sequence in a specific manner. The basic mutational processes can be of 3 types: Insertion an insertion of a base (letter) or several bases to the sequence Deletion deleting a base (or more) from the sequence Substitution replacing a sequence base by another .

An alignment just reflects the probable evolutionary history of the two genes as it is presumed that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes

ALIGNMENT Pairwise alignment Multiple alignment

Why pairwise alignment? Pairwise alignment is used in database searches. BLAST & FASTA are essentially highly optimized versions of local pairwise alignment. Pairwise alignment is used to compute evolutionary distances, which are used to build phylogenetic trees. Pairwise alignment is used for sequence assembly in shotgun sequencing. Pairwise alignment underlies multiple alignment, which is used to find consensus patterns. Both amino acid sequences and nucleotide sequences are handled in much the same way.

Why multiple sequence alignment ? Incorporation: Organize data to reflect sequence homology Phylogeny :Infer phylogeny trees from homologous sites Motif : Highlight conserved sites/regions Structure Prediction : Highlight variable sites/regions Extrapolation: Uncover changes in gene structure Profile: Summarize information The process of aligning sequences is a game involving playing off gaps and mismatches

PAIRWISE ALIGNMENT Global alignment Local alignment Global alignment - means placing both the complete sequences over one another to find maximum similarity i.e Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached Local alignment - looks for a maximum similarity within the subsequences. i.e Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there

Global Alignment Aligns entire sequence identifies all conserved residues dynamic programming required Computationally intensive, much slower than local alignment eg Needleman & Wunsch method, GAP Local Alignment Identify short conserved sequences complete alignment is not done may miss out on some important conserved residues eg BLAST, FASTP Comparative Analysis of Alignment Techniques

A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)

Extreme Value Distribution Probability density function for the extreme value distribution resulting from parameter values  = 0 and  = 1, [ y = 1 – exp(- e -x )], where  is the characteristic value and  is the decay constant. y = 1 – exp(- e -  ( x -  ) )

Extreme Value Distribution (EDV) You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit. real data EDV approximation

Extreme Value Distribution The probability of a score S to be larger than a given value x can be calculated following the EDV as: E-value: P ( S  x ) = 1 – exp(- e -  ( x -  ) ) , where  =(ln Kmn )/  , and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for  and K over a set of widely used scoring matrices).

Extreme Value Distribution Using the equation for  (preceding slide), the probability for the raw alignment score S becomes P ( S  x ) = 1 – exp(- Kmne -  x ). In practice, the probability P ( S  x ) is estimated using the approximation 1 – exp(- e -x )  e -x , which is valid for large values of x . This leads to a simplification of the equation for P ( S  x ): P ( S  x )  e -  (x-  ) = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score S .

Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and 0.001. Low E-values decrease the number of false positives in a database search, but increase the number of false negatives , thereby lowering the sensitivity of the search.

FASTP : Local Alignment Tool Sequence 1 F L W R T W S Sequence 2 S W K T W T Method based on lookup tables Lipman & Pearson, Science (1985) vol 227,1435-41 The first widely used program: Lipman & Pearson, 1985 and onwards

Construction of the Lookup Table Position Number Residue Seq 1 Seq2 Offset(p1-p2) F 1 - - L 2 - - W 3,6 2,5 1(3,2) 1(6,5) 4(6,2) -2(3,5) R 4 - - T 5 4,6 1(5,4) - 1(5,6) S 7 1 6(7,1) K - 3 - Pos no. 1 2 3 4 5 6 7 Sequence 1 F L W R T W S Sequence 2 S W K T W T

Calculation of Offset Frequency Offset Frequency 1 3 4 1 -1 1 -2 1 6 1 Final Local Alignment Pos no. 1 2 3 4 5 6 7 Sequence 1 F L W R T W S Sequence 2 - S W K T W T

-Needleman-Wunsch (1970) provided first automatic method -Dynamic Programming to Find Global Alignment Global alignment For sequences that are single-domain For sequences that have not diverged NEEDLEMAN-WUNSCH Algorithm

Gaps What is the biological significance of gaps ? As explained earlier, changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions So, regions where the residues of one sequence correspond to nothing in another, they are interpreted due to either insertion in one sequence or deletion from other. A Gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another Gaps in alignment are represented as dashes(-).

Gaps How long gaps must be allowed for optimal alignment and how should they be scored ? Some gaps can be introduced in alignment to compensate for insertion and deletions but not too many To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes deduction of a fixed amount (the gap score) from the alignment score. So gaps will occur in alignment only when really needed Addition of gaps to optimize an alignment always decreases the quality of an alignment therefore gap penalty is always negative For example AGGVLIQVG AGGVLIIQVG AGGVL-IQVG AGGVLIIQVG

Gaps Two types of gap penalties Linear gap penalty Both gap opening (G) & gap extension (L) penalty is same. Affine gap penalty gap opening penalty is higher than gap extension penalty Thus for a gap of length n total deduction = G + (n-1) L BLOSUM 62 matrix : -11 gap opening / -1 gap extension BLOSUM 50 matrix : -12 gap opening / -1 gap extension

Summary An alignment just reflects the probable evolutionary history of the two genes as it is presumed that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions Two types of gap penalties Global alignment Local alignment Two types of Alignment Linear gap penalty Affine gap penalty

Sequence alignment belgaum

More Related Content

What's hot

Viewers also liked

Similar to Sequence alignment belgaum

More from National Institute of Biologics

Sequence alignment belgaum