SEQUENCE ALIGNMENT P.S.CHANDRANAND
Objectives General Terms What is Alignment ? Basic concept of Alignment Rationale Behind Alignment Types of Alignment Comparative Analysis Biological Significance of Gaps
Some Definitions   Similarity   The extent to which nucleotide or protein sequences are related  The extent of similarity between two sequences might be expressed based on percent sequence identity and/or conservation.  Identity   The extent to which two (nucleotide or amino acid) sequences are invariant Conservation   Changes at a specific position of an amino acid or less commonly, a DNA sequence, that preserves the physico-chemical properties of the original residue Optimal Alignment   An alignment of two sequences with the highest possible score  Query   The input sequence which is compared to all entries in database
Homologous   refers to conclusion drawn from the data that the two genes or sequences have descended from a common ancestor   Homologous sequences are of two types   Orthologous   Homologous sequences in different species that arose from a common ancestral gene during speciation Parologous   Homologous sequences within a single species that arose by gene duplication
What is Alignment ? Explicit mapping between two or more sequences   To place one sequence over another in such a fashion so as to get maximum similarity SEQUENCE ALIGNMENT  STRUCTURAL  ALIGNMENT
WHY ALIGNMENT IS NECESSARY ? We need to be able to compare sequences for similarities and differences Often what we are looking for are not exact matches, but  similarities Similarity is based on biology
Conserved regions Some regions tend to be more conserved than others Conserved regions (amino acid residues) may suggest which residues are critical for structure or function BUT may just be accident of history
Similarity vs. homology SIMILARITY   – observable quantity that can be expressed as %identity or some suitable measure HOMOLOGY  –  a conclusion drawn from similarity data regarding shared evolutionary history (is it homologous or not?) E.g. human myoglobin and tuna myoglobin – some similarities can be found
Proteins of 100% identity  (Human & Xenopus Myoglobin) MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG
MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG GLSDGEWQ Q VLNVWGKVEADI A GHGQEV LIRLF T GHPETLEKFDKFKHLKTE A EMKA SEDLKKHG TV VLTALGGILKKKGHHEAE L KPLAQSHATKHKIP I KYLEFIS DA II H VL H SKHPGDFGADAQGAM T KALELFR N D I A A K YKELGFQG Proteins with similarity  (H orse P02188  & Xenopus)
Evolutionary Basis Presumption is homologous sequences have diverged from a common ancestor But we do not have the ancestral sequence, only raw sequence from living organisms
Basic Concept of Alignment Firstly, both the sequences are matched in a arbitrary way. Quality of the match is then reflected in terms of score. Then one of the two sequences is moved w.r.t other and match is scored. This process is repeated until we find best scoring alignment. But, if this process is carried out for 2 sequences of length N each (N=10,000), then there will be around N 2  alignments, which is computationally impossible to calculate Thus we look for  Optimal alignment  which is done through  Dynamic programming
What is the rationale behind alignment ? The resemblance of two DNA sequences taken from different organisms means that sequences have arisen from one common ancestral DNA by the process of mutations and selection, modifying the DNA sequence in a specific manner.  The basic mutational processes can be of 3 types: Insertion   an insertion of a base (letter) or several bases to the sequence Deletion   deleting a base (or more) from the sequence Substitution   replacing a sequence base by another .
An alignment just  reflects the  probable  evolutionary history  of the two genes as it is  presumed  that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes
ALIGNMENT Pairwise alignment    Multiple alignment
Why pairwise alignment? Pairwise alignment is used in database searches. BLAST & FASTA are essentially highly optimized versions of local pairwise alignment. Pairwise alignment is used to compute evolutionary distances, which are used to build phylogenetic trees. Pairwise alignment is used for sequence assembly in shotgun sequencing. Pairwise alignment underlies multiple alignment, which is used to find consensus patterns. Both amino acid sequences and nucleotide sequences are handled in much the same way.
Why multiple sequence alignment   ? Incorporation: Organize data to reflect sequence homology Phylogeny :Infer phylogeny trees from homologous sites Motif : Highlight conserved sites/regions Structure Prediction : Highlight variable sites/regions Extrapolation:  Uncover changes in gene structure Profile: Summarize information The process of aligning sequences is a game involving playing off gaps and mismatches
PAIRWISE ALIGNMENT   Global alignment   Local alignment  Global alignment -  means placing both the complete sequences over one another to find maximum similarity i.e Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached Local alignment -  looks for a maximum similarity within the subsequences. i.e Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there
Global Alignment Aligns entire sequence identifies all conserved residues dynamic programming required Computationally intensive, much slower than local alignment eg Needleman & Wunsch method, GAP Local Alignment Identify short conserved sequences complete alignment is not done may miss out on some important conserved residues eg BLAST, FASTP Comparative Analysis of Alignment Techniques
Global vs. Local Alignment
A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)
Extreme Value Distribution Probability density function for the extreme value distribution resulting from parameter values    = 0 and    = 1, [ y  = 1 – exp(- e -x )], where     is the characteristic value and     is the decay constant.  y  = 1 – exp(- e -  ( x -  ) )
Extreme Value Distribution (EDV) You  know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit.  real data EDV approximation
Extreme Value Distribution The probability of a score  S  to be larger than a given value  x  can be calculated following the EDV as:  E-value: P ( S     x ) = 1 – exp(- e  -  ( x -  ) ) ,  where      =(ln  Kmn )/  , and  K  a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for    and  K  over a set of widely used scoring matrices).
Extreme Value Distribution Using the equation for     (preceding slide), the probability for the raw alignment score  S  becomes  P ( S     x ) = 1 – exp(- Kmne -  x ). In practice, the probability  P ( S  x ) is estimated using the approximation 1 – exp(- e -x )    e -x , which is valid for large values of  x . This leads to a simplification of the equation for  P ( S  x ): P ( S    x )    e -  (x-  )  = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score  S .
Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and 0.001. Low E-values decrease the number of  false positives  in a database search, but increase the number of  false negatives , thereby lowering the sensitivity of the search.
FASTP : Local Alignment Tool Sequence 1  F  L  W  R  T  W  S Sequence 2  S  W  K  T  W  T Method based on lookup tables Lipman & Pearson, Science (1985) vol 227,1435-41 The first widely used program: Lipman & Pearson, 1985 and onwards
Construction of the Lookup Table   Position Number Residue  Seq 1  Seq2  Offset(p1-p2) F  1   -   - L  2   -   - W  3,6  2,5  1(3,2)  1(6,5)  4(6,2)  -2(3,5) R  4   -   - T  5  4,6 1(5,4)  - 1(5,6) S  7   1    6(7,1) K  -   3  - Pos no.  1  2  3  4  5  6  7 Sequence 1  F  L  W  R  T  W  S Sequence 2  S  W  K  T  W  T
Calculation of Offset Frequency Offset  Frequency   1  3   4  1 -1  1 -2  1    6  1 Final Local Alignment Pos no.   1  2  3  4  5  6  7 Sequence 1   F  L  W  R  T  W  S Sequence 2   -  S  W  K  T  W  T
Extreme Value Distribution Using the equation for     (preceding slide), the probability for the raw alignment score  S  becomes  P ( S     x ) = 1 – exp(- Kmne -  x ). In practice, the probability  P ( S  x ) is estimated using the approximation 1 – exp(- e -x )    e -x , which is valid for large values of  x . This leads to a simplification of the equation for  P ( S  x ): P ( S    x )    e -  (x-  )  = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score  S .
-Needleman-Wunsch (1970) provided first automatic method -Dynamic Programming to Find Global Alignment Global alignment For sequences that are single-domain For sequences that have not diverged NEEDLEMAN-WUNSCH Algorithm
Gaps What is the biological significance of gaps ? As explained earlier, changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions So, regions where the residues of one sequence correspond to nothing in another, they are interpreted due to either insertion in one sequence or deletion from other. A Gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another Gaps in alignment are represented as dashes(-).
Gaps How long gaps  must be allowed for optimal alignment  and  how should they be scored  ?  Some gaps can be introduced in alignment to compensate for insertion and deletions but not too many To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes deduction of a fixed amount (the gap score) from the alignment score. So gaps will occur in alignment only when really needed Addition of gaps to optimize an alignment always decreases the quality of an alignment therefore  gap penalty is always negative For example AGGVLIQVG  AGGVLIIQVG AGGVL-IQVG   AGGVLIIQVG
Gaps Two types of gap penalties Linear gap penalty Both  gap opening  (G)  &  gap extension (L)  penalty is same. Affine gap penalty gap opening penalty is higher than gap extension penalty Thus for a gap of length   n   total deduction = G + (n-1) L BLOSUM 62 matrix : -11 gap opening / -1 gap extension BLOSUM 50 matrix : -12 gap opening / -1 gap extension
Summary An alignment just  reflects the  probable  evolutionary history  of the two genes as it is  presumed  that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions Two types of gap penalties Global alignment   Local alignment Two types of Alignment Linear gap penalty Affine gap penalty

Sequence alignment belgaum

  • 1.
  • 2.
    Objectives General TermsWhat is Alignment ? Basic concept of Alignment Rationale Behind Alignment Types of Alignment Comparative Analysis Biological Significance of Gaps
  • 3.
    Some Definitions Similarity The extent to which nucleotide or protein sequences are related The extent of similarity between two sequences might be expressed based on percent sequence identity and/or conservation. Identity The extent to which two (nucleotide or amino acid) sequences are invariant Conservation Changes at a specific position of an amino acid or less commonly, a DNA sequence, that preserves the physico-chemical properties of the original residue Optimal Alignment An alignment of two sequences with the highest possible score Query The input sequence which is compared to all entries in database
  • 4.
    Homologous refers to conclusion drawn from the data that the two genes or sequences have descended from a common ancestor Homologous sequences are of two types Orthologous Homologous sequences in different species that arose from a common ancestral gene during speciation Parologous Homologous sequences within a single species that arose by gene duplication
  • 5.
    What is Alignment? Explicit mapping between two or more sequences To place one sequence over another in such a fashion so as to get maximum similarity SEQUENCE ALIGNMENT STRUCTURAL ALIGNMENT
  • 6.
    WHY ALIGNMENT ISNECESSARY ? We need to be able to compare sequences for similarities and differences Often what we are looking for are not exact matches, but similarities Similarity is based on biology
  • 7.
    Conserved regions Someregions tend to be more conserved than others Conserved regions (amino acid residues) may suggest which residues are critical for structure or function BUT may just be accident of history
  • 8.
    Similarity vs. homologySIMILARITY – observable quantity that can be expressed as %identity or some suitable measure HOMOLOGY – a conclusion drawn from similarity data regarding shared evolutionary history (is it homologous or not?) E.g. human myoglobin and tuna myoglobin – some similarities can be found
  • 9.
    Proteins of 100%identity (Human & Xenopus Myoglobin) MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG
  • 10.
    MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG GLSDGEWQ Q VLNVWGKVEADI A GHGQEV LIRLF T GHPETLEKFDKFKHLKTE A EMKA SEDLKKHG TV VLTALGGILKKKGHHEAE L KPLAQSHATKHKIP I KYLEFIS DA II H VL H SKHPGDFGADAQGAM T KALELFR N D I A A K YKELGFQG Proteins with similarity (H orse P02188 & Xenopus)
  • 11.
    Evolutionary Basis Presumptionis homologous sequences have diverged from a common ancestor But we do not have the ancestral sequence, only raw sequence from living organisms
  • 12.
    Basic Concept ofAlignment Firstly, both the sequences are matched in a arbitrary way. Quality of the match is then reflected in terms of score. Then one of the two sequences is moved w.r.t other and match is scored. This process is repeated until we find best scoring alignment. But, if this process is carried out for 2 sequences of length N each (N=10,000), then there will be around N 2 alignments, which is computationally impossible to calculate Thus we look for Optimal alignment which is done through Dynamic programming
  • 13.
    What is therationale behind alignment ? The resemblance of two DNA sequences taken from different organisms means that sequences have arisen from one common ancestral DNA by the process of mutations and selection, modifying the DNA sequence in a specific manner. The basic mutational processes can be of 3 types: Insertion an insertion of a base (letter) or several bases to the sequence Deletion deleting a base (or more) from the sequence Substitution replacing a sequence base by another .
  • 14.
    An alignment just reflects the probable evolutionary history of the two genes as it is presumed that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes
  • 15.
    ALIGNMENT Pairwise alignment Multiple alignment
  • 16.
    Why pairwise alignment?Pairwise alignment is used in database searches. BLAST & FASTA are essentially highly optimized versions of local pairwise alignment. Pairwise alignment is used to compute evolutionary distances, which are used to build phylogenetic trees. Pairwise alignment is used for sequence assembly in shotgun sequencing. Pairwise alignment underlies multiple alignment, which is used to find consensus patterns. Both amino acid sequences and nucleotide sequences are handled in much the same way.
  • 17.
    Why multiple sequencealignment ? Incorporation: Organize data to reflect sequence homology Phylogeny :Infer phylogeny trees from homologous sites Motif : Highlight conserved sites/regions Structure Prediction : Highlight variable sites/regions Extrapolation: Uncover changes in gene structure Profile: Summarize information The process of aligning sequences is a game involving playing off gaps and mismatches
  • 18.
    PAIRWISE ALIGNMENT Global alignment Local alignment Global alignment - means placing both the complete sequences over one another to find maximum similarity i.e Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached Local alignment - looks for a maximum similarity within the subsequences. i.e Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there
  • 19.
    Global Alignment Alignsentire sequence identifies all conserved residues dynamic programming required Computationally intensive, much slower than local alignment eg Needleman & Wunsch method, GAP Local Alignment Identify short conserved sequences complete alignment is not done may miss out on some important conserved residues eg BLAST, FASTP Comparative Analysis of Alignment Techniques
  • 20.
  • 21.
    A model fordatabase searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)
  • 22.
    Extreme Value DistributionProbability density function for the extreme value distribution resulting from parameter values  = 0 and  = 1, [ y = 1 – exp(- e -x )], where  is the characteristic value and  is the decay constant. y = 1 – exp(- e -  ( x -  ) )
  • 23.
    Extreme Value Distribution(EDV) You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit. real data EDV approximation
  • 24.
    Extreme Value DistributionThe probability of a score S to be larger than a given value x can be calculated following the EDV as: E-value: P ( S  x ) = 1 – exp(- e -  ( x -  ) ) , where  =(ln Kmn )/  , and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for  and K over a set of widely used scoring matrices).
  • 25.
    Extreme Value DistributionUsing the equation for  (preceding slide), the probability for the raw alignment score S becomes P ( S  x ) = 1 – exp(- Kmne -  x ). In practice, the probability P ( S  x ) is estimated using the approximation 1 – exp(- e -x )  e -x , which is valid for large values of x . This leads to a simplification of the equation for P ( S  x ): P ( S  x )  e -  (x-  ) = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score S .
  • 26.
    Normalised sequence similarityStatistical significance Database searching is commonly performed using an E-value in between 0.1 and 0.001. Low E-values decrease the number of false positives in a database search, but increase the number of false negatives , thereby lowering the sensitivity of the search.
  • 27.
    FASTP : LocalAlignment Tool Sequence 1 F L W R T W S Sequence 2 S W K T W T Method based on lookup tables Lipman & Pearson, Science (1985) vol 227,1435-41 The first widely used program: Lipman & Pearson, 1985 and onwards
  • 28.
    Construction of theLookup Table Position Number Residue Seq 1 Seq2 Offset(p1-p2) F 1 - - L 2 - - W 3,6 2,5 1(3,2) 1(6,5) 4(6,2) -2(3,5) R 4 - - T 5 4,6 1(5,4) - 1(5,6) S 7 1 6(7,1) K - 3 - Pos no. 1 2 3 4 5 6 7 Sequence 1 F L W R T W S Sequence 2 S W K T W T
  • 29.
    Calculation of OffsetFrequency Offset Frequency 1 3 4 1 -1 1 -2 1 6 1 Final Local Alignment Pos no. 1 2 3 4 5 6 7 Sequence 1 F L W R T W S Sequence 2 - S W K T W T
  • 30.
    Extreme Value DistributionUsing the equation for  (preceding slide), the probability for the raw alignment score S becomes P ( S  x ) = 1 – exp(- Kmne -  x ). In practice, the probability P ( S  x ) is estimated using the approximation 1 – exp(- e -x )  e -x , which is valid for large values of x . This leads to a simplification of the equation for P ( S  x ): P ( S  x )  e -  (x-  ) = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score S .
  • 31.
    -Needleman-Wunsch (1970) providedfirst automatic method -Dynamic Programming to Find Global Alignment Global alignment For sequences that are single-domain For sequences that have not diverged NEEDLEMAN-WUNSCH Algorithm
  • 32.
    Gaps What isthe biological significance of gaps ? As explained earlier, changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions So, regions where the residues of one sequence correspond to nothing in another, they are interpreted due to either insertion in one sequence or deletion from other. A Gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another Gaps in alignment are represented as dashes(-).
  • 33.
    Gaps How longgaps must be allowed for optimal alignment and how should they be scored ? Some gaps can be introduced in alignment to compensate for insertion and deletions but not too many To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes deduction of a fixed amount (the gap score) from the alignment score. So gaps will occur in alignment only when really needed Addition of gaps to optimize an alignment always decreases the quality of an alignment therefore gap penalty is always negative For example AGGVLIQVG AGGVLIIQVG AGGVL-IQVG AGGVLIIQVG
  • 34.
    Gaps Two typesof gap penalties Linear gap penalty Both gap opening (G) & gap extension (L) penalty is same. Affine gap penalty gap opening penalty is higher than gap extension penalty Thus for a gap of length n total deduction = G + (n-1) L BLOSUM 62 matrix : -11 gap opening / -1 gap extension BLOSUM 50 matrix : -12 gap opening / -1 gap extension
  • 35.
    Summary An alignmentjust reflects the probable evolutionary history of the two genes as it is presumed that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions Two types of gap penalties Global alignment Local alignment Two types of Alignment Linear gap penalty Affine gap penalty