SlideShare a Scribd company logo
1 of 39
Needleman-Wunsch

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
The Needleman-Wunsch algorithm
• Even for relatively short sequences, there are lots of
  possible alignments
  It will take you (or a computer) a long time to assess each alignment
         one-by-one to find the best alignment
• The problem of finding the best possible alignment
  for 2 sequences is solved by the Needleman-Wunsch
  algorithm
  The N-W algorithm was proposed by Christian Wunsch & Saul
  Needleman, 1970
• The N-W algorithm is mathematically proven to find
  the best alignment of 2 sequences
  By the ‘best’ alignment, we mean the alignment that implies the fewest
  number of mutations in the 2 sequences
The Needleman-Wunsch algorithm
• The Needleman-Wunsch algorithm saves us the
  trouble of assessing all the many possible alignments
  to find the best one
• The N-W algorithm takes time proportion to n2 to
  find the best alignment of two sequences that are
  both n letters long
  In contrast, assessing all possible alignments one-by-one
                                           2n
        would take time proportional to (n )
                              2n
  n2 is much smaller than ( n ), so N-W is much faster than
        assessing all possible alignments one-by-one
  eg. for n=11, n2=121, ( 2n )=705432, so N-W is ~5830-fold
                          n
  faster (705432/121) than assessing all alignments
Problem
• How many times faster is it to find the best
  alignment for sequences “RQQEP” & “QQESP” using
  the Needleman-Wunsch algorithm, compared to
  assessing each possible alignment one-by-one?
Answer
• How many times faster is it to find the best
  alignment for sequences “RQQEP” & “QQESP” using
  the Needleman-Wunsch algorithm, compared to
  assessing each possible alignment one-by-one?
  The sequence length, n, is 5 here
  This means it will take time proportional to n2=25 to find the best
  alignment using N-W
  It will take time proportional to ( 2n ) = 252 to find the best alignment
                                      n
         by assessing each possible alignment one-by-one
  This means that we can find the best alignment about 10 times
  (=252/25) faster by using N-W
Explanation of the N-W algorithm
• In the following explanation, we’ll refer to the ith
  letter in sequence S1 as S1(i)
• Similarly, we’ll refer to the jth letter in sequence S2
  as S2(j)
  eg. for sequences ‘VIVADAVIS’ and ‘VIVALASVEGAS’:
i= 1 2       3    4    5    6    7    8   9 Sequence S1
   V I V A D A V I S
j= 1 2       3    4    5    6    7    8   9 10 11 12
   V I V A L A S V E G A S                             Sequence S2
  For example, S1(5) = ‘D’, S2(3) = ‘V’
• To use N-W, we must first define:
1 A scoring function (σ): defines the score to give to a substitution
    mutation eg. -1 for a match, -1 for mismatch
2   A gap penalty: defines the score to give to an insertion or
    deletion mutation, eg. -1
    A recurrence relation: defines what actions we repeat at each iteration
3   (step) of the algorithm; for N-W this is:
                          T(i-1, j-1) + σ(S1(i), S2(j))
          T(i, j) = max T(i-1, j) + gap penalty              This will be
                          T(i, j-1) + gap penalty            explained later...
• There are 2 parts to computing the best alignment
  using the N-W algorithm:
1 Fill up a matrix (table) T using the recurrence relation
2 The traceback step : use the filled-in matrix T to work out the best
    alignment
Scoring functions
• We define a scoring function σ(S1(i), S2(j)) for pairs of
  amino acids or nucleotides (S1(i), S2(j))
  σ(S1(i), S2(j)) is the cost (score) of aligning symbols S1(i) & S2(j)
  ie. σ(S1(i), S2(j)) is the cost (score) for a substitution mutation     from
  S1(i) → S2(j)
• A simple scoring function σ is a score of +1 for
  matches, and -1 for mismatches
                               A
  This can be written as: (the symbol means ‘for all’)
                  a=b
               A
                                                  a≠b
                                                     A
  σ(a,b) = +1                  and σ(a,b) = -1
• A convenient way of representing many scoring
  functions is a substitution matrix
  This shows the cost (score) of aligning one letter (nucleotide or amino
        acid) with another letter
• Substitution matrix for Letter b function that assigns
                          a scoring
  +1 to matches, and D Cto mismatches: M F P S T W Y V
               A R N -1    Q E G H I L K
                       A   1       -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     -1   -1   -1   -1   -1   -1
  σ(a,b) now refers to an entry in the substitution matrix
                 R -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1                                              -1     -1   -1   -1   -1   -1   -1

                       N   -1      -1      1    -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     -1   -1   -1   -1   -1   -1

                       D   -1      -1      -1   1       -1    Letter -1
                                                             -1 -1 -1 b           -1   -1   -1   -1   -1     -1   -1   -1   -1   -1   -1

                       C   -1      -1      -1   -1      1    -1   -1
                                                                       A-1   -1   -1
                                                                                       C
                                                                                       -1   -1   -1   -1
                                                                                                           G-1    -1   -1
                                                                                                                            T
                                                                                                                            -1   -1   -1

 Substitution
            Letter a
                Q          -1      -1      -1   -1      -1   1    -1    -1   -1   -1   -1   -1   -1   -1     -1   -1   -1   -1   -1   -1
                                                 A                     +1              -1                  -1-1             -1
 matrix σ for DNA
                E          -1      -1      -1   -1      -1   -1   1     -1   -1   -1   -1   -1   -1   -1          -1   -1   -1   -1   -1

 alignments     G          -1   Letter a
                                   -1      -1   -1  C   -1   -1   -1   -1
                                                                        1    -1   -1   +1 -1
                                                                                       -1        -1   -1   -1-1   -1   -1   -1
                                                                                                                            -1   -1   -1

                       H   -1      -1      -1   -1      -1   -1   -1    -1   1    -1   -1   -1   -1   -1     -1   -1   -1   -1   -1   -1
                                                 G                     -1              -1                  +1               -1
                       I   -1      -1      -1   -1      -1   -1   -1    -1   -1   1    -1   -1   -1   -1     -1   -1   -1   -1   -1   -1
                                                    T                  -1              -1                  -1-1             +1
 Substitution          L   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   1    -1   -1   -1          -1   -1   -1   -1   -1

 matrix σ for          K   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   1    -1   -1     -1   -1   -1   -1   -1   -1

 protein               M   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   1    -1     -1   -1   -1   -1   -1   -1


 alignments            F   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   1      -1   -1   -1   -1   -1   -1

                       P   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     1    -1   -1   -1   -1   -1

                       S   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     -1   1    -1   -1   -1   -1

                       T   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     -1   -1   1    -1   -1   -1

                       W   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     -1   -1   -1   1    -1   -1

                       Y   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     -1   -1   -1   -1   1    -1

                       V   -1      -1      -1   -1      -1   -1   -1    -1   -1   -1   -1   -1   -1   -1     -1   -1   -1   -1   -1   1
N-W: Initialising table T
• To align 2 sequences S1 & S2 of lengths m & n, N-W starts by building a
  table T with m+1 columns & n+1 rows:
  eg. for S1=‘TGGTG’ & S2=‘ATCGT’, m=5 and n=5:
                              T   G   G   T   G


                       A
                       T
          Table T
                       C
                       G
                       T                                  i=0 i=1 i=2 i=3 i=4 i=5
                                                              T   G   G   T   G
• We number the columns i=0,1,2,....m             j=0
  We number the rows j=0,1,2,...n                 j=1 A
                                                  j=2 T
                                                  j=3 C
                                                  j=4 G
                                                  j=5 T
• T(i, j) is the cell at the intersection of column i and row j
  eg. for S1=‘TGGTG’ & S2=‘ATCGT’, m=5 and n=5:
                             i=0 i=1 i=2 i=3 i=4 i=5
                                 T   G       G   T       G
                     j=0
                     j=1 A
        Table T      j=2 T                                       T(3,2)
                     j=3 C
                     j=4 G
                     j=5 T
• The N-W algorithm starts by initialising (setting the initial value of) T(0,0) to
  zero:
                                                     T       G   G   T    G
                                                 0
                                         A
                                         T
                                         C
                                         G
                                         T
• The table T is then filled in using the recurrence relation:
         T(i-1, j-1) + σ(S1(i), S2(j))                This will be explained
  T(i, j) = max T(i-1, j) + gap penalty               in a minute...
    T(i, j-1) + gap penalty
• The table is filled in from left to right, and from top to bottom
• The value of T(0,0) is set to zero at the start (initialised to 0)
• We first calculate the values of T(i, j) for row 0 of the table, from left to
  right
• We then calculate the values of T(i, j) for row 1 of the table from left to
  right, then rows 2, 3, 4 .... row n of the table
                                    T
                                    T   G
                                        G   G
                                            G   T
                                                T    G
                                                     G
                                0 x x x x x
                           A
                           A    x x x x x x
                           T
                           T
                           T    x x x x x x
                           C
                           C
                           C
                           C    x x x x x x
                           G
                           G
                           G
                           G
                           G    x   x   x   x    x   x
                           T
                           T
                           T
                           T
                           T    x   x   x   x    x   x
• The table T is then filled in using the recurrence relation:
                     T(i-1, j-1) + σ(S1(i), S2(j))    1
  T(i, j) = max T(i-1, j) + gap penalty               2
                      T(i, j-1) + gap penalty         3
• This means that the value in cell T(i, j) is set to be the maximum of the
  three possibilities        1 , 2 , 3 , where:
  T(i-1, j-1) is the value in the previous column & row
  T(i-1, j) is the value in the previous column & same row
  T(i, j-1) is the value in the same column & previous row

                          i=0 i=1 i=2 i=3 i=4 i=5
                              T   G   G   T   G
                  j=0                               T(i, j-1) = T(3,1)
                  j=1 A                             T(i-1, j-1) = T(2,1)
     Table T      j=2 T                             T(i, j) = T(3,2)
                  j=3 C                             T(i-1, j) = T(2,2)
                  j=4 G
                  j=5 T
•    The table T is then filled in using the recurrence relation:
                          T(i-1, j-1) + σ(S1(i), S2(j))  1
     T(i, j) = max T(i-1, j) + gap penalty               2
                          T(i, j-1) + gap penalty        3
•    This means that the value in cell T(i, j) is set to be the maximum of these
     three possibilities 1 , 2 , 3 , where
     gap penalty is score that we have decided to use for an insertion or
             deletion mutation, for example -1
     σ(S1(i), S2(j)) is the cost (score) that we have decided to use for
     aligning symbols S1(i) & S2(j), in our substitution matrix σ
     eg. using +1 for matches and -1 for mismatches:
                                        Letter b
                                           A          C         G         T
    Substitution                    A      +1        -1         -1        -1
    matrix σ for
                         Letter a




                                    C      -1        +1         -1        -1
    DNA                             G      -1        -1        +1         -1
    alignments                      T      -1        -1         -1        +1
•   For example, say we decide to use +1 for matches, -1 for mismatches, and
    -2 for an insertion/deletion (gap)
•   The N-W algorithm starts by initialising (setting the initial value of) T(0,0)
    to zero
•   We next calculate the value of T(1, 0)
•   The value of T(1,0) is set to the maximum of 3 possibilities:
                     T(i-1, j-1) + σ(S1(i), S2(j))   Not defined here
    T(i, j) = max    T(i-1, j) + gap penalty         = 0 – 2 = -2
                      T(i, j-1) + gap penalty        Not defined here
•   We calculate this to be -2, so set T(1, 0) to -2
•   We record which previous cell was used to set the value of T(1, 0) :

                                T   G   G   T   G
                             0 -2
                                ?
                        A
                        T
                        C
                        G
                        T
•   We next calculate the value of T(2, 0)
•   The value of T(2,0) is set to the maximum of 3 possibilities:
                    T(i-1, j-1) + σ(S1(i), S2(j))    Not defined here
    T(i, j) = max   T(i-1, j) + gap penalty          = -2 – 2 = -4
                    T(i, j-1) + gap penalty
                                                     Not defined here
•   We calculate this to be -4, so set T(2, 0) to -4
•   We record which previous cell was used to set the value of T(2, 0) :




                               T   G   G   T   G
                           0 -2 -4
                                 ?
                       A
                       T
                       C
                       G
                       T
•   We next calculate the value of T(3, 0)
•   The value of T(3,0) is set to the maximum of 3 possibilities:
                    T(i-1, j-1) + σ(S1(i), S2(j))    Not defined here
    T(i, j) = max    T(i-1, j) + gap penalty
                                                     = -4 – 2 = -6
                    T(i, j-1) + gap penalty
•   We calculate this to be -6, so set T(3, 0) to -6
                                                     Not defined here
•   We record which previous cell was used to set the value of T(3, 0) :




                               T   G   G   T   G
                           0 -2 -4 -6
                                    ?
                       A
                       T
                       C
                       G
                       T
•   We next calculate the value of T(4, 0)
•   The value of T(4,0) is set to the maximum of 3 possibilities:
                    T(i-1, j-1) + σ(S1(i), S2(j))    Not defined here
    T(i, j) = max   T(i-1, j) + gap penalty
                                                     = -6 – 2 = -8
                    T(i, j-1) + gap penalty
•   We calculate this to be -8, so set T(4, 0) to -8
                                                     Not defined here
•   We record which previous cell was used to set the value of T(4, 0) :




                               T   G   G   T   G
                           0 -2 -4 -6 -8
                                       ?
                       A
                       T
                       C
                       G
                       T
•   We next calculate the value of T(5, 0)
•   The value of T(5,0) is set to the maximum of 3 possibilities:
                    T(i-1, j-1) + σ(S1(i), S2(j))      Not defined here
    T(i, j) = max   T(i-1, j) + gap penalty            = -8 – 2 = -10
                    T(i, j-1) + gap penalty            Not defined here
•   We calculate this to be -10, so set T(5, 0) to -10
•   We record which previous cell was used to set the value of T(5, 0) :




                              T   G   G   T   G
                           0 -2 -4 -6 -8 -10
                                          ?
                       A
                       T
                       C
                       G
                       T
•   We next calculate the value of T(0, 1)
•   The value of T(0,1) is set to the maximum of 3 possibilities:
                    T(i-1, j-1) + σ(S1(i), S2(j))    Not defined here
    T(i, j) = max   T(i-1, j) + gap penalty          Not defined here
                    T(i, j-1) + gap penalty          = 0 – 2 = -2
•   We calculate this to be -2, so set T(0, 1) to -2
•   We record which previous cell was used to set the value of T(0, 1) :




                                T   G   G   T   G
                                                -10
                           0 -2 -4 -6 -8
                       A   -2
                            ?
                       T
                       T
                       C
                       C
                       G
                       G
                       T
                       T
•   We next calculate the value of T(1, 1)
•   The value of T(1,1) is set to the maximum of 3 possibilities:
                    T(i-1, j-1) + σ(S1(i), S2(j))    = 0 – 1 = -1
    T(i, j) = max   T(i-1, j) + gap penalty          = -2 -2 = -4
                    T(i, j-1) + gap penalty          = -2 -2 = -4
•   We calculate this to be -1, so set T(1, 1) to -1
•   We record which previous cell was used to set the value of T(1, 1) :




                               T    G   G   T   G
                                                -10
                            0 -2 -4 -6 -8
                        A   -2 -1
                                ?
                        T
                        C
                        G
                        T
• We next calculate the value of T(2, 1)
• The value of T(2,1) is set to the maximum of 3 possibilities:
         T(i-1, j-1) + σ(S1(i), S2(j))             = -2 -1 = -3
  T(i, j) = max       T(i-1, j) + gap penalty      = -1 -2 = -3
    T(i, j-1) + gap penalty                        = -4 -2 = -6
• We calculate this to be -3, so set T(2, 1) to -3
• We record which previous cells were used to set the value of T(2, 1) (two
  different cells here):



                              T   G   G   T   G
                                              -10
                           0 -2 -4 -6 -8
                      A   -2 -1 -3
                                 ?
                      T
                      C
                      G
                      T
• We next calculate the value of T(3, 1)
• The value of T(3,1) is set to the maximum of 3 possibilities:
         T(i-1, j-1) + σ(S1(i), S2(j))             = -4 -1 = -5
  T(i, j) = max       T(i-1, j) + gap penalty      = -3 -2 = -5
    T(i, j-1) + gap penalty                        = -6 -2 = -8
• We calculate this to be -5, so set T(3, 1) to -5
• We record which previous cells were used to set the value of T(3, 1) (two
  different cells here):



                              T   G   G   T   G
                                              -10
                           0 -2 -4 -6 -8
                      A   -2 -1 -3 -5
                                    ?
                      T
                      C
                      G
                      T
• We next calculate the value of T(4, 1)
• The value of T(4,1) is set to the maximum of 3 possibilities:
         T(i-1, j-1) + σ(S1(i), S2(j))             = -6 -1 = -7
  T(i, j) = max       T(i-1, j) + gap penalty      = -5 -2 = -7
    T(i, j-1) + gap penalty                        = -8 -2 = -10
• We calculate this to be -7, so set T(4, 1) to -7
• We record which previous cells were used to set the value of T(4, 1) (two
  different cells here):



                              T   G   G   T   G
                                              -10
                           0 -2 -4 -6 -8
                      A   -2 -1 -3 -5 -7
                                       ?
                      T
                      C
                      G
                      T
•   We next calculate the value of T(5, 1)
•   The value of T(5,1) is set to the maximum of 3 possibilities:
                     T(i-1, j-1) + σ(S1(i), S2(j))   = -8 -1 = -9
    T(i, j) = max    T(i-1, j) + gap penalty         = -7 -2 = -9
                     T(i, j-1) + gap penalty         = -10 -2 = -12
•   We calculate this to be -9, so set T(5, 1) to -9
•   We record which previous cells were used to set the value of T(5, 1) (two
    different cells here):


                               T   G   G   T   G
                                               -10
                           0 -2 -4 -6 -8
                       A   -2 -1 -3 -5 -7 -9
                                           ?
                       T
                       C
                       G
                       T
Problem
• Fill in the next row of matrix T
                        T   G   G   T   G
                                        -10
                    0 -2 -4 -6 -8
                A   -2 -1 -3 -5 -7 -9
                T   ?   ?   ?   ?   ?   ?
                C
                G
                T
Answer
• Fill in the next row of matrix T
                       T   G   G   T   G
                    0 -2 -4 -6 -8      -10

                A   -2 -1 -3 -5 -7 -9
                T   -4 -1 -2 -4 -4 -6
                        ? ? ? ? ?
                C
                G
                T
N-W: the traceback step
•   When we have filled in the whole of matrix T, it looks like:
                                   T   G   G   T   G
                              0 -2 -4 -6 -8        -10


                         A   -2 -1 -3 -5 -7 -9
                         T   -4 -1 -2 -4 -4 -6
                         C   -6 -3 -2 -3 -5 -5
                         G   -8 -5 -2 -1 -3 -4
                         T   -10
                                   -7 -4 -3 0 -2

•   In the traceback step we use the filled-in matrix T to work out the best
    alignment between the two sequences S1 & S2
•   We start at the bottom right cell of matrix T
•   We then follow the arrow to the previous cell used to calculate the best
    value for that cell
•   From there, follow the arrow to the previous cell... and so on..
•   The path through matrix T is the traceback (in pink here):
                   sequence S1
                               T   G   G   T   G
                                               -10
                         0 -2 -4 -6 -8
                     A   -2 -1 -3 -5 -7 -9           - T G G T G
       sequence S2

                     T   -4 -1 -2 -4 -4 -6             |   | |
                     C   -6 -3 -2 -3 -5 -5           A T C G T -
                     G   -8 -5 -2 -1 -3 -4
                     T   -10
                               -7 -4 -3 0 -2

•   To work out the best alignment, follow the traceback from top left to
    bottom right, & look at the letters aligned in each cell
•   Here the 1st cell doesn’t correspond to any letter
•   The 2nd cell is ‘A’ in sequence S2 but nothing in sequence S1
•   The 3rd cell is ‘T’ in sequence S2 and ‘T’ in sequence S1
•   The 4th cell is ‘C’ in sequence S2 and ‘G’ in sequence S1
•   The 5th cell is ‘G’ in sequence S2 and ‘G’ in sequence S1
•   The 6th cell is ‘T’ in sequence S2 and ‘T’ in sequence S1
•   The 7th cell is nothing in sequence S2 and ‘G’ in sequence S1
Problem
• The traceback is shown in pink in the matrix T
  below. What is the best alignment?
                        A   C   C   T
                    x   x   x   x   x
                C   x   x   x   x   x
                T   x   x   x   x   x
                G   x   x   x   x   x
Answer
• The traceback is shown in pink in the matrix T
  below. What is the best alignment?
                        A   C   C   T
                    x   x   x   x   x
                C   x   x   x   x   x
                T   x   x   x   x   x
                G   x   x   x   x   x


• It is:            A C C T -
                      |   |
                    - C - T G
• The Needleman-Wunsch algorithm uses an approach
  called dynamic programming (d.p.)
  d.p. algorithms solve problems by breaking a large problem into smaller
        easy problems of a similar type
  The N-W algorithm works by progressively building optimal
  alignments of longer and longer subsequences of S1 & S2

• N-W finds the best alignment between 2 sequences
  by iteratively (repeatedly):
  i. taking the 1st i letters of sequence S1 and the 1st j letters of
  sequence S2, for a particular i and j
  ii. get the score of the best alignment of the 2 subsequences
  This is what we are doing when we are filling matrix T
  If S1 is m letters long, & S2 is n letters long, we need to do this for all m×n
  possible subsequences of S1 and S2
  So N-W takes time proportional to m×n to run (or n2, if m=n)
• During the N-W algorithm we assign scores to
  alignments of subsequences of S1 and S2
  We store the score for an alignment of the 1st i letters of S1 to the first j
        letters of S2 in cell T(i, j)
  So, after filling T, the bottom right cell will contain the score for the
        best alignment between S1 and S2
  This is just the sum of the scores for the matches, mismatches and
        gaps in the best alignment:
  eg. the best alignment of ‘TGGTG’ and ‘ATCGT’ using a score of +1 for a
        match, -1 for a mismatch and -2 for a gap:
           T   G   G   T   G
     0 -2 -4 -6 -8         -10                         The best alignment has:
                                                       • 3 matches (score +3)
 A   -2 -1 -3 -5 -7 -9
                                  - T G G T G          • 1 mismatch (score -1)
 T   -4 -1 -2 -4 -4 -6              |   | |            • 2 gaps (score -4)
 C   -6 -3 -2 -3 -5 -5            A T C G T -            → Score = 3-1-4 = -2
 G   -8 -5 -2 -1 -3 -4
 T   -10
           -7 -4 -3 0 -2
Software for making alignments
• For Needleman-Wunsch pairwise alignment
 pairwiseAlignment() in the “Biostrings” R library
 the EMBOSS (emboss.sourceforge.net/) needle program
Problem
• How many times faster is it to find the best
  alignment for sequences “RQQEPVRSTC” &
  “QQESGPVRST” using the Needleman-Wunsch
  algorithm, compared to assessing each possible
  alignment one-by-one?
Answer
• How many times faster is it to find the best
  alignment for sequences “RQQEPVRSTC” &
  “QQESGPVRST” using the N-W algorithm, compared
  to assessing each possible alignment one-by-one?
  The sequence length, n, is 10 here
  This means it will take time proportional to n2=100 to find the best
  alignment using N-W
                                     2n
  It will take time proportional to ( n ) = 184,756 to find the best
  alignment by assessing each possible alignment one-by-one
  We can find the best alignment about 1848 times               (=184756/100)
  faster by using N-W
Problem
• Find the best alignment between the sequences
  “WHAT” and “WHY”, using the Needleman-Wunsch
  algorithm, with +1 for a match, -1 for a mismatch,
  and -2 for a gap.
Answer
• Find the best alignment between “WHAT” & “WHY”
  using N-W with match:+1, mismatch:-1, gap:-2
•   Matrix T looks like this, giving 2 possible tracebacks:
             W   H   A   T              W   H   A   T                  W   H   A   T
        0 -2 -4 -6 -8               0 -2 -4 -6 -8                 0 -2 -4 -6 -8
    W   -2   1 -1 -3 -5        W   -2   1 -1 -3 -5            W   -2   1 -1 -3 -5
    H   -4 -1 2      0 -2      H   -4 -1 2      0 -2          H   -4 -1 2      0 -2
    Y   -6 -3 0      1 -1      Y   -6 -3 0      1 -1          Y   -6 -3 0      1 -1

•   The two possible tracebacks give two equally good best alignments:
                                        W H A T                        W H A T
                                        | |                            | |
                                        W H - Y                        W H Y -
                                    (Pink traceback)              (Orange traceback)
Further Reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Chapter 6 in Deonier et al Computational Genome Analysis
•   Practical on pairwise alignment in R in the Little Book of R for
    Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

More Related Content

What's hot (20)

Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Dynamic programming
Dynamic programming Dynamic programming
Dynamic programming
 
Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)Scoring schemes in bioinformatics (blosum)
Scoring schemes in bioinformatics (blosum)
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure prediction
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysis
 
Needleman-wunch algorithm harshita
Needleman-wunch algorithm  harshitaNeedleman-wunch algorithm  harshita
Needleman-wunch algorithm harshita
 
UniProt
UniProtUniProt
UniProt
 
RNA secondary structure prediction
RNA secondary structure predictionRNA secondary structure prediction
RNA secondary structure prediction
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
Dot matrix
Dot matrixDot matrix
Dot matrix
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
String.pptx
String.pptxString.pptx
String.pptx
 

More from avrilcoghlan

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomesavrilcoghlan
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignmentsavrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functionsavrilcoghlan
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformaticsavrilcoghlan
 

More from avrilcoghlan (10)

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
 
BLAST
BLASTBLAST
BLAST
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Recently uploaded

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 

Recently uploaded (20)

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 

The Needleman Wunsch algorithm

  • 1. Needleman-Wunsch Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. The Needleman-Wunsch algorithm • Even for relatively short sequences, there are lots of possible alignments It will take you (or a computer) a long time to assess each alignment one-by-one to find the best alignment • The problem of finding the best possible alignment for 2 sequences is solved by the Needleman-Wunsch algorithm The N-W algorithm was proposed by Christian Wunsch & Saul Needleman, 1970 • The N-W algorithm is mathematically proven to find the best alignment of 2 sequences By the ‘best’ alignment, we mean the alignment that implies the fewest number of mutations in the 2 sequences
  • 3. The Needleman-Wunsch algorithm • The Needleman-Wunsch algorithm saves us the trouble of assessing all the many possible alignments to find the best one • The N-W algorithm takes time proportion to n2 to find the best alignment of two sequences that are both n letters long In contrast, assessing all possible alignments one-by-one 2n would take time proportional to (n ) 2n n2 is much smaller than ( n ), so N-W is much faster than assessing all possible alignments one-by-one eg. for n=11, n2=121, ( 2n )=705432, so N-W is ~5830-fold n faster (705432/121) than assessing all alignments
  • 4. Problem • How many times faster is it to find the best alignment for sequences “RQQEP” & “QQESP” using the Needleman-Wunsch algorithm, compared to assessing each possible alignment one-by-one?
  • 5. Answer • How many times faster is it to find the best alignment for sequences “RQQEP” & “QQESP” using the Needleman-Wunsch algorithm, compared to assessing each possible alignment one-by-one? The sequence length, n, is 5 here This means it will take time proportional to n2=25 to find the best alignment using N-W It will take time proportional to ( 2n ) = 252 to find the best alignment n by assessing each possible alignment one-by-one This means that we can find the best alignment about 10 times (=252/25) faster by using N-W
  • 6. Explanation of the N-W algorithm • In the following explanation, we’ll refer to the ith letter in sequence S1 as S1(i) • Similarly, we’ll refer to the jth letter in sequence S2 as S2(j) eg. for sequences ‘VIVADAVIS’ and ‘VIVALASVEGAS’: i= 1 2 3 4 5 6 7 8 9 Sequence S1 V I V A D A V I S j= 1 2 3 4 5 6 7 8 9 10 11 12 V I V A L A S V E G A S Sequence S2 For example, S1(5) = ‘D’, S2(3) = ‘V’
  • 7. • To use N-W, we must first define: 1 A scoring function (σ): defines the score to give to a substitution mutation eg. -1 for a match, -1 for mismatch 2 A gap penalty: defines the score to give to an insertion or deletion mutation, eg. -1 A recurrence relation: defines what actions we repeat at each iteration 3 (step) of the algorithm; for N-W this is: T(i-1, j-1) + σ(S1(i), S2(j)) T(i, j) = max T(i-1, j) + gap penalty This will be T(i, j-1) + gap penalty explained later... • There are 2 parts to computing the best alignment using the N-W algorithm: 1 Fill up a matrix (table) T using the recurrence relation 2 The traceback step : use the filled-in matrix T to work out the best alignment
  • 8. Scoring functions • We define a scoring function σ(S1(i), S2(j)) for pairs of amino acids or nucleotides (S1(i), S2(j)) σ(S1(i), S2(j)) is the cost (score) of aligning symbols S1(i) & S2(j) ie. σ(S1(i), S2(j)) is the cost (score) for a substitution mutation from S1(i) → S2(j) • A simple scoring function σ is a score of +1 for matches, and -1 for mismatches A This can be written as: (the symbol means ‘for all’) a=b A a≠b A σ(a,b) = +1 and σ(a,b) = -1 • A convenient way of representing many scoring functions is a substitution matrix This shows the cost (score) of aligning one letter (nucleotide or amino acid) with another letter
  • 9. • Substitution matrix for Letter b function that assigns a scoring +1 to matches, and D Cto mismatches: M F P S T W Y V A R N -1 Q E G H I L K A 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 σ(a,b) now refers to an entry in the substitution matrix R -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 N -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 D -1 -1 -1 1 -1 Letter -1 -1 -1 -1 b -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 C -1 -1 -1 -1 1 -1 -1 A-1 -1 -1 C -1 -1 -1 -1 G-1 -1 -1 T -1 -1 -1 Substitution Letter a Q -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 A +1 -1 -1-1 -1 matrix σ for DNA E -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 alignments G -1 Letter a -1 -1 -1 C -1 -1 -1 -1 1 -1 -1 +1 -1 -1 -1 -1 -1-1 -1 -1 -1 -1 -1 -1 H -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 G -1 -1 +1 -1 I -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 T -1 -1 -1-1 +1 Substitution L -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 matrix σ for K -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 protein M -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 -1 alignments F -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 -1 P -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 S -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 T -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 W -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 Y -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 V -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1
  • 10. N-W: Initialising table T • To align 2 sequences S1 & S2 of lengths m & n, N-W starts by building a table T with m+1 columns & n+1 rows: eg. for S1=‘TGGTG’ & S2=‘ATCGT’, m=5 and n=5: T G G T G A T Table T C G T i=0 i=1 i=2 i=3 i=4 i=5 T G G T G • We number the columns i=0,1,2,....m j=0 We number the rows j=0,1,2,...n j=1 A j=2 T j=3 C j=4 G j=5 T
  • 11. • T(i, j) is the cell at the intersection of column i and row j eg. for S1=‘TGGTG’ & S2=‘ATCGT’, m=5 and n=5: i=0 i=1 i=2 i=3 i=4 i=5 T G G T G j=0 j=1 A Table T j=2 T T(3,2) j=3 C j=4 G j=5 T • The N-W algorithm starts by initialising (setting the initial value of) T(0,0) to zero: T G G T G 0 A T C G T
  • 12. • The table T is then filled in using the recurrence relation: T(i-1, j-1) + σ(S1(i), S2(j)) This will be explained T(i, j) = max T(i-1, j) + gap penalty in a minute... T(i, j-1) + gap penalty • The table is filled in from left to right, and from top to bottom • The value of T(0,0) is set to zero at the start (initialised to 0) • We first calculate the values of T(i, j) for row 0 of the table, from left to right • We then calculate the values of T(i, j) for row 1 of the table from left to right, then rows 2, 3, 4 .... row n of the table T T G G G G T T G G 0 x x x x x A A x x x x x x T T T x x x x x x C C C C x x x x x x G G G G G x x x x x x T T T T T x x x x x x
  • 13. • The table T is then filled in using the recurrence relation: T(i-1, j-1) + σ(S1(i), S2(j)) 1 T(i, j) = max T(i-1, j) + gap penalty 2 T(i, j-1) + gap penalty 3 • This means that the value in cell T(i, j) is set to be the maximum of the three possibilities 1 , 2 , 3 , where: T(i-1, j-1) is the value in the previous column & row T(i-1, j) is the value in the previous column & same row T(i, j-1) is the value in the same column & previous row i=0 i=1 i=2 i=3 i=4 i=5 T G G T G j=0 T(i, j-1) = T(3,1) j=1 A T(i-1, j-1) = T(2,1) Table T j=2 T T(i, j) = T(3,2) j=3 C T(i-1, j) = T(2,2) j=4 G j=5 T
  • 14. The table T is then filled in using the recurrence relation: T(i-1, j-1) + σ(S1(i), S2(j)) 1 T(i, j) = max T(i-1, j) + gap penalty 2 T(i, j-1) + gap penalty 3 • This means that the value in cell T(i, j) is set to be the maximum of these three possibilities 1 , 2 , 3 , where gap penalty is score that we have decided to use for an insertion or deletion mutation, for example -1 σ(S1(i), S2(j)) is the cost (score) that we have decided to use for aligning symbols S1(i) & S2(j), in our substitution matrix σ eg. using +1 for matches and -1 for mismatches: Letter b A C G T Substitution A +1 -1 -1 -1 matrix σ for Letter a C -1 +1 -1 -1 DNA G -1 -1 +1 -1 alignments T -1 -1 -1 +1
  • 15. For example, say we decide to use +1 for matches, -1 for mismatches, and -2 for an insertion/deletion (gap) • The N-W algorithm starts by initialising (setting the initial value of) T(0,0) to zero • We next calculate the value of T(1, 0) • The value of T(1,0) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) Not defined here T(i, j) = max T(i-1, j) + gap penalty = 0 – 2 = -2 T(i, j-1) + gap penalty Not defined here • We calculate this to be -2, so set T(1, 0) to -2 • We record which previous cell was used to set the value of T(1, 0) : T G G T G 0 -2 ? A T C G T
  • 16. We next calculate the value of T(2, 0) • The value of T(2,0) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) Not defined here T(i, j) = max T(i-1, j) + gap penalty = -2 – 2 = -4 T(i, j-1) + gap penalty Not defined here • We calculate this to be -4, so set T(2, 0) to -4 • We record which previous cell was used to set the value of T(2, 0) : T G G T G 0 -2 -4 ? A T C G T
  • 17. We next calculate the value of T(3, 0) • The value of T(3,0) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) Not defined here T(i, j) = max T(i-1, j) + gap penalty = -4 – 2 = -6 T(i, j-1) + gap penalty • We calculate this to be -6, so set T(3, 0) to -6 Not defined here • We record which previous cell was used to set the value of T(3, 0) : T G G T G 0 -2 -4 -6 ? A T C G T
  • 18. We next calculate the value of T(4, 0) • The value of T(4,0) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) Not defined here T(i, j) = max T(i-1, j) + gap penalty = -6 – 2 = -8 T(i, j-1) + gap penalty • We calculate this to be -8, so set T(4, 0) to -8 Not defined here • We record which previous cell was used to set the value of T(4, 0) : T G G T G 0 -2 -4 -6 -8 ? A T C G T
  • 19. We next calculate the value of T(5, 0) • The value of T(5,0) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) Not defined here T(i, j) = max T(i-1, j) + gap penalty = -8 – 2 = -10 T(i, j-1) + gap penalty Not defined here • We calculate this to be -10, so set T(5, 0) to -10 • We record which previous cell was used to set the value of T(5, 0) : T G G T G 0 -2 -4 -6 -8 -10 ? A T C G T
  • 20. We next calculate the value of T(0, 1) • The value of T(0,1) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) Not defined here T(i, j) = max T(i-1, j) + gap penalty Not defined here T(i, j-1) + gap penalty = 0 – 2 = -2 • We calculate this to be -2, so set T(0, 1) to -2 • We record which previous cell was used to set the value of T(0, 1) : T G G T G -10 0 -2 -4 -6 -8 A -2 ? T T C C G G T T
  • 21. We next calculate the value of T(1, 1) • The value of T(1,1) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) = 0 – 1 = -1 T(i, j) = max T(i-1, j) + gap penalty = -2 -2 = -4 T(i, j-1) + gap penalty = -2 -2 = -4 • We calculate this to be -1, so set T(1, 1) to -1 • We record which previous cell was used to set the value of T(1, 1) : T G G T G -10 0 -2 -4 -6 -8 A -2 -1 ? T C G T
  • 22. • We next calculate the value of T(2, 1) • The value of T(2,1) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) = -2 -1 = -3 T(i, j) = max T(i-1, j) + gap penalty = -1 -2 = -3 T(i, j-1) + gap penalty = -4 -2 = -6 • We calculate this to be -3, so set T(2, 1) to -3 • We record which previous cells were used to set the value of T(2, 1) (two different cells here): T G G T G -10 0 -2 -4 -6 -8 A -2 -1 -3 ? T C G T
  • 23. • We next calculate the value of T(3, 1) • The value of T(3,1) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) = -4 -1 = -5 T(i, j) = max T(i-1, j) + gap penalty = -3 -2 = -5 T(i, j-1) + gap penalty = -6 -2 = -8 • We calculate this to be -5, so set T(3, 1) to -5 • We record which previous cells were used to set the value of T(3, 1) (two different cells here): T G G T G -10 0 -2 -4 -6 -8 A -2 -1 -3 -5 ? T C G T
  • 24. • We next calculate the value of T(4, 1) • The value of T(4,1) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) = -6 -1 = -7 T(i, j) = max T(i-1, j) + gap penalty = -5 -2 = -7 T(i, j-1) + gap penalty = -8 -2 = -10 • We calculate this to be -7, so set T(4, 1) to -7 • We record which previous cells were used to set the value of T(4, 1) (two different cells here): T G G T G -10 0 -2 -4 -6 -8 A -2 -1 -3 -5 -7 ? T C G T
  • 25. We next calculate the value of T(5, 1) • The value of T(5,1) is set to the maximum of 3 possibilities: T(i-1, j-1) + σ(S1(i), S2(j)) = -8 -1 = -9 T(i, j) = max T(i-1, j) + gap penalty = -7 -2 = -9 T(i, j-1) + gap penalty = -10 -2 = -12 • We calculate this to be -9, so set T(5, 1) to -9 • We record which previous cells were used to set the value of T(5, 1) (two different cells here): T G G T G -10 0 -2 -4 -6 -8 A -2 -1 -3 -5 -7 -9 ? T C G T
  • 26. Problem • Fill in the next row of matrix T T G G T G -10 0 -2 -4 -6 -8 A -2 -1 -3 -5 -7 -9 T ? ? ? ? ? ? C G T
  • 27. Answer • Fill in the next row of matrix T T G G T G 0 -2 -4 -6 -8 -10 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 ? ? ? ? ? C G T
  • 28. N-W: the traceback step • When we have filled in the whole of matrix T, it looks like: T G G T G 0 -2 -4 -6 -8 -10 A -2 -1 -3 -5 -7 -9 T -4 -1 -2 -4 -4 -6 C -6 -3 -2 -3 -5 -5 G -8 -5 -2 -1 -3 -4 T -10 -7 -4 -3 0 -2 • In the traceback step we use the filled-in matrix T to work out the best alignment between the two sequences S1 & S2 • We start at the bottom right cell of matrix T • We then follow the arrow to the previous cell used to calculate the best value for that cell • From there, follow the arrow to the previous cell... and so on..
  • 29. The path through matrix T is the traceback (in pink here): sequence S1 T G G T G -10 0 -2 -4 -6 -8 A -2 -1 -3 -5 -7 -9 - T G G T G sequence S2 T -4 -1 -2 -4 -4 -6 | | | C -6 -3 -2 -3 -5 -5 A T C G T - G -8 -5 -2 -1 -3 -4 T -10 -7 -4 -3 0 -2 • To work out the best alignment, follow the traceback from top left to bottom right, & look at the letters aligned in each cell • Here the 1st cell doesn’t correspond to any letter • The 2nd cell is ‘A’ in sequence S2 but nothing in sequence S1 • The 3rd cell is ‘T’ in sequence S2 and ‘T’ in sequence S1 • The 4th cell is ‘C’ in sequence S2 and ‘G’ in sequence S1 • The 5th cell is ‘G’ in sequence S2 and ‘G’ in sequence S1 • The 6th cell is ‘T’ in sequence S2 and ‘T’ in sequence S1 • The 7th cell is nothing in sequence S2 and ‘G’ in sequence S1
  • 30. Problem • The traceback is shown in pink in the matrix T below. What is the best alignment? A C C T x x x x x C x x x x x T x x x x x G x x x x x
  • 31. Answer • The traceback is shown in pink in the matrix T below. What is the best alignment? A C C T x x x x x C x x x x x T x x x x x G x x x x x • It is: A C C T - | | - C - T G
  • 32. • The Needleman-Wunsch algorithm uses an approach called dynamic programming (d.p.) d.p. algorithms solve problems by breaking a large problem into smaller easy problems of a similar type The N-W algorithm works by progressively building optimal alignments of longer and longer subsequences of S1 & S2 • N-W finds the best alignment between 2 sequences by iteratively (repeatedly): i. taking the 1st i letters of sequence S1 and the 1st j letters of sequence S2, for a particular i and j ii. get the score of the best alignment of the 2 subsequences This is what we are doing when we are filling matrix T If S1 is m letters long, & S2 is n letters long, we need to do this for all m×n possible subsequences of S1 and S2 So N-W takes time proportional to m×n to run (or n2, if m=n)
  • 33. • During the N-W algorithm we assign scores to alignments of subsequences of S1 and S2 We store the score for an alignment of the 1st i letters of S1 to the first j letters of S2 in cell T(i, j) So, after filling T, the bottom right cell will contain the score for the best alignment between S1 and S2 This is just the sum of the scores for the matches, mismatches and gaps in the best alignment: eg. the best alignment of ‘TGGTG’ and ‘ATCGT’ using a score of +1 for a match, -1 for a mismatch and -2 for a gap: T G G T G 0 -2 -4 -6 -8 -10 The best alignment has: • 3 matches (score +3) A -2 -1 -3 -5 -7 -9 - T G G T G • 1 mismatch (score -1) T -4 -1 -2 -4 -4 -6 | | | • 2 gaps (score -4) C -6 -3 -2 -3 -5 -5 A T C G T - → Score = 3-1-4 = -2 G -8 -5 -2 -1 -3 -4 T -10 -7 -4 -3 0 -2
  • 34. Software for making alignments • For Needleman-Wunsch pairwise alignment pairwiseAlignment() in the “Biostrings” R library the EMBOSS (emboss.sourceforge.net/) needle program
  • 35. Problem • How many times faster is it to find the best alignment for sequences “RQQEPVRSTC” & “QQESGPVRST” using the Needleman-Wunsch algorithm, compared to assessing each possible alignment one-by-one?
  • 36. Answer • How many times faster is it to find the best alignment for sequences “RQQEPVRSTC” & “QQESGPVRST” using the N-W algorithm, compared to assessing each possible alignment one-by-one? The sequence length, n, is 10 here This means it will take time proportional to n2=100 to find the best alignment using N-W 2n It will take time proportional to ( n ) = 184,756 to find the best alignment by assessing each possible alignment one-by-one We can find the best alignment about 1848 times (=184756/100) faster by using N-W
  • 37. Problem • Find the best alignment between the sequences “WHAT” and “WHY”, using the Needleman-Wunsch algorithm, with +1 for a match, -1 for a mismatch, and -2 for a gap.
  • 38. Answer • Find the best alignment between “WHAT” & “WHY” using N-W with match:+1, mismatch:-1, gap:-2 • Matrix T looks like this, giving 2 possible tracebacks: W H A T W H A T W H A T 0 -2 -4 -6 -8 0 -2 -4 -6 -8 0 -2 -4 -6 -8 W -2 1 -1 -3 -5 W -2 1 -1 -3 -5 W -2 1 -1 -3 -5 H -4 -1 2 0 -2 H -4 -1 2 0 -2 H -4 -1 2 0 -2 Y -6 -3 0 1 -1 Y -6 -3 0 1 -1 Y -6 -3 0 1 -1 • The two possible tracebacks give two equally good best alignments: W H A T W H A T | | | | W H - Y W H Y - (Pink traceback) (Orange traceback)
  • 39. Further Reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Chapter 6 in Deonier et al Computational Genome Analysis • Practical on pairwise alignment in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  1. factorial(5*2)/( (factorial(5)) * (factorial(5)) ) 252
  2. factorial(10*2)/( (factorial(10)) * (factorial(10)) ) 184756