Dynamic Programming :
Sequence Alignment
Rohan Prakash
2K17 / BT / 20
Flow of Presentation
What is Dynamic Programming
Example : Fibonacci Number
TopDown and BottomUp Dynamic Programming
Sequence alignment – best possible cost
Recursive / Dynamic approach ( Needleman – Wunsch )
Sequence alignment optimal Traceback
Smith Waterman ( brief )
What is Dynamic Programming
 Dynamic Programming is just an optimization over a plain Recursion.
 If the Recursive solution to a problem has repeating sub problems then
we can avoid the recalculation of same sub problems.
Fibonacci Number
 Suppose we have a simple problem –
Q : Given a integer ‘n’ calculate the nth Fibonacci number. Fibonacci series goes
like 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, …… ,
Simple recursive Solution :
F(n) = F(n-1) + F(n-2)
Where is the problem ? What's the need
for Dynamic Programming ?
 Time Complexity of recursive call is Exponential. We are calculating same sub
problems again and again.
 Here is the recursion Tree for 6th Fibonacci Number
A better Way to solve : TopDown Approach
BottomUp Dynamic Programming
 Start from the base case and build all the way up to the required solution
0 1
0 1 2 3 4 5 6
Another Better Way : Bottom Up Approach
0 1
Best Possible Cost of Sequence Alignment ?
 Q : Given two Sequences, find the cost of aligning the two sequences, if we allow different cost for
mismatches , matches and gap
Example : what is the best possible cost for aligning the two sequences
“GATCGGGAC” to “CGTACACACTAG” given the following cost
gapPanalty = 0 ,
mismatch = 0 and
match = 1
Output = “best possible cost is 5”
Logic Building for Recursive solution
 Base Case : What is one of the sequence is empty ?
=> then return the length of other sequence * gapPanalty
 If a particular character does not matches then what should we do ?
=> we should look at the following cost , cost of adding a gap in 1st sequence , cost of
adding a gap in second sequence , and cost of mismatch.
=> Then go with the one that gives the maximum cost
 What is a particular character matches , what should the program do ?
=> we should look at the following cost ,
 cost of extending 1st sequence (i.e. adding gap in 2nd ),
 cost of extending 2nd sequence (i.e. adding gap in 1st )
 cost of extending both the sequences
=> Then go with the one that gives the maximum cost
Code your Recursive Solution
Again Where is the Problem ?
We are solving same sub problems again and again, this re-computation can be
avoided if we store our answer at each step.
Dynamic Programming : TopDown
Dynamic Programming : BottomUp
In bottomUp approach we start with base case itself and work all the
way up
How our DP table looks like :
sequenceA = GATCGGGAC , sequenceB = CGTACACACTAG
MatchScore = 1
MisMatchScore = -1
GapScore = -2
 Step 1 : Base case Filling
 Step 2 : Matrix Filling
=> If character matches then
Maximum of ( topCell + gapPanalty,
leftCell + gapPanalty,
UpperLeftDiag + matchScore )
=> if character Don’t Match then
Maximum of ( topCell + gapPanalty,
leftCell + gapPanalty,
UpperLeftDiag + mismatchScore)
How do we TraceBack ?
 Start From Nth row Mth column , i.e. the last cell and
 If characters matches then record characters in both the sequences
 and move to upperLeftDiagonal DP(i-1 , j-1)
 Else Check for the maxScoreamong left , top , upperLeftDiagonal
If we have max score in Left cell then add the character in sequence2 ( horizontal one ) and add
gap in sequence1 ( vertical one )
If we have max score in Top cell then add the character in sequence1 ( Vertical one ) and add
gap in sequence2 ( horizontal one )
If we have max score in Diagonal Then add corresponding characters in both the sequences
this is our Mismatch
Traceback
More Improvement ….
Scoring Matrix :
Purines ( adenine and guanine ) are chemically Similar , and
Pyrimidines ( Thymine and cytosine ) are also similar , so
different mismatch/penalty Scores should be given.
Smith-Waterman-Algorithm ( Local alignment )
 This algorithm is very similar to Needleman - Wunsch Algorithm.
 In Needleman – Wunsch Algorithm we perform the complete matching.
 Where as in Smith – Waterman we just make limit the min possible score for every
cell to zero. i.e. all negative values which we get in Needleman – Wunsch is
replaced by Zero.
 Example “GATCGATCGATC” and “CCGATCGATCCC” , gap = -2 , match = 1,
mismatch = -1.
Quick Recap
 How to Write recursive code
 Base case + Recursion logic
 Fibonacci Example
 What are subProblems
 How to avoid re-computations by storing answers to subProblems
 TopDown dynamic programming
 Check if previously calculated + RECURSION ( Base Case + Recursion Logic )
 BottomUp dynamic programming
 Start with base case itself and work all the way up
 Cost of optimal Sequence alignment Problem
 Recursion
 TopDown Dynamic Programming
 BottomUp Dynamic Programming
 Sequence Alignment Problem
 BottomUp Dynamic Programming
 Traceback
 Improvement – Scoring matrix for purines and pyrimidines
 Smith Waterman ( brief )
Thank You
Rohan Prakash 2k17/BT/20

Dynamic Programming: Smith-Waterman

  • 1.
    Dynamic Programming : SequenceAlignment Rohan Prakash 2K17 / BT / 20
  • 2.
    Flow of Presentation Whatis Dynamic Programming Example : Fibonacci Number TopDown and BottomUp Dynamic Programming Sequence alignment – best possible cost Recursive / Dynamic approach ( Needleman – Wunsch ) Sequence alignment optimal Traceback Smith Waterman ( brief )
  • 3.
    What is DynamicProgramming  Dynamic Programming is just an optimization over a plain Recursion.  If the Recursive solution to a problem has repeating sub problems then we can avoid the recalculation of same sub problems.
  • 4.
    Fibonacci Number  Supposewe have a simple problem – Q : Given a integer ‘n’ calculate the nth Fibonacci number. Fibonacci series goes like 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, …… , Simple recursive Solution : F(n) = F(n-1) + F(n-2)
  • 5.
    Where is theproblem ? What's the need for Dynamic Programming ?  Time Complexity of recursive call is Exponential. We are calculating same sub problems again and again.  Here is the recursion Tree for 6th Fibonacci Number
  • 6.
    A better Wayto solve : TopDown Approach
  • 7.
    BottomUp Dynamic Programming Start from the base case and build all the way up to the required solution 0 1 0 1 2 3 4 5 6
  • 8.
    Another Better Way: Bottom Up Approach 0 1
  • 9.
    Best Possible Costof Sequence Alignment ?  Q : Given two Sequences, find the cost of aligning the two sequences, if we allow different cost for mismatches , matches and gap Example : what is the best possible cost for aligning the two sequences “GATCGGGAC” to “CGTACACACTAG” given the following cost gapPanalty = 0 , mismatch = 0 and match = 1 Output = “best possible cost is 5”
  • 10.
    Logic Building forRecursive solution  Base Case : What is one of the sequence is empty ? => then return the length of other sequence * gapPanalty  If a particular character does not matches then what should we do ? => we should look at the following cost , cost of adding a gap in 1st sequence , cost of adding a gap in second sequence , and cost of mismatch. => Then go with the one that gives the maximum cost  What is a particular character matches , what should the program do ? => we should look at the following cost ,  cost of extending 1st sequence (i.e. adding gap in 2nd ),  cost of extending 2nd sequence (i.e. adding gap in 1st )  cost of extending both the sequences => Then go with the one that gives the maximum cost
  • 11.
  • 12.
    Again Where isthe Problem ? We are solving same sub problems again and again, this re-computation can be avoided if we store our answer at each step.
  • 13.
  • 14.
    Dynamic Programming :BottomUp In bottomUp approach we start with base case itself and work all the way up
  • 15.
    How our DPtable looks like : sequenceA = GATCGGGAC , sequenceB = CGTACACACTAG MatchScore = 1 MisMatchScore = -1 GapScore = -2  Step 1 : Base case Filling
  • 16.
     Step 2: Matrix Filling => If character matches then Maximum of ( topCell + gapPanalty, leftCell + gapPanalty, UpperLeftDiag + matchScore ) => if character Don’t Match then Maximum of ( topCell + gapPanalty, leftCell + gapPanalty, UpperLeftDiag + mismatchScore)
  • 18.
    How do weTraceBack ?  Start From Nth row Mth column , i.e. the last cell and  If characters matches then record characters in both the sequences  and move to upperLeftDiagonal DP(i-1 , j-1)  Else Check for the maxScoreamong left , top , upperLeftDiagonal If we have max score in Left cell then add the character in sequence2 ( horizontal one ) and add gap in sequence1 ( vertical one ) If we have max score in Top cell then add the character in sequence1 ( Vertical one ) and add gap in sequence2 ( horizontal one ) If we have max score in Diagonal Then add corresponding characters in both the sequences this is our Mismatch
  • 19.
  • 20.
    More Improvement …. ScoringMatrix : Purines ( adenine and guanine ) are chemically Similar , and Pyrimidines ( Thymine and cytosine ) are also similar , so different mismatch/penalty Scores should be given.
  • 21.
    Smith-Waterman-Algorithm ( Localalignment )  This algorithm is very similar to Needleman - Wunsch Algorithm.  In Needleman – Wunsch Algorithm we perform the complete matching.  Where as in Smith – Waterman we just make limit the min possible score for every cell to zero. i.e. all negative values which we get in Needleman – Wunsch is replaced by Zero.  Example “GATCGATCGATC” and “CCGATCGATCCC” , gap = -2 , match = 1, mismatch = -1.
  • 22.
    Quick Recap  Howto Write recursive code  Base case + Recursion logic  Fibonacci Example  What are subProblems  How to avoid re-computations by storing answers to subProblems  TopDown dynamic programming  Check if previously calculated + RECURSION ( Base Case + Recursion Logic )  BottomUp dynamic programming  Start with base case itself and work all the way up  Cost of optimal Sequence alignment Problem  Recursion  TopDown Dynamic Programming  BottomUp Dynamic Programming
  • 23.
     Sequence AlignmentProblem  BottomUp Dynamic Programming  Traceback  Improvement – Scoring matrix for purines and pyrimidines  Smith Waterman ( brief )
  • 24.

Editor's Notes

  • #2 Discuss Dynamic programming + How to apply Dynamic Programming for sequence alignment
  • #3 What is Dp , subproblems , topDown and BottomUp Our first Main focus will be on What is dynamic programming. Second main focus will be on , Sequence alignment , and how Needleman – Wunsch works.
  • #5 Nth Fibonacci no can be calculated as sum of previous 2 Fibonacci numbers. Fibonacci no for n equal to 0 is 0 , for 1 it is 1 and rest all can be calculated by the relation. #Recursion is basecase + Logic
  • #6 Here you can see, what repetitive subproblem means and where are we recomputing same sub problem
  • #7 Some point to notice in topDown approach , is nothing but the exact recursion with added condition , that if we have previously calculated is same problem then no need to recalculate this, we can simply use previously stored answer. And avoid fuither recursion call. Else if we don’t know the answer then we do recursion and before returning we store our calculated answer for future references.
  • #9 Things to observe in Bottom Up We start from base case. In TopDown we were performing recursion until we hit the base case and then we start building our solution by storing answer to all problems that we had solved traversing UP. But in bottom we don’t perform recursion , and we start with the base case itself and work all the way up. ( Now EXPLAIN CODE )
  • #10 You know Recursion is Base case + recursive logic, so what should be our base case and what should be our recursive logic ?
  • #12 Where is the problem in this code ? Again the problem is whith Optimization. We are re-calculating same sub problem again and again. At every recursion step we are doing 3 recursive call, cost of adding gapPanalty in sequence A cost of adding gap in sequenceB , and cost of Match / mismatch of the current character. Here is how the recursion tree is going to ;look like, for 2 sequences of of length 3
  • #13 Let ud write the TopDown Dynamic Programming solution for to avoid recalculation , we will store the answers for every subproblem and avoid recalculation
  • #14 [explain the code] now we know the recursive algorithm , we have done TopDown approach , let us see how BottomUp Dp is going to look like , again recall in bottomUp DP we start from Base case itself and work all the way up
  • #15 Code baad me dhekenge , first show them algorithm
  • #17 (slow) Let me read it out the algorithm again
  • #18 This is a beautiful GUI based representation of DP table. This is the exact same thing , we will Now see how to Traceback the sequence
  • #20 [ show them complete code in IDE ], In the end we do need to reverse the result. this completes our Traceback, and we are now coming towards the end of presentation, before moving futher let us see the whole code , again , traceback , matrix filling , and traceback
  • #21 Some sources of improvement , Scoring matrix , puring mismatch purin should be given higher score than purine mismatch pyrimidine and vice versa
  • #23 This finishes my presentation , a Quick recap