Analysis of Dynamic Programming Algorithms in
Bioinformatics – A Big Data Perspective
Vineetha V and Dr.Achuthsankar S. Nair
Department of Computational Biology & Bioinformatics, University of Kerala,
Thiruvananthapuram, Kerala
Vineetha V
Dept. of Computational Biology and Bioinformatics, University of Kerala
Email: vineevishnu@gmail.com
Phone: 9446175215
Contact
1. http://www.site.uottawa.ca/~lucia/courses/5126-11/lecturenotes/12-13MultipleAlignment.pdf
2. http://thor.info.uaic.ro/~ciortuz/SLIDES/msa.pdf
3. http://www.inf.fu-berlin.de/lehre/WS05/aldabi/downloads/multAlign_part3.pdf
4. http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/pdf/lec05.pdf
5. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018093
6. https://en.wikipedia.org/wiki/List_of_sequence_alignment_software
7. http://fpt.akt.tu-berlin.de/publications/fpt-strings-beatcs14.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=fpt-strings-beatcs14.pdf
8. http://homepages.ecs.vuw.ac.nz/~downey/publications/sequence_alignment.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=sequence_alignment.pdf
References
In bioinformatics, DP algorithms are used mainly in
Sequence alignment problems like:
Pairwise Sequence
Alignment
Multi
Sequence
Alignment
Gene
Prediction
DP Algorithms in Bioinformatics
Approximation with performance guarantee, polytime.
In approximation algorithms, the running time is
polynomial but we do not find an optimal solution.
Instead, we get (for a minimization problem):
(objective value of the solution found) ≤ α (optimal value
of solution)
α ≥ 1 is the approximation ratio; the closest to 1 the
better
Center Star Method with ratio 2
D(X, Y ) be the pairwise optimal alignment distance
between X, Y .
Sc ∈ S is said to be the center string of S if it minimizes:
∑i
k
=1 D(Sc, Si)
- Identify the center string Sc of S.
- Uses the alignments of Sc with each Si to create a
multiple alignment.
computing the optimal distance D(Si , Sj ) for all i, j
=> O(k2n2 ).
computing the center string Sc => O(k2 )
generating the k pairwise alignments with Sc => O(kn2 )
insert spaces into Sc in order to satisfy all multiple
alignments simultaneously => O(k2n)
Total running time – O(k2n2)
Neat optimal solution in polynomial time. Guarantee that
solution is not too far away from optimal solution.
Big Data Challenges
No performance guarantee, but effective in practice,
polytime.
Progressive Alignment - Compute pairwise distance
scores for all pairs of sequences, generate a guide tree,
align sequences based on guide tree. Root node will
represent a complete multiple alignment of the input
sequences.
ClustalW – S={S1, S2…Sk} each of length n,
Optimal alignment of every Si, Sj => O(k2n2).
Build guide tree from distance matrix => O(k3)
Alignments based on guide tree with profile-profile
alignments => O(k2n+kn2)
Total Time => O(k2n2+k3)
Additional heuristics like weighing sequences based on
branch length, adjusting guide tree on the fly etc. are
also be applied.
Iterative Alignment – Start from initial MA (can be via
progressive), and then apply modifications to improve.
MUSCLE – Use alignment
to compute more accurate
pairwise distance.
Refines multiple alignment
using the tree-dependent
restricted partition technique - a process of deleting
edges of guide tree, and re-combine the alignment of the
disjoint trees, if better.
Other iterative methods – PRRP, MAFFT
Iterative algorithms offer improved alignment accuracy at
the expense of computation time.
Heuristics
Alignment quality and
Total run time
State of the Art
• There is trade off between optimal solution and
computation time.
• Fixed parameter based approaches found to be
promising for finding tractable special cases of
these problems.
• Employing more than one program based on
different alignment techniques might yield a
better result
• Parallelization is also a recommended technique
to be considered from practical aspect for
reducing execution time though computation
time would remain the same.
Conclusion & Future Work
Enhanced sequencing technologies produce
sequence data on an unparalleled scale and there is
a need to scale the alignment solutions to be able
to handle huge volume of input data
Computational complexity of the DP algorithm
increases exponentially with dimensionality of the
state, which makes it impractical in large-scale
applications
• MSA – MSA with sum-of-pairs score is NP
Complete.
• Multiple tree alignment is MAX SNP-Hard
• DP on full k dimension box of volume n1 x n2 x n3
x …x nk takes O(n1 . n2 . n3 … nk . 2k)
• Running time is very slow even for k = 3, and
totally infeasible for k ≥ 6
• Pairwise Sequence Alignment – Most
versions of pairwise sequence alignment has a
time complexity of O(mn) and space complexity
of O(n)
• Specific cases like LCS is solvable in polynomial
time.
• Gene Prediction – In general prediction
problem is NP Hard.
• There exists polynomial time algorithms for
several special cases.
Approximation
Fixed Parameter
Complexity
• Fine-grained complexity analysis of NP Hard
problems.
• Analyzes how problem- and data-specific
parameters influence the computational
complexity of the problem.
Analyze problem difficulty not only in terms of the
input size, but also for an additional parameter,
typically an integer p.
Fixed-parameter tractability - If a problem can be
solved in time O(nα) for each fixed parameter value
p, where α is a constant independent of p
A parameterized problem with parameter p is fixed-
parameter tractable if there is an algorithm that
decides an instance (I, p) in f(p)·|I| O(1) time, where f
is an arbitrary computable function depending only
on p and I is un-parameterized instance.
Possible parameterizations:
Instance: A set of k strings X1, ..., Xk over an
alphabet Σ, and a positive integer m.
Parameter 1: k
Parameter 2: m
Parameter 3: (k, m)
Approaches Considered
Approaches to tackle
MSA Hardness
Heuristics
Approximation
Fixed Parameter
Complexity
Curse of Dimensionality
Complexity
Tool/Algo. Description
ClustalW Progressive alignment, medium-large.
MUSCLE Iterative alignment, medium
dialign Segment based method
kalign Progressive alignment, large
MAFFT Iterative alignment, medium-large
probcons Probabilistic/consistency
T-Coffee Consistency based, small
Data Deluge

Dynamic_Prog_Analysis_poster2

  • 1.
    Analysis of DynamicProgramming Algorithms in Bioinformatics – A Big Data Perspective Vineetha V and Dr.Achuthsankar S. Nair Department of Computational Biology & Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala Vineetha V Dept. of Computational Biology and Bioinformatics, University of Kerala Email: vineevishnu@gmail.com Phone: 9446175215 Contact 1. http://www.site.uottawa.ca/~lucia/courses/5126-11/lecturenotes/12-13MultipleAlignment.pdf 2. http://thor.info.uaic.ro/~ciortuz/SLIDES/msa.pdf 3. http://www.inf.fu-berlin.de/lehre/WS05/aldabi/downloads/multAlign_part3.pdf 4. http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/pdf/lec05.pdf 5. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0018093 6. https://en.wikipedia.org/wiki/List_of_sequence_alignment_software 7. http://fpt.akt.tu-berlin.de/publications/fpt-strings-beatcs14.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=fpt-strings-beatcs14.pdf 8. http://homepages.ecs.vuw.ac.nz/~downey/publications/sequence_alignment.pdf?bcsi_scan_8d5e706812dbbfea=0&bcsi_scan_filename=sequence_alignment.pdf References In bioinformatics, DP algorithms are used mainly in Sequence alignment problems like: Pairwise Sequence Alignment Multi Sequence Alignment Gene Prediction DP Algorithms in Bioinformatics Approximation with performance guarantee, polytime. In approximation algorithms, the running time is polynomial but we do not find an optimal solution. Instead, we get (for a minimization problem): (objective value of the solution found) ≤ α (optimal value of solution) α ≥ 1 is the approximation ratio; the closest to 1 the better Center Star Method with ratio 2 D(X, Y ) be the pairwise optimal alignment distance between X, Y . Sc ∈ S is said to be the center string of S if it minimizes: ∑i k =1 D(Sc, Si) - Identify the center string Sc of S. - Uses the alignments of Sc with each Si to create a multiple alignment. computing the optimal distance D(Si , Sj ) for all i, j => O(k2n2 ). computing the center string Sc => O(k2 ) generating the k pairwise alignments with Sc => O(kn2 ) insert spaces into Sc in order to satisfy all multiple alignments simultaneously => O(k2n) Total running time – O(k2n2) Neat optimal solution in polynomial time. Guarantee that solution is not too far away from optimal solution. Big Data Challenges No performance guarantee, but effective in practice, polytime. Progressive Alignment - Compute pairwise distance scores for all pairs of sequences, generate a guide tree, align sequences based on guide tree. Root node will represent a complete multiple alignment of the input sequences. ClustalW – S={S1, S2…Sk} each of length n, Optimal alignment of every Si, Sj => O(k2n2). Build guide tree from distance matrix => O(k3) Alignments based on guide tree with profile-profile alignments => O(k2n+kn2) Total Time => O(k2n2+k3) Additional heuristics like weighing sequences based on branch length, adjusting guide tree on the fly etc. are also be applied. Iterative Alignment – Start from initial MA (can be via progressive), and then apply modifications to improve. MUSCLE – Use alignment to compute more accurate pairwise distance. Refines multiple alignment using the tree-dependent restricted partition technique - a process of deleting edges of guide tree, and re-combine the alignment of the disjoint trees, if better. Other iterative methods – PRRP, MAFFT Iterative algorithms offer improved alignment accuracy at the expense of computation time. Heuristics Alignment quality and Total run time State of the Art • There is trade off between optimal solution and computation time. • Fixed parameter based approaches found to be promising for finding tractable special cases of these problems. • Employing more than one program based on different alignment techniques might yield a better result • Parallelization is also a recommended technique to be considered from practical aspect for reducing execution time though computation time would remain the same. Conclusion & Future Work Enhanced sequencing technologies produce sequence data on an unparalleled scale and there is a need to scale the alignment solutions to be able to handle huge volume of input data Computational complexity of the DP algorithm increases exponentially with dimensionality of the state, which makes it impractical in large-scale applications • MSA – MSA with sum-of-pairs score is NP Complete. • Multiple tree alignment is MAX SNP-Hard • DP on full k dimension box of volume n1 x n2 x n3 x …x nk takes O(n1 . n2 . n3 … nk . 2k) • Running time is very slow even for k = 3, and totally infeasible for k ≥ 6 • Pairwise Sequence Alignment – Most versions of pairwise sequence alignment has a time complexity of O(mn) and space complexity of O(n) • Specific cases like LCS is solvable in polynomial time. • Gene Prediction – In general prediction problem is NP Hard. • There exists polynomial time algorithms for several special cases. Approximation Fixed Parameter Complexity • Fine-grained complexity analysis of NP Hard problems. • Analyzes how problem- and data-specific parameters influence the computational complexity of the problem. Analyze problem difficulty not only in terms of the input size, but also for an additional parameter, typically an integer p. Fixed-parameter tractability - If a problem can be solved in time O(nα) for each fixed parameter value p, where α is a constant independent of p A parameterized problem with parameter p is fixed- parameter tractable if there is an algorithm that decides an instance (I, p) in f(p)·|I| O(1) time, where f is an arbitrary computable function depending only on p and I is un-parameterized instance. Possible parameterizations: Instance: A set of k strings X1, ..., Xk over an alphabet Σ, and a positive integer m. Parameter 1: k Parameter 2: m Parameter 3: (k, m) Approaches Considered Approaches to tackle MSA Hardness Heuristics Approximation Fixed Parameter Complexity Curse of Dimensionality Complexity Tool/Algo. Description ClustalW Progressive alignment, medium-large. MUSCLE Iterative alignment, medium dialign Segment based method kalign Progressive alignment, large MAFFT Iterative alignment, medium-large probcons Probabilistic/consistency T-Coffee Consistency based, small Data Deluge