SlideShare a Scribd company logo
1 of 63
Download to read offline
Linear Space Alignment,
Divide & Conquer
Algorithms
Outline
1. MergeSort
2. Finding the middle vertex
3. Linear space sequence alignment
4. Block alignment
5. Four-Russians speedup
6. LCS in sub-quadratic time
Section 1:
MergeSort
Divide and Conquer Algorithms
• Divide problem into sub-problems.
• Conquer by solving sub-problems recursively. If the sub-
problems are small enough, solve them in brute force fashion.
• Combine the solutions of sub-problems into a solution of the
original problem (tricky part).
Sorting Problem Revisited
• Given: An unsorted array.
• Goal: Sort it.
5 2 4 7 1 3 2 6
1 2 2 3 4 5 6 7
Mergesort: Divide Step
5 2 4 7 1 3 2 6
5 2 4 7 1 3 2 6
5 2 4 7 1 3 2 6
• log(n) divisions to split an array of size n into single elements.
• Step 1: DIVIDE
Mergesort: Conquer Step
• Step 2: CONQUER
1 2 2 3 4 5 6 7
2 4 5 7 1 2 3 6
2 5 4 7 1 3 2 6
5 2 4 7 1 3 2 6
O(n)
O(n)
O(n)
O(n)
• log(n) iterations, each iteration takes O(n) time.
• Total Time: O(n log n)
Mergesort: Combine Step
• Step 3: COMBINE
• 2 arrays of size 1 can be easily merged to form a sorted
array of size 2.
• In general, 2 sorted arrays of size n and m can be merged
in O(n+m) time to form a sorted array of size n+m.
5 2 2 5
Mergesort: Combine Step
• Combining 2 arrays of size 4…
2 4 5 7
1 2 3 6
1
2 4 5 7
2 3 6
1 2
4 5 7
2 3 6
1 2 2
4 5 7
3 6
1 2 2 3
4 5 7
6
1 2 2 3 4
Etcetera…
1 2 2 3 4 5 6 7
Merge Algorithm
1. Merge(a,b)
2. n1 size of array a
3. n2 size of array b
4. an1+1
5. an2+1
6. i 1
7. j 1
8. for k 1 to n1 + n2
9. if ai < bj
10. ck ai
11. i i +1
12. else
13. ck bj
14. j j+1
15.return c
Mergesort: Example
20 4 7 6 1 3 9 5
20 4 7 6 1 3 9 5
20 4 7 6 1 3 9 5
20 4 7 6 1 3 9 5
4 20 6 7 1 3 5 9
4 6 7 20 1 3 5 9
1 3 4 5 6 7 9 20
Divide
Conquer
MergeSort Algorithm
1. MergeSort(c)
2. n size of array c
3. if n = 1
4. return c
5. left list of first n/2 elements of c
6. right list of last n-n/2 elements of c
7. sortedLeft MergeSort(left)
8. sortedRight MergeSort(right)
9. sortedList Merge(sortedLeft,sortedRight)
10. return sortedList
MergeSort: Running Time
• The problem is simplified to baby steps:
• For the ith merging iteration, the complexity of the problem
is O(n).
• Number of iterations is O(log n).
• Running Time: O(n logn).
Divide and Conquer Approach to LCS
1. Path(source, sink)
2. if(source & sink are in consecutive columns)
3. output the longest path from source to sink
4. else
5. middle ← middle vertex between source & sink
6. Path(source, middle)
7. Path(middle, sink)
Divide and Conquer Approach to LCS
• The only problem left is how to find this “middle vertex”!
1. Path(source, sink)
2. if(source & sink are in consecutive columns)
3. output the longest path from source to sink
4. else
5. middle ← middle vertex between source & sink
6. Path(source, middle)
7. Path(middle, sink)
Section 2:
Finding the Middle Vertex
Alignment Score Requires Linear Memory
• Space complexity of computing the
alignment score is just O(n).
• We only need the previous column
to calculate the current column, and
we can then throw away that
previous column once we’re done
using it.
2
nn
Computing Alignment Score: Recycling
Columns
Memory for column 1
is re-used to calculate
column 3
Memory for column 2
is re-used to
calculate column 4
• Only two columns of scores are saved at any given time:
Alignment Path Requires Quadratic Memory
• Space complexity for computing an
alignment path for sequences of
length n and m is O(nm).
• The reason is that we need to keep
all backtracking references in
memory to reconstruct the path
(backtracking).
n
m
Crossing the Middle Line
m/2 m
n
• We want to calculate the longest
path from (0,0) to (n,m) that passes
through (i,m/2) where i ranges
from 0 to n and represents the i-th
row.
• Define length(i) as the length of the
longest path from (0,0) to (n,m)
that passes through vertex (i, m/2).
Crossing the Middle Line
• We want to calculate the longest
path from (0,0) to (n,m) that passes
through (i,m/2) where i ranges
from 0 to n and represents the ith
row.
• Define length(i) as the length of the
longest path from (0,0) to (n,m)
that passes through vertex (i, m/2).
m/2 m
n
m/2 m
n
• Define (mid,m/2) as the vertex
where the longest path crosses
the middle column.
• length(mid) = optimal length = max0 i n length(i)
Crossing the Middle Line
prefix(i)
suffix(i)
Crossing the Middle Line
m/2 m
n
• Define (mid,m/2) as the vertex
where the longest path crosses
the middle column.
• length(mid) = optimal length = max0 i n length(i)
Computing prefix(i)
• prefix(i) is the length of the longest path from (0,0) to (i, m/2).
• Compute prefix(i) by dynamic programming in the left half of
the matrix.
0 m/2 m
Store prefix(i) column
Computing suffix(i)
• suffix(i) is the length of the longest path from (i,m/2) to (n,m).
• suffix(i) is the length of the longest path from (n,m) to (i, m/2)
with all edges reversed.
• Compute suffix(i) by dynamic programming in the right half of the
“reversed” matrix.
0 m/2 m
Store suffix(i) column
length(i) = prefix(i) + suffix(i)
• Add prefix(i) and suffix(i) to compute length(i):
• length(i)=prefix(i) + suffix(i)
• You now have a middle vertex of the maximum path (i,m/2) as
maximum of length(i).
Middle point found
0 m/2 m
0
i
Finding the Middle Point
0 m/4 m/2 3m/4 m
Finding the Middle Point Again
0 m/4 m/2 3m/4 m
And Again…
0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m
Time = Area: First Pass
• On first pass, the algorithm covers the entire area.
Area = n m
Time = Area: First Pass
Computing
prefix(i)
Computing
suffix(i)
• On first pass, the algorithm covers the entire area.
Area = n m
Time = Area: Second Pass
• On second pass, the algorithm covers only 1/2 of the area
Area/2
Time = Area: Third Pass
• On third pass, only 1/4th is covered.
Area/4
Geometric Reduction At Each Iteration
• 1 + ½ + ¼ + ... + (½)k ≤ 2
• Runtime: O(Area) = O(nm)
first pass: 1
2nd pass: 1/2
3rd pass: 1/4
5th pass: 1/16
4th pass: 1/8
Can We Align Sequences in Subquadratic Time?
• Dynamic Programming takes O(n2) for global alignment.
• Can we do better?
• Yes, use Four-Russians Speedup.
Section 4:
Block Alignment
Partitioning Sequences into Blocks
• Partition the n x n grid into blocks of size t x t.
• We are comparing two sequences, each of size n, and each
sequence is sectioned off into chunks, each of length t.
• Sequence u = u1…un becomes
|u1…ut| |ut+1…u2t| … |un-t+1…un|
and sequence v = v1…vn becomes
|v1…vt| |vt+1…v2t| … |vn-t+1…vn|
Partitioning Alignment Grid into Blocks
partition
n n/t
n/t
t
tn
Block Alignment
• Block alignment of sequences u and v.
1. An entire block in u is aligned with an entire block in v.
2. An entire block is inserted.
3. An entire block is deleted.
• Block path: a path that traverses every t x t square through its
corners.
Block Alignment: Examples
valid invalid
Block Alignment Problem
• Goal: Find the longest block path through an edit graph.
• Input: Two sequences, u and v partitioned into blocks of size t.
This is equivalent to an n x n edit graph partitioned into t x t
subgrids.
• Output: The block alignment of u and v with the maximum
score (longest block path through the edit graph).
Constructing Alignments within Blocks
• To solve: Compute alignment score BlockScorei,j for each pair
of blocks |u(i-1)*t+1…ui*t| and |v(j-1)*t+1…vj*t|.
• How many blocks are there per sequence?
• (n/t) blocks of size t
• How many pairs of blocks for aligning the two sequences?
• (n/t) x (n/t)
• For each block pair, solve a mini-alignment problem of size
t x t
Constructing Alignments within Blocks
n/t
Block pair represented
by each square
Solve mini-alignment problems
Block Alignment: Dynamic Programming
• Let si,j denote the optimal block alignment score between the
first i blocks of u and first j blocks of v.
• block is the penalty for inserting or deleting an entire block.
• BlockScore(i, j) is the score of the pair of blocks in row i and
column j.
si, j max
si 1, j block
si, j 1 block
si 1, j 1 BlockScore i, j
Block Alignment Runtime
• Indices i, j range from 0 to n/t.
• Running time of algorithm is
O( [n/t]*[n/t]) = O(n2/t2)
if we don’t count the time to compute each BlockScore(i, j).
Block Alignment Runtime
• Computing all BlockScorei,j requires solving (n/t)*(n/t) mini
block alignments, each of size (t*t).
• So computing all i,j takes time
O([n/t]*[n/t]*t*t) = O(n2)
• This is the same as dynamic programming.
• How do we speed this up?
Section 5:
Four Russians Speedup
Four Russians Technique
• Let t = log(n), where t is the block size, n is the sequence size.
• Instead of having (n/t)*(n/t) mini-alignments, construct 4t x 4t
mini-alignments for all pairs of strings of t nucleotides (huge
size), and put in a lookup table.
• However, size of lookup table is not really that huge if t is
small. Let t = (logn)/4. Then 4t x 4t = n.
Look-up Table for Four Russians Technique
Lookup table “Score”
AAAAAA
AAAAAC
AAAAAG
AAAAAT
AAAACA
…
AAAAAA
AAAAAC
AAAAAG
AAAAAT
AAAACA
…
each sequence
has t nucleotides
sizeis only n,
insteadof
(n/t)*(n/t)
New Recurrence
• The new lookup table Score is indexed by a pair of t-
nucleotide strings, so
• Key Difference: The Score function is taken from the hash
table rather than computed by dynamic programming as
before.
si, j max
si 1, j block
si, j 1 block
si 1, j 1 Score ith
block ofv, jth
block ofu
Four Russians Speedup Runtime
• Since computing the lookup table Score of size n takes O(n)
time, the running time is mainly limited by the (n/t)*(n/t)
accesses to the lookup table.
• Each access takes O(logn) time.
• Overall running time: O( [n2/t2]*logn)
• Since t = logn, substitute in:
• O( [n2/{logn}2]*logn) > O( n2/logn )
Section 6:
LCS in Sub-Quadratic
Time
So Far…
• We can divide up the grid into blocks and run dynamic
programming only on the corners of these blocks.
• In order to speed up the mini-alignment calculations to under
n2, we create a lookup table of size n, which consists of all
scores for all t-nucleotide pairs.
• Running time goes from quadratic, O(n2), to subquadratic:
O(n2/logn)
Four Russians Speedup for LCS
• Unlike the block partitioned graph, the LCS path does not have
to pass through the vertices of the blocks.
Block Alignment Longest Common Subsequence
Block Alignment vs. LCS
• In block alignment, we only care about the corners of the
blocks.
• In LCS, we care about all points on the edges of the blocks,
because those are points that the path can traverse.
• Recall, each sequence is of length n, each block is of size t, so
each sequence has (n/t) blocks.
Block Alignment vs. LCS: Points Of Interest
Block alignment has (n/t)*(n/t)
= (n2/t2) points of interest
LCS alignment has O(n2/t)
points of interest
Traversing Blocks for LCS
• Given alignment scores si,* in the first row and scores s*,j in the
first column of a t x t mini square, compute alignment scores in
the last row and column of the minisquare.
• To compute the last row and the last column score, we use these 4
variables:
1. Alignment scores si,* in the first row.
2. Alignment scores s*,j in the first column.
3. Substring of sequence u in this block (4t possibilities).
4. Substring of sequence v in this block (4t possibilities).
Traversing Blocks for LCS
• If we used this to compute the grid, it would take quadratic,
O(n2) time, but we want to do better.
We know
these scores
We can
calculate these
scores
t x t block
Four Russians Speedup
• Build a lookup table for all possible values of the four
variables:
1. All possible scores for the first row s*,j
2. All possible scores for the first column s*,j
3. Substring of sequence u in this block (4t possibilities).
4. Substring of sequence v in this block (4t possibilities).
• For each quadruple we store the value of the score for the last
row and last column.
• This will be a huge table, but we can eliminate alignments
scores that don’t make sense.
Reducing Table Size
• Alignment scores in LCS are monotonically increasing, and
adjacent elements can’t differ by more than 1.
• Example: 0,1,2,2,3,4 is ok; 0,1,2,4,5,8, is not because 2 and 4
differ by more than 1 (and so do 5 and 8).
• Therefore, we only need to store quadruples whose scores are
monotonically increasing and differ by at most 1.
Efficient Encoding of Alignment Scores
• Instead of recording numbers that correspond to the index in
the sequences u and v, we can use binary to encode the
differences between the alignment scores.
0 1 2 2 3 4
1 1 0 0 1 1
original encoding
binary encoding
Reducing Lookup Table Size
• 2t possible scores (t = size of blocks)
• 4t possible strings
• Lookup table size is (2t * 2t)*(4t * 4t) = 26t
• Let t = (logn)/4;
• Table size is: 26((logn)/4) = n(6/4) = n(3/2)
• Time = O( [n2/t2]*logn )
• O( [n2/{logn}2]*logn) > O( n2/logn )
Summary
• We take advantage of the fact that for each block of t = log(n),
we can pre-compute all possible scores and store them in a
lookup table of size n(3/2).
• We used the Four Russian speedup to go from a quadratic
running time for LCS to subquadratic running time:
O(n2/logn).

More Related Content

What's hot

01 knapsack using backtracking
01 knapsack using backtracking01 knapsack using backtracking
01 knapsack using backtracking
mandlapure
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3
Traian Rebedea
 
8 queens problem using back tracking
8 queens problem using back tracking8 queens problem using back tracking
8 queens problem using back tracking
Tech_MX
 

What's hot (20)

Branch and bounding : Data structures
Branch and bounding : Data structuresBranch and bounding : Data structures
Branch and bounding : Data structures
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
FDM Numerical solution of Laplace Equation using MATLAB
FDM Numerical solution of Laplace Equation using MATLABFDM Numerical solution of Laplace Equation using MATLAB
FDM Numerical solution of Laplace Equation using MATLAB
 
Matrices
MatricesMatrices
Matrices
 
15 puzzle problem using branch and bound
15 puzzle problem using branch and bound15 puzzle problem using branch and bound
15 puzzle problem using branch and bound
 
Unit 2
Unit 2Unit 2
Unit 2
 
Branch and bound
Branch and boundBranch and bound
Branch and bound
 
Chapter2 - Linear Time-Invariant System
Chapter2 - Linear Time-Invariant SystemChapter2 - Linear Time-Invariant System
Chapter2 - Linear Time-Invariant System
 
Constant strain triangular
Constant strain triangular Constant strain triangular
Constant strain triangular
 
Backtracking
BacktrackingBacktracking
Backtracking
 
Backtracking
BacktrackingBacktracking
Backtracking
 
linear transfermation.pptx
linear transfermation.pptxlinear transfermation.pptx
linear transfermation.pptx
 
Modern Control - Lec 04 - Analysis and Design of Control Systems using Root L...
Modern Control - Lec 04 - Analysis and Design of Control Systems using Root L...Modern Control - Lec 04 - Analysis and Design of Control Systems using Root L...
Modern Control - Lec 04 - Analysis and Design of Control Systems using Root L...
 
01 knapsack using backtracking
01 knapsack using backtracking01 knapsack using backtracking
01 knapsack using backtracking
 
Backtracking
Backtracking  Backtracking
Backtracking
 
Aplikasi Bilangan Kompleks - Analisis Sinyal [PAPER]
Aplikasi Bilangan Kompleks - Analisis Sinyal [PAPER]Aplikasi Bilangan Kompleks - Analisis Sinyal [PAPER]
Aplikasi Bilangan Kompleks - Analisis Sinyal [PAPER]
 
Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3Algorithm Design and Complexity - Course 3
Algorithm Design and Complexity - Course 3
 
8 queens problem using back tracking
8 queens problem using back tracking8 queens problem using back tracking
8 queens problem using back tracking
 
Lab no.07
Lab no.07Lab no.07
Lab no.07
 
Backtracking
BacktrackingBacktracking
Backtracking
 

Viewers also liked

Dna sequencing methods
Dna sequencing methodsDna sequencing methods
Dna sequencing methods
hephz
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
14cummke
 

Viewers also liked (18)

Longest Common Sequence Algorithm Analysis
Longest Common Sequence Algorithm AnalysisLongest Common Sequence Algorithm Analysis
Longest Common Sequence Algorithm Analysis
 
Annual Report 2012: Westerville Library
Annual Report 2012: Westerville LibraryAnnual Report 2012: Westerville Library
Annual Report 2012: Westerville Library
 
Red blacktrees
Red blacktreesRed blacktrees
Red blacktrees
 
Fast algorithms for large scale genome alignment and comparison
Fast algorithms for large scale genome alignment and comparisonFast algorithms for large scale genome alignment and comparison
Fast algorithms for large scale genome alignment and comparison
 
Disk partition alignment
Disk partition alignmentDisk partition alignment
Disk partition alignment
 
14 - 08 Feb - Dynamic Programming
14 - 08 Feb - Dynamic Programming14 - 08 Feb - Dynamic Programming
14 - 08 Feb - Dynamic Programming
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Longest Common Subsequence
Longest Common SubsequenceLongest Common Subsequence
Longest Common Subsequence
 
DP
DPDP
DP
 
12.3 dna, rna, and protein
12.3  dna, rna, and protein12.3  dna, rna, and protein
12.3 dna, rna, and protein
 
Greedy Algorihm
Greedy AlgorihmGreedy Algorihm
Greedy Algorihm
 
Greedy Algorithms with examples' b-18298
Greedy Algorithms with examples'  b-18298Greedy Algorithms with examples'  b-18298
Greedy Algorithms with examples' b-18298
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
Dna sequencing
Dna sequencingDna sequencing
Dna sequencing
 
Dna sequencing methods
Dna sequencing methodsDna sequencing methods
Dna sequencing methods
 
DNA Sequencing : Maxam Gilbert and Sanger Sequencing
DNA Sequencing : Maxam Gilbert and Sanger SequencingDNA Sequencing : Maxam Gilbert and Sanger Sequencing
DNA Sequencing : Maxam Gilbert and Sanger Sequencing
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
 
Dna Sequencing
Dna SequencingDna Sequencing
Dna Sequencing
 

Similar to Ch07 linearspacealignment

Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modified
Inbok Lee
 
Data Structure & Algorithms - Mathematical
Data Structure & Algorithms - MathematicalData Structure & Algorithms - Mathematical
Data Structure & Algorithms - Mathematical
babuk110
 
Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....
Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....
Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....
SintooChauhan6
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
GopiNathVelivela
 

Similar to Ch07 linearspacealignment (20)

Class13_Quicksort_Algorithm.pdf
Class13_Quicksort_Algorithm.pdfClass13_Quicksort_Algorithm.pdf
Class13_Quicksort_Algorithm.pdf
 
Lca seminar modified
Lca seminar modifiedLca seminar modified
Lca seminar modified
 
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWERUndecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
chapter1.pdf ......................................
chapter1.pdf ......................................chapter1.pdf ......................................
chapter1.pdf ......................................
 
Introduction to Matlab - Basic Functions
Introduction to Matlab - Basic FunctionsIntroduction to Matlab - Basic Functions
Introduction to Matlab - Basic Functions
 
ICPC 2015, Tsukuba : Unofficial Commentary
ICPC 2015, Tsukuba: Unofficial CommentaryICPC 2015, Tsukuba: Unofficial Commentary
ICPC 2015, Tsukuba : Unofficial Commentary
 
Daa chapter 3
Daa chapter 3Daa chapter 3
Daa chapter 3
 
MATLAB - Arrays and Matrices
MATLAB - Arrays and MatricesMATLAB - Arrays and Matrices
MATLAB - Arrays and Matrices
 
3.8 quick sort
3.8 quick sort3.8 quick sort
3.8 quick sort
 
Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)Matrix Multiplication(An example of concurrent programming)
Matrix Multiplication(An example of concurrent programming)
 
Data Structure & Algorithms - Mathematical
Data Structure & Algorithms - MathematicalData Structure & Algorithms - Mathematical
Data Structure & Algorithms - Mathematical
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
CS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdfCS345-Algorithms-II-Lecture-1-CS345-2016.pdf
CS345-Algorithms-II-Lecture-1-CS345-2016.pdf
 
Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....
Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....
Decision Maths 1 Chapter 3 Algorithms on Graphs (including Floyd A2 content)....
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptx
 
testpang
testpangtestpang
testpang
 
Counting trees.pptx
Counting trees.pptxCounting trees.pptx
Counting trees.pptx
 
Lecture -16-merge sort (slides).pptx
Lecture -16-merge sort (slides).pptxLecture -16-merge sort (slides).pptx
Lecture -16-merge sort (slides).pptx
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 

More from BioinformaticsInstitute

Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
BioinformaticsInstitute
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
BioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
BioinformaticsInstitute
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
BioinformaticsInstitute
 

More from BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
Nanopores sequencing
Nanopores sequencingNanopores sequencing
Nanopores sequencing
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Comparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphsComparative Genomics and de Bruijn graphs
Comparative Genomics and de Bruijn graphs
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Плюрипотентность 101
Плюрипотентность 101Плюрипотентность 101
Плюрипотентность 101
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Ch07 linearspacealignment

  • 1. Linear Space Alignment, Divide & Conquer Algorithms
  • 2. Outline 1. MergeSort 2. Finding the middle vertex 3. Linear space sequence alignment 4. Block alignment 5. Four-Russians speedup 6. LCS in sub-quadratic time
  • 4. Divide and Conquer Algorithms • Divide problem into sub-problems. • Conquer by solving sub-problems recursively. If the sub- problems are small enough, solve them in brute force fashion. • Combine the solutions of sub-problems into a solution of the original problem (tricky part).
  • 5. Sorting Problem Revisited • Given: An unsorted array. • Goal: Sort it. 5 2 4 7 1 3 2 6 1 2 2 3 4 5 6 7
  • 6. Mergesort: Divide Step 5 2 4 7 1 3 2 6 5 2 4 7 1 3 2 6 5 2 4 7 1 3 2 6 • log(n) divisions to split an array of size n into single elements. • Step 1: DIVIDE
  • 7. Mergesort: Conquer Step • Step 2: CONQUER 1 2 2 3 4 5 6 7 2 4 5 7 1 2 3 6 2 5 4 7 1 3 2 6 5 2 4 7 1 3 2 6 O(n) O(n) O(n) O(n) • log(n) iterations, each iteration takes O(n) time. • Total Time: O(n log n)
  • 8. Mergesort: Combine Step • Step 3: COMBINE • 2 arrays of size 1 can be easily merged to form a sorted array of size 2. • In general, 2 sorted arrays of size n and m can be merged in O(n+m) time to form a sorted array of size n+m. 5 2 2 5
  • 9. Mergesort: Combine Step • Combining 2 arrays of size 4… 2 4 5 7 1 2 3 6 1 2 4 5 7 2 3 6 1 2 4 5 7 2 3 6 1 2 2 4 5 7 3 6 1 2 2 3 4 5 7 6 1 2 2 3 4 Etcetera… 1 2 2 3 4 5 6 7
  • 10. Merge Algorithm 1. Merge(a,b) 2. n1 size of array a 3. n2 size of array b 4. an1+1 5. an2+1 6. i 1 7. j 1 8. for k 1 to n1 + n2 9. if ai < bj 10. ck ai 11. i i +1 12. else 13. ck bj 14. j j+1 15.return c
  • 11. Mergesort: Example 20 4 7 6 1 3 9 5 20 4 7 6 1 3 9 5 20 4 7 6 1 3 9 5 20 4 7 6 1 3 9 5 4 20 6 7 1 3 5 9 4 6 7 20 1 3 5 9 1 3 4 5 6 7 9 20 Divide Conquer
  • 12. MergeSort Algorithm 1. MergeSort(c) 2. n size of array c 3. if n = 1 4. return c 5. left list of first n/2 elements of c 6. right list of last n-n/2 elements of c 7. sortedLeft MergeSort(left) 8. sortedRight MergeSort(right) 9. sortedList Merge(sortedLeft,sortedRight) 10. return sortedList
  • 13. MergeSort: Running Time • The problem is simplified to baby steps: • For the ith merging iteration, the complexity of the problem is O(n). • Number of iterations is O(log n). • Running Time: O(n logn).
  • 14. Divide and Conquer Approach to LCS 1. Path(source, sink) 2. if(source & sink are in consecutive columns) 3. output the longest path from source to sink 4. else 5. middle ← middle vertex between source & sink 6. Path(source, middle) 7. Path(middle, sink)
  • 15. Divide and Conquer Approach to LCS • The only problem left is how to find this “middle vertex”! 1. Path(source, sink) 2. if(source & sink are in consecutive columns) 3. output the longest path from source to sink 4. else 5. middle ← middle vertex between source & sink 6. Path(source, middle) 7. Path(middle, sink)
  • 16. Section 2: Finding the Middle Vertex
  • 17. Alignment Score Requires Linear Memory • Space complexity of computing the alignment score is just O(n). • We only need the previous column to calculate the current column, and we can then throw away that previous column once we’re done using it. 2 nn
  • 18. Computing Alignment Score: Recycling Columns Memory for column 1 is re-used to calculate column 3 Memory for column 2 is re-used to calculate column 4 • Only two columns of scores are saved at any given time:
  • 19. Alignment Path Requires Quadratic Memory • Space complexity for computing an alignment path for sequences of length n and m is O(nm). • The reason is that we need to keep all backtracking references in memory to reconstruct the path (backtracking). n m
  • 20. Crossing the Middle Line m/2 m n • We want to calculate the longest path from (0,0) to (n,m) that passes through (i,m/2) where i ranges from 0 to n and represents the i-th row. • Define length(i) as the length of the longest path from (0,0) to (n,m) that passes through vertex (i, m/2).
  • 21. Crossing the Middle Line • We want to calculate the longest path from (0,0) to (n,m) that passes through (i,m/2) where i ranges from 0 to n and represents the ith row. • Define length(i) as the length of the longest path from (0,0) to (n,m) that passes through vertex (i, m/2). m/2 m n
  • 22. m/2 m n • Define (mid,m/2) as the vertex where the longest path crosses the middle column. • length(mid) = optimal length = max0 i n length(i) Crossing the Middle Line
  • 23. prefix(i) suffix(i) Crossing the Middle Line m/2 m n • Define (mid,m/2) as the vertex where the longest path crosses the middle column. • length(mid) = optimal length = max0 i n length(i)
  • 24. Computing prefix(i) • prefix(i) is the length of the longest path from (0,0) to (i, m/2). • Compute prefix(i) by dynamic programming in the left half of the matrix. 0 m/2 m Store prefix(i) column
  • 25. Computing suffix(i) • suffix(i) is the length of the longest path from (i,m/2) to (n,m). • suffix(i) is the length of the longest path from (n,m) to (i, m/2) with all edges reversed. • Compute suffix(i) by dynamic programming in the right half of the “reversed” matrix. 0 m/2 m Store suffix(i) column
  • 26. length(i) = prefix(i) + suffix(i) • Add prefix(i) and suffix(i) to compute length(i): • length(i)=prefix(i) + suffix(i) • You now have a middle vertex of the maximum path (i,m/2) as maximum of length(i). Middle point found 0 m/2 m 0 i
  • 27. Finding the Middle Point 0 m/4 m/2 3m/4 m
  • 28. Finding the Middle Point Again 0 m/4 m/2 3m/4 m
  • 29. And Again… 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m
  • 30. Time = Area: First Pass • On first pass, the algorithm covers the entire area. Area = n m
  • 31. Time = Area: First Pass Computing prefix(i) Computing suffix(i) • On first pass, the algorithm covers the entire area. Area = n m
  • 32. Time = Area: Second Pass • On second pass, the algorithm covers only 1/2 of the area Area/2
  • 33. Time = Area: Third Pass • On third pass, only 1/4th is covered. Area/4
  • 34. Geometric Reduction At Each Iteration • 1 + ½ + ¼ + ... + (½)k ≤ 2 • Runtime: O(Area) = O(nm) first pass: 1 2nd pass: 1/2 3rd pass: 1/4 5th pass: 1/16 4th pass: 1/8
  • 35. Can We Align Sequences in Subquadratic Time? • Dynamic Programming takes O(n2) for global alignment. • Can we do better? • Yes, use Four-Russians Speedup.
  • 37. Partitioning Sequences into Blocks • Partition the n x n grid into blocks of size t x t. • We are comparing two sequences, each of size n, and each sequence is sectioned off into chunks, each of length t. • Sequence u = u1…un becomes |u1…ut| |ut+1…u2t| … |un-t+1…un| and sequence v = v1…vn becomes |v1…vt| |vt+1…v2t| … |vn-t+1…vn|
  • 38. Partitioning Alignment Grid into Blocks partition n n/t n/t t tn
  • 39. Block Alignment • Block alignment of sequences u and v. 1. An entire block in u is aligned with an entire block in v. 2. An entire block is inserted. 3. An entire block is deleted. • Block path: a path that traverses every t x t square through its corners.
  • 41. Block Alignment Problem • Goal: Find the longest block path through an edit graph. • Input: Two sequences, u and v partitioned into blocks of size t. This is equivalent to an n x n edit graph partitioned into t x t subgrids. • Output: The block alignment of u and v with the maximum score (longest block path through the edit graph).
  • 42. Constructing Alignments within Blocks • To solve: Compute alignment score BlockScorei,j for each pair of blocks |u(i-1)*t+1…ui*t| and |v(j-1)*t+1…vj*t|. • How many blocks are there per sequence? • (n/t) blocks of size t • How many pairs of blocks for aligning the two sequences? • (n/t) x (n/t) • For each block pair, solve a mini-alignment problem of size t x t
  • 43. Constructing Alignments within Blocks n/t Block pair represented by each square Solve mini-alignment problems
  • 44. Block Alignment: Dynamic Programming • Let si,j denote the optimal block alignment score between the first i blocks of u and first j blocks of v. • block is the penalty for inserting or deleting an entire block. • BlockScore(i, j) is the score of the pair of blocks in row i and column j. si, j max si 1, j block si, j 1 block si 1, j 1 BlockScore i, j
  • 45. Block Alignment Runtime • Indices i, j range from 0 to n/t. • Running time of algorithm is O( [n/t]*[n/t]) = O(n2/t2) if we don’t count the time to compute each BlockScore(i, j).
  • 46. Block Alignment Runtime • Computing all BlockScorei,j requires solving (n/t)*(n/t) mini block alignments, each of size (t*t). • So computing all i,j takes time O([n/t]*[n/t]*t*t) = O(n2) • This is the same as dynamic programming. • How do we speed this up?
  • 48. Four Russians Technique • Let t = log(n), where t is the block size, n is the sequence size. • Instead of having (n/t)*(n/t) mini-alignments, construct 4t x 4t mini-alignments for all pairs of strings of t nucleotides (huge size), and put in a lookup table. • However, size of lookup table is not really that huge if t is small. Let t = (logn)/4. Then 4t x 4t = n.
  • 49. Look-up Table for Four Russians Technique Lookup table “Score” AAAAAA AAAAAC AAAAAG AAAAAT AAAACA … AAAAAA AAAAAC AAAAAG AAAAAT AAAACA … each sequence has t nucleotides sizeis only n, insteadof (n/t)*(n/t)
  • 50. New Recurrence • The new lookup table Score is indexed by a pair of t- nucleotide strings, so • Key Difference: The Score function is taken from the hash table rather than computed by dynamic programming as before. si, j max si 1, j block si, j 1 block si 1, j 1 Score ith block ofv, jth block ofu
  • 51. Four Russians Speedup Runtime • Since computing the lookup table Score of size n takes O(n) time, the running time is mainly limited by the (n/t)*(n/t) accesses to the lookup table. • Each access takes O(logn) time. • Overall running time: O( [n2/t2]*logn) • Since t = logn, substitute in: • O( [n2/{logn}2]*logn) > O( n2/logn )
  • 52. Section 6: LCS in Sub-Quadratic Time
  • 53. So Far… • We can divide up the grid into blocks and run dynamic programming only on the corners of these blocks. • In order to speed up the mini-alignment calculations to under n2, we create a lookup table of size n, which consists of all scores for all t-nucleotide pairs. • Running time goes from quadratic, O(n2), to subquadratic: O(n2/logn)
  • 54. Four Russians Speedup for LCS • Unlike the block partitioned graph, the LCS path does not have to pass through the vertices of the blocks. Block Alignment Longest Common Subsequence
  • 55. Block Alignment vs. LCS • In block alignment, we only care about the corners of the blocks. • In LCS, we care about all points on the edges of the blocks, because those are points that the path can traverse. • Recall, each sequence is of length n, each block is of size t, so each sequence has (n/t) blocks.
  • 56. Block Alignment vs. LCS: Points Of Interest Block alignment has (n/t)*(n/t) = (n2/t2) points of interest LCS alignment has O(n2/t) points of interest
  • 57. Traversing Blocks for LCS • Given alignment scores si,* in the first row and scores s*,j in the first column of a t x t mini square, compute alignment scores in the last row and column of the minisquare. • To compute the last row and the last column score, we use these 4 variables: 1. Alignment scores si,* in the first row. 2. Alignment scores s*,j in the first column. 3. Substring of sequence u in this block (4t possibilities). 4. Substring of sequence v in this block (4t possibilities).
  • 58. Traversing Blocks for LCS • If we used this to compute the grid, it would take quadratic, O(n2) time, but we want to do better. We know these scores We can calculate these scores t x t block
  • 59. Four Russians Speedup • Build a lookup table for all possible values of the four variables: 1. All possible scores for the first row s*,j 2. All possible scores for the first column s*,j 3. Substring of sequence u in this block (4t possibilities). 4. Substring of sequence v in this block (4t possibilities). • For each quadruple we store the value of the score for the last row and last column. • This will be a huge table, but we can eliminate alignments scores that don’t make sense.
  • 60. Reducing Table Size • Alignment scores in LCS are monotonically increasing, and adjacent elements can’t differ by more than 1. • Example: 0,1,2,2,3,4 is ok; 0,1,2,4,5,8, is not because 2 and 4 differ by more than 1 (and so do 5 and 8). • Therefore, we only need to store quadruples whose scores are monotonically increasing and differ by at most 1.
  • 61. Efficient Encoding of Alignment Scores • Instead of recording numbers that correspond to the index in the sequences u and v, we can use binary to encode the differences between the alignment scores. 0 1 2 2 3 4 1 1 0 0 1 1 original encoding binary encoding
  • 62. Reducing Lookup Table Size • 2t possible scores (t = size of blocks) • 4t possible strings • Lookup table size is (2t * 2t)*(4t * 4t) = 26t • Let t = (logn)/4; • Table size is: 26((logn)/4) = n(6/4) = n(3/2) • Time = O( [n2/t2]*logn ) • O( [n2/{logn}2]*logn) > O( n2/logn )
  • 63. Summary • We take advantage of the fact that for each block of t = log(n), we can pre-compute all possible scores and store them in a lookup table of size n(3/2). • We used the Four Russian speedup to go from a quadratic running time for LCS to subquadratic running time: O(n2/logn).