GARY BENSON, YOZEN HERNANDEZ, &
JOSHUA LOVING
BIOINFORMATICS PROGRAM
B O S T O N U N I V E R S I T Y
J L O V I N G @ B U ....
Introduction: Problem Description
Input:
• Sequences X and Y
• Integer weights M; I; G
match; mismatch; indel or gap that ...
Introduction
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -...
Introduction
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -...
Introduction
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -...
Introduction
- A C T G C A A
- 0 0 0 0 0 0 0 0
A 0 2 -3 -3 -3 -3 2 2
G 0 -3 -1 -6 -1 -6 -3 -1
T 0 -3 -6 1 -4 -4 -8 -6
C 0 ...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mi...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2
G -10
T -15
C -20
A -25
A -30
Match = 2, ...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3
G -10
T -15
C -20
A -25
A -30
Match = ...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Needleman-Wunsch Alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8
G -10
T -15
C -20
A -25
A -30
Match...
Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismat...
Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismat...
Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismat...
Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismat...
Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismat...
Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismat...
Bit-parallel alignment
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5
G -10
T -15
C -20
A -25
A -30
Match = 2, Mismat...
Motivation
 Cheaper
sequencing of
DNA means that
larger datasets
are being
generated
 Sequence
analysis of such
large da...
Earlier Bit-parallel Pattern Matching Algorithms
 Longest Common Subsequence (LCS) (Allison & Dix, 1986;
Crochemore et al...
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4...
Algorithm Foundation
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16...
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4...
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T -15 -8 -6 1 -4...
∆H
-5
∆V -
-5
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -16 -21
T ...
Function Table
∆VNEXT output values given ∆V and ∆H
input values
What is the range of differences?
What is the range of differences?
Match = 2,
Mismatch = -3,
Indel = -5
What is the range of differences?
Match = 2,
Mismatch = -3,
Indel = -5
Minimum Value = Indel = -5
Maximum Value = Match – ...
Generalized Function Table
M = match score
I = mismatch score
G = indel (gap) penalty
Zones in Example Function Table
Bit-parallel Representation
 Bit-vectors: computer words 64 bits long
 A bit-vector for each possible difference, both h...
Example ∆H Bit-vector Storage
- A C T G C A A
C -20 -13 -6 -4 -2 -2 -7 -12
∆H values
7 7 2 2 0 -5 -5
∆H Bit-Vectors
7 1 1 ...
Example Match Vectors
- A C T G C A A
- 0 -5 -10 -15 -20 -25 -30 -35
A -5 2 -3 -8 -13 -18 -23 -28
G -10 -3 -1 -6 -6 -11 -1...
Algorithm
 Start with ∆H values
 Compute ∆V values
 Then compute the new ∆H values
Algorithm: Example
- A C T G C A A
C -20 -13 -6 -4 -2 -2 -7 -12
A -25 -18 -11 -9 -7 -5 0 -5
∆H values
7 7 2 2 0 -5 -5
-5 -...
Time Complexity
𝑂 𝑧𝑛
𝑚
𝑤
where
n = |Sequence Y|
m = |Sequence X|
w = length of computer word
𝑧 =
(𝑀 − 2𝐺 + 1)2
− (𝐼 − 2𝐺)2...
Implementation
 Python script that generates C code based on input
parameters (M; I; G)
 Will eventually have web page f...
Experimental Analysis
 Compared BHL with
 Wu-Manber K-differences algorithm
 Unit cost edit distance bit-parallel algor...
Results: Comparison to NW algorithm
Results: comparison to bit-parallel algorithms
Current and Future Work
 Implementation for sequences longer than one word
 Single Instruction Multiple Data (SIMD) impl...
Acknowledgements
My advisor, Dr. Gary Benson
Lab members
Yevgeniy Gelfand Yozen Hernandez
Funded by the National Science F...
Questions
Upcoming SlideShare
Loading in …5
×

A bit parallel, general integer-scoring

375 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
375
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • In computational biology, sequence alignment forms an important part of biological sequence analysis.To formally state the problem we are trying to solve, given two input sequences we want to find their global alignment score, using integer parameters to weight the alignment. We aren’t computing the alignment, which takes extra steps, we are solely interested in computing the alignment score.
  • The Needleman Wunsch algorithm is the standard method of global alignment.Here is the alignment scoring matrix for an example alignment of two sequences with a particular set of alignment scoring weights.
  • We restrict the scores that we are using to be only integer valued.
  • In this alignment example, we want to align the strings from their beginnings, so we impose a gap penalty for starting from a place other than the beginning in either string.
  • However, if this constraint is not present, we can allow alignment to start at any point by initializing the first row and column to be all zero.Our method allows either case, but we will focus on the case using an initial penalty.
  • As a brief overview, the NW algorithm proceeds by initializing the scoring matrix’s first row and column.
  • To fill in the scoring matrix, we iterate over it, with each cell’s score being determined by the three cells to the left, diagonally, and above.
  • To fill in the entire scoring matrix, each cell must be visited once.
  • Because cells depend on the cells that preceded them, they must be done in order.
  • Bit parallel alignment allows us to represent multiple cells by bits in a computer word.Rather than computing each cell individually, we compute values for entire rows at once.The benefit to bit-parallel algorithms lie in their speed and efficiency.
  • In this figure, we have cost, in dollars, shown logarithmically on the y-axis against time on the x-axis. The white line represents how Moore’s Law shrinks the cost of computation over time. In green, we have the actual cost of sequencing a human genome, so it is clear that the cost of sequencing is dropping much faster than the cost of computing.This makes implementing faster sequence alignment algorithms important, which is what we are doing.
  • Say that Hyyroacheieved 4 bit operations per word for LCS, and 15 bit operations per word for Unit-cost edit distance.Arbitrary weights edit-distance – does it solve the same problem? They gave no implementation and were only able to guess at the time complexity. We have an implementation that uses a different methodology.
  • As we recalled, each cell in the NW alignment matrix is derived from the 3 adjacent cells to its left and above.
  • As an example, we will consider a small block of the scoring matrix.
  • We know that the scores in the scoring matrix have unbounded values.However, the differences between adjacent cells are bounded. In our bit-parallel algorithm and others these differences are used as the problem representation. This is similar to how the 4 russians’ technique uses differences between cells. (possibly refer to 4 russian’s technique if questions are asked)
  • We will call these differences Delta H and Delta V.
  • Of course, it is obvious that DeltaVnext and DeltaHnext can be derived from Delta H and Delta V, but what was not obvious was the regular pattern that emerges if you look at all possible inputs and outputs.
  • In fact, the output values produced by input values of Delta H and Delta V follow a very regular pattern.This is shown here in this function table for Delta Vnext output values.
  • As an example, we can look at the function table shown before. The input Delta V and Delta H values range from -5 to positive 7.
  • These values derive from the input parameters of 2, -3, and -5
  • The minimum value is equal to the indel penalty, and the maximum value is equal to the match score minus the indel penalty. 2 minus (negative 5)
  • Similar relationships determine important boundaries between these zones in this function table.The algorithm needs to compute the values in these zones separately Thus, using a set of scoring parameters we can generate the corresponding function table
  • Here are the same zones shown in the example function table we were looking at.
  • What are bit-vectors? They are computer words, in current machines, generally 64 bits long. We use a bit-vector to represent each possible difference horizontally and vertically. We end up with two sets, one that represents Delta H, and another set representing the same integer values of difference for Delta V.We also need a set of Match vectors. For each character in Sequence Y, we wish to know whether there are matches in sequence X. In the case of DNA, this means that we must store 4 vectors for matches with A, C, G or T.These match vectors are necessary because matches represent a special case in the function table.
  • Here we have a single row in the scoring matrix. The bit-vectors that represent the horizontal changes are shown directly below the values they represent.
  • Similarly, this is an example set of Match vectors used to store the matches corresponding to each base in Sequence X
  • Point out the squared values, note that these tend to be small, and mention that we will be showing an actual runtime demonstration of how the parameters affect efficiency.
  • As the scoring parameters change, the code changes as well. Since it would be very difficult to write a program for each possible input parameter set, I wrote a script that generates the C code for each scoresWe will host it on a website.
  • For all experiments, we used human DNA and did a total of 25 million alignments. All sequences were 63 bases long, to allow the bit-parallel representations to fit in single computer words. (if they ask: We are working on the multiple word implementation, but were unable to finish in time for this paper.)Single core of 3 GHz computer (intel core 2 duo).
  • We implemented our algorithm and compared it to several other pattern matching algorithms.This figure compares our method with Needleman Wunsch. Say what the Y – axis is: minutes to compute 25 million alignmentsSay what the X-axis is: number of bit operations that each of our versions of the algorithm uses.Say what the labels on the BHL algorithms mean.Note that as the parameters increase, the time increases.Note that NW takes over 7 minutes, our 2-3-5 version takes under 2
  • Here, we compare the unit-cost edit distance bit-parallel algorithm, the LCS bit-parallel algorithm, Wu-Manbers K-difference algorithm, and the version of our algorithm equivalent to unit-cost edit-distance.While our algorithm has slightly more operations, 23, than the unit-cost edit distance algorithm with 15, our algorithm is not optimized for any particular parameter set. Even though our algorithm is general purpose, we come quite close to the best known solution.
  • I’d like to thank my advisor, Dr. Gary Benson, and the other members of my lab, YevgeniyGelfand and Yozen Hernandez, as well as the NSF for funding this work.
  • A bit parallel, general integer-scoring

    1. 1. GARY BENSON, YOZEN HERNANDEZ, & JOSHUA LOVING BIOINFORMATICS PROGRAM B O S T O N U N I V E R S I T Y J L O V I N G @ B U . E D U A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm
    2. 2. Introduction: Problem Description Input: • Sequences X and Y • Integer weights M; I; G match; mismatch; indel or gap that define a similarity or distance scoring function S Output: • Calculate the global alignment score for X and Y
    3. 3. Introduction - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 Match = 2, Mismatch = -3, Indel = -5 Global Alignment – Needleman-Wunsch Alignment Scoring Matrix Sequence X SequenceY
    4. 4. Introduction - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 Match = 2, Mismatch = -3, Indel = -5 Integer Scores
    5. 5. Introduction - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 Match = 2, Mismatch = -3, Indel = -5 Penalty from beginning
    6. 6. Introduction - A C T G C A A - 0 0 0 0 0 0 0 0 A 0 2 -3 -3 -3 -3 2 2 G 0 -3 -1 -6 -1 -6 -3 -1 T 0 -3 -6 1 -4 -4 -8 -6 C 0 -3 -1 -4 -2 -2 -7 -11 A 0 2 -3 -9 -7 -5 0 -5 A 0 2 -1 -6 -7 -10 -3 2 Match = 2, Mismatch = -3, Indel = -5 No initial Penalty
    7. 7. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5
    8. 8. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5
    9. 9. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5
    10. 10. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5
    11. 11. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30
    12. 12. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30
    13. 13. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30
    14. 14. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30
    15. 15. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 T -15 C -20 A -25 A -30
    16. 16. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 T -15 C -20 A -25 A -30
    17. 17. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 T -15 C -20 A -25 A -30
    18. 18. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 T -15 C -20 A -25 A -30
    19. 19. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 T -15 C -20 A -25 A -30
    20. 20. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 T -15 C -20 A -25 A -30
    21. 21. Needleman-Wunsch Alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 C -20 A -25 A -30
    22. 22. Bit-parallel alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 Integer Scores
    23. 23. Bit-parallel alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 Integer Scores - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30
    24. 24. Bit-parallel alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 Integer Scores - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 C -20 A -25 A -30
    25. 25. Bit-parallel alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 Integer Scores - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 A -25 A -30
    26. 26. Bit-parallel alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 Integer Scores - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 A -30
    27. 27. Bit-parallel alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 Integer Scores - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30
    28. 28. Bit-parallel alignment - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 G -10 T -15 C -20 A -25 A -30 Match = 2, Mismatch = -3, Indel = -5 Integer Scores - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2
    29. 29. Motivation  Cheaper sequencing of DNA means that larger datasets are being generated  Sequence analysis of such large datasets can be accelerated by faster alignment algorithms Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed June 10, 2013.
    30. 30. Earlier Bit-parallel Pattern Matching Algorithms  Longest Common Subsequence (LCS) (Allison & Dix, 1986; Crochemore et al, 2001; Hyyro, 2004)  Unit-cost edit-distance (Myers, 99; Hyyro et al, 2005)  K-differences (agrep; Wu-Manber, 92)  Regular expression search (Navarro, 04)  Arbitrary weights edit-distance (Bergeron&Hamel, 02)
    31. 31. - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 Algorithm Foundation Match = 2, Mismatch = -3, Indel = -5
    32. 32. Algorithm Foundation - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 Match = 2, Mismatch = -3, Indel = -5 -3 -8 -1 -6
    33. 33. - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 -3 -8 -1 -6 -5 2 2 -5 Algorithm Foundation Match = 2, Mismatch = -3, Indel = -5
    34. 34. - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 -3 -8 -1 -6 ∆H ∆V ∆VNEXT ∆HNEXT -5 2 2 -5 Algorithm Foundation Match = 2, Mismatch = -3, Indel = -5
    35. 35. ∆H -5 ∆V - -5 - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 -3 -8 -1 -6 -5 2 2 -5 ∆H ∆V ∆VNEXT ∆HNEXT Algorithm Foundation Match = 2, Mismatch = -3, Indel = -5 Output Input
    36. 36. Function Table ∆VNEXT output values given ∆V and ∆H input values
    37. 37. What is the range of differences?
    38. 38. What is the range of differences? Match = 2, Mismatch = -3, Indel = -5
    39. 39. What is the range of differences? Match = 2, Mismatch = -3, Indel = -5 Minimum Value = Indel = -5 Maximum Value = Match – Indel = 2 - (- 5) = 7
    40. 40. Generalized Function Table M = match score I = mismatch score G = indel (gap) penalty
    41. 41. Zones in Example Function Table
    42. 42. Bit-parallel Representation  Bit-vectors: computer words 64 bits long  A bit-vector for each possible difference, both horizontally and vertically (∆V and ∆H)  A set of Match vectors (MatchA, MatchC, MatchG, MatchT in the DNA case)  We keep track of match positions because they are a special case in the function table.
    43. 43. Example ∆H Bit-vector Storage - A C T G C A A C -20 -13 -6 -4 -2 -2 -7 -12 ∆H values 7 7 2 2 0 -5 -5 ∆H Bit-Vectors 7 1 1 0 0 0 0 0 6 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 2 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -1 0 0 0 0 0 0 0 -2 0 0 0 0 0 0 0 -3 0 0 0 0 0 0 0 -4 0 0 0 0 0 0 0 -5 0 0 0 0 0 1 1
    44. 44. Example Match Vectors - A C T G C A A - 0 -5 -10 -15 -20 -25 -30 -35 A -5 2 -3 -8 -13 -18 -23 -28 G -10 -3 -1 -6 -6 -11 -16 -21 T -15 -8 -6 1 -4 -9 -14 -19 C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 A -30 -23 -16 -14 -12 -10 -3 2 Match Vectors MatchesA 1 0 0 0 0 1 1 MatchesC 0 1 0 0 1 0 0 MatchesT 0 0 1 0 0 0 0 MatchesG 0 0 0 1 0 0 0
    45. 45. Algorithm  Start with ∆H values  Compute ∆V values  Then compute the new ∆H values
    46. 46. Algorithm: Example - A C T G C A A C -20 -13 -6 -4 -2 -2 -7 -12 A -25 -18 -11 -9 -7 -5 0 -5 ∆H values 7 7 2 2 0 -5 -5 -5 -5 -5 -5 -3 7 7 ∆V values-5
    47. 47. Time Complexity 𝑂 𝑧𝑛 𝑚 𝑤 where n = |Sequence Y| m = |Sequence X| w = length of computer word 𝑧 = (𝑀 − 2𝐺 + 1)2 − (𝐼 − 2𝐺)2 2
    48. 48. Implementation  Python script that generates C code based on input parameters (M; I; G)  Will eventually have web page for download of code
    49. 49. Experimental Analysis  Compared BHL with  Wu-Manber K-differences algorithm  Unit cost edit distance bit-parallel algorithm  Longest Common Subsequence bit-parallel algorithm  Needleman-Wunsch dynamic programming algorithm  Computed 25 million alignments with each program  Each DNA sequence was 63 bases long  All programs compiled using GCC, optimization level O3  Computation done on a typical desktop computer
    50. 50. Results: Comparison to NW algorithm
    51. 51. Results: comparison to bit-parallel algorithms
    52. 52. Current and Future Work  Implementation for sequences longer than one word  Single Instruction Multiple Data (SIMD) implementation  BLOSUM and PAM type substitution matrix support  General Purpose Graphics Processing Unit (GPGPU) implementation  New bit-parallel representations for greater speed and compactness of data
    53. 53. Acknowledgements My advisor, Dr. Gary Benson Lab members Yevgeniy Gelfand Yozen Hernandez Funded by the National Science Foundation (NSF)
    54. 54. Questions

    ×