Parallel Biological Sequence Comparison in GPU Platforms

Alba Cristina Magalhaes Alves de Melo
Full Professor at the University of Brasilia (UnB)
CNPq/Brazil Research Fellow level 1C
IEEE Senior Member
Parallel Biological Sequence
Comparison in GPU Platforms:
Research at the University of Brasilia
OpenPower Webinar – June, 19th, 2020
https://www.compoundchem.com/2015/03/24/dna/>
3 Comments <
poundchem.com/2015/03/24/dna/#disqus_thread>

• Biological Sequences are obtained with sequencing
machines, using chromatography analysis.
This image cannot currently be displayed.
DNA
sequence
Introduction

• Once a biological sequence is obtained, its
properties/characteristics need to be
established.
• This is mainly done by comparing the new
sequence with sequences that have already
been catalogued.
• The comparison of biological sequences is
one of the most important operations in
Bioinformatics.
– Its goal is to define how similar the sequences
are.
Introduction

• There are exact algorithms that
compare two sequences and produce
the optimal result. They have quadratic
time and memory complexity - O(mn),
where m and n are the size of the
sequences.
• Heuristic methods are faster and are
used in many genome projects.
– However, they do not guarantee that the
optimal result will be produced.
Introduction

• Smith-Waterman (SW) proposed an exact
algorithm based on dynamic programming
to locally align two sequences in quadratic
time and memory.
– It produces the optimal result.
– High execution times and huge memory
requirements.
• To compare the human chromosome 1
with the chimpanzee chromosome 1 (249
Millions of Base Pairs - MBP x 228 MBP),
at least 240 Petabytes of memory are
needed.
ØThis SW comparison was considered
unfeasible in 2008.
Introduction

• Present and discuss our MASA-CUDAlign
strategy to compare huge chromosomes in
GPUs
– We use a highly optimized algorithm to exploit
parallelism
– We use speculative techniques to accelerate the
sequential part of the algorithm
– We were able to use up to 384 NVidia M2090 GPUs to
compare the human and chimpanzee homologous
chromosomes 1 in 2016
– We will show preliminary results of the next version of
our tool in the IBM platform with 8 NVidia Volta
• Present and discuss our MASA-OpenMP results in
Power9 for smaller sequences
– We will show comparative results Power vs Intel
– We will present preliminary covid-19 results
Goal of this talk

• Biological sequence comparison
• MASA-CUDAlign (one GPU)
• MASA-CUDAlign (multiple GPUs)
• MASA-CUDAlign with pruning
• MASA-OpenMP CPU studies
Agenda
OpenPower Webinar – June, 19th, 2020

• To compare two sequences, one sequence is
placed above the other and a score is
computed.
G A C G G A T T A G G A T C G G A A T A GS0 S1
G
G
S0
S1
+1
A
A
+1
match
-
T
-2
gap
C
C
+1
G
G
+1
G
G
+1
A
A
+1
match
T
A
-1
mismatch
T
T
+1
A
A
+1
G
G
+1
match
+6
score
alignment
11 characters (Base Pairs - BP)
Biological Sequence Comparison

• Based on dynamic programming with quadratic time and
memory complexity (O(mn)).
• Executes in two steps:
• (1) calculate the DP matrix (similarity score) and
• (2) traceback (alignment)
• Having sequences S0 and S1 as input, with sizes m and n,
Hm+1,n+1 is computed:
H[i, j] = max
H[i, j −1]− g
H[i −1, j]− g
H[i −1, j −1]+ p(i, j)
0.
#
$
%
%
&
%
%
p(i,j) = ma, if s[i] = t[j]
mi, otherwise
Gap penalty
match
mismatch
Smith-Waterman (SW) Algorithm

- A T A G C T A
0 0 0 0 0 0 0 0
A 0 ! 1 0 ! 1 0 0 0 ! 1
T 0 0 ! 2 0 0 0 ! 1 0
A 0 ! 1 0 ! 3 0 0 0 ! 2
C 0 0 0 " 1 ! 2 ! 1 0 0
G 0 0 0 ! 1 ! 2 ! 1 0 0
C 0 0 0 0 0 ! 3 0 0
T 0 0 ! 1 0 0 " 1 ! 4 #2
C 0 0 0 0 0 ! 1 " 2 ! 3
T 0 0 ! 1 0 0 0 ! 2 " 1
T 0 0 ! 1 0 0 0 0 ! 1
highest
score
(traceback
path)
Local Alignment:
A T A - G C T
A T A C G C T
values: g=-2, mi=-1, ma=1
S1
S0
first row and
column are
initialized
with zeros
max(2,-2,-2,0)
Smith-Waterman (SW) Example

• [Gotoh 1982]: Computes the affine gap model, where the
value assigned to start a sequence of gaps (GapOpen) is
higher than the value assigned to extend it (GapExtend)
• Computes 3 DP matrices and provides a better
biological result
• Time and memory complexity (O(mn))
• [Hirschberg 1977]: Computes the linear gap model in
linear memory, with a divide and conquer recursive
approach
• Time complexity (O(mn)), memory complexity (O(m+n))
• [Myers-Miller 1988]: computes the affine gap model in
linear memory, with a modified version of Hirschberg’s
• Time complexity (O(mn)), memory complexity (O(m+n))
Smith-Waterman (SW) Variants

Smith-Waterman (SW) and its Variants
Wavefront method
(i,j) depends on (i-1,j), (i-1,j-1) and (i,j-1)
up
left
diag
Smith-Waterman (SW) and its Variants
Wavefront method
minimum parallelism maximum parallelism
(i,j) depends on (i-1,j), (i-1,j-1) and (i,j-1)
up
left
diag
Clicktoenlarge
(i,j) depends on (i-1,j), (i-1,j-1) and (i,j
up
left
diag
Clicktoenlarge
(i,j) depends on (i-1,j), (i-1,j-1)
left
d0 d1 d2 d3 d4d5d6 d0 d1 d2 d3
d4d5d6
d0 d1 d2 d3
d4d5d6
d0 d1 d2 d3
d4d5d6
• Each anti-diagonal can be computed in parallel
• m+n-1 antidiagonals
• Non-uniform parallelism
minimum at the beginning (d0)
maximum at the main antidiagonal (d4)
minimum at the end (d7)

• Goal: compare huge DNA sequences in GPU with a
combination of Gotoh and Myers-Miller algorithms
– CUDAlign 1.0: similarity score in 1 GPU
– CUDAlign 2.0: score and alignment in 1 GPU
– CUDAlign 2.1: score and alignment in 1 GPU with pruning
– CUDAlign 3.0: similarity score in several GPUs
– MASA-CUDAlign 4.0: score and alignment in several GPUs
• PhD Thesis - Edans F. O. Sandes (Awarded the Best PhD
Thesis in Computer Science in Brazil - 2016)
• Wilkes Award 2019 – Best paper -
The Computer Journal in 2018
MASA-CUDAlign: Goal and VersionsMASA-CUDAlign: Goal and versions

1 - Find the best score
(GPU)
2 - Partial traceback
(GPU)
(3) Split partitions
(GPU)
(4) Align partitions (CPU) (5) Full alignment (CPU)
crosspoint
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Stage 1 (Compute the DP matrix)
Stages 2 to 5 (Traceback)

•The DP Matrix is divided
into grid blocks and a set
of grid blocks compose
an external diagonal.
•Each external diagonal
is composed by B blocks
where each block is
calculated by T threads.
Each thread will compute
a rows.
•Each CUDA kernel is
invoked to calculate one
external diagonal.
B=3; T=3; α=2
Size(S0)=36, Size(S1)=36
B1 B2 B3
Stage 1 (Compute the DP matrix)

• Computation of each block stops at full parallelism
and remaining cells are delegated to the next
invocation.
G0,0 G0,1 G0,2
G1,0 G1,1 G1,2
G2,0 G2,1 G2,2
G3,0 G3,1 G3,2
G4,0 G4,1 G4,2
G5,0 G5,1 G5,2
G6,0 G6,1 G6,2
G7,0 G7,1 G7,2
first external
diagonals processed
external diagonals
in the middle of the
matrices
external diagonals
with non-contiguous
cells delegation
small loss of
parallelism
Stage 1 (Parallelogram execution)

• The goal of the block pruning optimization is to
eliminate the calculation of blocks of cells that
surely do not belong to the optimal alignment.
• These blocks have such a small score that it is
not mathematically possible to lead to a score
higher than a score that has already been
produced.calculating the original MM
similar to FastLSA (Section 3.1),
stead of memory because there
pace. Also, saving rows to disk
omputation in case of interrup-
intervention, among others).
flushed to disk are taken from
n 4.1) at a certain interval of
bus contains the cells of the last
that are multiple of the block
be considered a special row.
ETRIEVING SMITH-WATERMAN ALIGNMENTS WITH OPTIMIZATIONS FOR MEGABASE BIOLOGICAL... 7
Stage 1 (Block Pruning)
Hi,j = 100, Di = 50, Hmax(i,j) = 150
best_score = 230
pruning = true

14 E. Sandes, G. Teodoro, M. Walter, E. Ayguade, X. Mar
FIGURE 14. Geometrical representation of the pruning
case f2: j > i and Hi
min(i, j)ma.p |i
i.ma.p (j i)G + (m
j(G
case f3: j  i and Hi
Hi,j + min(
For similar sequences,
the pruning area is
characterized by four lines
(f1, f2, f3, f4), forming
two polygons that are
connected in the end of the
alignment
Gray area: not processed
Stage 1 (Block Pruning)

Publications (CUDAlign 1.0)
CUDAlign: Using GPU to Accelerate the
Comparison of Megabase Genomic Sequences
Edans Flavius de O. Sandes Alba Cristina M. A. de Melo
University of Brasilia (UnB), Brazil
{edans,albamm}@cic.unb.br
Abstract
Biological sequence comparison is a very important oper-
ation in Bioinformatics. Even though there do exist exact
methods to compare biological sequences, these methods
are often neglected due to their quadratic time and space
complexity. In order to accelerate these methods, many
GPU algorithms were proposed in the literature. Never-
theless, all of them restrict the size of the smallest se-
quence in such a way that Megabase genome comparison
is prevented. In this paper, we propose and evaluate CUD-
Align, a GPU algorithm that is able to compare Megabase
biological sequences with an exact Smith-Waterman affine
gap variant. CUDAlign was implemented in CUDA and
tested in two GPU boards, separately. For real sequences
whose size range from 1MBP (Megabase Pairs) to 47MBP,
a close to uniform GCUPS (Giga Cells Updates per Sec-
ond) was obtained, showing the potential scalability of our
approach. Also, CUDAlign was able to compare the human
chromosome 21 and the chimpanzee chromosome 22. This
operation took 21 hours on GeForce GTX 280, resulting in
a peak performance of 20.375 GCUPS. As far as we know,
this is the first time such huge chromosomes are compared
with an exact method.
Categories and Subject Descriptors D.1.3 [Program-
ming Techniques]: Concurrent Programming; J.3 [Life
and Medical Sciences]: Biology and Genetics
General Terms Algorithms
Keywords Biological Sequence Comparison, Smith-
Waterman, GPU
1. Introduction
In the last four years, new DNA sequencing technologies
have been developed that allow a hundred-fold increase in
the throughput over the traditional method. This means
that the genomic databases, that have already an expo-
nential growth rate, will experience an unprecedented in-
crease in their sizes. Therefore, a huge amount of new
DNA sequences will need to be compared, in order to in-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
to republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
PPoPP’10, January 9–14, 2010, Bangalore, India.
Copyright c⃝ 2010 ACM 978-1-60558-708-0/10/01. . . $10.00
fer functional/structural characteristics. In this scenario,
the time spent in each comparison, as well as the accuracy
of the result obtained, will be a fundamental factor to de-
termine the success/failure of the next generation genome
projects.
Sequence comparison is, thus, a very basic and impor-
tant operation in Bioinformatics. As a result of this step,
one or more sequence alignments can be produced [1]. A
sequence alignment has a similarity score associated to it
that is obtained by placing one sequence above the other,
making clear the correspondence between the characters
and possibly introducing gaps into them [2]. The most
common types of sequence alignment are global and lo-
cal. To solve a global alignment problem is to find the best
match between the entire sequences. On the other hand,
local alignment algorithms must find the best match be-
tween parts of the sequences.
One important issue to be considered is how gaps are
treated. A simple solution assigns a constant penalty for
gaps. However, it has been observed that keeping gaps
together represents better the biological relationships.
Hence, the most widely used model among biologists is the
affine gap model [3], where the penalty for opening a gap
is higher than the penalty for extending it.
Smith-Waterman (SW) [4] is an exact algorithm based
on the longest common subsequence (LCS) concept that
uses dynamic programming to find local alignments be-
tween two sequences of size m and n in O(mn) space
and time. In this algorithm, a similarity matrix of size
(m + 1) × (n + 1) is calculated. SW is very accurate but
it needs a lot of computational resources.
In order to reduce execution time, heuristic methods
such as BLAST [5] were proposed. These methods com-
bine exact pattern matching with dynamic programming
in order to produce good solutions faster. BLAST can align
sequences in a very short time, still producing good re-
sults. Nevertheless, there is no guarantee that the best
result will be produced.
Therefore, many efforts were made to develop methods
and techniques that execute the SW algorithm in high per-
formance architectures, allowing the production of exact
results in a shorter time. One recent trend in high per-
formance architectures is the Graphics Processing Units
(GPUs). In addition to the usual graphics functions, re-
cent GPU architectures are able to execute general pur-
pose algorithms (GPGPUs). These GPUs contain elements
that execute massive vector operations in a highly parallel
way. Because of its TFlops peak performance and its avail-
ability in PC desktops, the utilization of GPUs is rapidly
increasing in many scientific areas.
137
Conference: ACM PPoPP 2010
543K×536K 2.91E+11 NC_003064.2 NC_000914.1 48 (308558 , 455134)
1044K×1073K 1.12E+12 CP000051.1 AE002160.2 88353 (1072950 , 722725)
3147K×3283K 1.03E+13 BA000035.2 BX927147.1 4226 (2991493 , 2689488)
5227K×5229K 2.73E+13 AE016879.1 AE017225.1 5220960 (5227292 , 5228663)
7146K×5227K 3.74E+13 NC_005027.1 NC_003997.3 172 (4655867 , 5077642)
23012K×24544K 5.65E+14 NT_033779.4 NT_037436.3 9063 (14651731 , 11501313)
32799K×46944K 1.54E+15 BA000046.3 NC_000021.7 27206434 (32718231 , 46919080)
Table 5. Comparison for the real sequences used in tests. The best local score and the end position are presented.
Comparison
BLAST
Time Score
162K×172K 0.4s 18
543K×536K 0.6s 48
1044K×1073K 2.4s 6973
3147K×3283K 6.7s 3888
5227K×5229K 17.4s 36159
7146K×5227K 7.7s 157
23012K×24544K 110s 7085
32799K×46944K - -
Table 6. BLAST Results.
panzee chromosome comparison, BLAST finished its
ution with a segmentation fault, due to an out-of-
ory error.
Conclusion and Future Work
his paper, we proposed and evaluated CUDAlign, a
-accelerated version of Smith-Waterman (SW) that
pares two Megabase genomic sequences. Differently
the previous GPU Smith-Waterman (SW) proposals
e literature, our proposal does not impose severe re-
tions on the size of the smallest sequence and that
0.1
1
10
100
1000
10000
100000
1e+06
1e+10 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16
time(s)
cells
8600GT
GTX280
1,968 MCUPS
20,375 MCUPS
Figure 10. Runtimes (seconds) × DP matrix size (cells) in
logarithm scale. Results show scalability and almost constant
MCUPS ratio for Megabase sequences (cells ≥ 1e + 12).
order to exploit the characteristics of the GPU memory hi-
erarchy.
We obtained the optimal score of the chromosome 21
human x chimpanzee comparison (32MBP x
47MBP) using the GPU Nvidia GTX 280 in 21 hours
GCUPS (Giga of Cells Updated Per Second): 20.3

Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space
Edans Flavius de O. Sandes, Alba Cristina M. A. de Melo
Department of Computer Science
University of Brasilia (UnB)
Brasilia, Brazil
{edans,albamm}@cic.unb.br
Abstract—Cross-species chromosome alignments can reveal
ancestral relationships and may be used to identify the pe-
culiarities of the species. It is thus an important problem
in Bioinformatics. So far, aligning huge sequences, such as
whole chromosomes, with exact methods has been regarded as
unfeasible, due to huge computing and memory requirements.
However, high performance computing platforms such as GPUs
are being able to change this scenario, making it possible
to obtain the exact result for huge sequences in reasonable
time. In this paper, we propose and evaluate a parallel
algorithm that uses GPUs to align huge sequences, executing
the Smith-Waterman algorithm combined with Myers-Miller,
with linear space complexity. In order to achieve that, we
propose optimizations that are able to reduce significantly the
amount of data processed and that enforce full parallelism
most of the time. Using the GTX 285 Board, our algorithm
was able to produce the optimal alignment between sequences
composed of 33 Millions of Base Pairs (MBP) and 47 MBP in
18.5 hours.
I. INTRODUCTION
In the last decade, genome projects have produced a
huge amount of new biological data. In order to better
understand a newly sequenced organism, biologists compare
its sequence against millions of other sequences, in order
to infer properties. Sequence comparison is, thus, one of
the most important mechanisms in Bioinformatics. One of
the first exact methods to globally compare two sequences
is Needleman-Wunsch (NW) [1]. It is based on dynamic
programming (DP) and has time and space complexity
O(mn), where m and n are the sizes of the sequences. The
NW algorithm was modified by Smith-Waterman (SW) [2]
to deal with local alignments. In SW, a linear gap function
was used. Nevertheless, in the nature, gaps tend to occur
together. For this reason, the affine gap model is often
used, where the penalty for opening a gap is higher than
the penalty for extending it. Gotoh [3] modified the SW
algorithm, without changing time and space complexity, to
include affine gap penalties.
One of the most restrictive characteristics of SW and
its variants is the quadratic space needed to store the DP
matrices. For instance, in order to compare two 30 MBP
sequences, we would need at least 3.6 PB of memory. This
fact was observed by Myers-Miller [4], that proposed the use
of Hirschberg’s algorithm [5] to compute global alignments
in linear space. The algorithm uses a divide and conquer
technique that recursively splits the DP matrix to obtain the
optimal alignment.
In the last years, Graphics Processing Units (GPUs) have
received a lot of attention because of their TFlops peak
performance and their availability in PC desktops. In the
Bioinformatics research area, there are some implementa-
tions of SW in GPUs [6, 7, 8, 9, 10, 11, 12, 13], that were
able to obtain the similarity score with very good speedups.
Nevertheless, with the exception of CUDAlign 1.0 [13], all
of them define a maximum size for the query sequence. That
means that two huge sequences cannot be compared in such
implementations.
As far as we know, the only strategies that are able to
retrieve the alignment in GPUs are [6] and [12]. Since
both of them execute in quadratic space, the sizes of the
sequences to be compared is severely restricted.
In this paper, we propose and evaluate CUDAlign 2.0,
a new algorithm using GPU that is able to retrieve the
alignment of huge sequences with the SW algorithm, using
the affine gap model. Our implementation is only bound
to the total available global memory in the GPU and the
available disk space in the desktop. Our algorithm receives
two input sequences and provides the optimal alignment as
output. It runs in 6 stages, where the first three stages are
executed in GPU and the last three stages run in CPU. The
first stage executes SW [2] to retrieve the best score and its
position in the DP matrices, as in CUDAlign 1.0 [13]. Also,
some special rows are saved to disk. The goal of stages 2,
3 and 4 is to retrieve points where the optimal alignment
occurs in special rows/columns, thus creating small sub-
problems. In stage 5, the alignments for each sub-problem
are obtained and concatenated to generate the full alignment.
In stage 6, the alignment can be visualized. The proposed
algorithm was implemented in CUDA and C++ and executed
in the GTX 285 board. With our algorithm, we were able
to retrieve the alignment between the human chromosome
21 and the chimpanzee chromosome 22, with respectively
47 MBP and 33 MBP, in 18.5 hours, using reasonable disk
area and GPU memory.
The rest of this paper is organized as follows. In Section
II, we present the Smith-Waterman and the Myers and Miller
algorithms. In Section III, related work is discussed. Section
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.114
1199
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.114
1199
Conference:
IEEE IPDPS 2011
A|.
V. EXPERIMENTAL RESULTS
2.0 was implemented in CUDA 3.1 and C++
an NVIDIA GeForce GTX 285. This board has
ory, 30 multiprocessors and 240 cores. It was
an Intel Pentium Dual-Core 3GHz, 3GB RAM,
running Ubuntu 10.04, Linux kernel 2.6.32.
-Waterman parameters were set to: match:
h 3; first gap: 5; extension gap: 2. The
nfigurations used for GTX 285 were = 4,
= 26
, B2 = B3 = 60 and T2 = T3 = 27
. The
ocks may be reduced during runtime in order
minimum size requirement in each stage. Note
ber of blocks must be preferably a multiple
er of multiprocessors (30 for GTX 285). By
better performance can be achieved because
essors do not become idle when they reach the
ernal diagonal.
Times and GCUPS
used real DNA sequences retrieved from the
www.ncbi.nlm.nih.gov). The names of the se-
pared, as well as their sizes, are shown in Table
of these real sequences range from 162 KBP
nd the sequences are the same used in [13].
s of sequences shown in Table II, Table III lists
e, its end and start positions, the length of the
162K⇥172K 1.4 19769 5M 1.5 18678
543K⇥536K 12.9 22545 50M 13.6 21419
1044K⇥1073K 48.3 23205 250M 51.6 21706
3147K⇥3283K 436 23706 1G 448 23035
5227K⇥5229K 1147 23822 3G 1185 23068
7146K⇥5227K 1568 23816 3G 1604 23282
23012K⇥24544K 23620 23911 10G 23750 23780
32799K⇥46944K 64507 23869 50G 65153 23632
Table IV
RUNTIMES (IN SECONDS) OF STAGE 1 OF CUDALIGN 2.0 FLUSHING
SPECIAL ROWS TO DISK. THE OVERHEAD OF SAVING SPECIAL ROWS
DEPENDS ON THE SIZES OF THE SRA AND THE SEQUENCES.
Comparison
Stages
Total
1 2 3 4 5+6
162K⇥172K 1.5 <0.1 <0.1 <0.1 <0.1 1.8
543K⇥536K 13.6 <0.1 <0.1 <0.1 <0.1 13.9
1044K⇥1073K 51.6 3.1 1.0 5.4 0.1 61.6
3147K⇥3283K 448 0.1 <0.1 0.3 <0.1 449
5227K⇥5229K 1185 65.9 20.3 47.6 1.9 1321
7146K⇥5227K 1604 <0.1 <0.1 <0.1 <0.1 1605
23012K⇥24544K 23750 0.3 <0.1 0.7 <0.1 23755
32799K⇥46944K 65153 805 236 376 9 66579
Table V
RUNTIMES OF EACH STAGE OF CUDALIGN 2.0 ON GTX 285 VARYING
THE COMPARISON SIZE. THE TOTAL TIME INCLUDES ALL THE STAGES
AND THE I/O OF READING THE SEQUENCES.
of our approach with almost constant GCUPS for megabase
sequences. Note that, for sequences longer than 3 MBP,
CUDAlign 2.0 is able to achieve a sustained performance
We obtained the optimal align. of the chromosome 21
human x chimpanzee comparison (32MBP x
47MBP) using the GPU Nvidia GTX 285 in
18.5 hours
GCUPS (Giga of Cells Updated Per Second): 23.6

Retrieving Smith-Waterman Alignments with
Optimizations for Megabase Biological
Sequences Using GPU
Edans Flavius de O. Sandes and Alba Cristina M.A. de Melo, Senior Member, IEEE
Abstract—In Genome Projects, biological sequences are aligned thousands of times, in a daily basis. The Smith-Waterman algorithm
is able to retrieve the optimal local alignment with quadratic time and space complexity. So far, aligning huge sequences, such as
whole chromosomes, with the Smith-Waterman algorithm has been regarded as unfeasible, due to huge computing and memory
requirements. However, high-performance computing platforms such as GPUs are making it possible to obtain the optimal result for
huge sequences in reasonable time. In this paper, we propose and evaluate CUDAlign 2.1, a parallel algorithm that uses GPU to align
huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to
achieve that, we propose optimizations which are able to reduce significantly the amount of data processed, while enforcing full
parallelism most of the time. Using the NVIDIA GTX 560 Ti board and comparing real DNA sequences that range from 162 KBP
(Thousand Base Pairs) to 59 MBP (Million Base Pairs), we show that CUDAlign 2.1 is scalable. Also, we show that CUDAlign 2.1 is
able to produce the optimal alignment between the chimpanzee chromosome 22 (33 MBP) and the human chromosome 21 (47 MBP)
in 8.4 hours and the optimal alignment between the chimpanzee chromosome Y (24 MBP) and the human chromosome Y (59 MBP) in
13.1 hours.
Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU
Ç
1 INTRODUCTION
BIOINFORMATICS is an interdisciplinary field that involves
computer science, biology, mathematics, and statistics
[1]. One of its main goals is to analyze biological sequence
data and genome content in order to obtain the function/
structure of the sequences as well as evolutionary
information.
Once a new biological sequence is discovered, its
functional/structural characteristics must be established.
The first step to achieve this goal is to compare the new
sequence with the sequences that compose genomic
databases, in search of similarities. This comparison is
made thousands of times in a daily basis, all over the world.
Sequence comparison is, therefore, one of the most basic
operations in Bioinformatics. As output, a sequence
comparison operation produces similarity scores and
alignments. The score is a measure of similarity between
the sequences and the alignment highlights the similarities/
differences between the sequences. Both are very useful and
often are used as building blocks for more complex
problems such as multiple sequence alignment and sec-
ondary structure prediction.
Smith and Waterman (SW) [2] proposed an exact
algorithm that retrieves the optimal score and local
alignment between two sequences. It is based on Dynamic
Programming (DP) and has time and space complexity
OðmnÞ, where m and n are the sizes of the sequences. In SW,
a linear gap function was used. Nevertheless, in the nature,
gaps tend to occur together. For this reason, the affine gap
model is often used, where the penalty for opening a gap is
higher than the penalty for extending it. Gotoh [3] modified
the SW algorithm to include affine gap penalties.
One of the most restrictive characteristics of SW and its
variants is the quadratic space needed to store the DP
matrices. For instance, in order to compare two 33 MBP
(Million Base Pairs) sequences, we would need at least
4.3 PB of memory. This fact was observed by Hirschberg [4],
who proposed a linear space algorithm to compute the
Longest Common Subsequence (LCS). Hirschberg’s algo-
rithm was later modified by Myers and Miller (MM) [5] to
compute global alignments in linear space.
Another restrictive characteristic of the SW algorithm is
that it is usually slow due to its quadratic time complexity.
In order to accelerate the comparison between long
sequences, heuristic tools such as LASTZ [6] and MUMMER
[7] were created. They use seeds (LASTZ) and suffix trees
(MUMMER) to scan the sequences, providing a big picture
of the main differences/similarities between them. On the
other hand, Smith-Waterman provides the optimal local
alignment, where the regions of differences/similarities are
much more accurate, as well as the gapped regions that
represent inclusion/deletion of bases. Therefore, we claim
that both kinds of tools should be used in a complementary
way: first, MUMMER or LASTZ would be executed and
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 5, MAY 2013 1009
. E.F. de O. Sandes is with the Department of Computer Science, University
of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF
CEP 70910-900, Brazil. E-mail: edans@cic.unb.br.
. A.C.M.A. de Melo is with the Department of Computer Science, University
of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF
CEP 70910-900, Brazil. E-mail: albamm@cic.unb.br.
Manuscript received 16 Nov. 2011; revised 29 Apr. 2012; accepted 4 June
2012; published online 22 June 2012.
Recommended for acceptance by S. Aluru.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2011-11-0838.
Digital Object Identifier no. 10.1109/TPDS.2012.194.
1045-9219/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society
Journal: IEEE Transactions on Parallel
and Distributed Systems, 2013
stages. The first stage processes the full DP matrix as in [27],
but some special rows are saved in an area called Special
Rows Area and some blocks are pruned. The second stage
processes the DP matrix in the reverse direction starting
from the endpoint of the optimal alignment and also saves
special columns in disk. Using an optimization called
orthogonal execution, the area calculated in Stage 2 is
reduced. Stage 3 increases the number of crosspoints with
an execution similar to Stage 2 but in the forward direction.
Stage 4 uses the MM algorithm with orthogonal
execution to decrease the size of the partitions. As soon as
all the partitions are smaller than the maximum partition
size, Stage 5 finds the alignment of each partition and
concatenates the results in the full alignment. Stage 6 is
optional and it presents the full alignment in textual or
graphical representation.
memory space. Using an SRA of 50 GB, the full align
these genomic sequences was obtained in 8 ho
26 minutes, where 99.1 percent of this time was spen
GPU stages. CUDAlign 2.1 obtained maximum spee
41.64Â when compared to the Z-align cluster solut
64 cores.
As future work, we intend to further optimiz
stages of the algorithm. In Stage 3, the parall
currently exploited intensively inside each partition
future works many partitions may also be proce
parallel, reducing the execution time of Stage 4. A
intend to implement the block pruning optimiza
Stages 2 and 3. We will also extend the tests to ev
powerful GPUs, including systems with dual ca
from other vendors. Finally, we will investig
possibility of solving the multiple sequence al
problem with the optimizations proposed in this p
REFERENCES
[1] D.W. Mount, Bioinformatics: Sequence and Genome Anal
Spring Harbor Laboratory Press, 2004.
[2] T.F. Smith and M.S. Waterman, “Identification of
Molecular Subsequences,” J. Molecular Biology, vol. 1
pp. 195-197, Mar. 1981.
[3] O. Gotoh, “An Improved Algorithm for Matching
Sequences,” J. Molecular Biology, vol. 162, no. 3, pp. 705
1982.
[4] D.S. Hirschberg, “A Linear Space Algorithm for C
Maximal Common Subsequences,” Comm. ACM, vol.
pp. 341-343, 1975.
[5] E.W. Myers and W. Miller, “Optimal Alignments
Space,” Computer Applications in the Biosciences, vol.
pp. 11-17, 1988.
[6] R.S. Harris, “Improved Pairwise Alignment of Genom
PhD thesis, The Pennsylvania State Univ., 2007.
[7] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shu
Antonescu, and S.L. Salzberg, “Versatile and Open Sof
Comparing Large Genomes,” Genome Biology, vol. 5, no.
2004.
[8] S. Aluru, N. Futamura, and K. Mehrotra, “Parallel
Sequence Comparison Using Prefix Computations,”
Distributed Computing, vol. 63, no. 3, pp. 264-272, 2003.
[9] S. Rajko and S. Aluru, “Space and Time Optimal Parallel
Alignments,” IEEE Trans. Parallel Distributed Systems
no. 12, pp. 1070-1081, Dec. 2004.
[10] R.B. Batista, A. Boukerche, and A.C.M.A. de Melo, “A
Strategy for Biological Sequence Alignment in Restricted
Space,” J. Parallel Distributed Computing, vol. 68, no. 4, pp
2008.Fig. 13. Plot of some alignments with pruned blocks in gray.We obtained the optimal align. of the
chromosome 21 human x chimpanzee
comparison (32MBP x 47MBP) using the GPU
Nvidia GTX 560 Ti in 8 hours, GCUPS: 52.85

Fo
GPU1 GPU2 GPU3 GPU4
Figure 6. Columns distributions for 4 GPUs. Figure 8. Multi-GPU buf
Figure 8 illustrates
tween 4 GPUs. Buffers I
and buffers O1, O2 and
output-input pair of buf
GPUs is continually tra
Transactions on Parallel and Distributed System

Figure 5. Multi-GPU threads chaining.
G
Communication
using sockets
and I/O threads
Overlap between
computation and
communication:
8M buffer
MASA-CUDAlign in multiple GPUs
Stage 1 – Compute the DP matrix
Multi-GPU wavefront

• Main challenge: how to parallelize a stage
that is inherently sequential?
– Speculation
• Incremental Speculative Traceback (IST):
each GPU will consider that the local
maximum is also the global maximum.
GPU1 GPU2 GPU3 GPU4
Figure 6. Columns distributions for 4 GPUs. Figure 8. Mu
Transactions on Parallel and Distribute
1
2
3
4
5
6
7
8
9
10
optimal
speculated
Stages 2 to 5 - Traceback

Without speculation With speculation
t
(a) Pipelined Traceback (PT)
t
(b) Incremental Speculative Traceback (IST)
Figure 9. Traceback Timelines
Incremental Speculative Traceback (IST)time
time

Fine-Grain Parallel Megabase Sequence
Comparison with Multiple Heterogeneous GPUs
Edans F. de O. Sandes
University of Brasilia
edans@cic.unb.br
Guillermo Miranda
Barcelona Supercomputing Center
guillermo.miranda@bsc.es
Alba C. M. A. Melo
University of Brasilia
alba@cic.unb.br
Xavier Martorell
Universitat Politècnica de Catalunya
xavier.martorell@bsc.es
Eduard Ayguadé
Universitat Politècnica de Catalunya
eduard.ayguade@bsc.es
Abstract
This paper proposes and evaluates a parallel strategy to
execute the exact Smith-Waterman (SW) algorithm for
megabase DNA sequences in heterogeneous multi-GPU
platforms. In our strategy, the computation of a single huge
SW matrix is spread over multiple GPUs, which commu-
nicate border elements to the neighbour, using a circular
buffer mechanism that hides the communication overhead.
We compared 4 pairs of human-chimpanzee homologous
chromosomes using 2 different GPU environments, obtain-
ing a performance of up to 140.36 GCUPS (Billion of cells
processed per second) with 3 heterogeneous GPUS.
Categories and Subject Descriptors D.1.3 [Programming
Techniques]: Concurrent Programming; J.3 [Life and Med-
ical Sciences]: Biology and Genetics
Keywords GPU; Biological Sequence Comparison; Smith-
Waterman;
1. Introduction
Smith-Waterman (SW) [4] is an exact algorithm based on
the longest common subsequence (LCS) concept, that uses
dynamic programming to find local alignments between two
sequences. SW is very accurate but it needs a lot of compu-
tational resources. GPUs (Graphics Processing Units) have
been considered to accelerate SW, but very few GPU strate-
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
PPoPP ’14, February 15–19, 2014, Orlando, Florida, USA.
Copyright is held by the owner/author(s).
ACM 978-1-4503-2656-8/14/02.
http://dx.doi.org/10.1145/2555243.2555280
gies [1, 3] allow the comparison of Megabase sequences
longer than 10 Million Base Pairs (MBP). SW#[1] uses 2
GPUs to execute a Myers-Miller [2] linear space variant of
SW. CUDAlign [3] uses a single GPU to execute a com-
bined strategy with SW and Myers-Miller. When compared
to SW#(1 GPU), CUDAlign (1 GPU) presents better execu-
tion times for huge sequences [1].
In this work, we modified the most computational inten-
sive stage of CUDAlign, parallelizing the computation of
a single huge DP matrix among heterogeneous GPUs in a
fine-grained way. In the proposed strategy, GPUs are logi-
cally arranged in a linear way so that each GPU calculates
a subset of columns of the SW matrix, sending the border
column elements to the next GPU. Experimental results col-
lected in 2 different environments show performance of up
to 140 GCUPS (Billion of cells processed per second) using
3 heterogeneous GPUS. With this performance, we are able
to compare real megabase sequences in reasonable time.
2. Proposed Multi-GPU Strategy
We modified the first stage of CUDAlign [3] to parallelize
computation of a single huge DP matrix among many het-
erogeneous GPUs. The parallelization is done using a multi-
GPU wavefront method, where the GPUs are logically ar-
ranged in a linear way, i.e, the first GPU is connected to the
second, the second to the third and so on. Each GPU com-
putes a range of columns of the DP matrix and the GPUs
transfer the cells of their last column to the next GPU. In
a scenario composed of heterogeneous GPUs, assigning the
same number of columns to all GPUs is not a good choice.
In this case, the slowest GPU would determine the process-
ing rate of the whole wavefront. To avoid this, we statically
distribute the columns proportionally to the computational
power of each GPU. This distribution can be obtained from
sequence comparison benchmarks that determine each GPU
383
Conference:
ACM PPoPP 2014
GTX580+GTX680+GTX680
30.71+34.64%+34.63%
GPU1 GPU2 GPU3 GPU4
Figure 1. Columns distributions for 4 GPUs.
Chr.
Human Chimpanzee
Score
Accession Size Accession Size
chr19 NC_000019.9 59M NC_006486.3 64M 17297608
chr20 NC_000020.10 63M NC_006487.3 62M 40050427
chr21 NC_000021.8 48M NC_006488.2 46M 36006054
chr22 NC_000022.10 51M NC_006489.3 50M 31510791
Table 1. Sequences used in the tests.
ea
an
co
th
G
an
th
ie
B
ti
p
b
w
G
We obtained the optimal score of the chromosome
21 human x chimpanzee comparison (46MBP x
47MBP) using 3 GPUs (GTX580+2GTX680) in
6 hours and 28 minutes GCUPS: 139.63

CUDAlign 3.0: Parallel Biological Sequence Comparison in Large GPU Clusters
Edans F. de O. Sandes∗, Guillermo Miranda†, Alba C. M. A. de Melo∗, Xavier Martorell†‡, Eduard Ayguadé†‡
∗University of Brasilia (UnB)
{edans, albamm}@cic.unb.br
†Barcelona Supercomputing Center (BSC)
{guillermo.miranda, xavier.martorell, eduard.ayguade}@bsc.es
‡Universitat Politècnica de Catalunya (UPC)
{xavim, eduard}@ac.upc.edu
Abstract—This paper proposes and evaluates a parallel
strategy to execute the exact Smith-Waterman (SW) biological
sequence comparison algorithm for huge DNA sequences in
multi-GPU platforms. In our strategy, the computation of a
single huge SW matrix is spread over multiple GPUs, which
communicate border elements to the neighbour, using a circular
buffer mechanism. We also provide a method to predict the
execution time and speedup of a comparison, given the number
of the GPUs and the sizes of the sequences. The results obtained
with a large multi-GPU environment show that our solution
is scalable when varying the sizes of the sequences and/or the
number of GPUs and that our prediction method is accurate.
With our proposal, we were able to compare the largest human
chromosome with its homologous chimpanzee chromosome
(249 Millions of Base Pairs (MBP) x 228 MBP) using 64 GPUs,
achieving 1.7 TCUPS (Tera Cells Updated per Second). As far
as we know, this is the largest comparison ever done using the
Smith-Waterman algorithm.
I. INTRODUCTION
In comparative genomics, biologists need to compare
their sequences against other organisms in order to infer
functional and structural properties. Sequence comparison is,
therefore, one of the most basic operations in Bioinformatics
[1], usually solved using heuristic methods due to the
excessive execution times of their exact counterparts.
The exact algorithm to execute pairwise comparisons is
the one proposed by Smith-Waterman (SW) [2], which is
based on Dynamic Programming (DP), with quadratic time
and space complexities. The SW algorithm is normally
executed to compare (a) two DNA sequences or (b) a
protein sequence (query sequence) to a genomic database,
which is composed of several protein sequences. Both cases
have been parallelized in the literature. In the first case,
a single SW matrix is calculated and all the Processing
Elements (PEs) participate in this calculation (fine-grained
computation). Since there are data dependencies, neighbour
PEs communicate in order to exchange border elements. For
Megabase DNA sequences, the algorithm calculates a huge
matrix with several Petabytes. In the second case, multiple
small SW matrices are calculated usually without communi-
cation between the PEs (coarse-grained computation). With
the current genomic databases, often hundreds of thousands
SW matrices are calculated in a single query × database
comparison.
GPUs (Graphics Processing Units) are highly parallel
architectures which execute data parallel problems usually
much faster than a general-purpose processor. For this rea-
son, they have been considered to accelerate SW, with many
versions already available, executing on a single GPU [3–
7]. More recently, several approaches have been proposed to
execute SW in multiple GPUs [8–12].
Very few GPU strategies [3, 12] allow the comparison
of Megabase sequences longer than 10 Million Base Pairs
(MBP). SW# [12] is able to use 2 GPUs in a single
Megabase comparison to calculate the Myers-Miller [13]
linear space variant of SW. CUDAlign [3] executes in a
single GPU and obtains the alignment of Megabase se-
quences with a combined SW and Myers-Miller strategy.
When compared to SW# (1 GPU), CUDAlign (1 GPU)
presents better execution times for huge sequences [12].
In this paper, we propose and evaluate CUDAlign 3.0,
an evolution of CUDAlign 2.1 [3] which executes the first
stage of the SW algorithm in a fine-grained parallel way,
comparing Megabase DNA sequences in multiple GPUs. In
CUDAlign 3.0, we faced the challenge of distributing the
computation of a huge DP matrix among several GPUs, with
low impact on the performance. In the proposed strategy,
GPUs are logically arranged in a linear way so that each
GPU calculates a subset of columns of the SW matrix,
sending the border column elements to the next GPU.
Due to the data dependencies of the SW recurrence
relation, a slowdown in the communication between any 2
GPUs will slowdown the whole matrix computation [14].
To tackle this problem, we decided that computation must
be overlapped with communication, so asynchronous CPU
threads will send/receive data to/from neighbor GPUs while
the GPU continues computing.
Sequence comparisons that deal with Megabase sequences
can take hours or even days to complete. In this scenario,
we developed a method to predict the execution time and
speedup of a comparison, given the number of the GPUs
and the sizes of the sequences.
CUDAlign 3.0 was implemented in CUDA, C++ and
2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
978-1-4799-2784-5/14 $31.00 © 2014 IEEE
DOI 10.1109/CCGrid.2014.18
160
1x
2x
4x
1 2 4 8 16
Speedup
GPUs
23Mx25M
10Mx10M
5Mx5M
Figure 10. Actual Speedup in Minotauro.
0
50
100
150
200
250
300
350
400
450
500
550
600
650
chr21x21
2.24E+15 cells
chr22x22
2.55E+15 cells
chr19x19
3.76E+15 cells
chr20x20
3.89E+15 cells
GCUPS
Minotauro 1GPU
Minotauro 2GPUs
Minotauro 4GPUs
Minotauro 8GPUs
Minotauro 16GPUs
Figure 11. Throughput obtained with variable sequence sizes.
the number of cells calculated in each comparison. We can
note that the GCUPS rate is almost identical considering the
workload size from 2.24 × 1015
to 3.49 × 1015
cells.
Figure 12 presents, in logarithmic scale, the execution
times of all the comparisons listed in Table II in Minotauro
(up to 16 GPUs). The dashed diagonal lines represent the
10min
1h
1e+13
F
This paper
gorithm able t
time. CUDAlig
homogeneous
used in the te
Minotauro, CU
GCUPS when
With CUDA
human and ch
was executed
These results
CUDAlign 3.0
inter-process c
other parallel e
× chimpanzee
in SW#[12] an
Additionally
to predict the
good accuracy
was below 0.4
prediction, it w
This paper
analysis of th
human-chimpa
Conference:
IEEE CCGrid 2014
We obtained the optimal score of the chromosome 21
human x chimpanzee comparison (32MBP x 47MBP)
using 16 GPUs Nvidia Tesla M2090 in
1 hour and 20 minutes, GCUPS: 488.21
Using 64 GPUs, we obtained the optimal score of the
chromosome 1 human x chimpanzee comparison
(249MBP x 229MBP) in 9.09 hours – 1726.47 GCUPS

Journal:
IEEE Transactions on Parallel and
Distributed Systems, 2016
Using 384 GPUs, we obtained the optimal align. of the
chromosome 21 in a few minutes and the opt. align. of
chromosome 5 human x chimpanzee comparison
(180MBP x 183MBP) in 53 minutes –
10372.56 GCUPS
CUDAlign 4.0: Incremental Speculative
Traceback for Exact Chromosome-Wide
Alignment in GPU Clusters
Edans Flavius de Oliveira Sandes, Guillermo Miranda, Xavier Martorell, Eduard Ayguade,
George Teodoro, and Alba Cristina Magalhaes Melo, Senior Member, IEEE
Abstract—This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA
sequences in multi-GPU platforms, using the exact Smith–Waterman (SW) algorithm. In the first phase of CUDAlign 4.0, a huge
Dynamic Programming (DP) matrix is computed by multiple GPUs, which asynchronously communicate border elements to the right
neighbor in order to find the optimal score. After that, the traceback phase of SW is executed. The efficient parallelization of the
traceback phase is very challenging because of the high amount of data dependency, which particularly impacts the performance and
limits the application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we propose and evaluate a new parallel
traceback algorithm called Incremental Speculative Traceback (IST), which pipelines the traceback phase, speculating incrementally
over the values calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate SW matrices with up to 60
Peta cells, obtaining the optimal local alignments of all Human and Chimpanzee homologous chromosomes, whose sizes range from
26 Millions of Base Pairs (MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with the SW exact
method. We also show that the ISTalgorithm is able to reduce the traceback time from 2.15Â up to 21.03Â, when compared with the
baseline traceback algorithm. The humanÂchimpanzee chromosome 5 comparison (180 MBPÂ183 MBP) attained 10,370.00 GCUPS
(Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2 percent.
Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU
Ç
1 INTRODUCTION
IN comparative genomics, biologists compare the sequen-
ces that represent organisms in order to infer functional/
structural properties. Sequence comparison is, therefore,
one of the most basic operations in Bioinformatics [1], usu-
ally solved using heuristic methods due to the excessive
computation times of the exact methods.
Smith–Waterman (SW) [2] is an exact algorithm to com-
pute pairwise local comparisons. It is based on Dynamic Pro-
gramming (DP) and has quadratic time and space
complexities. The SW algorithm is divided in two phases,
where the first phase is responsible to calculate a DP matrix
in order to obtain the optimal score and the second phase
(traceback) obtains the optimal alignment. SW is usually
executed to compare (a) two DNA sequences or (b) a protein
sequence (query sequence) to a genomic database. In the first
case, a single SW matrix is calculated and all the Processing
Elements (PEs) cooperate in this calculation, communicating
to exchange border elements (fine-grained computation).
For Megabase DNA sequences, a huge DP matrix with
several Petabytes is computed. In the second case, multiple
small SW matrices are calculated usually without communi-
cation between the PEs (coarse-grained computation). With
the current genomic databases, often hundreds of thousands
SW matrices are calculated in a single query Â database
comparison.
In the last decades, SW approaches for both cases have
been parallelized in the literature, using multiprocessor/
multicores [3], [4], Cell Broadband Engines (CellBEs) [5],
Field Programmable Gate Arrays (FPGAs) [6], Application
Specific Integrated Circuits (ASICs) [7], Intel Xeon Phis [8]
and Graphics Processing Units (GPUs) [9], [10], [11], [12].
The SW algorithm is widely used by biologists to compare
sequences in many practical applications, such as identifica-
tion of orthologs [13], and virus integration detection [14].
In this last application, an FPGA-based platform [6] was
used to compute millions of SW alignments with small
query sequences in short time.
Nowadays, executing SW comparisons with Megabase
sequences is still considered unfeasible by most research-
ers, which currently limits its practical use. We claim
that important bioinformatics applications such as whole
genome alignment (WGA) [15] could benefit from exact
pairwise comparisons of long DNA sequences. WGA
applications often construct global genome alignments by
using local alignments as building blocks [16], [17]. In
[18], the authors state that SW local alignments would be
the best choice in this case. However, in order to compare
1 MBP Â 1 MBP sequences, the SW tool took more than
five days, preventing its use.
E. Sandes, G. Teodoro, and A. Melo are with the Department of Computer
Science, University of Brasılia, Brasılia, DF, Brazil.
E-mail: {edans, teodoro, albamm}@cic.unb.br.
G. Miranda, X. Martorell, and E. Ayguade are with the Barcelona Super-
computing Center, Barcelona, Spain.
E-mail: {guillermo.miranda, xavier.martorell, eduard.ayguade}@bsc.es.
Manuscript received 29 Dec. 2014; revised 10 Dec. 2015; accepted 1 Jan. 2016.
Date of publication 7 Jan. 2016; date of current version 14 Sept. 2016.
Recommended for acceptance by D. Trystram.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPDS.2016.2515597
2838 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016
1045-9219 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
spent in Stage 1 and the remaining stages (Traceback). As
shown, the speedups attained with 128 nodes for chr22 and
chr16 were, respectively, 26.9Â and 29.7Â (21.0 and 23.2
percent of parallel efficiency).
The breakdown of the total execution shows that the
Stage 1 of CUDAlign has a much better scalability. Stage 1
attained speedups of 84.0Â and 97.3Â with 128 nodes (65.6
and 76.0 percent of parallel efficiency), resulting in a peak
performance of 8.3 and 9.7 TCUPS for chr22 and chr16,
respectively. Stage 1 results of chr22 and chr16 are consis-
tent with the ones obtained in CUDAlign 3.0 [12]. The PT
traceback phase, on the other hand, was not able to effi-
phase increased from about 4 to 71 percent as the number of
nodes used was scaled from 1 to 128. This negative impact
of the traceback to the whole application performance is
highly reduced when IST is used, as shown in Section 6.3.
6.3 Impact of Incremental Speculative Traceback
The experimental evaluation of the impact of IST to the per-
formance was carried out using five pairs of homologous
chromosomes: chr22, chr16, chr13, chr8, and chr5. These
sequences were selected intending to provide a wide range
of variation in the DP matrix size calculated (2.55, 8.13,
13.26, 21.07, 33.04 Peta cells, respectively).
Fig. 10. Alignment plots between human and chimpanzee homologous chromosomes.
SANDES ET AL.: CUDALIGN 4.0: INCREMENTAL SPECULATIVE TRACEBACK FOR EXACT CHROMOSOME-WIDE ALIGNMENT IN GPU... 2847

Chromosome 1
HumanvsChimpanzeeComparison
Chromosome 1 human vs chimpanzee
This comparison took
2 hours with
384 NVidia
M2090 GPUs.
We computed
56.91 PetaCells
in this execution

Best Performances for Smith-Waterman
HPC Platforms
Name Year Max. Size
(10ˆn)
Output Platform TCUPS
SW-Rivyera 2014 1,000 score 128 FPGAs
real execution
6.02
SW-MVM 2014 100 score
alignment
128 CPUs
real execution
0.90
MASA-
CUDAlign
2016 100,000,000 score
alignment
384 GPUs
real execution
10.3
Prins 2017 100,000,000 score ReCAM custom
simulation
53.0
More than 100 papers in the literature in the last decades

The programmer chooses the type of block pruning and
the parallelization strategy
The programmer needs to code the recurrence relations
ust be implemented in the specic language and linked together to create a new aligner extension. Some aligners we
in the work that presented MASA, executing in dierent hardware such as GPUs, multicore CPUs and Intel Phi co-pro
e of MASA is divided in modules, according to the features: platform-independent functions (like data management, sta
ical procedures) and platform-dependent functions (like the parallel processing of the DP matrix and the BP module
ed considering the target platform). The integration of these modules can be observed in Figure 2 .
FIGURE 2 MASA architecture17
.
lelization strategy, two approaches are suggested: the diagonal method, allowing the parallel processing of cells in the
ta ow method, where the propagation is generic among nodes that represent blocks of cells. Similarly, the block pr
posed using diagonal or generic execution approaches, avoiding unnecessary calculations. In order to create a speci
MASA Architecture

GPU1 GPU2 GPU3 GPU4
Figure 5: Columns distributions for 4 GPUs.
Pruning area:
The area in blue is not computed
Heavy load imbalance
• Challenge: execute MASA-CUDAlign in a multi-
GPU platform with block pruning.
MASA-CUDAlign –
Multi-GPU with Pruning
The pruning area is obtained during the execution

• The GPUs exchange their local best
results periodically.
In order to execute the sequence alignment with BP in
multiple GPUs, each one will compute a subset of columns
of the DP matrix, i.e., the sequence placed horizontally (S1
in Fig. 4) is split according to a deﬁned static partition. Thus,
each GPU compares a part of this sequence with the entire
sequence placed vertically (S0 in Fig. 4).
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
Score sharing

Score sharing - Publication
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
1M 3M 5M 7M 10M 23M 47M Ch19 Ch20 Ch21 Ch22
GCUPS
Comparison Id
2*P100: No−BP
2*P100: BP
4*P100: No−BP
4*P100: BP
Fig. 6: Multi-BP results in Comet environment (2 and 4 P100
GPUs).
As can be observed, the speedup varied from 1.60x to 1.92x
with two GPUs, and 2.70x to 3.72x with four GPUs.
TABLE VII: Execution time (in hours) and speedup of Multi-
BP on P100 GPUs.
Id 1*P100 2*P100 4*P100
Linear 1.00x 2.00x 4.00x
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Conference: Euromicro PDP 2020
Parallel Comparison of Huge DNA Sequences in
Multiple GPUs with Block Pruning
Marco Figueiredo Jr.
Univ. of Brasilia
160063027@aluno.unb.br marcoacf@sarah.br
Edans Sandes
Univ. of Brasilia
edans.sandes@gmail.com
George Teodoro
Univ. Fed. de Minas Gerais
george@dcc.ufmg.br
Alba C. M. A. Melo
Univ. of Brasilia
alves@unb.br
Abstract—Sequence comparison is a task performed in several
Bioinformatics applications daily all over the world. Algorithms
that retrieve the optimal result have quadratic time complexity,
requiring a huge amount of computing power when the sequences
compared are long. In order to reduce the execution time, many
parallel solutions have been proposed in the literature. Neverthe-
less, depending on the sizes of the sequences, even those parallel
solutions take hours or days to complete. Pruning techniques can
significantly improve the performance of the parallel solutions
and a few approaches have been proposed to provide pruning
capabilities for sequence comparison applications. This paper
proposes and evaluates a variant of the block pruning approach
that runs in multiple GPUs, in homogeneous or heterogeneous en-
vironments. Experimental results obtained with DNA sequences
in two testbeds show that significant performance gains are
obtained with pruning, compared to its non-pruning counterpart,
achieving the impressive performance of 694.8 GCUPS (Billions
of Cells Updated per Second) for four GPUs.
Index Terms—bioinformatics, DNA alignment, GPU, pruning
I. INTRODUCTION
Bioinformatics produces solutions that are used by various
fields of study, such as medicine and biology [1]. Biological
sequence comparison operations are executed several times
daily all over the world, either in stand-alone mode or in-
corporated into Bioinformatics applications to solve complex
problems such as evolutionary relationship determination and
drug design. Due to their quadratic time complexity, sequence
comparison algorithms that retrieve the optimal result can
take a lot of time. In order to reduce the execution time of
such algorithms, parallel solutions have been proposed in the
literature over the last decades.
The type of parallelism provided by Graphical Processor
Units (GPUs) makes these devices a very good alternative
to run sequence comparisons [2] [3]. CUDAlign 4.0 [3] is
a state-of-the art tool that compares huge DNA sequences
in multiple GPUs and obtains the optimal result, combining
the Gotoh [4] and the Myers-Miller [5] algorithms. Using
384 GPUs, it was able to compare the homologous human x
chimpanzee chromosomes 5 (180 Million of Base Pairs – MBP
– each) in 53 minutes, computing a matrix of 33.04 Petacells
at 10.37 TCUPS (Trillions of Cells Updated per Second). In
an earlier version for one GPU (CUDAlign 2.1 [6]), the block
pruning (BP) strategy was proposed to avoid the computation
of parts of the matrix that surely will not lead to the optimal
solution, with good results for one GPU. Further versions of
CUDAlign present pruning capabilities only for single GPU
executions. SW# [7] implemented the original MM algorithm
and extended the block pruning strategy [6] to be used in two
GPUs, but the performance was just a little better than the
execution of CUDAlign in one GPU [7]. As far as we know,
there is no work in the literature that obtains the optimal result
with pruning using more than two GPUs. Other works use
CPUs [8], FPGAs [9] or hybrid environment [10], but they
are outside the scope of the paper.
This paper proposes and evaluates Multi-BP, an adaptation
of block pruning for multiple GPUs. It is based on static
distribution and dynamic sharing of pruning information, lead-
ing to considerable performance gains in medium-sized GPU
environments. Multi-BP combines the multi-GPU CUDAlign
version [3] and the pruning technique proposed in [6]. The
challenges to design Multi-BP were the following: (a) ensure
that Multi-BP will not affect the performance in single GPU
executions; (b) adapt the calculation of the index of each GPU
block of cells and the evaluation of the pruning window to a
multiple GPU environment; (c) disseminate the pruning infor-
mation obtained by each GPU to all others with low overhead;
and (d) adjust the pruning technique to the heterogeneous GPU
environments, considering that the DP matrix might not be
partitioned evenly among the GPUs.
Experimental results obtained with real DNA sequences
with sizes varying from 1 to 63 MBP in two computing envi-
ronments show that very good gains were attained with Multi-
BP. The execution time of the comparison of chromosome 20
(human x chimpanzee) in the same heterogeneous environment
(GTX 980 Ti + GTX 680) was reduced from 8h17min (without
Multi-BP) to 4h55min (with Multi-BP).
The remainder of this paper is organized as follows. In
Section II we present the pairwise sequence alignment problem
and in Section III we discuss pruning approaches and the block
pruning technique. Section IV discusses solutions that execute
biological sequence comparisons in multiple GPUs. Section V
describes the design of Multi-BP and Section VI details the
experiments. Finally, Section VII concludes the paper.
II. PAIRWISE BIOLOGICAL SEQUENCE COMPARISON
The field of Bioinformatics [1] demands continuous pro-
cessing improvements. Due to the huge volume of data and
performance requirements, new parallel algorithms and tools
are proposed regularly, aiming to provide faster executions.
In particular, the alignment of biological sequences (proteins,
Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA. Downloaded on June 11,2020 at 12:19:16 UTC from IEEE Xplore. Restrictions apply.
We obtained the optimal score of the chromosome
21 human x chimpanzee comparison (46MBP x
47MBP) using 4 GPUs NVidia Pascal in
56 minutes GCUPS: 680.81

• Challenge: adapt the workload to a dynamic
pruning scenario.
– Execution is paused at some points: overhead
– Use in cases where the load balancing benefits are
higher than the overhead
– Sequences which are not very similar do not have a
big pruning area.
Load Balancing – ongoing work

GPU1 GPU4 GPU1 GPU1GPU4 GPU4
w/o breakpoint with breakpoint
Score sharing + cyclic + load balancing

sequence comparison: initial split 1:1:1:1:1:1:1:1:1
MASA-CUDAlign
MASA-CUDAlign+sscore
MASA-CUDAlign+sscore+1break
MASA-CUDAlign+sscore+1break+LB
MASA-CUDAlign+sscore+2break+LB
2.75 TCUPS
63M x 62M
2 Machines (IBM Power9 + 8 V100)

Score sharing + cyclic + load balancing
• Best result in the literature for GPUs:
– 10.3 TCUPS with 384 NVidia M2090 GPUs + Intel CPU
• Result obtained in the platform:
– 2.7 TCUPS with 8 NVidia Volta GPUs + Power9 CPU
• We estimate that we are able to beat the best result
for GPUs (10.3 TCUPS) with 40 NVidia V100 and
the best theoretical result (53 TCUPS) with 256
NVidia V100

• We compared MASA-OpenMP (CPU) running
in the IBM Power9, Intel i7 and Intel Xeon
platforms for the 1M x 1M comparison.
• Intel i7 (4 cores - Skylake)
– GCUPS: 1.1, time: 16 minutes (962.9 seconds)
• Intel Xeon (24 cores – Haswell)
• IBM Power (22 cores – Power9)
MASA in CPUs (MASA-OpenMP)
MASA-OpenMP – CPU Comparison

• We used MASA-OpenMP (CPU) for
these comparisons since the sars-cov-2
sequences are short (about 30 thousand
characters)
• We first compared sars-cov-2 sequences
from China, Brazil, USA, India and
Japan
– Conclusion: very similar sequences
• We then compared sars-cov-2
sequences from Brazil to mers and sars:
– Even though the sequences are quite
similar, there are regions of interest
MASA in CPUs (MASA-OpenMP)
Ongoing covid-19 study – IBM Power9

Query: 1 -ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATC 60
||| || || ||| ||| | || | || || | ||| ||||||| ||| | [-42]/[-42]
Sbjct: 1 GATTTAA-GTGAATAGCTT---GGCTATCTCACTTCCC--C--TCGTTCTCTTGCAGAAC 53
Query: 60 TGTTCTCTAAACGAACTT---TAAAATC-----TGTGTGGC---T-GT--CAC---TC-- 101
| | | | ||||||||| ||||| | ||| | || | || ||| || [-50]/[-92]
Sbjct: 53 TTTGATTTTAACGAACTTAAATAAAAGCCCTGTTGTTTAGCGTATCGTTGCACTTGTCTG 113
Query: 101 ----------GGC-------TGCATGCT----TAG----TG--CA----CTC-ACGC--A 127
||| ||| |||| ||| || || ||| || | [-75]/[-167]
Sbjct: 113 GTGGGATTGTGGCATTAATTTGCCTGCTCATCTAGGCAGTGGACATATGCTCAACACTGG 173
Query: 127 GTATAATTAATAACTAATTACTGTCGTTGACAG-GACA-CG-AGTAACTCGTCTATCTTC 184
|||||||| ||| | | |||| | || ||| | | || || ||| | || | || [-39]/[-206]
Sbjct: 173 GTATAATT-CTAATTGAATACTAT--TTTTCAGTTAGAGCGTCGTGTCTCTTGTA-CGTC 229
Query: 184 TGCAGGCTGCTT--ACGGTTTCGTCCGTGTTGC-AGCCGATCATCAGCACATCTAG-GTT 240
| | | | | ||||||||||||| | ||| | | || ||||||| | || [-47]/[-253]
Sbjct: 229 T-CGGTCACAATACACGGTTTCGTCCG-G-TGCGTGGCAATTCGGGGCACATCATGTCTT 286
Query: 240 TCGT-CCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAA 299
|||| | |||||||||| | ||| | | | || || | || | |||| [-49]/[-302]
Sbjct: 286 TCGTGGCTGGTGTGACCG---CGCAAGGT-GCGCGC--GGTAC---GT---ATCGAG--- 331
Query: 299 ACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGAC---GTG-CTCGTAC-GTG 354
|| || |||||| || ||| || ||| ||| ||| || ||| [-56]/[-358]
Sbjct: 331 -CA-GCGCTCAACTC--------TGAAAAACA---TCAAGACCATGTGTCTCTAACTGTG 378
Query: 354 GCTTTGGAGACTCCGTGG---AGGAGGTCTTATCAGAGGCAC-GTCAACAT-CTT-AAAG 408
| |||| |||| |||| || | || || ||| ||| || | | [-64]/[-422]
Sbjct: 378 CC-------ACTCTGTGGTTCAGGAAACCTGGT-TGAAAAACTTTCACCATGGTTCATGG 430
Query: 408 ATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGA---ACAGC 465
Ongoing covid-19 study – IBM Power9
(sars-cov-2 Brazil vs mers)

Thanks to…
• My former PhD student, Edans F. O. Sandes, my PhD
student Marco Figueiredo Jr. and my undergrad student
Bernardo Nascimento
• George Teodoro, UFMG, Brazil
• Maria Emilia Walter, University of Brasilia, Brazil
• Eduard Ayguade, Xavier Martorell and Guillermo
Miranda, Universitat Politecnica de Catalunya and
• And Azzedine Boukerche, University of Ottawa, Manuel
Ujaldon, University of Malaga, Samuel Thibault, University
of Bordeaux, Genaina Rodrigues, University of Brasilia,
Celia Ralha, University of Brasilia, a couple of MsC students
and many undergrad students

Thanks to…
• Barcelona Supercomputing Center, Spain, for providing access
to the Minotauro GPU cluster (M2090)
• Xsede Platform, USA, for providing access to the Keeneland
Fullscale System (M2090), hosted at Georgia Tech, and the
comet cluster (P100), hosted at UC San Diego.
• NVidia Brazil, for providing access to their platform (P100)
• IBM USA and U. Oregon for providing access to their Power9 +
V100 platforms

The MASA code, including MASA-CUDAlign and
MASA-OpenMP, is available at
https://github.com/edanssandes/MASA-Core
MASA code
The MASA code (GPU, CPU, Intel Phi) was used in the following institutions:
Brazil – University of Brasilia, Fed Univ Rio Grande do Sul, NVidia Brazil, NEC Brazil
Croatia – University of Zagreb
France – University of Bordeaux
India - Manonmaniam Sundaranar University
Japan – NEC Japan
Singapore – Agency for Science Technology and Research
Spain – Politechnical University of Catalunya and University of Malaga
USA – University of Delaware and IBM USA
We are open to collaborations!

Parallel Biological Sequence Comparison in GPU Platforms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallel Biological Sequence Comparison in GPU Platforms

Similar to Parallel Biological Sequence Comparison in GPU Platforms (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

Parallel Biological Sequence Comparison in GPU Platforms