SlideShare a Scribd company logo
Alba Cristina Magalhaes Alves de Melo
Full Professor at the University of Brasilia (UnB)
CNPq/Brazil Research Fellow level 1C
IEEE Senior Member
Parallel Biological Sequence
Comparison in GPU Platforms:
Research	at	the	University	of	Brasilia
OpenPower Webinar	– June,	19th,	2020
https://www.compoundchem.com/2015/03/24/dna/>
3 Comments <
poundchem.com/2015/03/24/dna/#disqus_thread>
• Biological Sequences are obtained with sequencing
machines, using chromatography analysis.
This image cannot currently be displayed.
DNA	
sequence
Introduction
• Once a biological sequence is obtained, its
properties/characteristics need to be
established.
• This is mainly done by comparing the new
sequence with sequences that have already
been catalogued.
• The comparison of biological sequences is
one of the most important operations in
Bioinformatics.
– Its goal is to define how similar the sequences
are.
Introduction
• There are exact algorithms that
compare two sequences and produce
the optimal result. They have quadratic
time and memory complexity - O(mn),
where m and n are the size of the
sequences.
• Heuristic methods are faster and are
used in many genome projects.
– However, they do not guarantee that the
optimal result will be produced.
Introduction
• Smith-Waterman (SW) proposed an exact
algorithm based on dynamic programming
to locally align two sequences in quadratic
time and memory.
– It produces the optimal result.
– High execution times and huge memory
requirements.
• To compare the human chromosome 1
with the chimpanzee chromosome 1 (249
Millions of Base Pairs - MBP x 228 MBP),
at least 240 Petabytes of memory are
needed.
ØThis SW comparison was considered
unfeasible in 2008.
Introduction
• Present and discuss our MASA-CUDAlign
strategy to compare huge chromosomes in
GPUs
– We use a highly optimized algorithm to exploit
parallelism
– We use speculative techniques to accelerate the
sequential part of the algorithm
– We were able to use up to 384 NVidia M2090 GPUs to
compare the human and chimpanzee homologous
chromosomes 1 in 2016
– We will show preliminary results of the next version of
our tool in the IBM platform with 8 NVidia Volta
• Present and discuss our MASA-OpenMP results in
Power9 for smaller sequences
– We will show comparative results Power vs Intel
– We will present preliminary covid-19 results
Goal	of	this	talk
• Biological sequence comparison
• MASA-CUDAlign (one GPU)
• MASA-CUDAlign (multiple GPUs)
• MASA-CUDAlign with pruning
• MASA-OpenMP CPU studies
Agenda
OpenPower Webinar	– June,	19th,	2020
• To compare two sequences, one sequence is
placed above the other and a score is
computed.
G A C G G A T T A G G A T C G G A A T A GS0 S1
G
G
S0
S1
+1
A
A
+1
match
-
T
-2
gap
C
C
+1
G
G
+1
G
G
+1
A
A
+1
match
T
A
-1
mismatch
T
T
+1
A
A
+1
G
G
+1
match
+6
score
alignment
11 characters (Base Pairs - BP)
Biological	Sequence	Comparison
• Based on dynamic programming with quadratic time	and
memory complexity (O(mn)).
• Executes	in	two steps:	
• (1)	calculate the DP	matrix (similarity score)	and
• (2)	traceback (alignment)
• Having sequences S0 and S1	as	input,	with sizes m and n,
Hm+1,n+1 is computed:
H[i, j] = max
H[i, j −1]− g
H[i −1, j]− g
H[i −1, j −1]+ p(i, j)
0.
#
$
%
%
&
%
%
p(i,j) = ma, if s[i] = t[j]
mi, otherwise
Gap penalty
match
mismatch
Smith-Waterman	(SW)	Algorithm
- A T A G C T A
0 0 0 0 0 0 0 0
A 0 ! 1 0 ! 1 0 0 0 ! 1
T 0 0 ! 2 0 0 0 ! 1 0
A 0 ! 1 0 ! 3 0 0 0 ! 2
C 0 0 0 " 1 ! 2 ! 1 0 0
G 0 0 0 ! 1 ! 2 ! 1 0 0
C 0 0 0 0 0 ! 3 0 0
T 0 0 ! 1 0 0 " 1 ! 4 #2
C 0 0 0 0 0 ! 1 " 2 ! 3
T 0 0 ! 1 0 0 0 ! 2 " 1
T 0 0 ! 1 0 0 0 0 ! 1
highest
score	
(traceback
path)
Local Alignment:
A T A - G C T
A T A C G C T
values:	g=-2,	mi=-1,		ma=1
S1
S0
first	row	and
column	are
initialized	
with	zeros
max(2,-2,-2,0)
Smith-Waterman	(SW)	Example
• [Gotoh 1982]: Computes	the affine gap	model,	where the
value assigned to start	a	sequence of gaps	(GapOpen)	is
higher than the value assigned to extend it	(GapExtend)
• Computes	3	DP	matrices and provides a	better
biological result
• Time	and memory complexity (O(mn))
• [Hirschberg 1977]:	Computes	the linear	gap	model in	
linear	memory,	with a	divide	and conquer recursive
approach
• Time	complexity (O(mn)),	memory complexity (O(m+n))
• [Myers-Miller	1988]:	computes	the affine gap	model in	
linear	memory,	with a	modified version of Hirschberg’s
• Time	complexity (O(mn)),	memory complexity (O(m+n))
Smith-Waterman	(SW)	Variants
Smith-Waterman	(SW)	and	its	Variants	
Wavefront method
(i,j)	depends on (i-1,j),	(i-1,j-1)	and (i,j-1)	
up
left
diag
Smith-Waterman	(SW)	and	its	Variants	
Wavefront method
minimum parallelism maximum parallelism
(i,j)	depends on (i-1,j),	(i-1,j-1)	and (i,j-1)	
up
left
diag
Clicktoenlarge
minimum parallelism maximum parallelism
(i,j)	depends on (i-1,j),	(i-1,j-1)	and (i,j
up
left
diag
Clicktoenlarge
minimum parallelism maximum parallelism
(i,j)	depends on (i-1,j),	(i-1,j-1)
left
d0 d1 d2 d3 d4d5d6 d0 d1 d2 d3
d4d5d6
d0 d1 d2 d3
d4d5d6
d0 d1 d2 d3
d4d5d6
• Each	anti-diagonal	can	be	computed	in	parallel
• m+n-1	antidiagonals
• Non-uniform	parallelism
minimum	at	the	beginning	(d0)
maximum	at	the	main	antidiagonal (d4)
minimum	at	the	end	(d7)
• Biological sequence comparison
• MASA-CUDAlign (one GPU)
• MASA-CUDAlign (multiple GPUs)
• MASA-CUDAlign with pruning
• MASA-OpenMP CPU studies
Agenda
OpenPower Webinar	– June,	19th,	2020
• Goal: compare huge DNA sequences in GPU with a
combination of Gotoh and Myers-Miller algorithms
– CUDAlign 1.0: similarity score in 1 GPU
– CUDAlign 2.0: score and alignment in 1 GPU
– CUDAlign 2.1: score and alignment in 1 GPU with pruning
– CUDAlign 3.0: similarity score in several GPUs
– MASA-CUDAlign 4.0: score and alignment in several GPUs
• PhD Thesis - Edans F. O. Sandes (Awarded the Best PhD
Thesis in Computer Science in Brazil - 2016)
• Wilkes Award 2019 – Best paper -
The Computer Journal in 2018
MASA-CUDAlign:	Goal	and	VersionsMASA-CUDAlign:	Goal	and	versions
1 - Find the best score
(GPU)
2	- Partial	traceback
(GPU)
(3)	Split	partitions
(GPU)
(4)	Align	partitions	(CPU) (5)	Full	alignment	(CPU)
crosspoint
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Stage	1	(Compute	the	DP	matrix)
Stages	2	to	5	(Traceback)
•The DP Matrix is divided
into grid blocks and a set
of grid blocks compose
an external diagonal.
•Each external diagonal
is composed by B blocks
where each block is
calculated by T threads.
Each thread will compute
a rows.
•Each CUDA kernel is
invoked to calculate one
external diagonal.
B=3;	T=3;	α=2
Size(S0)=36,	Size(S1)=36
B1 B2 B3
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Stage	1	(Compute	the	DP	matrix)
• Computation of each block stops at full parallelism
and remaining cells are delegated to the next
invocation.
G0,0 G0,1 G0,2
G1,0 G1,1 G1,2
G2,0 G2,1 G2,2
G3,0 G3,1 G3,2
G4,0 G4,1 G4,2
G5,0 G5,1 G5,2
G6,0 G6,1 G6,2
G7,0 G7,1 G7,2
first	external
diagonals	processed
external	diagonals
in	the	middle	of	the
matrices
external	diagonals
with	non-contiguous
cells	delegation
small	loss	of
parallelism
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Stage	1	(Parallelogram	execution)
• The goal of the block pruning optimization is to
eliminate the calculation of blocks of cells that
surely do not belong to the optimal alignment.
• These blocks have such a small score that it is
not mathematically possible to lead to a score
higher than a score that has already been
produced.calculating the original MM
similar to FastLSA (Section 3.1),
stead of memory because there
pace. Also, saving rows to disk
omputation in case of interrup-
intervention, among others).
flushed to disk are taken from
n 4.1) at a certain interval of
bus contains the cells of the last
that are multiple of the block
be considered a special row.
ETRIEVING SMITH-WATERMAN ALIGNMENTS WITH OPTIMIZATIONS FOR MEGABASE BIOLOGICAL... 7
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Stage	1	(Block	Pruning)
Hi,j =	100,	Di = 50, Hmax(i,j) = 150
best_score = 230
pruning = true
14 E. Sandes, G. Teodoro, M. Walter, E. Ayguade, X. Mar
FIGURE 14. Geometrical representation of the pruning
case f2: j > i and Hi
min(i, j)ma.p |i
i.ma.p (j i)G + (m
j(G
case f3: j  i and Hi
Hi,j + min(
For similar sequences,
the pruning area is
characterized by four lines
(f1, f2, f3, f4), forming
two polygons that are
connected in the end of the
alignment
Gray area: not processed
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Stage	1	(Block	Pruning)
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Publications	(CUDAlign 1.0)
CUDAlign: Using GPU to Accelerate the
Comparison of Megabase Genomic Sequences
Edans Flavius de O. Sandes Alba Cristina M. A. de Melo
University of Brasilia (UnB), Brazil
{edans,albamm}@cic.unb.br
Abstract
Biological sequence comparison is a very important oper-
ation in Bioinformatics. Even though there do exist exact
methods to compare biological sequences, these methods
are often neglected due to their quadratic time and space
complexity. In order to accelerate these methods, many
GPU algorithms were proposed in the literature. Never-
theless, all of them restrict the size of the smallest se-
quence in such a way that Megabase genome comparison
is prevented. In this paper, we propose and evaluate CUD-
Align, a GPU algorithm that is able to compare Megabase
biological sequences with an exact Smith-Waterman affine
gap variant. CUDAlign was implemented in CUDA and
tested in two GPU boards, separately. For real sequences
whose size range from 1MBP (Megabase Pairs) to 47MBP,
a close to uniform GCUPS (Giga Cells Updates per Sec-
ond) was obtained, showing the potential scalability of our
approach. Also, CUDAlign was able to compare the human
chromosome 21 and the chimpanzee chromosome 22. This
operation took 21 hours on GeForce GTX 280, resulting in
a peak performance of 20.375 GCUPS. As far as we know,
this is the first time such huge chromosomes are compared
with an exact method.
Categories and Subject Descriptors D.1.3 [Program-
ming Techniques]: Concurrent Programming; J.3 [Life
and Medical Sciences]: Biology and Genetics
General Terms Algorithms
Keywords Biological Sequence Comparison, Smith-
Waterman, GPU
1. Introduction
In the last four years, new DNA sequencing technologies
have been developed that allow a hundred-fold increase in
the throughput over the traditional method. This means
that the genomic databases, that have already an expo-
nential growth rate, will experience an unprecedented in-
crease in their sizes. Therefore, a huge amount of new
DNA sequences will need to be compared, in order to in-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
to republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
PPoPP’10, January 9–14, 2010, Bangalore, India.
Copyright c⃝ 2010 ACM 978-1-60558-708-0/10/01. . . $10.00
fer functional/structural characteristics. In this scenario,
the time spent in each comparison, as well as the accuracy
of the result obtained, will be a fundamental factor to de-
termine the success/failure of the next generation genome
projects.
Sequence comparison is, thus, a very basic and impor-
tant operation in Bioinformatics. As a result of this step,
one or more sequence alignments can be produced [1]. A
sequence alignment has a similarity score associated to it
that is obtained by placing one sequence above the other,
making clear the correspondence between the characters
and possibly introducing gaps into them [2]. The most
common types of sequence alignment are global and lo-
cal. To solve a global alignment problem is to find the best
match between the entire sequences. On the other hand,
local alignment algorithms must find the best match be-
tween parts of the sequences.
One important issue to be considered is how gaps are
treated. A simple solution assigns a constant penalty for
gaps. However, it has been observed that keeping gaps
together represents better the biological relationships.
Hence, the most widely used model among biologists is the
affine gap model [3], where the penalty for opening a gap
is higher than the penalty for extending it.
Smith-Waterman (SW) [4] is an exact algorithm based
on the longest common subsequence (LCS) concept that
uses dynamic programming to find local alignments be-
tween two sequences of size m and n in O(mn) space
and time. In this algorithm, a similarity matrix of size
(m + 1) × (n + 1) is calculated. SW is very accurate but
it needs a lot of computational resources.
In order to reduce execution time, heuristic methods
such as BLAST [5] were proposed. These methods com-
bine exact pattern matching with dynamic programming
in order to produce good solutions faster. BLAST can align
sequences in a very short time, still producing good re-
sults. Nevertheless, there is no guarantee that the best
result will be produced.
Therefore, many efforts were made to develop methods
and techniques that execute the SW algorithm in high per-
formance architectures, allowing the production of exact
results in a shorter time. One recent trend in high per-
formance architectures is the Graphics Processing Units
(GPUs). In addition to the usual graphics functions, re-
cent GPU architectures are able to execute general pur-
pose algorithms (GPGPUs). These GPUs contain elements
that execute massive vector operations in a highly parallel
way. Because of its TFlops peak performance and its avail-
ability in PC desktops, the utilization of GPUs is rapidly
increasing in many scientific areas.
137
Conference: ACM PPoPP 2010
543K×536K 2.91E+11 NC_003064.2 NC_000914.1 48 (308558 , 455134)
1044K×1073K 1.12E+12 CP000051.1 AE002160.2 88353 (1072950 , 722725)
3147K×3283K 1.03E+13 BA000035.2 BX927147.1 4226 (2991493 , 2689488)
5227K×5229K 2.73E+13 AE016879.1 AE017225.1 5220960 (5227292 , 5228663)
7146K×5227K 3.74E+13 NC_005027.1 NC_003997.3 172 (4655867 , 5077642)
23012K×24544K 5.65E+14 NT_033779.4 NT_037436.3 9063 (14651731 , 11501313)
32799K×46944K 1.54E+15 BA000046.3 NC_000021.7 27206434 (32718231 , 46919080)
Table 5. Comparison for the real sequences used in tests. The best local score and the end position are presented.
Comparison
BLAST
Time Score
162K×172K 0.4s 18
543K×536K 0.6s 48
1044K×1073K 2.4s 6973
3147K×3283K 6.7s 3888
5227K×5229K 17.4s 36159
7146K×5227K 7.7s 157
23012K×24544K 110s 7085
32799K×46944K - -
Table 6. BLAST Results.
panzee chromosome comparison, BLAST finished its
ution with a segmentation fault, due to an out-of-
ory error.
Conclusion and Future Work
his paper, we proposed and evaluated CUDAlign, a
-accelerated version of Smith-Waterman (SW) that
pares two Megabase genomic sequences. Differently
the previous GPU Smith-Waterman (SW) proposals
e literature, our proposal does not impose severe re-
tions on the size of the smallest sequence and that
0.1
1
10
100
1000
10000
100000
1e+06
1e+10 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16
time(s)
cells
8600GT
GTX280
1,968 MCUPS
20,375 MCUPS
Figure 10. Runtimes (seconds) × DP matrix size (cells) in
logarithm scale. Results show scalability and almost constant
MCUPS ratio for Megabase sequences (cells ≥ 1e + 12).
order to exploit the characteristics of the GPU memory hi-
erarchy.
We obtained the optimal score of the chromosome 21
human x chimpanzee comparison (32MBP x
47MBP) using the GPU Nvidia GTX 280 in 21 hours
GCUPS (Giga of Cells Updated Per Second): 20.3
Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space
Edans Flavius de O. Sandes, Alba Cristina M. A. de Melo
Department of Computer Science
University of Brasilia (UnB)
Brasilia, Brazil
{edans,albamm}@cic.unb.br
Abstract—Cross-species chromosome alignments can reveal
ancestral relationships and may be used to identify the pe-
culiarities of the species. It is thus an important problem
in Bioinformatics. So far, aligning huge sequences, such as
whole chromosomes, with exact methods has been regarded as
unfeasible, due to huge computing and memory requirements.
However, high performance computing platforms such as GPUs
are being able to change this scenario, making it possible
to obtain the exact result for huge sequences in reasonable
time. In this paper, we propose and evaluate a parallel
algorithm that uses GPUs to align huge sequences, executing
the Smith-Waterman algorithm combined with Myers-Miller,
with linear space complexity. In order to achieve that, we
propose optimizations that are able to reduce significantly the
amount of data processed and that enforce full parallelism
most of the time. Using the GTX 285 Board, our algorithm
was able to produce the optimal alignment between sequences
composed of 33 Millions of Base Pairs (MBP) and 47 MBP in
18.5 hours.
I. INTRODUCTION
In the last decade, genome projects have produced a
huge amount of new biological data. In order to better
understand a newly sequenced organism, biologists compare
its sequence against millions of other sequences, in order
to infer properties. Sequence comparison is, thus, one of
the most important mechanisms in Bioinformatics. One of
the first exact methods to globally compare two sequences
is Needleman-Wunsch (NW) [1]. It is based on dynamic
programming (DP) and has time and space complexity
O(mn), where m and n are the sizes of the sequences. The
NW algorithm was modified by Smith-Waterman (SW) [2]
to deal with local alignments. In SW, a linear gap function
was used. Nevertheless, in the nature, gaps tend to occur
together. For this reason, the affine gap model is often
used, where the penalty for opening a gap is higher than
the penalty for extending it. Gotoh [3] modified the SW
algorithm, without changing time and space complexity, to
include affine gap penalties.
One of the most restrictive characteristics of SW and
its variants is the quadratic space needed to store the DP
matrices. For instance, in order to compare two 30 MBP
sequences, we would need at least 3.6 PB of memory. This
fact was observed by Myers-Miller [4], that proposed the use
of Hirschberg’s algorithm [5] to compute global alignments
in linear space. The algorithm uses a divide and conquer
technique that recursively splits the DP matrix to obtain the
optimal alignment.
In the last years, Graphics Processing Units (GPUs) have
received a lot of attention because of their TFlops peak
performance and their availability in PC desktops. In the
Bioinformatics research area, there are some implementa-
tions of SW in GPUs [6, 7, 8, 9, 10, 11, 12, 13], that were
able to obtain the similarity score with very good speedups.
Nevertheless, with the exception of CUDAlign 1.0 [13], all
of them define a maximum size for the query sequence. That
means that two huge sequences cannot be compared in such
implementations.
As far as we know, the only strategies that are able to
retrieve the alignment in GPUs are [6] and [12]. Since
both of them execute in quadratic space, the sizes of the
sequences to be compared is severely restricted.
In this paper, we propose and evaluate CUDAlign 2.0,
a new algorithm using GPU that is able to retrieve the
alignment of huge sequences with the SW algorithm, using
the affine gap model. Our implementation is only bound
to the total available global memory in the GPU and the
available disk space in the desktop. Our algorithm receives
two input sequences and provides the optimal alignment as
output. It runs in 6 stages, where the first three stages are
executed in GPU and the last three stages run in CPU. The
first stage executes SW [2] to retrieve the best score and its
position in the DP matrices, as in CUDAlign 1.0 [13]. Also,
some special rows are saved to disk. The goal of stages 2,
3 and 4 is to retrieve points where the optimal alignment
occurs in special rows/columns, thus creating small sub-
problems. In stage 5, the alignments for each sub-problem
are obtained and concatenated to generate the full alignment.
In stage 6, the alignment can be visualized. The proposed
algorithm was implemented in CUDA and C++ and executed
in the GTX 285 board. With our algorithm, we were able
to retrieve the alignment between the human chromosome
21 and the chimpanzee chromosome 22, with respectively
47 MBP and 33 MBP, in 18.5 hours, using reasonable disk
area and GPU memory.
The rest of this paper is organized as follows. In Section
II, we present the Smith-Waterman and the Myers and Miller
algorithms. In Section III, related work is discussed. Section
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.114
1199
2011 IEEE International Parallel & Distributed Processing Symposium
1530-2075/11 $26.00 © 2011 IEEE
DOI 10.1109/IPDPS.2011.114
1199
Conference:
IEEE IPDPS 2011
A|.
V. EXPERIMENTAL RESULTS
2.0 was implemented in CUDA 3.1 and C++
an NVIDIA GeForce GTX 285. This board has
ory, 30 multiprocessors and 240 cores. It was
an Intel Pentium Dual-Core 3GHz, 3GB RAM,
running Ubuntu 10.04, Linux kernel 2.6.32.
-Waterman parameters were set to: match:
h 3; first gap: 5; extension gap: 2. The
nfigurations used for GTX 285 were = 4,
= 26
, B2 = B3 = 60 and T2 = T3 = 27
. The
ocks may be reduced during runtime in order
minimum size requirement in each stage. Note
ber of blocks must be preferably a multiple
er of multiprocessors (30 for GTX 285). By
better performance can be achieved because
essors do not become idle when they reach the
ernal diagonal.
Times and GCUPS
used real DNA sequences retrieved from the
www.ncbi.nlm.nih.gov). The names of the se-
pared, as well as their sizes, are shown in Table
of these real sequences range from 162 KBP
nd the sequences are the same used in [13].
s of sequences shown in Table II, Table III lists
e, its end and start positions, the length of the
162K⇥172K 1.4 19769 5M 1.5 18678
543K⇥536K 12.9 22545 50M 13.6 21419
1044K⇥1073K 48.3 23205 250M 51.6 21706
3147K⇥3283K 436 23706 1G 448 23035
5227K⇥5229K 1147 23822 3G 1185 23068
7146K⇥5227K 1568 23816 3G 1604 23282
23012K⇥24544K 23620 23911 10G 23750 23780
32799K⇥46944K 64507 23869 50G 65153 23632
Table IV
RUNTIMES (IN SECONDS) OF STAGE 1 OF CUDALIGN 2.0 FLUSHING
SPECIAL ROWS TO DISK. THE OVERHEAD OF SAVING SPECIAL ROWS
DEPENDS ON THE SIZES OF THE SRA AND THE SEQUENCES.
Comparison
Stages
Total
1 2 3 4 5+6
162K⇥172K 1.5 <0.1 <0.1 <0.1 <0.1 1.8
543K⇥536K 13.6 <0.1 <0.1 <0.1 <0.1 13.9
1044K⇥1073K 51.6 3.1 1.0 5.4 0.1 61.6
3147K⇥3283K 448 0.1 <0.1 0.3 <0.1 449
5227K⇥5229K 1185 65.9 20.3 47.6 1.9 1321
7146K⇥5227K 1604 <0.1 <0.1 <0.1 <0.1 1605
23012K⇥24544K 23750 0.3 <0.1 0.7 <0.1 23755
32799K⇥46944K 65153 805 236 376 9 66579
Table V
RUNTIMES OF EACH STAGE OF CUDALIGN 2.0 ON GTX 285 VARYING
THE COMPARISON SIZE. THE TOTAL TIME INCLUDES ALL THE STAGES
AND THE I/O OF READING THE SEQUENCES.
of our approach with almost constant GCUPS for megabase
sequences. Note that, for sequences longer than 3 MBP,
CUDAlign 2.0 is able to achieve a sustained performance
We obtained the optimal align. of the chromosome 21
human x chimpanzee comparison (32MBP x
47MBP) using the GPU Nvidia GTX 285 in
18.5 hours
GCUPS (Giga of Cells Updated Per Second): 23.6
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Publications	(CUDAlign 2.0)
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	one	GPU
Publications	(CUDAlign 2.1)
Retrieving Smith-Waterman Alignments with
Optimizations for Megabase Biological
Sequences Using GPU
Edans Flavius de O. Sandes and Alba Cristina M.A. de Melo, Senior Member, IEEE
Abstract—In Genome Projects, biological sequences are aligned thousands of times, in a daily basis. The Smith-Waterman algorithm
is able to retrieve the optimal local alignment with quadratic time and space complexity. So far, aligning huge sequences, such as
whole chromosomes, with the Smith-Waterman algorithm has been regarded as unfeasible, due to huge computing and memory
requirements. However, high-performance computing platforms such as GPUs are making it possible to obtain the optimal result for
huge sequences in reasonable time. In this paper, we propose and evaluate CUDAlign 2.1, a parallel algorithm that uses GPU to align
huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to
achieve that, we propose optimizations which are able to reduce significantly the amount of data processed, while enforcing full
parallelism most of the time. Using the NVIDIA GTX 560 Ti board and comparing real DNA sequences that range from 162 KBP
(Thousand Base Pairs) to 59 MBP (Million Base Pairs), we show that CUDAlign 2.1 is scalable. Also, we show that CUDAlign 2.1 is
able to produce the optimal alignment between the chimpanzee chromosome 22 (33 MBP) and the human chromosome 21 (47 MBP)
in 8.4 hours and the optimal alignment between the chimpanzee chromosome Y (24 MBP) and the human chromosome Y (59 MBP) in
13.1 hours.
Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU
Ç
1 INTRODUCTION
BIOINFORMATICS is an interdisciplinary field that involves
computer science, biology, mathematics, and statistics
[1]. One of its main goals is to analyze biological sequence
data and genome content in order to obtain the function/
structure of the sequences as well as evolutionary
information.
Once a new biological sequence is discovered, its
functional/structural characteristics must be established.
The first step to achieve this goal is to compare the new
sequence with the sequences that compose genomic
databases, in search of similarities. This comparison is
made thousands of times in a daily basis, all over the world.
Sequence comparison is, therefore, one of the most basic
operations in Bioinformatics. As output, a sequence
comparison operation produces similarity scores and
alignments. The score is a measure of similarity between
the sequences and the alignment highlights the similarities/
differences between the sequences. Both are very useful and
often are used as building blocks for more complex
problems such as multiple sequence alignment and sec-
ondary structure prediction.
Smith and Waterman (SW) [2] proposed an exact
algorithm that retrieves the optimal score and local
alignment between two sequences. It is based on Dynamic
Programming (DP) and has time and space complexity
OðmnÞ, where m and n are the sizes of the sequences. In SW,
a linear gap function was used. Nevertheless, in the nature,
gaps tend to occur together. For this reason, the affine gap
model is often used, where the penalty for opening a gap is
higher than the penalty for extending it. Gotoh [3] modified
the SW algorithm to include affine gap penalties.
One of the most restrictive characteristics of SW and its
variants is the quadratic space needed to store the DP
matrices. For instance, in order to compare two 33 MBP
(Million Base Pairs) sequences, we would need at least
4.3 PB of memory. This fact was observed by Hirschberg [4],
who proposed a linear space algorithm to compute the
Longest Common Subsequence (LCS). Hirschberg’s algo-
rithm was later modified by Myers and Miller (MM) [5] to
compute global alignments in linear space.
Another restrictive characteristic of the SW algorithm is
that it is usually slow due to its quadratic time complexity.
In order to accelerate the comparison between long
sequences, heuristic tools such as LASTZ [6] and MUMMER
[7] were created. They use seeds (LASTZ) and suffix trees
(MUMMER) to scan the sequences, providing a big picture
of the main differences/similarities between them. On the
other hand, Smith-Waterman provides the optimal local
alignment, where the regions of differences/similarities are
much more accurate, as well as the gapped regions that
represent inclusion/deletion of bases. Therefore, we claim
that both kinds of tools should be used in a complementary
way: first, MUMMER or LASTZ would be executed and
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 5, MAY 2013 1009
. E.F. de O. Sandes is with the Department of Computer Science, University
of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF
CEP 70910-900, Brazil. E-mail: edans@cic.unb.br.
. A.C.M.A. de Melo is with the Department of Computer Science, University
of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF
CEP 70910-900, Brazil. E-mail: albamm@cic.unb.br.
Manuscript received 16 Nov. 2011; revised 29 Apr. 2012; accepted 4 June
2012; published online 22 June 2012.
Recommended for acceptance by S. Aluru.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2011-11-0838.
Digital Object Identifier no. 10.1109/TPDS.2012.194.
1045-9219/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society
Journal: IEEE Transactions on Parallel
and Distributed Systems, 2013
stages. The first stage processes the full DP matrix as in [27],
but some special rows are saved in an area called Special
Rows Area and some blocks are pruned. The second stage
processes the DP matrix in the reverse direction starting
from the endpoint of the optimal alignment and also saves
special columns in disk. Using an optimization called
orthogonal execution, the area calculated in Stage 2 is
reduced. Stage 3 increases the number of crosspoints with
an execution similar to Stage 2 but in the forward direction.
Stage 4 uses the MM algorithm with orthogonal
execution to decrease the size of the partitions. As soon as
all the partitions are smaller than the maximum partition
size, Stage 5 finds the alignment of each partition and
concatenates the results in the full alignment. Stage 6 is
optional and it presents the full alignment in textual or
graphical representation.
memory space. Using an SRA of 50 GB, the full align
these genomic sequences was obtained in 8 ho
26 minutes, where 99.1 percent of this time was spen
GPU stages. CUDAlign 2.1 obtained maximum spee
41.64Â when compared to the Z-align cluster solut
64 cores.
As future work, we intend to further optimiz
stages of the algorithm. In Stage 3, the parall
currently exploited intensively inside each partition
future works many partitions may also be proce
parallel, reducing the execution time of Stage 4. A
intend to implement the block pruning optimiza
Stages 2 and 3. We will also extend the tests to ev
powerful GPUs, including systems with dual ca
from other vendors. Finally, we will investig
possibility of solving the multiple sequence al
problem with the optimizations proposed in this p
REFERENCES
[1] D.W. Mount, Bioinformatics: Sequence and Genome Anal
Spring Harbor Laboratory Press, 2004.
[2] T.F. Smith and M.S. Waterman, “Identification of
Molecular Subsequences,” J. Molecular Biology, vol. 1
pp. 195-197, Mar. 1981.
[3] O. Gotoh, “An Improved Algorithm for Matching
Sequences,” J. Molecular Biology, vol. 162, no. 3, pp. 705
1982.
[4] D.S. Hirschberg, “A Linear Space Algorithm for C
Maximal Common Subsequences,” Comm. ACM, vol.
pp. 341-343, 1975.
[5] E.W. Myers and W. Miller, “Optimal Alignments
Space,” Computer Applications in the Biosciences, vol.
pp. 11-17, 1988.
[6] R.S. Harris, “Improved Pairwise Alignment of Genom
PhD thesis, The Pennsylvania State Univ., 2007.
[7] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shu
Antonescu, and S.L. Salzberg, “Versatile and Open Sof
Comparing Large Genomes,” Genome Biology, vol. 5, no.
2004.
[8] S. Aluru, N. Futamura, and K. Mehrotra, “Parallel
Sequence Comparison Using Prefix Computations,”
Distributed Computing, vol. 63, no. 3, pp. 264-272, 2003.
[9] S. Rajko and S. Aluru, “Space and Time Optimal Parallel
Alignments,” IEEE Trans. Parallel Distributed Systems
no. 12, pp. 1070-1081, Dec. 2004.
[10] R.B. Batista, A. Boukerche, and A.C.M.A. de Melo, “A
Strategy for Biological Sequence Alignment in Restricted
Space,” J. Parallel Distributed Computing, vol. 68, no. 4, pp
2008.Fig. 13. Plot of some alignments with pruned blocks in gray.We obtained the optimal align. of the
chromosome 21 human x chimpanzee
comparison (32MBP x 47MBP) using the GPU
Nvidia GTX 560 Ti in 8 hours, GCUPS: 52.85
• Biological sequence comparison
• MASA-CUDAlign (one GPU)
• MASA-CUDAlign (multiple GPUs)
• MASA-CUDAlign with pruning
• MASA-OpenMP CPU studies
Agenda
OpenPower Webinar	– June,	19th,	2020
Fo
GPU1 GPU2 GPU3 GPU4
Figure 6. Columns distributions for 4 GPUs. Figure 8. Multi-GPU buf
Figure 8 illustrates
tween 4 GPUs. Buffers I
and buffers O1, O2 and
output-input pair of buf
GPUs is continually tra
Transactions on Parallel and Distributed System
Figure 5. Multi-GPU threads chaining.
G
Communication
using sockets
and I/O threads
Overlap between
computation and
communication:
8M buffer
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	multiple	GPUs
Stage	1	– Compute	the	DP	matrix
Multi-GPU	wavefront
• Main challenge: how to parallelize a stage
that is inherently sequential?
– Speculation
• Incremental Speculative Traceback (IST):
each GPU will consider that the local
maximum is also the global maximum.
GPU1 GPU2 GPU3 GPU4
Figure 6. Columns distributions for 4 GPUs. Figure 8. Mu
Transactions on Parallel and Distribute
1
2
3
4
5
6
7
8
9
10
optimal
speculated
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	multiple	GPUs
Stages	2	to	5	- Traceback
Without speculation With speculation
t
(a) Pipelined Traceback (PT)
t
(b) Incremental Speculative Traceback (IST)
Figure 9. Traceback Timelines
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	multiple	GPUs
Incremental	Speculative	Traceback (IST)time
time
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	multiple	GPUs
Publications	(CUDAlign 3.0)
Fine-Grain Parallel Megabase Sequence
Comparison with Multiple Heterogeneous GPUs
Edans F. de O. Sandes
University of Brasilia
edans@cic.unb.br
Guillermo Miranda
Barcelona Supercomputing Center
guillermo.miranda@bsc.es
Alba C. M. A. Melo
University of Brasilia
alba@cic.unb.br
Xavier Martorell
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
xavier.martorell@bsc.es
Eduard Ayguadé
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
eduard.ayguade@bsc.es
Abstract
This paper proposes and evaluates a parallel strategy to
execute the exact Smith-Waterman (SW) algorithm for
megabase DNA sequences in heterogeneous multi-GPU
platforms. In our strategy, the computation of a single huge
SW matrix is spread over multiple GPUs, which commu-
nicate border elements to the neighbour, using a circular
buffer mechanism that hides the communication overhead.
We compared 4 pairs of human-chimpanzee homologous
chromosomes using 2 different GPU environments, obtain-
ing a performance of up to 140.36 GCUPS (Billion of cells
processed per second) with 3 heterogeneous GPUS.
Categories and Subject Descriptors D.1.3 [Programming
Techniques]: Concurrent Programming; J.3 [Life and Med-
ical Sciences]: Biology and Genetics
Keywords GPU; Biological Sequence Comparison; Smith-
Waterman;
1. Introduction
Smith-Waterman (SW) [4] is an exact algorithm based on
the longest common subsequence (LCS) concept, that uses
dynamic programming to find local alignments between two
sequences. SW is very accurate but it needs a lot of compu-
tational resources. GPUs (Graphics Processing Units) have
been considered to accelerate SW, but very few GPU strate-
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
PPoPP ’14, February 15–19, 2014, Orlando, Florida, USA.
Copyright is held by the owner/author(s).
ACM 978-1-4503-2656-8/14/02.
http://dx.doi.org/10.1145/2555243.2555280
gies [1, 3] allow the comparison of Megabase sequences
longer than 10 Million Base Pairs (MBP). SW#[1] uses 2
GPUs to execute a Myers-Miller [2] linear space variant of
SW. CUDAlign [3] uses a single GPU to execute a com-
bined strategy with SW and Myers-Miller. When compared
to SW#(1 GPU), CUDAlign (1 GPU) presents better execu-
tion times for huge sequences [1].
In this work, we modified the most computational inten-
sive stage of CUDAlign, parallelizing the computation of
a single huge DP matrix among heterogeneous GPUs in a
fine-grained way. In the proposed strategy, GPUs are logi-
cally arranged in a linear way so that each GPU calculates
a subset of columns of the SW matrix, sending the border
column elements to the next GPU. Experimental results col-
lected in 2 different environments show performance of up
to 140 GCUPS (Billion of cells processed per second) using
3 heterogeneous GPUS. With this performance, we are able
to compare real megabase sequences in reasonable time.
2. Proposed Multi-GPU Strategy
We modified the first stage of CUDAlign [3] to parallelize
computation of a single huge DP matrix among many het-
erogeneous GPUs. The parallelization is done using a multi-
GPU wavefront method, where the GPUs are logically ar-
ranged in a linear way, i.e, the first GPU is connected to the
second, the second to the third and so on. Each GPU com-
putes a range of columns of the DP matrix and the GPUs
transfer the cells of their last column to the next GPU. In
a scenario composed of heterogeneous GPUs, assigning the
same number of columns to all GPUs is not a good choice.
In this case, the slowest GPU would determine the process-
ing rate of the whole wavefront. To avoid this, we statically
distribute the columns proportionally to the computational
power of each GPU. This distribution can be obtained from
sequence comparison benchmarks that determine each GPU
383
Conference:
ACM PPoPP 2014
GTX580+GTX680+GTX680
30.71+34.64%+34.63%
GPU1 GPU2 GPU3 GPU4
Figure 1. Columns distributions for 4 GPUs.
Chr.
Human Chimpanzee
Score
Accession Size Accession Size
chr19 NC_000019.9 59M NC_006486.3 64M 17297608
chr20 NC_000020.10 63M NC_006487.3 62M 40050427
chr21 NC_000021.8 48M NC_006488.2 46M 36006054
chr22 NC_000022.10 51M NC_006489.3 50M 31510791
Table 1. Sequences used in the tests.
ea
an
co
th
G
an
th
ie
B
ti
p
b
w
G
We	obtained	the	optimal	score	of	the	chromosome	
21	human	x	chimpanzee	comparison		(46MBP	x		
47MBP)	using	3 GPUs	(GTX580+2GTX680)	in	
6	hours	and	28	minutes	GCUPS:	139.63
CUDAlign 3.0: Parallel Biological Sequence Comparison in Large GPU Clusters
Edans F. de O. Sandes∗, Guillermo Miranda†, Alba C. M. A. de Melo∗, Xavier Martorell†‡, Eduard Ayguad醇
∗University of Brasilia (UnB)
{edans, albamm}@cic.unb.br
†Barcelona Supercomputing Center (BSC)
{guillermo.miranda, xavier.martorell, eduard.ayguade}@bsc.es
‡Universitat Politècnica de Catalunya (UPC)
{xavim, eduard}@ac.upc.edu
Abstract—This paper proposes and evaluates a parallel
strategy to execute the exact Smith-Waterman (SW) biological
sequence comparison algorithm for huge DNA sequences in
multi-GPU platforms. In our strategy, the computation of a
single huge SW matrix is spread over multiple GPUs, which
communicate border elements to the neighbour, using a circular
buffer mechanism. We also provide a method to predict the
execution time and speedup of a comparison, given the number
of the GPUs and the sizes of the sequences. The results obtained
with a large multi-GPU environment show that our solution
is scalable when varying the sizes of the sequences and/or the
number of GPUs and that our prediction method is accurate.
With our proposal, we were able to compare the largest human
chromosome with its homologous chimpanzee chromosome
(249 Millions of Base Pairs (MBP) x 228 MBP) using 64 GPUs,
achieving 1.7 TCUPS (Tera Cells Updated per Second). As far
as we know, this is the largest comparison ever done using the
Smith-Waterman algorithm.
I. INTRODUCTION
In comparative genomics, biologists need to compare
their sequences against other organisms in order to infer
functional and structural properties. Sequence comparison is,
therefore, one of the most basic operations in Bioinformatics
[1], usually solved using heuristic methods due to the
excessive execution times of their exact counterparts.
The exact algorithm to execute pairwise comparisons is
the one proposed by Smith-Waterman (SW) [2], which is
based on Dynamic Programming (DP), with quadratic time
and space complexities. The SW algorithm is normally
executed to compare (a) two DNA sequences or (b) a
protein sequence (query sequence) to a genomic database,
which is composed of several protein sequences. Both cases
have been parallelized in the literature. In the first case,
a single SW matrix is calculated and all the Processing
Elements (PEs) participate in this calculation (fine-grained
computation). Since there are data dependencies, neighbour
PEs communicate in order to exchange border elements. For
Megabase DNA sequences, the algorithm calculates a huge
matrix with several Petabytes. In the second case, multiple
small SW matrices are calculated usually without communi-
cation between the PEs (coarse-grained computation). With
the current genomic databases, often hundreds of thousands
SW matrices are calculated in a single query × database
comparison.
GPUs (Graphics Processing Units) are highly parallel
architectures which execute data parallel problems usually
much faster than a general-purpose processor. For this rea-
son, they have been considered to accelerate SW, with many
versions already available, executing on a single GPU [3–
7]. More recently, several approaches have been proposed to
execute SW in multiple GPUs [8–12].
Very few GPU strategies [3, 12] allow the comparison
of Megabase sequences longer than 10 Million Base Pairs
(MBP). SW# [12] is able to use 2 GPUs in a single
Megabase comparison to calculate the Myers-Miller [13]
linear space variant of SW. CUDAlign [3] executes in a
single GPU and obtains the alignment of Megabase se-
quences with a combined SW and Myers-Miller strategy.
When compared to SW# (1 GPU), CUDAlign (1 GPU)
presents better execution times for huge sequences [12].
In this paper, we propose and evaluate CUDAlign 3.0,
an evolution of CUDAlign 2.1 [3] which executes the first
stage of the SW algorithm in a fine-grained parallel way,
comparing Megabase DNA sequences in multiple GPUs. In
CUDAlign 3.0, we faced the challenge of distributing the
computation of a huge DP matrix among several GPUs, with
low impact on the performance. In the proposed strategy,
GPUs are logically arranged in a linear way so that each
GPU calculates a subset of columns of the SW matrix,
sending the border column elements to the next GPU.
Due to the data dependencies of the SW recurrence
relation, a slowdown in the communication between any 2
GPUs will slowdown the whole matrix computation [14].
To tackle this problem, we decided that computation must
be overlapped with communication, so asynchronous CPU
threads will send/receive data to/from neighbor GPUs while
the GPU continues computing.
Sequence comparisons that deal with Megabase sequences
can take hours or even days to complete. In this scenario,
we developed a method to predict the execution time and
speedup of a comparison, given the number of the GPUs
and the sizes of the sequences.
CUDAlign 3.0 was implemented in CUDA, C++ and
2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
978-1-4799-2784-5/14 $31.00 © 2014 IEEE
DOI 10.1109/CCGrid.2014.18
160
1x
2x
4x
1 2 4 8 16
Speedup
GPUs
23Mx25M
10Mx10M
5Mx5M
Figure 10. Actual Speedup in Minotauro.
0
50
100
150
200
250
300
350
400
450
500
550
600
650
chr21x21
2.24E+15 cells
chr22x22
2.55E+15 cells
chr19x19
3.76E+15 cells
chr20x20
3.89E+15 cells
GCUPS
Minotauro 1GPU
Minotauro 2GPUs
Minotauro 4GPUs
Minotauro 8GPUs
Minotauro 16GPUs
Figure 11. Throughput obtained with variable sequence sizes.
the number of cells calculated in each comparison. We can
note that the GCUPS rate is almost identical considering the
workload size from 2.24 × 1015
to 3.49 × 1015
cells.
Figure 12 presents, in logarithmic scale, the execution
times of all the comparisons listed in Table II in Minotauro
(up to 16 GPUs). The dashed diagonal lines represent the
10min
1h
1e+13
F
This paper
gorithm able t
time. CUDAlig
homogeneous
used in the te
Minotauro, CU
GCUPS when
With CUDA
human and ch
was executed
These results
CUDAlign 3.0
inter-process c
other parallel e
× chimpanzee
in SW#[12] an
Additionally
to predict the
good accuracy
was below 0.4
prediction, it w
This paper
analysis of th
human-chimpa
Conference:
IEEE CCGrid 2014
We obtained the optimal score of the chromosome 21
human x chimpanzee comparison (32MBP x 47MBP)
using 16 GPUs Nvidia Tesla M2090 in
1 hour and 20 minutes, GCUPS: 488.21
Using 64 GPUs, we obtained the optimal score of the
chromosome 1 human x chimpanzee comparison
(249MBP x 229MBP) in 9.09 hours – 1726.47 GCUPS
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	multiple	GPUs
Publications	(CUDAlign 3.0)
Journal:
IEEE Transactions on Parallel and
Distributed Systems, 2016
Using 384 GPUs, we obtained the optimal align. of the
chromosome 21 in a few minutes and the opt. align. of
chromosome 5 human x chimpanzee comparison
(180MBP x 183MBP) in 53 minutes –
10372.56 GCUPS
CUDAlign 4.0: Incremental Speculative
Traceback for Exact Chromosome-Wide
Alignment in GPU Clusters
Edans Flavius de Oliveira Sandes, Guillermo Miranda, Xavier Martorell, Eduard Ayguade,
George Teodoro, and Alba Cristina Magalhaes Melo, Senior Member, IEEE
Abstract—This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA
sequences in multi-GPU platforms, using the exact Smith–Waterman (SW) algorithm. In the first phase of CUDAlign 4.0, a huge
Dynamic Programming (DP) matrix is computed by multiple GPUs, which asynchronously communicate border elements to the right
neighbor in order to find the optimal score. After that, the traceback phase of SW is executed. The efficient parallelization of the
traceback phase is very challenging because of the high amount of data dependency, which particularly impacts the performance and
limits the application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we propose and evaluate a new parallel
traceback algorithm called Incremental Speculative Traceback (IST), which pipelines the traceback phase, speculating incrementally
over the values calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate SW matrices with up to 60
Peta cells, obtaining the optimal local alignments of all Human and Chimpanzee homologous chromosomes, whose sizes range from
26 Millions of Base Pairs (MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with the SW exact
method. We also show that the ISTalgorithm is able to reduce the traceback time from 2.15Â up to 21.03Â, when compared with the
baseline traceback algorithm. The humanÂchimpanzee chromosome 5 comparison (180 MBPÂ183 MBP) attained 10,370.00 GCUPS
(Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2 percent.
Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU
Ç
1 INTRODUCTION
IN comparative genomics, biologists compare the sequen-
ces that represent organisms in order to infer functional/
structural properties. Sequence comparison is, therefore,
one of the most basic operations in Bioinformatics [1], usu-
ally solved using heuristic methods due to the excessive
computation times of the exact methods.
Smith–Waterman (SW) [2] is an exact algorithm to com-
pute pairwise local comparisons. It is based on Dynamic Pro-
gramming (DP) and has quadratic time and space
complexities. The SW algorithm is divided in two phases,
where the first phase is responsible to calculate a DP matrix
in order to obtain the optimal score and the second phase
(traceback) obtains the optimal alignment. SW is usually
executed to compare (a) two DNA sequences or (b) a protein
sequence (query sequence) to a genomic database. In the first
case, a single SW matrix is calculated and all the Processing
Elements (PEs) cooperate in this calculation, communicating
to exchange border elements (fine-grained computation).
For Megabase DNA sequences, a huge DP matrix with
several Petabytes is computed. In the second case, multiple
small SW matrices are calculated usually without communi-
cation between the PEs (coarse-grained computation). With
the current genomic databases, often hundreds of thousands
SW matrices are calculated in a single query  database
comparison.
In the last decades, SW approaches for both cases have
been parallelized in the literature, using multiprocessor/
multicores [3], [4], Cell Broadband Engines (CellBEs) [5],
Field Programmable Gate Arrays (FPGAs) [6], Application
Specific Integrated Circuits (ASICs) [7], Intel Xeon Phis [8]
and Graphics Processing Units (GPUs) [9], [10], [11], [12].
The SW algorithm is widely used by biologists to compare
sequences in many practical applications, such as identifica-
tion of orthologs [13], and virus integration detection [14].
In this last application, an FPGA-based platform [6] was
used to compute millions of SW alignments with small
query sequences in short time.
Nowadays, executing SW comparisons with Megabase
sequences is still considered unfeasible by most research-
ers, which currently limits its practical use. We claim
that important bioinformatics applications such as whole
genome alignment (WGA) [15] could benefit from exact
pairwise comparisons of long DNA sequences. WGA
applications often construct global genome alignments by
using local alignments as building blocks [16], [17]. In
[18], the authors state that SW local alignments would be
the best choice in this case. However, in order to compare
1 MBP Â 1 MBP sequences, the SW tool took more than
five days, preventing its use.
 E. Sandes, G. Teodoro, and A. Melo are with the Department of Computer
Science, University of Brasılia, Brasılia, DF, Brazil.
E-mail: {edans, teodoro, albamm}@cic.unb.br.
 G. Miranda, X. Martorell, and E. Ayguade are with the Barcelona Super-
computing Center, Barcelona, Spain.
E-mail: {guillermo.miranda, xavier.martorell, eduard.ayguade}@bsc.es.
Manuscript received 29 Dec. 2014; revised 10 Dec. 2015; accepted 1 Jan. 2016.
Date of publication 7 Jan. 2016; date of current version 14 Sept. 2016.
Recommended for acceptance by D. Trystram.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPDS.2016.2515597
2838 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016
1045-9219 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
spent in Stage 1 and the remaining stages (Traceback). As
shown, the speedups attained with 128 nodes for chr22 and
chr16 were, respectively, 26.9Â and 29.7Â (21.0 and 23.2
percent of parallel efficiency).
The breakdown of the total execution shows that the
Stage 1 of CUDAlign has a much better scalability. Stage 1
attained speedups of 84.0Â and 97.3Â with 128 nodes (65.6
and 76.0 percent of parallel efficiency), resulting in a peak
performance of 8.3 and 9.7 TCUPS for chr22 and chr16,
respectively. Stage 1 results of chr22 and chr16 are consis-
tent with the ones obtained in CUDAlign 3.0 [12]. The PT
traceback phase, on the other hand, was not able to effi-
phase increased from about 4 to 71 percent as the number of
nodes used was scaled from 1 to 128. This negative impact
of the traceback to the whole application performance is
highly reduced when IST is used, as shown in Section 6.3.
6.3 Impact of Incremental Speculative Traceback
The experimental evaluation of the impact of IST to the per-
formance was carried out using five pairs of homologous
chromosomes: chr22, chr16, chr13, chr8, and chr5. These
sequences were selected intending to provide a wide range
of variation in the DP matrix size calculated (2.55, 8.13,
13.26, 21.07, 33.04 Peta cells, respectively).
Fig. 10. Alignment plots between human and chimpanzee homologous chromosomes.
SANDES ET AL.: CUDALIGN 4.0: INCREMENTAL SPECULATIVE TRACEBACK FOR EXACT CHROMOSOME-WIDE ALIGNMENT IN GPU... 2847
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	multiple	GPUs
Publications	(CUDAlign 4.0)
Chromosome 1
HumanvsChimpanzeeComparison
MASA-CUDAlign:	Goal	and	Versions
MASA-CUDAlign in	multiple	GPUs
Chromosome	1	human	vs	chimpanzee
This	comparison	took	
2	hours	with
384	NVidia	
M2090	GPUs.
We	computed	
56.91	PetaCells
in	this	execution
MASA-CUDAlign:	Goal	and	Versions
Best	Performances	for	Smith-Waterman
HPC	Platforms
Name Year Max.	Size	
(10ˆn)
Output Platform TCUPS
SW-Rivyera 2014 1,000 score 128	FPGAs
real	execution
6.02
SW-MVM 2014 100 score
alignment
128	CPUs
real	execution
0.90
MASA-
CUDAlign
2016 100,000,000 score
alignment
384	GPUs
real	execution
10.3
Prins 2017 100,000,000 score ReCAM custom
simulation
53.0
More than 100 papers in the literature in the last decades
• Biological sequence comparison
• MASA-CUDAlign (one GPU)
• MASA-CUDAlign (multiple GPUs)
• MASA-CUDAlign with pruning
• MASA-OpenMP CPU studies
Agenda
OpenPower Webinar	– June,	19th,	2020
The programmer chooses the type of block pruning and
the parallelization strategy
The programmer needs to code the recurrence relations
ust be implemented in the specic language and linked together to create a new aligner extension. Some aligners we
in the work that presented MASA, executing in dierent hardware such as GPUs, multicore CPUs and Intel Phi co-pro
e of MASA is divided in modules, according to the features: platform-independent functions (like data management, sta
ical procedures) and platform-dependent functions (like the parallel processing of the DP matrix and the BP module
ed considering the target platform). The integration of these modules can be observed in Figure 2 .
FIGURE 2 MASA architecture17
.
lelization strategy, two approaches are suggested: the diagonal method, allowing the parallel processing of cells in the
ta ow method, where the propagation is generic among nodes that represent blocks of cells. Similarly, the block pr
posed using diagonal or generic execution approaches, avoiding unnecessary calculations. In order to create a speci
MASA	Architecture
GPU1 GPU2 GPU3 GPU4
Figure 5: Columns distributions for 4 GPUs.
Pruning area:
The area in blue is not computed
Heavy	load	imbalance	
• Challenge: execute MASA-CUDAlign in a multi-
GPU platform with block pruning.
MASA-CUDAlign –
Multi-GPU	with	Pruning
The	pruning	area	is	obtained	during	the	execution
• The GPUs exchange their local best
results periodically.
In order to execute the sequence alignment with BP in
multiple GPUs, each one will compute a subset of columns
of the DP matrix, i.e., the sequence placed horizontally (S1
in Fig. 4) is split according to a defined static partition. Thus,
each GPU compares a part of this sequence with the entire
sequence placed vertically (S0 in Fig. 4).
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
Multi-GPU	with	Pruning
Score	sharing
Multi-GPU	with	Pruning
Score	sharing	- Publication
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
1M 3M 5M 7M 10M 23M 47M Ch19 Ch20 Ch21 Ch22
GCUPS
Comparison Id
2*P100: No−BP
2*P100: BP
4*P100: No−BP
4*P100: BP
Fig. 6: Multi-BP results in Comet environment (2 and 4 P100
GPUs).
As can be observed, the speedup varied from 1.60x to 1.92x
with two GPUs, and 2.70x to 3.72x with four GPUs.
TABLE VII: Execution time (in hours) and speedup of Multi-
BP on P100 GPUs.
Id 1*P100 2*P100 4*P100
Linear 1.00x 2.00x 4.00x
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Conference: Euromicro PDP 2020
Parallel Comparison of Huge DNA Sequences in
Multiple GPUs with Block Pruning
Marco Figueiredo Jr.
Univ. of Brasilia
160063027@aluno.unb.br marcoacf@sarah.br
Edans Sandes
Univ. of Brasilia
edans.sandes@gmail.com
George Teodoro
Univ. Fed. de Minas Gerais
george@dcc.ufmg.br
Alba C. M. A. Melo
Univ. of Brasilia
alves@unb.br
Abstract—Sequence comparison is a task performed in several
Bioinformatics applications daily all over the world. Algorithms
that retrieve the optimal result have quadratic time complexity,
requiring a huge amount of computing power when the sequences
compared are long. In order to reduce the execution time, many
parallel solutions have been proposed in the literature. Neverthe-
less, depending on the sizes of the sequences, even those parallel
solutions take hours or days to complete. Pruning techniques can
significantly improve the performance of the parallel solutions
and a few approaches have been proposed to provide pruning
capabilities for sequence comparison applications. This paper
proposes and evaluates a variant of the block pruning approach
that runs in multiple GPUs, in homogeneous or heterogeneous en-
vironments. Experimental results obtained with DNA sequences
in two testbeds show that significant performance gains are
obtained with pruning, compared to its non-pruning counterpart,
achieving the impressive performance of 694.8 GCUPS (Billions
of Cells Updated per Second) for four GPUs.
Index Terms—bioinformatics, DNA alignment, GPU, pruning
I. INTRODUCTION
Bioinformatics produces solutions that are used by various
fields of study, such as medicine and biology [1]. Biological
sequence comparison operations are executed several times
daily all over the world, either in stand-alone mode or in-
corporated into Bioinformatics applications to solve complex
problems such as evolutionary relationship determination and
drug design. Due to their quadratic time complexity, sequence
comparison algorithms that retrieve the optimal result can
take a lot of time. In order to reduce the execution time of
such algorithms, parallel solutions have been proposed in the
literature over the last decades.
The type of parallelism provided by Graphical Processor
Units (GPUs) makes these devices a very good alternative
to run sequence comparisons [2] [3]. CUDAlign 4.0 [3] is
a state-of-the art tool that compares huge DNA sequences
in multiple GPUs and obtains the optimal result, combining
the Gotoh [4] and the Myers-Miller [5] algorithms. Using
384 GPUs, it was able to compare the homologous human x
chimpanzee chromosomes 5 (180 Million of Base Pairs – MBP
– each) in 53 minutes, computing a matrix of 33.04 Petacells
at 10.37 TCUPS (Trillions of Cells Updated per Second). In
an earlier version for one GPU (CUDAlign 2.1 [6]), the block
pruning (BP) strategy was proposed to avoid the computation
of parts of the matrix that surely will not lead to the optimal
solution, with good results for one GPU. Further versions of
CUDAlign present pruning capabilities only for single GPU
executions. SW# [7] implemented the original MM algorithm
and extended the block pruning strategy [6] to be used in two
GPUs, but the performance was just a little better than the
execution of CUDAlign in one GPU [7]. As far as we know,
there is no work in the literature that obtains the optimal result
with pruning using more than two GPUs. Other works use
CPUs [8], FPGAs [9] or hybrid environment [10], but they
are outside the scope of the paper.
This paper proposes and evaluates Multi-BP, an adaptation
of block pruning for multiple GPUs. It is based on static
distribution and dynamic sharing of pruning information, lead-
ing to considerable performance gains in medium-sized GPU
environments. Multi-BP combines the multi-GPU CUDAlign
version [3] and the pruning technique proposed in [6]. The
challenges to design Multi-BP were the following: (a) ensure
that Multi-BP will not affect the performance in single GPU
executions; (b) adapt the calculation of the index of each GPU
block of cells and the evaluation of the pruning window to a
multiple GPU environment; (c) disseminate the pruning infor-
mation obtained by each GPU to all others with low overhead;
and (d) adjust the pruning technique to the heterogeneous GPU
environments, considering that the DP matrix might not be
partitioned evenly among the GPUs.
Experimental results obtained with real DNA sequences
with sizes varying from 1 to 63 MBP in two computing envi-
ronments show that very good gains were attained with Multi-
BP. The execution time of the comparison of chromosome 20
(human x chimpanzee) in the same heterogeneous environment
(GTX 980 Ti + GTX 680) was reduced from 8h17min (without
Multi-BP) to 4h55min (with Multi-BP).
The remainder of this paper is organized as follows. In
Section II we present the pairwise sequence alignment problem
and in Section III we discuss pruning approaches and the block
pruning technique. Section IV discusses solutions that execute
biological sequence comparisons in multiple GPUs. Section V
describes the design of Multi-BP and Section VI details the
experiments. Finally, Section VII concludes the paper.
II. PAIRWISE BIOLOGICAL SEQUENCE COMPARISON
The field of Bioinformatics [1] demands continuous pro-
cessing improvements. Due to the huge volume of data and
performance requirements, new parallel algorithms and tools
are proposed regularly, aiming to provide faster executions.
In particular, the alignment of biological sequences (proteins,
Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA. Downloaded on June 11,2020 at 12:19:16 UTC from IEEE Xplore. Restrictions apply.
We	obtained	the	optimal	score	of	the	chromosome	
21	human	x	chimpanzee	comparison		(46MBP	x		
47MBP)	using	4	GPUs	NVidia	Pascal	in	
56	minutes	GCUPS:	680.81
• Challenge: adapt the workload to a dynamic
pruning scenario.
– Execution is paused at some points: overhead
– Use in cases where the load balancing benefits are
higher than the overhead
– Sequences which are not very similar do not have a
big pruning area.
MASA-CUDAlign:	Goal	and	Versions
Multi-GPU	with	Pruning
Load	Balancing	– ongoing	work
GPU1 GPU4 GPU1 GPU1GPU4 GPU4
w/o	breakpoint with	breakpoint
Multi-GPU	with	Pruning
Score	sharing	+	cyclic	+	load	balancing
sequence	comparison:	initial	split	1:1:1:1:1:1:1:1:1
MASA-CUDAlign
MASA-CUDAlign+sscore
MASA-CUDAlign+sscore+1break
MASA-CUDAlign+sscore+1break+LB
MASA-CUDAlign+sscore+2break+LB
2.75	TCUPS
63M	x	62M
Multi-GPU	with	Pruning
2	Machines	(IBM	Power9	+	8	V100)
Multi-GPU	with	Pruning
Score	sharing	+	cyclic	+	load	balancing
• Best result in the literature for GPUs:
– 10.3 TCUPS with 384 NVidia M2090 GPUs + Intel CPU
• Result obtained in the platform:
– 2.7 TCUPS with 8 NVidia Volta GPUs + Power9 CPU
• We estimate that we are able to beat the best result
for GPUs (10.3 TCUPS) with 40 NVidia V100 and
the best theoretical result (53 TCUPS) with 256
NVidia V100
• Biological sequence comparison
• MASA-CUDAlign (one GPU)
• MASA-CUDAlign (multiple GPUs)
• MASA-CUDAlign with pruning
• MASA-OpenMP CPU studies
Agenda
OpenPower Webinar	– June,	19th,	2020
• We compared MASA-OpenMP (CPU) running
in the IBM Power9, Intel i7 and Intel Xeon
platforms for the 1M x 1M comparison.
• Intel i7 (4 cores - Skylake)
– GCUPS: 1.1, time: 16 minutes (962.9 seconds)
• Intel Xeon (24 cores – Haswell)
– GCUPS: 4.5, time: 4 minutes (247.16 seconds)
• IBM Power (22 cores – Power9)
– GCUPS: 6.1, time: 3 minutes (181.4 seconds)
MASA	in	CPUs	(MASA-OpenMP)
MASA-OpenMP – CPU	Comparison
• We used MASA-OpenMP (CPU) for
these comparisons since the sars-cov-2
sequences are short (about 30 thousand
characters)
• We first compared sars-cov-2 sequences
from China, Brazil, USA, India and
Japan
– Conclusion: very similar sequences
• We then compared sars-cov-2
sequences from Brazil to mers and sars:
– Even though the sequences are quite
similar, there are regions of interest
MASA	in	CPUs	(MASA-OpenMP)
Ongoing	covid-19	study	– IBM	Power9
Query: 1 -ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATC 60
||| || || ||| ||| | || | || || | ||| ||||||| ||| | [-42]/[-42]
Sbjct: 1 GATTTAA-GTGAATAGCTT---GGCTATCTCACTTCCC--C--TCGTTCTCTTGCAGAAC 53
Query: 60 TGTTCTCTAAACGAACTT---TAAAATC-----TGTGTGGC---T-GT--CAC---TC-- 101
| | | | ||||||||| ||||| | ||| | || | || ||| || [-50]/[-92]
Sbjct: 53 TTTGATTTTAACGAACTTAAATAAAAGCCCTGTTGTTTAGCGTATCGTTGCACTTGTCTG 113
Query: 101 ----------GGC-------TGCATGCT----TAG----TG--CA----CTC-ACGC--A 127
||| ||| |||| ||| || || ||| || | [-75]/[-167]
Sbjct: 113 GTGGGATTGTGGCATTAATTTGCCTGCTCATCTAGGCAGTGGACATATGCTCAACACTGG 173
Query: 127 GTATAATTAATAACTAATTACTGTCGTTGACAG-GACA-CG-AGTAACTCGTCTATCTTC 184
|||||||| ||| | | |||| | || ||| | | || || ||| | || | || [-39]/[-206]
Sbjct: 173 GTATAATT-CTAATTGAATACTAT--TTTTCAGTTAGAGCGTCGTGTCTCTTGTA-CGTC 229
Query: 184 TGCAGGCTGCTT--ACGGTTTCGTCCGTGTTGC-AGCCGATCATCAGCACATCTAG-GTT 240
| | | | | ||||||||||||| | ||| | | || ||||||| | || [-47]/[-253]
Sbjct: 229 T-CGGTCACAATACACGGTTTCGTCCG-G-TGCGTGGCAATTCGGGGCACATCATGTCTT 286
Query: 240 TCGT-CCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAA 299
|||| | |||||||||| | ||| | | | || || | || | |||| [-49]/[-302]
Sbjct: 286 TCGTGGCTGGTGTGACCG---CGCAAGGT-GCGCGC--GGTAC---GT---ATCGAG--- 331
Query: 299 ACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGAC---GTG-CTCGTAC-GTG 354
|| || |||||| || ||| || ||| ||| ||| || ||| [-56]/[-358]
Sbjct: 331 -CA-GCGCTCAACTC--------TGAAAAACA---TCAAGACCATGTGTCTCTAACTGTG 378
Query: 354 GCTTTGGAGACTCCGTGG---AGGAGGTCTTATCAGAGGCAC-GTCAACAT-CTT-AAAG 408
| |||| |||| |||| || | || || ||| ||| || | | [-64]/[-422]
Sbjct: 378 CC-------ACTCTGTGGTTCAGGAAACCTGGT-TGAAAAACTTTCACCATGGTTCATGG 430
Query: 408 ATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGA---ACAGC 465
Ongoing	covid-19	study	– IBM	Power9
(sars-cov-2	Brazil	vs	mers)
Thanks	to…
• My	former	PhD	student,	Edans F.	O.	Sandes,	my	PhD	
student	Marco	Figueiredo Jr.	and	my	undergrad	student	
Bernardo	Nascimento
• George	Teodoro,	UFMG,	Brazil
• Maria	Emilia	Walter, University	of	Brasilia,	Brazil
• Eduard	Ayguade,	Xavier	Martorell and	Guillermo	
Miranda,	Universitat Politecnica de	Catalunya and	
Barcelona	Supercomputing	Center
• And	Azzedine Boukerche,	University	of	Ottawa,	Manuel	
Ujaldon,	University	of	Malaga,	Samuel	Thibault,	University	
of	Bordeaux,	Genaina Rodrigues,	University	of	Brasilia,	
Celia	Ralha,	University	of	Brasilia,	a	couple	of	MsC students	
and	many	undergrad	students
Thanks	to…
• Barcelona	Supercomputing	Center,	Spain,	for	providing	access	
to	the	Minotauro GPU	cluster	(M2090)
• Xsede Platform,	USA,	for	providing	access	to	the	Keeneland
Fullscale System	(M2090),	hosted	at	Georgia	Tech,	and	the	
comet	cluster	(P100),	hosted	at	UC	San	Diego.
• NVidia Brazil,	for	providing	access	to	their	platform	(P100)
• IBM	USA	and	U.	Oregon	for	providing	access	to	their	Power9	+	
V100	platforms
The MASA code, including MASA-CUDAlign and
MASA-OpenMP, is available at
https://github.com/edanssandes/MASA-Core
MASA	code
The	MASA	code	(GPU,	CPU,	Intel	Phi)	was	used	in	the	following	institutions:
Brazil	– University	of	Brasilia,	Fed	Univ Rio	Grande	do	Sul, NVidia	Brazil,	NEC	Brazil
Croatia	– University	of	Zagreb
France	– University	of	Bordeaux
India	- Manonmaniam Sundaranar University
Japan	– NEC	Japan
Singapore	– Agency	for	Science	Technology	and	Research
Spain	– Politechnical University	of	Catalunya and	University	of	Malaga
USA	– University	of	Delaware	and	IBM	USA
We	are	open	to	collaborations!

More Related Content

What's hot

Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님
taeseon ryu
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
Kyong-Ha Lee
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Kyong-Ha Lee
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Sunghoon Joo
 
DNR - Auto deep lab paper review ppt
DNR - Auto deep lab paper review pptDNR - Auto deep lab paper review ppt
DNR - Auto deep lab paper review ppt
taeseon ryu
 
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Robin Srivastava
 
D0325016021
D0325016021D0325016021
D0325016021
theijes
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
The Statistical and Applied Mathematical Sciences Institute
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
hytae
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
ijtsrd
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
Charles Martin
 
A New Cross Diamond Search Motion Estimation Algorithm for HEVC
A New Cross Diamond Search Motion Estimation Algorithm for HEVCA New Cross Diamond Search Motion Estimation Algorithm for HEVC
A New Cross Diamond Search Motion Estimation Algorithm for HEVC
IJERA Editor
 
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
ijdkp
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)
RCCSRENKEI
 
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Voica Gavrilut
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generator
Iaetsd Iaetsd
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
ijcsa
 
Kassem2009
Kassem2009Kassem2009
Kassem2009
lazchi
 

What's hot (20)

Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
DNR - Auto deep lab paper review ppt
DNR - Auto deep lab paper review pptDNR - Auto deep lab paper review ppt
DNR - Auto deep lab paper review ppt
 
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;Paper  on experimental setup for verifying  - &quot;Slow Learners are Fast&quot;
Paper on experimental setup for verifying - &quot;Slow Learners are Fast&quot;
 
D0325016021
D0325016021D0325016021
D0325016021
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
Fulltext
FulltextFulltext
Fulltext
 
Recent Progress in RNN and NLP
Recent Progress in RNN and NLPRecent Progress in RNN and NLP
Recent Progress in RNN and NLP
 
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
A New Cross Diamond Search Motion Estimation Algorithm for HEVC
A New Cross Diamond Search Motion Estimation Algorithm for HEVCA New Cross Diamond Search Motion Estimation Algorithm for HEVC
A New Cross Diamond Search Motion Estimation Algorithm for HEVC
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)
 
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
Fault-tolerant topology and routing synthesis for IEEE time-sensitive network...
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generator
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
 
Kassem2009
Kassem2009Kassem2009
Kassem2009
 

Similar to Parallel Biological Sequence Comparison in GPU Platforms

Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
lauratoni4
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä
 
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
ijma
 
6. Implementation
6. Implementation6. Implementation
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGenevsachde
 
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Filipo Mór
 
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Masahito Ohue
 
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
Mokhtar SELLAMI
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
Amritesh Srivastava
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
csandit
 
Compressing Graphs and Indexes with Recursive Graph Bisection
Compressing Graphs and Indexes with Recursive Graph Bisection Compressing Graphs and Indexes with Recursive Graph Bisection
Compressing Graphs and Indexes with Recursive Graph Bisection
aftab alam
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
Jason Riedy
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 

Similar to Parallel Biological Sequence Comparison in GPU Platforms (20)

Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
vega
vegavega
vega
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
 
6. Implementation
6. Implementation6. Implementation
6. Implementation
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
 
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
 
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
CARI2020: A CGM-Based Parallel Algorithm Using the Four-Russians Speedup for ...
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 
Compressing Graphs and Indexes with Recursive Graph Bisection
Compressing Graphs and Indexes with Recursive Graph Bisection Compressing Graphs and Indexes with Recursive Graph Bisection
Compressing Graphs and Indexes with Recursive Graph Bisection
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Todtree
TodtreeTodtree
Todtree
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Parallel Biological Sequence Comparison in GPU Platforms

  • 1. Alba Cristina Magalhaes Alves de Melo Full Professor at the University of Brasilia (UnB) CNPq/Brazil Research Fellow level 1C IEEE Senior Member Parallel Biological Sequence Comparison in GPU Platforms: Research at the University of Brasilia OpenPower Webinar – June, 19th, 2020 https://www.compoundchem.com/2015/03/24/dna/> 3 Comments < poundchem.com/2015/03/24/dna/#disqus_thread>
  • 2. • Biological Sequences are obtained with sequencing machines, using chromatography analysis. This image cannot currently be displayed. DNA sequence Introduction
  • 3. • Once a biological sequence is obtained, its properties/characteristics need to be established. • This is mainly done by comparing the new sequence with sequences that have already been catalogued. • The comparison of biological sequences is one of the most important operations in Bioinformatics. – Its goal is to define how similar the sequences are. Introduction
  • 4. • There are exact algorithms that compare two sequences and produce the optimal result. They have quadratic time and memory complexity - O(mn), where m and n are the size of the sequences. • Heuristic methods are faster and are used in many genome projects. – However, they do not guarantee that the optimal result will be produced. Introduction
  • 5. • Smith-Waterman (SW) proposed an exact algorithm based on dynamic programming to locally align two sequences in quadratic time and memory. – It produces the optimal result. – High execution times and huge memory requirements. • To compare the human chromosome 1 with the chimpanzee chromosome 1 (249 Millions of Base Pairs - MBP x 228 MBP), at least 240 Petabytes of memory are needed. ØThis SW comparison was considered unfeasible in 2008. Introduction
  • 6. • Present and discuss our MASA-CUDAlign strategy to compare huge chromosomes in GPUs – We use a highly optimized algorithm to exploit parallelism – We use speculative techniques to accelerate the sequential part of the algorithm – We were able to use up to 384 NVidia M2090 GPUs to compare the human and chimpanzee homologous chromosomes 1 in 2016 – We will show preliminary results of the next version of our tool in the IBM platform with 8 NVidia Volta • Present and discuss our MASA-OpenMP results in Power9 for smaller sequences – We will show comparative results Power vs Intel – We will present preliminary covid-19 results Goal of this talk
  • 7. • Biological sequence comparison • MASA-CUDAlign (one GPU) • MASA-CUDAlign (multiple GPUs) • MASA-CUDAlign with pruning • MASA-OpenMP CPU studies Agenda OpenPower Webinar – June, 19th, 2020
  • 8. • To compare two sequences, one sequence is placed above the other and a score is computed. G A C G G A T T A G G A T C G G A A T A GS0 S1 G G S0 S1 +1 A A +1 match - T -2 gap C C +1 G G +1 G G +1 A A +1 match T A -1 mismatch T T +1 A A +1 G G +1 match +6 score alignment 11 characters (Base Pairs - BP) Biological Sequence Comparison
  • 9. • Based on dynamic programming with quadratic time and memory complexity (O(mn)). • Executes in two steps: • (1) calculate the DP matrix (similarity score) and • (2) traceback (alignment) • Having sequences S0 and S1 as input, with sizes m and n, Hm+1,n+1 is computed: H[i, j] = max H[i, j −1]− g H[i −1, j]− g H[i −1, j −1]+ p(i, j) 0. # $ % % & % % p(i,j) = ma, if s[i] = t[j] mi, otherwise Gap penalty match mismatch Smith-Waterman (SW) Algorithm
  • 10. - A T A G C T A 0 0 0 0 0 0 0 0 A 0 ! 1 0 ! 1 0 0 0 ! 1 T 0 0 ! 2 0 0 0 ! 1 0 A 0 ! 1 0 ! 3 0 0 0 ! 2 C 0 0 0 " 1 ! 2 ! 1 0 0 G 0 0 0 ! 1 ! 2 ! 1 0 0 C 0 0 0 0 0 ! 3 0 0 T 0 0 ! 1 0 0 " 1 ! 4 #2 C 0 0 0 0 0 ! 1 " 2 ! 3 T 0 0 ! 1 0 0 0 ! 2 " 1 T 0 0 ! 1 0 0 0 0 ! 1 highest score (traceback path) Local Alignment: A T A - G C T A T A C G C T values: g=-2, mi=-1, ma=1 S1 S0 first row and column are initialized with zeros max(2,-2,-2,0) Smith-Waterman (SW) Example
  • 11. • [Gotoh 1982]: Computes the affine gap model, where the value assigned to start a sequence of gaps (GapOpen) is higher than the value assigned to extend it (GapExtend) • Computes 3 DP matrices and provides a better biological result • Time and memory complexity (O(mn)) • [Hirschberg 1977]: Computes the linear gap model in linear memory, with a divide and conquer recursive approach • Time complexity (O(mn)), memory complexity (O(m+n)) • [Myers-Miller 1988]: computes the affine gap model in linear memory, with a modified version of Hirschberg’s • Time complexity (O(mn)), memory complexity (O(m+n)) Smith-Waterman (SW) Variants
  • 12. Smith-Waterman (SW) and its Variants Wavefront method (i,j) depends on (i-1,j), (i-1,j-1) and (i,j-1) up left diag Smith-Waterman (SW) and its Variants Wavefront method minimum parallelism maximum parallelism (i,j) depends on (i-1,j), (i-1,j-1) and (i,j-1) up left diag Clicktoenlarge minimum parallelism maximum parallelism (i,j) depends on (i-1,j), (i-1,j-1) and (i,j up left diag Clicktoenlarge minimum parallelism maximum parallelism (i,j) depends on (i-1,j), (i-1,j-1) left d0 d1 d2 d3 d4d5d6 d0 d1 d2 d3 d4d5d6 d0 d1 d2 d3 d4d5d6 d0 d1 d2 d3 d4d5d6 • Each anti-diagonal can be computed in parallel • m+n-1 antidiagonals • Non-uniform parallelism minimum at the beginning (d0) maximum at the main antidiagonal (d4) minimum at the end (d7)
  • 13. • Biological sequence comparison • MASA-CUDAlign (one GPU) • MASA-CUDAlign (multiple GPUs) • MASA-CUDAlign with pruning • MASA-OpenMP CPU studies Agenda OpenPower Webinar – June, 19th, 2020
  • 14. • Goal: compare huge DNA sequences in GPU with a combination of Gotoh and Myers-Miller algorithms – CUDAlign 1.0: similarity score in 1 GPU – CUDAlign 2.0: score and alignment in 1 GPU – CUDAlign 2.1: score and alignment in 1 GPU with pruning – CUDAlign 3.0: similarity score in several GPUs – MASA-CUDAlign 4.0: score and alignment in several GPUs • PhD Thesis - Edans F. O. Sandes (Awarded the Best PhD Thesis in Computer Science in Brazil - 2016) • Wilkes Award 2019 – Best paper - The Computer Journal in 2018 MASA-CUDAlign: Goal and VersionsMASA-CUDAlign: Goal and versions
  • 15. 1 - Find the best score (GPU) 2 - Partial traceback (GPU) (3) Split partitions (GPU) (4) Align partitions (CPU) (5) Full alignment (CPU) crosspoint MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Stage 1 (Compute the DP matrix) Stages 2 to 5 (Traceback)
  • 16. •The DP Matrix is divided into grid blocks and a set of grid blocks compose an external diagonal. •Each external diagonal is composed by B blocks where each block is calculated by T threads. Each thread will compute a rows. •Each CUDA kernel is invoked to calculate one external diagonal. B=3; T=3; α=2 Size(S0)=36, Size(S1)=36 B1 B2 B3 MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Stage 1 (Compute the DP matrix)
  • 17. • Computation of each block stops at full parallelism and remaining cells are delegated to the next invocation. G0,0 G0,1 G0,2 G1,0 G1,1 G1,2 G2,0 G2,1 G2,2 G3,0 G3,1 G3,2 G4,0 G4,1 G4,2 G5,0 G5,1 G5,2 G6,0 G6,1 G6,2 G7,0 G7,1 G7,2 first external diagonals processed external diagonals in the middle of the matrices external diagonals with non-contiguous cells delegation small loss of parallelism MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Stage 1 (Parallelogram execution)
  • 18. • The goal of the block pruning optimization is to eliminate the calculation of blocks of cells that surely do not belong to the optimal alignment. • These blocks have such a small score that it is not mathematically possible to lead to a score higher than a score that has already been produced.calculating the original MM similar to FastLSA (Section 3.1), stead of memory because there pace. Also, saving rows to disk omputation in case of interrup- intervention, among others). flushed to disk are taken from n 4.1) at a certain interval of bus contains the cells of the last that are multiple of the block be considered a special row. ETRIEVING SMITH-WATERMAN ALIGNMENTS WITH OPTIMIZATIONS FOR MEGABASE BIOLOGICAL... 7 MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Stage 1 (Block Pruning) Hi,j = 100, Di = 50, Hmax(i,j) = 150 best_score = 230 pruning = true
  • 19. 14 E. Sandes, G. Teodoro, M. Walter, E. Ayguade, X. Mar FIGURE 14. Geometrical representation of the pruning case f2: j > i and Hi min(i, j)ma.p |i i.ma.p (j i)G + (m j(G case f3: j  i and Hi Hi,j + min( For similar sequences, the pruning area is characterized by four lines (f1, f2, f3, f4), forming two polygons that are connected in the end of the alignment Gray area: not processed MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Stage 1 (Block Pruning)
  • 20. MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Publications (CUDAlign 1.0) CUDAlign: Using GPU to Accelerate the Comparison of Megabase Genomic Sequences Edans Flavius de O. Sandes Alba Cristina M. A. de Melo University of Brasilia (UnB), Brazil {edans,albamm}@cic.unb.br Abstract Biological sequence comparison is a very important oper- ation in Bioinformatics. Even though there do exist exact methods to compare biological sequences, these methods are often neglected due to their quadratic time and space complexity. In order to accelerate these methods, many GPU algorithms were proposed in the literature. Never- theless, all of them restrict the size of the smallest se- quence in such a way that Megabase genome comparison is prevented. In this paper, we propose and evaluate CUD- Align, a GPU algorithm that is able to compare Megabase biological sequences with an exact Smith-Waterman affine gap variant. CUDAlign was implemented in CUDA and tested in two GPU boards, separately. For real sequences whose size range from 1MBP (Megabase Pairs) to 47MBP, a close to uniform GCUPS (Giga Cells Updates per Sec- ond) was obtained, showing the potential scalability of our approach. Also, CUDAlign was able to compare the human chromosome 21 and the chimpanzee chromosome 22. This operation took 21 hours on GeForce GTX 280, resulting in a peak performance of 20.375 GCUPS. As far as we know, this is the first time such huge chromosomes are compared with an exact method. Categories and Subject Descriptors D.1.3 [Program- ming Techniques]: Concurrent Programming; J.3 [Life and Medical Sciences]: Biology and Genetics General Terms Algorithms Keywords Biological Sequence Comparison, Smith- Waterman, GPU 1. Introduction In the last four years, new DNA sequencing technologies have been developed that allow a hundred-fold increase in the throughput over the traditional method. This means that the genomic databases, that have already an expo- nential growth rate, will experience an unprecedented in- crease in their sizes. Therefore, a huge amount of new DNA sequences will need to be compared, in order to in- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PPoPP’10, January 9–14, 2010, Bangalore, India. Copyright c⃝ 2010 ACM 978-1-60558-708-0/10/01. . . $10.00 fer functional/structural characteristics. In this scenario, the time spent in each comparison, as well as the accuracy of the result obtained, will be a fundamental factor to de- termine the success/failure of the next generation genome projects. Sequence comparison is, thus, a very basic and impor- tant operation in Bioinformatics. As a result of this step, one or more sequence alignments can be produced [1]. A sequence alignment has a similarity score associated to it that is obtained by placing one sequence above the other, making clear the correspondence between the characters and possibly introducing gaps into them [2]. The most common types of sequence alignment are global and lo- cal. To solve a global alignment problem is to find the best match between the entire sequences. On the other hand, local alignment algorithms must find the best match be- tween parts of the sequences. One important issue to be considered is how gaps are treated. A simple solution assigns a constant penalty for gaps. However, it has been observed that keeping gaps together represents better the biological relationships. Hence, the most widely used model among biologists is the affine gap model [3], where the penalty for opening a gap is higher than the penalty for extending it. Smith-Waterman (SW) [4] is an exact algorithm based on the longest common subsequence (LCS) concept that uses dynamic programming to find local alignments be- tween two sequences of size m and n in O(mn) space and time. In this algorithm, a similarity matrix of size (m + 1) × (n + 1) is calculated. SW is very accurate but it needs a lot of computational resources. In order to reduce execution time, heuristic methods such as BLAST [5] were proposed. These methods com- bine exact pattern matching with dynamic programming in order to produce good solutions faster. BLAST can align sequences in a very short time, still producing good re- sults. Nevertheless, there is no guarantee that the best result will be produced. Therefore, many efforts were made to develop methods and techniques that execute the SW algorithm in high per- formance architectures, allowing the production of exact results in a shorter time. One recent trend in high per- formance architectures is the Graphics Processing Units (GPUs). In addition to the usual graphics functions, re- cent GPU architectures are able to execute general pur- pose algorithms (GPGPUs). These GPUs contain elements that execute massive vector operations in a highly parallel way. Because of its TFlops peak performance and its avail- ability in PC desktops, the utilization of GPUs is rapidly increasing in many scientific areas. 137 Conference: ACM PPoPP 2010 543K×536K 2.91E+11 NC_003064.2 NC_000914.1 48 (308558 , 455134) 1044K×1073K 1.12E+12 CP000051.1 AE002160.2 88353 (1072950 , 722725) 3147K×3283K 1.03E+13 BA000035.2 BX927147.1 4226 (2991493 , 2689488) 5227K×5229K 2.73E+13 AE016879.1 AE017225.1 5220960 (5227292 , 5228663) 7146K×5227K 3.74E+13 NC_005027.1 NC_003997.3 172 (4655867 , 5077642) 23012K×24544K 5.65E+14 NT_033779.4 NT_037436.3 9063 (14651731 , 11501313) 32799K×46944K 1.54E+15 BA000046.3 NC_000021.7 27206434 (32718231 , 46919080) Table 5. Comparison for the real sequences used in tests. The best local score and the end position are presented. Comparison BLAST Time Score 162K×172K 0.4s 18 543K×536K 0.6s 48 1044K×1073K 2.4s 6973 3147K×3283K 6.7s 3888 5227K×5229K 17.4s 36159 7146K×5227K 7.7s 157 23012K×24544K 110s 7085 32799K×46944K - - Table 6. BLAST Results. panzee chromosome comparison, BLAST finished its ution with a segmentation fault, due to an out-of- ory error. Conclusion and Future Work his paper, we proposed and evaluated CUDAlign, a -accelerated version of Smith-Waterman (SW) that pares two Megabase genomic sequences. Differently the previous GPU Smith-Waterman (SW) proposals e literature, our proposal does not impose severe re- tions on the size of the smallest sequence and that 0.1 1 10 100 1000 10000 100000 1e+06 1e+10 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16 time(s) cells 8600GT GTX280 1,968 MCUPS 20,375 MCUPS Figure 10. Runtimes (seconds) × DP matrix size (cells) in logarithm scale. Results show scalability and almost constant MCUPS ratio for Megabase sequences (cells ≥ 1e + 12). order to exploit the characteristics of the GPU memory hi- erarchy. We obtained the optimal score of the chromosome 21 human x chimpanzee comparison (32MBP x 47MBP) using the GPU Nvidia GTX 280 in 21 hours GCUPS (Giga of Cells Updated Per Second): 20.3
  • 21. Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space Edans Flavius de O. Sandes, Alba Cristina M. A. de Melo Department of Computer Science University of Brasilia (UnB) Brasilia, Brazil {edans,albamm}@cic.unb.br Abstract—Cross-species chromosome alignments can reveal ancestral relationships and may be used to identify the pe- culiarities of the species. It is thus an important problem in Bioinformatics. So far, aligning huge sequences, such as whole chromosomes, with exact methods has been regarded as unfeasible, due to huge computing and memory requirements. However, high performance computing platforms such as GPUs are being able to change this scenario, making it possible to obtain the exact result for huge sequences in reasonable time. In this paper, we propose and evaluate a parallel algorithm that uses GPUs to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to achieve that, we propose optimizations that are able to reduce significantly the amount of data processed and that enforce full parallelism most of the time. Using the GTX 285 Board, our algorithm was able to produce the optimal alignment between sequences composed of 33 Millions of Base Pairs (MBP) and 47 MBP in 18.5 hours. I. INTRODUCTION In the last decade, genome projects have produced a huge amount of new biological data. In order to better understand a newly sequenced organism, biologists compare its sequence against millions of other sequences, in order to infer properties. Sequence comparison is, thus, one of the most important mechanisms in Bioinformatics. One of the first exact methods to globally compare two sequences is Needleman-Wunsch (NW) [1]. It is based on dynamic programming (DP) and has time and space complexity O(mn), where m and n are the sizes of the sequences. The NW algorithm was modified by Smith-Waterman (SW) [2] to deal with local alignments. In SW, a linear gap function was used. Nevertheless, in the nature, gaps tend to occur together. For this reason, the affine gap model is often used, where the penalty for opening a gap is higher than the penalty for extending it. Gotoh [3] modified the SW algorithm, without changing time and space complexity, to include affine gap penalties. One of the most restrictive characteristics of SW and its variants is the quadratic space needed to store the DP matrices. For instance, in order to compare two 30 MBP sequences, we would need at least 3.6 PB of memory. This fact was observed by Myers-Miller [4], that proposed the use of Hirschberg’s algorithm [5] to compute global alignments in linear space. The algorithm uses a divide and conquer technique that recursively splits the DP matrix to obtain the optimal alignment. In the last years, Graphics Processing Units (GPUs) have received a lot of attention because of their TFlops peak performance and their availability in PC desktops. In the Bioinformatics research area, there are some implementa- tions of SW in GPUs [6, 7, 8, 9, 10, 11, 12, 13], that were able to obtain the similarity score with very good speedups. Nevertheless, with the exception of CUDAlign 1.0 [13], all of them define a maximum size for the query sequence. That means that two huge sequences cannot be compared in such implementations. As far as we know, the only strategies that are able to retrieve the alignment in GPUs are [6] and [12]. Since both of them execute in quadratic space, the sizes of the sequences to be compared is severely restricted. In this paper, we propose and evaluate CUDAlign 2.0, a new algorithm using GPU that is able to retrieve the alignment of huge sequences with the SW algorithm, using the affine gap model. Our implementation is only bound to the total available global memory in the GPU and the available disk space in the desktop. Our algorithm receives two input sequences and provides the optimal alignment as output. It runs in 6 stages, where the first three stages are executed in GPU and the last three stages run in CPU. The first stage executes SW [2] to retrieve the best score and its position in the DP matrices, as in CUDAlign 1.0 [13]. Also, some special rows are saved to disk. The goal of stages 2, 3 and 4 is to retrieve points where the optimal alignment occurs in special rows/columns, thus creating small sub- problems. In stage 5, the alignments for each sub-problem are obtained and concatenated to generate the full alignment. In stage 6, the alignment can be visualized. The proposed algorithm was implemented in CUDA and C++ and executed in the GTX 285 board. With our algorithm, we were able to retrieve the alignment between the human chromosome 21 and the chimpanzee chromosome 22, with respectively 47 MBP and 33 MBP, in 18.5 hours, using reasonable disk area and GPU memory. The rest of this paper is organized as follows. In Section II, we present the Smith-Waterman and the Myers and Miller algorithms. In Section III, related work is discussed. Section 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.114 1199 2011 IEEE International Parallel & Distributed Processing Symposium 1530-2075/11 $26.00 © 2011 IEEE DOI 10.1109/IPDPS.2011.114 1199 Conference: IEEE IPDPS 2011 A|. V. EXPERIMENTAL RESULTS 2.0 was implemented in CUDA 3.1 and C++ an NVIDIA GeForce GTX 285. This board has ory, 30 multiprocessors and 240 cores. It was an Intel Pentium Dual-Core 3GHz, 3GB RAM, running Ubuntu 10.04, Linux kernel 2.6.32. -Waterman parameters were set to: match: h 3; first gap: 5; extension gap: 2. The nfigurations used for GTX 285 were = 4, = 26 , B2 = B3 = 60 and T2 = T3 = 27 . The ocks may be reduced during runtime in order minimum size requirement in each stage. Note ber of blocks must be preferably a multiple er of multiprocessors (30 for GTX 285). By better performance can be achieved because essors do not become idle when they reach the ernal diagonal. Times and GCUPS used real DNA sequences retrieved from the www.ncbi.nlm.nih.gov). The names of the se- pared, as well as their sizes, are shown in Table of these real sequences range from 162 KBP nd the sequences are the same used in [13]. s of sequences shown in Table II, Table III lists e, its end and start positions, the length of the 162K⇥172K 1.4 19769 5M 1.5 18678 543K⇥536K 12.9 22545 50M 13.6 21419 1044K⇥1073K 48.3 23205 250M 51.6 21706 3147K⇥3283K 436 23706 1G 448 23035 5227K⇥5229K 1147 23822 3G 1185 23068 7146K⇥5227K 1568 23816 3G 1604 23282 23012K⇥24544K 23620 23911 10G 23750 23780 32799K⇥46944K 64507 23869 50G 65153 23632 Table IV RUNTIMES (IN SECONDS) OF STAGE 1 OF CUDALIGN 2.0 FLUSHING SPECIAL ROWS TO DISK. THE OVERHEAD OF SAVING SPECIAL ROWS DEPENDS ON THE SIZES OF THE SRA AND THE SEQUENCES. Comparison Stages Total 1 2 3 4 5+6 162K⇥172K 1.5 <0.1 <0.1 <0.1 <0.1 1.8 543K⇥536K 13.6 <0.1 <0.1 <0.1 <0.1 13.9 1044K⇥1073K 51.6 3.1 1.0 5.4 0.1 61.6 3147K⇥3283K 448 0.1 <0.1 0.3 <0.1 449 5227K⇥5229K 1185 65.9 20.3 47.6 1.9 1321 7146K⇥5227K 1604 <0.1 <0.1 <0.1 <0.1 1605 23012K⇥24544K 23750 0.3 <0.1 0.7 <0.1 23755 32799K⇥46944K 65153 805 236 376 9 66579 Table V RUNTIMES OF EACH STAGE OF CUDALIGN 2.0 ON GTX 285 VARYING THE COMPARISON SIZE. THE TOTAL TIME INCLUDES ALL THE STAGES AND THE I/O OF READING THE SEQUENCES. of our approach with almost constant GCUPS for megabase sequences. Note that, for sequences longer than 3 MBP, CUDAlign 2.0 is able to achieve a sustained performance We obtained the optimal align. of the chromosome 21 human x chimpanzee comparison (32MBP x 47MBP) using the GPU Nvidia GTX 285 in 18.5 hours GCUPS (Giga of Cells Updated Per Second): 23.6 MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Publications (CUDAlign 2.0)
  • 22. MASA-CUDAlign: Goal and Versions MASA-CUDAlign in one GPU Publications (CUDAlign 2.1) Retrieving Smith-Waterman Alignments with Optimizations for Megabase Biological Sequences Using GPU Edans Flavius de O. Sandes and Alba Cristina M.A. de Melo, Senior Member, IEEE Abstract—In Genome Projects, biological sequences are aligned thousands of times, in a daily basis. The Smith-Waterman algorithm is able to retrieve the optimal local alignment with quadratic time and space complexity. So far, aligning huge sequences, such as whole chromosomes, with the Smith-Waterman algorithm has been regarded as unfeasible, due to huge computing and memory requirements. However, high-performance computing platforms such as GPUs are making it possible to obtain the optimal result for huge sequences in reasonable time. In this paper, we propose and evaluate CUDAlign 2.1, a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to achieve that, we propose optimizations which are able to reduce significantly the amount of data processed, while enforcing full parallelism most of the time. Using the NVIDIA GTX 560 Ti board and comparing real DNA sequences that range from 162 KBP (Thousand Base Pairs) to 59 MBP (Million Base Pairs), we show that CUDAlign 2.1 is scalable. Also, we show that CUDAlign 2.1 is able to produce the optimal alignment between the chimpanzee chromosome 22 (33 MBP) and the human chromosome 21 (47 MBP) in 8.4 hours and the optimal alignment between the chimpanzee chromosome Y (24 MBP) and the human chromosome Y (59 MBP) in 13.1 hours. Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU Ç 1 INTRODUCTION BIOINFORMATICS is an interdisciplinary field that involves computer science, biology, mathematics, and statistics [1]. One of its main goals is to analyze biological sequence data and genome content in order to obtain the function/ structure of the sequences as well as evolutionary information. Once a new biological sequence is discovered, its functional/structural characteristics must be established. The first step to achieve this goal is to compare the new sequence with the sequences that compose genomic databases, in search of similarities. This comparison is made thousands of times in a daily basis, all over the world. Sequence comparison is, therefore, one of the most basic operations in Bioinformatics. As output, a sequence comparison operation produces similarity scores and alignments. The score is a measure of similarity between the sequences and the alignment highlights the similarities/ differences between the sequences. Both are very useful and often are used as building blocks for more complex problems such as multiple sequence alignment and sec- ondary structure prediction. Smith and Waterman (SW) [2] proposed an exact algorithm that retrieves the optimal score and local alignment between two sequences. It is based on Dynamic Programming (DP) and has time and space complexity OðmnÞ, where m and n are the sizes of the sequences. In SW, a linear gap function was used. Nevertheless, in the nature, gaps tend to occur together. For this reason, the affine gap model is often used, where the penalty for opening a gap is higher than the penalty for extending it. Gotoh [3] modified the SW algorithm to include affine gap penalties. One of the most restrictive characteristics of SW and its variants is the quadratic space needed to store the DP matrices. For instance, in order to compare two 33 MBP (Million Base Pairs) sequences, we would need at least 4.3 PB of memory. This fact was observed by Hirschberg [4], who proposed a linear space algorithm to compute the Longest Common Subsequence (LCS). Hirschberg’s algo- rithm was later modified by Myers and Miller (MM) [5] to compute global alignments in linear space. Another restrictive characteristic of the SW algorithm is that it is usually slow due to its quadratic time complexity. In order to accelerate the comparison between long sequences, heuristic tools such as LASTZ [6] and MUMMER [7] were created. They use seeds (LASTZ) and suffix trees (MUMMER) to scan the sequences, providing a big picture of the main differences/similarities between them. On the other hand, Smith-Waterman provides the optimal local alignment, where the regions of differences/similarities are much more accurate, as well as the gapped regions that represent inclusion/deletion of bases. Therefore, we claim that both kinds of tools should be used in a complementary way: first, MUMMER or LASTZ would be executed and IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 5, MAY 2013 1009 . E.F. de O. Sandes is with the Department of Computer Science, University of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF CEP 70910-900, Brazil. E-mail: edans@cic.unb.br. . A.C.M.A. de Melo is with the Department of Computer Science, University of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF CEP 70910-900, Brazil. E-mail: albamm@cic.unb.br. Manuscript received 16 Nov. 2011; revised 29 Apr. 2012; accepted 4 June 2012; published online 22 June 2012. Recommended for acceptance by S. Aluru. For information on obtaining reprints of this article, please send e-mail to: tpds@computer.org, and reference IEEECS Log Number TPDS-2011-11-0838. Digital Object Identifier no. 10.1109/TPDS.2012.194. 1045-9219/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society Journal: IEEE Transactions on Parallel and Distributed Systems, 2013 stages. The first stage processes the full DP matrix as in [27], but some special rows are saved in an area called Special Rows Area and some blocks are pruned. The second stage processes the DP matrix in the reverse direction starting from the endpoint of the optimal alignment and also saves special columns in disk. Using an optimization called orthogonal execution, the area calculated in Stage 2 is reduced. Stage 3 increases the number of crosspoints with an execution similar to Stage 2 but in the forward direction. Stage 4 uses the MM algorithm with orthogonal execution to decrease the size of the partitions. As soon as all the partitions are smaller than the maximum partition size, Stage 5 finds the alignment of each partition and concatenates the results in the full alignment. Stage 6 is optional and it presents the full alignment in textual or graphical representation. memory space. Using an SRA of 50 GB, the full align these genomic sequences was obtained in 8 ho 26 minutes, where 99.1 percent of this time was spen GPU stages. CUDAlign 2.1 obtained maximum spee 41.64Â when compared to the Z-align cluster solut 64 cores. As future work, we intend to further optimiz stages of the algorithm. In Stage 3, the parall currently exploited intensively inside each partition future works many partitions may also be proce parallel, reducing the execution time of Stage 4. A intend to implement the block pruning optimiza Stages 2 and 3. We will also extend the tests to ev powerful GPUs, including systems with dual ca from other vendors. Finally, we will investig possibility of solving the multiple sequence al problem with the optimizations proposed in this p REFERENCES [1] D.W. Mount, Bioinformatics: Sequence and Genome Anal Spring Harbor Laboratory Press, 2004. [2] T.F. Smith and M.S. Waterman, “Identification of Molecular Subsequences,” J. Molecular Biology, vol. 1 pp. 195-197, Mar. 1981. [3] O. Gotoh, “An Improved Algorithm for Matching Sequences,” J. Molecular Biology, vol. 162, no. 3, pp. 705 1982. [4] D.S. Hirschberg, “A Linear Space Algorithm for C Maximal Common Subsequences,” Comm. ACM, vol. pp. 341-343, 1975. [5] E.W. Myers and W. Miller, “Optimal Alignments Space,” Computer Applications in the Biosciences, vol. pp. 11-17, 1988. [6] R.S. Harris, “Improved Pairwise Alignment of Genom PhD thesis, The Pennsylvania State Univ., 2007. [7] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shu Antonescu, and S.L. Salzberg, “Versatile and Open Sof Comparing Large Genomes,” Genome Biology, vol. 5, no. 2004. [8] S. Aluru, N. Futamura, and K. Mehrotra, “Parallel Sequence Comparison Using Prefix Computations,” Distributed Computing, vol. 63, no. 3, pp. 264-272, 2003. [9] S. Rajko and S. Aluru, “Space and Time Optimal Parallel Alignments,” IEEE Trans. Parallel Distributed Systems no. 12, pp. 1070-1081, Dec. 2004. [10] R.B. Batista, A. Boukerche, and A.C.M.A. de Melo, “A Strategy for Biological Sequence Alignment in Restricted Space,” J. Parallel Distributed Computing, vol. 68, no. 4, pp 2008.Fig. 13. Plot of some alignments with pruned blocks in gray.We obtained the optimal align. of the chromosome 21 human x chimpanzee comparison (32MBP x 47MBP) using the GPU Nvidia GTX 560 Ti in 8 hours, GCUPS: 52.85
  • 23. • Biological sequence comparison • MASA-CUDAlign (one GPU) • MASA-CUDAlign (multiple GPUs) • MASA-CUDAlign with pruning • MASA-OpenMP CPU studies Agenda OpenPower Webinar – June, 19th, 2020
  • 24. Fo GPU1 GPU2 GPU3 GPU4 Figure 6. Columns distributions for 4 GPUs. Figure 8. Multi-GPU buf Figure 8 illustrates tween 4 GPUs. Buffers I and buffers O1, O2 and output-input pair of buf GPUs is continually tra Transactions on Parallel and Distributed System
  • 25. Figure 5. Multi-GPU threads chaining. G Communication using sockets and I/O threads Overlap between computation and communication: 8M buffer MASA-CUDAlign: Goal and Versions MASA-CUDAlign in multiple GPUs Stage 1 – Compute the DP matrix Multi-GPU wavefront
  • 26. • Main challenge: how to parallelize a stage that is inherently sequential? – Speculation • Incremental Speculative Traceback (IST): each GPU will consider that the local maximum is also the global maximum. GPU1 GPU2 GPU3 GPU4 Figure 6. Columns distributions for 4 GPUs. Figure 8. Mu Transactions on Parallel and Distribute 1 2 3 4 5 6 7 8 9 10 optimal speculated MASA-CUDAlign: Goal and Versions MASA-CUDAlign in multiple GPUs Stages 2 to 5 - Traceback
  • 27. Without speculation With speculation t (a) Pipelined Traceback (PT) t (b) Incremental Speculative Traceback (IST) Figure 9. Traceback Timelines MASA-CUDAlign: Goal and Versions MASA-CUDAlign in multiple GPUs Incremental Speculative Traceback (IST)time time
  • 28. MASA-CUDAlign: Goal and Versions MASA-CUDAlign in multiple GPUs Publications (CUDAlign 3.0) Fine-Grain Parallel Megabase Sequence Comparison with Multiple Heterogeneous GPUs Edans F. de O. Sandes University of Brasilia edans@cic.unb.br Guillermo Miranda Barcelona Supercomputing Center guillermo.miranda@bsc.es Alba C. M. A. Melo University of Brasilia alba@cic.unb.br Xavier Martorell Universitat Politècnica de Catalunya Barcelona Supercomputing Center xavier.martorell@bsc.es Eduard Ayguadé Universitat Politècnica de Catalunya Barcelona Supercomputing Center eduard.ayguade@bsc.es Abstract This paper proposes and evaluates a parallel strategy to execute the exact Smith-Waterman (SW) algorithm for megabase DNA sequences in heterogeneous multi-GPU platforms. In our strategy, the computation of a single huge SW matrix is spread over multiple GPUs, which commu- nicate border elements to the neighbour, using a circular buffer mechanism that hides the communication overhead. We compared 4 pairs of human-chimpanzee homologous chromosomes using 2 different GPU environments, obtain- ing a performance of up to 140.36 GCUPS (Billion of cells processed per second) with 3 heterogeneous GPUS. Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming; J.3 [Life and Med- ical Sciences]: Biology and Genetics Keywords GPU; Biological Sequence Comparison; Smith- Waterman; 1. Introduction Smith-Waterman (SW) [4] is an exact algorithm based on the longest common subsequence (LCS) concept, that uses dynamic programming to find local alignments between two sequences. SW is very accurate but it needs a lot of compu- tational resources. GPUs (Graphics Processing Units) have been considered to accelerate SW, but very few GPU strate- Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). PPoPP ’14, February 15–19, 2014, Orlando, Florida, USA. Copyright is held by the owner/author(s). ACM 978-1-4503-2656-8/14/02. http://dx.doi.org/10.1145/2555243.2555280 gies [1, 3] allow the comparison of Megabase sequences longer than 10 Million Base Pairs (MBP). SW#[1] uses 2 GPUs to execute a Myers-Miller [2] linear space variant of SW. CUDAlign [3] uses a single GPU to execute a com- bined strategy with SW and Myers-Miller. When compared to SW#(1 GPU), CUDAlign (1 GPU) presents better execu- tion times for huge sequences [1]. In this work, we modified the most computational inten- sive stage of CUDAlign, parallelizing the computation of a single huge DP matrix among heterogeneous GPUs in a fine-grained way. In the proposed strategy, GPUs are logi- cally arranged in a linear way so that each GPU calculates a subset of columns of the SW matrix, sending the border column elements to the next GPU. Experimental results col- lected in 2 different environments show performance of up to 140 GCUPS (Billion of cells processed per second) using 3 heterogeneous GPUS. With this performance, we are able to compare real megabase sequences in reasonable time. 2. Proposed Multi-GPU Strategy We modified the first stage of CUDAlign [3] to parallelize computation of a single huge DP matrix among many het- erogeneous GPUs. The parallelization is done using a multi- GPU wavefront method, where the GPUs are logically ar- ranged in a linear way, i.e, the first GPU is connected to the second, the second to the third and so on. Each GPU com- putes a range of columns of the DP matrix and the GPUs transfer the cells of their last column to the next GPU. In a scenario composed of heterogeneous GPUs, assigning the same number of columns to all GPUs is not a good choice. In this case, the slowest GPU would determine the process- ing rate of the whole wavefront. To avoid this, we statically distribute the columns proportionally to the computational power of each GPU. This distribution can be obtained from sequence comparison benchmarks that determine each GPU 383 Conference: ACM PPoPP 2014 GTX580+GTX680+GTX680 30.71+34.64%+34.63% GPU1 GPU2 GPU3 GPU4 Figure 1. Columns distributions for 4 GPUs. Chr. Human Chimpanzee Score Accession Size Accession Size chr19 NC_000019.9 59M NC_006486.3 64M 17297608 chr20 NC_000020.10 63M NC_006487.3 62M 40050427 chr21 NC_000021.8 48M NC_006488.2 46M 36006054 chr22 NC_000022.10 51M NC_006489.3 50M 31510791 Table 1. Sequences used in the tests. ea an co th G an th ie B ti p b w G We obtained the optimal score of the chromosome 21 human x chimpanzee comparison (46MBP x 47MBP) using 3 GPUs (GTX580+2GTX680) in 6 hours and 28 minutes GCUPS: 139.63
  • 29. CUDAlign 3.0: Parallel Biological Sequence Comparison in Large GPU Clusters Edans F. de O. Sandes∗, Guillermo Miranda†, Alba C. M. A. de Melo∗, Xavier Martorell†‡, Eduard Ayguad醇 ∗University of Brasilia (UnB) {edans, albamm}@cic.unb.br †Barcelona Supercomputing Center (BSC) {guillermo.miranda, xavier.martorell, eduard.ayguade}@bsc.es ‡Universitat Politècnica de Catalunya (UPC) {xavim, eduard}@ac.upc.edu Abstract—This paper proposes and evaluates a parallel strategy to execute the exact Smith-Waterman (SW) biological sequence comparison algorithm for huge DNA sequences in multi-GPU platforms. In our strategy, the computation of a single huge SW matrix is spread over multiple GPUs, which communicate border elements to the neighbour, using a circular buffer mechanism. We also provide a method to predict the execution time and speedup of a comparison, given the number of the GPUs and the sizes of the sequences. The results obtained with a large multi-GPU environment show that our solution is scalable when varying the sizes of the sequences and/or the number of GPUs and that our prediction method is accurate. With our proposal, we were able to compare the largest human chromosome with its homologous chimpanzee chromosome (249 Millions of Base Pairs (MBP) x 228 MBP) using 64 GPUs, achieving 1.7 TCUPS (Tera Cells Updated per Second). As far as we know, this is the largest comparison ever done using the Smith-Waterman algorithm. I. INTRODUCTION In comparative genomics, biologists need to compare their sequences against other organisms in order to infer functional and structural properties. Sequence comparison is, therefore, one of the most basic operations in Bioinformatics [1], usually solved using heuristic methods due to the excessive execution times of their exact counterparts. The exact algorithm to execute pairwise comparisons is the one proposed by Smith-Waterman (SW) [2], which is based on Dynamic Programming (DP), with quadratic time and space complexities. The SW algorithm is normally executed to compare (a) two DNA sequences or (b) a protein sequence (query sequence) to a genomic database, which is composed of several protein sequences. Both cases have been parallelized in the literature. In the first case, a single SW matrix is calculated and all the Processing Elements (PEs) participate in this calculation (fine-grained computation). Since there are data dependencies, neighbour PEs communicate in order to exchange border elements. For Megabase DNA sequences, the algorithm calculates a huge matrix with several Petabytes. In the second case, multiple small SW matrices are calculated usually without communi- cation between the PEs (coarse-grained computation). With the current genomic databases, often hundreds of thousands SW matrices are calculated in a single query × database comparison. GPUs (Graphics Processing Units) are highly parallel architectures which execute data parallel problems usually much faster than a general-purpose processor. For this rea- son, they have been considered to accelerate SW, with many versions already available, executing on a single GPU [3– 7]. More recently, several approaches have been proposed to execute SW in multiple GPUs [8–12]. Very few GPU strategies [3, 12] allow the comparison of Megabase sequences longer than 10 Million Base Pairs (MBP). SW# [12] is able to use 2 GPUs in a single Megabase comparison to calculate the Myers-Miller [13] linear space variant of SW. CUDAlign [3] executes in a single GPU and obtains the alignment of Megabase se- quences with a combined SW and Myers-Miller strategy. When compared to SW# (1 GPU), CUDAlign (1 GPU) presents better execution times for huge sequences [12]. In this paper, we propose and evaluate CUDAlign 3.0, an evolution of CUDAlign 2.1 [3] which executes the first stage of the SW algorithm in a fine-grained parallel way, comparing Megabase DNA sequences in multiple GPUs. In CUDAlign 3.0, we faced the challenge of distributing the computation of a huge DP matrix among several GPUs, with low impact on the performance. In the proposed strategy, GPUs are logically arranged in a linear way so that each GPU calculates a subset of columns of the SW matrix, sending the border column elements to the next GPU. Due to the data dependencies of the SW recurrence relation, a slowdown in the communication between any 2 GPUs will slowdown the whole matrix computation [14]. To tackle this problem, we decided that computation must be overlapped with communication, so asynchronous CPU threads will send/receive data to/from neighbor GPUs while the GPU continues computing. Sequence comparisons that deal with Megabase sequences can take hours or even days to complete. In this scenario, we developed a method to predict the execution time and speedup of a comparison, given the number of the GPUs and the sizes of the sequences. CUDAlign 3.0 was implemented in CUDA, C++ and 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 978-1-4799-2784-5/14 $31.00 © 2014 IEEE DOI 10.1109/CCGrid.2014.18 160 1x 2x 4x 1 2 4 8 16 Speedup GPUs 23Mx25M 10Mx10M 5Mx5M Figure 10. Actual Speedup in Minotauro. 0 50 100 150 200 250 300 350 400 450 500 550 600 650 chr21x21 2.24E+15 cells chr22x22 2.55E+15 cells chr19x19 3.76E+15 cells chr20x20 3.89E+15 cells GCUPS Minotauro 1GPU Minotauro 2GPUs Minotauro 4GPUs Minotauro 8GPUs Minotauro 16GPUs Figure 11. Throughput obtained with variable sequence sizes. the number of cells calculated in each comparison. We can note that the GCUPS rate is almost identical considering the workload size from 2.24 × 1015 to 3.49 × 1015 cells. Figure 12 presents, in logarithmic scale, the execution times of all the comparisons listed in Table II in Minotauro (up to 16 GPUs). The dashed diagonal lines represent the 10min 1h 1e+13 F This paper gorithm able t time. CUDAlig homogeneous used in the te Minotauro, CU GCUPS when With CUDA human and ch was executed These results CUDAlign 3.0 inter-process c other parallel e × chimpanzee in SW#[12] an Additionally to predict the good accuracy was below 0.4 prediction, it w This paper analysis of th human-chimpa Conference: IEEE CCGrid 2014 We obtained the optimal score of the chromosome 21 human x chimpanzee comparison (32MBP x 47MBP) using 16 GPUs Nvidia Tesla M2090 in 1 hour and 20 minutes, GCUPS: 488.21 Using 64 GPUs, we obtained the optimal score of the chromosome 1 human x chimpanzee comparison (249MBP x 229MBP) in 9.09 hours – 1726.47 GCUPS MASA-CUDAlign: Goal and Versions MASA-CUDAlign in multiple GPUs Publications (CUDAlign 3.0)
  • 30. Journal: IEEE Transactions on Parallel and Distributed Systems, 2016 Using 384 GPUs, we obtained the optimal align. of the chromosome 21 in a few minutes and the opt. align. of chromosome 5 human x chimpanzee comparison (180MBP x 183MBP) in 53 minutes – 10372.56 GCUPS CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-Wide Alignment in GPU Clusters Edans Flavius de Oliveira Sandes, Guillermo Miranda, Xavier Martorell, Eduard Ayguade, George Teodoro, and Alba Cristina Magalhaes Melo, Senior Member, IEEE Abstract—This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA sequences in multi-GPU platforms, using the exact Smith–Waterman (SW) algorithm. In the first phase of CUDAlign 4.0, a huge Dynamic Programming (DP) matrix is computed by multiple GPUs, which asynchronously communicate border elements to the right neighbor in order to find the optimal score. After that, the traceback phase of SW is executed. The efficient parallelization of the traceback phase is very challenging because of the high amount of data dependency, which particularly impacts the performance and limits the application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we propose and evaluate a new parallel traceback algorithm called Incremental Speculative Traceback (IST), which pipelines the traceback phase, speculating incrementally over the values calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate SW matrices with up to 60 Peta cells, obtaining the optimal local alignments of all Human and Chimpanzee homologous chromosomes, whose sizes range from 26 Millions of Base Pairs (MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with the SW exact method. We also show that the ISTalgorithm is able to reduce the traceback time from 2.15 up to 21.03Â, when compared with the baseline traceback algorithm. The humanÂchimpanzee chromosome 5 comparison (180 MBPÂ183 MBP) attained 10,370.00 GCUPS (Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2 percent. Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU Ç 1 INTRODUCTION IN comparative genomics, biologists compare the sequen- ces that represent organisms in order to infer functional/ structural properties. Sequence comparison is, therefore, one of the most basic operations in Bioinformatics [1], usu- ally solved using heuristic methods due to the excessive computation times of the exact methods. Smith–Waterman (SW) [2] is an exact algorithm to com- pute pairwise local comparisons. It is based on Dynamic Pro- gramming (DP) and has quadratic time and space complexities. The SW algorithm is divided in two phases, where the first phase is responsible to calculate a DP matrix in order to obtain the optimal score and the second phase (traceback) obtains the optimal alignment. SW is usually executed to compare (a) two DNA sequences or (b) a protein sequence (query sequence) to a genomic database. In the first case, a single SW matrix is calculated and all the Processing Elements (PEs) cooperate in this calculation, communicating to exchange border elements (fine-grained computation). For Megabase DNA sequences, a huge DP matrix with several Petabytes is computed. In the second case, multiple small SW matrices are calculated usually without communi- cation between the PEs (coarse-grained computation). With the current genomic databases, often hundreds of thousands SW matrices are calculated in a single query  database comparison. In the last decades, SW approaches for both cases have been parallelized in the literature, using multiprocessor/ multicores [3], [4], Cell Broadband Engines (CellBEs) [5], Field Programmable Gate Arrays (FPGAs) [6], Application Specific Integrated Circuits (ASICs) [7], Intel Xeon Phis [8] and Graphics Processing Units (GPUs) [9], [10], [11], [12]. The SW algorithm is widely used by biologists to compare sequences in many practical applications, such as identifica- tion of orthologs [13], and virus integration detection [14]. In this last application, an FPGA-based platform [6] was used to compute millions of SW alignments with small query sequences in short time. Nowadays, executing SW comparisons with Megabase sequences is still considered unfeasible by most research- ers, which currently limits its practical use. We claim that important bioinformatics applications such as whole genome alignment (WGA) [15] could benefit from exact pairwise comparisons of long DNA sequences. WGA applications often construct global genome alignments by using local alignments as building blocks [16], [17]. In [18], the authors state that SW local alignments would be the best choice in this case. However, in order to compare 1 MBP  1 MBP sequences, the SW tool took more than five days, preventing its use. E. Sandes, G. Teodoro, and A. Melo are with the Department of Computer Science, University of Brasılia, Brasılia, DF, Brazil. E-mail: {edans, teodoro, albamm}@cic.unb.br. G. Miranda, X. Martorell, and E. Ayguade are with the Barcelona Super- computing Center, Barcelona, Spain. E-mail: {guillermo.miranda, xavier.martorell, eduard.ayguade}@bsc.es. Manuscript received 29 Dec. 2014; revised 10 Dec. 2015; accepted 1 Jan. 2016. Date of publication 7 Jan. 2016; date of current version 14 Sept. 2016. Recommended for acceptance by D. Trystram. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2016.2515597 2838 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016 1045-9219 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. spent in Stage 1 and the remaining stages (Traceback). As shown, the speedups attained with 128 nodes for chr22 and chr16 were, respectively, 26.9 and 29.7 (21.0 and 23.2 percent of parallel efficiency). The breakdown of the total execution shows that the Stage 1 of CUDAlign has a much better scalability. Stage 1 attained speedups of 84.0 and 97.3 with 128 nodes (65.6 and 76.0 percent of parallel efficiency), resulting in a peak performance of 8.3 and 9.7 TCUPS for chr22 and chr16, respectively. Stage 1 results of chr22 and chr16 are consis- tent with the ones obtained in CUDAlign 3.0 [12]. The PT traceback phase, on the other hand, was not able to effi- phase increased from about 4 to 71 percent as the number of nodes used was scaled from 1 to 128. This negative impact of the traceback to the whole application performance is highly reduced when IST is used, as shown in Section 6.3. 6.3 Impact of Incremental Speculative Traceback The experimental evaluation of the impact of IST to the per- formance was carried out using five pairs of homologous chromosomes: chr22, chr16, chr13, chr8, and chr5. These sequences were selected intending to provide a wide range of variation in the DP matrix size calculated (2.55, 8.13, 13.26, 21.07, 33.04 Peta cells, respectively). Fig. 10. Alignment plots between human and chimpanzee homologous chromosomes. SANDES ET AL.: CUDALIGN 4.0: INCREMENTAL SPECULATIVE TRACEBACK FOR EXACT CHROMOSOME-WIDE ALIGNMENT IN GPU... 2847 MASA-CUDAlign: Goal and Versions MASA-CUDAlign in multiple GPUs Publications (CUDAlign 4.0)
  • 32. MASA-CUDAlign: Goal and Versions Best Performances for Smith-Waterman HPC Platforms Name Year Max. Size (10ˆn) Output Platform TCUPS SW-Rivyera 2014 1,000 score 128 FPGAs real execution 6.02 SW-MVM 2014 100 score alignment 128 CPUs real execution 0.90 MASA- CUDAlign 2016 100,000,000 score alignment 384 GPUs real execution 10.3 Prins 2017 100,000,000 score ReCAM custom simulation 53.0 More than 100 papers in the literature in the last decades
  • 33. • Biological sequence comparison • MASA-CUDAlign (one GPU) • MASA-CUDAlign (multiple GPUs) • MASA-CUDAlign with pruning • MASA-OpenMP CPU studies Agenda OpenPower Webinar – June, 19th, 2020
  • 34. The programmer chooses the type of block pruning and the parallelization strategy The programmer needs to code the recurrence relations ust be implemented in the specic language and linked together to create a new aligner extension. Some aligners we in the work that presented MASA, executing in dierent hardware such as GPUs, multicore CPUs and Intel Phi co-pro e of MASA is divided in modules, according to the features: platform-independent functions (like data management, sta ical procedures) and platform-dependent functions (like the parallel processing of the DP matrix and the BP module ed considering the target platform). The integration of these modules can be observed in Figure 2 . FIGURE 2 MASA architecture17 . lelization strategy, two approaches are suggested: the diagonal method, allowing the parallel processing of cells in the ta ow method, where the propagation is generic among nodes that represent blocks of cells. Similarly, the block pr posed using diagonal or generic execution approaches, avoiding unnecessary calculations. In order to create a speci MASA Architecture
  • 35. GPU1 GPU2 GPU3 GPU4 Figure 5: Columns distributions for 4 GPUs. Pruning area: The area in blue is not computed Heavy load imbalance • Challenge: execute MASA-CUDAlign in a multi- GPU platform with block pruning. MASA-CUDAlign – Multi-GPU with Pruning The pruning area is obtained during the execution
  • 36. • The GPUs exchange their local best results periodically. In order to execute the sequence alignment with BP in multiple GPUs, each one will compute a subset of columns of the DP matrix, i.e., the sequence placed horizontally (S1 in Fig. 4) is split according to a defined static partition. Thus, each GPU compares a part of this sequence with the entire sequence placed vertically (S0 in Fig. 4). 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 Multi-GPU with Pruning Score sharing
  • 37. Multi-GPU with Pruning Score sharing - Publication 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 1M 3M 5M 7M 10M 23M 47M Ch19 Ch20 Ch21 Ch22 GCUPS Comparison Id 2*P100: No−BP 2*P100: BP 4*P100: No−BP 4*P100: BP Fig. 6: Multi-BP results in Comet environment (2 and 4 P100 GPUs). As can be observed, the speedup varied from 1.60x to 1.92x with two GPUs, and 2.70x to 3.72x with four GPUs. TABLE VII: Execution time (in hours) and speedup of Multi- BP on P100 GPUs. Id 1*P100 2*P100 4*P100 Linear 1.00x 2.00x 4.00x [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] Conference: Euromicro PDP 2020 Parallel Comparison of Huge DNA Sequences in Multiple GPUs with Block Pruning Marco Figueiredo Jr. Univ. of Brasilia 160063027@aluno.unb.br marcoacf@sarah.br Edans Sandes Univ. of Brasilia edans.sandes@gmail.com George Teodoro Univ. Fed. de Minas Gerais george@dcc.ufmg.br Alba C. M. A. Melo Univ. of Brasilia alves@unb.br Abstract—Sequence comparison is a task performed in several Bioinformatics applications daily all over the world. Algorithms that retrieve the optimal result have quadratic time complexity, requiring a huge amount of computing power when the sequences compared are long. In order to reduce the execution time, many parallel solutions have been proposed in the literature. Neverthe- less, depending on the sizes of the sequences, even those parallel solutions take hours or days to complete. Pruning techniques can significantly improve the performance of the parallel solutions and a few approaches have been proposed to provide pruning capabilities for sequence comparison applications. This paper proposes and evaluates a variant of the block pruning approach that runs in multiple GPUs, in homogeneous or heterogeneous en- vironments. Experimental results obtained with DNA sequences in two testbeds show that significant performance gains are obtained with pruning, compared to its non-pruning counterpart, achieving the impressive performance of 694.8 GCUPS (Billions of Cells Updated per Second) for four GPUs. Index Terms—bioinformatics, DNA alignment, GPU, pruning I. INTRODUCTION Bioinformatics produces solutions that are used by various fields of study, such as medicine and biology [1]. Biological sequence comparison operations are executed several times daily all over the world, either in stand-alone mode or in- corporated into Bioinformatics applications to solve complex problems such as evolutionary relationship determination and drug design. Due to their quadratic time complexity, sequence comparison algorithms that retrieve the optimal result can take a lot of time. In order to reduce the execution time of such algorithms, parallel solutions have been proposed in the literature over the last decades. The type of parallelism provided by Graphical Processor Units (GPUs) makes these devices a very good alternative to run sequence comparisons [2] [3]. CUDAlign 4.0 [3] is a state-of-the art tool that compares huge DNA sequences in multiple GPUs and obtains the optimal result, combining the Gotoh [4] and the Myers-Miller [5] algorithms. Using 384 GPUs, it was able to compare the homologous human x chimpanzee chromosomes 5 (180 Million of Base Pairs – MBP – each) in 53 minutes, computing a matrix of 33.04 Petacells at 10.37 TCUPS (Trillions of Cells Updated per Second). In an earlier version for one GPU (CUDAlign 2.1 [6]), the block pruning (BP) strategy was proposed to avoid the computation of parts of the matrix that surely will not lead to the optimal solution, with good results for one GPU. Further versions of CUDAlign present pruning capabilities only for single GPU executions. SW# [7] implemented the original MM algorithm and extended the block pruning strategy [6] to be used in two GPUs, but the performance was just a little better than the execution of CUDAlign in one GPU [7]. As far as we know, there is no work in the literature that obtains the optimal result with pruning using more than two GPUs. Other works use CPUs [8], FPGAs [9] or hybrid environment [10], but they are outside the scope of the paper. This paper proposes and evaluates Multi-BP, an adaptation of block pruning for multiple GPUs. It is based on static distribution and dynamic sharing of pruning information, lead- ing to considerable performance gains in medium-sized GPU environments. Multi-BP combines the multi-GPU CUDAlign version [3] and the pruning technique proposed in [6]. The challenges to design Multi-BP were the following: (a) ensure that Multi-BP will not affect the performance in single GPU executions; (b) adapt the calculation of the index of each GPU block of cells and the evaluation of the pruning window to a multiple GPU environment; (c) disseminate the pruning infor- mation obtained by each GPU to all others with low overhead; and (d) adjust the pruning technique to the heterogeneous GPU environments, considering that the DP matrix might not be partitioned evenly among the GPUs. Experimental results obtained with real DNA sequences with sizes varying from 1 to 63 MBP in two computing envi- ronments show that very good gains were attained with Multi- BP. The execution time of the comparison of chromosome 20 (human x chimpanzee) in the same heterogeneous environment (GTX 980 Ti + GTX 680) was reduced from 8h17min (without Multi-BP) to 4h55min (with Multi-BP). The remainder of this paper is organized as follows. In Section II we present the pairwise sequence alignment problem and in Section III we discuss pruning approaches and the block pruning technique. Section IV discusses solutions that execute biological sequence comparisons in multiple GPUs. Section V describes the design of Multi-BP and Section VI details the experiments. Finally, Section VII concludes the paper. II. PAIRWISE BIOLOGICAL SEQUENCE COMPARISON The field of Bioinformatics [1] demands continuous pro- cessing improvements. Due to the huge volume of data and performance requirements, new parallel algorithms and tools are proposed regularly, aiming to provide faster executions. In particular, the alignment of biological sequences (proteins, Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA. Downloaded on June 11,2020 at 12:19:16 UTC from IEEE Xplore. Restrictions apply. We obtained the optimal score of the chromosome 21 human x chimpanzee comparison (46MBP x 47MBP) using 4 GPUs NVidia Pascal in 56 minutes GCUPS: 680.81
  • 38. • Challenge: adapt the workload to a dynamic pruning scenario. – Execution is paused at some points: overhead – Use in cases where the load balancing benefits are higher than the overhead – Sequences which are not very similar do not have a big pruning area. MASA-CUDAlign: Goal and Versions Multi-GPU with Pruning Load Balancing – ongoing work
  • 39. GPU1 GPU4 GPU1 GPU1GPU4 GPU4 w/o breakpoint with breakpoint Multi-GPU with Pruning Score sharing + cyclic + load balancing
  • 41. Multi-GPU with Pruning Score sharing + cyclic + load balancing • Best result in the literature for GPUs: – 10.3 TCUPS with 384 NVidia M2090 GPUs + Intel CPU • Result obtained in the platform: – 2.7 TCUPS with 8 NVidia Volta GPUs + Power9 CPU • We estimate that we are able to beat the best result for GPUs (10.3 TCUPS) with 40 NVidia V100 and the best theoretical result (53 TCUPS) with 256 NVidia V100
  • 42. • Biological sequence comparison • MASA-CUDAlign (one GPU) • MASA-CUDAlign (multiple GPUs) • MASA-CUDAlign with pruning • MASA-OpenMP CPU studies Agenda OpenPower Webinar – June, 19th, 2020
  • 43. • We compared MASA-OpenMP (CPU) running in the IBM Power9, Intel i7 and Intel Xeon platforms for the 1M x 1M comparison. • Intel i7 (4 cores - Skylake) – GCUPS: 1.1, time: 16 minutes (962.9 seconds) • Intel Xeon (24 cores – Haswell) – GCUPS: 4.5, time: 4 minutes (247.16 seconds) • IBM Power (22 cores – Power9) – GCUPS: 6.1, time: 3 minutes (181.4 seconds) MASA in CPUs (MASA-OpenMP) MASA-OpenMP – CPU Comparison
  • 44. • We used MASA-OpenMP (CPU) for these comparisons since the sars-cov-2 sequences are short (about 30 thousand characters) • We first compared sars-cov-2 sequences from China, Brazil, USA, India and Japan – Conclusion: very similar sequences • We then compared sars-cov-2 sequences from Brazil to mers and sars: – Even though the sequences are quite similar, there are regions of interest MASA in CPUs (MASA-OpenMP) Ongoing covid-19 study – IBM Power9
  • 45. Query: 1 -ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATC 60 ||| || || ||| ||| | || | || || | ||| ||||||| ||| | [-42]/[-42] Sbjct: 1 GATTTAA-GTGAATAGCTT---GGCTATCTCACTTCCC--C--TCGTTCTCTTGCAGAAC 53 Query: 60 TGTTCTCTAAACGAACTT---TAAAATC-----TGTGTGGC---T-GT--CAC---TC-- 101 | | | | ||||||||| ||||| | ||| | || | || ||| || [-50]/[-92] Sbjct: 53 TTTGATTTTAACGAACTTAAATAAAAGCCCTGTTGTTTAGCGTATCGTTGCACTTGTCTG 113 Query: 101 ----------GGC-------TGCATGCT----TAG----TG--CA----CTC-ACGC--A 127 ||| ||| |||| ||| || || ||| || | [-75]/[-167] Sbjct: 113 GTGGGATTGTGGCATTAATTTGCCTGCTCATCTAGGCAGTGGACATATGCTCAACACTGG 173 Query: 127 GTATAATTAATAACTAATTACTGTCGTTGACAG-GACA-CG-AGTAACTCGTCTATCTTC 184 |||||||| ||| | | |||| | || ||| | | || || ||| | || | || [-39]/[-206] Sbjct: 173 GTATAATT-CTAATTGAATACTAT--TTTTCAGTTAGAGCGTCGTGTCTCTTGTA-CGTC 229 Query: 184 TGCAGGCTGCTT--ACGGTTTCGTCCGTGTTGC-AGCCGATCATCAGCACATCTAG-GTT 240 | | | | | ||||||||||||| | ||| | | || ||||||| | || [-47]/[-253] Sbjct: 229 T-CGGTCACAATACACGGTTTCGTCCG-G-TGCGTGGCAATTCGGGGCACATCATGTCTT 286 Query: 240 TCGT-CCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAA 299 |||| | |||||||||| | ||| | | | || || | || | |||| [-49]/[-302] Sbjct: 286 TCGTGGCTGGTGTGACCG---CGCAAGGT-GCGCGC--GGTAC---GT---ATCGAG--- 331 Query: 299 ACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGAC---GTG-CTCGTAC-GTG 354 || || |||||| || ||| || ||| ||| ||| || ||| [-56]/[-358] Sbjct: 331 -CA-GCGCTCAACTC--------TGAAAAACA---TCAAGACCATGTGTCTCTAACTGTG 378 Query: 354 GCTTTGGAGACTCCGTGG---AGGAGGTCTTATCAGAGGCAC-GTCAACAT-CTT-AAAG 408 | |||| |||| |||| || | || || ||| ||| || | | [-64]/[-422] Sbjct: 378 CC-------ACTCTGTGGTTCAGGAAACCTGGT-TGAAAAACTTTCACCATGGTTCATGG 430 Query: 408 ATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGA---ACAGC 465 Ongoing covid-19 study – IBM Power9 (sars-cov-2 Brazil vs mers)
  • 46. Thanks to… • My former PhD student, Edans F. O. Sandes, my PhD student Marco Figueiredo Jr. and my undergrad student Bernardo Nascimento • George Teodoro, UFMG, Brazil • Maria Emilia Walter, University of Brasilia, Brazil • Eduard Ayguade, Xavier Martorell and Guillermo Miranda, Universitat Politecnica de Catalunya and Barcelona Supercomputing Center • And Azzedine Boukerche, University of Ottawa, Manuel Ujaldon, University of Malaga, Samuel Thibault, University of Bordeaux, Genaina Rodrigues, University of Brasilia, Celia Ralha, University of Brasilia, a couple of MsC students and many undergrad students
  • 47. Thanks to… • Barcelona Supercomputing Center, Spain, for providing access to the Minotauro GPU cluster (M2090) • Xsede Platform, USA, for providing access to the Keeneland Fullscale System (M2090), hosted at Georgia Tech, and the comet cluster (P100), hosted at UC San Diego. • NVidia Brazil, for providing access to their platform (P100) • IBM USA and U. Oregon for providing access to their Power9 + V100 platforms
  • 48. The MASA code, including MASA-CUDAlign and MASA-OpenMP, is available at https://github.com/edanssandes/MASA-Core MASA code The MASA code (GPU, CPU, Intel Phi) was used in the following institutions: Brazil – University of Brasilia, Fed Univ Rio Grande do Sul, NVidia Brazil, NEC Brazil Croatia – University of Zagreb France – University of Bordeaux India - Manonmaniam Sundaranar University Japan – NEC Japan Singapore – Agency for Science Technology and Research Spain – Politechnical University of Catalunya and University of Malaga USA – University of Delaware and IBM USA We are open to collaborations!
  • 49. Thank you very much! Alba Cristina M. A. Melo alves@unb.br / 1/6 BIOCHEMISTRY COMPOUNDCHEM.COM/CATEGORY/BIOCHEMISTRY/ Chemical Structure of DNA By Compound Interest w.compoundchem.com/author/compound-interest/ 5 https://www.compoundchem.com/2015/03/24/dna/ 3 Comments compoundchem.com/2015/03/24/dna/#disqus_thread Click to enlarge OpenPower Webinar – June, 19th, 2020