SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo
Luca Pepe
Daniele Valenza
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Sequence alignment is a process for comparing two
or more DNA or RNA sequences.
Sequence alignment is performed in order to find
similar or identical regions in the provided sequences,
or to check if it is a known sequence stored in a
database.
DNA STRUCTURE
DNA bases: A C G T
Bounds: (A, T) (C, G)
DNA ALIGNMENT
Affinity measures:
• MATCH
• MISMATCH
• GAP
MATCHING TYPE:
• SIMPLE
• REVERSE AND COMPLEMENT
Q: ATGATTACC DNA String
R(Q): CCATTAGTA Reverse
C (R(Q)): GGTAATCAT Complement
• Global Alignment:
• Local Alignment:
• Local Alignment:
DNA ALIGNMENT TYPES
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Searching all the perfect matchings of a small query string in a bigger
DNA string.
INPUT: DNA String, Query String
OUTPUT: Number of occurences, Occurences starting positions
SIMPLE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙 𝑞
DNA Length 𝑙 𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖
Searching the «best» n alignments of a small query string in a bigger
DNA string
INPUT: DNA String, Query String
OUTPUT: Best alignments starting positions
APPROXIMATE SEARCH
Variables Notation
# Workers 𝑛
Query length 𝑙 𝑞
DNA Length 𝑙 𝑑
Relative pos. 𝑂𝑓𝑓𝑖
Absolute pos. 𝑠𝑖
APPROXIMATE SEARCH – SIMILARITY EVALUATION
Character similarity function
𝑠𝑖 =
𝑥, 𝑀𝑎𝑡𝑐ℎ
𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ
𝑧, 𝐺𝑎𝑝
x > 0; y, z ≤ 0
(In this work gaps are not considered)
Objective function to maximize:
𝑆 =
𝑖
𝑙 𝑞
𝑠𝑖
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
The common approach to all
solutions is based on Map Reduce
model:
• Master node splits the string into
chunks and scatters them to
workers node.
• The workers perform the
computation and results are sent
back to the master.
• Master combines the single
solutions and returns the output.
GENERAL IDEA
Attention must be paid to the cross-matching strings
GENERIC SPLIT AND COMPUTATION
Complete
Matching
Partial
Matching
𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
Query string
DNA string
𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏
GENERIC REDUCE PHASE
𝑖 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑖
Worker ID Offset
𝑖 𝑜𝑓𝑓𝑖
𝑗 𝑜𝑓𝑓𝑗
𝑙 𝑞
Query string
DNA string
Size
WORKERS OUTPUT
FINAL OUTPUT
𝑗 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙 𝑞
Positions
𝑠𝑖
𝑠𝑗
𝑠𝑖 𝑠𝑗
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Bigger chunk:
The master sends to every worker a chunk of sizes =
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =
𝑙 𝑑
𝑛
, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘
such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =
𝑙 𝑑
𝑛
and computes its complete
matchings and all partial matchings. Partial matchings will be
combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
𝑇0
𝑇1
𝑇2
𝑇3
𝑇4
𝑇5
𝑇6
𝑇7
BIGGER CHUNK APPROACH
Complete
Matching
Chunk 𝒊 − 𝟏
𝑙 𝑑/n
Query string
DNA string
𝑙 𝑞-1
Chunk size
Chunk 𝒊
𝑙 𝑑/n 𝑙 𝑞-1
Chunk 𝒊 + 𝟏
𝑙 𝑑/n 𝑙 𝑞-1
Same Char
ADVANTAGES:
• it does not requires intra-workers communication;
• it does not produce duplicated occurrences;
• the master has an extremely small sequential work to perform.
DISADVANTAGES:
• each worker (except the last one) receives 𝑙 𝑞 − 1 extra characters
Thus, an extra bandwidth 𝑏 𝑒 usage is produced such as:
𝑏 𝑒 = 𝑙 𝑞 − 1 ⋅ (𝑛 − 1)
BIGGER CHUNK APPROACH
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partial
matchings.
REDUCE PHASE
Bigger chunk:
The master sends to every worker a chunk of sizes =
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =
𝑙 𝑑
𝑛
, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘
such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =
𝑙 𝑑
𝑛
and computes its complete
matchings and all partial matchings. Partial matchings will be
combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – BIG REQUEST APPROACH
𝑙 𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete
Matching
Partial
Matching
Query string
DNA string
Chunk size
ADVANTAGES:
• extra data is requested only when needed
• it does not produce duplicated occurrences
• a single request is performed for each worker
DISADVANTAGES:
• extra overhead for the big request
• potential useless extra characters
ON DEMAND – BIG REQUEST APPROACH
𝑇0
…
𝑇𝑗
𝑇𝑗 + 1
𝑇𝑗 + 2
𝑇𝑗 + 3
ON DEMAND – SMALL REQUEST APPROACH
𝑙 𝑑/n
v v v v
v x
v v v v
…
Chunk 𝒊
Chunk 𝒊 + 𝟏
Complete
Matching
Partial
Matching
Query string
DNA string
Chunk size
ADVANTAGES:
• extra data are requested only when needed
• it does not produce duplicated occurrences
• better bandwidth usage than big request
DISADVANTAGES:
• Number of requests grows proportionally to the length of the
query
ON DEMAND – SMALL REQUEST APPROACH
Two kind of communication can be adopted:
ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized:
request is made to master node
Distributed:
request is made to adjacent right node
k )
ON DEMAND – CENTRALIZED VS DISTRIBUTED
Centralized Distributed
ADVANTAGES Master idle time is
reduced.
No extra accesses to
DNA are needed.
No linearization
point.
DISADVANTAGES Linearization point is
added.
Access to DNA must be
performed.
Extra data requests
may be slowed
down.
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partial
matchings.
REDUCE PHASE
Bigger chunk:
The master sends to every worker a chunk of sizes =
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching strings can be found.
On Demand:
The master sends chunks of sizes =
𝑙 𝑑
𝑛
, whether a worker finds a partial
matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘
such that cross chunk matching strings can be found.
Two possible heuristics: big request and small request
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of sizes =
𝑙 𝑑
𝑛
and computes its complete
matchings and all partial matchings. Partial matchings will be
combined by the master in order to find Cross-Chunk Matchings.
SOLUTION APPROACHES
…
𝑇0
𝑇1
𝑇2
𝑇3
𝑇𝑗
𝑇𝑗+1
𝑇𝑗+2
𝑇𝑗+3
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
Complete
Matching
Right-side
Partial
Matching
Query string
DNA string
𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n
Chunk size
Chunk 𝒊 − 𝟏 Chunk 𝒊
…
Left-side
Partial
Matching
…
Chunk 𝒊 + 𝟏
…
ADVANTAGES:
• no extra data is required
• it does not produce duplicated occurrences
• no extra communication is needed
• the master does not need to store the DNA string
• it reduces bandwidth consumption to perform cross-chunk strings
checking. Indeed workers return bits instead of integers.
DISADVANTAGES:
• Extra work is required to the master (partial matchings combine)
SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
Bigger Chunk
analogous to generic approach
On Demand:
analogous to generic approach
Side to Side Sliding Query (3SQ):
additional work is performed by master node for combining partial
matchings.
REDUCE PHASE
3SQ REDUCE PHASE
𝑖 ∗ (𝑙 𝑑 𝑛)- j
1 1 0 1
𝑙 𝑞
Query match
DNA string
Size
WORKER i
Right side array
FINAL OUTPUT
𝑖 ∗ (𝑙 𝑑 𝑛)- k 𝑙 𝑞 Positions
𝑠𝑗
𝑠 𝑘
Results array
𝑠 𝑘
1 0 0 1
AND
1 0 0 1
WORKER i+1
Left side array
𝑠𝑖
𝒋 𝒌
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Same as simple search
• Splitting phase:
same of simple search
• Computation phase:
• Similarity function is evaluated for every alignment of query string
• Likely simple search, Cross-chunk strings must be considered
• Every worker returns its 𝑛 best similarity values, with relative
positions
• Reduce phase:
All similarity values are merged in order and the best 𝑛 alignments
are returned
PARALLELIZATION MODEL
REDUCE PHASE
Off. Similarity
X 10
Y 8
Z 3
Off. Similarity
A 5
B -3
C -6
W.
Id
Off. Sim.
1 X 10
1 Y 8
2 U 7
3 A 5
1 Z 3
2 V 2
2 W -1
3 B -3
3 C -6
Pos. Similarity
X’ 10
Y’ 8
U’ 7
O
R
D
E
R
E
D
M
E
R
G
E
P
O
S.
T
R
A
N
S
L
A
T
I
O
N
Off. Similarity
U 7
V 2
W -1
FINAL OUTPUT
Worker 1
Worker 2
Worker 3
Bigger chunk:
The master sends to every worker a chunk of size s ≤
𝑙 𝑑
𝑛
+ 𝑙 𝑞 − 1 such
that cross chunk matching similarities can be evaluated.
Side to Side Sliding Query (3SQ):
Every worker receives a chunk of size s =
𝑙 𝑑
𝑛
and computes its similarity
values and all partial similarities (leftside and rightside). Partial
similarities will be summed by the master in order to compute Cross-
Chunk String similarity values.
CROSS-CHUNK MATCHING
3SQ PARTIAL SIMILARITY COMBINE PHASE
4 2 0 1
𝑙 𝑞
Query match
DNA string
Size
WORKER i
Right side array
OUTPUT
W.
Id.
Off. Sim
i 𝑠𝑗 5
i 𝑠 𝑘 3
i …
Results array
sk
1 0 3 -4
+
5 2 3 -3
WORKER i+1
Left side array
si
𝒋 𝒌
𝑠𝑗 = 𝑙 𝑐 − (𝑙 𝑞 − 1)+ j
𝑙 𝑞
Chunk 𝒊 Chunk 𝒊 + 𝟏
• Introduction
• Problem definition
• Simple Search
• Approximate Search
• Parallelization
• Cross-Chunk Matching
• Bigger chunk
• On Demand
• Side to Side Sliding Query
• Approximate Search
• Test plan
Varying parameters:
• Number of Workers
• Query Length
We plan to evaluate the running times of every presented
algorithm. The analysis of these results will validate our
proposal, highlighting the algorithm that performs better.
OVERVIEW
SEQUENCE ALIGNMENT SPEED-UP
A PARALLEL APPROACH
University of Salerno
Parallel and Concurrent Computing Course
19 February 2013
Giuliana Carullo
Luca Pepe
Daniele Valenza
DEVELOPMENT AND BENCHMARKING
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
Every proposed algorithm has been
implemented using C language and OpenMPI library
Advantages:
• High performances
• Scalability
• Portability
INTRODUCTION
A natural approach: load it entirely from file, calculate the size (𝑙 𝑑),
split it in 𝑛 chunks and send them to the workers
Problems:
A DNA genome may be very large (3.0 ×109 bp (base pairs) )
The available memory can’t be enough.
Projectual choice:
The whole DNA is actually never needed
DNA is never entirely loaded in memory, first dna and chunk size are
calculated, and then step by step 𝑙 𝑐 characters are read from file and
sent to a worker.
PROJECTUAL CHOICES: DNA SPLITTING
• The type of messages exchanged during the simple search
computation would normally consist in:
• Characters (splitting phase)
• Integers (Reduce phase)
Bandwidth usage:
• 1 byte (Char size) x lc x n - Splitting phase
• 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase
Can we do better? … YES!
PROJECTUAL CHOICES: BANDWIDTH USAGE
In the Simple Search algorithm, a compression can be performed in
order to drastically reduce bandwidth consumption.
Simple Search Reduce phase Compression: instead of sending
actual positions, a bit array of size 𝑙 𝑐 is exploited.
Bit array costruction:
for each position, if a matching is found starting from it, the bit is
set to 1, 0 otherwise.
Compression Ratio:
1: 32 (E.g, with 4 integers from 4 positions to 128 positions)
PROJECTUAL CHOICES: BANDWIDTH USAGE
COMUNICATION
Master to workers Extra Comunication Workers to Master
Messages Data
Type
Type Messages Data
Type
Type Messages Data
Type
Type
Bigger Chunk N(ld/n+lq-1) Char Async
Sync X N(ld/n) Int
Bit
Sync
Sync
On Demand: N(ld/n) Char Async
Sync
N-1(lq-1) Char Sync
Sync
N(ld/n) Int
Bit
Sync
Sync
3SQ N(ld/n) Char Async
Sync X N(ld/n)+2(l
q-1)
Int
Bit
Sync
Sync
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
Cluster
8 Nodes - Ethernet 100Mbps connection
Node
CPU: Intel Xeon Dual Core 2.8 Ghz
RAM: 4GB
Hard Drive: 2x 30GB SCSI
Software
OS: Debian 6.0.4
OpenMPI 1.6.1
TESTING ENVIRONMENT
Image for illustrative purposes only
Benchmarking consisted in evaluating and comparing running
times of each algorithm as function of the following
parameters
• Number of processors (# workers +1) [2, 4, 8, 16]
• DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-)
• Query length (Small -8byte-, Medium -32byte-, Large -64byte-)
• # best allignments -Approximate search only- (10, 50, 100)
In grey the fixed value for the parameter when not evaluated
TEST PLAN
SIMPLE SEARCH: NUMBER OF WORKERS (1/2)
Results:
• Good Scalability for
every algorithm
• 3SQ worse than the
others because
additional sequential
work must be
performed.
SIMPLE SEARCH: NUMBER OF WORKERS (2/2)
Results:
• Bigger Chunk Bit
performs better than
int solution.
• Increasing processors,
bigger chunk performs
better than the others
because more cross-
chunk matchings occur.
• No relevant
improvements occurred
between 8 and 16
processors.
SIMPLE SEARCH: SPEED UP
0
0,5
1
1,5
2
2,5
3
3,5
4
2 4 8 16
SPEEDUP
NUMBER OF PROCESSORS
Speed Up Simple Search
DNA Size: Big Query Size: Small
BC-bit OD-cent BC-int OD-dist 3SQ
Results:
• Increasing speedup for
every algorithm
(except BC-int)
• The speedup grows
proportionally to
𝑛 + 1
• BC-int suffers from
network bottleneck due
to the size of the
messages.
SIMPLE SEARCH: DNA LENGTH
Results:
• Good Scalability for
every algorithm
• 3SQ worse: additional
sequential work than
others….
• Bigger Chunk Bit
performs better than
int solution
• Execution times grows
linearly respect to DNA
size
SIMPLE SEARCH: QUERY LENGTH
Results:
• 3SQ is highly sensible to query
length variations due to partial
matching combine phase.
• No significative variations for
other algorithms since single
Query Matching is interrupted
on first mismatch found.
APPROXIMATE SEARCH: NUMBER OF WORKERS
Results:
• Running times
decrease linearly
respectively to the
number of processors.
• 3SQ is only slightly
worse than Bigger
chunk because the
sequential work is
almost the same
(Ordered Merge)
APPROXIMATE SEARCH: SPEED UP
0
2
4
6
8
10
12
14
16
2 4 8 16
SPEEDUP
NUMBER OF PROCESSORS
Speed Up Approximate Search
DNA Size: Medium Query Size: Small
3SQ BC-int
Results:
Speed up globally better
than simple search and
close to the ideal value.
APPROXIMATE SEARCH: DNA SIZE
Results:
Running times grows
linearly respectively to the
DNA SIZE
Motivation
The main sequential
computation consists in
Ordered Merge that has
linear complexity.
APPROXIMATE SEARCH: QUERY SIZE
Results:
Running times is
influenced by Query Size.
Motivation
The computation of
similarity function is
affected by query length.
APPROXIMATE SEARCH: NUMBER OF BEST ALIGNMENTS
Results:
Running times grows
almost linearly.
Motivation
Each worker returns to the
master its Number of best
alignments and the
ordered merge process is
affected by it.
0,00
5,00
10,00
15,00
20,00
25,00
30,00
35,00
10 50 100
RUNNINGTIME(SECONDS)
NUMBER OF BEST ALIGNMENT
Approximate Search
DNA size: Big Processor: 16 Query Size:
Small
BC-int 3SQ
• Implementation
• Introduction
• DNA Splitting
• Bandwidth usage
• Comunication
• Benchmarking
• Testing environment
• Test plan
• Results
• Conclusions
The winner is….
Bigger Chunk
On Demand
3SQ 
Further improvements can be applied to the presented algorithms
Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit
are enough to rappresent the character, instead of 8 bit
Bit Mapping:
e.g. A=00, T=01, C=10, G=11
Compression Ratio:
1: 4 (E.g, with 1 character from 1 base to 4 bases)
IMPROVEMENTS
3SQ algorithm:
Partial matchings combine phase can be performed in a distributed
manner
Each node sends its left or right partial matching to left or right sibling,
which will combine it with his results and send them to master.
In this way sequential work can be reduced
IMPROVEMENTS
Thanks !

Parallel DNA Sequence Alignment

  • 1.
    SEQUENCE ALIGNMENT SPEED-UP APARALLEL APPROACH University of Salerno Parallel and Concurrent Computing Course 19 February 2013 Giuliana Carullo Luca Pepe Daniele Valenza
  • 2.
    • Introduction • Problemdefinition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 3.
    Sequence alignment isa process for comparing two or more DNA or RNA sequences. Sequence alignment is performed in order to find similar or identical regions in the provided sequences, or to check if it is a known sequence stored in a database.
  • 4.
    DNA STRUCTURE DNA bases:A C G T Bounds: (A, T) (C, G)
  • 5.
    DNA ALIGNMENT Affinity measures: •MATCH • MISMATCH • GAP MATCHING TYPE: • SIMPLE • REVERSE AND COMPLEMENT Q: ATGATTACC DNA String R(Q): CCATTAGTA Reverse C (R(Q)): GGTAATCAT Complement
  • 6.
    • Global Alignment: •Local Alignment: • Local Alignment: DNA ALIGNMENT TYPES
  • 7.
    • Introduction • Problemdefinition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 8.
    Searching all theperfect matchings of a small query string in a bigger DNA string. INPUT: DNA String, Query String OUTPUT: Number of occurences, Occurences starting positions SIMPLE SEARCH Variables Notation # Workers 𝑛 Query length 𝑙 𝑞 DNA Length 𝑙 𝑑 Relative pos. 𝑂𝑓𝑓𝑖 Absolute pos. 𝑠𝑖
  • 9.
    Searching the «best»n alignments of a small query string in a bigger DNA string INPUT: DNA String, Query String OUTPUT: Best alignments starting positions APPROXIMATE SEARCH Variables Notation # Workers 𝑛 Query length 𝑙 𝑞 DNA Length 𝑙 𝑑 Relative pos. 𝑂𝑓𝑓𝑖 Absolute pos. 𝑠𝑖
  • 10.
    APPROXIMATE SEARCH –SIMILARITY EVALUATION Character similarity function 𝑠𝑖 = 𝑥, 𝑀𝑎𝑡𝑐ℎ 𝑦, 𝑀𝑖𝑠𝑚𝑎𝑡𝑐ℎ 𝑧, 𝐺𝑎𝑝 x > 0; y, z ≤ 0 (In this work gaps are not considered) Objective function to maximize: 𝑆 = 𝑖 𝑙 𝑞 𝑠𝑖
  • 11.
    • Introduction • Problemdefinition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 12.
    The common approachto all solutions is based on Map Reduce model: • Master node splits the string into chunks and scatters them to workers node. • The workers perform the computation and results are sent back to the master. • Master combines the single solutions and returns the output. GENERAL IDEA Attention must be paid to the cross-matching strings
  • 13.
    GENERIC SPLIT ANDCOMPUTATION Complete Matching Partial Matching 𝑇0 𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 Query string DNA string 𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n Chunk size Chunk 𝒊 − 𝟏 Chunk 𝒊 Chunk 𝒊 + 𝟏
  • 14.
    GENERIC REDUCE PHASE 𝑖∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑖 Worker ID Offset 𝑖 𝑜𝑓𝑓𝑖 𝑗 𝑜𝑓𝑓𝑗 𝑙 𝑞 Query string DNA string Size WORKERS OUTPUT FINAL OUTPUT 𝑗 ∗ (𝑙 𝑑 𝑛) + 𝑜𝑓𝑓𝑗 𝑙 𝑞 Positions 𝑠𝑖 𝑠𝑗 𝑠𝑖 𝑠𝑗
  • 15.
    • Introduction • Problemdefinition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 16.
    Bigger chunk: The mastersends to every worker a chunk of sizes = 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching strings can be found. On Demand: The master sends chunks of sizes = 𝑙 𝑑 𝑛 , whether a worker finds a partial matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘 such that cross chunk matching strings can be found. Two possible heuristics: big request and small request Side to Side Sliding Query (3SQ): Every worker receives a chunk of sizes = 𝑙 𝑑 𝑛 and computes its complete matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings. SOLUTION APPROACHES
  • 17.
    𝑇0 𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 BIGGER CHUNK APPROACH Complete Matching Chunk𝒊 − 𝟏 𝑙 𝑑/n Query string DNA string 𝑙 𝑞-1 Chunk size Chunk 𝒊 𝑙 𝑑/n 𝑙 𝑞-1 Chunk 𝒊 + 𝟏 𝑙 𝑑/n 𝑙 𝑞-1 Same Char
  • 18.
    ADVANTAGES: • it doesnot requires intra-workers communication; • it does not produce duplicated occurrences; • the master has an extremely small sequential work to perform. DISADVANTAGES: • each worker (except the last one) receives 𝑙 𝑞 − 1 extra characters Thus, an extra bandwidth 𝑏 𝑒 usage is produced such as: 𝑏 𝑒 = 𝑙 𝑞 − 1 ⋅ (𝑛 − 1) BIGGER CHUNK APPROACH
  • 19.
    Bigger Chunk analogous togeneric approach On Demand: analogous to generic approach Side to Side Sliding Query (3SQ): additional work is performed by master node for combining partial matchings. REDUCE PHASE
  • 20.
    Bigger chunk: The mastersends to every worker a chunk of sizes = 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching strings can be found. On Demand: The master sends chunks of sizes = 𝑙 𝑑 𝑛 , whether a worker finds a partial matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘 such that cross chunk matching strings can be found. Two possible heuristics: big request and small request Side to Side Sliding Query (3SQ): Every worker receives a chunk of sizes = 𝑙 𝑑 𝑛 and computes its complete matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings. SOLUTION APPROACHES
  • 21.
    𝑇0 … 𝑇𝑗 𝑇𝑗 + 1 𝑇𝑗+ 2 𝑇𝑗 + 3 ON DEMAND – BIG REQUEST APPROACH 𝑙 𝑑/n v v v v v x v v v v … Chunk 𝒊 Chunk 𝒊 + 𝟏 Complete Matching Partial Matching Query string DNA string Chunk size
  • 22.
    ADVANTAGES: • extra datais requested only when needed • it does not produce duplicated occurrences • a single request is performed for each worker DISADVANTAGES: • extra overhead for the big request • potential useless extra characters ON DEMAND – BIG REQUEST APPROACH
  • 23.
    𝑇0 … 𝑇𝑗 𝑇𝑗 + 1 𝑇𝑗+ 2 𝑇𝑗 + 3 ON DEMAND – SMALL REQUEST APPROACH 𝑙 𝑑/n v v v v v x v v v v … Chunk 𝒊 Chunk 𝒊 + 𝟏 Complete Matching Partial Matching Query string DNA string Chunk size
  • 24.
    ADVANTAGES: • extra dataare requested only when needed • it does not produce duplicated occurrences • better bandwidth usage than big request DISADVANTAGES: • Number of requests grows proportionally to the length of the query ON DEMAND – SMALL REQUEST APPROACH
  • 25.
    Two kind ofcommunication can be adopted: ON DEMAND – CENTRALIZED VS DISTRIBUTED Centralized: request is made to master node Distributed: request is made to adjacent right node k )
  • 26.
    ON DEMAND –CENTRALIZED VS DISTRIBUTED Centralized Distributed ADVANTAGES Master idle time is reduced. No extra accesses to DNA are needed. No linearization point. DISADVANTAGES Linearization point is added. Access to DNA must be performed. Extra data requests may be slowed down.
  • 27.
    Bigger Chunk analogous togeneric approach On Demand: analogous to generic approach Side to Side Sliding Query (3SQ): additional work is performed by master node for combining partial matchings. REDUCE PHASE
  • 28.
    Bigger chunk: The mastersends to every worker a chunk of sizes = 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching strings can be found. On Demand: The master sends chunks of sizes = 𝑙 𝑑 𝑛 , whether a worker finds a partial matching at the end of its chunk, it asks the remaining part r ≤ 𝑙 𝑞 − 𝑘 such that cross chunk matching strings can be found. Two possible heuristics: big request and small request Side to Side Sliding Query (3SQ): Every worker receives a chunk of sizes = 𝑙 𝑑 𝑛 and computes its complete matchings and all partial matchings. Partial matchings will be combined by the master in order to find Cross-Chunk Matchings. SOLUTION APPROACHES
  • 29.
    … 𝑇0 𝑇1 𝑇2 𝑇3 𝑇𝑗 𝑇𝑗+1 𝑇𝑗+2 𝑇𝑗+3 SIDE TO SIDESLIDING QUERY (3SQ) APPROACH Complete Matching Right-side Partial Matching Query string DNA string 𝑙 𝑑/n 𝑙 𝑑/n 𝑙 𝑑/n Chunk size Chunk 𝒊 − 𝟏 Chunk 𝒊 … Left-side Partial Matching … Chunk 𝒊 + 𝟏 …
  • 30.
    ADVANTAGES: • no extradata is required • it does not produce duplicated occurrences • no extra communication is needed • the master does not need to store the DNA string • it reduces bandwidth consumption to perform cross-chunk strings checking. Indeed workers return bits instead of integers. DISADVANTAGES: • Extra work is required to the master (partial matchings combine) SIDE TO SIDE SLIDING QUERY (3SQ) APPROACH
  • 31.
    Bigger Chunk analogous togeneric approach On Demand: analogous to generic approach Side to Side Sliding Query (3SQ): additional work is performed by master node for combining partial matchings. REDUCE PHASE
  • 32.
    3SQ REDUCE PHASE 𝑖∗ (𝑙 𝑑 𝑛)- j 1 1 0 1 𝑙 𝑞 Query match DNA string Size WORKER i Right side array FINAL OUTPUT 𝑖 ∗ (𝑙 𝑑 𝑛)- k 𝑙 𝑞 Positions 𝑠𝑗 𝑠 𝑘 Results array 𝑠 𝑘 1 0 0 1 AND 1 0 0 1 WORKER i+1 Left side array 𝑠𝑖 𝒋 𝒌
  • 33.
    • Introduction • Problemdefinition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 34.
    Same as simplesearch • Splitting phase: same of simple search • Computation phase: • Similarity function is evaluated for every alignment of query string • Likely simple search, Cross-chunk strings must be considered • Every worker returns its 𝑛 best similarity values, with relative positions • Reduce phase: All similarity values are merged in order and the best 𝑛 alignments are returned PARALLELIZATION MODEL
  • 35.
    REDUCE PHASE Off. Similarity X10 Y 8 Z 3 Off. Similarity A 5 B -3 C -6 W. Id Off. Sim. 1 X 10 1 Y 8 2 U 7 3 A 5 1 Z 3 2 V 2 2 W -1 3 B -3 3 C -6 Pos. Similarity X’ 10 Y’ 8 U’ 7 O R D E R E D M E R G E P O S. T R A N S L A T I O N Off. Similarity U 7 V 2 W -1 FINAL OUTPUT Worker 1 Worker 2 Worker 3
  • 36.
    Bigger chunk: The mastersends to every worker a chunk of size s ≤ 𝑙 𝑑 𝑛 + 𝑙 𝑞 − 1 such that cross chunk matching similarities can be evaluated. Side to Side Sliding Query (3SQ): Every worker receives a chunk of size s = 𝑙 𝑑 𝑛 and computes its similarity values and all partial similarities (leftside and rightside). Partial similarities will be summed by the master in order to compute Cross- Chunk String similarity values. CROSS-CHUNK MATCHING
  • 37.
    3SQ PARTIAL SIMILARITYCOMBINE PHASE 4 2 0 1 𝑙 𝑞 Query match DNA string Size WORKER i Right side array OUTPUT W. Id. Off. Sim i 𝑠𝑗 5 i 𝑠 𝑘 3 i … Results array sk 1 0 3 -4 + 5 2 3 -3 WORKER i+1 Left side array si 𝒋 𝒌 𝑠𝑗 = 𝑙 𝑐 − (𝑙 𝑞 − 1)+ j 𝑙 𝑞 Chunk 𝒊 Chunk 𝒊 + 𝟏
  • 38.
    • Introduction • Problemdefinition • Simple Search • Approximate Search • Parallelization • Cross-Chunk Matching • Bigger chunk • On Demand • Side to Side Sliding Query • Approximate Search • Test plan
  • 39.
    Varying parameters: • Numberof Workers • Query Length We plan to evaluate the running times of every presented algorithm. The analysis of these results will validate our proposal, highlighting the algorithm that performs better. OVERVIEW
  • 40.
    SEQUENCE ALIGNMENT SPEED-UP APARALLEL APPROACH University of Salerno Parallel and Concurrent Computing Course 19 February 2013 Giuliana Carullo Luca Pepe Daniele Valenza DEVELOPMENT AND BENCHMARKING
  • 41.
    • Implementation • Introduction •DNA Splitting • Bandwidth usage • Comunication • Benchmarking • Testing environment • Test plan • Results • Conclusions
  • 42.
    Every proposed algorithmhas been implemented using C language and OpenMPI library Advantages: • High performances • Scalability • Portability INTRODUCTION
  • 43.
    A natural approach:load it entirely from file, calculate the size (𝑙 𝑑), split it in 𝑛 chunks and send them to the workers Problems: A DNA genome may be very large (3.0 ×109 bp (base pairs) ) The available memory can’t be enough. Projectual choice: The whole DNA is actually never needed DNA is never entirely loaded in memory, first dna and chunk size are calculated, and then step by step 𝑙 𝑐 characters are read from file and sent to a worker. PROJECTUAL CHOICES: DNA SPLITTING
  • 44.
    • The typeof messages exchanged during the simple search computation would normally consist in: • Characters (splitting phase) • Integers (Reduce phase) Bandwidth usage: • 1 byte (Char size) x lc x n - Splitting phase • 4 byte (Integer size) x lc x n (best case, all matchings) – Reduce phase Can we do better? … YES! PROJECTUAL CHOICES: BANDWIDTH USAGE
  • 45.
    In the SimpleSearch algorithm, a compression can be performed in order to drastically reduce bandwidth consumption. Simple Search Reduce phase Compression: instead of sending actual positions, a bit array of size 𝑙 𝑐 is exploited. Bit array costruction: for each position, if a matching is found starting from it, the bit is set to 1, 0 otherwise. Compression Ratio: 1: 32 (E.g, with 4 integers from 4 positions to 128 positions) PROJECTUAL CHOICES: BANDWIDTH USAGE
  • 46.
    COMUNICATION Master to workersExtra Comunication Workers to Master Messages Data Type Type Messages Data Type Type Messages Data Type Type Bigger Chunk N(ld/n+lq-1) Char Async Sync X N(ld/n) Int Bit Sync Sync On Demand: N(ld/n) Char Async Sync N-1(lq-1) Char Sync Sync N(ld/n) Int Bit Sync Sync 3SQ N(ld/n) Char Async Sync X N(ld/n)+2(l q-1) Int Bit Sync Sync
  • 47.
    • Implementation • Introduction •DNA Splitting • Bandwidth usage • Comunication • Benchmarking • Testing environment • Test plan • Results • Conclusions
  • 48.
    Cluster 8 Nodes -Ethernet 100Mbps connection Node CPU: Intel Xeon Dual Core 2.8 Ghz RAM: 4GB Hard Drive: 2x 30GB SCSI Software OS: Debian 6.0.4 OpenMPI 1.6.1 TESTING ENVIRONMENT Image for illustrative purposes only
  • 49.
    Benchmarking consisted inevaluating and comparing running times of each algorithm as function of the following parameters • Number of processors (# workers +1) [2, 4, 8, 16] • DNA length (Small -5MB-, Medium -149 MB-, Large -292MB-) • Query length (Small -8byte-, Medium -32byte-, Large -64byte-) • # best allignments -Approximate search only- (10, 50, 100) In grey the fixed value for the parameter when not evaluated TEST PLAN
  • 50.
    SIMPLE SEARCH: NUMBEROF WORKERS (1/2) Results: • Good Scalability for every algorithm • 3SQ worse than the others because additional sequential work must be performed.
  • 51.
    SIMPLE SEARCH: NUMBEROF WORKERS (2/2) Results: • Bigger Chunk Bit performs better than int solution. • Increasing processors, bigger chunk performs better than the others because more cross- chunk matchings occur. • No relevant improvements occurred between 8 and 16 processors.
  • 52.
    SIMPLE SEARCH: SPEEDUP 0 0,5 1 1,5 2 2,5 3 3,5 4 2 4 8 16 SPEEDUP NUMBER OF PROCESSORS Speed Up Simple Search DNA Size: Big Query Size: Small BC-bit OD-cent BC-int OD-dist 3SQ Results: • Increasing speedup for every algorithm (except BC-int) • The speedup grows proportionally to 𝑛 + 1 • BC-int suffers from network bottleneck due to the size of the messages.
  • 53.
    SIMPLE SEARCH: DNALENGTH Results: • Good Scalability for every algorithm • 3SQ worse: additional sequential work than others…. • Bigger Chunk Bit performs better than int solution • Execution times grows linearly respect to DNA size
  • 54.
    SIMPLE SEARCH: QUERYLENGTH Results: • 3SQ is highly sensible to query length variations due to partial matching combine phase. • No significative variations for other algorithms since single Query Matching is interrupted on first mismatch found.
  • 55.
    APPROXIMATE SEARCH: NUMBEROF WORKERS Results: • Running times decrease linearly respectively to the number of processors. • 3SQ is only slightly worse than Bigger chunk because the sequential work is almost the same (Ordered Merge)
  • 56.
    APPROXIMATE SEARCH: SPEEDUP 0 2 4 6 8 10 12 14 16 2 4 8 16 SPEEDUP NUMBER OF PROCESSORS Speed Up Approximate Search DNA Size: Medium Query Size: Small 3SQ BC-int Results: Speed up globally better than simple search and close to the ideal value.
  • 57.
    APPROXIMATE SEARCH: DNASIZE Results: Running times grows linearly respectively to the DNA SIZE Motivation The main sequential computation consists in Ordered Merge that has linear complexity.
  • 58.
    APPROXIMATE SEARCH: QUERYSIZE Results: Running times is influenced by Query Size. Motivation The computation of similarity function is affected by query length.
  • 59.
    APPROXIMATE SEARCH: NUMBEROF BEST ALIGNMENTS Results: Running times grows almost linearly. Motivation Each worker returns to the master its Number of best alignments and the ordered merge process is affected by it. 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 10 50 100 RUNNINGTIME(SECONDS) NUMBER OF BEST ALIGNMENT Approximate Search DNA size: Big Processor: 16 Query Size: Small BC-int 3SQ
  • 60.
    • Implementation • Introduction •DNA Splitting • Bandwidth usage • Comunication • Benchmarking • Testing environment • Test plan • Results • Conclusions
  • 61.
    The winner is…. BiggerChunk On Demand 3SQ 
  • 62.
    Further improvements canbe applied to the presented algorithms Splitting phase: DNA alphabet consists merely in 4 characters, 2 bit are enough to rappresent the character, instead of 8 bit Bit Mapping: e.g. A=00, T=01, C=10, G=11 Compression Ratio: 1: 4 (E.g, with 1 character from 1 base to 4 bases) IMPROVEMENTS
  • 63.
    3SQ algorithm: Partial matchingscombine phase can be performed in a distributed manner Each node sends its left or right partial matching to left or right sibling, which will combine it with his results and send them to master. In this way sequential work can be reduced IMPROVEMENTS
  • 64.