Slightly modified version of slides on BWA-MEM2 that I presented at IPDPS'19 for the paper: Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019.
1. Intel Labs
Vasimuddin Md.
Sanchit Misra
Efficient Architecture-Aware Acceleration
of BWA-MEM for Multicore Systems
Heng Li Srinivas Aluru
May 21, 2019
2. Intel Labs
BIGstack: Broad Intel Genomics stack
Optimized Broad Software on Top of Reference Architecture Design
2
3. Intel Labs
3
Primer on Human Genome
3 Billion base-pairs
over 23
chromosome-pairs
23 sequences over
∑= {A,C,G,T}
Exactly same
DNA across
cells of a body
Human ~ Human
99.5% Similarity
5. Intel Labs
Obtaining Genome of an Individual
1 Human Genome Get reads
(30X coverage)
1.2 Billion Paired End
Reads of length 151
Map to the
Reference
Sequence
5
6. Intel Labs
Obtaining Genome of an Individual
6
1 Human Genome Get reads
(30X coverage)
1.2 Billion Paired End
Reads of length 151
28 min
164 min
64 min
Illumina HiSeq X 10 BWA-MEM* BWA-MEM2*
Among the most popular tools
~70K users
*On single socket Intel® Xeon® Platinum 8180 Processor
Map to the
Reference
Sequence
6
8. Intel Labs
8
Population Genomics, Approaching Worldwide Scale
Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”,
January 2017
9. Intel Labs
9
Population Genomics, Approaching Worldwide Scale
Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”,
January 2017
100 million - 2 billion human genomes expected to be sequenced
by 2025!
(That’s ~ 10-200 Exabytes!)
Stephens, et. al. Big Data: Astronomical or Genomical?. PLOS Biology. (2015)
10. Intel Labs
3 key kernels (each quite complex) consuming 15-45% of time
– SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded
Smith Waterman) with several heuristics
– Different kernels can be the most time consuming depending on data
– Time not covered by the kernels (Misc) is also significant
Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA
– pipeline the rest on the host CPU
– Performance bound by the non-optimized kernels running on CPU
Accelerating BWA-MEM has Proven Difficult
Approach SMEM SAL BSW Overall
Multiple approaches - (CPU) - (CPU) 1.6x-3x
(GPGPU/FPGA)
1.45x-2x
Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x
Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x
10
11. Intel Labs
3 key kernels (each quite complex) consuming 15-45% of time
– SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded
Smith Waterman) with several heuristics
– Different kernels can be the most time consuming depending on data
– Time not covered by the kernels (Misc) is also significant
Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA
– pipeline the rest on the host CPU
– Performance bound by the non-optimized kernels running on CPU
Accelerating BWA-MEM has Proven Difficult
Approach SMEM SAL BSW Overall
Multiple approaches - (CPU) - (CPU) 1.6x-3x
(GPGPU/FPGA)
1.45x-2x
Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x
Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x
Bypasses some of the heuristics – Get different output – Strict No No
11
12. Intel Labs
3 key kernels (each quite complex) consuming 15-45% of time
– SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded
Smith Waterman) with several heuristics
– Different kernels can be the most time consuming depending on data
– Time not covered by the kernels (Misc) is also significant
Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA
– pipeline the rest on the host CPU
– Performance bound by the non-optimized kernels running on CPU
Accelerating BWA-MEM has Proven Difficult
No published work contains a holistic architecture-aware optimization of
BWA-MEM software on multicore systems.
Approach SMEM SAL BSW Overall
Multiple approaches - (CPU) - (CPU) 1.6x-3x
(GPGPU/FPGA)
1.45x-2x
Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x
Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x
Bypasses some of the heuristics – Get different output – Strict No No
12
13. Intel Labs
System Configuration
Intel® Xeon® Platinum
8180 Processor
Name used in the rest of the
presentation
SKX
Sockets x Cores x Threads 2 x 28 x 2
VPUs/Core x AVX register width 2 x {512, 256, 128}
Base clock frequency 2.5 GHz
L1D/L2 cache / Core 32/1024 KB
L3 cache / Socket 38.5 MB
DRAM size / Socket, BW 96 GB, 114 GB/s
Compiler version ICC v. 17.0.2
Performance on multiple sockets can be achieved by just distributing the reads equally
and load imbalance is usually not an issue.
Therefore, our efforts are focused on single socket performance.
13
14. Intel Labs
Datasets
Reference Sequence
Half of Human Genome (version HG38) - 1.5 Billion nucleotides
Dataset # Reads Read Length Dataset Source
D1 5 x 105 151 Broad Institute
D2 5 x 105 151 Broad Institute
D3 1.25 x 106 76 NCBI SRA: SRX020470
D4 1.25 x 106 101 NCBI SRA: SRX207170
D5 1.25 x 106 101 NCBI SRA: SRX206890
Read Datasets
14
15. Intel Labs
End to End Performance Gains On SKX – Compute Only
Our output is identical to original BWA-MEM
Single Thread of SKX Single socket (56 threads/28 cores) of SKX
15
17. Intel Labs
The Problem – Mapping to the Reference Sequence
S1
S2
S4
S3
Sm
Reference R
CCCTCCTATTTAAC
Query Q
Find the best matches of 𝑄 in 𝑅
17
18. Intel Labs
FM-Index of the Reference Sequence
FM-index of a sample
reference sequence:
AGTGGA.
It consists of Suffix Array,
Burrows Wheeler
Transform (BWT), O and D
arrays.
Since BW-Matrix is
lexicographically sorted, all
the occurrences of a query
appear contiguously in the
suffix array (SA). These
contiguous locations are
called SA interval.
18
19. Intel Labs
FM-Index of the Reference Sequence
FM-index of a sample
reference sequence:
AGTGGA.
It consists of Suffix Array,
Burrows Wheeler
Transform (BWT), O and D
arrays.
Since BW-Matrix is
lexicographically sorted, all
the occurrences of a query
appear contiguously in the
suffix array (SA). These
contiguous locations are
called SA interval.
30 GB 1.5 GB
96 GB
19
Sizes for human
genome
20. Intel Labs
Compressed FM-Index in BWA-MEM
To reduce memory footprint, the O array is divided into buckets of
size 𝜂
For each bucket
– nucleotide counts are stored for all the previous buckets
– The corresponding BWT string of size 𝜂 is stored in a 2-bit per nucleotide format
O(G, t) = 256 + 1 = 257
A:0
C:0
G:0
T:0
GGAAC…..AGCT
A:35
C:30
G:31
T:32
TGAGC…..AGCT
A:266
C:250
G:256
T:252
CGCCA…..TGAT
𝜂 = 128 tth index in BWT
string
Fig. based on Jing Zhang et. al. CCGrid’2013
20
21. Intel Labs
BWA-MEM Algorithm
Seeding – Look for exact matches (regions) in the reference sequence for the
substrings (seeds) of the query using compressed FM-Index
– Super Maximal Exact Match (SMEM)
– Suffix Array Lookup (SAL)
– Chaining
Extension – Extend the matches on either side to get end-to-end matches.
Select matches with high similarity
– Banded Smith Waterman (BSW)
SAM-Form – Format the
output in the SAM format
21
22. Intel Labs
BWA-MEM Algorithm
Seeding – Look for exact matches (regions) in the reference sequence for the
substrings (seeds) of the query using compressed FM-Index
– Super Maximal Exact Match (SMEM)
– Suffix Array Lookup (SAL)
– Chaining
Extension – Extend the matches on either side to get end-to-end matches.
Select matches with high similarity
– Banded Smith Waterman (BSW)
SAM-Form – Format the
output in the SAM format
- Reorganization
22
23. Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase Backward extension phase
23
1. Find maximal length query
substrings with matches
2. Output the matches
24. Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
Backward extension phase
24
1. Find maximal length query
substrings with matches
2. Output the matches
25. Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
Backward extension phase
25
1. Find maximal length query
substrings with matches
2. Output the matches
26. Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
Backward extension phase
26
1. Find maximal length query
substrings with matches
2. Output the matches
41. Intel Labs
SMEM Algorithm
No spatial locality
New values in the tuple
depend on current values
and the current nucleotide
41
42. Intel Labs
SMEM Algorithm
No spatial locality
Large # instructions
for 𝜂 = 128
New values in the tuple
depend on current values
and the current nucleotide
42
43. Intel Labs
SMEM Algorithm
No spatial locality
Large # instructions
for 𝜂 = 128
New values in the tuple
depend on current values
and the current nucleotide
43
44. Intel Labs
SMEM Algorithm – Key Optimizations
Software Prefetching
– For any tuple that is added to the backward search buffer, we know the memory
locations that will be accessed when the corresponding backward search occurs
– So, we software prefetch it and hide prefetch latency with computation
45. Intel Labs
SMEM Algorithm – Key Optimizations
Reducing 𝜂 and vectorization
– Reduced the value of 𝜂 to 32
– Store BWT string using 1-byte per nucleotide format – 32 bytes total
– Process the 32 byte BWT using byte level AVX2 instrinsics to get the number of
occurrences of a nucleotide
– The four counts consume 4 bytes per letter – 16 bytes total
– Added 16 bytes of padding to make 64 bytes to align along cache line boundary
– one cache line to ensure the whole bucket can be prefetched using one
instruction
45
47. Intel Labs
Suffix Array Lookup - SAL
SMEM outputs the suffix array interval
Each suffix array index in the interval is looked
up to get the reference sequence coordinate like
this:
Optimization:
– Original BWA-MEM uses compressed suffix array to
reduce memory footprint – but there is sufficient
memory on current systems
– So, we simply use uncompressed suffix array and look
it up using the above expression
47
48. Intel Labs
SAL - Results
System: SKX, #Threads = 1
Input data created by intercepting the data to SAL stage from an actual run using
600,000 reds from D2
183x speedup
48
49. Intel Labs
Banded Smith Waterman - BSW
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise49
Regular Smith Waterman
50. Intel Labs
Banded Smith Waterman - BSW
Only a diagonal band is computed
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise50
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
51. Intel Labs
Banded Smith Waterman - BSW
Only a diagonal band is computed
Size of the band can dynamically change from
top to bottom
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise51
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
52. Intel Labs
Banded Smith Waterman - BSW
Only a diagonal band is computed
Size of the band can dynamically change from
top to bottom
Various conditions of early exit
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise52
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
53. Intel Labs
Banded Smith Waterman - BSW
Only a diagonal band is computed
Size of the band can dynamically change from
top to bottom
Various conditions of early exit
Low parallelism within one matrix computation
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise53
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
54. Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
54
55. Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
55
56. Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
Sort the sequences according to band sizes to make
the computation across pairs being vectorized more
uniform
56
57. Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
Sort the sequences according to band sizes to make
the computation across pairs being vectorized more
uniform
Convert the sequences from AoS to SoA format to
prevent gather/scatter cost
57
58. Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
Sort the sequences according to band sizes to make
the computation across pairs being vectorized more
uniform
Convert the sequences from AoS to SoA format to
prevent gather/scatter cost
SIMD Operations used
– cmp, blend, max, mov, add, and sub, mask
– Precision
– Lower precision provides more performance
– Precision required depends on max. score depends on sequence lengths
– We choose 8-bit or 16-bit precision based on sequence lengths
58
59. Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
59
60. Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
~14x reduction in # instructions
IPC is reduced because majority of
instructions in optimized code are SIMD
instructions
There are 2 ports for SIMD (VPUs), but 4 for
scalar
60
61. Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
~14x reduction in # instructions
IPC is reduced because majority of
instructions in optimized code are SIMD
instructions
There are 2 ports for SIMD (VPUs), but 4 for
scalar
61
Why not
512
8
= 64x
speedup?
62. Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
~14x reduction in # instructions
IPC is reduced because majority of
instructions in optimized code are SIMD
instructions
There are 2 ports for SIMD (VPUs), but 4 for
scalar
62
Why not
512
8
= 64x
speedup?
Only 43% of the time is spent on cell
computation using SIMD
In which ~50% of lanes are idle – so,
effectively ~21.5% for cell computation
63. Intel Labs
Multithread Scaling
Scaling of three kernels and the entire application from 1 to 28 core
on SKX
We demonstrate nearly equal or better scaling on all kernels
Application scaling is worse due to bad scaling of “Misc” section
63
64. Intel Labs
End to End Performance Results – Compute only
All kernels retain their speedup in the end-to-end run
SAL barely contributes to the run time due to 183x speedup
Single Thread of SKX Single socket (56 threads/28 cores) of SKX
64
65. Intel Confidential – Internal Only
BWA-MEM2 Open Sourcing
Drop-In Replacement
Supported executions: AVX512, AVX2, SSE4.1, scalar
Supported functionality: All the functionality of BWA-MEM
including single end and paired-end alignments
Output: Identical to BWA-MEM
Command line interface: Exactly same as BWA-MEM
Future Steps
Algorithmic, implementation level (Misc) and architectural
improvements
https://github.com/bwa-mem2/bwa-mem2
65