BWA-MEM2-IPDPS 2019

Intel Labs
Vasimuddin Md.
Sanchit Misra
Efﬁcient Architecture-Aware Acceleration
of BWA-MEM for Multicore Systems
Heng Li Srinivas Aluru
May 21, 2019

Intel Labs
BIGstack: Broad Intel Genomics stack
Optimized Broad Software on Top of Reference Architecture Design
2

Intel Labs
3
Primer on Human Genome
 3 Billion base-pairs
over 23
chromosome-pairs
 23 sequences over
∑= {A,C,G,T}
Exactly same
DNA across
cells of a body
Human ~ Human
99.5% Similarity

Intel Labs
Obtaining Genome of an Individual
Map to the
Reference
Sequence
4

Intel Labs
1 Human Genome Get reads
(30X coverage)
1.2 Billion Paired End
Reads of length 151
Map to the
Reference
Sequence
5

Intel Labs
6
1 Human Genome Get reads
(30X coverage)
1.2 Billion Paired End
Reads of length 151
28 min
164 min
64 min
Illumina HiSeq X 10 BWA-MEM* BWA-MEM2*
Among the most popular tools
~70K users
*On single socket Intel® Xeon® Platinum 8180 Processor
Map to the
Reference
Sequence
6

Intel Labs
7
Genome Data Will Dwarf Everything Else

Intel Labs
8
Population Genomics, Approaching Worldwide Scale
Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”,
January 2017

Intel Labs
9
Population Genomics, Approaching Worldwide Scale
Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”,
January 2017
100 million - 2 billion human genomes expected to be sequenced
by 2025!
(That’s ~ 10-200 Exabytes!)
Stephens, et. al. Big Data: Astronomical or Genomical?. PLOS Biology. (2015)

Intel Labs
3 key kernels (each quite complex) consuming 15-45% of time
– SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded
Smith Waterman) with several heuristics
– Different kernels can be the most time consuming depending on data
– Time not covered by the kernels (Misc) is also significant
Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA
– pipeline the rest on the host CPU
– Performance bound by the non-optimized kernels running on CPU
Accelerating BWA-MEM has Proven Difficult
Approach SMEM SAL BSW Overall
Multiple approaches - (CPU) - (CPU) 1.6x-3x
(GPGPU/FPGA)
1.45x-2x
Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x
Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x
10

Intel Labs
(GPGPU/FPGA)
1.45x-2x
Bypasses some of the heuristics – Get different output – Strict No No
11

Intel Labs
No published work contains a holistic architecture-aware optimization of
BWA-MEM software on multicore systems.
(GPGPU/FPGA)
1.45x-2x
Bypasses some of the heuristics – Get different output – Strict No No
12

Intel Labs
System Configuration
Intel® Xeon® Platinum
8180 Processor
Name used in the rest of the
presentation
SKX
Sockets x Cores x Threads 2 x 28 x 2
VPUs/Core x AVX register width 2 x {512, 256, 128}
Base clock frequency 2.5 GHz
L1D/L2 cache / Core 32/1024 KB
L3 cache / Socket 38.5 MB
DRAM size / Socket, BW 96 GB, 114 GB/s
Compiler version ICC v. 17.0.2
Performance on multiple sockets can be achieved by just distributing the reads equally
and load imbalance is usually not an issue.
Therefore, our efforts are focused on single socket performance.
13

Intel Labs
Datasets
Reference Sequence
Half of Human Genome (version HG38) - 1.5 Billion nucleotides
Dataset # Reads Read Length Dataset Source
D1 5 x 105 151 Broad Institute
D2 5 x 105 151 Broad Institute
D3 1.25 x 106 76 NCBI SRA: SRX020470
D4 1.25 x 106 101 NCBI SRA: SRX207170
D5 1.25 x 106 101 NCBI SRA: SRX206890
Read Datasets
14

Intel Labs
End to End Performance Gains On SKX – Compute Only
Our output is identical to original BWA-MEM
Single Thread of SKX Single socket (56 threads/28 cores) of SKX
15

Intel Labs
Optimization Details
16

Intel Labs
The Problem – Mapping to the Reference Sequence
S1
S2
S4
S3
Sm
Reference R
CCCTCCTATTTAAC
Query Q
Find the best matches of 𝑄 in 𝑅
17

Intel Labs
FM-Index of the Reference Sequence
FM-index of a sample
reference sequence:
AGTGGA.
It consists of Suffix Array,
Burrows Wheeler
Transform (BWT), O and D
arrays.
Since BW-Matrix is
lexicographically sorted, all
the occurrences of a query
appear contiguously in the
suffix array (SA). These
contiguous locations are
called SA interval.
18

Intel Labs
FM-Index of the Reference Sequence
FM-index of a sample
reference sequence:
AGTGGA.
It consists of Suffix Array,
Burrows Wheeler
Transform (BWT), O and D
arrays.
Since BW-Matrix is
lexicographically sorted, all
the occurrences of a query
appear contiguously in the
suffix array (SA). These
contiguous locations are
called SA interval.
30 GB 1.5 GB
96 GB
19
Sizes for human
genome

Intel Labs
Compressed FM-Index in BWA-MEM
 To reduce memory footprint, the O array is divided into buckets of
size 𝜂
 For each bucket
– nucleotide counts are stored for all the previous buckets
– The corresponding BWT string of size 𝜂 is stored in a 2-bit per nucleotide format
O(G, t) = 256 + 1 = 257
A:0
C:0
G:0
T:0
GGAAC…..AGCT
A:35
C:30
G:31
T:32
TGAGC…..AGCT
A:266
C:250
G:256
T:252
CGCCA…..TGAT
𝜂 = 128 tth index in BWT
string
Fig. based on Jing Zhang et. al. CCGrid’2013
20

Intel Labs
BWA-MEM Algorithm
Seeding – Look for exact matches (regions) in the reference sequence for the
substrings (seeds) of the query using compressed FM-Index
– Super Maximal Exact Match (SMEM)
– Suffix Array Lookup (SAL)
– Chaining
Extension – Extend the matches on either side to get end-to-end matches.
Select matches with high similarity
– Banded Smith Waterman (BSW)
SAM-Form – Format the
output in the SAM format
21

Intel Labs
BWA-MEM Algorithm
Seeding – Look for exact matches (regions) in the reference sequence for the
substrings (seeds) of the query using compressed FM-Index
– Super Maximal Exact Match (SMEM)
– Suffix Array Lookup (SAL)
– Chaining
Extension – Extend the matches on either side to get end-to-end matches.
Select matches with high similarity
– Banded Smith Waterman (BSW)
SAM-Form – Format the
output in the SAM format
- Reorganization
22

Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase Backward extension phase
23
1. Find maximal length query
substrings with matches
2. Output the matches

Intel Labs
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
Backward extension phase
24

Intel Labs
Read: GTTAC
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
25

Intel Labs
Read: GTTAC
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
26

Intel Labs
Read: GTTAC
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
1. GTTAC
<TA, 7, 8> - Find TTA- <TTA, 11, 11>
<T, 7, 12> - Find TT - <TT, 11, 12>
27

Intel Labs
Read: GTTAC
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
1. GTTAC
<TA, 7, 8> - Find TTA- <TTA, 11, 11>
<T, 7, 12> - Find TT - <TT, 11, 12>
2. GTTAC
<TTA, 11, 11> - Find GTTA – Not
found
Add TTA to list of SMEMs
<TT, 11, 12> - Find GTT – Not found
28

Intel Labs
Read: GTTAC
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
1. GTTAC
<TA, 7, 8> - Find TTA- <TTA, 11, 11>
<T, 7, 12> - Find TT - <TT, 11, 12>
2. GTTAC
<TTA, 11, 11> - Find GTTA – Not
found
Add TTA to list of SMEMs
<TT, 11, 12> - Find GTT – Not found
Output SMEMs:
<TTA, 11, 11>
29

Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position

Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
FM-Index

Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
FM-Index
query

Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
m
Forward extension
p q
m m+1 p1 q1
m m+2 p2 q2
… … ... ...
m m+k pk qk
FM-Index
query

Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
m
Forward extension
p q
m m+1 p1 q1
m m+2 p2 q2
… … ... ...
m m+k pk qk
m m+k pk qk
… … … …
m m+2 p2 q2
m m+1 p1 q1
m-1 m+k pk
’ qk
’
… … … …
m-1 m+2 p2
’ q2
’
m-1 m+1 p1
’ q1
’
Backward extension
FM-Index
query

Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
m
Forward extension
p q
m m+1 p1 q1
m m+2 p2 q2
… … ... ...
m m+k pk qk
m m+k pk qk
… … … …
m m+2 p2 q2
m m+1 p1 q1
m-1 m+k pk
’ qk
’
… … … …
m-1 m+2 p2
’ q2
’
m-1 m+1 p1
’ q1
’
Backward extension
m-2 m+k pk
’’ qk
’’
… … … …
m-2 m+2 p2
’’ q2
’’
FM-Index
query
m-1 m+k pk
’ qk
’
… … … …
m-1 m+2 p2
’ q2
’

Intel Labs
SMEM Algorithm
No spatial locality
40

Intel Labs
SMEM Algorithm
No spatial locality
New values in the tuple
depend on current values
and the current nucleotide
41

Intel Labs
SMEM Algorithm
No spatial locality
Large # instructions
for 𝜂 = 128
42

Intel Labs
SMEM Algorithm
No spatial locality
Large # instructions
for 𝜂 = 128
43

Intel Labs
SMEM Algorithm – Key Optimizations
 Software Prefetching
– For any tuple that is added to the backward search buffer, we know the memory
locations that will be accessed when the corresponding backward search occurs
– So, we software prefetch it and hide prefetch latency with computation

Intel Labs
SMEM Algorithm – Key Optimizations
 Reducing 𝜂 and vectorization
– Reduced the value of 𝜂 to 32
– Store BWT string using 1-byte per nucleotide format – 32 bytes total
– Process the 32 byte BWT using byte level AVX2 instrinsics to get the number of
occurrences of a nucleotide
– The four counts consume 4 bytes per letter – 16 bytes total
– Added 16 bytes of padding to make 64 bytes to align along cache line boundary
– one cache line to ensure the whole bucket can be prefetched using one
instruction
45

Intel Labs
SMEM Algorithm – Results
System: SKX, #Threads = 1
Read dataset: 60000 reads from D2
2x speedup
46

Intel Labs
Suffix Array Lookup - SAL
SMEM outputs the suffix array interval
Each suffix array index in the interval is looked
up to get the reference sequence coordinate like
this:
Optimization:
– Original BWA-MEM uses compressed suffix array to
reduce memory footprint – but there is sufficient
memory on current systems
– So, we simply use uncompressed suffix array and look
it up using the above expression
47

Intel Labs
SAL - Results
Input data created by intercepting the data to SAL stage from an actual run using
600,000 reds from D2
183x speedup
48

Intel Labs
Banded Smith Waterman - BSW
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise49
Regular Smith Waterman

Intel Labs
 Only a diagonal band is computed
is gap open penalty
Regular Smith Waterman Banded Smith Waterman from BWA-MEM

Intel Labs
 Size of the band can dynamically change from
top to bottom
is gap open penalty

Intel Labs
top to bottom
 Various conditions of early exit
is gap open penalty

Intel Labs
top to bottom
 Various conditions of early exit
 Low parallelism within one matrix computation
is gap open penalty

Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
54

Intel Labs
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
55

Intel Labs
Challenges
– Early exits
Sort the sequences according to band sizes to make
the computation across pairs being vectorized more
uniform
56

Intel Labs
Challenges
– Early exits
uniform
Convert the sequences from AoS to SoA format to
prevent gather/scatter cost
57

Intel Labs
Challenges
– Early exits
uniform
Convert the sequences from AoS to SoA format to
prevent gather/scatter cost
SIMD Operations used
– cmp, blend, max, mov, add, and sub, mask
– Precision
– Lower precision provides more performance
– Precision required depends on max. score depends on sequence lengths
– We choose 8-bit or 16-bit precision based on sequence lengths
58

Intel Labs
BSW - Results
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
59

Intel Labs
BSW - Results
11.6x6.7x
~14x reduction in # instructions
IPC is reduced because majority of
instructions in optimized code are SIMD
instructions
There are 2 ports for SIMD (VPUs), but 4 for
scalar
60

Intel Labs
BSW - Results
11.6x6.7x
instructions
scalar
61
Why not
512
8
= 64x
speedup?

Intel Labs
BSW - Results
11.6x6.7x
instructions
scalar
62
Why not
512
8
= 64x
speedup?
Only 43% of the time is spent on cell
computation using SIMD
In which ~50% of lanes are idle – so,
effectively ~21.5% for cell computation

Intel Labs
Multithread Scaling
Scaling of three kernels and the entire application from 1 to 28 core
on SKX
We demonstrate nearly equal or better scaling on all kernels
Application scaling is worse due to bad scaling of “Misc” section
63

Intel Labs
End to End Performance Results – Compute only
All kernels retain their speedup in the end-to-end run
SAL barely contributes to the run time due to 183x speedup
Single Thread of SKX Single socket (56 threads/28 cores) of SKX
64

Intel Confidential – Internal Only
BWA-MEM2 Open Sourcing
Drop-In Replacement
Supported executions: AVX512, AVX2, SSE4.1, scalar
Supported functionality: All the functionality of BWA-MEM
including single end and paired-end alignments
Output: Identical to BWA-MEM
Command line interface: Exactly same as BWA-MEM
Future Steps
Algorithmic, implementation level (Misc) and architectural
improvements
https://github.com/bwa-mem2/bwa-mem2
65

Intel Legal Disclaimers
 Intel, Xeon and Intel Xeon Phi are trademarks of Intel Corporation or its
subsidiaries in the U.S. and/or other countries. Other names and brands may be
claimed as the property of others. © Intel Corporation
 Software and workloads used in performance tests may have been optimized for
performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause
the results to vary. You should consult other information and performance tests to
assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. For more
information go to www.intel.com/benchmarks.
 Benchmark results were obtained prior to implementation of recent software
patches and firmware updates intended to address exploits referred to as "Spectre"
and "Meltdown". Implementation of these updates may make these results
inapplicable to your device or system.
66

Thank You!
Vasimuddin Md
vasimuddin.md@intel.com
@wasim_galaxy
Sanchit Misra
sanchit.misra@intel.com
sanchit-misra@github.io
@sanchit_misra
Heng Li
hli@jimmy.harvard.edu
http://www.liheng.org/
@lh3lh3
Srinivas Aluru
aluru@cc.gatech.edu
https://www.cc.gatech.edu/~saluru/
67

BWA-MEM2-IPDPS 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BWA-MEM2-IPDPS 2019

Similar to BWA-MEM2-IPDPS 2019 (20)

Recently uploaded

Recently uploaded (20)

BWA-MEM2-IPDPS 2019