SlideShare a Scribd company logo
Intel Labs
Vasimuddin Md.
Sanchit Misra
Efficient Architecture-Aware Acceleration
of BWA-MEM for Multicore Systems
Heng Li Srinivas Aluru
May 21, 2019
Intel Labs
BIGstack: Broad Intel Genomics stack
Optimized Broad Software on Top of Reference Architecture Design
2
Intel Labs
3
Primer on Human Genome
 3 Billion base-pairs
over 23
chromosome-pairs
 23 sequences over
∑= {A,C,G,T}
Exactly same
DNA across
cells of a body
Human ~ Human
99.5% Similarity
Intel Labs
Obtaining Genome of an Individual
Map to the
Reference
Sequence
4
Intel Labs
Obtaining Genome of an Individual
1 Human Genome Get reads
(30X coverage)
1.2 Billion Paired End
Reads of length 151
Map to the
Reference
Sequence
5
Intel Labs
Obtaining Genome of an Individual
6
1 Human Genome Get reads
(30X coverage)
1.2 Billion Paired End
Reads of length 151
28 min
164 min
64 min
Illumina HiSeq X 10 BWA-MEM* BWA-MEM2*
Among the most popular tools
~70K users
*On single socket Intel® Xeon® Platinum 8180 Processor
Map to the
Reference
Sequence
6
Intel Labs
7
Genome Data Will Dwarf Everything Else
Intel Labs
8
Population Genomics, Approaching Worldwide Scale
Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”,
January 2017
Intel Labs
9
Population Genomics, Approaching Worldwide Scale
Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”,
January 2017
100 million - 2 billion human genomes expected to be sequenced
by 2025!
(That’s ~ 10-200 Exabytes!)
Stephens, et. al. Big Data: Astronomical or Genomical?. PLOS Biology. (2015)
Intel Labs
3 key kernels (each quite complex) consuming 15-45% of time
– SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded
Smith Waterman) with several heuristics
– Different kernels can be the most time consuming depending on data
– Time not covered by the kernels (Misc) is also significant
Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA
– pipeline the rest on the host CPU
– Performance bound by the non-optimized kernels running on CPU
Accelerating BWA-MEM has Proven Difficult
Approach SMEM SAL BSW Overall
Multiple approaches - (CPU) - (CPU) 1.6x-3x
(GPGPU/FPGA)
1.45x-2x
Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x
Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x
10
Intel Labs
3 key kernels (each quite complex) consuming 15-45% of time
– SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded
Smith Waterman) with several heuristics
– Different kernels can be the most time consuming depending on data
– Time not covered by the kernels (Misc) is also significant
Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA
– pipeline the rest on the host CPU
– Performance bound by the non-optimized kernels running on CPU
Accelerating BWA-MEM has Proven Difficult
Approach SMEM SAL BSW Overall
Multiple approaches - (CPU) - (CPU) 1.6x-3x
(GPGPU/FPGA)
1.45x-2x
Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x
Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x
Bypasses some of the heuristics – Get different output – Strict No No
11
Intel Labs
3 key kernels (each quite complex) consuming 15-45% of time
– SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded
Smith Waterman) with several heuristics
– Different kernels can be the most time consuming depending on data
– Time not covered by the kernels (Misc) is also significant
Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA
– pipeline the rest on the host CPU
– Performance bound by the non-optimized kernels running on CPU
Accelerating BWA-MEM has Proven Difficult
No published work contains a holistic architecture-aware optimization of
BWA-MEM software on multicore systems.
Approach SMEM SAL BSW Overall
Multiple approaches - (CPU) - (CPU) 1.6x-3x
(GPGPU/FPGA)
1.45x-2x
Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x
Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x
Bypasses some of the heuristics – Get different output – Strict No No
12
Intel Labs
System Configuration
Intel® Xeon® Platinum
8180 Processor
Name used in the rest of the
presentation
SKX
Sockets x Cores x Threads 2 x 28 x 2
VPUs/Core x AVX register width 2 x {512, 256, 128}
Base clock frequency 2.5 GHz
L1D/L2 cache / Core 32/1024 KB
L3 cache / Socket 38.5 MB
DRAM size / Socket, BW 96 GB, 114 GB/s
Compiler version ICC v. 17.0.2
Performance on multiple sockets can be achieved by just distributing the reads equally
and load imbalance is usually not an issue.
Therefore, our efforts are focused on single socket performance.
13
Intel Labs
Datasets
Reference Sequence
Half of Human Genome (version HG38) - 1.5 Billion nucleotides
Dataset # Reads Read Length Dataset Source
D1 5 x 105 151 Broad Institute
D2 5 x 105 151 Broad Institute
D3 1.25 x 106 76 NCBI SRA: SRX020470
D4 1.25 x 106 101 NCBI SRA: SRX207170
D5 1.25 x 106 101 NCBI SRA: SRX206890
Read Datasets
14
Intel Labs
End to End Performance Gains On SKX – Compute Only
Our output is identical to original BWA-MEM
Single Thread of SKX Single socket (56 threads/28 cores) of SKX
15
Intel Labs
Optimization Details
16
Intel Labs
The Problem – Mapping to the Reference Sequence
S1
S2
S4
S3
Sm
Reference R
CCCTCCTATTTAAC
Query Q
Find the best matches of 𝑄 in 𝑅
17
Intel Labs
FM-Index of the Reference Sequence
FM-index of a sample
reference sequence:
AGTGGA.
It consists of Suffix Array,
Burrows Wheeler
Transform (BWT), O and D
arrays.
Since BW-Matrix is
lexicographically sorted, all
the occurrences of a query
appear contiguously in the
suffix array (SA). These
contiguous locations are
called SA interval.
18
Intel Labs
FM-Index of the Reference Sequence
FM-index of a sample
reference sequence:
AGTGGA.
It consists of Suffix Array,
Burrows Wheeler
Transform (BWT), O and D
arrays.
Since BW-Matrix is
lexicographically sorted, all
the occurrences of a query
appear contiguously in the
suffix array (SA). These
contiguous locations are
called SA interval.
30 GB 1.5 GB
96 GB
19
Sizes for human
genome
Intel Labs
Compressed FM-Index in BWA-MEM
 To reduce memory footprint, the O array is divided into buckets of
size 𝜂
 For each bucket
– nucleotide counts are stored for all the previous buckets
– The corresponding BWT string of size 𝜂 is stored in a 2-bit per nucleotide format
O(G, t) = 256 + 1 = 257
A:0
C:0
G:0
T:0
GGAAC…..AGCT
A:35
C:30
G:31
T:32
TGAGC…..AGCT
A:266
C:250
G:256
T:252
CGCCA…..TGAT
𝜂 = 128 tth index in BWT
string
Fig. based on Jing Zhang et. al. CCGrid’2013
20
Intel Labs
BWA-MEM Algorithm
Seeding – Look for exact matches (regions) in the reference sequence for the
substrings (seeds) of the query using compressed FM-Index
– Super Maximal Exact Match (SMEM)
– Suffix Array Lookup (SAL)
– Chaining
Extension – Extend the matches on either side to get end-to-end matches.
Select matches with high similarity
– Banded Smith Waterman (BSW)
SAM-Form – Format the
output in the SAM format
21
Intel Labs
BWA-MEM Algorithm
Seeding – Look for exact matches (regions) in the reference sequence for the
substrings (seeds) of the query using compressed FM-Index
– Super Maximal Exact Match (SMEM)
– Suffix Array Lookup (SAL)
– Chaining
Extension – Extend the matches on either side to get end-to-end matches.
Select matches with high similarity
– Banded Smith Waterman (BSW)
SAM-Form – Format the
output in the SAM format
- Reorganization
22
Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase Backward extension phase
23
1. Find maximal length query
substrings with matches
2. Output the matches
Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
Backward extension phase
24
1. Find maximal length query
substrings with matches
2. Output the matches
Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
Backward extension phase
25
1. Find maximal length query
substrings with matches
2. Output the matches
Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
Backward extension phase
26
1. Find maximal length query
substrings with matches
2. Output the matches
Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
Backward extension phase
1. GTTAC
<TA, 7, 8> - Find TTA- <TTA, 11, 11>
<T, 7, 12> - Find TT - <TT, 11, 12>
27
1. Find maximal length query
substrings with matches
2. Output the matches
Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
Backward extension phase
1. GTTAC
<TA, 7, 8> - Find TTA- <TTA, 11, 11>
<T, 7, 12> - Find TT - <TT, 11, 12>
2. GTTAC
<TTA, 11, 11> - Find GTTA – Not
found
Add TTA to list of SMEMs
<TT, 11, 12> - Find GTT – Not found
28
1. Find maximal length query
substrings with matches
2. Output the matches
Intel Labs
SMEM Algorithm from BWA-MEM - For One Position
Reference: ATTCTTATGTA
Read: GTTAC
Forward extension phase
1. GTTAC
Find T - <T, 7, 12>
2. GTTAC
Find TA - <TA, 7, 8>
<T, 7, 12>
3. GTTAC
Find TAC –
<TA, 7, 8>
<T, 7, 12>
Backward extension phase
1. GTTAC
<TA, 7, 8> - Find TTA- <TTA, 11, 11>
<T, 7, 12> - Find TT - <TT, 11, 12>
2. GTTAC
<TTA, 11, 11> - Find GTTA – Not
found
Add TTA to list of SMEMs
<TT, 11, 12> - Find GTT – Not found
Output SMEMs:
<TTA, 11, 11>
29
1. Find maximal length query
substrings with matches
2. Output the matches
Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
FM-Index
Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
FM-Index
query
Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
m
Forward extension
p q
m m+1 p1 q1
m m+2 p2 q2
… … ... ...
m m+k pk qk
FM-Index
query
Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
m
Forward extension
p q
m m+1 p1 q1
m m+2 p2 q2
… … ... ...
m m+k pk qk
m m+k pk qk
… … … …
m m+2 p2 q2
m m+1 p1 q1
m-1 m+k pk
’ qk
’
… … … …
m-1 m+2 p2
’ q2
’
m-1 m+1 p1
’ q1
’
Backward extension
FM-Index
query
Intel Labs
SMEM Algorithm
from BWA-MEM:
For One Position
m
Forward extension
p q
m m+1 p1 q1
m m+2 p2 q2
… … ... ...
m m+k pk qk
m m+k pk qk
… … … …
m m+2 p2 q2
m m+1 p1 q1
m-1 m+k pk
’ qk
’
… … … …
m-1 m+2 p2
’ q2
’
m-1 m+1 p1
’ q1
’
Backward extension
m-2 m+k pk
’’ qk
’’
… … … …
m-2 m+2 p2
’’ q2
’’
FM-Index
query
m-1 m+k pk
’ qk
’
… … … …
m-1 m+2 p2
’ q2
’
Intel Labs
SMEM Algorithm
36
Intel Labs
SMEM Algorithm
37
Intel Labs
SMEM Algorithm
38
Intel Labs
SMEM Algorithm
39
Intel Labs
SMEM Algorithm
No spatial locality
40
Intel Labs
SMEM Algorithm
No spatial locality
New values in the tuple
depend on current values
and the current nucleotide
41
Intel Labs
SMEM Algorithm
No spatial locality
Large # instructions
for 𝜂 = 128
New values in the tuple
depend on current values
and the current nucleotide
42
Intel Labs
SMEM Algorithm
No spatial locality
Large # instructions
for 𝜂 = 128
New values in the tuple
depend on current values
and the current nucleotide
43
Intel Labs
SMEM Algorithm – Key Optimizations
 Software Prefetching
– For any tuple that is added to the backward search buffer, we know the memory
locations that will be accessed when the corresponding backward search occurs
– So, we software prefetch it and hide prefetch latency with computation
Intel Labs
SMEM Algorithm – Key Optimizations
 Reducing 𝜂 and vectorization
– Reduced the value of 𝜂 to 32
– Store BWT string using 1-byte per nucleotide format – 32 bytes total
– Process the 32 byte BWT using byte level AVX2 instrinsics to get the number of
occurrences of a nucleotide
– The four counts consume 4 bytes per letter – 16 bytes total
– Added 16 bytes of padding to make 64 bytes to align along cache line boundary
– one cache line to ensure the whole bucket can be prefetched using one
instruction
45
Intel Labs
SMEM Algorithm – Results
System: SKX, #Threads = 1
Read dataset: 60000 reads from D2
2x speedup
46
Intel Labs
Suffix Array Lookup - SAL
SMEM outputs the suffix array interval
Each suffix array index in the interval is looked
up to get the reference sequence coordinate like
this:
Optimization:
– Original BWA-MEM uses compressed suffix array to
reduce memory footprint – but there is sufficient
memory on current systems
– So, we simply use uncompressed suffix array and look
it up using the above expression
47
Intel Labs
SAL - Results
System: SKX, #Threads = 1
Input data created by intercepting the data to SAL stage from an actual run using
600,000 reds from D2
183x speedup
48
Intel Labs
Banded Smith Waterman - BSW
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise49
Regular Smith Waterman
Intel Labs
Banded Smith Waterman - BSW
 Only a diagonal band is computed
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise50
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
Intel Labs
Banded Smith Waterman - BSW
 Only a diagonal band is computed
 Size of the band can dynamically change from
top to bottom
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise51
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
Intel Labs
Banded Smith Waterman - BSW
 Only a diagonal band is computed
 Size of the band can dynamically change from
top to bottom
 Various conditions of early exit
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise52
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
Intel Labs
Banded Smith Waterman - BSW
 Only a diagonal band is computed
 Size of the band can dynamically change from
top to bottom
 Various conditions of early exit
 Low parallelism within one matrix computation
is gap open penalty
is gap extension penalty
𝑓(𝑎, 𝑏) = match parameter, if a=b
mismatch parameter, otherwise53
Regular Smith Waterman Banded Smith Waterman from BWA-MEM
Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
54
Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
55
Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
Sort the sequences according to band sizes to make
the computation across pairs being vectorized more
uniform
56
Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
Sort the sequences according to band sizes to make
the computation across pairs being vectorized more
uniform
Convert the sequences from AoS to SoA format to
prevent gather/scatter cost
57
Intel Labs
BSW – Optimizations – Inter-task Vectorization
We hand vectorized using AVX512 SIMD intrinsics
Challenges
– Variable and dynamically changing band size
– Early exits
– Overhead of dynamic band computation
Sort the sequences according to band sizes to make
the computation across pairs being vectorized more
uniform
Convert the sequences from AoS to SoA format to
prevent gather/scatter cost
SIMD Operations used
– cmp, blend, max, mov, add, and sub, mask
– Precision
– Lower precision provides more performance
– Precision required depends on max. score depends on sequence lengths
– We choose 8-bit or 16-bit precision based on sequence lengths
58
Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
59
Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
~14x reduction in # instructions
IPC is reduced because majority of
instructions in optimized code are SIMD
instructions
There are 2 ports for SIMD (VPUs), but 4 for
scalar
60
Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
~14x reduction in # instructions
IPC is reduced because majority of
instructions in optimized code are SIMD
instructions
There are 2 ports for SIMD (VPUs), but 4 for
scalar
61
Why not
512
8
= 64x
speedup?
Intel Labs
BSW - Results
System: SKX, #Threads = 1
Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a
full application run. Read dataset used for full run: D3.
11.6x6.7x
~14x reduction in # instructions
IPC is reduced because majority of
instructions in optimized code are SIMD
instructions
There are 2 ports for SIMD (VPUs), but 4 for
scalar
62
Why not
512
8
= 64x
speedup?
Only 43% of the time is spent on cell
computation using SIMD
In which ~50% of lanes are idle – so,
effectively ~21.5% for cell computation
Intel Labs
Multithread Scaling
Scaling of three kernels and the entire application from 1 to 28 core
on SKX
We demonstrate nearly equal or better scaling on all kernels
Application scaling is worse due to bad scaling of “Misc” section
63
Intel Labs
End to End Performance Results – Compute only
All kernels retain their speedup in the end-to-end run
SAL barely contributes to the run time due to 183x speedup
Single Thread of SKX Single socket (56 threads/28 cores) of SKX
64
Intel Confidential – Internal Only
BWA-MEM2 Open Sourcing
Drop-In Replacement
Supported executions: AVX512, AVX2, SSE4.1, scalar
Supported functionality: All the functionality of BWA-MEM
including single end and paired-end alignments
Output: Identical to BWA-MEM
Command line interface: Exactly same as BWA-MEM
Future Steps
Algorithmic, implementation level (Misc) and architectural
improvements
https://github.com/bwa-mem2/bwa-mem2
65
Intel Confidential – Internal Only
Intel Legal Disclaimers
 Intel, Xeon and Intel Xeon Phi are trademarks of Intel Corporation or its
subsidiaries in the U.S. and/or other countries. Other names and brands may be
claimed as the property of others. © Intel Corporation
 Software and workloads used in performance tests may have been optimized for
performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause
the results to vary. You should consult other information and performance tests to
assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. For more
information go to www.intel.com/benchmarks.
 Benchmark results were obtained prior to implementation of recent software
patches and firmware updates intended to address exploits referred to as "Spectre"
and "Meltdown". Implementation of these updates may make these results
inapplicable to your device or system.
66
Intel Confidential – Internal Only
Thank You!
Vasimuddin Md
vasimuddin.md@intel.com
@wasim_galaxy
Sanchit Misra
sanchit.misra@intel.com
sanchit-misra@github.io
@sanchit_misra
Heng Li
hli@jimmy.harvard.edu
http://www.liheng.org/
@lh3lh3
Srinivas Aluru
aluru@cc.gatech.edu
https://www.cc.gatech.edu/~saluru/
67

More Related Content

What's hot

Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in BioinformaticsDmytro Fishman
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignmentKubuldinho
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withHafiz Muhammad Zeeshan Raza
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence AlignmentAjayPatil210
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptxPiyushBehgal1
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentSanaym
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment Parinda Rajapaksha
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Sijo A
 
BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)Sobia
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsNikesh Narayanan
 

What's hot (20)

Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
Blast
BlastBlast
Blast
 
Quality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained withQuality control of sequencing with fast qc obtained with
Quality control of sequencing with fast qc obtained with
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
UPGMA
UPGMAUPGMA
UPGMA
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence Alignment
 
BLAST AND FASTA.pptx
BLAST AND FASTA.pptxBLAST AND FASTA.pptx
BLAST AND FASTA.pptx
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)
 
BLAST and sequence alignment
BLAST and sequence alignmentBLAST and sequence alignment
BLAST and sequence alignment
 
BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)
 
Dot matrix
Dot matrixDot matrix
Dot matrix
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 

Similar to BWA-MEM2-IPDPS 2019

Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAnh Dung NGUYEN
 
Thomas+Niewel+ +Oracletuning
Thomas+Niewel+ +OracletuningThomas+Niewel+ +Oracletuning
Thomas+Niewel+ +Oracletuningafa reg
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1wjunjmt
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...NECST Lab @ Politecnico di Milano
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorJinho Lee
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxKandavelEee
 
BFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres JpsBFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres Jpsjpsvenn
 
Algorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationAlgorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationFederico Cerutti
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodellingObsidian Software
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 

Similar to BWA-MEM2-IPDPS 2019 (20)

Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
ASCIC.ppt
ASCIC.pptASCIC.ppt
ASCIC.ppt
 
AAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and OptimizationAAME ARM Techcon2013 004v02 Debug and Optimization
AAME ARM Techcon2013 004v02 Debug and Optimization
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Thomas+Niewel+ +Oracletuning
Thomas+Niewel+ +OracletuningThomas+Niewel+ +Oracletuning
Thomas+Niewel+ +Oracletuning
 
7 eti pres
7 eti pres7 eti pres
7 eti pres
 
Asic
AsicAsic
Asic
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...The CAOS framework: democratize the acceleration of compute intensive applica...
The CAOS framework: democratize the acceleration of compute intensive applica...
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
ate_full_paper
ate_full_paperate_full_paper
ate_full_paper
 
Altera trcak g
Altera  trcak gAltera  trcak g
Altera trcak g
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
BFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres JpsBFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres Jps
 
Algorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationAlgorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions Enumeration
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 

Recently uploaded

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhArpitMalhotra16
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单ewymefz
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单ewymefz
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单nscud
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Domenico Conte
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 

Recently uploaded (20)

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 

BWA-MEM2-IPDPS 2019

  • 1. Intel Labs Vasimuddin Md. Sanchit Misra Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems Heng Li Srinivas Aluru May 21, 2019
  • 2. Intel Labs BIGstack: Broad Intel Genomics stack Optimized Broad Software on Top of Reference Architecture Design 2
  • 3. Intel Labs 3 Primer on Human Genome  3 Billion base-pairs over 23 chromosome-pairs  23 sequences over ∑= {A,C,G,T} Exactly same DNA across cells of a body Human ~ Human 99.5% Similarity
  • 4. Intel Labs Obtaining Genome of an Individual Map to the Reference Sequence 4
  • 5. Intel Labs Obtaining Genome of an Individual 1 Human Genome Get reads (30X coverage) 1.2 Billion Paired End Reads of length 151 Map to the Reference Sequence 5
  • 6. Intel Labs Obtaining Genome of an Individual 6 1 Human Genome Get reads (30X coverage) 1.2 Billion Paired End Reads of length 151 28 min 164 min 64 min Illumina HiSeq X 10 BWA-MEM* BWA-MEM2* Among the most popular tools ~70K users *On single socket Intel® Xeon® Platinum 8180 Processor Map to the Reference Sequence 6
  • 7. Intel Labs 7 Genome Data Will Dwarf Everything Else
  • 8. Intel Labs 8 Population Genomics, Approaching Worldwide Scale Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”, January 2017
  • 9. Intel Labs 9 Population Genomics, Approaching Worldwide Scale Source: Frost & Sullivan, “Global Precision Medicine Growth Opportunities, Forecast to 2025”, January 2017 100 million - 2 billion human genomes expected to be sequenced by 2025! (That’s ~ 10-200 Exabytes!) Stephens, et. al. Big Data: Astronomical or Genomical?. PLOS Biology. (2015)
  • 10. Intel Labs 3 key kernels (each quite complex) consuming 15-45% of time – SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded Smith Waterman) with several heuristics – Different kernels can be the most time consuming depending on data – Time not covered by the kernels (Misc) is also significant Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA – pipeline the rest on the host CPU – Performance bound by the non-optimized kernels running on CPU Accelerating BWA-MEM has Proven Difficult Approach SMEM SAL BSW Overall Multiple approaches - (CPU) - (CPU) 1.6x-3x (GPGPU/FPGA) 1.45x-2x Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x 10
  • 11. Intel Labs 3 key kernels (each quite complex) consuming 15-45% of time – SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded Smith Waterman) with several heuristics – Different kernels can be the most time consuming depending on data – Time not covered by the kernels (Misc) is also significant Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA – pipeline the rest on the host CPU – Performance bound by the non-optimized kernels running on CPU Accelerating BWA-MEM has Proven Difficult Approach SMEM SAL BSW Overall Multiple approaches - (CPU) - (CPU) 1.6x-3x (GPGPU/FPGA) 1.45x-2x Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x Bypasses some of the heuristics – Get different output – Strict No No 11
  • 12. Intel Labs 3 key kernels (each quite complex) consuming 15-45% of time – SMEM (Super Maximal Exact Match), SAL (Suffix Array Lookup), BSW (Banded Smith Waterman) with several heuristics – Different kernels can be the most time consuming depending on data – Time not covered by the kernels (Misc) is also significant Majority of other approaches target 1-2 of the 3 kernels on GPGPU/ FPGA – pipeline the rest on the host CPU – Performance bound by the non-optimized kernels running on CPU Accelerating BWA-MEM has Proven Difficult No published work contains a holistic architecture-aware optimization of BWA-MEM software on multicore systems. Approach SMEM SAL BSW Overall Multiple approaches - (CPU) - (CPU) 1.6x-3x (GPGPU/FPGA) 1.45x-2x Chang et. al. 2016 4x (FPGA) - (CPU) - (CPU) 1.26x Ahmed et. al. 2015 1.7x (CPU) 2.8x (4 FPGAs) 5.7x (4 FPGAs) 2.6x Bypasses some of the heuristics – Get different output – Strict No No 12
  • 13. Intel Labs System Configuration Intel® Xeon® Platinum 8180 Processor Name used in the rest of the presentation SKX Sockets x Cores x Threads 2 x 28 x 2 VPUs/Core x AVX register width 2 x {512, 256, 128} Base clock frequency 2.5 GHz L1D/L2 cache / Core 32/1024 KB L3 cache / Socket 38.5 MB DRAM size / Socket, BW 96 GB, 114 GB/s Compiler version ICC v. 17.0.2 Performance on multiple sockets can be achieved by just distributing the reads equally and load imbalance is usually not an issue. Therefore, our efforts are focused on single socket performance. 13
  • 14. Intel Labs Datasets Reference Sequence Half of Human Genome (version HG38) - 1.5 Billion nucleotides Dataset # Reads Read Length Dataset Source D1 5 x 105 151 Broad Institute D2 5 x 105 151 Broad Institute D3 1.25 x 106 76 NCBI SRA: SRX020470 D4 1.25 x 106 101 NCBI SRA: SRX207170 D5 1.25 x 106 101 NCBI SRA: SRX206890 Read Datasets 14
  • 15. Intel Labs End to End Performance Gains On SKX – Compute Only Our output is identical to original BWA-MEM Single Thread of SKX Single socket (56 threads/28 cores) of SKX 15
  • 17. Intel Labs The Problem – Mapping to the Reference Sequence S1 S2 S4 S3 Sm Reference R CCCTCCTATTTAAC Query Q Find the best matches of 𝑄 in 𝑅 17
  • 18. Intel Labs FM-Index of the Reference Sequence FM-index of a sample reference sequence: AGTGGA. It consists of Suffix Array, Burrows Wheeler Transform (BWT), O and D arrays. Since BW-Matrix is lexicographically sorted, all the occurrences of a query appear contiguously in the suffix array (SA). These contiguous locations are called SA interval. 18
  • 19. Intel Labs FM-Index of the Reference Sequence FM-index of a sample reference sequence: AGTGGA. It consists of Suffix Array, Burrows Wheeler Transform (BWT), O and D arrays. Since BW-Matrix is lexicographically sorted, all the occurrences of a query appear contiguously in the suffix array (SA). These contiguous locations are called SA interval. 30 GB 1.5 GB 96 GB 19 Sizes for human genome
  • 20. Intel Labs Compressed FM-Index in BWA-MEM  To reduce memory footprint, the O array is divided into buckets of size 𝜂  For each bucket – nucleotide counts are stored for all the previous buckets – The corresponding BWT string of size 𝜂 is stored in a 2-bit per nucleotide format O(G, t) = 256 + 1 = 257 A:0 C:0 G:0 T:0 GGAAC…..AGCT A:35 C:30 G:31 T:32 TGAGC…..AGCT A:266 C:250 G:256 T:252 CGCCA…..TGAT 𝜂 = 128 tth index in BWT string Fig. based on Jing Zhang et. al. CCGrid’2013 20
  • 21. Intel Labs BWA-MEM Algorithm Seeding – Look for exact matches (regions) in the reference sequence for the substrings (seeds) of the query using compressed FM-Index – Super Maximal Exact Match (SMEM) – Suffix Array Lookup (SAL) – Chaining Extension – Extend the matches on either side to get end-to-end matches. Select matches with high similarity – Banded Smith Waterman (BSW) SAM-Form – Format the output in the SAM format 21
  • 22. Intel Labs BWA-MEM Algorithm Seeding – Look for exact matches (regions) in the reference sequence for the substrings (seeds) of the query using compressed FM-Index – Super Maximal Exact Match (SMEM) – Suffix Array Lookup (SAL) – Chaining Extension – Extend the matches on either side to get end-to-end matches. Select matches with high similarity – Banded Smith Waterman (BSW) SAM-Form – Format the output in the SAM format - Reorganization 22
  • 23. Intel Labs SMEM Algorithm from BWA-MEM - For One Position Reference: ATTCTTATGTA Read: GTTAC Forward extension phase Backward extension phase 23 1. Find maximal length query substrings with matches 2. Output the matches
  • 24. Intel Labs SMEM Algorithm from BWA-MEM - For One Position Reference: ATTCTTATGTA Read: GTTAC Forward extension phase 1. GTTAC Find T - <T, 7, 12> Backward extension phase 24 1. Find maximal length query substrings with matches 2. Output the matches
  • 25. Intel Labs SMEM Algorithm from BWA-MEM - For One Position Reference: ATTCTTATGTA Read: GTTAC Forward extension phase 1. GTTAC Find T - <T, 7, 12> 2. GTTAC Find TA - <TA, 7, 8> <T, 7, 12> Backward extension phase 25 1. Find maximal length query substrings with matches 2. Output the matches
  • 26. Intel Labs SMEM Algorithm from BWA-MEM - For One Position Reference: ATTCTTATGTA Read: GTTAC Forward extension phase 1. GTTAC Find T - <T, 7, 12> 2. GTTAC Find TA - <TA, 7, 8> <T, 7, 12> 3. GTTAC Find TAC – <TA, 7, 8> <T, 7, 12> Backward extension phase 26 1. Find maximal length query substrings with matches 2. Output the matches
  • 27. Intel Labs SMEM Algorithm from BWA-MEM - For One Position Reference: ATTCTTATGTA Read: GTTAC Forward extension phase 1. GTTAC Find T - <T, 7, 12> 2. GTTAC Find TA - <TA, 7, 8> <T, 7, 12> 3. GTTAC Find TAC – <TA, 7, 8> <T, 7, 12> Backward extension phase 1. GTTAC <TA, 7, 8> - Find TTA- <TTA, 11, 11> <T, 7, 12> - Find TT - <TT, 11, 12> 27 1. Find maximal length query substrings with matches 2. Output the matches
  • 28. Intel Labs SMEM Algorithm from BWA-MEM - For One Position Reference: ATTCTTATGTA Read: GTTAC Forward extension phase 1. GTTAC Find T - <T, 7, 12> 2. GTTAC Find TA - <TA, 7, 8> <T, 7, 12> 3. GTTAC Find TAC – <TA, 7, 8> <T, 7, 12> Backward extension phase 1. GTTAC <TA, 7, 8> - Find TTA- <TTA, 11, 11> <T, 7, 12> - Find TT - <TT, 11, 12> 2. GTTAC <TTA, 11, 11> - Find GTTA – Not found Add TTA to list of SMEMs <TT, 11, 12> - Find GTT – Not found 28 1. Find maximal length query substrings with matches 2. Output the matches
  • 29. Intel Labs SMEM Algorithm from BWA-MEM - For One Position Reference: ATTCTTATGTA Read: GTTAC Forward extension phase 1. GTTAC Find T - <T, 7, 12> 2. GTTAC Find TA - <TA, 7, 8> <T, 7, 12> 3. GTTAC Find TAC – <TA, 7, 8> <T, 7, 12> Backward extension phase 1. GTTAC <TA, 7, 8> - Find TTA- <TTA, 11, 11> <T, 7, 12> - Find TT - <TT, 11, 12> 2. GTTAC <TTA, 11, 11> - Find GTTA – Not found Add TTA to list of SMEMs <TT, 11, 12> - Find GTT – Not found Output SMEMs: <TTA, 11, 11> 29 1. Find maximal length query substrings with matches 2. Output the matches
  • 30. Intel Labs SMEM Algorithm from BWA-MEM: For One Position
  • 31. Intel Labs SMEM Algorithm from BWA-MEM: For One Position FM-Index
  • 32. Intel Labs SMEM Algorithm from BWA-MEM: For One Position FM-Index query
  • 33. Intel Labs SMEM Algorithm from BWA-MEM: For One Position m Forward extension p q m m+1 p1 q1 m m+2 p2 q2 … … ... ... m m+k pk qk FM-Index query
  • 34. Intel Labs SMEM Algorithm from BWA-MEM: For One Position m Forward extension p q m m+1 p1 q1 m m+2 p2 q2 … … ... ... m m+k pk qk m m+k pk qk … … … … m m+2 p2 q2 m m+1 p1 q1 m-1 m+k pk ’ qk ’ … … … … m-1 m+2 p2 ’ q2 ’ m-1 m+1 p1 ’ q1 ’ Backward extension FM-Index query
  • 35. Intel Labs SMEM Algorithm from BWA-MEM: For One Position m Forward extension p q m m+1 p1 q1 m m+2 p2 q2 … … ... ... m m+k pk qk m m+k pk qk … … … … m m+2 p2 q2 m m+1 p1 q1 m-1 m+k pk ’ qk ’ … … … … m-1 m+2 p2 ’ q2 ’ m-1 m+1 p1 ’ q1 ’ Backward extension m-2 m+k pk ’’ qk ’’ … … … … m-2 m+2 p2 ’’ q2 ’’ FM-Index query m-1 m+k pk ’ qk ’ … … … … m-1 m+2 p2 ’ q2 ’
  • 40. Intel Labs SMEM Algorithm No spatial locality 40
  • 41. Intel Labs SMEM Algorithm No spatial locality New values in the tuple depend on current values and the current nucleotide 41
  • 42. Intel Labs SMEM Algorithm No spatial locality Large # instructions for 𝜂 = 128 New values in the tuple depend on current values and the current nucleotide 42
  • 43. Intel Labs SMEM Algorithm No spatial locality Large # instructions for 𝜂 = 128 New values in the tuple depend on current values and the current nucleotide 43
  • 44. Intel Labs SMEM Algorithm – Key Optimizations  Software Prefetching – For any tuple that is added to the backward search buffer, we know the memory locations that will be accessed when the corresponding backward search occurs – So, we software prefetch it and hide prefetch latency with computation
  • 45. Intel Labs SMEM Algorithm – Key Optimizations  Reducing 𝜂 and vectorization – Reduced the value of 𝜂 to 32 – Store BWT string using 1-byte per nucleotide format – 32 bytes total – Process the 32 byte BWT using byte level AVX2 instrinsics to get the number of occurrences of a nucleotide – The four counts consume 4 bytes per letter – 16 bytes total – Added 16 bytes of padding to make 64 bytes to align along cache line boundary – one cache line to ensure the whole bucket can be prefetched using one instruction 45
  • 46. Intel Labs SMEM Algorithm – Results System: SKX, #Threads = 1 Read dataset: 60000 reads from D2 2x speedup 46
  • 47. Intel Labs Suffix Array Lookup - SAL SMEM outputs the suffix array interval Each suffix array index in the interval is looked up to get the reference sequence coordinate like this: Optimization: – Original BWA-MEM uses compressed suffix array to reduce memory footprint – but there is sufficient memory on current systems – So, we simply use uncompressed suffix array and look it up using the above expression 47
  • 48. Intel Labs SAL - Results System: SKX, #Threads = 1 Input data created by intercepting the data to SAL stage from an actual run using 600,000 reds from D2 183x speedup 48
  • 49. Intel Labs Banded Smith Waterman - BSW is gap open penalty is gap extension penalty 𝑓(𝑎, 𝑏) = match parameter, if a=b mismatch parameter, otherwise49 Regular Smith Waterman
  • 50. Intel Labs Banded Smith Waterman - BSW  Only a diagonal band is computed is gap open penalty is gap extension penalty 𝑓(𝑎, 𝑏) = match parameter, if a=b mismatch parameter, otherwise50 Regular Smith Waterman Banded Smith Waterman from BWA-MEM
  • 51. Intel Labs Banded Smith Waterman - BSW  Only a diagonal band is computed  Size of the band can dynamically change from top to bottom is gap open penalty is gap extension penalty 𝑓(𝑎, 𝑏) = match parameter, if a=b mismatch parameter, otherwise51 Regular Smith Waterman Banded Smith Waterman from BWA-MEM
  • 52. Intel Labs Banded Smith Waterman - BSW  Only a diagonal band is computed  Size of the band can dynamically change from top to bottom  Various conditions of early exit is gap open penalty is gap extension penalty 𝑓(𝑎, 𝑏) = match parameter, if a=b mismatch parameter, otherwise52 Regular Smith Waterman Banded Smith Waterman from BWA-MEM
  • 53. Intel Labs Banded Smith Waterman - BSW  Only a diagonal band is computed  Size of the band can dynamically change from top to bottom  Various conditions of early exit  Low parallelism within one matrix computation is gap open penalty is gap extension penalty 𝑓(𝑎, 𝑏) = match parameter, if a=b mismatch parameter, otherwise53 Regular Smith Waterman Banded Smith Waterman from BWA-MEM
  • 54. Intel Labs BSW – Optimizations – Inter-task Vectorization We hand vectorized using AVX512 SIMD intrinsics 54
  • 55. Intel Labs BSW – Optimizations – Inter-task Vectorization We hand vectorized using AVX512 SIMD intrinsics Challenges – Variable and dynamically changing band size – Early exits – Overhead of dynamic band computation 55
  • 56. Intel Labs BSW – Optimizations – Inter-task Vectorization We hand vectorized using AVX512 SIMD intrinsics Challenges – Variable and dynamically changing band size – Early exits – Overhead of dynamic band computation Sort the sequences according to band sizes to make the computation across pairs being vectorized more uniform 56
  • 57. Intel Labs BSW – Optimizations – Inter-task Vectorization We hand vectorized using AVX512 SIMD intrinsics Challenges – Variable and dynamically changing band size – Early exits – Overhead of dynamic band computation Sort the sequences according to band sizes to make the computation across pairs being vectorized more uniform Convert the sequences from AoS to SoA format to prevent gather/scatter cost 57
  • 58. Intel Labs BSW – Optimizations – Inter-task Vectorization We hand vectorized using AVX512 SIMD intrinsics Challenges – Variable and dynamically changing band size – Early exits – Overhead of dynamic band computation Sort the sequences according to band sizes to make the computation across pairs being vectorized more uniform Convert the sequences from AoS to SoA format to prevent gather/scatter cost SIMD Operations used – cmp, blend, max, mov, add, and sub, mask – Precision – Lower precision provides more performance – Precision required depends on max. score depends on sequence lengths – We choose 8-bit or 16-bit precision based on sequence lengths 58
  • 59. Intel Labs BSW - Results System: SKX, #Threads = 1 Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a full application run. Read dataset used for full run: D3. 11.6x6.7x 59
  • 60. Intel Labs BSW - Results System: SKX, #Threads = 1 Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a full application run. Read dataset used for full run: D3. 11.6x6.7x ~14x reduction in # instructions IPC is reduced because majority of instructions in optimized code are SIMD instructions There are 2 ports for SIMD (VPUs), but 4 for scalar 60
  • 61. Intel Labs BSW - Results System: SKX, #Threads = 1 Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a full application run. Read dataset used for full run: D3. 11.6x6.7x ~14x reduction in # instructions IPC is reduced because majority of instructions in optimized code are SIMD instructions There are 2 ports for SIMD (VPUs), but 4 for scalar 61 Why not 512 8 = 64x speedup?
  • 62. Intel Labs BSW - Results System: SKX, #Threads = 1 Input: 48 Million sequence pairs obtained by intercepting the input to this stage from a full application run. Read dataset used for full run: D3. 11.6x6.7x ~14x reduction in # instructions IPC is reduced because majority of instructions in optimized code are SIMD instructions There are 2 ports for SIMD (VPUs), but 4 for scalar 62 Why not 512 8 = 64x speedup? Only 43% of the time is spent on cell computation using SIMD In which ~50% of lanes are idle – so, effectively ~21.5% for cell computation
  • 63. Intel Labs Multithread Scaling Scaling of three kernels and the entire application from 1 to 28 core on SKX We demonstrate nearly equal or better scaling on all kernels Application scaling is worse due to bad scaling of “Misc” section 63
  • 64. Intel Labs End to End Performance Results – Compute only All kernels retain their speedup in the end-to-end run SAL barely contributes to the run time due to 183x speedup Single Thread of SKX Single socket (56 threads/28 cores) of SKX 64
  • 65. Intel Confidential – Internal Only BWA-MEM2 Open Sourcing Drop-In Replacement Supported executions: AVX512, AVX2, SSE4.1, scalar Supported functionality: All the functionality of BWA-MEM including single end and paired-end alignments Output: Identical to BWA-MEM Command line interface: Exactly same as BWA-MEM Future Steps Algorithmic, implementation level (Misc) and architectural improvements https://github.com/bwa-mem2/bwa-mem2 65
  • 66. Intel Confidential – Internal Only Intel Legal Disclaimers  Intel, Xeon and Intel Xeon Phi are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. © Intel Corporation  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.  Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system. 66
  • 67. Intel Confidential – Internal Only Thank You! Vasimuddin Md vasimuddin.md@intel.com @wasim_galaxy Sanchit Misra sanchit.misra@intel.com sanchit-misra@github.io @sanchit_misra Heng Li hli@jimmy.harvard.edu http://www.liheng.org/ @lh3lh3 Srinivas Aluru aluru@cc.gatech.edu https://www.cc.gatech.edu/~saluru/ 67