VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
Architectural Optimizations for High Performance and Energy Efficient Smith-Waterman Implementation on FPGAs Using OpenCL
1. 1
Architectural Optimizations for High
Performance
and Energy Efficient Smith-Waterman
Implementation on FPGAs using OpenCL
06/07/2017 @ Oracle
Lorenzo Di Tucci
lorenzo.ditucci@polimi.it
NECST Lab, Politecnico di Milano
2. 2
The Problem
Performance Requirements of biological algorithms increased
as..
Large amount of data Algorithm Complexity High Computational
Needs
In such scenario, hardware accelerators proved to be
effective in optimizing the Performance/Power
Consumption Ratio
High Parallelism Low Power Consumption
3. 3
Contributions
The contributions of this work are:
• Energy-Efficient Hardware architecture for a pure Smith-
Waterman algorithm
• Implementation with an OpenCL-based design and run-time
environment
• Analysis of this algorithm using the Berkeley Roofline Model
• Experimental results for ADM_PCIE_7V3 and ADM_PCIE_KU3
The results highlights the best performance w.r.t. FPGA solutions
and the best performance/power consumption ratio
w.r.t all competing devices
4. 4
● Dynamic programming algorithm
● Perform local sequence alignment between two nucleotides or
proteins
● Guaranteed to find the optimal local alignment with regards to the
scoring system used[1]
● Highly Compute Intensive
● In order to increase system performance, the state of the art is full of
implementation based on heuristics
Speedup in
computation
Decrease in
Algorithm
Precision
[1] Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197.
Background
5. Read all inputs
(query, database, scoring system)
Compute
Max score, similarity and traceback matrix
Traceback
Starting from max score along the highest
score in traceback matrix
Write results
Each Element depends
on the values:
-Over it (north)
- On its (west)
- On its diagonal position
(north-west)
Similarity Matrix
Starting from the
maximum value in the
Similarity Matrix,
Follow the directions
stored in the
TraceBack Matrix
Traceback Matrix
5
Algorithm
6. Platform
Performance
[GCUPS]
Power Efficiency
[GCUPS/W]
Tesla K20 45.0 0.200
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
Nvidia GeForce GTX 295 16.1 0.056
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 280 9.66 0.041
Xtreme Data XD2000i 9.00 0.150
2XNvidia GeForce 8800 3.60 0.017
6
State on the art
7. Static Code
Analysis
Roofline Model Implementation
Application
BenchmarkPerformance
satisfies
roofline
prediction?
No
Yes
Final Implementation
7
Implementation work flow
8. Work W [Operations] Theoretical[N=query,
M=database]
Example[Ops]
N=256,M=65K
Indexing 11N2 + 11NM – 6N 185M
Comparison 6N2 + 6NM -5N 101M
Arithmetic 15N2 + 15NM – 6N + 8M +2 253M
Total 32N2 + 32NM – 17N + 8M + 2 539M
Memory Traffic DMT [B] [B]
Data in N+M 65K
Data out 64(N+M-1) 4.2M
Total 65N + 65M -64 4.3M
Operational Intensity [Ops/B]
[Ops/B] (32N2 + 32NM – 17N + 8M + 2) /
(65N + 65M -64)
126
Compute Intensive
Little read
Massive Writes
8Static code analysis
9. Static Code
Analysis
Roofline Model Implementation
Application
Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
9
Implementation work flow
Roofline Model
10. The roofline model [2]
Performance Model
that depicts the relation between atteinable performance
and operational intensity
[2] Williams, Samuel, Andrew Waterman, and David Patterson. "Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76.
12. Static Code
Analysis
Roofline Model Implementation
Application
Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
12
Implementation work flow
Implementation
13. 13
Implementation choices
• Traceback is sequential
• Compute on host processor
• As seen in the roofline, we are
memory bound, therefore
compression of input/output
essential
• Directions expressed with 2-bit representation
• Parallel computation along the
anti-diagonals with a systolic
array
• Buffer out corners to simplify
corner cases
• No need to buffer entire
database
• shift in as needed given current compute
window(maximum size = size of the query)
14. 14
Implementation choices
• Traceback is sequential
• Compute on host processor
• As seen in the roofline, we are
memory bound, therefore
compression of input/output
essential
• Directions expressed with 2-bit representation
• Parallel computation along the
anti-diagonals with a systolic
array
• Buffer out corners to simplify
corner cases
• No need to buffer entire
database
• shift in as needed given current compute
window(maximum size = size of the query)
15. 15
• Traceback is sequential
• Compute on host processor
• As seen in the roofline, we are
memory bound, therefore
compression of input/output
essential
• Directions expressed with 2-bit representation
• Parallel computation along the
anti-diagonals with a systolic
array
• Buffer out corners to simplify
corner cases
• No need to buffer entire
database
• shift in as needed given current compute
window(maximum size = size of the query)
Implementation choices
17. Static Code
Analysis
Roofline Model Implementation
Application
Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
17
Application
Benchmark
Implementation work flow
18. • For the experiment, we used two boards developed by
AlphaData. The ADM-PCIE-7V3 and the ADM-PCIE-KU3
• The benchmarks have been performed by increasing the sizes of
the query and the database
PCIe
The host machine is a x64 machine
running Red Hat Linux Enterprise 6.6
• Host & FPGA are connected
over PCIe
• The execution times are
measured using the events of
the OpenCL standard
18
Experimental settings
23. Static Code
Analysis
Roofline Model Implementation
Application Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
23
Yes
Final Implementation
Performance
satisfies
roofline
prediction?
Implementation work flow
24. Platform
Performance
[GCUPS]
Power Efficiency
[GCUPS/W]
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
Nvidia GeForce GTX 295 16.1 0.056
ADM-PCIE-7V3 14.8 0.594
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 280 9.66 0.041
Xtreme Data XD2000i 9.00 0.150
2XNvidia GeForce 8800 3.60 0.017
24
State on the art
25. Platform
Performance
[GCUPS]
Power Efficiency
[GCUPS/W]
ADM-PCIE-KU3 42.5 1.699
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
ADM-PCIE-7V3 14.8 0.594
Xtreme Data XD1000 25.6 0.430
Tesla K20 45.0 0.200
Xtreme Data XD2000i 9.00 0.150
Nvidia GeForce GTX 295 30.0 0.104
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 295 16.1 0.056
Nvidia GeForce GTX 280 9.66 0.041
2XNvidia GeForce 8800 3.60 0.017
25
State on the art
26. 26
Conclusions
We presented
• A pure implementation of the Smith-Waterman algorithm
• Analyzed using the Berkeley Roofline Model
The version presented here has
• The best performance/power consumption ratio
• The fastest implementation w.r.t FPGA implementations
Di Tucci, Lorenzo, Kenneth O'Brien, Michaela Blott, and Marco D. Santambrogio. "Architectural optimizations for high
performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL." In 2017 Design,
Automation & Test in Europe Conference & Exhibition (DATE), pp. 716-721. IEEE, 2017.
27. 27
Future Works
Started a collaboration with Lawrence Berkeley
National Laboratory
• Implementation of the Smith-Waterman using Chisel HDL[1]
• Adaptation of the code to run with the merAligner [2]
• Implementation of a single and Multi FPGA architecture for the
merAligner
[1] https://chisel.eecs.berkeley.edu/
[2] https://people.eecs.berkeley.edu/~egeor/ipdps_genome.pdf
28. Thanks for the attention!
Questions?
28
Lorenzo Di Tucci – lorenzo.ditucci@polimi.it
29. 29
Appendix: area usage & resource utilization
• All loops have II
=1
• LUTs usage <
10%
• FF usage < 5%
• BRAM ~ 1%
30. Platform
Performance
[GCUPS]
Price
[$]
GCUPS/$
2XNvidia GeForce 8800 3.6 2x100 0,018
Xtreme Data XD2000i 9 ------ ------
Nvidia GeForce GTX 280 9.66 50 0,1932
Dual-core Nvidia 9800 GX2 14.5 70 0,207
ADM-PCIE-7V3 14.84 3200 0,0046
Nvidia GeForce GTX 295 16.087 294 0,055
Altera Stratix V on Nallatech
PCIe-385
24.7 4995 0,005
Xtreme Data XD1000 25.6 ------ ------
Nvidia GeForce GTX 295 30 295 0,102
ADM-PCIE-KU3 42.47 2795 0,015
Tesla K20 45 2779 0,016
30
Comparison with state of the art
Editor's Notes
GARANTISCE DI TROVARE!!!
Le performance predette sono maggiori di quelle nello stato dell’arte per quanto riguarda implementazioni su FPGA, quindi ha motivo accelerare questo algoritmo per le nostre piattaforme