SlideShare a Scribd company logo
1 of 30
1
Architectural Optimizations for High
Performance
and Energy Efficient Smith-Waterman
Implementation on FPGAs using OpenCL
06/07/2017 @ Oracle
Lorenzo Di Tucci
lorenzo.ditucci@polimi.it
NECST Lab, Politecnico di Milano
2
The Problem
Performance Requirements of biological algorithms increased
as..
Large amount of data Algorithm Complexity High Computational
Needs
In such scenario, hardware accelerators proved to be
effective in optimizing the Performance/Power
Consumption Ratio
High Parallelism Low Power Consumption
3
Contributions
The contributions of this work are:
• Energy-Efficient Hardware architecture for a pure Smith-
Waterman algorithm
• Implementation with an OpenCL-based design and run-time
environment
• Analysis of this algorithm using the Berkeley Roofline Model
• Experimental results for ADM_PCIE_7V3 and ADM_PCIE_KU3
The results highlights the best performance w.r.t. FPGA solutions
and the best performance/power consumption ratio
w.r.t all competing devices
4
● Dynamic programming algorithm
● Perform local sequence alignment between two nucleotides or
proteins
● Guaranteed to find the optimal local alignment with regards to the
scoring system used[1]
● Highly Compute Intensive
● In order to increase system performance, the state of the art is full of
implementation based on heuristics
Speedup in
computation
Decrease in
Algorithm
Precision
[1] Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197.
Background
Read all inputs
(query, database, scoring system)
Compute
Max score, similarity and traceback matrix
Traceback
Starting from max score along the highest
score in traceback matrix
Write results
Each Element depends
on the values:
-Over it (north)
- On its (west)
- On its diagonal position
(north-west)
Similarity Matrix
Starting from the
maximum value in the
Similarity Matrix,
Follow the directions
stored in the
TraceBack Matrix
Traceback Matrix
5
Algorithm
Platform
Performance
[GCUPS]
Power Efficiency
[GCUPS/W]
Tesla K20 45.0 0.200
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
Nvidia GeForce GTX 295 16.1 0.056
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 280 9.66 0.041
Xtreme Data XD2000i 9.00 0.150
2XNvidia GeForce 8800 3.60 0.017
6
State on the art
Static Code
Analysis
Roofline Model Implementation
Application
BenchmarkPerformance
satisfies
roofline
prediction?
No
Yes
Final Implementation
7
Implementation work flow
Work W [Operations] Theoretical[N=query,
M=database]
Example[Ops]
N=256,M=65K
Indexing 11N2 + 11NM – 6N 185M
Comparison 6N2 + 6NM -5N 101M
Arithmetic 15N2 + 15NM – 6N + 8M +2 253M
Total 32N2 + 32NM – 17N + 8M + 2 539M
Memory Traffic DMT [B] [B]
Data in N+M 65K
Data out 64(N+M-1) 4.2M
Total 65N + 65M -64 4.3M
Operational Intensity [Ops/B]
[Ops/B] (32N2 + 32NM – 17N + 8M + 2) /
(65N + 65M -64)
126
Compute Intensive
Little read
Massive Writes
8Static code analysis
Static Code
Analysis
Roofline Model Implementation
Application
Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
9
Implementation work flow
Roofline Model
The roofline model [2]
Performance Model
that depicts the relation between atteinable performance
and operational intensity
[2] Williams, Samuel, Andrew Waterman, and David Patterson. "Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76.
11The roofline model
Static Code
Analysis
Roofline Model Implementation
Application
Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
12
Implementation work flow
Implementation
13
Implementation choices
• Traceback is sequential
• Compute on host processor
• As seen in the roofline, we are
memory bound, therefore
compression of input/output
essential
• Directions expressed with 2-bit representation
• Parallel computation along the
anti-diagonals with a systolic
array
• Buffer out corners to simplify
corner cases
• No need to buffer entire
database
• shift in as needed given current compute
window(maximum size = size of the query)
14
Implementation choices
• Traceback is sequential
• Compute on host processor
• As seen in the roofline, we are
memory bound, therefore
compression of input/output
essential
• Directions expressed with 2-bit representation
• Parallel computation along the
anti-diagonals with a systolic
array
• Buffer out corners to simplify
corner cases
• No need to buffer entire
database
• shift in as needed given current compute
window(maximum size = size of the query)
15
• Traceback is sequential
• Compute on host processor
• As seen in the roofline, we are
memory bound, therefore
compression of input/output
essential
• Directions expressed with 2-bit representation
• Parallel computation along the
anti-diagonals with a systolic
array
• Buffer out corners to simplify
corner cases
• No need to buffer entire
database
• shift in as needed given current compute
window(maximum size = size of the query)
Implementation choices
C G T
C
T
G
A
C
G
16
Implementation choices
Static Code
Analysis
Roofline Model Implementation
Application
Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
17
Application
Benchmark
Implementation work flow
• For the experiment, we used two boards developed by
AlphaData. The ADM-PCIE-7V3 and the ADM-PCIE-KU3
• The benchmarks have been performed by increasing the sizes of
the query and the database
PCIe
The host machine is a x64 machine
running Red Hat Linux Enterprise 6.6
• Host & FPGA are connected
over PCIe
• The execution times are
measured using the events of
the OpenCL standard
18
Experimental settings
19
Results
Systolic Array
I/O
Compression
20
Results
Systolic Array
Shift Register
21
Results
I/O
Compression
Systolic Array
Shift Register
22
Port
Mapping on
Kintex
Ultrascale
Results
Systolic Array
I/O
Compression
Static Code
Analysis
Roofline Model Implementation
Application Benchmark
Performance
satisfies
roofline
prediction?
No
Yes
Final Implementation
23
Yes
Final Implementation
Performance
satisfies
roofline
prediction?
Implementation work flow
Platform
Performance
[GCUPS]
Power Efficiency
[GCUPS/W]
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
Nvidia GeForce GTX 295 16.1 0.056
ADM-PCIE-7V3 14.8 0.594
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 280 9.66 0.041
Xtreme Data XD2000i 9.00 0.150
2XNvidia GeForce 8800 3.60 0.017
24
State on the art
Platform
Performance
[GCUPS]
Power Efficiency
[GCUPS/W]
ADM-PCIE-KU3 42.5 1.699
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
ADM-PCIE-7V3 14.8 0.594
Xtreme Data XD1000 25.6 0.430
Tesla K20 45.0 0.200
Xtreme Data XD2000i 9.00 0.150
Nvidia GeForce GTX 295 30.0 0.104
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 295 16.1 0.056
Nvidia GeForce GTX 280 9.66 0.041
2XNvidia GeForce 8800 3.60 0.017
25
State on the art
26
Conclusions
We presented
• A pure implementation of the Smith-Waterman algorithm
• Analyzed using the Berkeley Roofline Model
The version presented here has
• The best performance/power consumption ratio
• The fastest implementation w.r.t FPGA implementations
Di Tucci, Lorenzo, Kenneth O'Brien, Michaela Blott, and Marco D. Santambrogio. "Architectural optimizations for high
performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL." In 2017 Design,
Automation & Test in Europe Conference & Exhibition (DATE), pp. 716-721. IEEE, 2017.
27
Future Works
Started a collaboration with Lawrence Berkeley
National Laboratory
• Implementation of the Smith-Waterman using Chisel HDL[1]
• Adaptation of the code to run with the merAligner [2]
• Implementation of a single and Multi FPGA architecture for the
merAligner
[1] https://chisel.eecs.berkeley.edu/
[2] https://people.eecs.berkeley.edu/~egeor/ipdps_genome.pdf
Thanks for the attention!
Questions?
28
Lorenzo Di Tucci – lorenzo.ditucci@polimi.it
29
Appendix: area usage & resource utilization
• All loops have II
=1
• LUTs usage <
10%
• FF usage < 5%
• BRAM ~ 1%
Platform
Performance
[GCUPS]
Price
[$]
GCUPS/$
2XNvidia GeForce 8800 3.6 2x100 0,018
Xtreme Data XD2000i 9 ------ ------
Nvidia GeForce GTX 280 9.66 50 0,1932
Dual-core Nvidia 9800 GX2 14.5 70 0,207
ADM-PCIE-7V3 14.84 3200 0,0046
Nvidia GeForce GTX 295 16.087 294 0,055
Altera Stratix V on Nallatech
PCIe-385
24.7 4995 0,005
Xtreme Data XD1000 25.6 ------ ------
Nvidia GeForce GTX 295 30 295 0,102
ADM-PCIE-KU3 42.47 2795 0,015
Tesla K20 45 2779 0,016
30
Comparison with state of the art

More Related Content

What's hot

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Thesis_presentation1
Thesis_presentation1Thesis_presentation1
Thesis_presentation1Bhushan Velis
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
Big Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open WorkshopBig Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open WorkshopExtremeEarth
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...InfluxData
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemJayjeetChakraborty
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...GISRUK conference
 
Smallworld Data Check-Out to Microstation
Smallworld Data Check-Out to MicrostationSmallworld Data Check-Out to Microstation
Smallworld Data Check-Out to MicrostationSafe Software
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Joseph Luchette
 
Federated HPC Clouds Applied to Radiation Therapy
Federated HPC Clouds Applied to Radiation TherapyFederated HPC Clouds Applied to Radiation Therapy
Federated HPC Clouds Applied to Radiation TherapyAndrés Gómez
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Igor Sfiligoi
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsNECST Lab @ Politecnico di Milano
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARMEdge AI and Vision Alliance
 
Thermal modeling and management of cluster storage systems xunfei jiang 2014
Thermal modeling and management of cluster storage systems xunfei jiang 2014Thermal modeling and management of cluster storage systems xunfei jiang 2014
Thermal modeling and management of cluster storage systems xunfei jiang 2014Xiao Qin
 
How to Build a Telegraf Plugin by Noah Crowley
How to Build a Telegraf Plugin by Noah CrowleyHow to Build a Telegraf Plugin by Noah Crowley
How to Build a Telegraf Plugin by Noah CrowleyInfluxData
 
"Update on Khronos Standards for Vision and Machine Learning," a Presentation...
"Update on Khronos Standards for Vision and Machine Learning," a Presentation..."Update on Khronos Standards for Vision and Machine Learning," a Presentation...
"Update on Khronos Standards for Vision and Machine Learning," a Presentation...Edge AI and Vision Alliance
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesNECST Lab @ Politecnico di Milano
 

What's hot (20)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Thesis_presentation1
Thesis_presentation1Thesis_presentation1
Thesis_presentation1
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
Big Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open WorkshopBig Linked Data Interlinking - ExtremeEarth Open Workshop
Big Linked Data Interlinking - ExtremeEarth Open Workshop
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...InfluxDB IOx Tech Talks:  A Rusty Introduction to Apache Arrow and How it App...
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
 
Smallworld Data Check-Out to Microstation
Smallworld Data Check-Out to MicrostationSmallworld Data Check-Out to Microstation
Smallworld Data Check-Out to Microstation
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
 
Federated HPC Clouds Applied to Radiation Therapy
Federated HPC Clouds Applied to Radiation TherapyFederated HPC Clouds Applied to Radiation Therapy
Federated HPC Clouds Applied to Radiation Therapy
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Run-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environmentsRun-time power management in cloud and containerized environments
Run-time power management in cloud and containerized environments
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 
Thermal modeling and management of cluster storage systems xunfei jiang 2014
Thermal modeling and management of cluster storage systems xunfei jiang 2014Thermal modeling and management of cluster storage systems xunfei jiang 2014
Thermal modeling and management of cluster storage systems xunfei jiang 2014
 
How to Build a Telegraf Plugin by Noah Crowley
How to Build a Telegraf Plugin by Noah CrowleyHow to Build a Telegraf Plugin by Noah Crowley
How to Build a Telegraf Plugin by Noah Crowley
 
"Update on Khronos Standards for Vision and Machine Learning," a Presentation...
"Update on Khronos Standards for Vision and Machine Learning," a Presentation..."Update on Khronos Standards for Vision and Machine Learning," a Presentation...
"Update on Khronos Standards for Vision and Machine Learning," a Presentation...
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 

Similar to Architectural Optimizations for High Performance and Energy Efficient Smith-Waterman Implementation on FPGAs Using OpenCL

MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxBharathiLakshmiAAssi
 
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...NECST Lab @ Politecnico di Milano
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGATO project
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsScyllaDB
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysisArun Joseph
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsGanesan Narayanasamy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance UpdateRebekah Rodriguez
 
RECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP Project
 

Similar to Architectural Optimizations for High Performance and Energy Efficient Smith-Waterman Implementation on FPGAs Using OpenCL (20)

FPGA-enhanced Bioinformatics @ NECST
FPGA-enhanced Bioinformatics @ NECSTFPGA-enhanced Bioinformatics @ NECST
FPGA-enhanced Bioinformatics @ NECST
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
 
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the projectLEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysis
 
Application Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systemsApplication Optimisation using OpenPOWER and Power 9 systems
Application Optimisation using OpenPOWER and Power 9 systems
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance Update
 
RECAP: The Simulation Approach
RECAP: The Simulation ApproachRECAP: The Simulation Approach
RECAP: The Simulation Approach
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingNECST Lab @ Politecnico di Milano
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...NECST Lab @ Politecnico di Milano
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification SystemNECST Lab @ Politecnico di Milano
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingNECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 

Architectural Optimizations for High Performance and Energy Efficient Smith-Waterman Implementation on FPGAs Using OpenCL

  • 1. 1 Architectural Optimizations for High Performance and Energy Efficient Smith-Waterman Implementation on FPGAs using OpenCL 06/07/2017 @ Oracle Lorenzo Di Tucci lorenzo.ditucci@polimi.it NECST Lab, Politecnico di Milano
  • 2. 2 The Problem Performance Requirements of biological algorithms increased as.. Large amount of data Algorithm Complexity High Computational Needs In such scenario, hardware accelerators proved to be effective in optimizing the Performance/Power Consumption Ratio High Parallelism Low Power Consumption
  • 3. 3 Contributions The contributions of this work are: • Energy-Efficient Hardware architecture for a pure Smith- Waterman algorithm • Implementation with an OpenCL-based design and run-time environment • Analysis of this algorithm using the Berkeley Roofline Model • Experimental results for ADM_PCIE_7V3 and ADM_PCIE_KU3 The results highlights the best performance w.r.t. FPGA solutions and the best performance/power consumption ratio w.r.t all competing devices
  • 4. 4 ● Dynamic programming algorithm ● Perform local sequence alignment between two nucleotides or proteins ● Guaranteed to find the optimal local alignment with regards to the scoring system used[1] ● Highly Compute Intensive ● In order to increase system performance, the state of the art is full of implementation based on heuristics Speedup in computation Decrease in Algorithm Precision [1] Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197. Background
  • 5. Read all inputs (query, database, scoring system) Compute Max score, similarity and traceback matrix Traceback Starting from max score along the highest score in traceback matrix Write results Each Element depends on the values: -Over it (north) - On its (west) - On its diagonal position (north-west) Similarity Matrix Starting from the maximum value in the Similarity Matrix, Follow the directions stored in the TraceBack Matrix Traceback Matrix 5 Algorithm
  • 6. Platform Performance [GCUPS] Power Efficiency [GCUPS/W] Tesla K20 45.0 0.200 Nvidia GeForce GTX 295 30.0 0.104 Xtreme Data XD1000 25.6 0.430 Altera Stratix V on Nallatech PCIe-385 24.7 0.988 Nvidia GeForce GTX 295 16.1 0.056 Dual-core Nvidia 9800 GX2 14.5 0.074 Nvidia GeForce GTX 280 9.66 0.041 Xtreme Data XD2000i 9.00 0.150 2XNvidia GeForce 8800 3.60 0.017 6 State on the art
  • 7. Static Code Analysis Roofline Model Implementation Application BenchmarkPerformance satisfies roofline prediction? No Yes Final Implementation 7 Implementation work flow
  • 8. Work W [Operations] Theoretical[N=query, M=database] Example[Ops] N=256,M=65K Indexing 11N2 + 11NM – 6N 185M Comparison 6N2 + 6NM -5N 101M Arithmetic 15N2 + 15NM – 6N + 8M +2 253M Total 32N2 + 32NM – 17N + 8M + 2 539M Memory Traffic DMT [B] [B] Data in N+M 65K Data out 64(N+M-1) 4.2M Total 65N + 65M -64 4.3M Operational Intensity [Ops/B] [Ops/B] (32N2 + 32NM – 17N + 8M + 2) / (65N + 65M -64) 126 Compute Intensive Little read Massive Writes 8Static code analysis
  • 9. Static Code Analysis Roofline Model Implementation Application Benchmark Performance satisfies roofline prediction? No Yes Final Implementation 9 Implementation work flow Roofline Model
  • 10. The roofline model [2] Performance Model that depicts the relation between atteinable performance and operational intensity [2] Williams, Samuel, Andrew Waterman, and David Patterson. "Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76.
  • 12. Static Code Analysis Roofline Model Implementation Application Benchmark Performance satisfies roofline prediction? No Yes Final Implementation 12 Implementation work flow Implementation
  • 13. 13 Implementation choices • Traceback is sequential • Compute on host processor • As seen in the roofline, we are memory bound, therefore compression of input/output essential • Directions expressed with 2-bit representation • Parallel computation along the anti-diagonals with a systolic array • Buffer out corners to simplify corner cases • No need to buffer entire database • shift in as needed given current compute window(maximum size = size of the query)
  • 14. 14 Implementation choices • Traceback is sequential • Compute on host processor • As seen in the roofline, we are memory bound, therefore compression of input/output essential • Directions expressed with 2-bit representation • Parallel computation along the anti-diagonals with a systolic array • Buffer out corners to simplify corner cases • No need to buffer entire database • shift in as needed given current compute window(maximum size = size of the query)
  • 15. 15 • Traceback is sequential • Compute on host processor • As seen in the roofline, we are memory bound, therefore compression of input/output essential • Directions expressed with 2-bit representation • Parallel computation along the anti-diagonals with a systolic array • Buffer out corners to simplify corner cases • No need to buffer entire database • shift in as needed given current compute window(maximum size = size of the query) Implementation choices
  • 17. Static Code Analysis Roofline Model Implementation Application Benchmark Performance satisfies roofline prediction? No Yes Final Implementation 17 Application Benchmark Implementation work flow
  • 18. • For the experiment, we used two boards developed by AlphaData. The ADM-PCIE-7V3 and the ADM-PCIE-KU3 • The benchmarks have been performed by increasing the sizes of the query and the database PCIe The host machine is a x64 machine running Red Hat Linux Enterprise 6.6 • Host & FPGA are connected over PCIe • The execution times are measured using the events of the OpenCL standard 18 Experimental settings
  • 23. Static Code Analysis Roofline Model Implementation Application Benchmark Performance satisfies roofline prediction? No Yes Final Implementation 23 Yes Final Implementation Performance satisfies roofline prediction? Implementation work flow
  • 24. Platform Performance [GCUPS] Power Efficiency [GCUPS/W] Tesla K20 45.0 0.200 ADM-PCIE-KU3 42.5 1.699 Nvidia GeForce GTX 295 30.0 0.104 Xtreme Data XD1000 25.6 0.430 Altera Stratix V on Nallatech PCIe-385 24.7 0.988 Nvidia GeForce GTX 295 16.1 0.056 ADM-PCIE-7V3 14.8 0.594 Dual-core Nvidia 9800 GX2 14.5 0.074 Nvidia GeForce GTX 280 9.66 0.041 Xtreme Data XD2000i 9.00 0.150 2XNvidia GeForce 8800 3.60 0.017 24 State on the art
  • 25. Platform Performance [GCUPS] Power Efficiency [GCUPS/W] ADM-PCIE-KU3 42.5 1.699 Altera Stratix V on Nallatech PCIe-385 24.7 0.988 ADM-PCIE-7V3 14.8 0.594 Xtreme Data XD1000 25.6 0.430 Tesla K20 45.0 0.200 Xtreme Data XD2000i 9.00 0.150 Nvidia GeForce GTX 295 30.0 0.104 Dual-core Nvidia 9800 GX2 14.5 0.074 Nvidia GeForce GTX 295 16.1 0.056 Nvidia GeForce GTX 280 9.66 0.041 2XNvidia GeForce 8800 3.60 0.017 25 State on the art
  • 26. 26 Conclusions We presented • A pure implementation of the Smith-Waterman algorithm • Analyzed using the Berkeley Roofline Model The version presented here has • The best performance/power consumption ratio • The fastest implementation w.r.t FPGA implementations Di Tucci, Lorenzo, Kenneth O'Brien, Michaela Blott, and Marco D. Santambrogio. "Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL." In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 716-721. IEEE, 2017.
  • 27. 27 Future Works Started a collaboration with Lawrence Berkeley National Laboratory • Implementation of the Smith-Waterman using Chisel HDL[1] • Adaptation of the code to run with the merAligner [2] • Implementation of a single and Multi FPGA architecture for the merAligner [1] https://chisel.eecs.berkeley.edu/ [2] https://people.eecs.berkeley.edu/~egeor/ipdps_genome.pdf
  • 28. Thanks for the attention! Questions? 28 Lorenzo Di Tucci – lorenzo.ditucci@polimi.it
  • 29. 29 Appendix: area usage & resource utilization • All loops have II =1 • LUTs usage < 10% • FF usage < 5% • BRAM ~ 1%
  • 30. Platform Performance [GCUPS] Price [$] GCUPS/$ 2XNvidia GeForce 8800 3.6 2x100 0,018 Xtreme Data XD2000i 9 ------ ------ Nvidia GeForce GTX 280 9.66 50 0,1932 Dual-core Nvidia 9800 GX2 14.5 70 0,207 ADM-PCIE-7V3 14.84 3200 0,0046 Nvidia GeForce GTX 295 16.087 294 0,055 Altera Stratix V on Nallatech PCIe-385 24.7 4995 0,005 Xtreme Data XD1000 25.6 ------ ------ Nvidia GeForce GTX 295 30 295 0,102 ADM-PCIE-KU3 42.47 2795 0,015 Tesla K20 45 2779 0,016 30 Comparison with state of the art

Editor's Notes

  1. GARANTISCE DI TROVARE!!!
  2. Le performance predette sono maggiori di quelle nello stato dell’arte per quanto riguarda implementazioni su FPGA, quindi ha motivo accelerare questo algoritmo per le nostre piattaforme