The DReAMS research line focuses on the definition of methodologies and software frameworks supporting the development of hardware-software system for reconfigurable systems, in the personalized medicine, genomics, machine learning, cryptography, cloud infrastructure or embedded and IoT context, industrial and consumer fields, characterizes the methodological activities.
Special attention is devoted to the definition of methodologies for developing heterogeneous distributed adaptive computing systems, studying methodologies to model, simulate, design and optimize those architectures, both in terms of performances and power consumption.
During the talk we will focus on two main research projects inside the DReAMS research line: HUGenomics and CAOS.
The HUGenomics framework aims at facilitating genome assembly process by means of both hardware accelerated algorithms and scientific data visualization tools. Indeed, the system raises the level of abstraction allowing users to easily integrate custom algorithms into the hardware pipeline without any knowledge of the underneath architecture. After HUGenomics we will present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
High Performance Reconfigurable Computing for Genomics Research
1. DReAMS
High Performance Reconfigurable Computing at
NECSTLab
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano
Lorenzo Di Tucci, Marco Rabozzi, Marco D. Santambrogio
05/31/2018
MSR, Mountain View - CA
1
{lorenzo.ditucci, marco.rabozzi, marco.santambrogio}@polimi.it
2. 2NECSTLab @ PoliMi
NECSTLab is a laboratory inside DEIB department of Politecnico di
Milano (Dipartimento di Elettronica, Informazione e Bioingegneria).
5. 5System Security Research Line
• MaTA
Malware and Threat Analysis
• FraudSec
Frauds Analysis and Detection
• MoSec
Mobile Security
• CyPhy
Security of Cyber-physical systems
Prof. S. Zanero
stefano.zanero@polimi.it
6. 6
Computer Architecture
Research Line
DReAMS
• To discover the world of FPGA-based systems
• Design and implementation of reconfigurable computing:
from architectural aspect to CAD development
• How to use CS to “speedup/improve” biomed applications
ORCA
• Unleashed computer architecture and operating systems
• From embedded to HPC computing systems, focusing on
computer architectures, OS and monitoring infrastructures.
STeEL
• To make smart ambient coming true!
• On how to make heterogeneous components to coexist to
improve quality of life and comfort while minimizing power
and energy consumption
• Emotional and Physical Comfort
• Biometric Human recognition
Prof. D. Sciuto
donatella.sciuto@polimi.it
Prof. M. Santambrogio
marco.santambrogio@polimi.it
7. 7
DReAMS
• To discover the world of FPGA-based systems
• Design and implementation of reconfigurable computing:
from architectural aspect to CAD development
• How to use CS to “speedup/improve” biomed applications
ORCA
• Unleashed computer architecture and operating systems
• From embedded to HPC computing systems, focusing on
computer architectures, OS and monitoring infrastructures.
STeEL
• To make smart ambient coming true!
• On how to make heterogeneous components to coexist to
improve quality of life and comfort while minimizing power
and energy consumption
• Emotional and Physical Comfort
• Biometric Human recognition
Prof. D. Sciuto
donatella.sciuto@polimi.it
Prof. M. Santambrogio
marco.santambrogio@polimi.it
Computer Architecture
Research Line
10. 10Genomic research
Recent advancements in genomic research allow to perform multiple analysis on
DNA affecting different fields
However, in order to extract biological meaning from secondary genome analysis a
complex process has to be performed
11. 11Genome sequencing
Given a biological sample, genome sequencing is the process of determining the
precise order of nucleotides within a DNA molecule
This process produces short DNA fragments which need to be assembled to
reconstruct the original sequence
ACGTAGCTCGGACCATAGCA
CCGCCGTAGCTCGGACCATAGCACATG
AGTTTTGGGGGACCATAGCACATGGACACATGC
GGACCATAGCACATGGACACATGC
GGTCAAAAATAGCACATGGACACATGC
ATTGTATCGGACCATATTGCTTAGCATGTATTTGC
CATGGACACATGC
CGTAACCATAGCACATGGACACATGC
TTTTAGGTAATTGCCATAGCACATGGACACAT
12. 12Genome assembly
Genome assembly: reconstruct a genome from a set of shorter reads
Reference-based assembly
ACGTAGCTCGGACCATAGCA
GGACCATAGCACATGGACACATGC
ACGTAGCTCGGACCATAGCAGGACCATAGCACATGGACATGGACACATGCTTA
CATGGACACATGC
13. 13Genome assembly
Genome assembly: reconstruct a genome from a set of shorter reads
De novo assembly
ACGTAGCTCGGACCATAGCAGGACCATAGCACATGGACATGGACACATGCTTA
Applications are limited to species with available reference genomes
14. 14De novo assembly
Genomics algorithms are usually:
• compute-intensive
• massive amount of data
• fast-changing
15. 15De novo assembly
Issue:
• General purpose architectures are inefficient
Genomics algorithms are usually:
• compute-intensive
• massive amount of data
• fast-changing
16. 16De novo assembly
Issue:
• General purpose architectures are inefficient
Solution:
• In such scenario, hardware accelerators proved to be effective in optimizing the
performance over power consumption ratio
Genomics algorithms are usually:
• compute-intensive
• massive amount of data
• fast-changing
17. 17De novo assembly
Issue:
• General purpose architectures are inefficient
Solution:
• In such scenario, hardware accelerators proved to be effective in optimizing the
performance over power consumption ratio
Genomics algorithms are usually:
• compute-intensive
• massive amount of data
• fast-changing
25. 25HUG Today
[1]
[2]
[3]
[1] Lorenzo Di Tucci, Giulia Guidi, Sara Notargiacomo, Luca Cerina, Alberto Scolari, and Marco D. Santambrogio. "HUGenomics: A support to personalized medicine research." In Research and Technologies for
Society and Industry (RTSI), 2017 IEEE 3rd International Forum on, pp. 1-5. IEEE, 2017.
[2] Lorenzo Di Tucci, Davide Conficconi, Alessandro Comodi, Steven Hofmeyr, David Donofrio and Marco Domenico Santambrogio. "A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA
using Chisel HDL " In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
[3] Davide Conficconi, Alessandro Comodi, Alberto Scolari and Marco Domenico Santambrogio. “TiReX: a Tiled Regular Expression Matching Architecture" In Parallel and Distributed Processing Symposium Workshops
(IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
26. 26Smith-Waterman
● Dynamic programming algorithm
● Perform local sequence alignment between two strings (DNA, RNA..)
● Guaranteed to find the optimal local alignment with regards to the
scoring system used
• In order to increase system performance, the state of the art is full of
implementation based on heuristics
• Highly compute intensive
Speedup
in computation
Decrease in algorithm
precision
27. 27Smith-Waterman
Platform Performance [GCUPS] Power Efficiency [GCUPS/W]
AWS-VU9P (3 queries in parallel) 110.0 4.400
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
Nvidia GeForce GTX 295 16.1 0.056
ADM-PCIE-7V3 14.8 0.594
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 280 9.7 0.041
Xtreme Data XD2000i 9.0 0.150
2XNvidia GeForce 8800 3.6 0.017
28. 28Smith-Waterman
Platform Performance [GCUPS] Power Efficiency [GCUPS/W]
AWS-VU9P (3 queries in parallel) 110.0 4.400
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
Nvidia GeForce GTX 295 16.1 0.056
ADM-PCIE-7V3 14.8 0.594
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 280 9.7 0.041
Xtreme Data XD2000i 9.0 0.150
2XNvidia GeForce 8800 3.6 0.017
29. 29Smith-Waterman
Platform Performance [GCUPS] Power Efficiency [GCUPS/W]
AWS-VU9P (3 queries in parallel) 110.0 4.400
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
Nvidia GeForce GTX 295 16.1 0.056
ADM-PCIE-7V3 14.8 0.594
Dual-core Nvidia 9800 GX2 14.5 0.074
Nvidia GeForce GTX 280 9.7 0.041
Xtreme Data XD2000i 9.0 0.150
2XNvidia GeForce 8800 3.6 0.017
30. HUG Today
[1]
RAW READ CONTIGGING SCAFFOLDING RE-SCAFFOLDING ANNOTATION
[2]
Sequence Alignment via Smith-Waterman Algorithm [2]
[1] Lorenzo Di Tucci, Giulia Guidi, Sara Notargiacomo, Luca Cerina, Alberto Scolari, and Marco D. Santambrogio. "HUGenomics: A support to personalized medicine research." In Research and Technologies for
Society and Industry (RTSI), 2017 IEEE 3rd International Forum on, pp. 1-5. IEEE, 2017.
[2] Lorenzo Di Tucci, Davide Conficconi, Alessandro Comodi, Steven Hofmeyr, David Donofrio and Marco Domenico Santambrogio. "A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA
using Chisel HDL " In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
30
31. 31TiREX
Regular
Expression
Compiler
1 & ACGT
2 JIM offset
3 (
4 |)* AC
5 & TT
Instruction Set
ACGTCGGGGCGTGCAAATGCCCCGTGCGATTTGCGTGACGTCGGGGCGTGCAAATGC
CCCGTGCGATTTGCGTGACGTCGGGGCGTGCAAATGCCCCGTGCGATTTGCGTGACG
TCGGGGCGTGCAAATGCCCCGTGCGATTTGCGTGCGTGCGATTTGCGTGACGTCGGG
GCGTGCAAACGTGCGATTTGCGTGACGTCGGGGCGTGCAAAGCTCGATCGATCGATC
GA…
Data
Match results
Flexible Instruction Set Architecture (ISA) representing the regular
expression
Customized processor for pattern matching
ACGT(A|C)*TT
32. 32Pattern Matching via TiREX
Regular Expression Flex* 16-core†
(VC707) Speedup
ACCGTGGA 271 µs 2.07 µs 130.90X
(TTT)+CT 121 µs 4.54 µs 26.65X
(CAGT)|(GGGG)|(TTGG)TGCA(C|G)+ 263 µs 3.36 µs 78.27X
* running on a Intel i7 with a peak frequency of 2.8GHz
†
running at 130 MHz
33. 33HUG Today
RAW READ CONTIGGING SCAFFOLDING RE-SCAFFOLDING ANNOTATION
Pattern Matching to identify gene motifs during Gene Annotation
[1]
[2]
[3]
[1] Lorenzo Di Tucci, Giulia Guidi, Sara Notargiacomo, Luca Cerina, Alberto Scolari, and Marco D. Santambrogio. "HUGenomics: A support to personalized medicine research." In Research and Technologies for
Society and Industry (RTSI), 2017 IEEE 3rd International Forum on, pp. 1-5. IEEE, 2017.
[2] Lorenzo Di Tucci, Davide Conficconi, Alessandro Comodi, Steven Hofmeyr, David Donofrio and Marco Domenico Santambrogio. "A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA
using Chisel HDL " In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
[3] Davide Conficconi, Alessandro Comodi, Alberto Scolari and Marco Domenico Santambrogio. “TiReX: a Tiled Regular Expression Matching Architecture" In Parallel and Distributed Processing Symposium Workshops
(IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
36. 36HUG Today
[1] Lorenzo Di Tucci, Giulia Guidi, Sara Notargiacomo, Luca Cerina, Alberto Scolari, and Marco D. Santambrogio. "HUGenomics: A support to personalized medicine research." In Research and Technologies for
Society and Industry (RTSI), 2017 IEEE 3rd International Forum on, pp. 1-5. IEEE, 2017.
[2] Lorenzo Di Tucci, Davide Conficconi, Alessandro Comodi, Steven Hofmeyr, David Donofrio and Marco Domenico Santambrogio. "A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA
using Chisel HDL " In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
[3] Davide Conficconi, Alessandro Comodi, Alberto Scolari and Marco Domenico Santambrogio. “TiReX: a Tiled Regular Expression Matching Architecture" In Parallel and Distributed Processing Symposium Workshops
(IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
RAW READ CONTIGGING SCAFFOLDING RE-SCAFFOLDING ANNOTATION
[1]
[2]
PairHMM for gene prediction/finding in the Gene Annotation phase
[3]
41. 41Roofline Model for FPGAs
• Estimate performance before implementing the kernel
• Compare performance on different FPGA boards
• FPGAs have no fixed architecture
- It needs to be generated for each kernel
- and each target FPGA board
Proposed solution: automatic tool to generate Roofline models
42. 42Roofline Model for FPGAs
• Based on the Vivado Design Suite
• Given High Level Code (now RTL too), it generates bitstream for the
target board
• Resembles GPU design flow
Host
OpenCL runtime & APIs
Accelerator
C, C++, OpenCL, RTL
PCIe
48. The CAOS framework
● guide the application designer in the implementation of
efficient hardware-software solutions for high performance
FPGA-based systems
● promote open research by allowing external researchers to
easily integrate and benchmark their algorithms
48
53. CAOS architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
– a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
53
54. CAOS architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
– a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
54
56. Design Space Exploration 56
• Problem:
– Identify an optimal implementation of
an SST-based design on a target
FPGA
57. Design Space Exploration 57
• Problem:
– Identify an optimal implementation of
an SST-based design on a target
FPGA
• Proposed Approach[1]
- leverage floorplanning to:
– Determine the maximum number of SST that can be instantiated
– Optimize SSTs placement to ease timing closure
59. CAOS architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
– a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
59
70. CAOS architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
– a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
70
71. Master / Slave background
• Support a larger class of source codes, ideally, close
to the same set of codes supported by High Level
Synthesis tools (e.g. Vivado HLS)
• Tiled computation:
– Application’s memory transferred in
blocks from DDR to local FPGA
memory
– Computation performed on a block of
data at a time
71
72. Master / Slave
architectural template
• Current Implementation
– Relies on Vivado HLS for resource and performance estimation
– Explore potential optimizations by testing HLS directives
(e.g. #pragma HLS UNROLL, #pragma HLS PIPELINE, …)
– Quite accurate but slow (HLS synthesis time may be high)
• On going work:
– Leverage and extend the classic roofline model for CPU to
predict performance and potential optimizations
– Potentially less accurate but faster, allowing to explore a larger
optimization space
72
73. Adapting the roofline
model for FPGA
PROBLEM:
• Theoretical peak performance dependent on the types of
operations performed, their bitwidth and technology mapping
IDEA:
• Estimate peak performance for each combination of:
– Target device
– Operation type
– Operands bitwidth
– Target frequency
• Analyze the source code and derive a unified roofline according to
the operations count within the code
73