High Performance Reconfigurable Computing for Genomics Research

DReAMS
High Performance Reconfigurable Computing at
NECSTLab
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano
Lorenzo Di Tucci, Marco Rabozzi, Marco D. Santambrogio
05/31/2018
MSR, Mountain View - CA
1
{lorenzo.ditucci, marco.rabozzi, marco.santambrogio}@polimi.it

2NECSTLab @ PoliMi
NECSTLab is a laboratory inside DEIB department of Politecnico di
Milano (Dipartimento di Elettronica, Informazione e Bioingegneria).

5System Security Research Line
• MaTA
Malware and Threat Analysis
• FraudSec
Frauds Analysis and Detection
• MoSec
Mobile Security
• CyPhy
Security of Cyber-physical systems
Prof. S. Zanero
stefano.zanero@polimi.it

6
Computer Architecture
Research Line
DReAMS
• To discover the world of FPGA-based systems
• Design and implementation of reconfigurable computing:
from architectural aspect to CAD development
• How to use CS to “speedup/improve” biomed applications
ORCA
• Unleashed computer architecture and operating systems
• From embedded to HPC computing systems, focusing on
computer architectures, OS and monitoring infrastructures.
STeEL
• To make smart ambient coming true!
• On how to make heterogeneous components to coexist to
improve quality of life and comfort while minimizing power
and energy consumption
• Emotional and Physical Comfort
• Biometric Human recognition
Prof. D. Sciuto
donatella.sciuto@polimi.it
Prof. M. Santambrogio
marco.santambrogio@polimi.it

7
DReAMS
• To discover the world of FPGA-based systems
• Design and implementation of reconfigurable computing:
from architectural aspect to CAD development
• How to use CS to “speedup/improve” biomed applications
ORCA
• Unleashed computer architecture and operating systems
• From embedded to HPC computing systems, focusing on
computer architectures, OS and monitoring infrastructures.
STeEL
• To make smart ambient coming true!
• On how to make heterogeneous components to coexist to
improve quality of life and comfort while minimizing power
and energy consumption
• Emotional and Physical Comfort
• Biometric Human recognition
Prof. D. Sciuto
donatella.sciuto@polimi.it
Prof. M. Santambrogio
marco.santambrogio@polimi.it
Computer Architecture
Research Line

10Genomic research
Recent advancements in genomic research allow to perform multiple analysis on
DNA affecting different fields
However, in order to extract biological meaning from secondary genome analysis a
complex process has to be performed

11Genome sequencing
Given a biological sample, genome sequencing is the process of determining the
precise order of nucleotides within a DNA molecule
This process produces short DNA fragments which need to be assembled to
reconstruct the original sequence
ACGTAGCTCGGACCATAGCA
CCGCCGTAGCTCGGACCATAGCACATG
AGTTTTGGGGGACCATAGCACATGGACACATGC
GGACCATAGCACATGGACACATGC
GGTCAAAAATAGCACATGGACACATGC
ATTGTATCGGACCATATTGCTTAGCATGTATTTGC
CATGGACACATGC
CGTAACCATAGCACATGGACACATGC
TTTTAGGTAATTGCCATAGCACATGGACACAT

12Genome assembly
Genome assembly: reconstruct a genome from a set of shorter reads
Reference-based assembly
ACGTAGCTCGGACCATAGCA
GGACCATAGCACATGGACACATGC
ACGTAGCTCGGACCATAGCAGGACCATAGCACATGGACATGGACACATGCTTA
CATGGACACATGC

13Genome assembly
Genome assembly: reconstruct a genome from a set of shorter reads
De novo assembly
ACGTAGCTCGGACCATAGCAGGACCATAGCACATGGACATGGACACATGCTTA
Applications are limited to species with available reference genomes

14De novo assembly
Genomics algorithms are usually:
• compute-intensive
• massive amount of data
• fast-changing

15De novo assembly
Issue:
• General purpose architectures are inefficient
• fast-changing

16De novo assembly
Issue:
Solution:
• In such scenario, hardware accelerators proved to be effective in optimizing the
performance over power consumption ratio
• fast-changing

17De novo assembly
Issue:
Solution:
• In such scenario, hardware accelerators proved to be effective in optimizing the
performance over power consumption ratio
• fast-changing

18Hardware architectures
Learning curve for multiple architectures

19Objective
An advanced support to genomic research exploiting
heterogeneous hardware architectures

20Genomics Hardware Pipeline
PIPELINE
CREATION
DATA
UPLOAD
PROCESSING DATA
VISUALIZATION
HUG has exactly
what I need!
YESLINE

21Scientific Data Visualization

25HUG Today
[1]
[2]
[3]
[1] Lorenzo Di Tucci, Giulia Guidi, Sara Notargiacomo, Luca Cerina, Alberto Scolari, and Marco D. Santambrogio. "HUGenomics: A support to personalized medicine research." In Research and Technologies for
Society and Industry (RTSI), 2017 IEEE 3rd International Forum on, pp. 1-5. IEEE, 2017.
[2] Lorenzo Di Tucci, Davide Conficconi, Alessandro Comodi, Steven Hofmeyr, David Donofrio and Marco Domenico Santambrogio. "A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA
using Chisel HDL " In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.
[3] Davide Conficconi, Alessandro Comodi, Alberto Scolari and Marco Domenico Santambrogio. “TiReX: a Tiled Regular Expression Matching Architecture" In Parallel and Distributed Processing Symposium Workshops
(IPDPSW), 2018 IEEE International, t.b.p. IEEE, 2018.

26Smith-Waterman
● Dynamic programming algorithm
● Perform local sequence alignment between two strings (DNA, RNA..)
● Guaranteed to find the optimal local alignment with regards to the
scoring system used
• In order to increase system performance, the state of the art is full of
implementation based on heuristics
• Highly compute intensive
Speedup
in computation
Decrease in algorithm
precision

27Smith-Waterman
Platform Performance [GCUPS] Power Efficiency [GCUPS/W]
AWS-VU9P (3 queries in parallel) 110.0 4.400
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
Nvidia GeForce GTX 295 30.0 0.104
Xtreme Data XD1000 25.6 0.430
Altera Stratix V on Nallatech PCIe-385 24.7 0.988
ADM-PCIE-7V3 14.8 0.594
Dual-core Nvidia 9800 GX2 14.5 0.074
Xtreme Data XD2000i 9.0 0.150
2XNvidia GeForce 8800 3.6 0.017

28Smith-Waterman
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
ADM-PCIE-7V3 14.8 0.594

29Smith-Waterman
Tesla K20 45.0 0.200
ADM-PCIE-KU3 42.5 1.699
ADM-PCIE-7V3 14.8 0.594

HUG Today
[1]
RAW READ CONTIGGING SCAFFOLDING RE-SCAFFOLDING ANNOTATION
[2]
Sequence Alignment via Smith-Waterman Algorithm [2]
30

31TiREX
Regular
Expression
Compiler
1 & ACGT
2 JIM offset
3 (
4 |)* AC
5 & TT
Instruction Set
ACGTCGGGGCGTGCAAATGCCCCGTGCGATTTGCGTGACGTCGGGGCGTGCAAATGC
CCCGTGCGATTTGCGTGACGTCGGGGCGTGCAAATGCCCCGTGCGATTTGCGTGACG
TCGGGGCGTGCAAATGCCCCGTGCGATTTGCGTGCGTGCGATTTGCGTGACGTCGGG
GCGTGCAAACGTGCGATTTGCGTGACGTCGGGGCGTGCAAAGCTCGATCGATCGATC
GA…
Data
Match results
Flexible Instruction Set Architecture (ISA) representing the regular
expression
Customized processor for pattern matching
ACGT(A|C)*TT

32Pattern Matching via TiREX
Regular Expression Flex* 16-core†
(VC707) Speedup
ACCGTGGA 271 µs 2.07 µs 130.90X
(TTT)+CT 121 µs 4.54 µs 26.65X
(CAGT)|(GGGG)|(TTGG)TGCA(C|G)+ 263 µs 3.36 µs 78.27X
* running on a Intel i7 with a peak frequency of 2.8GHz
†
running at 130 MHz

33HUG Today
Pattern Matching to identify gene motifs during Gene Annotation
[1]
[2]
[3]

36HUG Today
[1]
[2]
PairHMM for gene prediction/finding in the Gene Annotation phase
[3]

40Roofline Model
Performance model that depicts the relation between attainable
performance and operational intensity

41Roofline Model for FPGAs
• Estimate performance before implementing the kernel
• Compare performance on different FPGA boards
• FPGAs have no fixed architecture
- It needs to be generated for each kernel
- and each target FPGA board
Proposed solution: automatic tool to generate Roofline models

42Roofline Model for FPGAs
• Based on the Vivado Design Suite
• Given High Level Code (now RTL too), it generates bitstream for the
target board
• Resembles GPU design flow
Host
OpenCL runtime & APIs
Accelerator
C, C++, OpenCL, RTL
PCIe

PIPELINE
CREATION
DATA
UPLOAD
PROCESSING DATA
VISUALIZATION
HUG has exactly
what I need!
YESLINE

44
Genomics Hardware Pipeline
on the AWS cloud

45Custom Code Integration
HETEROGENEOUS ARCHITECTURE
HUG
I’d like to integrate
my own algorithm

FAST
PROTOTYPING
CUSTOM HARDWARE
ALGORITHM
PIPELINE CREATION
OR INTEGRATION DATA
UPLOAD
PROCESSING DATA
VISUALIZATION
YESLINE
NOLINE
Is the algorithm
available on HUG?

The CAOS framework
● guide the application designer in the implementation of
efficient hardware-software solutions for high performance
FPGA-based systems
● promote open research by allowing external researchers to
easily integrate and benchmark their algorithms
48

The proposed CAOS framework
…
49

The proposed CAOS framework
…
50

…
…
CAOS infrastructure 51

…
…
CAOS infrastructure 52

CAOS architectural templates
• SST (Single Streaming Timestep)
– Scientific applications with a regular execution flow in which
– a N-dimensional data structure is updated over time
– Example: Heat flow simulation, Gauss-Seidel linear systems solver
• Dataflow
– Applications with a static compute graph where the same
computation is repeated for multiple input items
– Example: Asian Option Pricing, Image processing filters
• Master / Slave
– More general applications with a regular execution flow whose
working set can be effectively tiled
– Example: N-Body Physics Simulation
53

• Dataflow
• Master / Slave
54

• Iterative Stencil Loop (ISL) Algorithm
• FPGA-based Stencil Time-Step (SST) Architecture [1]
SST background 55

Design Space Exploration 56
• Problem:
– Identify an optimal implementation of
an SST-based design on a target
FPGA

Design Space Exploration 57
• Problem:
– Identify an optimal implementation of
an SST-based design on a target
FPGA
• Proposed Approach[1]
- leverage floorplanning to:
– Determine the maximum number of SST that can be instantiated
– Optimize SSTs placement to ease timing closure

• Dataflow
• Master / Slave
59

void foo(type_1* in_1, type_2* in_2, scalar_type_1* v1, type_1* out_1){
for(int i = offs; i < I_SIZE; i++){
S1: …statements…
for(int j = …; j < 15; j++){
}
for(int j = … ){
}
}
for(int i = offs_3; … ){
}
}
Supported code & dataflow IROUTERSTREAMINGLOOPS 60

void foo(type_1* in_1, type_2* in_2, scalar_type_1* v1, type_1* out_1){
for(int i = offs; i < I_SIZE; i++){
for(int j = …; j < 15; j++){
}
for(int j = … ){
}
}
for(int i = offs_3; … ){
}
}
Supported code & dataflow IR
NESTED
LOOP
NODES
LOOP CARRIED
DEPENDENCY ARC
OUTERSTREAMINGLOOPS 61

Design optimizations
•
•
62

Which optimization to apply? 63

Which optimization to apply?
•
•
64

Which optimization to apply?
•
•
65

67CAOS experimental results
30 15.83 1.0
15 31.22 1.0
10 45.95 0.1
8 56.91 1.0
6 74.35 0.1
5 88.10 0.1
39.06 37.65 21.86 37.15
55.08 42.09 24.47 40.71
29.30 41.50 31.55 52.32
87.11 54.30 29.60 47.73
41.02 59.13 39.51 64.31
46.88 63.96 43.31 70.09
•
•

68
CAOS vs
Hand-tuned implementation
1.78s 1.25s
Testbench parameters:
• Averaging points window: 30
• Asian Options in portfolio: 10000
• Market scenarios: 5000

69
CAOS vs
Hand-tuned implementation
1.78s 1.25s
Testbench parameters:
• Averaging points window: 30
• Asian Options in portfolio: 10000
• Market scenarios: 5000

• Dataflow
• Master / Slave
70

Master / Slave background
• Support a larger class of source codes, ideally, close
to the same set of codes supported by High Level
Synthesis tools (e.g. Vivado HLS)
• Tiled computation:
– Application’s memory transferred in
blocks from DDR to local FPGA
memory
– Computation performed on a block of
data at a time
71

Master / Slave
architectural template
• Current Implementation
– Relies on Vivado HLS for resource and performance estimation
– Explore potential optimizations by testing HLS directives
(e.g. #pragma HLS UNROLL, #pragma HLS PIPELINE, …)
– Quite accurate but slow (HLS synthesis time may be high)
• On going work:
– Leverage and extend the classic roofline model for CPU to
predict performance and potential optimizations
– Potentially less accurate but faster, allowing to explore a larger
optimization space
72

Adapting the roofline
model for FPGA
PROBLEM:
• Theoretical peak performance dependent on the types of
operations performed, their bitwidth and technology mapping
IDEA:
• Estimate peak performance for each combination of:
– Target device
– Operation type
– Operands bitwidth
– Target frequency
• Analyze the source code and derive a unified roofline according to
the operations count within the code
73

Instructions count
from static code
analysis
75Roofline model for FPGA

HW operators for ideal
performance
(within resource constraints)

Estimated HW
resources
breakdown

Peak performance
(Ops / sec)
with fully pipelined
operators
@ target frequency

Operational intensity
derived from static
code analysis

81We presented DReAMS
Thanks!
Lorenzo Di Tucci, Marco Rabozzi, Marco D. Santambrogio
{lorenzo.ditucci, marco.rabozzi, marco.santambrogio}@polimi.it

High Performance Reconfigurable Computing for Genomics Research

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to High Performance Reconfigurable Computing for Genomics Research

Similar to High Performance Reconfigurable Computing for Genomics Research (20)

More from NECST Lab @ Politecnico di Milano

More from NECST Lab @ Politecnico di Milano (20)

Recently uploaded

Recently uploaded (20)

High Performance Reconfigurable Computing for Genomics Research