9004554577, Get Adorable Call Girls service. Book call girls & escort service...
Synthesis of Platform Architectures from OpenCL Programs
1. Synthesis of Platform Architectures
from OpenCL Programs
Muhsen
Owaida
Konstantis
Daloukas
Nikolaos
Bellas
Christos D.
Antonopoulos
Department of Computer and Communication Engineering
University of Thessaly
Volos, Greece
2. 01/03/16 FCCM 2011 2
Introduction
• High Level Synthesis (HLS) has been at the
research forefront in the last few years.
• Variety of Programming Models have been
introduced: C/C++, C-like Languages,
MATLAB, CUDA.
• Obstacles:
– Parallelism Expression.
– Extensive Compiler Transformations &
Optimizations.
3. 01/03/16 FCCM 2011 3
Motivation
• Lack of parallel programming language for
reconfigurable platforms.
• A major shift of Computing industry toward
many-core computing systems.
• Reconfigurable fabrics bear a strong resemblance
to many core systems.
4. 01/03/16 FCCM 2011 4
Contribution
• Silicon-OpenCL “SOpenCL”.
• A tool flow to convert an
unmodified OpenCL
application into a SoC design
with HW/SW components.
• A template-based hardware
accelerator generation.
• Decouple data movement and
computations.
Streaming
Unit
Datapath
Input
data
Output
data
Architectural Template
6. 01/03/16 6
OpenCL Programming Language
• Open Computing Language
• OpenCL expresses parallelism at its finest granularity.
• Computation-grid partitioned in a 3-dimensional space of
work groups.
7. 01/03/16 FCCM 2011 7
Data Movement
• Explicit Data Movement: Local Buffers and
Global Buffers.
9. 01/03/16 FCCM 2011 9
SOpenCL Front-End (I)
Granularity Coarsening
• Work Item represents a light computational load.
• Coarsen the granularity due to limited resources and memory
bandwidth.
12. 01/03/16 FCCM 2011 12
Hardware Generation
• Perform a series of optimizations and
Transformations.
– Uses LLVM Compiler Infrastructure.
• Generate synthesizable Verilog.
• Generate Test bench and simulation files.
C code
(Nested loop)
LLVM
Compiler
Optimize
LLVM-IR
Predication
Code
slicing
SMS mod
scheduling
Verilog
generation
Simulation
Synthesis
Final
bitstream
Accelerator
Template
User
Performance
Requirements
Synthesizable
Verilog
Test bench
13. 01/03/16 FCCM 2011 13
IF Conversion
• Predication: If-conversion necessary for the
application of Modulo-Scheduler.
Predication
Code
slicing
SMS mod
scheduling
Verilog
generation
bb0:
r0 = cmp eq t, 0
br r0, bb1, bb2
bb1:
r1 = load A
br bb3
bb2:
r2 = add a, 1
br bb3
bb3:
r4 = phi r1, bb1, r2, bb2
br bb4
bb0:
r0 = cmp eq t, 0
p0 = xor r0, true
(r0) r1 = load A
(p0) r2 = add a, 1
r4 = select r0, r1, r2
br bb4
Most-inner loop body
(LLVM assembly)
Predicates
14. 01/03/16
FCCM 2011
Code Slicing
• Decouple Data
movement and
computations.
• Input Streaming
Kernel
• Output Streaming
Kernel
• Computational
Kernel
Predicated
LLVM Loop
Predication
Code
slicing
SMS mod
scheduling
Verilog
generation
Part of Chroma
Interpolation LLVM
Termination
Computation
15. 01/03/16 FCCM 2011 15
Modulo Scheduling
• Software Pipelining:
– II: Initiation Interval.
• Swing Modulo Scheduling (SMS).
• Valid Bits used to implement Prologue and Epilogue.
Predication
Code
slicing
SMS mod
scheduling
Verilog
generation
16. 01/03/16 FCCM 2011 16
Verilog Generation
Feed Data
in Order
Predication
Code
slicing
SMS mod
scheduling
Verilog
generation
Write Data
in Order
FU types,
Bitwidths,
I/O Bandwidth
Requests/Data
FIFO Size
18. 01/03/16 FCCM 2011 18
Run-Time
• The OpenCL main
program is executed as a
main thread in the host
processor of the platform
(e.g. PowerPC).
• Work-tasks are created
by the helper thread.
Host
Main
thread
Host
helper
thread
Command
Queue
Enqueue
OpenCL
command
1
Accelerator
Work queue
Initialize
Accelerator
Finish signal
Enqueue new
Work tasks
2
3
4
5
Work thread
(PowerPC)
20. 01/03/16 FCCM 2011 20
Experimental Evaluation
• We tested the SOpenCL methodology on six OpenCL and
C applications.
• we evaluated our designs on a Xilinx Virtex-5 FX70
FPGA.
• We used Xilinx ISE 11.4 toolset for synthesis, placement
and routing.
• Evaluation Methodology:
– Three levels of resources availability {Ca, Cb, Cc}.
– Three Requests/Data FIFO Sizes.
– Cache Usage.