TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

May 1, 2013 1
OpenCL for ALTERA FPGAs
Accelerating performance and design
productivity
Liad Weinberger – Appilo
May 1st, 2013

May 1, 2013 2
Technology trends
• Over the past years
– Technology scaling favors programmability and parallelism
Fine-Grained
Massively
Parallel
Arrays
Single Cores Coarse-Grained
Massively
Parallel
Processor
Arrays
Multi-Cores
Coarse-Grained
CPUs and DSPs
CPUs DSPs Multi-Cores Array GPGPUs FPGAs

May 1, 2013 3
Technology trends
0
20
40
60
80
100
120
140
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022
Process node (nm)
• Moore’s law still in effect
– More FPGA real-estate
• More potential for parallelism – an extremely good thing!
• Designs that utilize this real-estate, becomes harder to
manage and maintain – this is not so good...

May 1, 2013 4
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Worldwide Interest over the years
Verilog + VHDL
• Decreased interest
– Number of Google searches for VHDL or
Verilog in decline

May 1, 2013 5
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Interest over the years
Verilog + VHDL
Python
• Software development keeps momentum
– Number of Google searches for Python (as a
representing language)

May 1, 2013 6
FPGA (hardware) development
• Design (programming) is complex
– Define state machine, data-paths, arbitration, IP interfaces, etc.
– Sophisticated iterative compilation process
• Synthesis, technology mapping, clustering, placement and routing, timing closure
• Leads to long compilation times (hours vs. minutes in software)
– Debug process is also very time-consuming
• Code is not portable
– Written in Verilog / VHDL
• Can’t re-target for CPUs, GPUs, DSPs, etc.
• Not scalable
Compilation
HDL
Timing
Closure
Set
Constraints

May 1, 2013 7
Software development
• Programming is straight-forward
– Ideas are expressed in languages such as C/C++/Python/etc.
• Typically, start with simple sequential implementation
• Use parallel APIs / language extensions, in order to exploit multi-core
architectures for additional performance
– Compilation times are usually reasonably short
• Simple straight-forward compilation/linking process
– Immediate feedback when debugging/profiling
• An assortment of tools available for both debugging and profiling
• Portability is still an issue
– Possible, but require pre-planning
Compiler
&Linker
C/C++
Python
etc.
C/C++
Python
etc.
C/C++/
Python/
etc.

May 1, 2013 8
Product development point-of-view
• Product producers want:
– Lower development and maintenance costs
– Competitive edge
• Higher performance
• Short time-in-market, and short time-to-market
– Agile development methods are becoming more and more popular
– Can’t afford long development cycles
– Trained developers with established experience
• Or cost-effective path for training new developers
– Flexibility
• No vendor-locking is preferred
• Ability to rapidly adapt product to market requirement changes

May 1, 2013 9
Our challenge
• How do we bring FPGA design process closer to the
software development model?
– Need to make FPGAs more accessible to the software development
community
• Change in mind-set: look at FPGAs as massively multi-core devices that
could be used in order to accelerate parallel applications
• A programming model that allows that
• Shorter compilation times and faster feedback for debugging and profiling
the design

May 1, 2013 10
An ideal programming environment...
• Based on a standard programming model
– Rather than something which is FPGA-specific
• Abstracts away the underlying details of the hardware
– VHDL / Verilog are similar to “assembly language” programming
– Useful in rare circumstances where the highest possible efficiency is needed
• The price of abstraction is not too high
– Still need to efficiently use the FPGA’s resources to achieve high throughput / low
area
• Allows for software-like compilation & debug cycles
– Faster compile times
– Profiling & user feedback

May 1, 2013 11
Introducing OpenCL
Parallel heterogeneous computing

May 1, 2013 12
A case for OpenCL
• What is OpenCL?
– An open, royalty-free standard for cross-platform parallel software programming of
heterogeneous systems
• CPU + DSPs
• CPU + GPUs
• CPU + FPGAs
– Maintained by KHRONOS group
• An industry consortium creating open, royalty-free standards
• Comprised of hardware and software vendors
– Enables software to leverage silicon acceleration
• Consists of two major parts:
– Application Programming Interface (API) for device management
– Device programming language based on C99 with
some restrictions and extensions to support explicit parallelism
Or maybe all together

May 1, 2013 13
Benefits of OpenCL
• Cross-vendor software portability
– Functional portability—Same code would normally execute on
different hardware, by different vendors
– Not performance portable—Code still needs to be optimized to
specific device (at least a device class)
• Allows for the management of available computational
resources under a single framework
– Views CPUs, GPUs, FPGAs, and other accelerators as devices that
could carry the computational needs of the application

May 1, 2013 14
OpenCL program structure
• Separation between managerial and computational code bases
– Managerial code executes on a host CPU
• Any type of conventional micro-processor
• Written in any language that has bindings for the OpenCL API
– The API is in ANSI-C
– There is a formal C++ binding
– Other bindings may exist
– Computational code executes on the compute devices (accelerators)
• Written in a language called OpenCL C
– Based on C99
– Adds restrictions and extensions for explicit parallelism
• Can be compiled either offline, or online, depending on implementation
• Will most likely consist only of those portions of the application we want to accelerate

May 1, 2013 15
OpenCL program structure
Compute DeviceHost
LocalMem
GlobalMem
LocalMemLocalMemLocalMem
AcceleratorAcceleratorAccelerator
Compute
unit
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
Host Program
Kernel Program

May 1, 2013 16
OpenCL host application
• Communicates with the Accelerator Device via a set of
library routines
– Abstracts away host processor to HW accelerator communication via
a set of API calls
main() {
read_data( … );
maninpulate( … );
}
Copy data
Host  FPGA
Ask the FPGA to run
a particular kernel
Copy data
FPGA  Host

May 1, 2013 17
OpenCL kernels
• Data-parallel function
– Executes by many parallel
threads
• Each thread has an identifier
which could be obtained with
a call to the get_global_id()
built-in function
• Uses qualifiers to define
where memory buffers reside
• Executed by a
compute device
– CPU
– GPU
– FPGA
– Other accelerator
float *a =
float *b =
float *y =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
__kernel void
__global float *b,
__global float *y)
{
}
__kernel void sum( … );

May 1, 2013 18
OpenCL on FPGAs
How does it map?

May 1, 2013 19
Compiling OpenCL to FPGAs
x86
PCIe
SOF X86 binary
ACL
Compiler
Standard
C Compiler
OpenCL
Host Program + Kernels
__kernel void
__global float *b,
__global float *y)
{
}
Kernel Programs Host Program
main() {
read_data( … );
maninpulate( … );
}

May 1, 2013 20
Compiling OpenCL to FPGAs
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
PCIe
DDRx
__kernel void
__global float *b,
__global float *y)
{
}
Kernel Programs
Custom Hardware for Your Kernels

May 1, 2013 21
FPGA architecture for OpenCL
FPGA
Kernel
Pipeline
Kernel
Pipeline
Kernel
Pipeline
PCIe
DDR*
x86 /
External
Processor
External
Memory
Controller
& PHY
Memory
Memory
Memory
Memory
Memory
Memory
Global Memory Interconnect
Local Memory Interconnect
External
Memory
Controller
& PHY
Kernel System

May 1, 2013 22
Mapping multithreaded kernels to FPGAs
• Simplest way of mapping kernel functions to FPGAs is
to replicate hardware for each thread
– Inefficient and wasteful
• Technique: deep pipeline parallelism
– Attempt to create a deeply pipelined representation of a kernel
– On each clock cycle, we attempt to send in input data for a new
thread
– Method of mapping coarse grained thread parallelism to fine-grained
FPGA parallelism

May 1, 2013 23
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
0 1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+

May 1, 2013 24
different threads
Load Load
Store
1 2 3 4 5 6 7
0
Thread IDs
+

May 1, 2013 25
different threads
Load Load
Store
2 3 4 5 6 7
0
1
Thread IDs
+

May 1, 2013 26
different threads
Load Load
Store
3 4 5 6 7
1
2
Thread IDs
+
0

May 1, 2013 27
different threads
Load Load
Store
4 5 6 7
0
2
3
Thread IDs
+
1

May 1, 2013 28
Some examples
Using ALTERA’s OpenCL solution

May 1, 2013 29
AES encryption
• Counter (CTR) based encryption/decryption
– 256-bit key
• Advantage FPGA
– Integer arithmetic
– Coarse grain bit operations
– Complex decision making
• Results Platform Throughput (GB/s)
E5503 Xeon Processor 0.01 (single core)
AMD Radeon HD 7970 0.33
PCIe385 A7 Accelerator 5.20
42% utilization (2 kernels)
•Power conservation
•Fill up for even higher performance

May 1, 2013 30
Multi-asset barrier option pricing
• Monte-carlo simulation
– Heston model
– ND range
• Assets x paths (64x1000000)
• Advantage FPGA
– Complex control flow
• Results
  


tttt
S
ttttt
dWdtd
dWSdtSdS


Platform
Power
(W)
Performance
(Msims/s)
Msims/W
W3690 Xeon Processor 130 32 0.25
nVidia Tesla C2075 225 63 0.28
PCIe385 D5 Accelerator 23 170 7.40

May 1, 2013 31
Document filtering
• Unstructured data analytics
– Bloom Filter
• Advantage FPGA
– Flexible memory configuration
• Results Platform Power (W) Performance (MTs) MTs/W
W3690 Xeon Processor 130 2070 15.92
nVidia Tesla C2075 215 3240 15.07
DE4 Stratix IV-530 Accelerator 21 1755 83.57
PCIe385 A7 Accelerator 25 3602 144.08

May 1, 2013 32
Fractal video compression
• Best matching codebook
– Correlation with SAD
• Advantage FPGA
• Results Platform Power (W) Performance (FPS) FPS/W
W3690 Xeon Processor 130 4.6 0.035
nVidia Tesla C2075 215 53.1 0.247
DE4 Stratix IV-530 Accelerator 21 70.9 3.376
PCIe385 A7 Accelerator 25 74.4 2.976

TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

Similar to TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger (20)

More from chiportal

More from chiportal (20)

Recently uploaded

Recently uploaded (20)

TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger