May 1, 2013 1
OpenCL for ALTERA FPGAs
Accelerating performance and design
productivity
Liad Weinberger – Appilo
May 1st, 2013
May 1, 2013 2
Technology trends
• Over the past years
– Technology scaling favors programmability and parallelism
Fine-Grained
Massively
Parallel
Arrays
Single Cores Coarse-Grained
Massively
Parallel
Processor
Arrays
Multi-Cores
Coarse-Grained
CPUs and DSPs
CPUs DSPs Multi-Cores Array GPGPUs FPGAs
May 1, 2013 3
Technology trends
0
20
40
60
80
100
120
140
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022
Process node (nm)
• Moore’s law still in effect
– More FPGA real-estate
• More potential for parallelism – an extremely good thing!
• Designs that utilize this real-estate, becomes harder to
manage and maintain – this is not so good...
May 1, 2013 4
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Worldwide Interest over the years
Verilog + VHDL
• Decreased interest
– Number of Google searches for VHDL or
Verilog in decline
May 1, 2013 5
Technology trends
2007 2008 2009 2010 2011 2012 2013
Google trends
Interest over the years
Verilog + VHDL
Python
• Software development keeps momentum
– Number of Google searches for Python (as a
representing language)
May 1, 2013 6
FPGA (hardware) development
• Design (programming) is complex
– Define state machine, data-paths, arbitration, IP interfaces, etc.
– Sophisticated iterative compilation process
• Synthesis, technology mapping, clustering, placement and routing, timing closure
• Leads to long compilation times (hours vs. minutes in software)
– Debug process is also very time-consuming
• Code is not portable
– Written in Verilog / VHDL
• Can’t re-target for CPUs, GPUs, DSPs, etc.
• Not scalable
Compilation
HDL
Timing
Closure
Set
Constraints
May 1, 2013 7
Software development
• Programming is straight-forward
– Ideas are expressed in languages such as C/C++/Python/etc.
• Typically, start with simple sequential implementation
• Use parallel APIs / language extensions, in order to exploit multi-core
architectures for additional performance
– Compilation times are usually reasonably short
• Simple straight-forward compilation/linking process
– Immediate feedback when debugging/profiling
• An assortment of tools available for both debugging and profiling
• Portability is still an issue
– Possible, but require pre-planning
Compiler
&Linker
C/C++
Python
etc.
C/C++
Python
etc.
C/C++/
Python/
etc.
May 1, 2013 8
Product development point-of-view
• Product producers want:
– Lower development and maintenance costs
– Competitive edge
• Higher performance
• Short time-in-market, and short time-to-market
– Agile development methods are becoming more and more popular
– Can’t afford long development cycles
– Trained developers with established experience
• Or cost-effective path for training new developers
– Flexibility
• No vendor-locking is preferred
• Ability to rapidly adapt product to market requirement changes
May 1, 2013 9
Our challenge
• How do we bring FPGA design process closer to the
software development model?
– Need to make FPGAs more accessible to the software development
community
• Change in mind-set: look at FPGAs as massively multi-core devices that
could be used in order to accelerate parallel applications
• A programming model that allows that
• Shorter compilation times and faster feedback for debugging and profiling
the design
May 1, 2013 10
An ideal programming environment...
• Based on a standard programming model
– Rather than something which is FPGA-specific
• Abstracts away the underlying details of the hardware
– VHDL / Verilog are similar to “assembly language” programming
– Useful in rare circumstances where the highest possible efficiency is needed
• The price of abstraction is not too high
– Still need to efficiently use the FPGA’s resources to achieve high throughput / low
area
• Allows for software-like compilation & debug cycles
– Faster compile times
– Profiling & user feedback
May 1, 2013 11
Introducing OpenCL
Parallel heterogeneous computing
May 1, 2013 12
A case for OpenCL
• What is OpenCL?
– An open, royalty-free standard for cross-platform parallel software programming of
heterogeneous systems
• CPU + DSPs
• CPU + GPUs
• CPU + FPGAs
– Maintained by KHRONOS group
• An industry consortium creating open, royalty-free standards
• Comprised of hardware and software vendors
– Enables software to leverage silicon acceleration
• Consists of two major parts:
– Application Programming Interface (API) for device management
– Device programming language based on C99 with
some restrictions and extensions to support explicit parallelism
Or maybe all together
May 1, 2013 13
Benefits of OpenCL
• Cross-vendor software portability
– Functional portability—Same code would normally execute on
different hardware, by different vendors
– Not performance portable—Code still needs to be optimized to
specific device (at least a device class)
• Allows for the management of available computational
resources under a single framework
– Views CPUs, GPUs, FPGAs, and other accelerators as devices that
could carry the computational needs of the application
May 1, 2013 14
OpenCL program structure
• Separation between managerial and computational code bases
– Managerial code executes on a host CPU
• Any type of conventional micro-processor
• Written in any language that has bindings for the OpenCL API
– The API is in ANSI-C
– There is a formal C++ binding
– Other bindings may exist
– Computational code executes on the compute devices (accelerators)
• Written in a language called OpenCL C
– Based on C99
– Adds restrictions and extensions for explicit parallelism
• Can be compiled either offline, or online, depending on implementation
• Will most likely consist only of those portions of the application we want to accelerate
May 1, 2013 15
OpenCL program structure
Compute DeviceHost
LocalMem
GlobalMem
LocalMemLocalMemLocalMem
AcceleratorAcceleratorAccelerator
Compute
unit
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
Host Program
Kernel Program
May 1, 2013 16
OpenCL host application
• Communicates with the Accelerator Device via a set of
library routines
– Abstracts away host processor to HW accelerator communication via
a set of API calls
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
Copy data
Host  FPGA
Ask the FPGA to run
a particular kernel
Copy data
FPGA  Host
May 1, 2013 17
OpenCL kernels
• Data-parallel function
– Executes by many parallel
threads
• Each thread has an identifier
which could be obtained with
a call to the get_global_id()
built-in function
• Uses qualifiers to define
where memory buffers reside
• Executed by a
compute device
– CPU
– GPU
– FPGA
– Other accelerator
float *a =
float *b =
float *y =
0 1 2 3 4 5 6 7
7 6 5 4 3 2 1 0
7 7 7 7 7 7 7 7
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
__kernel void sum( … );
May 1, 2013 18
OpenCL on FPGAs
How does it map?
May 1, 2013 19
Compiling OpenCL to FPGAs
x86
PCIe
SOF X86 binary
ACL
Compiler
Standard
C Compiler
OpenCL
Host Program + Kernels
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
Kernel Programs Host Program
main() {
read_data( … );
maninpulate( … );
clEnqueueWriteBuffer( … );
clEnqueueNDRangeKernel(…,sum,…);
clEnqueueReadBuffer( … );
display_result( … );
}
May 1, 2013 20
Compiling OpenCL to FPGAs
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
Load Load
Store
PCIe
DDRx
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
Kernel Programs
Custom Hardware for Your Kernels
May 1, 2013 21
FPGA architecture for OpenCL
FPGA
Kernel
Pipeline
Kernel
Pipeline
Kernel
Pipeline
PCIe
DDR*
x86 /
External
Processor
External
Memory
Controller
& PHY
Memory
Memory
Memory
Memory
Memory
Memory
Global Memory Interconnect
Local Memory Interconnect
External
Memory
Controller
& PHY
Kernel System
May 1, 2013 22
Mapping multithreaded kernels to FPGAs
• Simplest way of mapping kernel functions to FPGAs is
to replicate hardware for each thread
– Inefficient and wasteful
• Technique: deep pipeline parallelism
– Attempt to create a deeply pipelined representation of a kernel
– On each clock cycle, we attempt to send in input data for a new
thread
– Method of mapping coarse grained thread parallelism to fine-grained
FPGA parallelism
May 1, 2013 23
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
0 1 2 3 4 5 6 7
8 threads for vector add example
Thread IDs
+
May 1, 2013 24
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
1 2 3 4 5 6 7
0
8 threads for vector add example
Thread IDs
+
May 1, 2013 25
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
2 3 4 5 6 7
0
1
8 threads for vector add example
Thread IDs
+
May 1, 2013 26
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
3 4 5 6 7
1
2
8 threads for vector add example
Thread IDs
+
0
May 1, 2013 27
Example pipeline for vector add
• On each cycle, the portions of
the pipeline are processing
different threads
• While thread 2 is being loaded,
thread 1 is being added, and
thread 0 is being stored
Load Load
Store
4 5 6 7
0
2
3
8 threads for vector add example
Thread IDs
+
1
May 1, 2013 28
Some examples
Using ALTERA’s OpenCL solution
May 1, 2013 29
AES encryption
• Counter (CTR) based encryption/decryption
– 256-bit key
• Advantage FPGA
– Integer arithmetic
– Coarse grain bit operations
– Complex decision making
• Results Platform Throughput (GB/s)
E5503 Xeon Processor 0.01 (single core)
AMD Radeon HD 7970 0.33
PCIe385 A7 Accelerator 5.20
42% utilization (2 kernels)
•Power conservation
•Fill up for even higher performance
May 1, 2013 30
Multi-asset barrier option pricing
• Monte-carlo simulation
– Heston model
– ND range
• Assets x paths (64x1000000)
• Advantage FPGA
– Complex control flow
• Results
  


tttt
S
ttttt
dWdtd
dWSdtSdS


Platform
Power
(W)
Performance
(Msims/s)
Msims/W
W3690 Xeon Processor 130 32 0.25
nVidia Tesla C2075 225 63 0.28
PCIe385 D5 Accelerator 23 170 7.40
May 1, 2013 31
Document filtering
• Unstructured data analytics
– Bloom Filter
• Advantage FPGA
– Integer arithmetic
– Flexible memory configuration
• Results Platform Power (W) Performance (MTs) MTs/W
W3690 Xeon Processor 130 2070 15.92
nVidia Tesla C2075 215 3240 15.07
DE4 Stratix IV-530 Accelerator 21 1755 83.57
PCIe385 A7 Accelerator 25 3602 144.08
May 1, 2013 32
Fractal video compression
• Best matching codebook
– Correlation with SAD
• Advantage FPGA
– Integer arithmetic
• Results Platform Power (W) Performance (FPS) FPS/W
W3690 Xeon Processor 130 4.6 0.035
nVidia Tesla C2075 215 53.1 0.247
DE4 Stratix IV-530 Accelerator 21 70.9 3.376
PCIe385 A7 Accelerator 25 74.4 2.976

TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design productivity/ Liad Weinberger

  • 1.
    May 1, 20131 OpenCL for ALTERA FPGAs Accelerating performance and design productivity Liad Weinberger – Appilo May 1st, 2013
  • 2.
    May 1, 20132 Technology trends • Over the past years – Technology scaling favors programmability and parallelism Fine-Grained Massively Parallel Arrays Single Cores Coarse-Grained Massively Parallel Processor Arrays Multi-Cores Coarse-Grained CPUs and DSPs CPUs DSPs Multi-Cores Array GPGPUs FPGAs
  • 3.
    May 1, 20133 Technology trends 0 20 40 60 80 100 120 140 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 Process node (nm) • Moore’s law still in effect – More FPGA real-estate • More potential for parallelism – an extremely good thing! • Designs that utilize this real-estate, becomes harder to manage and maintain – this is not so good...
  • 4.
    May 1, 20134 Technology trends 2007 2008 2009 2010 2011 2012 2013 Google trends Worldwide Interest over the years Verilog + VHDL • Decreased interest – Number of Google searches for VHDL or Verilog in decline
  • 5.
    May 1, 20135 Technology trends 2007 2008 2009 2010 2011 2012 2013 Google trends Interest over the years Verilog + VHDL Python • Software development keeps momentum – Number of Google searches for Python (as a representing language)
  • 6.
    May 1, 20136 FPGA (hardware) development • Design (programming) is complex – Define state machine, data-paths, arbitration, IP interfaces, etc. – Sophisticated iterative compilation process • Synthesis, technology mapping, clustering, placement and routing, timing closure • Leads to long compilation times (hours vs. minutes in software) – Debug process is also very time-consuming • Code is not portable – Written in Verilog / VHDL • Can’t re-target for CPUs, GPUs, DSPs, etc. • Not scalable Compilation HDL Timing Closure Set Constraints
  • 7.
    May 1, 20137 Software development • Programming is straight-forward – Ideas are expressed in languages such as C/C++/Python/etc. • Typically, start with simple sequential implementation • Use parallel APIs / language extensions, in order to exploit multi-core architectures for additional performance – Compilation times are usually reasonably short • Simple straight-forward compilation/linking process – Immediate feedback when debugging/profiling • An assortment of tools available for both debugging and profiling • Portability is still an issue – Possible, but require pre-planning Compiler &Linker C/C++ Python etc. C/C++ Python etc. C/C++/ Python/ etc.
  • 8.
    May 1, 20138 Product development point-of-view • Product producers want: – Lower development and maintenance costs – Competitive edge • Higher performance • Short time-in-market, and short time-to-market – Agile development methods are becoming more and more popular – Can’t afford long development cycles – Trained developers with established experience • Or cost-effective path for training new developers – Flexibility • No vendor-locking is preferred • Ability to rapidly adapt product to market requirement changes
  • 9.
    May 1, 20139 Our challenge • How do we bring FPGA design process closer to the software development model? – Need to make FPGAs more accessible to the software development community • Change in mind-set: look at FPGAs as massively multi-core devices that could be used in order to accelerate parallel applications • A programming model that allows that • Shorter compilation times and faster feedback for debugging and profiling the design
  • 10.
    May 1, 201310 An ideal programming environment... • Based on a standard programming model – Rather than something which is FPGA-specific • Abstracts away the underlying details of the hardware – VHDL / Verilog are similar to “assembly language” programming – Useful in rare circumstances where the highest possible efficiency is needed • The price of abstraction is not too high – Still need to efficiently use the FPGA’s resources to achieve high throughput / low area • Allows for software-like compilation & debug cycles – Faster compile times – Profiling & user feedback
  • 11.
    May 1, 201311 Introducing OpenCL Parallel heterogeneous computing
  • 12.
    May 1, 201312 A case for OpenCL • What is OpenCL? – An open, royalty-free standard for cross-platform parallel software programming of heterogeneous systems • CPU + DSPs • CPU + GPUs • CPU + FPGAs – Maintained by KHRONOS group • An industry consortium creating open, royalty-free standards • Comprised of hardware and software vendors – Enables software to leverage silicon acceleration • Consists of two major parts: – Application Programming Interface (API) for device management – Device programming language based on C99 with some restrictions and extensions to support explicit parallelism Or maybe all together
  • 13.
    May 1, 201313 Benefits of OpenCL • Cross-vendor software portability – Functional portability—Same code would normally execute on different hardware, by different vendors – Not performance portable—Code still needs to be optimized to specific device (at least a device class) • Allows for the management of available computational resources under a single framework – Views CPUs, GPUs, FPGAs, and other accelerators as devices that could carry the computational needs of the application
  • 14.
    May 1, 201314 OpenCL program structure • Separation between managerial and computational code bases – Managerial code executes on a host CPU • Any type of conventional micro-processor • Written in any language that has bindings for the OpenCL API – The API is in ANSI-C – There is a formal C++ binding – Other bindings may exist – Computational code executes on the compute devices (accelerators) • Written in a language called OpenCL C – Based on C99 – Adds restrictions and extensions for explicit parallelism • Can be compiled either offline, or online, depending on implementation • Will most likely consist only of those portions of the application we want to accelerate
  • 15.
    May 1, 201315 OpenCL program structure Compute DeviceHost LocalMem GlobalMem LocalMemLocalMemLocalMem AcceleratorAcceleratorAccelerator Compute unit __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); } Host Program Kernel Program
  • 16.
    May 1, 201316 OpenCL host application • Communicates with the Accelerator Device via a set of library routines – Abstracts away host processor to HW accelerator communication via a set of API calls main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); } Copy data Host  FPGA Ask the FPGA to run a particular kernel Copy data FPGA  Host
  • 17.
    May 1, 201317 OpenCL kernels • Data-parallel function – Executes by many parallel threads • Each thread has an identifier which could be obtained with a call to the get_global_id() built-in function • Uses qualifiers to define where memory buffers reside • Executed by a compute device – CPU – GPU – FPGA – Other accelerator float *a = float *b = float *y = 0 1 2 3 4 5 6 7 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7 __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } __kernel void sum( … );
  • 18.
    May 1, 201318 OpenCL on FPGAs How does it map?
  • 19.
    May 1, 201319 Compiling OpenCL to FPGAs x86 PCIe SOF X86 binary ACL Compiler Standard C Compiler OpenCL Host Program + Kernels __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Kernel Programs Host Program main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRangeKernel(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
  • 20.
    May 1, 201320 Compiling OpenCL to FPGAs Load Load Store Load Load Store Load Load Store Load Load Store Load Load Store Load Load Store PCIe DDRx __kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Kernel Programs Custom Hardware for Your Kernels
  • 21.
    May 1, 201321 FPGA architecture for OpenCL FPGA Kernel Pipeline Kernel Pipeline Kernel Pipeline PCIe DDR* x86 / External Processor External Memory Controller & PHY Memory Memory Memory Memory Memory Memory Global Memory Interconnect Local Memory Interconnect External Memory Controller & PHY Kernel System
  • 22.
    May 1, 201322 Mapping multithreaded kernels to FPGAs • Simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread – Inefficient and wasteful • Technique: deep pipeline parallelism – Attempt to create a deeply pipelined representation of a kernel – On each clock cycle, we attempt to send in input data for a new thread – Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism
  • 23.
    May 1, 201323 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 0 1 2 3 4 5 6 7 8 threads for vector add example Thread IDs +
  • 24.
    May 1, 201324 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 1 2 3 4 5 6 7 0 8 threads for vector add example Thread IDs +
  • 25.
    May 1, 201325 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 2 3 4 5 6 7 0 1 8 threads for vector add example Thread IDs +
  • 26.
    May 1, 201326 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 3 4 5 6 7 1 2 8 threads for vector add example Thread IDs + 0
  • 27.
    May 1, 201327 Example pipeline for vector add • On each cycle, the portions of the pipeline are processing different threads • While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored Load Load Store 4 5 6 7 0 2 3 8 threads for vector add example Thread IDs + 1
  • 28.
    May 1, 201328 Some examples Using ALTERA’s OpenCL solution
  • 29.
    May 1, 201329 AES encryption • Counter (CTR) based encryption/decryption – 256-bit key • Advantage FPGA – Integer arithmetic – Coarse grain bit operations – Complex decision making • Results Platform Throughput (GB/s) E5503 Xeon Processor 0.01 (single core) AMD Radeon HD 7970 0.33 PCIe385 A7 Accelerator 5.20 42% utilization (2 kernels) •Power conservation •Fill up for even higher performance
  • 30.
    May 1, 201330 Multi-asset barrier option pricing • Monte-carlo simulation – Heston model – ND range • Assets x paths (64x1000000) • Advantage FPGA – Complex control flow • Results      tttt S ttttt dWdtd dWSdtSdS   Platform Power (W) Performance (Msims/s) Msims/W W3690 Xeon Processor 130 32 0.25 nVidia Tesla C2075 225 63 0.28 PCIe385 D5 Accelerator 23 170 7.40
  • 31.
    May 1, 201331 Document filtering • Unstructured data analytics – Bloom Filter • Advantage FPGA – Integer arithmetic – Flexible memory configuration • Results Platform Power (W) Performance (MTs) MTs/W W3690 Xeon Processor 130 2070 15.92 nVidia Tesla C2075 215 3240 15.07 DE4 Stratix IV-530 Accelerator 21 1755 83.57 PCIe385 A7 Accelerator 25 3602 144.08
  • 32.
    May 1, 201332 Fractal video compression • Best matching codebook – Correlation with SAD • Advantage FPGA – Integer arithmetic • Results Platform Power (W) Performance (FPS) FPS/W W3690 Xeon Processor 130 4.6 0.035 nVidia Tesla C2075 215 53.1 0.247 DE4 Stratix IV-530 Accelerator 21 70.9 3.376 PCIe385 A7 Accelerator 25 74.4 2.976