SlideShare a Scribd company logo
Can FPGAs compete with GPUs?
John Romein, Bram Veenboer
GPU Technology Conference (GTC’19)
March 18-21, 2019
2
Outline
●
explain FPGA
–
hardware
●
FPGA vs. GPU
–
programming models (OpenCL)
–
case studies
●
matrix multiplication
●
radio-astronomical imaging
–
lessons learned
●
answer the question in the title
analyze performance & energy efficiency
3
What is a Field-Programmable Gate Array (FPGA)?
●
configurable processor
●
collection of
–
multipliers / adders
–
registers
–
memories
–
logic
–
transceivers
–
clocks
–
interconnect
4
What is an FPGA program?
●
connect elements
–
performs fixed function
●
data-flow engine
●
Hardware Description Language (HDL)
–
Verilog, VHDL
–
difficult
5
FPGA vs GPU
• FPGA advantages
– high energy efficiency
• no instruction decoding etc.
• use/move as few bits as possible
– configurable I/O
• high bandwidth
• e.g., many times 100 GbE
• GPU advantages
– easier to program
– more flexible
– short compilation times
6
New Intel (Altera) FPGA technologies
1) high-level programming language (OpenCL)
2) hard Floating-Point Units
3) tight integration with CPU cores
➔ simple tasks: less programming effort than HDL
➔ allows complex HPC applications
7
FPGA ≠ GPU
• hardware
– GPU: use hardware
– FPGA: create hardware
• execution model
– GPU: instructions
– FPGA: data flow
8
Common language: OpenCL
• OpenCL
– similar to CUDA
– C + explicit parallelism + synchronization + software-managed cache
– offload to GPU/FPGA
• GPU programs not suitable for FPGAs
– FPGA: data-flow engine
9
FPGA: data-flow pipeline example
• create hardware for complex multiply-add
C.real += A.real * B.real;
C.real += -A.imag * B.imag;
C.imag += A.real * B.imag;
C.imag += A.imag * B.real;
• needs four FPUs
• new input enters every cycle
BA ACr Cir iBr i
Cr Ci
FPU
FPU
FPU
FPU
10
Local memory
• GPU: use memory, FPGA: create memory
float tmp[128] __attribute__((…));
• properties:
– registers or memory blocks
– #banks
– bank width
– bank address selection bits attributes
– #read ports
– #write ports
– single/double pumped
– with/without arbiter
– replication factor
• well designed program: few resources, stall-free
11
Running multiple kernels
• GPU: consecutively
• FPGA: concurrently
– channels (OpenCL extension)
channel float2 my_channel __attribute__((depth(256)));
– requires less memory bandwidth
– all kernels resident → must fit
GPU:
FPGA: channel
device memorykernel_1
kernel_1
kernel_2
kernel_2
12
Parallel constructs in OpenCL
• work groups, work items (CUDA: thread blocks, threads)
– GPU: parallel
– FPGA: default: one work item/cycle; not useful
• FPGA:
– compiler auto-parallelizes whole kernel
– #pragma unroll
• create hardware for concurrent loop iterations
– replicate kernels
0 1 2
#pragma unroll
a[i] = i;
for (int i = 0; i < 3; i ++)
BA ACr Cir iBr i
Cr Ci
FPU
FPU
FPU
FPU
13
More GPU/FPGA differences
• code outside critical loop:
– GPU: not an issue
– FPGA: can use many resources
• Shift registers
– FPGA: single cycle
– GPU: expensive
init_once();
for (int i = 0; i < 100000; i ++)
do_many_times();
for (int i = 256; i > 0; i --)
a[i] = a[i – 1];
14
Resource optimizations
• compiler feedback
– after minutes, not hours
– HTML-based reports
• indispensible tool
15
Frequency optimization
● Compiler (place & route) determines Fmax
–
Unlike GPUs
● Full FPGA: longest path limits Fmax
●
HDL: fine-grained control
●
OpenCL: one clock for full design
●
FPGA: 450 MHz; BSP: 400 Mhz; 1 DSP: 350 Mhz
●
Little compiler feedback
●
Recompile with random seeds
●
Compile kernels in isolation to find frequency limiters
–
Tedious, but useful
16
Applications
●
dense matrix multiplication
●
radio-astronomical imager
●
Signal processing (FIR filter, FFT, correlations, beam forming)
17
• complex<float>
• horizontal & vertical memory accesses (→ coalesce)
• reuse of data (→ cache)
Dense matrix multiplication
18
Matrix multiplication
on FPGA
• hierarchical approach
– systolic array of
Processing Elements (PE)
– each PE computes 32x32
submatrix
repeat
repeatreorder
repeatreorder
repeatreorder
readAmatrix
reorder
PE
PE
PE PE
PE PE
PE PE
PE PE PE
PE PE
PE
PE PE
read B matrix
repeat
reorder
repeat
reorder
repeat
reorder
repeat
reorder
collect
collect
collect
write C matrix
collect
19
Matrix multiplication performance
• Arria 10:
– uses 89% of the DSPs, 40% on-chip memory
– clock (288 MHz) at 64% of peak (450 Mhz)
– nearly stall free
Type Device Performance
(TFlop/s)
Power (W) Efficiency
(GFlop/W)
FPGA Intel Arria 10 0.774 37 20.9
GPU NVIDIA Titan X
(Pascal)
10.1 263 38.4
GPU AMD Vega FE 9.73 265 36.7
20
Image-Domain Gridding for Radio-Astronomical Imaging
●
see talk S9306 (Astronomical Imaging on GPUs)
21
Image-Domain Gridding algorithm
#pragma parallel
for s = 1...S :
complex<float> subgrid[P ][N ×N ];
for i = 1...N ×N :
float offset = compute_offset(s, i);
for t = 1...T :
float index = compute_index(s, i, t);
for c = 1...C :
float scale = scales[c];
float phase = offset - (index × scale);
complex<float> phasor = {cos(phase), sin(phase)};
#pragma unroll
for p = 1...P : // 4 polarizations
complex<float> visibility = visibilities[t][c][p];
subgrid[p][i] += cmul(phasor, visibility);
apply_aterm(subgrid);
apply_taper(subgrid);
apply_ifft(subgrid);
store(subgrid);
22
FPGA gridding design ●
Create a dataflow network:
–
Use all available DSPs
–
Every DSP performs a useful
computation every cycle
–
no device memory use
(except kernel args)
23
FPGA OpenCL gridder kernel__attribute__((max_global_work_dim(0)))
__attribute__((autorun))
__attribute__((num_compute_units(NR_GRIDDERS)))
__kernel void gridder()
{
int gridder = get_compute_id(0);
float8 subgrid[NR_PIXELS];
for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) {
| subgrid[pixel] = 0;
}
#pragma ivdep
for (unsigned short vis_major = 0; vis_major < NR_VISIBILITIES; vis_major += UNROLL_FACTOR) {
| float8 visibilities[UNROLL_FACTOR] __attribute__((register));
|
| for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) {
| | visibilities[vis_minor] = read_channel_intel(visibilities_channel[gridder]);
| }
|
| for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel++) {
| | float8 pixel_value = subgrid[pixel];
| | float8 phasors = read_channel_intel(phasors_channel[gridder]); // { cos(phase), sin(phase) }
| |
| | #pragma unroll
| | for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) {
| | | pixel_value.even += phasors[vis_minor] * visibilities[vis_minor].even + -phasors[vis_minor] * visibilities[vis_minor].odd;
| | | pixel_value.odd += phasors[vis_minor] * visibilities[vis_minor].odd + phasors[vis_minor] * visibilities[vis_minor].even;
| | }
| |
| | subgrid[pixel] = pixel_value;
| }
}
for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) {
| write_channel_intel(pixel_channel[gridder], subgrid[pixel]);
}
}
24
Sine/cosine optimization
●
compiler-generated: 8 DSPs
●
limited-precision lookup-table: 1 DSP
–
more DSPs available → replicate Φ 14 20x
25
FPGA resource usage + Fmax
ALUTs FFs RAMs DSPs MLABs ɸ Fmax
gridding-ip 43% 31% 64% 1439 (95%) 71% 14 258
degridding-ip 47% 35% 72% 1441 (95%) 78% 14 254
gridding-lu 27% 32% 61% 1498 (99%) 57% 20 256
degridding-lu 33% 38% 73% 1503 (99%) 69% 20 253
●
almost all of 1518 DSPs available used
● Fmax < 350 Mhz
26
Experimental setup
●
compare Intel Arria 10 FPGA to comparable CPU and GPU
●
CPU and GPU implementations are both optimized
Type Device #FPUs Peak Bandwidth TDP Process
CPU Intel Xeon
E5-2697v3
224 1.39 TFlop/s 68 GB/s 145W 28nm
(TSMC)
FPGA Nallatech 385A 1518 1.37 TFlop/s 34 GB/s 75W 20nm
(TSMC)
GPU NVIDIA GTX 750 Ti 640 1.39 TFlop/s 88 GB/s 60W 28nm
(TSMC)
27
Throughput and energy-efficiency comparison
●
throughput: number of visibilities processed per second (Mvisibilities/s)
●
energy-efficiency: number of visibilities processed per Joule (MVisibilities/J)
●
similar peak performance, what causes the performance differences?
28
Performance analysis
●
CPU: sin/cos in software takes 80% of total runtime
●
GPU: sin/cos overlapped with other computations
●
FPGA: sin/cos overlapped, but cannot use all DSPs for FMAs
29
FPGAs vs GPUs: lessons learned (1/3)
●
dataflow vs. imperative
–
rethink your algorithm
–
different program code
●
on FPGAs:
–
frequent use of OpenCL extensions
–
think about resource usage, occupancy, timing
–
less use of off-chip memory
●
parallelism:
–
both: kernel replication and vectorization
–
FPGA: pipelining and loop unrolling
30
FPGAs vs GPUs: lessons learned (2/3)
●
FPGA ≠ GPU, yet: same optimizations …
–
exploit parallelism
–
maximize FPU utilization
●
hide latency
–
optimize memory performance
●
hide latency → prefetch
●
maximize bandwidth → avoid bank conflicts, unit-stride access (coalescing)
●
reuse data (caching)
●
… but achieved in very different ways!
●
+ architecture-specific optimizations
31
FPGAs vs GPUs: lessons learned (3/3)
●
much simpler than HDL
–
FPGAs accessible to wider audience
●
long learning curve
–
small performance penalty
●
yet not as easy as GPUs
–
distribute resources in complex dataflow
–
optimize for high clock
●
tools still maturing
32
Current/future work
• Stratix 10 versus Volta/Turing
– Stratix 10 imager port not trivial ← different optimizations for new routing
architecture
• 100 GbE FPGA support
– send/receive UDP packets in OpenCL
– BSP firmware development
• streaming data applications
– signal processing (filtering, correlations, beam forming, ...)
• OpenCL FFT library
33
Conclusions
• complex applications on FPGAs now possible
– high-level language
– hard FPUs
– tight integration with CPUs
• GPUs vs FPGAs
– 1 language; no (performance) portability
– very different architectures
– yet many optimizations similar
34
So can FPGAs compete with GPUs?
• depends on application
– compute → GPUs
– energy efficiency → GPUs
– I/O → FPGA
– flexibility → CPU/GPU
– programming ease → CPU, then GPU, then FPGA
FPGA not far behind; CPU way behind
35
Acknowledgements
• This work was funded by
– Netherlands eScience Center (Triple-A 2)
– EU H2020 FETHPC (DEEP-EST, Grant Agreement nr. 754304)
– NWO Netherlands Foundation for Scientific Research (DAS-5)
• Other people
– Suleyman Demirsoy (Intel), Johan Hiddink (NLeSC), Atze v.d. Ploeg (NLeSC),
Daniël v.d. Schuur (ASTRON), Merijn Verstraaten (NLeSC), Ben v. Werkhoven
(NLeSC)

More Related Content

What's hot

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
AMD Developer Central
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
AMD Developer Central
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
AMD Developer Central
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
William Cunningham
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
Kelum Senanayake
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
ugur candan
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
Netronome
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
Taras Zakharchenko
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
Khan Mostafa
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
AMD Developer Central
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
Shinya Takamaeda-Y
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
prithan
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
Kohei KaiGai
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
Unai Lopez-Novoa
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
Kohei KaiGai
 

What's hot (20)

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
Omp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdacOmp tutorial cpugpu_programming_cdac
Omp tutorial cpugpu_programming_cdac
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 

Similar to Can FPGAs Compete with GPUs?

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
Facultad de Informática UCM
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
AnastasiaStulova
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
Fisnik Kraja
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Koichi Shirahata
 
Using FPGA in Embedded Devices
Using FPGA in Embedded DevicesUsing FPGA in Embedded Devices
Using FPGA in Embedded Devices
GlobalLogic Ukraine
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
John Holden
 
Pgopencl
PgopenclPgopencl
Pgopencl
Tim Child
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
GiannisTsagatakis
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PC Cluster Consortium
 

Similar to Can FPGAs Compete with GPUs? (20)

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Using FPGA in Embedded Devices
Using FPGA in Embedded DevicesUsing FPGA in Embedded Devices
Using FPGA in Embedded Devices
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 

Recently uploaded (20)

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 

Can FPGAs Compete with GPUs?

  • 1. Can FPGAs compete with GPUs? John Romein, Bram Veenboer GPU Technology Conference (GTC’19) March 18-21, 2019
  • 2. 2 Outline ● explain FPGA – hardware ● FPGA vs. GPU – programming models (OpenCL) – case studies ● matrix multiplication ● radio-astronomical imaging – lessons learned ● answer the question in the title analyze performance & energy efficiency
  • 3. 3 What is a Field-Programmable Gate Array (FPGA)? ● configurable processor ● collection of – multipliers / adders – registers – memories – logic – transceivers – clocks – interconnect
  • 4. 4 What is an FPGA program? ● connect elements – performs fixed function ● data-flow engine ● Hardware Description Language (HDL) – Verilog, VHDL – difficult
  • 5. 5 FPGA vs GPU • FPGA advantages – high energy efficiency • no instruction decoding etc. • use/move as few bits as possible – configurable I/O • high bandwidth • e.g., many times 100 GbE • GPU advantages – easier to program – more flexible – short compilation times
  • 6. 6 New Intel (Altera) FPGA technologies 1) high-level programming language (OpenCL) 2) hard Floating-Point Units 3) tight integration with CPU cores ➔ simple tasks: less programming effort than HDL ➔ allows complex HPC applications
  • 7. 7 FPGA ≠ GPU • hardware – GPU: use hardware – FPGA: create hardware • execution model – GPU: instructions – FPGA: data flow
  • 8. 8 Common language: OpenCL • OpenCL – similar to CUDA – C + explicit parallelism + synchronization + software-managed cache – offload to GPU/FPGA • GPU programs not suitable for FPGAs – FPGA: data-flow engine
  • 9. 9 FPGA: data-flow pipeline example • create hardware for complex multiply-add C.real += A.real * B.real; C.real += -A.imag * B.imag; C.imag += A.real * B.imag; C.imag += A.imag * B.real; • needs four FPUs • new input enters every cycle BA ACr Cir iBr i Cr Ci FPU FPU FPU FPU
  • 10. 10 Local memory • GPU: use memory, FPGA: create memory float tmp[128] __attribute__((…)); • properties: – registers or memory blocks – #banks – bank width – bank address selection bits attributes – #read ports – #write ports – single/double pumped – with/without arbiter – replication factor • well designed program: few resources, stall-free
  • 11. 11 Running multiple kernels • GPU: consecutively • FPGA: concurrently – channels (OpenCL extension) channel float2 my_channel __attribute__((depth(256))); – requires less memory bandwidth – all kernels resident → must fit GPU: FPGA: channel device memorykernel_1 kernel_1 kernel_2 kernel_2
  • 12. 12 Parallel constructs in OpenCL • work groups, work items (CUDA: thread blocks, threads) – GPU: parallel – FPGA: default: one work item/cycle; not useful • FPGA: – compiler auto-parallelizes whole kernel – #pragma unroll • create hardware for concurrent loop iterations – replicate kernels 0 1 2 #pragma unroll a[i] = i; for (int i = 0; i < 3; i ++) BA ACr Cir iBr i Cr Ci FPU FPU FPU FPU
  • 13. 13 More GPU/FPGA differences • code outside critical loop: – GPU: not an issue – FPGA: can use many resources • Shift registers – FPGA: single cycle – GPU: expensive init_once(); for (int i = 0; i < 100000; i ++) do_many_times(); for (int i = 256; i > 0; i --) a[i] = a[i – 1];
  • 14. 14 Resource optimizations • compiler feedback – after minutes, not hours – HTML-based reports • indispensible tool
  • 15. 15 Frequency optimization ● Compiler (place & route) determines Fmax – Unlike GPUs ● Full FPGA: longest path limits Fmax ● HDL: fine-grained control ● OpenCL: one clock for full design ● FPGA: 450 MHz; BSP: 400 Mhz; 1 DSP: 350 Mhz ● Little compiler feedback ● Recompile with random seeds ● Compile kernels in isolation to find frequency limiters – Tedious, but useful
  • 16. 16 Applications ● dense matrix multiplication ● radio-astronomical imager ● Signal processing (FIR filter, FFT, correlations, beam forming)
  • 17. 17 • complex<float> • horizontal & vertical memory accesses (→ coalesce) • reuse of data (→ cache) Dense matrix multiplication
  • 18. 18 Matrix multiplication on FPGA • hierarchical approach – systolic array of Processing Elements (PE) – each PE computes 32x32 submatrix repeat repeatreorder repeatreorder repeatreorder readAmatrix reorder PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE read B matrix repeat reorder repeat reorder repeat reorder repeat reorder collect collect collect write C matrix collect
  • 19. 19 Matrix multiplication performance • Arria 10: – uses 89% of the DSPs, 40% on-chip memory – clock (288 MHz) at 64% of peak (450 Mhz) – nearly stall free Type Device Performance (TFlop/s) Power (W) Efficiency (GFlop/W) FPGA Intel Arria 10 0.774 37 20.9 GPU NVIDIA Titan X (Pascal) 10.1 263 38.4 GPU AMD Vega FE 9.73 265 36.7
  • 20. 20 Image-Domain Gridding for Radio-Astronomical Imaging ● see talk S9306 (Astronomical Imaging on GPUs)
  • 21. 21 Image-Domain Gridding algorithm #pragma parallel for s = 1...S : complex<float> subgrid[P ][N ×N ]; for i = 1...N ×N : float offset = compute_offset(s, i); for t = 1...T : float index = compute_index(s, i, t); for c = 1...C : float scale = scales[c]; float phase = offset - (index × scale); complex<float> phasor = {cos(phase), sin(phase)}; #pragma unroll for p = 1...P : // 4 polarizations complex<float> visibility = visibilities[t][c][p]; subgrid[p][i] += cmul(phasor, visibility); apply_aterm(subgrid); apply_taper(subgrid); apply_ifft(subgrid); store(subgrid);
  • 22. 22 FPGA gridding design ● Create a dataflow network: – Use all available DSPs – Every DSP performs a useful computation every cycle – no device memory use (except kernel args)
  • 23. 23 FPGA OpenCL gridder kernel__attribute__((max_global_work_dim(0))) __attribute__((autorun)) __attribute__((num_compute_units(NR_GRIDDERS))) __kernel void gridder() { int gridder = get_compute_id(0); float8 subgrid[NR_PIXELS]; for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) { | subgrid[pixel] = 0; } #pragma ivdep for (unsigned short vis_major = 0; vis_major < NR_VISIBILITIES; vis_major += UNROLL_FACTOR) { | float8 visibilities[UNROLL_FACTOR] __attribute__((register)); | | for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) { | | visibilities[vis_minor] = read_channel_intel(visibilities_channel[gridder]); | } | | for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel++) { | | float8 pixel_value = subgrid[pixel]; | | float8 phasors = read_channel_intel(phasors_channel[gridder]); // { cos(phase), sin(phase) } | | | | #pragma unroll | | for (unsigned short vis_minor = 0; vis_minor < UNROLL_FACTOR; vis_minor++) { | | | pixel_value.even += phasors[vis_minor] * visibilities[vis_minor].even + -phasors[vis_minor] * visibilities[vis_minor].odd; | | | pixel_value.odd += phasors[vis_minor] * visibilities[vis_minor].odd + phasors[vis_minor] * visibilities[vis_minor].even; | | } | | | | subgrid[pixel] = pixel_value; | } } for (unsigned short pixel = 0; pixel < NR_PIXELS; pixel ++) { | write_channel_intel(pixel_channel[gridder], subgrid[pixel]); } }
  • 24. 24 Sine/cosine optimization ● compiler-generated: 8 DSPs ● limited-precision lookup-table: 1 DSP – more DSPs available → replicate Φ 14 20x
  • 25. 25 FPGA resource usage + Fmax ALUTs FFs RAMs DSPs MLABs ɸ Fmax gridding-ip 43% 31% 64% 1439 (95%) 71% 14 258 degridding-ip 47% 35% 72% 1441 (95%) 78% 14 254 gridding-lu 27% 32% 61% 1498 (99%) 57% 20 256 degridding-lu 33% 38% 73% 1503 (99%) 69% 20 253 ● almost all of 1518 DSPs available used ● Fmax < 350 Mhz
  • 26. 26 Experimental setup ● compare Intel Arria 10 FPGA to comparable CPU and GPU ● CPU and GPU implementations are both optimized Type Device #FPUs Peak Bandwidth TDP Process CPU Intel Xeon E5-2697v3 224 1.39 TFlop/s 68 GB/s 145W 28nm (TSMC) FPGA Nallatech 385A 1518 1.37 TFlop/s 34 GB/s 75W 20nm (TSMC) GPU NVIDIA GTX 750 Ti 640 1.39 TFlop/s 88 GB/s 60W 28nm (TSMC)
  • 27. 27 Throughput and energy-efficiency comparison ● throughput: number of visibilities processed per second (Mvisibilities/s) ● energy-efficiency: number of visibilities processed per Joule (MVisibilities/J) ● similar peak performance, what causes the performance differences?
  • 28. 28 Performance analysis ● CPU: sin/cos in software takes 80% of total runtime ● GPU: sin/cos overlapped with other computations ● FPGA: sin/cos overlapped, but cannot use all DSPs for FMAs
  • 29. 29 FPGAs vs GPUs: lessons learned (1/3) ● dataflow vs. imperative – rethink your algorithm – different program code ● on FPGAs: – frequent use of OpenCL extensions – think about resource usage, occupancy, timing – less use of off-chip memory ● parallelism: – both: kernel replication and vectorization – FPGA: pipelining and loop unrolling
  • 30. 30 FPGAs vs GPUs: lessons learned (2/3) ● FPGA ≠ GPU, yet: same optimizations … – exploit parallelism – maximize FPU utilization ● hide latency – optimize memory performance ● hide latency → prefetch ● maximize bandwidth → avoid bank conflicts, unit-stride access (coalescing) ● reuse data (caching) ● … but achieved in very different ways! ● + architecture-specific optimizations
  • 31. 31 FPGAs vs GPUs: lessons learned (3/3) ● much simpler than HDL – FPGAs accessible to wider audience ● long learning curve – small performance penalty ● yet not as easy as GPUs – distribute resources in complex dataflow – optimize for high clock ● tools still maturing
  • 32. 32 Current/future work • Stratix 10 versus Volta/Turing – Stratix 10 imager port not trivial ← different optimizations for new routing architecture • 100 GbE FPGA support – send/receive UDP packets in OpenCL – BSP firmware development • streaming data applications – signal processing (filtering, correlations, beam forming, ...) • OpenCL FFT library
  • 33. 33 Conclusions • complex applications on FPGAs now possible – high-level language – hard FPUs – tight integration with CPUs • GPUs vs FPGAs – 1 language; no (performance) portability – very different architectures – yet many optimizations similar
  • 34. 34 So can FPGAs compete with GPUs? • depends on application – compute → GPUs – energy efficiency → GPUs – I/O → FPGA – flexibility → CPU/GPU – programming ease → CPU, then GPU, then FPGA FPGA not far behind; CPU way behind
  • 35. 35 Acknowledgements • This work was funded by – Netherlands eScience Center (Triple-A 2) – EU H2020 FETHPC (DEEP-EST, Grant Agreement nr. 754304) – NWO Netherlands Foundation for Scientific Research (DAS-5) • Other people – Suleyman Demirsoy (Intel), Johan Hiddink (NLeSC), Atze v.d. Ploeg (NLeSC), Daniël v.d. Schuur (ASTRON), Merijn Verstraaten (NLeSC), Ben v. Werkhoven (NLeSC)