SlideShare a Scribd company logo
1 of 88
Download to read offline
Making the most out of
Heterogeneous Chips with CPU,
GPU and FPGA
Rafael Asenjo
Dept. of Computer Architecture
University of Malaga, Spain.
2
Xilinx Zynq UltraScale+ MPSoC
3
4 CPUs
ARM
Cortex-A53
GPU
ARM
Mali 400
2x CPUs
ARM
Cortex-R5
FPGA
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software
– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB
– parallel_for
– parallel pipeline
4
Motivation
• A new mantra: Power and Energy saving
• In all domains
5
Motivation
• GPUs came to rescue:
–Massive Data Parallel Code at
a low price in terms of power
–Supercomputers and servers:
NVIDIA
• GREEN500 Top 5:
– June 2017
• Top500.org (Nov 2017):
–87 systems w. NVIDIA
–12 systems w. Xeon Phi
–5 systems w. PEZY
6
Xeon + Tesla P100
Xeon + Tesla P100
Xeon + Tesla P100
Xeon + Tesla P100
Xeon + Tesla P100
7
Green 500 Nov 2017
Xeon + PEZY
Xeon + PEZY
Xeon + PEZY
Xeon + PEZY
Motivation
• There is (parallel) life beyond supercomputers:
8
Motivation
• Plenty of GPUs elsewhere:
– Integrated GPUs on more than 90% of shipped processors
9
Motivation
• Plenty of GPUs on desktops and laptops:
– Desktops (35 – 130 W) and laptops (15 – 60 W):
10
Intel Kaby Lake AMD APU Kaveri
http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/http://www.anandtech.com/show/10610/intel-announces-7th-gen-
kaby-lake-14nm-plus-six-notebook-skus-desktop-coming-in-
january/2
GPU
Motivation
• Plenty of integrated GPUs in mobile devices too.
11
Huawei HiSilicon Kirin 960 (2 - 6 W)
http://www.anandtech.com/show/10766/huawei-announces-hisilicon-kirin-960-a73-g71
Huawei Mate 9
Mali
Motivation
• And user-programmable DSPs as well:
12
Qualcomm Snapdragon 820/821 (2 - 6 W)
https://www.qualcomm.com/products/snapdragon/processors/820
Samsung
Galaxy S7
Google Pixel
• And FPGAs too…
– ARM + iFPGA
• Altera Cyclon V
• Xilinx Zynq-7000
– Intel + Altera’s FPGA
• Heterogeneous
Architecture Research
Platform
• CPU + iGPU + iFPGA
– Xilinx Zynq UltraScale+
• 4 Cortex A53
• 2 Cortex R5
• GPU Mali 400
• FPGA
Motivation
13 http://www.analyticsengines.com/developer-blog/xilinx-announce-new-zynq-architecture/
4 A53 GPU
2 R5
FPGA
Motivation
• Overwhelming heterogeneity:
14
Motivation
• Plenty of room for improvements
– Want to make the most out of the CPU, the GPU and the FPGA
– Taking productivity into account:
Productivity = Performance - Pain
• High level abstractions in order to hide HW details
• Lack of high-level productive programming models
• “Homogeneous programming of heterogeneous architectures”
– Taking energy consumption into account:
Performance ➔ Performance / Watt
Performance ➔ Performance / Joule
Productivity = Performance / Joule - Pain
15
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software
– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB
– parallel_for
– parallel pipeline
16
Hardware: iGPUs
17
Intel Skylake AMD Kaveri
Iris Pro Graphics Graphic Core Next
HiSilicon Kirin 960
ARM Mali-G71
Intel Skylake – Kaby Lake
18
• Modular design
– 2 or 4 cores
– GPU
• GT-1: 12 EU
• GT-2: 24 EU
• GT-3: 48 EU
• GT-4: 72 EU
• SKU options
– 2+2 (4.5W, 15W)
– 4+2 (35W -- 91W)
– 2+3 (15W, 28W,...)
– 4+4 (65W)
– ....
http://www.anandtech.com/show/6355/intels-haswell-architecture
Intel Skylake – Kaby Lake
19
https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
Intel Processor Graphics
20
GT2
GT3
GT4
Intel Processor Graphics
21
• Peak
performance:
– 72 EUs
– 2 x 4-wide
SIMD
– 2 flop/ck
(fmadd)
– 1GHz
2 x 4-wide
SIMD Units7 in-flight
wavefronts
1.15 TFlops
Intel Processor Graphics
23
NDRange ≈ grid work-group ≈ block
EU-thread (SIMD16) ≈
wavefront ≈ warp:
16 work-items ≈ 16 thrds
=
• Each SIMD16 wavefront takes 2cks
• IMT: fine-grained multi-threading
– Latency hiding mechanism
Intel Virt. Technology for Directed I/O (Intel VT-d)
24
Cache coherent
path
GPU
CPU
AMD Kaveri
• Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores.
25
http://wccftech.com/
CU CU CU CU
CU CU CU CU
AMD Graphics Core Next (GCN)
• In Kaveri,
– 8 Compute Units (CU)
– Each CU: 4 x 16-wide SIMD
– Total: 512 FPUs
– 866 MHz
• Max GFLOPS=
• 0.86 GHz x
• 512 FPUs x
• 2 fmad =
• 880 GFLOPS
26
OpenCL execution on GCN
Work-group à wavefronts (64 work-items) à pools
27
WG
CU0
SIMD0-0
SIMD0 SIMD1 SIMD2 SIMD3
SIMD1-0
SIMD2-0
SIMD3-0
SIMD0-1
SIMD1-1
SIMD2-1
SIMD3-1
SIMD0-2
SIMD1-2
4 pools:
4 wavefronts in
flight per SIMD
4 ck to execute
each wavefront
pool number
Wavefronts
HiSilicon Kirin 960
• ARM64 big.LITTLE (Cortex A73+Cortex A53) + Mali G71
28
Mali
Mali G71
• Supports OpenCL 2.0
• 4, 6, 8, 16, 32 Cores
• Each core:
– 3 x 4-wide SIMD units
• 32 x 12 x 2 fmad x
0.85GHz=652 GFLOPS
29
HSA (Heterogeneous System Architecture)
– CPU, GPU, DSPs,...
• Founders:
30
• HSA Foundation’s goal: Productivity on heterogeneous HW
Advantages of iGPUs
• Discrete and integrated GPUs: different goals
– NVIDIA Pascal: 3584 CUDA cores, 250W, 10 TFLOPS
– Intel Iris Pro Graphics 580: 72EU x 8 SIMD, ~15W, 1.15 TFLOPS
– Mali-G71: 32 EU x 3 x 4 SIMD, ~ 2W, 0.652 TFLOPS
• CPU and GPU are both first-class citizens.
– SVM and platform atomics allows for closer collaboration
• Avoid PCI data transfer and associated POWER dissipation
– Operating System doesn’t get in the way à less overhead
• CPU and GPU may have similar performance
– It’s more likely that they can collaborate
• Cheaper!
31
Hardware: iFPGAs
32
Altera Cyclone V Xilinx Zynq UltraScale+ Intel Xeon+Arria10
QPI
FPGA
2 x A9 4 x
A53
GPU
2xR5
FPGA
FPGA Architecture
33
Courtesy: Javier
Navaridas, U.
Manchester.
CLB: Configurable Logic Block
34
CIN
Switch
Matrix
COUTCOUT
Slice X0Y0
Slice X0Y1
Fast Connects
Slice X1Y0
Slice X1Y1
CIN
SHIFTIN
Left-Hand SLICEM Right-Hand SLICEL
SHIFTOUT
Courtesy: Jose Luis Nunez-Yanez, U. Bristol.
LUT
LUT
FF
FF
FPGA Architecture
35
How do we “mold this liquid silicon”?
• Hardware overlays
– Map a CPU or GPU onto the FPGA
ü100s of simple CPU can fit on large FPGAs
üThe CPU or GPU overlay can change/adapt
!Resulting overlay is not as efficient as a “real” one
!Same drawbacks as “general purpose” processors
36
CPU: GPU:
FPGAs for Software Programmers, Dirk Kock, Daniel Ziener
How do we “mold this liquid silicon”?
üNo FETCH nor DECODE of instructions à already hardwired
üData movement (Exec. Units ⇔ MEM) reduction à Power save
üNot constrained by a fixed ISA:
üBit-level parallelism; shift, bit mask, ...; variable precision arith.
!Less frequency (hundreds of MHz)
!In order execution (proposal from David Kaeli’s group to OoO)
!Programming effort
37
FPGA:
Hardware thread reordering to boost OpenCL throughput on
FPGAs, ICCD’16
FPGAs for Software Programmers, Dirk Kock, Daniel Ziener
Xilinx Zynq UltraScale+ MPSoC
• Available:
– SW coherence (cache flush) à Slow
– HW coherence (MESI coherence protocol) à Fast. Ports:
• Accelerator Coherency Port, ACP
• Cache-coherent interconnect (CCI) ACE-Lite ports
– Part of ARM® AMBA® AXI and ACE protocol specification
38
4 x
A53
GPU
2xR5
FPGA
Intel Broadwell+Arria10
• Cache Coherent Interconnect (CCI)
– New IP inside the FPGA: Intel Quick Path Interconnect with CCI
– AFU (Accelerated Function Unit) can access cached data
39
Towards CCI everywhere
• CCIX consortium (http://www.ccixconsortium.com)
– New chip-to-chip interconnect operating at 25Gbps
– Protocol built for FPGAs, GPUs, network/storage adapters, ...
– Already in Xilinx Virtex UltraScale+
40
Which architecture suits me best?
41
• All of them are Turing complete, but
• Depending on the problem:
– CPU:
• For control-dominant problems
• Frequent context switching
– GPU:
• Data-parallel problems (vectorized/SIMD)
• Little control flow and little synchronization
– FPGA:
• Stream processing problems
• Large data sets processed in a pipelined fashion
• Low power constraints
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software
– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB
– parallel_for
– parallel pipeline
42
http://imgs.xkcd.com/comics/standards.png
43
HOW STANDARDS PROLIFERATE
see: A/C chargers, Character encodings, Instant Messaging, etc)
Programming models for heterogeneous: GPUs
• Targeted at exploiting one device at a time
– CUDA (NVIDIA)
– OpenCL (Khronos Group Standard)
– OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0)
– C++AMP (Windows’ extension of C++. Recently HSA announced own ver.)
– Sycl (Khronos Gruop’s specification for “Single-source C++ progr.)
– ROCm (Radeon Open Compute, HCC, HIP, for AMD GCN –HSA–)
– RenderScript (Google’s Java API for Android)
– ParallDroid (Java + Directives from ULL, Spain)
– Many more (Numba Python, IBM Java, Matlab, R, JavaScript, …)
• Targeted at exploiting both devices simultaneously (discrete GPUs)
– Qilin (C++ and Qilin API compiled to TBB+CUDA)
– OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler)
– XKaapi
– StarPU
• Targeted at exploiting both devices simultaneously (integrated GPUs)
– Qualcomm Symphony
– Intel Concord
44
Programming models for heterogeneous: GPUs
Targeted at exploiting
ONE device at a time
45
parallel_for
CPU GPU
Targeted at exploiting SEVERAL
devices at the same time
parallel_for
compiler
+
runtime
CPU GPU
CUDA OpenCL
OpenMP4.0C++AMP
Sycl
ROCm
RenderScript
ParallDroid
Numba Python
JavaScriptMatlab
R
Qilin
OmpSs
XKaapi
StarPU
Qualcomm Symphony
Concord
Programming models for heterogeneous: FPGAs
• HDL-like languages
– SystemVerilog, Bluespec System Verilog
– Extend RTL languages by adding new features
• Parallel libraries
– CUDA, OpenCL
– Instrument existing parallel libraries to generate RTL code
• C-based languages
– SDSoC, LegUp, ROCCC, Impulse C, Vivado, Calypto Catapult C
– Compile* C into intermediate representation and from there to HDL
– Functions become Finite State Machines and variables mem. blocks
• Higher level languages
– Kiwi (C#), Lime (Java), Chisel (Scala)
– Translate* input language into HDL
46
* Normally	a	subset	thereof
Courtesy: Javier Navaridas, U. Manchester.
OpenCL 2.0
• OpenCL 2.0 supports:
– Coherent SVM (Shared Virtual Memory) & Platform atomics
– Example: allocating an array of atomics:
47
clContex = ...
clCommandQueue = ...
clKernel = ...
cl_svm_mem_flags flags = CL_MEM_SVM_FINE_GRAIN_BUFFER | CL_MEM_SVM_ATOMICS;
atomic_int * data;
data = (atomic_int *) clSVMAlloc(clContext, flags, NUM * sizeof(atomic_int), 0);
clSetKernelArgSVMPointer(clKernel, 0, data);
clSetKernelArg(clKernel, 1, sizeof(int), &NUM);
clEnqueueNDRangeKernel(clCommandQueue, clKernel, .... );
atomic_store_explicit ((atomic_int*)&data[0], 99, memory_order_release);
CPU Host Code
kernel void myKernel(global int *data, global int* NUM){
int i=atomic_load_explicit((global atomic_int *)&data[0], memory_order_acquire);
....
}
GPU Kernel Code
C++AMP
• C++ Accelerated Massive Parallelism
• Example: Vector Add (C[i]=A[i]+B[i])
48
A, B and C buffers
Initialize A and B
Run on GPU
SYCL Parallel STL
• Parallel STL aimed at C++17 standard:
– std::sort(vec.begin(), vec.end()); //Vector sort
– std::sort(seq, vec.begin(), vec.end()); // Explicit SEQUENTIAL
– std::sort(par, vec.begin(), vec.end()); // Explicit PARALLEL
• SYCL exposes the execution policy (user-defined)
50
std::vector<int>(N);
float ratio = 0.5;
sycl::het_sycl_policy<class ForEach> policy(ratio);
std::for_each(policy, v.begin(), v.end(),
[=](int& val) {
val= val*val;
}
);
Courtesy: Antonio Vilches
50% of the iterations on CPU
50% of the iterations on GPU
https://github.com/KhronosGroup/SyclParallelSTL
Qualcomm Symphony
• Point Kernel: write once, run everywhere
– Pattern tuning extends to heterogeneous load hints
51
SYMPHONY_POINT_KERNEL_3(vadd, int, i, float*, a, int, na,
float*, b, int, nb, float*, c, int, nc,
{ c[i] = a[i] + b[i];}); ...
symphony::buffer_ptr buf_a(1024), buf_b(1024), buf_c(1024);
...
symphony::range<1>(1024) r;
symphony::pfor_each(r, vadd_pk, buf_a, buf_b, buf_c,
symphony::pattern::tuner().set_cpu_load(10)
.set_gpu_load(50)
.set_dsp_load(40)
);
Executed across CPU(10%) + GPU(50%) + DSP(40%)
A kernel for
each device is
automatically
generated
Intel Concord
• C++ heterogeneous programming framework
• Papers:
– Rashid Kaleem et al. Adaptive heterogeneous scheduling on
integrated GPUs. PACT 2014.
• Intel Heterogeneous Research Compiler (iHRC) at
https://github.com/IntelLabs/iHRC/
– Naila Farooqui et al. Affinity-aware work-stealing for integrated
CPU-GPU processors. PPoPP ’16. PhD dissertation:
https://smartech.gatech.edu/bitstream/handle/1853/54915/FAROOQUI-DISSERTATION-2016.pdf
52
class Foo {
int *a, *b, *c;
void operator(int i) { c[i] = a[i]+b[i]; }
Foo *f = new Foo();
parallel_for(0, N, *f);
Executed across CPU+GPU,
dynamic partition
Xilinx SDSoC
• Without SDSoC:
53
FPGA
CPU
main(){
init(A,B,C);
mmult(A,B,D);
madd(C,D,E);
consume(E);
}
Courtesy: Xilix
mmult madd
ABC
D
E
Xilinx SDSoC
• With SDSoC:
– Eclipse-based SDK
54
FPGA
CPU
Refine Code:
• Re-factoring
code to be
more HLS-
friendly
• Adding code
annotations
(i.e., pragmas)
#pragma SDS data copy(Array)
#pragma SDS zero_copy(Array)
#pragma SDS data access_pattern
#pragma SDS data sys_port (...)
#pragma HLS PIPELINE II = 1
#pragma HLS unroll
#pragma HLS inline
Altera OpenCL
55
main() {
read_data( ... );
manipulate( ... );
clEnqueueWriteBuffer( ... );
clEnqueueNDRange(...,sum,...);
clEnqueueReadBuffer( ... );
display_result( ... );
}
__kernel void
sum(__global float *a,
__global float *b,
__global float *y)
{
int i = get_global_id(0);
c[i] = a[i] + b[i];
}
Host Code
Kernel Code
Standard
C Compiler
OpenCL
aoc compiler
Verilog
.EXE
.OUT
.AOCX
bitstream
CPU FPGA
Altera OpenCL optimizations
• Locality:
– Shift-Register Pattern (SRP)
– Local Memory (LM)
– Constant Memory (CM)
– Load-Store Units Buffers (LSU)
• Loop Unrolling (LU)
– Deeper pipelines
– Optimized tree-like reductions
• Kernel vectorization (SIMD)
– Wider ALUs
• Compute Unit Replication (CUR)
– Replicate Pipelines
– Downsides: memory divergence ⬆, complexity ⬆, frequency ⬇
56
Shift Register Pattern
• Example: 1D stencil computation
57
x x x
+
NDRange
No SRP
SingleTask
SRP
x x x
+
shft_reg
64x faster than Naive (17 elem. filter)
“Tuning Stencil Codes in OpenCL for
FPGAs”, ICCD’16
Shift !!
Agenda
• Motivation
• Hardware: iGPUs and iFPGAs
• Software
– Programming models for chips with iGPUs and iFPGAs
• Our heterogeneous templates based on TBB
– parallel_for
– parallel pipeline
58
Our work: parallel_for
• Heterogeneous parallel for based on TBB
– Dynamic scheduling / Adaptive partitioning for irregular codes
– CPU and GPU chunk size re-computed every time
59
A15
GPU
A15
A15
A15
A7
A7
A7
Problem:
BarnesHut
Archit.:
Exynos 5
One A7
feeds
the GPU
Time
Time to execute
this chunk of
iterations on GPU
New scheduler and API
60
New API
61
Use the GPU
Exploit Zero-Copy-Buffer
Scheduler class
62
Related work: Qilin
63
• Static Approach: Bulk-Oracle
Offline search (11 executions)
– Work partitioning between CPU and GPU
– One big chunk with 70% of the iterations: on the GPU
– The remaining 30% processed on the cores
0"
50"
100"
150"
0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%"
Percentage)of)the)itera.on)space)offloaded)to)the)GPU)
Barnes)Hut:)Offline)search)for)sta.c)par..on)
Execu2on"2me"in"seconds"
Related work: Intel Concord
64
• Papers:
– Rashid Kaleem et al.
Adaptive heterogeneous
scheduling on integrated
GPUs. PACT 2014.
– Naila Farooqui et al.
Affinity-aware work-stealing
for integrated CPU-GPU
processors. PPoPP ’16
(poster).
assign chunk
size just for
GPU
compute
on CPU
compute
chunk on
GPU
when the GPU is done
according to
relative speeds:
partition the rest
of the iteration
space
Related Work: HDSS
65
• Belviranli et al, A Dynamic Self-Scheduling Scheme for
Heterogeneous Multiprocessor Architectures, TACO 2013
M. E. Belviranli et al.
ignments by time for a two-threaded (1 GPU and 1 CPU thread) run of
system. Figure (a), on the left, illustrates the assignments dispatched
ure, block size axis is shown in logarithmic scale for a better separation
ure (b), on the right, shows the two phases of the algorithm separated
zes just after computational weights are determined and stabilized.
Fermi
Discrete
GPU
Debunking big chunks
• Example: irregular Barnes-Hut benchmark
66
Debunkign big chunks
A larger chunk causes more memory traffic
67
Updated
chunk
Array of bodies:
Written	body
Read	body
Updated
chunk
Array of bodies:
Written	body
Read	body
Debunking big chunks
• BarnesHut: GPU throughput along the iteration space
– For two different time-steps
– Different GPU chunk-sizes
68
0 2 4 6 8 10
x 10
4
20
40
60
80
100
120
140
160
Barnes−Hut: Throughput variation (time step=0)
Iteration Space
Throughput
chunk=320
chunk=640
chunk=1280
chunk=2560
0 2 4 6 8 10
x 10
4
20
40
60
80
100
120
140
160
Barnes−Hut: Throughput variation (time step=5)
Iteration Space
Throughput
chunk=320
chunk=640
chunk=1280
chunk=2560
640
2560
1280
640
2560
Time step=0 Time step=5
LogFit main idea
70
Time
GPU
Core1
Core2
Core3
Core4
Time to execute
this	chunk	of	
iterations	on	GPU
Exp
Phs
Stable	Phase F
P
Intel	Core	i7
Haswell
GPU
Core
Core
Core
Core
Shared	L3
Parallel_for: Performance per joule
• Static: Oracle-Like static partition of the work based on profiling
• Concord: Intel approach: GPU size provided by user once
• HDSS: Belviranli et al. approach: GPU size computed once
• LogFit: our work
72
GPU
GPU+1Core
GPU+2Cores
GPU+4Cores
GPU+3Cores
Antonio Vilches et al. Adaptive
Partitioning for Irregular
Applications on Heterogeneous
CPU-GPU Chips, Procedia
Computer Science, 2015.
On Intel Haswell
Core i7-4770
LogFit: better for irregular benchmarks
73
LogFit working on FPGA
• Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.)
74
DE1-SoC
Altera
Cyclone V
AccelPower Cap
+ BeagelBone
Shunt resistor
(power flow)
LogFit working on FPGA
75
LogFit
2C+FPGA
Static
• Terasic DE1: Altera Cyclone V (FPGA+2 Cortex A9)
• Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.)
Our work: pipeline
• ViVid, an object detection application
• Contains three main kernels that form a pipeline
• Would like to answer the following questions:
– Granularity: coarse or fine grained parallelism?
– Mapping of stages: where do we run them (CPU/GPU)?
– Number of cores: how many of them when running on CPU?
– Optimum: what metric do we optimize (time, energy, both)?
76
Stage1
Stage2
Stage3
Input
frame Filter Histogram Classifier
Output
response mtx.
index mtx.
histograms
detect. response
Granularity
• Coarse grain:
– CG
• Medium grain:
– MG
• Fine grain:
– Fine grain also in the CPU via AVX intrinsics
77
Input Stage Output Stage
CPU
C
item
CPU
C
item
CPU
Stage 1
C
item
CPU
Stage 2
C
item
CPU
Stage 3
C
item
C =core
Input Stage
CPU
C
item
Output Stage
CPU
C
item
CPU
Stage 1
C
item
C
C C
CPU
Stage 2
C
item
C
C C
CPU
Stage 3
C
item
C
C C
Input Stage Output Stage
GPU
C C C
C C C
item
CPU
C
item
CPU
C
item
GPU
C C C
C C C
item
GPU
C C C
C C C
item
Stage 1 Stage 2 Stage 3
Our work: pipeline
78
Input Stage Output Stage
GPU
C C C
C C C
item
CPU
C
item
CPU
C
item
CPU
Stage 2
C
item
C
C C
CPU
Stage 3
C
item
C
C C
Stage 1
CPU
GPU
CPU
GPU
CPU
GPU
CPU CPU CPU
GPU
CPU CPU CPU CPU
CPU CPU
GPU
CPU
GPU
CPU CPU CPU
GPU
CPU CPU
GPU
CPU CPU
CPU CPU CPU CPU CPU
CPU CPU
GPU
CPU
GPU
CPU CPU
GPU
CPU CPU CPU CPU CPU
CPU CPU CPU CPU CPU
GPU
111 100
010 001
110 101
011 000
Accounting for all alternatives
• In general: nC CPU cores, 1 GPU and p pipeline stages
# alternatives = 2
p
x (nC +2)
• For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives
79
Input Stage
GPU
C C C
C C C
item
CPU
C
item
Output Stage
CPU
C
item
GPU
C C C
C C C
item
GPU
C C C
C C C
item
CPU
Stage 1
item
CPU
Stage 2
item
CPU
Stage p
item
C C
nC cores
C C
nC cores
C C
nC cores
Mapping streaming applications on commodity multi-CPU and GPU on-
chip processors, IEEE Tran. Parallel and Distributed Systems, 2016.
Framework and Model
• Key idea:
1. Run only on GPU
2. Run only on CPU
3. Analytically extrapolate for heterogeneous execution
4. Find out the best configuration à RUN
80
Input Stage
GPU
C C C
C C C
item
CPU
C
item
Output Stage
CPU
C
item
GPU
C C C
C C C
item
GPU
C C C
C C C
item
CPU
Stage 1
C
item
C
C C
CPU
Stage 2
C
item
C
C C
CPU
Stage 3
C
item
C
C C
DP-MG
collect λ and E (homogeneous values)
Environmental Setup: Benchmarks
• Four Benchmarks
– ViVid (Low and High Definition inputs)
– SRAD
– Tracking
– Scene Recognition
81
Inpt
Filter Histogram Classifier
St 1 St 2 St 3 Out
Inpt
Extrac. Prep. Reduct.
St 1 St 2 St 3 St 4
Comp.1 Comp. 2
St 5 St 6 Out
Statist.
Inpt
Track.
St 1 Out
Inpt
Feature. SVM
St 1 St 2 Out
Does Higher Throughput imply Lower Energy?
82
1.0E-03
2.0E-03
3.0E-03
4.0E-03
5.0E-03
6.0E-03
7.0E-03
8.0E-03
9.0E-03
Num-Threads	(CG)
6.0
8.0
10.0
12.0
14.0
16.0
Num-Threads	(CG)
0.0E+00
2.0E-04
4.0E-04
6.0E-04
8.0E-04
1.0E-03
Num-Threads	(CG)
Haswell
HD
1.0E-01
3.0E-01
5.0E-01
7.0E-01
9.0E-01
1.1E+00
Num-Threads	(CG)
SRAD on Haswell
84
0.0E+00
1.0E-01
2.0E-01
3.0E-01
4.0E-01
5.0E-01
6.0E-01
7.0E-01
8.0E-01
Num-Threads	(CG)
0.0E+00
2.0E-02
4.0E-02
6.0E-02
8.0E-02
1.0E-01
1.2E-01
1.4E-01
1.6E-01
1.8E-01
Num-Threads	(CG)
Input Stage
GPU
C C C
C C C
item
CPU
C
item
CPU
Stage 1
C
item
GPU
C C C
C C C
item
CPU
Stage 2
C
item
Output Stage
CPU
C
item
GPU
C C C
C C C
item
CPU
Stage 6
C
item
DP-CG
GPU
C C C
C C C
item
CPU
Stage 3
C
item
GPU
C C C
C C C
item
CPU
Stage 4
C
item
GPU
C C C
C C C
item
CPU
Stage 5
C
item
Results on big.LITTLE architecture
85
• On Odroid XU3: 4 Cortex A15 + 4 Cortex A7 + GPU Mali
Input Stage Output Stage
CPU
C
item
CPU
C
item
CPU
Stage 1
C
item
CPU
Stage 2
C
item
CPU
Stage 3
C
item
C =core
5 thr (GPU+4A15)
1thr (A15)
2thr (2 A15)
4thr (4 A15)
5thr (4 A15+1 A7)
9thr (4 A15+ 4 A7)
1 thr
(GPU)
9 thr (GPU+4A15+4A7)
Pipeline on CPU-FPGAs
• Change the GPU by a FPGA
– Core i7 4770k 3.5 GHz Haswell quad-core CPU
– Altera DE5-Net FPGA from Terasic
– Intel HD6000 embedded GPU
– Nvidia Quadro K600 discrete GPU (192 cuda cores)
• Homogeneous results:
86
Device Time
(sec.)
Throughput
(fps)
Power
(watts)
Energy
(Jules)
CPU (1 core) 167.610 0.596& 37& 6201&
CPU (4 cores) 50.310& 1.987& 92& 4628&
FPGA 13.480& 7.418& 5" 67"
On-chip GPU 7.855" 12.730" 16& 125&
Discrete GPU 13.941& 7.173& 47& 655&
Pipeline on CPU-FPGA
• Heterogeneous results based on our TBB scheduler:
- Accelerator + CPU
- Mapping 001
87
onGPU+CPU ~ FPGA+CPU
Conclusions
• Plenty of heterogeneous on-chip architectures out there
• Why not use all devices?
– Need to find the best mapping/distribution/scheduling out of the
many possible alternatives.
• Programming models and runtimes aimed at this goal are in
their infancy: striving to solve this.
• Challenges:
– Hide HW complexity providing a productive programming model
– Consider energy in the partition/scheduling decisions
– Minimize overhead of adaptation policies
88
Future Work
• Exploit	CPU-GPU-FPGA	chips	and	coherency
• Predict	energy	consumption	on	heterogeneous	CPU-
GPU-FPGA	chips
• Consider	energy	consumption	in	the	partitioning	
heuristic
• Rebuild	the	scheduler	engine	on	top	of	the	new TBB	
library	(OpenCL	node)
• Tackle	other	irregular	codes
89
Future Work: TBB OpenCL node
buffer
get_next_image
preprocess
detect_with_A
detect_with_B
make_decision
Can express pipelining, task parallelism and data parallelism
OpenCL node
Future Work: other irregular codes
91
1
2 3 4
11 12 13 14
5 6 7 8 9 10
15 16 17
1918
Bottom-Up
a) Selective
1
2 3 4
11 12 13 14
5 6 7 8 9 10
15 16 17
1918
b) Concurrent
Top-Down
1
2 3 4
11 12 13 14
5 6 7 8 9 10
15 16 17
1918
c) Asynchronous
5
1 CPU
GPU
19 Both
Node computed on:
CPU
GPU
Both
Frontier computed on:
C.H.G Vázquez, PhD, Library based solutions for algorithms with complex patterns, 2015
Collaborators
• Mª Ángeles González Navarro (UMA, Spain)
• Andrés Rodríguez (UMA, Spain)
• Francisco Corbera (UMA, Spain)
• Jose Luis Nunez-Yanez (University of Bristol)
• Antonio Vilches (UMA, Spain)
• Alejandro Villegas (UMA, Spain)
• Juan Gómez Luna (U. Cordoba, Spain)
• Rubén Gran (U. Zaragoza, Spain)
• Darío Suárez (U. Zaragoza, Spain)
• Maria Jesús Garzarán (Intel and UIUC, USA)
• Mert Dikmen (UIUC, USA)
• Kurt Fellows (UIUC, USA)
• Ehsan Totoni (UIUC, USA)
• Luis Remis (Intel, USA)
92
Questions
asenjo@uma.es
93

More Related Content

What's hot

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Software Defined Networks (SDN) and Cloud Computing in 5G Wireless Technologies
Software Defined Networks (SDN) and Cloud Computing in 5G Wireless TechnologiesSoftware Defined Networks (SDN) and Cloud Computing in 5G Wireless Technologies
Software Defined Networks (SDN) and Cloud Computing in 5G Wireless Technologiesspirit conference
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Telecom Infra Project study notes
Telecom Infra Project study notesTelecom Infra Project study notes
Telecom Infra Project study notesRichard Kuo
 
Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...Michelle Holley
 
Centralized Emergency Traffic Optimizer NEV SDK
Centralized Emergency Traffic Optimizer NEV SDKCentralized Emergency Traffic Optimizer NEV SDK
Centralized Emergency Traffic Optimizer NEV SDKMichelle Holley
 
nokia_netact_brochure
nokia_netact_brochurenokia_netact_brochure
nokia_netact_brochureAna Sousa
 
Enabling MEC as a New Telco Business Opportunity
Enabling MEC as a New Telco Business OpportunityEnabling MEC as a New Telco Business Opportunity
Enabling MEC as a New Telco Business OpportunityMichelle Holley
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
Building Next Generation Transport Networks
Building Next Generation Transport NetworksBuilding Next Generation Transport Networks
Building Next Generation Transport NetworksInfinera
 
5G for Reliable Industrial Wireless Networks
5G for Reliable Industrial Wireless Networks5G for Reliable Industrial Wireless Networks
5G for Reliable Industrial Wireless NetworksAUTOWARE
 
Distributed Communication and Control for a Network of Melting Probes in Extr...
Distributed Communication and Control for a Network of Melting Probes in Extr...Distributed Communication and Control for a Network of Melting Probes in Extr...
Distributed Communication and Control for a Network of Melting Probes in Extr...Real-Time Innovations (RTI)
 
Strategies to architecting ultra-efficient data centers
Strategies to architecting ultra-efficient data centersStrategies to architecting ultra-efficient data centers
Strategies to architecting ultra-efficient data centersInfinera
 
OSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and Beyond
OSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and BeyondOSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and Beyond
OSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and BeyondLumina Networks
 
Operationalizing SDN
Operationalizing SDNOperationalizing SDN
Operationalizing SDNADVA
 
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for EnterprisesEnabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for EnterprisesMichelle Holley
 
Transport SDN Overview and Standards Update: Industry Perspectives
Transport SDN Overview and Standards Update: Industry PerspectivesTransport SDN Overview and Standards Update: Industry Perspectives
Transport SDN Overview and Standards Update: Industry PerspectivesInfinera
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmapinside-BigData.com
 

What's hot (20)

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Software Defined Networks (SDN) and Cloud Computing in 5G Wireless Technologies
Software Defined Networks (SDN) and Cloud Computing in 5G Wireless TechnologiesSoftware Defined Networks (SDN) and Cloud Computing in 5G Wireless Technologies
Software Defined Networks (SDN) and Cloud Computing in 5G Wireless Technologies
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Telecom Infra Project study notes
Telecom Infra Project study notesTelecom Infra Project study notes
Telecom Infra Project study notes
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...Development, test, and characterization of MEC platforms with Teranium and Dr...
Development, test, and characterization of MEC platforms with Teranium and Dr...
 
Centralized Emergency Traffic Optimizer NEV SDK
Centralized Emergency Traffic Optimizer NEV SDKCentralized Emergency Traffic Optimizer NEV SDK
Centralized Emergency Traffic Optimizer NEV SDK
 
nokia_netact_brochure
nokia_netact_brochurenokia_netact_brochure
nokia_netact_brochure
 
Enabling MEC as a New Telco Business Opportunity
Enabling MEC as a New Telco Business OpportunityEnabling MEC as a New Telco Business Opportunity
Enabling MEC as a New Telco Business Opportunity
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
Building Next Generation Transport Networks
Building Next Generation Transport NetworksBuilding Next Generation Transport Networks
Building Next Generation Transport Networks
 
5G for Reliable Industrial Wireless Networks
5G for Reliable Industrial Wireless Networks5G for Reliable Industrial Wireless Networks
5G for Reliable Industrial Wireless Networks
 
Distributed Communication and Control for a Network of Melting Probes in Extr...
Distributed Communication and Control for a Network of Melting Probes in Extr...Distributed Communication and Control for a Network of Melting Probes in Extr...
Distributed Communication and Control for a Network of Melting Probes in Extr...
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
Strategies to architecting ultra-efficient data centers
Strategies to architecting ultra-efficient data centersStrategies to architecting ultra-efficient data centers
Strategies to architecting ultra-efficient data centers
 
OSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and Beyond
OSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and BeyondOSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and Beyond
OSN Bay Area Feb 2019 Meetup: ONAP Edge, 5G and Beyond
 
Operationalizing SDN
Operationalizing SDNOperationalizing SDN
Operationalizing SDN
 
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for EnterprisesEnabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
Enabling Multi-access Edge Computing (MEC) Platform-as-a-Service for Enterprises
 
Transport SDN Overview and Standards Update: Industry Perspectives
Transport SDN Overview and Standards Update: Industry PerspectivesTransport SDN Overview and Standards Update: Industry Perspectives
Transport SDN Overview and Standards Update: Industry Perspectives
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
 

Similar to Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)Julien SIMON
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)Vsevolod Shabad
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
 

Similar to Making the most out of Heterogeneous Chips with CPU, GPU and FPGA (20)

Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
HiPEAC-Keynote.pptx
HiPEAC-Keynote.pptxHiPEAC-Keynote.pptx
HiPEAC-Keynote.pptx
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 

More from Facultad de Informática UCM

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?Facultad de Informática UCM
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...Facultad de Informática UCM
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersFacultad de Informática UCM
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmFacultad de Informática UCM
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingFacultad de Informática UCM
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroFacultad de Informática UCM
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...Facultad de Informática UCM
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Facultad de Informática UCM
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFacultad de Informática UCM
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoFacultad de Informática UCM
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCFacultad de Informática UCM
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Facultad de Informática UCM
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Facultad de Informática UCM
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Facultad de Informática UCM
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windFacultad de Informática UCM
 

More from Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Recently uploaded

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 

Recently uploaded (20)

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 

Making the most out of Heterogeneous Chips with CPU, GPU and FPGA

  • 1. Making the most out of Heterogeneous Chips with CPU, GPU and FPGA Rafael Asenjo Dept. of Computer Architecture University of Malaga, Spain.
  • 2. 2
  • 3. Xilinx Zynq UltraScale+ MPSoC 3 4 CPUs ARM Cortex-A53 GPU ARM Mali 400 2x CPUs ARM Cortex-R5 FPGA
  • 4. Agenda • Motivation • Hardware: iGPUs and iFPGAs • Software – Programming models for chips with iGPUs and iFPGAs • Our heterogeneous templates based on TBB – parallel_for – parallel pipeline 4
  • 5. Motivation • A new mantra: Power and Energy saving • In all domains 5
  • 6. Motivation • GPUs came to rescue: –Massive Data Parallel Code at a low price in terms of power –Supercomputers and servers: NVIDIA • GREEN500 Top 5: – June 2017 • Top500.org (Nov 2017): –87 systems w. NVIDIA –12 systems w. Xeon Phi –5 systems w. PEZY 6 Xeon + Tesla P100 Xeon + Tesla P100 Xeon + Tesla P100 Xeon + Tesla P100 Xeon + Tesla P100
  • 7. 7 Green 500 Nov 2017 Xeon + PEZY Xeon + PEZY Xeon + PEZY Xeon + PEZY
  • 8. Motivation • There is (parallel) life beyond supercomputers: 8
  • 9. Motivation • Plenty of GPUs elsewhere: – Integrated GPUs on more than 90% of shipped processors 9
  • 10. Motivation • Plenty of GPUs on desktops and laptops: – Desktops (35 – 130 W) and laptops (15 – 60 W): 10 Intel Kaby Lake AMD APU Kaveri http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/http://www.anandtech.com/show/10610/intel-announces-7th-gen- kaby-lake-14nm-plus-six-notebook-skus-desktop-coming-in- january/2 GPU
  • 11. Motivation • Plenty of integrated GPUs in mobile devices too. 11 Huawei HiSilicon Kirin 960 (2 - 6 W) http://www.anandtech.com/show/10766/huawei-announces-hisilicon-kirin-960-a73-g71 Huawei Mate 9 Mali
  • 12. Motivation • And user-programmable DSPs as well: 12 Qualcomm Snapdragon 820/821 (2 - 6 W) https://www.qualcomm.com/products/snapdragon/processors/820 Samsung Galaxy S7 Google Pixel
  • 13. • And FPGAs too… – ARM + iFPGA • Altera Cyclon V • Xilinx Zynq-7000 – Intel + Altera’s FPGA • Heterogeneous Architecture Research Platform • CPU + iGPU + iFPGA – Xilinx Zynq UltraScale+ • 4 Cortex A53 • 2 Cortex R5 • GPU Mali 400 • FPGA Motivation 13 http://www.analyticsengines.com/developer-blog/xilinx-announce-new-zynq-architecture/ 4 A53 GPU 2 R5 FPGA
  • 15. Motivation • Plenty of room for improvements – Want to make the most out of the CPU, the GPU and the FPGA – Taking productivity into account: Productivity = Performance - Pain • High level abstractions in order to hide HW details • Lack of high-level productive programming models • “Homogeneous programming of heterogeneous architectures” – Taking energy consumption into account: Performance ➔ Performance / Watt Performance ➔ Performance / Joule Productivity = Performance / Joule - Pain 15
  • 16. Agenda • Motivation • Hardware: iGPUs and iFPGAs • Software – Programming models for chips with iGPUs and iFPGAs • Our heterogeneous templates based on TBB – parallel_for – parallel pipeline 16
  • 17. Hardware: iGPUs 17 Intel Skylake AMD Kaveri Iris Pro Graphics Graphic Core Next HiSilicon Kirin 960 ARM Mali-G71
  • 18. Intel Skylake – Kaby Lake 18 • Modular design – 2 or 4 cores – GPU • GT-1: 12 EU • GT-2: 24 EU • GT-3: 48 EU • GT-4: 72 EU • SKU options – 2+2 (4.5W, 15W) – 4+2 (35W -- 91W) – 2+3 (15W, 28W,...) – 4+4 (65W) – .... http://www.anandtech.com/show/6355/intels-haswell-architecture
  • 19. Intel Skylake – Kaby Lake 19 https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf
  • 21. Intel Processor Graphics 21 • Peak performance: – 72 EUs – 2 x 4-wide SIMD – 2 flop/ck (fmadd) – 1GHz 2 x 4-wide SIMD Units7 in-flight wavefronts 1.15 TFlops
  • 22. Intel Processor Graphics 23 NDRange ≈ grid work-group ≈ block EU-thread (SIMD16) ≈ wavefront ≈ warp: 16 work-items ≈ 16 thrds = • Each SIMD16 wavefront takes 2cks • IMT: fine-grained multi-threading – Latency hiding mechanism
  • 23. Intel Virt. Technology for Directed I/O (Intel VT-d) 24 Cache coherent path GPU CPU
  • 24. AMD Kaveri • Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores. 25 http://wccftech.com/ CU CU CU CU CU CU CU CU
  • 25. AMD Graphics Core Next (GCN) • In Kaveri, – 8 Compute Units (CU) – Each CU: 4 x 16-wide SIMD – Total: 512 FPUs – 866 MHz • Max GFLOPS= • 0.86 GHz x • 512 FPUs x • 2 fmad = • 880 GFLOPS 26
  • 26. OpenCL execution on GCN Work-group à wavefronts (64 work-items) à pools 27 WG CU0 SIMD0-0 SIMD0 SIMD1 SIMD2 SIMD3 SIMD1-0 SIMD2-0 SIMD3-0 SIMD0-1 SIMD1-1 SIMD2-1 SIMD3-1 SIMD0-2 SIMD1-2 4 pools: 4 wavefronts in flight per SIMD 4 ck to execute each wavefront pool number Wavefronts
  • 27. HiSilicon Kirin 960 • ARM64 big.LITTLE (Cortex A73+Cortex A53) + Mali G71 28 Mali
  • 28. Mali G71 • Supports OpenCL 2.0 • 4, 6, 8, 16, 32 Cores • Each core: – 3 x 4-wide SIMD units • 32 x 12 x 2 fmad x 0.85GHz=652 GFLOPS 29
  • 29. HSA (Heterogeneous System Architecture) – CPU, GPU, DSPs,... • Founders: 30 • HSA Foundation’s goal: Productivity on heterogeneous HW
  • 30. Advantages of iGPUs • Discrete and integrated GPUs: different goals – NVIDIA Pascal: 3584 CUDA cores, 250W, 10 TFLOPS – Intel Iris Pro Graphics 580: 72EU x 8 SIMD, ~15W, 1.15 TFLOPS – Mali-G71: 32 EU x 3 x 4 SIMD, ~ 2W, 0.652 TFLOPS • CPU and GPU are both first-class citizens. – SVM and platform atomics allows for closer collaboration • Avoid PCI data transfer and associated POWER dissipation – Operating System doesn’t get in the way à less overhead • CPU and GPU may have similar performance – It’s more likely that they can collaborate • Cheaper! 31
  • 31. Hardware: iFPGAs 32 Altera Cyclone V Xilinx Zynq UltraScale+ Intel Xeon+Arria10 QPI FPGA 2 x A9 4 x A53 GPU 2xR5 FPGA
  • 33. CLB: Configurable Logic Block 34 CIN Switch Matrix COUTCOUT Slice X0Y0 Slice X0Y1 Fast Connects Slice X1Y0 Slice X1Y1 CIN SHIFTIN Left-Hand SLICEM Right-Hand SLICEL SHIFTOUT Courtesy: Jose Luis Nunez-Yanez, U. Bristol. LUT LUT FF FF
  • 35. How do we “mold this liquid silicon”? • Hardware overlays – Map a CPU or GPU onto the FPGA ü100s of simple CPU can fit on large FPGAs üThe CPU or GPU overlay can change/adapt !Resulting overlay is not as efficient as a “real” one !Same drawbacks as “general purpose” processors 36 CPU: GPU: FPGAs for Software Programmers, Dirk Kock, Daniel Ziener
  • 36. How do we “mold this liquid silicon”? üNo FETCH nor DECODE of instructions à already hardwired üData movement (Exec. Units ⇔ MEM) reduction à Power save üNot constrained by a fixed ISA: üBit-level parallelism; shift, bit mask, ...; variable precision arith. !Less frequency (hundreds of MHz) !In order execution (proposal from David Kaeli’s group to OoO) !Programming effort 37 FPGA: Hardware thread reordering to boost OpenCL throughput on FPGAs, ICCD’16 FPGAs for Software Programmers, Dirk Kock, Daniel Ziener
  • 37. Xilinx Zynq UltraScale+ MPSoC • Available: – SW coherence (cache flush) à Slow – HW coherence (MESI coherence protocol) à Fast. Ports: • Accelerator Coherency Port, ACP • Cache-coherent interconnect (CCI) ACE-Lite ports – Part of ARM® AMBA® AXI and ACE protocol specification 38 4 x A53 GPU 2xR5 FPGA
  • 38. Intel Broadwell+Arria10 • Cache Coherent Interconnect (CCI) – New IP inside the FPGA: Intel Quick Path Interconnect with CCI – AFU (Accelerated Function Unit) can access cached data 39
  • 39. Towards CCI everywhere • CCIX consortium (http://www.ccixconsortium.com) – New chip-to-chip interconnect operating at 25Gbps – Protocol built for FPGAs, GPUs, network/storage adapters, ... – Already in Xilinx Virtex UltraScale+ 40
  • 40. Which architecture suits me best? 41 • All of them are Turing complete, but • Depending on the problem: – CPU: • For control-dominant problems • Frequent context switching – GPU: • Data-parallel problems (vectorized/SIMD) • Little control flow and little synchronization – FPGA: • Stream processing problems • Large data sets processed in a pipelined fashion • Low power constraints
  • 41. Agenda • Motivation • Hardware: iGPUs and iFPGAs • Software – Programming models for chips with iGPUs and iFPGAs • Our heterogeneous templates based on TBB – parallel_for – parallel pipeline 42
  • 42. http://imgs.xkcd.com/comics/standards.png 43 HOW STANDARDS PROLIFERATE see: A/C chargers, Character encodings, Instant Messaging, etc)
  • 43. Programming models for heterogeneous: GPUs • Targeted at exploiting one device at a time – CUDA (NVIDIA) – OpenCL (Khronos Group Standard) – OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0) – C++AMP (Windows’ extension of C++. Recently HSA announced own ver.) – Sycl (Khronos Gruop’s specification for “Single-source C++ progr.) – ROCm (Radeon Open Compute, HCC, HIP, for AMD GCN –HSA–) – RenderScript (Google’s Java API for Android) – ParallDroid (Java + Directives from ULL, Spain) – Many more (Numba Python, IBM Java, Matlab, R, JavaScript, …) • Targeted at exploiting both devices simultaneously (discrete GPUs) – Qilin (C++ and Qilin API compiled to TBB+CUDA) – OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler) – XKaapi – StarPU • Targeted at exploiting both devices simultaneously (integrated GPUs) – Qualcomm Symphony – Intel Concord 44
  • 44. Programming models for heterogeneous: GPUs Targeted at exploiting ONE device at a time 45 parallel_for CPU GPU Targeted at exploiting SEVERAL devices at the same time parallel_for compiler + runtime CPU GPU CUDA OpenCL OpenMP4.0C++AMP Sycl ROCm RenderScript ParallDroid Numba Python JavaScriptMatlab R Qilin OmpSs XKaapi StarPU Qualcomm Symphony Concord
  • 45. Programming models for heterogeneous: FPGAs • HDL-like languages – SystemVerilog, Bluespec System Verilog – Extend RTL languages by adding new features • Parallel libraries – CUDA, OpenCL – Instrument existing parallel libraries to generate RTL code • C-based languages – SDSoC, LegUp, ROCCC, Impulse C, Vivado, Calypto Catapult C – Compile* C into intermediate representation and from there to HDL – Functions become Finite State Machines and variables mem. blocks • Higher level languages – Kiwi (C#), Lime (Java), Chisel (Scala) – Translate* input language into HDL 46 * Normally a subset thereof Courtesy: Javier Navaridas, U. Manchester.
  • 46. OpenCL 2.0 • OpenCL 2.0 supports: – Coherent SVM (Shared Virtual Memory) & Platform atomics – Example: allocating an array of atomics: 47 clContex = ... clCommandQueue = ... clKernel = ... cl_svm_mem_flags flags = CL_MEM_SVM_FINE_GRAIN_BUFFER | CL_MEM_SVM_ATOMICS; atomic_int * data; data = (atomic_int *) clSVMAlloc(clContext, flags, NUM * sizeof(atomic_int), 0); clSetKernelArgSVMPointer(clKernel, 0, data); clSetKernelArg(clKernel, 1, sizeof(int), &NUM); clEnqueueNDRangeKernel(clCommandQueue, clKernel, .... ); atomic_store_explicit ((atomic_int*)&data[0], 99, memory_order_release); CPU Host Code kernel void myKernel(global int *data, global int* NUM){ int i=atomic_load_explicit((global atomic_int *)&data[0], memory_order_acquire); .... } GPU Kernel Code
  • 47. C++AMP • C++ Accelerated Massive Parallelism • Example: Vector Add (C[i]=A[i]+B[i]) 48 A, B and C buffers Initialize A and B Run on GPU
  • 48. SYCL Parallel STL • Parallel STL aimed at C++17 standard: – std::sort(vec.begin(), vec.end()); //Vector sort – std::sort(seq, vec.begin(), vec.end()); // Explicit SEQUENTIAL – std::sort(par, vec.begin(), vec.end()); // Explicit PARALLEL • SYCL exposes the execution policy (user-defined) 50 std::vector<int>(N); float ratio = 0.5; sycl::het_sycl_policy<class ForEach> policy(ratio); std::for_each(policy, v.begin(), v.end(), [=](int& val) { val= val*val; } ); Courtesy: Antonio Vilches 50% of the iterations on CPU 50% of the iterations on GPU https://github.com/KhronosGroup/SyclParallelSTL
  • 49. Qualcomm Symphony • Point Kernel: write once, run everywhere – Pattern tuning extends to heterogeneous load hints 51 SYMPHONY_POINT_KERNEL_3(vadd, int, i, float*, a, int, na, float*, b, int, nb, float*, c, int, nc, { c[i] = a[i] + b[i];}); ... symphony::buffer_ptr buf_a(1024), buf_b(1024), buf_c(1024); ... symphony::range<1>(1024) r; symphony::pfor_each(r, vadd_pk, buf_a, buf_b, buf_c, symphony::pattern::tuner().set_cpu_load(10) .set_gpu_load(50) .set_dsp_load(40) ); Executed across CPU(10%) + GPU(50%) + DSP(40%) A kernel for each device is automatically generated
  • 50. Intel Concord • C++ heterogeneous programming framework • Papers: – Rashid Kaleem et al. Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014. • Intel Heterogeneous Research Compiler (iHRC) at https://github.com/IntelLabs/iHRC/ – Naila Farooqui et al. Affinity-aware work-stealing for integrated CPU-GPU processors. PPoPP ’16. PhD dissertation: https://smartech.gatech.edu/bitstream/handle/1853/54915/FAROOQUI-DISSERTATION-2016.pdf 52 class Foo { int *a, *b, *c; void operator(int i) { c[i] = a[i]+b[i]; } Foo *f = new Foo(); parallel_for(0, N, *f); Executed across CPU+GPU, dynamic partition
  • 51. Xilinx SDSoC • Without SDSoC: 53 FPGA CPU main(){ init(A,B,C); mmult(A,B,D); madd(C,D,E); consume(E); } Courtesy: Xilix mmult madd ABC D E
  • 52. Xilinx SDSoC • With SDSoC: – Eclipse-based SDK 54 FPGA CPU Refine Code: • Re-factoring code to be more HLS- friendly • Adding code annotations (i.e., pragmas) #pragma SDS data copy(Array) #pragma SDS zero_copy(Array) #pragma SDS data access_pattern #pragma SDS data sys_port (...) #pragma HLS PIPELINE II = 1 #pragma HLS unroll #pragma HLS inline
  • 53. Altera OpenCL 55 main() { read_data( ... ); manipulate( ... ); clEnqueueWriteBuffer( ... ); clEnqueueNDRange(...,sum,...); clEnqueueReadBuffer( ... ); display_result( ... ); } __kernel void sum(__global float *a, __global float *b, __global float *y) { int i = get_global_id(0); c[i] = a[i] + b[i]; } Host Code Kernel Code Standard C Compiler OpenCL aoc compiler Verilog .EXE .OUT .AOCX bitstream CPU FPGA
  • 54. Altera OpenCL optimizations • Locality: – Shift-Register Pattern (SRP) – Local Memory (LM) – Constant Memory (CM) – Load-Store Units Buffers (LSU) • Loop Unrolling (LU) – Deeper pipelines – Optimized tree-like reductions • Kernel vectorization (SIMD) – Wider ALUs • Compute Unit Replication (CUR) – Replicate Pipelines – Downsides: memory divergence ⬆, complexity ⬆, frequency ⬇ 56
  • 55. Shift Register Pattern • Example: 1D stencil computation 57 x x x + NDRange No SRP SingleTask SRP x x x + shft_reg 64x faster than Naive (17 elem. filter) “Tuning Stencil Codes in OpenCL for FPGAs”, ICCD’16 Shift !!
  • 56. Agenda • Motivation • Hardware: iGPUs and iFPGAs • Software – Programming models for chips with iGPUs and iFPGAs • Our heterogeneous templates based on TBB – parallel_for – parallel pipeline 58
  • 57. Our work: parallel_for • Heterogeneous parallel for based on TBB – Dynamic scheduling / Adaptive partitioning for irregular codes – CPU and GPU chunk size re-computed every time 59 A15 GPU A15 A15 A15 A7 A7 A7 Problem: BarnesHut Archit.: Exynos 5 One A7 feeds the GPU Time Time to execute this chunk of iterations on GPU
  • 59. New API 61 Use the GPU Exploit Zero-Copy-Buffer Scheduler class
  • 60. 62
  • 61. Related work: Qilin 63 • Static Approach: Bulk-Oracle Offline search (11 executions) – Work partitioning between CPU and GPU – One big chunk with 70% of the iterations: on the GPU – The remaining 30% processed on the cores 0" 50" 100" 150" 0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%" Percentage)of)the)itera.on)space)offloaded)to)the)GPU) Barnes)Hut:)Offline)search)for)sta.c)par..on) Execu2on"2me"in"seconds"
  • 62. Related work: Intel Concord 64 • Papers: – Rashid Kaleem et al. Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014. – Naila Farooqui et al. Affinity-aware work-stealing for integrated CPU-GPU processors. PPoPP ’16 (poster). assign chunk size just for GPU compute on CPU compute chunk on GPU when the GPU is done according to relative speeds: partition the rest of the iteration space
  • 63. Related Work: HDSS 65 • Belviranli et al, A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures, TACO 2013 M. E. Belviranli et al. ignments by time for a two-threaded (1 GPU and 1 CPU thread) run of system. Figure (a), on the left, illustrates the assignments dispatched ure, block size axis is shown in logarithmic scale for a better separation ure (b), on the right, shows the two phases of the algorithm separated zes just after computational weights are determined and stabilized. Fermi Discrete GPU
  • 64. Debunking big chunks • Example: irregular Barnes-Hut benchmark 66
  • 65. Debunkign big chunks A larger chunk causes more memory traffic 67 Updated chunk Array of bodies: Written body Read body Updated chunk Array of bodies: Written body Read body
  • 66. Debunking big chunks • BarnesHut: GPU throughput along the iteration space – For two different time-steps – Different GPU chunk-sizes 68 0 2 4 6 8 10 x 10 4 20 40 60 80 100 120 140 160 Barnes−Hut: Throughput variation (time step=0) Iteration Space Throughput chunk=320 chunk=640 chunk=1280 chunk=2560 0 2 4 6 8 10 x 10 4 20 40 60 80 100 120 140 160 Barnes−Hut: Throughput variation (time step=5) Iteration Space Throughput chunk=320 chunk=640 chunk=1280 chunk=2560 640 2560 1280 640 2560 Time step=0 Time step=5
  • 67. LogFit main idea 70 Time GPU Core1 Core2 Core3 Core4 Time to execute this chunk of iterations on GPU Exp Phs Stable Phase F P Intel Core i7 Haswell GPU Core Core Core Core Shared L3
  • 68. Parallel_for: Performance per joule • Static: Oracle-Like static partition of the work based on profiling • Concord: Intel approach: GPU size provided by user once • HDSS: Belviranli et al. approach: GPU size computed once • LogFit: our work 72 GPU GPU+1Core GPU+2Cores GPU+4Cores GPU+3Cores Antonio Vilches et al. Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips, Procedia Computer Science, 2015. On Intel Haswell Core i7-4770
  • 69. LogFit: better for irregular benchmarks 73
  • 70. LogFit working on FPGA • Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.) 74 DE1-SoC Altera Cyclone V AccelPower Cap + BeagelBone Shunt resistor (power flow)
  • 71. LogFit working on FPGA 75 LogFit 2C+FPGA Static • Terasic DE1: Altera Cyclone V (FPGA+2 Cortex A9) • Energy probe: AccelPower Cap (Luis Piñuel, U. Complut.)
  • 72. Our work: pipeline • ViVid, an object detection application • Contains three main kernels that form a pipeline • Would like to answer the following questions: – Granularity: coarse or fine grained parallelism? – Mapping of stages: where do we run them (CPU/GPU)? – Number of cores: how many of them when running on CPU? – Optimum: what metric do we optimize (time, energy, both)? 76 Stage1 Stage2 Stage3 Input frame Filter Histogram Classifier Output response mtx. index mtx. histograms detect. response
  • 73. Granularity • Coarse grain: – CG • Medium grain: – MG • Fine grain: – Fine grain also in the CPU via AVX intrinsics 77 Input Stage Output Stage CPU C item CPU C item CPU Stage 1 C item CPU Stage 2 C item CPU Stage 3 C item C =core Input Stage CPU C item Output Stage CPU C item CPU Stage 1 C item C C C CPU Stage 2 C item C C C CPU Stage 3 C item C C C Input Stage Output Stage GPU C C C C C C item CPU C item CPU C item GPU C C C C C C item GPU C C C C C C item Stage 1 Stage 2 Stage 3
  • 74. Our work: pipeline 78 Input Stage Output Stage GPU C C C C C C item CPU C item CPU C item CPU Stage 2 C item C C C CPU Stage 3 C item C C C Stage 1 CPU GPU CPU GPU CPU GPU CPU CPU CPU GPU CPU CPU CPU CPU CPU CPU GPU CPU GPU CPU CPU CPU GPU CPU CPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU GPU CPU GPU CPU CPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU GPU 111 100 010 001 110 101 011 000
  • 75. Accounting for all alternatives • In general: nC CPU cores, 1 GPU and p pipeline stages # alternatives = 2 p x (nC +2) • For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives 79 Input Stage GPU C C C C C C item CPU C item Output Stage CPU C item GPU C C C C C C item GPU C C C C C C item CPU Stage 1 item CPU Stage 2 item CPU Stage p item C C nC cores C C nC cores C C nC cores Mapping streaming applications on commodity multi-CPU and GPU on- chip processors, IEEE Tran. Parallel and Distributed Systems, 2016.
  • 76. Framework and Model • Key idea: 1. Run only on GPU 2. Run only on CPU 3. Analytically extrapolate for heterogeneous execution 4. Find out the best configuration à RUN 80 Input Stage GPU C C C C C C item CPU C item Output Stage CPU C item GPU C C C C C C item GPU C C C C C C item CPU Stage 1 C item C C C CPU Stage 2 C item C C C CPU Stage 3 C item C C C DP-MG collect λ and E (homogeneous values)
  • 77. Environmental Setup: Benchmarks • Four Benchmarks – ViVid (Low and High Definition inputs) – SRAD – Tracking – Scene Recognition 81 Inpt Filter Histogram Classifier St 1 St 2 St 3 Out Inpt Extrac. Prep. Reduct. St 1 St 2 St 3 St 4 Comp.1 Comp. 2 St 5 St 6 Out Statist. Inpt Track. St 1 Out Inpt Feature. SVM St 1 St 2 Out
  • 78. Does Higher Throughput imply Lower Energy? 82 1.0E-03 2.0E-03 3.0E-03 4.0E-03 5.0E-03 6.0E-03 7.0E-03 8.0E-03 9.0E-03 Num-Threads (CG) 6.0 8.0 10.0 12.0 14.0 16.0 Num-Threads (CG) 0.0E+00 2.0E-04 4.0E-04 6.0E-04 8.0E-04 1.0E-03 Num-Threads (CG) Haswell HD
  • 79. 1.0E-01 3.0E-01 5.0E-01 7.0E-01 9.0E-01 1.1E+00 Num-Threads (CG) SRAD on Haswell 84 0.0E+00 1.0E-01 2.0E-01 3.0E-01 4.0E-01 5.0E-01 6.0E-01 7.0E-01 8.0E-01 Num-Threads (CG) 0.0E+00 2.0E-02 4.0E-02 6.0E-02 8.0E-02 1.0E-01 1.2E-01 1.4E-01 1.6E-01 1.8E-01 Num-Threads (CG) Input Stage GPU C C C C C C item CPU C item CPU Stage 1 C item GPU C C C C C C item CPU Stage 2 C item Output Stage CPU C item GPU C C C C C C item CPU Stage 6 C item DP-CG GPU C C C C C C item CPU Stage 3 C item GPU C C C C C C item CPU Stage 4 C item GPU C C C C C C item CPU Stage 5 C item
  • 80. Results on big.LITTLE architecture 85 • On Odroid XU3: 4 Cortex A15 + 4 Cortex A7 + GPU Mali Input Stage Output Stage CPU C item CPU C item CPU Stage 1 C item CPU Stage 2 C item CPU Stage 3 C item C =core 5 thr (GPU+4A15) 1thr (A15) 2thr (2 A15) 4thr (4 A15) 5thr (4 A15+1 A7) 9thr (4 A15+ 4 A7) 1 thr (GPU) 9 thr (GPU+4A15+4A7)
  • 81. Pipeline on CPU-FPGAs • Change the GPU by a FPGA – Core i7 4770k 3.5 GHz Haswell quad-core CPU – Altera DE5-Net FPGA from Terasic – Intel HD6000 embedded GPU – Nvidia Quadro K600 discrete GPU (192 cuda cores) • Homogeneous results: 86 Device Time (sec.) Throughput (fps) Power (watts) Energy (Jules) CPU (1 core) 167.610 0.596& 37& 6201& CPU (4 cores) 50.310& 1.987& 92& 4628& FPGA 13.480& 7.418& 5" 67" On-chip GPU 7.855" 12.730" 16& 125& Discrete GPU 13.941& 7.173& 47& 655&
  • 82. Pipeline on CPU-FPGA • Heterogeneous results based on our TBB scheduler: - Accelerator + CPU - Mapping 001 87 onGPU+CPU ~ FPGA+CPU
  • 83. Conclusions • Plenty of heterogeneous on-chip architectures out there • Why not use all devices? – Need to find the best mapping/distribution/scheduling out of the many possible alternatives. • Programming models and runtimes aimed at this goal are in their infancy: striving to solve this. • Challenges: – Hide HW complexity providing a productive programming model – Consider energy in the partition/scheduling decisions – Minimize overhead of adaptation policies 88
  • 84. Future Work • Exploit CPU-GPU-FPGA chips and coherency • Predict energy consumption on heterogeneous CPU- GPU-FPGA chips • Consider energy consumption in the partitioning heuristic • Rebuild the scheduler engine on top of the new TBB library (OpenCL node) • Tackle other irregular codes 89
  • 85. Future Work: TBB OpenCL node buffer get_next_image preprocess detect_with_A detect_with_B make_decision Can express pipelining, task parallelism and data parallelism OpenCL node
  • 86. Future Work: other irregular codes 91 1 2 3 4 11 12 13 14 5 6 7 8 9 10 15 16 17 1918 Bottom-Up a) Selective 1 2 3 4 11 12 13 14 5 6 7 8 9 10 15 16 17 1918 b) Concurrent Top-Down 1 2 3 4 11 12 13 14 5 6 7 8 9 10 15 16 17 1918 c) Asynchronous 5 1 CPU GPU 19 Both Node computed on: CPU GPU Both Frontier computed on: C.H.G Vázquez, PhD, Library based solutions for algorithms with complex patterns, 2015
  • 87. Collaborators • Mª Ángeles González Navarro (UMA, Spain) • Andrés Rodríguez (UMA, Spain) • Francisco Corbera (UMA, Spain) • Jose Luis Nunez-Yanez (University of Bristol) • Antonio Vilches (UMA, Spain) • Alejandro Villegas (UMA, Spain) • Juan Gómez Luna (U. Cordoba, Spain) • Rubén Gran (U. Zaragoza, Spain) • Darío Suárez (U. Zaragoza, Spain) • Maria Jesús Garzarán (Intel and UIUC, USA) • Mert Dikmen (UIUC, USA) • Kurt Fellows (UIUC, USA) • Ehsan Totoni (UIUC, USA) • Luis Remis (Intel, USA) 92