This document discusses research on automatic parallelization for heterogeneous and homogeneous multicore processors. It presents Akihiro Hayashi's PhD defense at Waseda University on this topic. It motivates the need for automatic parallelization due to difficulties in programming multicore processors. It proposes a solution called OSCAR that uses a heterogeneous multicore compiler with APIs to enable automatic parallelization across different processor types. The methodology involves hint directives, parallelization of tasks, power reduction techniques, and generation of executables. It evaluates the approach on media applications using a Renesas multicore processor.
3. WASEDA
UNIVERSITY
Multicore Processors
Pros.
Easy Programming
Cons.
Low Performance
High Power Consumption
2012/01/23 3
Core
Core
Core
Core
Core
Pros.
High Performance
Low Power Consumption
Cons.
Hard Programming
Ph.D Defense
Single-core Multi-core
4. WASEDA
UNIVERSITY
Category of
Multicore Processors
Homogeneous Multicore
Heterogeneous Multicore
2012/01/23 4
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
GPU
GPU
Integrating identical cores
Intel Core i7
IBM Power7
Renesas RP-2
Integrating different cores
IBM/SONY/TOSHIBA CELL BE
Intel Processor + NVIDIA GPGPU
Renesas RP-X
Ph.D Defense
5. WASEDA
UNIVERSITY
Focused Three Walls
in Multicore Processors
Programming Wall
Parallel programming is time-consuming
Power Wall
Power is expensive
Portability Wall
Portability is important
2012/01/23 5
Programmers
HW
Mobile Phones, PCs, SupercomputersParallel Programs
Ph.D Defense
7. WASEDA
UNIVERSITY
Related Work
PEPPHER(MICRO’11)
Performance Portability
(-)Parallel programming by hand
OmpSs(LCPC’10)
OpenMP extension to support GPGPU
(-)Parallel programming by hand
Qilin(MICRO’09)
Automatic dynamic scheduling for CPUs+GPUs
(-)Parallel programming by hand
CPU: Intel TBB, GPU:CUDA
2012/01/23 7Ph.D Defense
8. WASEDA
UNIVERSITY
Related Work
EXOCHI/Merge(PLDI’07, ASPLOS’08)
Automatic dynamic scheduling for CPUs + GPUs
(-) Parallel Programming by hand
CellSs(SC’06)
Automatic dynamic scheduling for CELL BE
(+) programmers can write a sequential code
(-) just only supports homogeneous SPEs
2012/01/23
Automatic parallelizing compiler for heterogeneous
multicores is getting more important
8Ph.D Defense
9. WASEDA
UNIVERSITY
Solution: OSCAR Heterogeneous
Multicore Compiler
Programming Wall
Automatic Parallelizing Compiler
Programmers can write a program in a sequential manner
Power Wall
Power Control by Compiler
The compiler automatically reduce the power consumption
Portability Wall
Multicore API
The compiler generates source codes with OSCAR API
2012/01/23 9Ph.D Defense
10. WASEDA
UNIVERSITY
1. Hint directives for OSCAR compiler
2. OSCAR heterogeneous parallelizing compiler
3. OSCAR heterogeneous API (OSCAR API 2.0)
Goal: fully automatic parallelization of a
Parallelizable C or Fortran program for various
heterogeneous and homogeneous multicores
Previous works:
OSCAR homogeneous compiler
OSCAR homogeneous API 1.0
Coarse-grain task scheduling for Heterogeneous
multicores
OSCAR Heterogeneous
Multicore Compiler
2012/01/23 10
Proposed Framework: Cross Platform
OSCAR: Optimally SCheduled Advanced multiprocessoR
Ph.D Defense
13. WASEDA
UNIVERSITY
Key Idea
2012/01/23 13Ph.D Defense
int main()
{
for () {…}
func();
…
…
}
Parallelizable C/ Frotran Program Accelerator Compiler/Library
Controller
Code
Generation
Accelerator
Binary
Generation
OSCAR Compiler
Automatic
Parallelization
Power
Reduction
14. WASEDA
UNIVERSITY
The Compiler Framework
2012/01/23 14
Compiler
for ACCa
OSCAR Compiler
Sequential
Program
Sequential Program
with hint directives
Compiler
for ACCa
Compiler
for ACCz
Control Code
For ACCa
API Analyzer + Sequential Compiler
+ linker
Executable
Objects
For ACCa
Step1:Hint directive insertion
specify execution time and
input/output variables or the other
information about accelerate-able
program part
Step2 : Parallelization
1.Coarse-grain task generation
2.Heterogeneous task scheduling
3.Power reduction
Step3 : Accelerator Binary
Generation
(1)control code generation
(2) accelerator binary generation
Step4 : Executable Generation
Library
For ACC
C with API
for CPU
C with API
For ACCa
C with API
For ACCz
Compiler
for ACCz
Ph.D Defense
15. WASEDA
UNIVERSITYReference Heterogeneous
Multicore Architecture Supported by
OSCAR Compiler and API
2012/01/23
DTU
– Data Transfer Unit
LPM
– Local Program Memory
LDM
– Local Data Memory
DSM
– Distributed Shared
Memory
CSM
– Centralized Shared
Memory
FVR
– Frequency/Voltage
Control Register
OSCAR-API applicable heterogeneous multicores
15
(e.g. GPU)(e.g. vector)
Ph.D Defense
16. WASEDA
UNIVERSITY
OSCAR Heterogeneous API
Parallel Execution API
parallel sections
flush
critical
execution
Memory Mapping API
threadprivate
distributedshared
onchipshared
Synchronization API
groupbarrier
Timer API
get_current_time
2012/01/23 16
Data transfer API
dma_transfer
dma_contiguous_parameter
dma_stide_parameter
dma_flag_check
dma_flag_send
Power control API
fvcontrol
get_fvstatus
Heterogeneous API
accelerator_task_entr
y
Cache control API
cache_writeback
cache_selfinvalidate
complete_memop
http://www.kasahara.cs.waseda.ac.jp/api/regist_en.html
Hint Directive
accelerator_task
oscar_comment
Ph.D Defense
17. WASEDA
UNIVERSITY
Step1: Hint Directives
for OSCAR Compiler for parallelization
2012/01/23
#pragma oscar_hint accelerator_task (ACCa) cycle(1000, ((OSCAR_DMAC())))
#pragma oscar_hint accelerator_task (ACCb) cycle(100) in(var1, x[2:11]) out(x[2:11])
void call_FFT(int var, int *x) {
#pragma oscar_comment XXXXXXXXXX
FFT(var, x);
} Accelerator compilers or programmers specify execution time and
input/output variables for accelerate-able program part
17
for (i = 0; I < 10; i++) {
x[i]++;
}
call_FFT(var1, x);
Accelerator Name Clock cycle
Input/output variables
Data transfer
Ph.D Defense
18. WASEDA
UNIVERSITY
Step2-1: Parallelization by OSCAR
- Coarse-grain Task generation -
2012/01/23
Program is decomposed into Macro Tasks (MTs)
BB (Basic Block)
RB (Repetition Block, or loop)
SB (Subroutine Block, or function)
Parallelism exploitation
Macro Flow Graph (MFG): control-flow and data dependency
Macro Task Graph (MTG): coarse grain task parallelism
Macro-Flow Graph Macro-Task Graph
(Condition for determination of MT Execution)
AND
(Condition for Data access)
Ex. Earliest Executable
Condition of MT6
MT2 takes a branch
that guarantees MT4 will be executed
OR
MT3 completes execution
Earliest Executable Condition
: Data Dependency
: Control flow
: Control Branch
18Ph.D Defense
19. WASEDA
UNIVERSITY
CPU ACCa
Step2-2: Parallelization by OSCAR
- Coarse-grain Static Task Scheduling -
2012/01/23
MT1
for CPU
MT2
for CPU
MT3
for CPU
MT6
for CPU
MT5
for ACC
MT4
for ACC
MT7
for ACCL
MT10
for ACC
MT9
for ACC
MT8
for CPU
MT13
for ACC
MT12
for ACC
MT11
for CPU
EMT
TIME
Macro-Task Graph
CPU0 CPU1
MT1
MT2 MT3
MT4
MT5
MT6 MT7
MT8
MT9
MT10
MT11
MT12
MT13
MT13
19
Accelerator0
ACC is busy
Ph.D Defense
20. WASEDA
UNIVERSITY
Step 2-3:Power Reduction
Compiler changes Frequency/Voltage
CPU0 CPU1 CPU2 CPU3 CPU ACCa CPU ACCb CPU ACCc CPU ACCd
SLEEP
Timer
SLEEP SLEEP SLEEP
SLEEP SLEEPSLEEPMID MID MID MID
TIME
cycle
deadline
=33[ms]
SLEEP
MID
2012/01/23 20
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
FULL:648MHz
@1.3V
MID:
324MHz@1.1
V
LOW:
162MHz@1.0
V
SLEEP:
Ph.D Defense
21. WASEDA
UNIVERSITY
Step3:Role of Accelerator Compilers
Accelerator compilers generate both controller CPU
code and accelerator binary
2012/01/23
C with OSCAR API
#pragma oscar accelerator_task_entry
controller(1) oscartask_CTRL1_loop2
void oscartask_CTRL1_loop2(int *x)
{
int i;
for (i = 0; i <= 9; i += 1) {
x[i]++;
}
}
21
Controller CPU code
void
oscartask_CTRL1_loop2
(int *x)
{
// Data-transfer
// Kernel Invocation
// Data-transfer
}
Accelerator binary
CPU1
ACC
Ph.D Defense
22. WASEDA
UNIVERSITY
Step4:Generating Executable
for CPUs and Accelerators
API analyzer translates OSCAR API
into runtime library calls or annotation
for sequential compiler
Example
parallel sections → pthread_create()
Sequential compiler generates the
executable for the target system
2012/01/23 22Ph.D Defense
24. WASEDA
UNIVERSITY
Performance Evaluation
Performance and power consumption
Evaluations
Using a Hitachi accelerator compiler
for “FE-GA”
Optical Flow calculation (from OpenCV)
Utilizing a hand-tuned library
Optical Flow calculation
(from Tohoku university ※1)
AAC encoding(from Renesas Electronics)
2012/01/23
(※1)Hariyama et Al. “Evaluation of a Heterogeneous Multi-Core Architecture for Multimedia
Applications” ICD2008-139 (Written in Japanese)
24Ph.D Defense
25. WASEDA
UNIVERSITY
Evaluation Environment:
RP-X processor
8 Renesas SH-4As as CPUs, 4 Hitachi FE-GAs
dynamically reconfigurable processor as accelerators
I$:32KB/core, D$:32KB/core
Frequency/Voltage State:
648MHz@1.3V, 324MHz@1.16V, 162MHz@1.0V
SH-4ASH-4A
SH-4A
SH-4A
SH-4ASH-4A
SH-4A
SH-4A
SHwy#0 SHwy#1
DDR3
#0
Media
IP
FE MX2 DDR3
#1
FEFEFE-GA MX2
SH-4ACPU FPU DTU
I$ MMU D$
ILRAM LDM DSM
2012/01/23
Yuyama et al. “A 45nm 37.3GOPS/W heterogeneous multi-core SoC” ISSCC2010
25
クロスバネットワーク
サブCPUへ
入出力ポート
割込/DMA
要求
コンフィギュレーションマネージャ
ALU/MLTセルアレイ
(24/8セル)
LSセル ローカルメモリ
(10バンク)(10セル)
MLT
LS CRAM
バス
I/F
MLT
MLT
MLT
MLT
MLT
MLT
MLT
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
LS
LS
LS
LS
LS
LS
LS
LS
LS
ALU MLT LS CRAMALUセル 乗算セル ロードストアセル
コンパイルドRAM
(4~16KB, 2-po
シーケンスマネージャ
Ph.D Defense
26. WASEDA
UNIVERSITYPerformance Using OSCAR and
FE-GA Compilers and OSCAR API on
RP-X(Optical Flow from OpenCV)
2012/01/23 26
1
1.9
3.46
5.64
2.65
5.48
12.36
0
2
4
6
8
10
12
14
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
SpeedupagainstasingleSHprocessor
FE-GA binary is generated
by Hitachi FE-GA Compiler
Ph.D Defense
27. WASEDA
UNIVERSITYProcessing Performance Using OSCAR
Compiler with a hand-tuned library
and OSCAR API on RP-X (AAC Encoding)
2012/01/23 27
1
1.98
3.68
6.32
4.44
8.39
16.08
0
2
4
6
8
10
12
14
16
18
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
SpeedupsagainstasingleSHprocessor
Hand-tuned library for FE-GA is utilized
Ph.D Defense
28. WASEDA
UNIVERSITYPerformance Using OSCAR Compiler with a
hand-tuned library and API on RP-X
(Optical Flow from Tohoku university)
2012/01/23 28
1
2.29 3.09
5.4
18.85
26.71
32.65
0
5
10
15
20
25
30
35
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
SpeedupsagainstasingleSHprocessor
Hand-tuned library for FE-GA is utilized
Ph.D Defense
30. WASEDA
UNIVERSITYPower Reduction Using OSCAR Compiler with
a hand-tuned library and OSCAR API on RP-X
(Optical Flow from Tohoku University)
2012/01/23 30
Without Power Reduction With Power Reduction
70% of power reduction1.7W 0.5W
Ph.D Defense
31. WASEDA
UNIVERSITYPower Reduction Using OSCAR Compiler with
a hand-tuned library and OSCAR API on RP-X
(AAC Encoding from Renesas Electronics)
2012/01/23 31
Without Power Reduction With Power Reduction
80% of power reduction
Ph.D Defense
33. WASEDA
UNIVERSITY
Particle Radiotherapy
for Cancer Treatment
Cancer is the primary cause of a
person’s death in Japan
One in three people died of cancer
⇒ 粒子線による治療が効果的
2012/01/23 33
Reference:NIRS HP
http://www.nirs.go.jp/index.shtml
Ph.D Defense
34. WASEDA
UNIVERSITY
Particle Radiotherapy
for Cancer Treatment details
1.Taking CT pictures
2.Input cancer information
3. Physical simulation
4.Planning
2012/01/23 34
Reference: NIRS HP
http://www.nirs.go.jp/index.shtml
Physical simulation is time-consuming
⇒this treatment is not covered by insurance
Ph.D Defense
35. WASEDA
UNIVERSITY
Automatic Parallelization
by OSCAR compiler
Overview
Automatic Parallelization
by OSCAR compiler and OSCAR API
The program is clinically used
Enhancing parallelism by code rewriting
– Original code can limit the potential parallelism
⇒Evaluation on various SMP machines
Performance Evaluation on IBM and Intel servers
2012/01/23 35
Parallelizing
Compiler
Programmers Parallelized
Program
Hardware
Ph.D Defense
36. WASEDA
UNIVERSITY
Simulation Flow
and Profile Result
2012/01/23 36
Dose
92%
Scatter
8%
Init,
Modify
0%
Profile Result
Environment:
Hitachi SR16000 System
Power7 4.00GHz
Dose Calculation
Scatter Calculation
Modify Calculation
Initialization
Developed by NIRS and Mitsubishi
Ph.D Defense
37. WASEDA
UNIVERSITY
An overview of
Dose and Scatter Calculation
Dose Calculation
Accumulates dose value for each pencil beam
Scatter Calculation
Calculates an influence on neighbor voxels
considering “scatter”
2012/01/23 37Ph.D Defense
38. WASEDA
UNIVERSITY
Enhancing parallelism
for dose calculation
2012/01/23 38
X
Y
Beams
for (pencilBeams/n) {
}
for (passedVoxels) {
//Dose calc
}
Each processor calculates dose values for each array
After that, each array is accumulated to one array
for (pencilBeams/n) {
}
for (passedVoxels) {
//Dose Calc
}
Each array is
accumulated to
one array
CPU0
CPU1
Each processor calculates dose
valuesfor each array
CPU0
CPU1
Ph.D Defense
39. WASEDA
UNIVERSITY
Enhancing parallelism
for scatter calculation
2012/01/23 39
Z
Y
for (Z/n) {
}
for (Y) {
for (X) {
//Scatter value addition
}
}
Each processor calculates dose values for each array
After that, each array is accumulated to one array
for (Z/n) {
}
Each array is
accumulated to
one array
CPU0
CPU1
Each processor calculates dose
valuesfor each array
CPU0
CPU1
for (Y) {
for (X) {
//Scatter value addition
}
}
Ph.D Defense
47. WASEDA
UNIVERSITY
Result of modifying scheduling
policy on SR16000
2012/01/23 47
1.00
28.09
30.27
49.93
55.10
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1CPU 32CPU 32CPU-rev 64CPU 64CPU-rev
Speedup
Processor Configuration
Ph.D Defense
48. WASEDA
UNIVERSITY
Conclusion:
Programmability
Programming Wall
Automatic Parallelizing Compiler
Programmers can write a program in a sequential manner
Optical flow(OpenCV):
12x speedup with 8SH+4FE
Optical flow(Hand-tuned library):
32x speedup with 8SH+4FE
AAC Encoder:
16x speedup with 8SH+4FE
Dose calculation engine for particle therapy:
55x speedup with 64CPU
2012/01/23 48Ph.D Defense
49. WASEDA
UNIVERSITY
Conclusions:
Power Consumption
Power Wall
Power Control by Compiler
The compiler automatically reduce the power consumption
Optical flow(Hand-tuned library):
Reduce 70% power consumption for 8SH+4FE
AAC Encoder:
Reduce 80% power consumption for 8SH+4FE
2012/01/23 49Ph.D Defense
51. WASEDA
UNIVERSITY
Future Work
Fully automatic parallelization of C program
Automatic detection of acceleration part
Power capping control
Local memory management for heterogeneous
multicore
2012/01/23 51Ph.D Defense
52. WASEDA
UNIVERSITY
Acknowledgement
Deep gratitude for
Dr.Hironori Kasahara(Waseda)
Dr.Keiji Kimura(Waseda)
Dr.Satoshi Goto(Waseda)
Dr.Nozomu Togawa(Waseda)
A part of this research has been supported by
NEDO “Advanced Heterogeneous Multiprocessor”
NEDO “Heterogeneous Multicore for Consumer
Electronics”
MEXT project “Global COE Ambient Soc”
2012/01/23 52Ph.D Defense
53. WASEDA
UNIVERSITY
Acknowledgement
Special thanks
Dr.Hirofumi Nakano(Waseda)
Dr.Jun Shirako(Rice)
Dr.Yasutaka Wada(Waseda)
Dr.Takamichi Miyamoto(NEC)
Dr.Fumiyo Takano(NEC)
Dr.Masayoshi Mase(Hitachi)
Mr.Sekiya Yamashita(Waseda)
Mr.Hiroki Mikami(Waseda)
Mr.Mamoru Shimaoka(Waseda)
Mr.Yoshiya Hirase(Waseda)
Mr.Yasir I. Al-Dosary (Waseda)
Ms.Cecilia Gonzalez (Waseda)
All student and alumni
in Kasahara Kimura Laboratory
2012/01/23 53Ph.D Defense
54. WASEDA
UNIVERSITY
Acknowledgement
Special Thanks
Dr.Takeshi Ikenaga(Waseda)
Mr.Keita Miyamura(IBM)
Mr.Yoichi Matsuyama(Waseda)
Ms.Miki Yajima(NTT Comware)
All faculty, stuff, and RA in CS department
Great Staffs in GCOE office
WIZDOM
Hiroshima Toyo Carp
Friends in Osaka
My parents and brother
and you!
2012/01/23 54Ph.D Defense
56. WASEDA
UNIVERSITY
Library
Step3:Utilizing Hand-tuned library
Accelerator compilers skip functional call whose
name starts with “oscarlib”
2012/01/23
C with OSCAR API
#pragma oscar accelerator_task_entry
controller(2)
oscartask_CTRL2_call_FFT
void oscartask_CTRL2_call_FFT(int var1,
int *x)
{
oscarlib_CTRL2_ACCEL3_FFT(var1, x);
}
56
Controller CPU code
void oscarlib_CTRL2_ACCEL3_FFT(int x,
int *v)
{
// Data-transfer
// Kernel Invocation
// Data-transfer
}
Accelerator binary
Ph.D Defense
Editor's Notes
Our goal is to realize a fully automatic parallelization of C or Fortran program for various heterogeneous multicores.
We have been developing an automatic parallelizing compiler called OSCAR for homogeneous multicore.
We proposed OSCAR homogeneous API, ver 1.0, which is available on this web-site
and an coarse-grain automatic parallelization scheme on heterogeneous multicore in our previous work.
Today, I will be talking about details on general purpose compilation framework which supports various types of heterogeneous multicores, using accelerators.
This framework includes hint directives for OSCAR compiler, OSCAR heterogeneous parallelizing compiler, and OSCAR heterogeneous API.
OSCAR heterogensou API is to be released soon.
Let me explain the overview of OSCAR API.
The input is the sequential Fortran or Parallelizable C program.
The Parallelizable C is the kind of programming style for C program, which has some limits of pointer usage or something.
OSCAR compiler works as a source-to-source compiler.
It generates parallelized Fortran or C program with OSCAR API.
After that, backend compiler including API Analyzer and Sequential compiler generates the target machine codes.
OK, let’s talk about the compilation framework
This framework allow us to utilize an accelerator compiler and existing hand-tuned library with OSCAR compiler.
The first step is that hint directive insertion for OSCAR compiler.
Accelerator compilers or programmers specify execution time and input/output variables or the other information about accelerate-able program part.
The second step is that parallelization by OSCAR compiler.
OSCAR compiler performs coarse-grain task generation, heterogeneous task scheduling and power reduction.
After that, OSCAR compiler generates the source code with OSCAR API for each core.
The third step is that accelerator binary generation.
Accelerator compiler performs control code generation and accelerator binary generation.
The final step is that generating executable.
We defined this architecture as a reference architecture supported by OSCAR API.
OSCAR Compiler and API can generate parallelized program for this architecture and subset of architectures.
The architecture may contains the three kinds of processor element,
These are general purpose processor cores, accelerator cores with their controller and accelerator cores without controllers.
General purpose processor core and Accelerator cores with their controllers may have a data transfer unit which you might call dma controller, local program memory or instruction cache,
Local data memory or data cache, distributed shared memory and freqeuncy/voltage control register.
The existing heterogeneous multicore available on the market can be seen as a subset of the architecture.
So, our methodology is applicable to various heterogeneous multicores from different vendors.
Let me show you the list of OSCAR heterogeneous API.
You can see the specification of the OSCAR API ver. 1.0 for a homogeneous multicore part in this website.
It is composed of parallel execution API such parallel sections, memory mapping API such thread private,
Synchronization API, Timer API, data transfer API and Power control API.
To support the heterogeneous multicore, just this directive is newly added and these three directives are also added to non-coherent cache control.
OK, let’s talk about the compilation flow details.
This is the example of the hint directives for OSCAR compiler.
These directive indicates that this loop or funtion can run on specified accelerator.
Accelerator compiler or programmers specify execution time and input/output variables for accelerate-able program part.
The second step is automatic parallelization by OSCAR compiler.
First, the program is decomposed into three kinds of block called macro tasks.
That is basic block, loop block, and function block.
Aftert that, OSCAR compiler exploits the parallelism of the program.
First, the compiler make a task graph called macro-flow graph which indicates control-flow and data dependency.
Then the compiler analyze an earliest executable condition and then we get a macro-task graph which shows coarse-grain task parallelism.
Then, the compiler schedules these tasks onto CPU and accelerator using the algorithm called CP/MISF.
Green colored tasks show the tasks can be execute on at least on an accelerator.
Note that the algorithm will not assign a task to accelerator, when target accelerator is busy and the execution on processor core
May give us shorter execution time..
Then the compiler applies the power reduction.
When the compiler finds the idle time on the scheduled result, the compiler inserts the instruction which changes frequency/voltage considering the deadline.
Then accelerator compiler generates the binary.
The role of accelerator compiler is to generate the data-transfer code between CPUs and accelerators in addition to accelerator binary.
And you can see the API in this figure, this is called accelerator_task_entry.
This directive indicates that the specified function is called by controller CPU.
What you see here is demonstration of optical flow calculation.
Shown here on your left is sequential execution by 1 cpu execution.
Shown here on your right is parallel execution by 8 cpus and 4 accelerators.
Like shown in this movie, we can achieve 32 times speedup by the proposed framwork.