Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors

WASEDA
UNIVERSITY
Studies on Automatic Parallelization for
Heterogeneous and Homogeneous
Multicore Processors
Akihiro Hayashi
林明宏
2012/01/23 1
ヘテロジニアスマルチコア及び
ホモジニアスマルチコアに対する
自動並列化に関する研究
Ph.D Defense

WASEDA
UNIVERSITY
Motivation &
Problem Definition
2012/01/23 2Ph.D Defense

WASEDA
UNIVERSITY
 Pros.
 Easy Programming
 Cons.
 Low Performance
 High Power Consumption
2012/01/23 3
Core
Core
Core
Core
Core
 Pros.
 High Performance
 Low Power Consumption
 Cons.
 Hard Programming
Ph.D Defense
Single-core Multi-core

WASEDA
UNIVERSITY
Category of
Homogeneous Multicore
Heterogeneous Multicore
2012/01/23 4
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
GPU
GPU
 Integrating identical cores
 Intel Core i7
 IBM Power7
 Renesas RP-2
 Integrating different cores
 IBM/SONY/TOSHIBA CELL BE
 Intel Processor + NVIDIA GPGPU
 Renesas RP-X
Ph.D Defense

WASEDA
UNIVERSITY
Focused Three Walls
in Multicore Processors
Programming Wall
 Parallel programming is time-consuming
Power Wall
 Power is expensive
Portability Wall
 Portability is important
2012/01/23 5
Programmers
HW
Mobile Phones, PCs, SupercomputersParallel Programs
Ph.D Defense

WASEDA
UNIVERSITY
Difficulty of
Parallel Programming
Programmers
Parallel
Programming
HW
Mobile Phones, PCs, Supercomputers
(1)Program (2)Expoit Tasks (3)Task scheduling
2012/01/23
Automatic parallelization is required
6
Parallel Programs
Long Period
& high costs
Cor
e
Cor
e
GPU
GPU
Ph.D Defense

WASEDA
UNIVERSITY
Related Work
 PEPPHER(MICRO’11)
 Performance Portability
 (-)Parallel programming by hand
 OmpSs(LCPC’10)
 OpenMP extension to support GPGPU
 Qilin(MICRO’09)
 Automatic dynamic scheduling for CPUs+GPUs
 CPU: Intel TBB, GPU:CUDA

WASEDA
UNIVERSITY
Related Work
 EXOCHI/Merge(PLDI’07, ASPLOS’08)
 Automatic dynamic scheduling for CPUs + GPUs
 (-) Parallel Programming by hand
 CellSs(SC’06)
 Automatic dynamic scheduling for CELL BE
 (+) programmers can write a sequential code
 (-) just only supports homogeneous SPEs
2012/01/23
Automatic parallelizing compiler for heterogeneous
multicores is getting more important
8Ph.D Defense

WASEDA
UNIVERSITY
Solution: OSCAR Heterogeneous
Multicore Compiler
Programming Wall
Automatic Parallelizing Compiler
Programmers can write a program in a sequential manner
Power Wall
Power Control by Compiler
The compiler automatically reduce the power consumption
Portability Wall
Multicore API
The compiler generates source codes with OSCAR API

WASEDA
UNIVERSITY
1. Hint directives for OSCAR compiler
2. OSCAR heterogeneous parallelizing compiler
3. OSCAR heterogeneous API (OSCAR API 2.0)
 Goal: fully automatic parallelization of a
Parallelizable C or Fortran program for various
heterogeneous and homogeneous multicores
 Previous works:
 OSCAR homogeneous compiler
 OSCAR homogeneous API 1.0
 Coarse-grain task scheduling for Heterogeneous
multicores
OSCAR Heterogeneous
Multicore Compiler
2012/01/23 10
Proposed Framework: Cross Platform
OSCAR: Optimally SCheduled Advanced multiprocessoR
Ph.D Defense

WASEDA
UNIVERSITY
Methodology

WASEDA
UNIVERSITY
OSCAR homogeneous API
(Previous Work)

WASEDA
UNIVERSITY
Key Idea
int main()
{
for () {…}
func();
…
…
}
Parallelizable C/ Frotran Program Accelerator Compiler/Library
Controller
Code
Generation
Accelerator
Binary
Generation
OSCAR Compiler
Automatic
Parallelization
Power
Reduction

WASEDA
UNIVERSITY
The Compiler Framework
2012/01/23 14
Compiler
for ACCa
OSCAR Compiler
Sequential
Program
Sequential Program
with hint directives
Compiler
for ACCa
Compiler
for ACCz
Control Code
For ACCa
API Analyzer + Sequential Compiler
+ linker
Executable
Objects
For ACCa
Step1：Hint directive insertion
specify execution time and
input/output variables or the other
information about accelerate-able
program part
Step2 : Parallelization
1.Coarse-grain task generation
2.Heterogeneous task scheduling
3.Power reduction
Step3 : Accelerator Binary
Generation
(1)control code generation
(2) accelerator binary generation
Step4 : Executable Generation
Library
For ACC
C with API
for CPU
C with API
For ACCa
C with API
For ACCz
Compiler
for ACCz
Ph.D Defense

WASEDA
UNIVERSITYReference Heterogeneous
Multicore Architecture Supported by
OSCAR Compiler and API
2012/01/23
 DTU
– Data Transfer Unit
 LPM
– Local Program Memory
 LDM
– Local Data Memory
 DSM
– Distributed Shared
Memory
 CSM
– Centralized Shared
Memory
 FVR
– Frequency/Voltage
Control Register
OSCAR-API applicable heterogeneous multicores
15
(e.g. GPU)(e.g. vector)
Ph.D Defense

WASEDA
UNIVERSITY
OSCAR Heterogeneous API
 Parallel Execution API
 parallel sections
 flush
 critical
 execution
 Memory Mapping API
 threadprivate
 distributedshared
 onchipshared
 Synchronization API
 groupbarrier
 Timer API
 get_current_time
2012/01/23 16
 Data transfer API
 dma_transfer
 dma_contiguous_parameter
 dma_stide_parameter
 dma_flag_check
 dma_flag_send
 Power control API
 fvcontrol
 get_fvstatus
 Heterogeneous API
 accelerator_task_entr
y
 Cache control API
 cache_writeback
 cache_selfinvalidate
 complete_memop
http://www.kasahara.cs.waseda.ac.jp/api/regist_en.html
 Hint Directive
 accelerator_task
 oscar_comment
Ph.D Defense

WASEDA
UNIVERSITY
Step1: Hint Directives
for OSCAR Compiler for parallelization
2012/01/23
#pragma oscar_hint accelerator_task (ACCa) cycle(1000, ((OSCAR_DMAC())))
#pragma oscar_hint accelerator_task (ACCb) cycle(100) in(var1, x[2:11]) out(x[2:11])
void call_FFT(int var, int *x) {
#pragma oscar_comment XXXXXXXXXX
FFT(var, x);
} Accelerator compilers or programmers specify execution time and
input/output variables for accelerate-able program part
17
for (i = 0; I < 10; i++) {
x[i]++;
}
call_FFT(var1, x);
Accelerator Name Clock cycle
Input/output variables
Data transfer
Ph.D Defense

WASEDA
UNIVERSITY
Step2-1: Parallelization by OSCAR
- Coarse-grain Task generation -
2012/01/23
 Program is decomposed into Macro Tasks (MTs)
 BB (Basic Block)
 RB (Repetition Block, or loop)
 SB (Subroutine Block, or function)
 Parallelism exploitation
 Macro Flow Graph (MFG): control-flow and data dependency
 Macro Task Graph (MTG): coarse grain task parallelism
Macro-Flow Graph Macro-Task Graph
(Condition for determination of MT Execution)
AND
(Condition for Data access)
Ex. Earliest Executable
Condition of MT6
MT2 takes a branch
that guarantees MT4 will be executed
OR
MT3 completes execution
Earliest Executable Condition
: Data Dependency
: Control flow
: Control Branch
18Ph.D Defense

WASEDA
UNIVERSITY
CPU ACCa
Step2-2: Parallelization by OSCAR
- Coarse-grain Static Task Scheduling -
2012/01/23
MT1
for CPU
MT2
for CPU
MT3
for CPU
MT6
for CPU
MT5
for ACC
MT4
for ACC
MT7
for ACCL
MT10
for ACC
MT9
for ACC
MT8
for CPU
MT13
for ACC
MT12
for ACC
MT11
for CPU
EMT
TIME
Macro-Task Graph
CPU0 CPU1
MT1
MT2 MT3
MT4
MT5
MT6 MT7
MT8
MT9
MT10
MT11
MT12
MT13
MT13
19
Accelerator0
ACC is busy
Ph.D Defense

WASEDA
UNIVERSITY
Step 2-3:Power Reduction
Compiler changes Frequency/Voltage
CPU0 CPU1 CPU2 CPU3 CPU ACCa CPU ACCb CPU ACCc CPU ACCd
SLEEP
Timer
SLEEP SLEEP SLEEP
SLEEP SLEEPSLEEPMID MID MID MID
TIME
cycle
deadline
=33[ms]
SLEEP
MID
2012/01/23 20
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
FULL:648MHz
@1.3V
MID:
324MHz@1.1
V
LOW:
162MHz@1.0
V
SLEEP:
Ph.D Defense

WASEDA
UNIVERSITY
Step3:Role of Accelerator Compilers
 Accelerator compilers generate both controller CPU
code and accelerator binary
2012/01/23
C with OSCAR API
#pragma oscar accelerator_task_entry
controller(1) oscartask_CTRL1_loop2
void oscartask_CTRL1_loop2(int *x)
{
int i;
for (i = 0; i <= 9; i += 1) {
x[i]++;
}
}
21
Controller CPU code
void
oscartask_CTRL1_loop2
(int *x)
{
// Data-transfer
// Kernel Invocation
// Data-transfer
}
Accelerator binary
CPU1
ACC
Ph.D Defense

WASEDA
UNIVERSITY
Step4:Generating Executable
for CPUs and Accelerators
API analyzer translates OSCAR API
into runtime library calls or annotation
for sequential compiler
 Example
 parallel sections → pthread_create()
Sequential compiler generates the
executable for the target system

WASEDA
UNIVERSITY
Practical Application:
Media Applications

WASEDA
UNIVERSITY
Performance Evaluation
Performance and power consumption
Evaluations
 Using a Hitachi accelerator compiler
for “FE-GA”
 Optical Flow calculation (from OpenCV)
 Utilizing a hand-tuned library
 Optical Flow calculation
(from Tohoku university ※1)
 AAC encoding(from Renesas Electronics)
2012/01/23
(※1)Hariyama et Al. “Evaluation of a Heterogeneous Multi-Core Architecture for Multimedia
Applications” ICD2008-139 (Written in Japanese)
24Ph.D Defense

WASEDA
UNIVERSITY
Evaluation Environment:
RP-X processor
 8 Renesas SH-4As as CPUs, 4 Hitachi FE-GAs
dynamically reconfigurable processor as accelerators
 I$:32KB/core, D$:32KB/core
 Frequency/Voltage State:
648MHz@1.3V, 324MHz@1.16V, 162MHz@1.0V
SH-4ASH-4A
SH-4A
SH-4A
SH-4ASH-4A
SH-4A
SH-4A
SHwy#0 SHwy#1
DDR3
#0
Media
IP
FE MX2 DDR3
#1
FEFEFE-GA MX2
SH-4ACPU FPU DTU
I$ MMU D$
ILRAM LDM DSM
2012/01/23
Yuyama et al. “A 45nm 37.3GOPS/W heterogeneous multi-core SoC” ISSCC2010
25
クロスバネットワーク
サブCPUへ
入出力ポート
割込/DMA
要求
コンフィギュレーションマネージャ
ALU/MLTセルアレイ
(24/8セル)
LSセルローカルメモリ
(10バンク)(10セル)
MLT
LS CRAM
バス
I/F
MLT
MLT
MLT
MLT
MLT
MLT
MLT
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
LS
LS
LS
LS
LS
LS
LS
LS
LS
ALU MLT LS CRAMALUセル乗算セルロードストアセル
コンパイルドRAM
(4～16KB, 2-po
シーケンスマネージャ
Ph.D Defense

WASEDA
UNIVERSITYPerformance Using OSCAR and
FE-GA Compilers and OSCAR API on
RP-X(Optical Flow from OpenCV)
2012/01/23 26
1
1.9
3.46
5.64
2.65
5.48
12.36
0
2
4
6
8
10
12
14
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
SpeedupagainstasingleSHprocessor
FE-GA binary is generated
by Hitachi FE-GA Compiler
Ph.D Defense

WASEDA
UNIVERSITYProcessing Performance Using OSCAR
Compiler with a hand-tuned library
and OSCAR API on RP-X (AAC Encoding)
2012/01/23 27
1
1.98
3.68
6.32
4.44
8.39
16.08
0
2
4
6
8
10
12
14
16
18
SpeedupsagainstasingleSHprocessor
Hand-tuned library for FE-GA is utilized
Ph.D Defense

WASEDA
UNIVERSITYPerformance Using OSCAR Compiler with a
hand-tuned library and API on RP-X
(Optical Flow from Tohoku university)
2012/01/23 28
1
2.29 3.09
5.4
18.85
26.71
32.65
0
5
10
15
20
25
30
35
SpeedupsagainstasingleSHprocessor
Hand-tuned library for FE-GA is utilized
Ph.D Defense

WASEDA
UNIVERSITY
Demonstration
-Optical Flow Calculation-
2012/01/23
Sequential Execution
1SH
Parallel Execution
8SH+4FE
32x
29Ph.D Defense

WASEDA
UNIVERSITYPower Reduction Using OSCAR Compiler with
a hand-tuned library and OSCAR API on RP-X
(Optical Flow from Tohoku University)
2012/01/23 30
Without Power Reduction With Power Reduction
70% of power reduction1.7W 0.5W
Ph.D Defense

WASEDA
UNIVERSITYPower Reduction Using OSCAR Compiler with
a hand-tuned library and OSCAR API on RP-X
(AAC Encoding from Renesas Electronics)
2012/01/23 31
Without Power Reduction With Power Reduction
80% of power reduction
Ph.D Defense

WASEDA
UNIVERSITY
Highly Practical Application:
Dose calculation Engine
for cancer treatment

WASEDA
UNIVERSITY
Particle Radiotherapy
for Cancer Treatment
Cancer is the primary cause of a
person’s death in Japan
 One in three people died of cancer
⇒ 粒子線による治療が効果的
2012/01/23 33
Reference:NIRS HP
http://www.nirs.go.jp/index.shtml
Ph.D Defense

WASEDA
UNIVERSITY
Particle Radiotherapy
for Cancer Treatment details
1.Taking CT pictures
2.Input cancer information
3. Physical simulation
4.Planning
2012/01/23 34
Reference: NIRS HP
http://www.nirs.go.jp/index.shtml
Physical simulation is time-consuming
⇒this treatment is not covered by insurance
Ph.D Defense

WASEDA
UNIVERSITY
Automatic Parallelization
by OSCAR compiler
Overview
 Automatic Parallelization
by OSCAR compiler and OSCAR API
 The program is clinically used
 Enhancing parallelism by code rewriting
– Original code can limit the potential parallelism
⇒Evaluation on various SMP machines
 Performance Evaluation on IBM and Intel servers
2012/01/23 35
Parallelizing
Compiler
Programmers Parallelized
Program
Hardware
Ph.D Defense

WASEDA
UNIVERSITY
Simulation Flow
and Profile Result
2012/01/23 36
Dose
92%
Scatter
8%
Init,
Modify
0%
Profile Result
Environment:
Hitachi SR16000 System
Power7 4.00GHz
Dose Calculation
Scatter Calculation
Modify Calculation
Initialization
Developed by NIRS and Mitsubishi
Ph.D Defense

WASEDA
UNIVERSITY
An overview of
Dose and Scatter Calculation
 Dose Calculation
 Accumulates dose value for each pencil beam
 Scatter Calculation
 Calculates an influence on neighbor voxels
considering “scatter”

WASEDA
UNIVERSITY
Enhancing parallelism
for dose calculation
2012/01/23 38
X
Y
Beams
for (pencilBeams/n) {
}
for (passedVoxels) {
//Dose calc
}
 Each processor calculates dose values for each array
 After that, each array is accumulated to one array
for (pencilBeams/n) {
}
for (passedVoxels) {
//Dose Calc
}
Each array is
accumulated to
one array
CPU0
CPU1
Each processor calculates dose
valuesfor each array
CPU0
CPU1
Ph.D Defense

WASEDA
UNIVERSITY
Enhancing parallelism
for scatter calculation
2012/01/23 39
Z
Y
for (Z/n) {
}
for (Y) {
for (X) {
//Scatter value addition
}
}
 Each processor calculates dose values for each array
 After that, each array is accumulated to one array
for (Z/n) {
}
Each array is
accumulated to
one array
CPU0
CPU1
Each processor calculates dose
valuesfor each array
CPU0
CPU1
for (Y) {
for (X) {
//Scatter value addition
}
}
Ph.D Defense

WASEDA
UNIVERSITY
Performance Evaluation on
Hitachi HA8000/RS200(Intel Xeon SMP)
Target Machine
 Intel Xeon X5670 2.93GHz
 1CPU, 2CPU, 6CPU, 12CPU
Evaluated cases
 1.OSCAR+Intel Composer XE 12.0
 -fast –openmp
 2.Intel Composer XE 12.0
 -parallel –fast
Data size: 1.5mm

WASEDA
UNIVERSITY
Performance on HA8000
OSCAR+icc
2012/01/23 41
1.22 1.23 1.21 1.22 1.221.22 1.00
1.92
4.84
6.90
0
1
2
3
4
5
6
7
8
original 1CPU 2CPU 6CPU 12CPU
Speedupagainst1CPU
Processor Configuration
icc自動並列化 OSCAR自動並列化
 OSCAR+icc shows 6.90x speedup with 12CPU
 icc does not exploit the parallelism
autopara
Ph.D Defense

WASEDA
UNIVERSITY
Performance
Analysis on HA8000
 OSCAR compiler shows higher scalability than icc
 6.90x with 12CPU
 Scatter calculation have to be accelerated to attain
more performance
2012/01/23 42
1.00
1.95
5.66
10.35
1.00 1.35
1.84
2.76
0.00
2.00
4.00
6.00
8.00
10.00
12.00
1CPU 2CPU 6CPU 12CPU 1CPU 2CPU 6CPU 12CPU
線量計算散乱計算
Speedup
Speedup of dose and scatter calculation
Dose Scatter
Ph.D Defense

WASEDA
UNIVERSITY
Performance Evaluation on
Hitachi SR16000(IBM Power 7 SMP)
Target Machine
 IBM Power 7 4.0GHz
 1CPU, 32CPU, 64CPU
Evaluated Cases
 1.OSCAR+IBM XLC Compiler 11.0
 -O4 -qsmp=omp –qarch=pwr7 –qmaxmem=-1
 2. IBM XLC Compiler 11.0
 -O4 -qsmp=auto –qarch=pwr7 –qmaxmem=-1
Data size: 0.5mm

WASEDA
UNIVERSITY
Performance on SR16000
OSCAR+xlc
2012/01/23 44
1.31 1.29 1.29 1.291.31 1.00
30.00
48.07
0
10
20
30
40
50
60
original 1CPU 32CPU 64CPU
Speedupagainst1CPU
Processor Configurataion
xlc自動並列化 OSCAR自動並列化
 OSCAR+xlc shows 48.07x with 64CPU
 xlc only exploits small loop parallelism
autopara
Ph.D Defense

WASEDA
UNIVERSITY
Pertformance
Analysis on SR16000
 OSCAR compiler shows higher scalability than xlc
 48.07x with 64CPU
 Scatter calculation has to be accelerated to attain
more performance
2012/01/23 45
1.00
29.80
56.45
1.00
17.04
21.31
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1CPU 32CPU 64CPU 1CPU 32CPU 64CPU
線量計算散乱計算
Speedup
Speedup of dose and scatter calculation
Dose Scatter
Ph.D Defense

WASEDA
UNIVERSITY
Workload Balancing
Load balancing issue
2012/01/23 46
0
0.1
0.2
0.3
0.4
1
22
43
64
85
106
127
148
169
190
211
232
CalculationAmount
voxel
Scatter calculation
0
0.0005
0.001
0.0015
0.002
0.0025
1
938
1875
2812
3749
4686
5623
6560
7497
8434
9371
CalculationAmount
beam
Dose calculation
 Scheduling policy has to be modified
Ph.D Defense

WASEDA
UNIVERSITY
Result of modifying scheduling
policy on SR16000
2012/01/23 47
1.00
28.09
30.27
49.93
55.10
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1CPU 32CPU 32CPU-rev 64CPU 64CPU-rev
Speedup
Processor Configuration
Ph.D Defense

WASEDA
UNIVERSITY
Conclusion:
Programmability
Programming Wall
Automatic Parallelizing Compiler
Programmers can write a program in a sequential manner
 Optical flow(OpenCV):
12x speedup with 8SH+4FE
 Optical flow(Hand-tuned library):
 AAC Encoder:
 Dose calculation engine for particle therapy:
55x speedup with 64CPU

WASEDA
UNIVERSITY
Conclusions:
Power Consumption
Power Wall
Power Control by Compiler
The compiler automatically reduce the power consumption
 Optical flow(Hand-tuned library):
Reduce 70% power consumption for 8SH+4FE
 AAC Encoder:
Reduce 80% power consumption for 8SH+4FE

WASEDA
UNIVERSITY
Conclusions:
Portability
Portability Wall
Multicore API
The compiler generates source codes with OSCAR API
 Renesas RP-X Heterogeneous Multicore
 SH4A
 Hitachi HA8000
 Intel Xeon
 Hitachi SR16000
 IBM Power7

WASEDA
UNIVERSITY
Future Work
 Fully automatic parallelization of C program
 Automatic detection of acceleration part
 Power capping control
 Local memory management for heterogeneous
multicore

WASEDA
UNIVERSITY
Acknowledgement
 Deep gratitude for
 Dr.Hironori Kasahara(Waseda)
 Dr.Keiji Kimura(Waseda)
 Dr.Satoshi Goto(Waseda)
 Dr.Nozomu Togawa(Waseda)
 A part of this research has been supported by
 NEDO “Advanced Heterogeneous Multiprocessor”
 NEDO “Heterogeneous Multicore for Consumer
Electronics”
 MEXT project “Global COE Ambient Soc”

WASEDA
UNIVERSITY
Acknowledgement
 Special thanks
 Dr.Hirofumi Nakano(Waseda)
 Dr.Jun Shirako(Rice)
 Dr.Yasutaka Wada(Waseda)
 Dr.Takamichi Miyamoto(NEC)
 Dr.Fumiyo Takano(NEC)
 Dr.Masayoshi Mase(Hitachi)
 Mr.Sekiya Yamashita(Waseda)
 Mr.Hiroki Mikami(Waseda)
 Mr.Mamoru Shimaoka(Waseda)
 Mr.Yoshiya Hirase(Waseda)
 Mr.Yasir I. Al-Dosary (Waseda)
 Ms.Cecilia Gonzalez (Waseda)
 All student and alumni
in Kasahara Kimura Laboratory

WASEDA
UNIVERSITY
Acknowledgement
 Special Thanks
 Dr.Takeshi Ikenaga(Waseda)
 Mr.Keita Miyamura(IBM)
 Mr.Yoichi Matsuyama(Waseda)
 Ms.Miki Yajima(NTT Comware)
 All faculty, stuff, and RA in CS department
 Great Staffs in GCOE office
 WIZDOM
 Hiroshima Toyo Carp
 Friends in Osaka
 My parents and brother
 and you!

WASEDA
UNIVERSITY
Thank you for your attention!

WASEDA
UNIVERSITY
Library
Step3:Utilizing Hand-tuned library
 Accelerator compilers skip functional call whose
name starts with “oscarlib”
2012/01/23
C with OSCAR API
#pragma oscar accelerator_task_entry
controller(2)
oscartask_CTRL2_call_FFT
void oscartask_CTRL2_call_FFT(int var1,
int *x)
{
oscarlib_CTRL2_ACCEL3_FFT(var1, x);
}
56
Controller CPU code
void oscarlib_CTRL2_ACCEL3_FFT(int x,
int *v)
{
// Data-transfer
// Kernel Invocation
// Data-transfer
}
Accelerator binary
Ph.D Defense

Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors

Similar to Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors (20)

More from Akihiro Hayashi

More from Akihiro Hayashi (11)

Recently uploaded

Recently uploaded (20)

Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors

Editor's Notes