SlideShare a Scribd company logo
1 of 56
WASEDA
UNIVERSITY
Studies on Automatic Parallelization for
Heterogeneous and Homogeneous
Multicore Processors
Akihiro Hayashi
林 明宏
2012/01/23 1
ヘテロジニアスマルチコア及び
ホモジニアスマルチコアに対する
自動並列化に関する研究
Ph.D Defense
WASEDA
UNIVERSITY
Motivation &
Problem Definition
2012/01/23 2Ph.D Defense
WASEDA
UNIVERSITY
Multicore Processors
 Pros.
 Easy Programming
 Cons.
 Low Performance
 High Power Consumption
2012/01/23 3
Core
Core
Core
Core
Core
 Pros.
 High Performance
 Low Power Consumption
 Cons.
 Hard Programming
Ph.D Defense
Single-core Multi-core
WASEDA
UNIVERSITY
Category of
Multicore Processors
Homogeneous Multicore
Heterogeneous Multicore
2012/01/23 4
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
GPU
GPU
 Integrating identical cores
 Intel Core i7
 IBM Power7
 Renesas RP-2
 Integrating different cores
 IBM/SONY/TOSHIBA CELL BE
 Intel Processor + NVIDIA GPGPU
 Renesas RP-X
Ph.D Defense
WASEDA
UNIVERSITY
Focused Three Walls
in Multicore Processors
Programming Wall
 Parallel programming is time-consuming
Power Wall
 Power is expensive
Portability Wall
 Portability is important
2012/01/23 5
Programmers
HW
Mobile Phones, PCs, SupercomputersParallel Programs
Ph.D Defense
WASEDA
UNIVERSITY
Difficulty of
Parallel Programming
Programmers
Parallel
Programming
HW
Mobile Phones, PCs, Supercomputers
(1)Program (2)Expoit Tasks (3)Task scheduling
2012/01/23
Automatic parallelization is required
6
Parallel Programs
Long Period
& high costs
Cor
e
Cor
e
GPU
GPU
Ph.D Defense
WASEDA
UNIVERSITY
Related Work
 PEPPHER(MICRO’11)
 Performance Portability
 (-)Parallel programming by hand
 OmpSs(LCPC’10)
 OpenMP extension to support GPGPU
 (-)Parallel programming by hand
 Qilin(MICRO’09)
 Automatic dynamic scheduling for CPUs+GPUs
 (-)Parallel programming by hand
 CPU: Intel TBB, GPU:CUDA
2012/01/23 7Ph.D Defense
WASEDA
UNIVERSITY
Related Work
 EXOCHI/Merge(PLDI’07, ASPLOS’08)
 Automatic dynamic scheduling for CPUs + GPUs
 (-) Parallel Programming by hand
 CellSs(SC’06)
 Automatic dynamic scheduling for CELL BE
 (+) programmers can write a sequential code
 (-) just only supports homogeneous SPEs
2012/01/23
Automatic parallelizing compiler for heterogeneous
multicores is getting more important
8Ph.D Defense
WASEDA
UNIVERSITY
Solution: OSCAR Heterogeneous
Multicore Compiler
Programming Wall
Automatic Parallelizing Compiler
Programmers can write a program in a sequential manner
Power Wall
Power Control by Compiler
The compiler automatically reduce the power consumption
Portability Wall
Multicore API
The compiler generates source codes with OSCAR API
2012/01/23 9Ph.D Defense
WASEDA
UNIVERSITY
1. Hint directives for OSCAR compiler
2. OSCAR heterogeneous parallelizing compiler
3. OSCAR heterogeneous API (OSCAR API 2.0)
 Goal: fully automatic parallelization of a
Parallelizable C or Fortran program for various
heterogeneous and homogeneous multicores
 Previous works:
 OSCAR homogeneous compiler
 OSCAR homogeneous API 1.0
 Coarse-grain task scheduling for Heterogeneous
multicores
OSCAR Heterogeneous
Multicore Compiler
2012/01/23 10
Proposed Framework: Cross Platform
OSCAR: Optimally SCheduled Advanced multiprocessoR
Ph.D Defense
WASEDA
UNIVERSITY
Methodology
2012/01/23 11Ph.D Defense
WASEDA
UNIVERSITY
OSCAR homogeneous API
(Previous Work)
2012/01/23 12Ph.D Defense
WASEDA
UNIVERSITY
Key Idea
2012/01/23 13Ph.D Defense
int main()
{
for () {…}
func();
…
…
}
Parallelizable C/ Frotran Program Accelerator Compiler/Library
Controller
Code
Generation
Accelerator
Binary
Generation
OSCAR Compiler
Automatic
Parallelization
Power
Reduction
WASEDA
UNIVERSITY
The Compiler Framework
2012/01/23 14
Compiler
for ACCa
OSCAR Compiler
Sequential
Program
Sequential Program
with hint directives
Compiler
for ACCa
Compiler
for ACCz
Control Code
For ACCa
API Analyzer + Sequential Compiler
+ linker
Executable
Objects
For ACCa
Step1:Hint directive insertion
specify execution time and
input/output variables or the other
information about accelerate-able
program part
Step2 : Parallelization
1.Coarse-grain task generation
2.Heterogeneous task scheduling
3.Power reduction
Step3 : Accelerator Binary
Generation
(1)control code generation
(2) accelerator binary generation
Step4 : Executable Generation
Library
For ACC
C with API
for CPU
C with API
For ACCa
C with API
For ACCz
Compiler
for ACCz
Ph.D Defense
WASEDA
UNIVERSITYReference Heterogeneous
Multicore Architecture Supported by
OSCAR Compiler and API
2012/01/23
 DTU
– Data Transfer Unit
 LPM
– Local Program Memory
 LDM
– Local Data Memory
 DSM
– Distributed Shared
Memory
 CSM
– Centralized Shared
Memory
 FVR
– Frequency/Voltage
Control Register
OSCAR-API applicable heterogeneous multicores
15
(e.g. GPU)(e.g. vector)
Ph.D Defense
WASEDA
UNIVERSITY
OSCAR Heterogeneous API
 Parallel Execution API
 parallel sections
 flush
 critical
 execution
 Memory Mapping API
 threadprivate
 distributedshared
 onchipshared
 Synchronization API
 groupbarrier
 Timer API
 get_current_time
2012/01/23 16
 Data transfer API
 dma_transfer
 dma_contiguous_parameter
 dma_stide_parameter
 dma_flag_check
 dma_flag_send
 Power control API
 fvcontrol
 get_fvstatus
 Heterogeneous API
 accelerator_task_entr
y
 Cache control API
 cache_writeback
 cache_selfinvalidate
 complete_memop
http://www.kasahara.cs.waseda.ac.jp/api/regist_en.html
 Hint Directive
 accelerator_task
 oscar_comment
Ph.D Defense
WASEDA
UNIVERSITY
Step1: Hint Directives
for OSCAR Compiler for parallelization
2012/01/23
#pragma oscar_hint accelerator_task (ACCa) cycle(1000, ((OSCAR_DMAC())))
#pragma oscar_hint accelerator_task (ACCb) cycle(100) in(var1, x[2:11]) out(x[2:11])
void call_FFT(int var, int *x) {
#pragma oscar_comment XXXXXXXXXX
FFT(var, x);
} Accelerator compilers or programmers specify execution time and
input/output variables for accelerate-able program part
17
for (i = 0; I < 10; i++) {
x[i]++;
}
call_FFT(var1, x);
Accelerator Name Clock cycle
Input/output variables
Data transfer
Ph.D Defense
WASEDA
UNIVERSITY
Step2-1: Parallelization by OSCAR
- Coarse-grain Task generation -
2012/01/23
 Program is decomposed into Macro Tasks (MTs)
 BB (Basic Block)
 RB (Repetition Block, or loop)
 SB (Subroutine Block, or function)
 Parallelism exploitation
 Macro Flow Graph (MFG): control-flow and data dependency
 Macro Task Graph (MTG): coarse grain task parallelism
Macro-Flow Graph Macro-Task Graph
(Condition for determination of MT Execution)
AND
(Condition for Data access)
Ex. Earliest Executable
Condition of MT6
MT2 takes a branch
that guarantees MT4 will be executed
OR
MT3 completes execution
Earliest Executable Condition
: Data Dependency
: Control flow
: Control Branch
18Ph.D Defense
WASEDA
UNIVERSITY
CPU ACCa
Step2-2: Parallelization by OSCAR
- Coarse-grain Static Task Scheduling -
2012/01/23
MT1
for CPU
MT2
for CPU
MT3
for CPU
MT6
for CPU
MT5
for ACC
MT4
for ACC
MT7
for ACCL
MT10
for ACC
MT9
for ACC
MT8
for CPU
MT13
for ACC
MT12
for ACC
MT11
for CPU
EMT
TIME
Macro-Task Graph
CPU0 CPU1
MT1
MT2 MT3
MT4
MT5
MT6 MT7
MT8
MT9
MT10
MT11
MT12
MT13
MT13
19
Accelerator0
ACC is busy
Ph.D Defense
WASEDA
UNIVERSITY
Step 2-3:Power Reduction
Compiler changes Frequency/Voltage
CPU0 CPU1 CPU2 CPU3 CPU ACCa CPU ACCb CPU ACCc CPU ACCd
SLEEP
Timer
SLEEP SLEEP SLEEP
SLEEP SLEEPSLEEPMID MID MID MID
TIME
cycle
deadline
=33[ms]
SLEEP
MID
2012/01/23 20
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
MID
FULL:648MHz
@1.3V
MID:
324MHz@1.1
V
LOW:
162MHz@1.0
V
SLEEP:
Ph.D Defense
WASEDA
UNIVERSITY
Step3:Role of Accelerator Compilers
 Accelerator compilers generate both controller CPU
code and accelerator binary
2012/01/23
C with OSCAR API
#pragma oscar accelerator_task_entry
controller(1) oscartask_CTRL1_loop2
void oscartask_CTRL1_loop2(int *x)
{
int i;
for (i = 0; i <= 9; i += 1) {
x[i]++;
}
}
21
Controller CPU code
void
oscartask_CTRL1_loop2
(int *x)
{
// Data-transfer
// Kernel Invocation
// Data-transfer
}
Accelerator binary
CPU1
ACC
Ph.D Defense
WASEDA
UNIVERSITY
Step4:Generating Executable
for CPUs and Accelerators
API analyzer translates OSCAR API
into runtime library calls or annotation
for sequential compiler
 Example
 parallel sections → pthread_create()
Sequential compiler generates the
executable for the target system
2012/01/23 22Ph.D Defense
WASEDA
UNIVERSITY
Practical Application:
Media Applications
2012/01/23 23Ph.D Defense
WASEDA
UNIVERSITY
Performance Evaluation
Performance and power consumption
Evaluations
 Using a Hitachi accelerator compiler
for “FE-GA”
 Optical Flow calculation (from OpenCV)
 Utilizing a hand-tuned library
 Optical Flow calculation
(from Tohoku university ※1)
 AAC encoding(from Renesas Electronics)
2012/01/23
(※1)Hariyama et Al. “Evaluation of a Heterogeneous Multi-Core Architecture for Multimedia
Applications” ICD2008-139 (Written in Japanese)
24Ph.D Defense
WASEDA
UNIVERSITY
Evaluation Environment:
RP-X processor
 8 Renesas SH-4As as CPUs, 4 Hitachi FE-GAs
dynamically reconfigurable processor as accelerators
 I$:32KB/core, D$:32KB/core
 Frequency/Voltage State:
648MHz@1.3V, 324MHz@1.16V, 162MHz@1.0V
SH-4ASH-4A
SH-4A
SH-4A
SH-4ASH-4A
SH-4A
SH-4A
SHwy#0 SHwy#1
DDR3
#0
Media
IP
FE MX2 DDR3
#1
FEFEFE-GA MX2
SH-4ACPU FPU DTU
I$ MMU D$
ILRAM LDM DSM
2012/01/23
Yuyama et al. “A 45nm 37.3GOPS/W heterogeneous multi-core SoC” ISSCC2010
25
クロスバネットワーク
サブCPUへ
入出力ポート
割込/DMA
要求
コンフィギュレーションマネージャ
ALU/MLTセルアレイ
(24/8セル)
LSセル ローカルメモリ
(10バンク)(10セル)
MLT
LS CRAM
バス
I/F
MLT
MLT
MLT
MLT
MLT
MLT
MLT
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
CRAM
LS
LS
LS
LS
LS
LS
LS
LS
LS
ALU MLT LS CRAMALUセル 乗算セル ロードストアセル
コンパイルドRAM
(4~16KB, 2-po
シーケンスマネージャ
Ph.D Defense
WASEDA
UNIVERSITYPerformance Using OSCAR and
FE-GA Compilers and OSCAR API on
RP-X(Optical Flow from OpenCV)
2012/01/23 26
1
1.9
3.46
5.64
2.65
5.48
12.36
0
2
4
6
8
10
12
14
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
SpeedupagainstasingleSHprocessor
FE-GA binary is generated
by Hitachi FE-GA Compiler
Ph.D Defense
WASEDA
UNIVERSITYProcessing Performance Using OSCAR
Compiler with a hand-tuned library
and OSCAR API on RP-X (AAC Encoding)
2012/01/23 27
1
1.98
3.68
6.32
4.44
8.39
16.08
0
2
4
6
8
10
12
14
16
18
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
SpeedupsagainstasingleSHprocessor
Hand-tuned library for FE-GA is utilized
Ph.D Defense
WASEDA
UNIVERSITYPerformance Using OSCAR Compiler with a
hand-tuned library and API on RP-X
(Optical Flow from Tohoku university)
2012/01/23 28
1
2.29 3.09
5.4
18.85
26.71
32.65
0
5
10
15
20
25
30
35
1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE
SpeedupsagainstasingleSHprocessor
Hand-tuned library for FE-GA is utilized
Ph.D Defense
WASEDA
UNIVERSITY
Demonstration
-Optical Flow Calculation-
2012/01/23
Sequential Execution
1SH
Parallel Execution
8SH+4FE
32x
29Ph.D Defense
WASEDA
UNIVERSITYPower Reduction Using OSCAR Compiler with
a hand-tuned library and OSCAR API on RP-X
(Optical Flow from Tohoku University)
2012/01/23 30
Without Power Reduction With Power Reduction
70% of power reduction1.7W 0.5W
Ph.D Defense
WASEDA
UNIVERSITYPower Reduction Using OSCAR Compiler with
a hand-tuned library and OSCAR API on RP-X
(AAC Encoding from Renesas Electronics)
2012/01/23 31
Without Power Reduction With Power Reduction
80% of power reduction
Ph.D Defense
WASEDA
UNIVERSITY
Highly Practical Application:
Dose calculation Engine
for cancer treatment
2012/01/23 32Ph.D Defense
WASEDA
UNIVERSITY
Particle Radiotherapy
for Cancer Treatment
Cancer is the primary cause of a
person’s death in Japan
 One in three people died of cancer
⇒ 粒子線による治療が効果的
2012/01/23 33
Reference:NIRS HP
http://www.nirs.go.jp/index.shtml
Ph.D Defense
WASEDA
UNIVERSITY
Particle Radiotherapy
for Cancer Treatment details
1.Taking CT pictures
2.Input cancer information
3. Physical simulation
4.Planning
2012/01/23 34
Reference: NIRS HP
http://www.nirs.go.jp/index.shtml
Physical simulation is time-consuming
⇒this treatment is not covered by insurance
Ph.D Defense
WASEDA
UNIVERSITY
Automatic Parallelization
by OSCAR compiler
Overview
 Automatic Parallelization
by OSCAR compiler and OSCAR API
 The program is clinically used
 Enhancing parallelism by code rewriting
– Original code can limit the potential parallelism
⇒Evaluation on various SMP machines
 Performance Evaluation on IBM and Intel servers
2012/01/23 35
Parallelizing
Compiler
Programmers Parallelized
Program
Hardware
Ph.D Defense
WASEDA
UNIVERSITY
Simulation Flow
and Profile Result
2012/01/23 36
Dose
92%
Scatter
8%
Init,
Modify
0%
Profile Result
Environment:
Hitachi SR16000 System
Power7 4.00GHz
Dose Calculation
Scatter Calculation
Modify Calculation
Initialization
Developed by NIRS and Mitsubishi
Ph.D Defense
WASEDA
UNIVERSITY
An overview of
Dose and Scatter Calculation
 Dose Calculation
 Accumulates dose value for each pencil beam
 Scatter Calculation
 Calculates an influence on neighbor voxels
considering “scatter”
2012/01/23 37Ph.D Defense
WASEDA
UNIVERSITY
Enhancing parallelism
for dose calculation
2012/01/23 38
X
Y
Beams
for (pencilBeams/n) {
}
for (passedVoxels) {
//Dose calc
}
 Each processor calculates dose values for each array
 After that, each array is accumulated to one array
for (pencilBeams/n) {
}
for (passedVoxels) {
//Dose Calc
}
Each array is
accumulated to
one array
CPU0
CPU1
Each processor calculates dose
valuesfor each array
CPU0
CPU1
Ph.D Defense
WASEDA
UNIVERSITY
Enhancing parallelism
for scatter calculation
2012/01/23 39
Z
Y
for (Z/n) {
}
for (Y) {
for (X) {
//Scatter value addition
}
}
 Each processor calculates dose values for each array
 After that, each array is accumulated to one array
for (Z/n) {
}
Each array is
accumulated to
one array
CPU0
CPU1
Each processor calculates dose
valuesfor each array
CPU0
CPU1
for (Y) {
for (X) {
//Scatter value addition
}
}
Ph.D Defense
WASEDA
UNIVERSITY
Performance Evaluation on
Hitachi HA8000/RS200(Intel Xeon SMP)
Target Machine
 Intel Xeon X5670 2.93GHz
 1CPU, 2CPU, 6CPU, 12CPU
Evaluated cases
 1.OSCAR+Intel Composer XE 12.0
 -fast –openmp
 2.Intel Composer XE 12.0
 -parallel –fast
Data size: 1.5mm
2012/01/23 40Ph.D Defense
WASEDA
UNIVERSITY
Performance on HA8000
OSCAR+icc
2012/01/23 41
1.22 1.23 1.21 1.22 1.221.22 1.00
1.92
4.84
6.90
0
1
2
3
4
5
6
7
8
original 1CPU 2CPU 6CPU 12CPU
Speedupagainst1CPU
Processor Configuration
icc自動並列化 OSCAR自動並列化
 OSCAR+icc shows 6.90x speedup with 12CPU
 icc does not exploit the parallelism
autopara
Ph.D Defense
WASEDA
UNIVERSITY
Performance
Analysis on HA8000
 OSCAR compiler shows higher scalability than icc
 6.90x with 12CPU
 Scatter calculation have to be accelerated to attain
more performance
2012/01/23 42
1.00
1.95
5.66
10.35
1.00 1.35
1.84
2.76
0.00
2.00
4.00
6.00
8.00
10.00
12.00
1CPU 2CPU 6CPU 12CPU 1CPU 2CPU 6CPU 12CPU
線量計算 散乱計算
Speedup
Speedup of dose and scatter calculation
Dose Scatter
Ph.D Defense
WASEDA
UNIVERSITY
Performance Evaluation on
Hitachi SR16000(IBM Power 7 SMP)
Target Machine
 IBM Power 7 4.0GHz
 1CPU, 32CPU, 64CPU
Evaluated Cases
 1.OSCAR+IBM XLC Compiler 11.0
 -O4 -qsmp=omp –qarch=pwr7 –qmaxmem=-1
 2. IBM XLC Compiler 11.0
 -O4 -qsmp=auto –qarch=pwr7 –qmaxmem=-1
Data size: 0.5mm
2012/01/23 43Ph.D Defense
WASEDA
UNIVERSITY
Performance on SR16000
OSCAR+xlc
2012/01/23 44
1.31 1.29 1.29 1.291.31 1.00
30.00
48.07
0
10
20
30
40
50
60
original 1CPU 32CPU 64CPU
Speedupagainst1CPU
Processor Configurataion
xlc自動並列化 OSCAR自動並列化
 OSCAR+xlc shows 48.07x with 64CPU
 xlc only exploits small loop parallelism
autopara
Ph.D Defense
WASEDA
UNIVERSITY
Pertformance
Analysis on SR16000
 OSCAR compiler shows higher scalability than xlc
 48.07x with 64CPU
 Scatter calculation has to be accelerated to attain
more performance
2012/01/23 45
1.00
29.80
56.45
1.00
17.04
21.31
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1CPU 32CPU 64CPU 1CPU 32CPU 64CPU
線量計算 散乱計算
Speedup
Speedup of dose and scatter calculation
Dose Scatter
Ph.D Defense
WASEDA
UNIVERSITY
Workload Balancing
Load balancing issue
2012/01/23 46
0
0.1
0.2
0.3
0.4
1
22
43
64
85
106
127
148
169
190
211
232
CalculationAmount
voxel
Scatter calculation
0
0.0005
0.001
0.0015
0.002
0.0025
1
938
1875
2812
3749
4686
5623
6560
7497
8434
9371
CalculationAmount
beam
Dose calculation
 Scheduling policy has to be modified
Ph.D Defense
WASEDA
UNIVERSITY
Result of modifying scheduling
policy on SR16000
2012/01/23 47
1.00
28.09
30.27
49.93
55.10
0.00
10.00
20.00
30.00
40.00
50.00
60.00
1CPU 32CPU 32CPU-rev 64CPU 64CPU-rev
Speedup
Processor Configuration
Ph.D Defense
WASEDA
UNIVERSITY
Conclusion:
Programmability
Programming Wall
Automatic Parallelizing Compiler
Programmers can write a program in a sequential manner
 Optical flow(OpenCV):
12x speedup with 8SH+4FE
 Optical flow(Hand-tuned library):
32x speedup with 8SH+4FE
 AAC Encoder:
16x speedup with 8SH+4FE
 Dose calculation engine for particle therapy:
55x speedup with 64CPU
2012/01/23 48Ph.D Defense
WASEDA
UNIVERSITY
Conclusions:
Power Consumption
Power Wall
Power Control by Compiler
The compiler automatically reduce the power consumption
 Optical flow(Hand-tuned library):
Reduce 70% power consumption for 8SH+4FE
 AAC Encoder:
Reduce 80% power consumption for 8SH+4FE
2012/01/23 49Ph.D Defense
WASEDA
UNIVERSITY
Conclusions:
Portability
Portability Wall
Multicore API
The compiler generates source codes with OSCAR API
 Renesas RP-X Heterogeneous Multicore
 SH4A
 Hitachi HA8000
 Intel Xeon
 Hitachi SR16000
 IBM Power7
2012/01/23 50Ph.D Defense
WASEDA
UNIVERSITY
Future Work
 Fully automatic parallelization of C program
 Automatic detection of acceleration part
 Power capping control
 Local memory management for heterogeneous
multicore
2012/01/23 51Ph.D Defense
WASEDA
UNIVERSITY
Acknowledgement
 Deep gratitude for
 Dr.Hironori Kasahara(Waseda)
 Dr.Keiji Kimura(Waseda)
 Dr.Satoshi Goto(Waseda)
 Dr.Nozomu Togawa(Waseda)
 A part of this research has been supported by
 NEDO “Advanced Heterogeneous Multiprocessor”
 NEDO “Heterogeneous Multicore for Consumer
Electronics”
 MEXT project “Global COE Ambient Soc”
2012/01/23 52Ph.D Defense
WASEDA
UNIVERSITY
Acknowledgement
 Special thanks
 Dr.Hirofumi Nakano(Waseda)
 Dr.Jun Shirako(Rice)
 Dr.Yasutaka Wada(Waseda)
 Dr.Takamichi Miyamoto(NEC)
 Dr.Fumiyo Takano(NEC)
 Dr.Masayoshi Mase(Hitachi)
 Mr.Sekiya Yamashita(Waseda)
 Mr.Hiroki Mikami(Waseda)
 Mr.Mamoru Shimaoka(Waseda)
 Mr.Yoshiya Hirase(Waseda)
 Mr.Yasir I. Al-Dosary (Waseda)
 Ms.Cecilia Gonzalez (Waseda)
 All student and alumni
in Kasahara Kimura Laboratory
2012/01/23 53Ph.D Defense
WASEDA
UNIVERSITY
Acknowledgement
 Special Thanks
 Dr.Takeshi Ikenaga(Waseda)
 Mr.Keita Miyamura(IBM)
 Mr.Yoichi Matsuyama(Waseda)
 Ms.Miki Yajima(NTT Comware)
 All faculty, stuff, and RA in CS department
 Great Staffs in GCOE office
 WIZDOM
 Hiroshima Toyo Carp
 Friends in Osaka
 My parents and brother
 and you!
2012/01/23 54Ph.D Defense
WASEDA
UNIVERSITY
Thank you for your attention!
2012/01/23 55Ph.D Defense
WASEDA
UNIVERSITY
Library
Step3:Utilizing Hand-tuned library
 Accelerator compilers skip functional call whose
name starts with “oscarlib”
2012/01/23
C with OSCAR API
#pragma oscar accelerator_task_entry
controller(2)
oscartask_CTRL2_call_FFT
void oscartask_CTRL2_call_FFT(int var1,
int *x)
{
oscarlib_CTRL2_ACCEL3_FFT(var1, x);
}
56
Controller CPU code
void oscarlib_CTRL2_ACCEL3_FFT(int x,
int *v)
{
// Data-transfer
// Kernel Invocation
// Data-transfer
}
Accelerator binary
Ph.D Defense

More Related Content

What's hot

What's hot (7)

HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
What's New in IBM Java 8 SE?
What's New in IBM Java 8 SE?What's New in IBM Java 8 SE?
What's New in IBM Java 8 SE?
 
Virtualization aware Java VM
Virtualization aware Java VMVirtualization aware Java VM
Virtualization aware Java VM
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Java Performance Tuning
Java Performance TuningJava Performance Tuning
Java Performance Tuning
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 

Similar to Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors

Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
chiportal
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
Energy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systemsEnergy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systems
Deepak Shankar
 

Similar to Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors (20)

Ajava final(sachin sir9822506209)_vision_academy_21
Ajava final(sachin sir9822506209)_vision_academy_21Ajava final(sachin sir9822506209)_vision_academy_21
Ajava final(sachin sir9822506209)_vision_academy_21
 
Vision_Academy_Ajava_final(sachin_sir9823037693)_22 (1).pdf
Vision_Academy_Ajava_final(sachin_sir9823037693)_22 (1).pdfVision_Academy_Ajava_final(sachin_sir9823037693)_22 (1).pdf
Vision_Academy_Ajava_final(sachin_sir9823037693)_22 (1).pdf
 
Vision_Academy_Ajava_final(sachin_sir9823037693)_22.pdf
Vision_Academy_Ajava_final(sachin_sir9823037693)_22.pdfVision_Academy_Ajava_final(sachin_sir9823037693)_22.pdf
Vision_Academy_Ajava_final(sachin_sir9823037693)_22.pdf
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
Machine learning at scale with aws sage maker
Machine learning at scale with aws sage makerMachine learning at scale with aws sage maker
Machine learning at scale with aws sage maker
 
Rishikesh Sharma Portfolio
Rishikesh Sharma PortfolioRishikesh Sharma Portfolio
Rishikesh Sharma Portfolio
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
Phil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage makerPhil Basford - machine learning at scale with aws sage maker
Phil Basford - machine learning at scale with aws sage maker
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkan
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
0507036
05070360507036
0507036
 
Nandita resume
Nandita resumeNandita resume
Nandita resume
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Learn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFVLearn more about the tremendous value Open Data Plane brings to NFV
Learn more about the tremendous value Open Data Plane brings to NFV
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Energy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systemsEnergy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systems
 

More from Akihiro Hayashi

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Akihiro Hayashi
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Akihiro Hayashi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
Akihiro Hayashi
 

More from Akihiro Hayashi (11)

GPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU PlatformsGPUIterator: Bridging the Gap between Chapel and GPU Platforms
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Introduction to Polyhedral Compilation
Introduction to Polyhedral CompilationIntroduction to Polyhedral Compilation
Introduction to Polyhedral Compilation
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
LLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS ProgramsLLVM-based Communication Optimizations for PGAS Programs
LLVM-based Communication Optimizations for PGAS Programs
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
 
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
 
Accelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL GenerationAccelerating Habanero-Java Program with OpenCL Generation
Accelerating Habanero-Java Program with OpenCL Generation
 

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 

Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors

  • 1. WASEDA UNIVERSITY Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multicore Processors Akihiro Hayashi 林 明宏 2012/01/23 1 ヘテロジニアスマルチコア及び ホモジニアスマルチコアに対する 自動並列化に関する研究 Ph.D Defense
  • 3. WASEDA UNIVERSITY Multicore Processors  Pros.  Easy Programming  Cons.  Low Performance  High Power Consumption 2012/01/23 3 Core Core Core Core Core  Pros.  High Performance  Low Power Consumption  Cons.  Hard Programming Ph.D Defense Single-core Multi-core
  • 4. WASEDA UNIVERSITY Category of Multicore Processors Homogeneous Multicore Heterogeneous Multicore 2012/01/23 4 Cor e Cor e Cor e Cor e Cor e Cor e GPU GPU  Integrating identical cores  Intel Core i7  IBM Power7  Renesas RP-2  Integrating different cores  IBM/SONY/TOSHIBA CELL BE  Intel Processor + NVIDIA GPGPU  Renesas RP-X Ph.D Defense
  • 5. WASEDA UNIVERSITY Focused Three Walls in Multicore Processors Programming Wall  Parallel programming is time-consuming Power Wall  Power is expensive Portability Wall  Portability is important 2012/01/23 5 Programmers HW Mobile Phones, PCs, SupercomputersParallel Programs Ph.D Defense
  • 6. WASEDA UNIVERSITY Difficulty of Parallel Programming Programmers Parallel Programming HW Mobile Phones, PCs, Supercomputers (1)Program (2)Expoit Tasks (3)Task scheduling 2012/01/23 Automatic parallelization is required 6 Parallel Programs Long Period & high costs Cor e Cor e GPU GPU Ph.D Defense
  • 7. WASEDA UNIVERSITY Related Work  PEPPHER(MICRO’11)  Performance Portability  (-)Parallel programming by hand  OmpSs(LCPC’10)  OpenMP extension to support GPGPU  (-)Parallel programming by hand  Qilin(MICRO’09)  Automatic dynamic scheduling for CPUs+GPUs  (-)Parallel programming by hand  CPU: Intel TBB, GPU:CUDA 2012/01/23 7Ph.D Defense
  • 8. WASEDA UNIVERSITY Related Work  EXOCHI/Merge(PLDI’07, ASPLOS’08)  Automatic dynamic scheduling for CPUs + GPUs  (-) Parallel Programming by hand  CellSs(SC’06)  Automatic dynamic scheduling for CELL BE  (+) programmers can write a sequential code  (-) just only supports homogeneous SPEs 2012/01/23 Automatic parallelizing compiler for heterogeneous multicores is getting more important 8Ph.D Defense
  • 9. WASEDA UNIVERSITY Solution: OSCAR Heterogeneous Multicore Compiler Programming Wall Automatic Parallelizing Compiler Programmers can write a program in a sequential manner Power Wall Power Control by Compiler The compiler automatically reduce the power consumption Portability Wall Multicore API The compiler generates source codes with OSCAR API 2012/01/23 9Ph.D Defense
  • 10. WASEDA UNIVERSITY 1. Hint directives for OSCAR compiler 2. OSCAR heterogeneous parallelizing compiler 3. OSCAR heterogeneous API (OSCAR API 2.0)  Goal: fully automatic parallelization of a Parallelizable C or Fortran program for various heterogeneous and homogeneous multicores  Previous works:  OSCAR homogeneous compiler  OSCAR homogeneous API 1.0  Coarse-grain task scheduling for Heterogeneous multicores OSCAR Heterogeneous Multicore Compiler 2012/01/23 10 Proposed Framework: Cross Platform OSCAR: Optimally SCheduled Advanced multiprocessoR Ph.D Defense
  • 12. WASEDA UNIVERSITY OSCAR homogeneous API (Previous Work) 2012/01/23 12Ph.D Defense
  • 13. WASEDA UNIVERSITY Key Idea 2012/01/23 13Ph.D Defense int main() { for () {…} func(); … … } Parallelizable C/ Frotran Program Accelerator Compiler/Library Controller Code Generation Accelerator Binary Generation OSCAR Compiler Automatic Parallelization Power Reduction
  • 14. WASEDA UNIVERSITY The Compiler Framework 2012/01/23 14 Compiler for ACCa OSCAR Compiler Sequential Program Sequential Program with hint directives Compiler for ACCa Compiler for ACCz Control Code For ACCa API Analyzer + Sequential Compiler + linker Executable Objects For ACCa Step1:Hint directive insertion specify execution time and input/output variables or the other information about accelerate-able program part Step2 : Parallelization 1.Coarse-grain task generation 2.Heterogeneous task scheduling 3.Power reduction Step3 : Accelerator Binary Generation (1)control code generation (2) accelerator binary generation Step4 : Executable Generation Library For ACC C with API for CPU C with API For ACCa C with API For ACCz Compiler for ACCz Ph.D Defense
  • 15. WASEDA UNIVERSITYReference Heterogeneous Multicore Architecture Supported by OSCAR Compiler and API 2012/01/23  DTU – Data Transfer Unit  LPM – Local Program Memory  LDM – Local Data Memory  DSM – Distributed Shared Memory  CSM – Centralized Shared Memory  FVR – Frequency/Voltage Control Register OSCAR-API applicable heterogeneous multicores 15 (e.g. GPU)(e.g. vector) Ph.D Defense
  • 16. WASEDA UNIVERSITY OSCAR Heterogeneous API  Parallel Execution API  parallel sections  flush  critical  execution  Memory Mapping API  threadprivate  distributedshared  onchipshared  Synchronization API  groupbarrier  Timer API  get_current_time 2012/01/23 16  Data transfer API  dma_transfer  dma_contiguous_parameter  dma_stide_parameter  dma_flag_check  dma_flag_send  Power control API  fvcontrol  get_fvstatus  Heterogeneous API  accelerator_task_entr y  Cache control API  cache_writeback  cache_selfinvalidate  complete_memop http://www.kasahara.cs.waseda.ac.jp/api/regist_en.html  Hint Directive  accelerator_task  oscar_comment Ph.D Defense
  • 17. WASEDA UNIVERSITY Step1: Hint Directives for OSCAR Compiler for parallelization 2012/01/23 #pragma oscar_hint accelerator_task (ACCa) cycle(1000, ((OSCAR_DMAC()))) #pragma oscar_hint accelerator_task (ACCb) cycle(100) in(var1, x[2:11]) out(x[2:11]) void call_FFT(int var, int *x) { #pragma oscar_comment XXXXXXXXXX FFT(var, x); } Accelerator compilers or programmers specify execution time and input/output variables for accelerate-able program part 17 for (i = 0; I < 10; i++) { x[i]++; } call_FFT(var1, x); Accelerator Name Clock cycle Input/output variables Data transfer Ph.D Defense
  • 18. WASEDA UNIVERSITY Step2-1: Parallelization by OSCAR - Coarse-grain Task generation - 2012/01/23  Program is decomposed into Macro Tasks (MTs)  BB (Basic Block)  RB (Repetition Block, or loop)  SB (Subroutine Block, or function)  Parallelism exploitation  Macro Flow Graph (MFG): control-flow and data dependency  Macro Task Graph (MTG): coarse grain task parallelism Macro-Flow Graph Macro-Task Graph (Condition for determination of MT Execution) AND (Condition for Data access) Ex. Earliest Executable Condition of MT6 MT2 takes a branch that guarantees MT4 will be executed OR MT3 completes execution Earliest Executable Condition : Data Dependency : Control flow : Control Branch 18Ph.D Defense
  • 19. WASEDA UNIVERSITY CPU ACCa Step2-2: Parallelization by OSCAR - Coarse-grain Static Task Scheduling - 2012/01/23 MT1 for CPU MT2 for CPU MT3 for CPU MT6 for CPU MT5 for ACC MT4 for ACC MT7 for ACCL MT10 for ACC MT9 for ACC MT8 for CPU MT13 for ACC MT12 for ACC MT11 for CPU EMT TIME Macro-Task Graph CPU0 CPU1 MT1 MT2 MT3 MT4 MT5 MT6 MT7 MT8 MT9 MT10 MT11 MT12 MT13 MT13 19 Accelerator0 ACC is busy Ph.D Defense
  • 20. WASEDA UNIVERSITY Step 2-3:Power Reduction Compiler changes Frequency/Voltage CPU0 CPU1 CPU2 CPU3 CPU ACCa CPU ACCb CPU ACCc CPU ACCd SLEEP Timer SLEEP SLEEP SLEEP SLEEP SLEEPSLEEPMID MID MID MID TIME cycle deadline =33[ms] SLEEP MID 2012/01/23 20 MID MID MID MID MID MID MID MID MID MID MID MID MID MID MID MID MID MID FULL:648MHz @1.3V MID: 324MHz@1.1 V LOW: 162MHz@1.0 V SLEEP: Ph.D Defense
  • 21. WASEDA UNIVERSITY Step3:Role of Accelerator Compilers  Accelerator compilers generate both controller CPU code and accelerator binary 2012/01/23 C with OSCAR API #pragma oscar accelerator_task_entry controller(1) oscartask_CTRL1_loop2 void oscartask_CTRL1_loop2(int *x) { int i; for (i = 0; i <= 9; i += 1) { x[i]++; } } 21 Controller CPU code void oscartask_CTRL1_loop2 (int *x) { // Data-transfer // Kernel Invocation // Data-transfer } Accelerator binary CPU1 ACC Ph.D Defense
  • 22. WASEDA UNIVERSITY Step4:Generating Executable for CPUs and Accelerators API analyzer translates OSCAR API into runtime library calls or annotation for sequential compiler  Example  parallel sections → pthread_create() Sequential compiler generates the executable for the target system 2012/01/23 22Ph.D Defense
  • 24. WASEDA UNIVERSITY Performance Evaluation Performance and power consumption Evaluations  Using a Hitachi accelerator compiler for “FE-GA”  Optical Flow calculation (from OpenCV)  Utilizing a hand-tuned library  Optical Flow calculation (from Tohoku university ※1)  AAC encoding(from Renesas Electronics) 2012/01/23 (※1)Hariyama et Al. “Evaluation of a Heterogeneous Multi-Core Architecture for Multimedia Applications” ICD2008-139 (Written in Japanese) 24Ph.D Defense
  • 25. WASEDA UNIVERSITY Evaluation Environment: RP-X processor  8 Renesas SH-4As as CPUs, 4 Hitachi FE-GAs dynamically reconfigurable processor as accelerators  I$:32KB/core, D$:32KB/core  Frequency/Voltage State: 648MHz@1.3V, 324MHz@1.16V, 162MHz@1.0V SH-4ASH-4A SH-4A SH-4A SH-4ASH-4A SH-4A SH-4A SHwy#0 SHwy#1 DDR3 #0 Media IP FE MX2 DDR3 #1 FEFEFE-GA MX2 SH-4ACPU FPU DTU I$ MMU D$ ILRAM LDM DSM 2012/01/23 Yuyama et al. “A 45nm 37.3GOPS/W heterogeneous multi-core SoC” ISSCC2010 25 クロスバネットワーク サブCPUへ 入出力ポート 割込/DMA 要求 コンフィギュレーションマネージャ ALU/MLTセルアレイ (24/8セル) LSセル ローカルメモリ (10バンク)(10セル) MLT LS CRAM バス I/F MLT MLT MLT MLT MLT MLT MLT ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU CRAM CRAM CRAM CRAM CRAM CRAM CRAM CRAM CRAM LS LS LS LS LS LS LS LS LS ALU MLT LS CRAMALUセル 乗算セル ロードストアセル コンパイルドRAM (4~16KB, 2-po シーケンスマネージャ Ph.D Defense
  • 26. WASEDA UNIVERSITYPerformance Using OSCAR and FE-GA Compilers and OSCAR API on RP-X(Optical Flow from OpenCV) 2012/01/23 26 1 1.9 3.46 5.64 2.65 5.48 12.36 0 2 4 6 8 10 12 14 1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE SpeedupagainstasingleSHprocessor FE-GA binary is generated by Hitachi FE-GA Compiler Ph.D Defense
  • 27. WASEDA UNIVERSITYProcessing Performance Using OSCAR Compiler with a hand-tuned library and OSCAR API on RP-X (AAC Encoding) 2012/01/23 27 1 1.98 3.68 6.32 4.44 8.39 16.08 0 2 4 6 8 10 12 14 16 18 1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE SpeedupsagainstasingleSHprocessor Hand-tuned library for FE-GA is utilized Ph.D Defense
  • 28. WASEDA UNIVERSITYPerformance Using OSCAR Compiler with a hand-tuned library and API on RP-X (Optical Flow from Tohoku university) 2012/01/23 28 1 2.29 3.09 5.4 18.85 26.71 32.65 0 5 10 15 20 25 30 35 1SH 2SH 4SH 8SH 2SH+1FE 4SH+2FE 8SH+4FE SpeedupsagainstasingleSHprocessor Hand-tuned library for FE-GA is utilized Ph.D Defense
  • 29. WASEDA UNIVERSITY Demonstration -Optical Flow Calculation- 2012/01/23 Sequential Execution 1SH Parallel Execution 8SH+4FE 32x 29Ph.D Defense
  • 30. WASEDA UNIVERSITYPower Reduction Using OSCAR Compiler with a hand-tuned library and OSCAR API on RP-X (Optical Flow from Tohoku University) 2012/01/23 30 Without Power Reduction With Power Reduction 70% of power reduction1.7W 0.5W Ph.D Defense
  • 31. WASEDA UNIVERSITYPower Reduction Using OSCAR Compiler with a hand-tuned library and OSCAR API on RP-X (AAC Encoding from Renesas Electronics) 2012/01/23 31 Without Power Reduction With Power Reduction 80% of power reduction Ph.D Defense
  • 32. WASEDA UNIVERSITY Highly Practical Application: Dose calculation Engine for cancer treatment 2012/01/23 32Ph.D Defense
  • 33. WASEDA UNIVERSITY Particle Radiotherapy for Cancer Treatment Cancer is the primary cause of a person’s death in Japan  One in three people died of cancer ⇒ 粒子線による治療が効果的 2012/01/23 33 Reference:NIRS HP http://www.nirs.go.jp/index.shtml Ph.D Defense
  • 34. WASEDA UNIVERSITY Particle Radiotherapy for Cancer Treatment details 1.Taking CT pictures 2.Input cancer information 3. Physical simulation 4.Planning 2012/01/23 34 Reference: NIRS HP http://www.nirs.go.jp/index.shtml Physical simulation is time-consuming ⇒this treatment is not covered by insurance Ph.D Defense
  • 35. WASEDA UNIVERSITY Automatic Parallelization by OSCAR compiler Overview  Automatic Parallelization by OSCAR compiler and OSCAR API  The program is clinically used  Enhancing parallelism by code rewriting – Original code can limit the potential parallelism ⇒Evaluation on various SMP machines  Performance Evaluation on IBM and Intel servers 2012/01/23 35 Parallelizing Compiler Programmers Parallelized Program Hardware Ph.D Defense
  • 36. WASEDA UNIVERSITY Simulation Flow and Profile Result 2012/01/23 36 Dose 92% Scatter 8% Init, Modify 0% Profile Result Environment: Hitachi SR16000 System Power7 4.00GHz Dose Calculation Scatter Calculation Modify Calculation Initialization Developed by NIRS and Mitsubishi Ph.D Defense
  • 37. WASEDA UNIVERSITY An overview of Dose and Scatter Calculation  Dose Calculation  Accumulates dose value for each pencil beam  Scatter Calculation  Calculates an influence on neighbor voxels considering “scatter” 2012/01/23 37Ph.D Defense
  • 38. WASEDA UNIVERSITY Enhancing parallelism for dose calculation 2012/01/23 38 X Y Beams for (pencilBeams/n) { } for (passedVoxels) { //Dose calc }  Each processor calculates dose values for each array  After that, each array is accumulated to one array for (pencilBeams/n) { } for (passedVoxels) { //Dose Calc } Each array is accumulated to one array CPU0 CPU1 Each processor calculates dose valuesfor each array CPU0 CPU1 Ph.D Defense
  • 39. WASEDA UNIVERSITY Enhancing parallelism for scatter calculation 2012/01/23 39 Z Y for (Z/n) { } for (Y) { for (X) { //Scatter value addition } }  Each processor calculates dose values for each array  After that, each array is accumulated to one array for (Z/n) { } Each array is accumulated to one array CPU0 CPU1 Each processor calculates dose valuesfor each array CPU0 CPU1 for (Y) { for (X) { //Scatter value addition } } Ph.D Defense
  • 40. WASEDA UNIVERSITY Performance Evaluation on Hitachi HA8000/RS200(Intel Xeon SMP) Target Machine  Intel Xeon X5670 2.93GHz  1CPU, 2CPU, 6CPU, 12CPU Evaluated cases  1.OSCAR+Intel Composer XE 12.0  -fast –openmp  2.Intel Composer XE 12.0  -parallel –fast Data size: 1.5mm 2012/01/23 40Ph.D Defense
  • 41. WASEDA UNIVERSITY Performance on HA8000 OSCAR+icc 2012/01/23 41 1.22 1.23 1.21 1.22 1.221.22 1.00 1.92 4.84 6.90 0 1 2 3 4 5 6 7 8 original 1CPU 2CPU 6CPU 12CPU Speedupagainst1CPU Processor Configuration icc自動並列化 OSCAR自動並列化  OSCAR+icc shows 6.90x speedup with 12CPU  icc does not exploit the parallelism autopara Ph.D Defense
  • 42. WASEDA UNIVERSITY Performance Analysis on HA8000  OSCAR compiler shows higher scalability than icc  6.90x with 12CPU  Scatter calculation have to be accelerated to attain more performance 2012/01/23 42 1.00 1.95 5.66 10.35 1.00 1.35 1.84 2.76 0.00 2.00 4.00 6.00 8.00 10.00 12.00 1CPU 2CPU 6CPU 12CPU 1CPU 2CPU 6CPU 12CPU 線量計算 散乱計算 Speedup Speedup of dose and scatter calculation Dose Scatter Ph.D Defense
  • 43. WASEDA UNIVERSITY Performance Evaluation on Hitachi SR16000(IBM Power 7 SMP) Target Machine  IBM Power 7 4.0GHz  1CPU, 32CPU, 64CPU Evaluated Cases  1.OSCAR+IBM XLC Compiler 11.0  -O4 -qsmp=omp –qarch=pwr7 –qmaxmem=-1  2. IBM XLC Compiler 11.0  -O4 -qsmp=auto –qarch=pwr7 –qmaxmem=-1 Data size: 0.5mm 2012/01/23 43Ph.D Defense
  • 44. WASEDA UNIVERSITY Performance on SR16000 OSCAR+xlc 2012/01/23 44 1.31 1.29 1.29 1.291.31 1.00 30.00 48.07 0 10 20 30 40 50 60 original 1CPU 32CPU 64CPU Speedupagainst1CPU Processor Configurataion xlc自動並列化 OSCAR自動並列化  OSCAR+xlc shows 48.07x with 64CPU  xlc only exploits small loop parallelism autopara Ph.D Defense
  • 45. WASEDA UNIVERSITY Pertformance Analysis on SR16000  OSCAR compiler shows higher scalability than xlc  48.07x with 64CPU  Scatter calculation has to be accelerated to attain more performance 2012/01/23 45 1.00 29.80 56.45 1.00 17.04 21.31 0.00 10.00 20.00 30.00 40.00 50.00 60.00 1CPU 32CPU 64CPU 1CPU 32CPU 64CPU 線量計算 散乱計算 Speedup Speedup of dose and scatter calculation Dose Scatter Ph.D Defense
  • 46. WASEDA UNIVERSITY Workload Balancing Load balancing issue 2012/01/23 46 0 0.1 0.2 0.3 0.4 1 22 43 64 85 106 127 148 169 190 211 232 CalculationAmount voxel Scatter calculation 0 0.0005 0.001 0.0015 0.002 0.0025 1 938 1875 2812 3749 4686 5623 6560 7497 8434 9371 CalculationAmount beam Dose calculation  Scheduling policy has to be modified Ph.D Defense
  • 47. WASEDA UNIVERSITY Result of modifying scheduling policy on SR16000 2012/01/23 47 1.00 28.09 30.27 49.93 55.10 0.00 10.00 20.00 30.00 40.00 50.00 60.00 1CPU 32CPU 32CPU-rev 64CPU 64CPU-rev Speedup Processor Configuration Ph.D Defense
  • 48. WASEDA UNIVERSITY Conclusion: Programmability Programming Wall Automatic Parallelizing Compiler Programmers can write a program in a sequential manner  Optical flow(OpenCV): 12x speedup with 8SH+4FE  Optical flow(Hand-tuned library): 32x speedup with 8SH+4FE  AAC Encoder: 16x speedup with 8SH+4FE  Dose calculation engine for particle therapy: 55x speedup with 64CPU 2012/01/23 48Ph.D Defense
  • 49. WASEDA UNIVERSITY Conclusions: Power Consumption Power Wall Power Control by Compiler The compiler automatically reduce the power consumption  Optical flow(Hand-tuned library): Reduce 70% power consumption for 8SH+4FE  AAC Encoder: Reduce 80% power consumption for 8SH+4FE 2012/01/23 49Ph.D Defense
  • 50. WASEDA UNIVERSITY Conclusions: Portability Portability Wall Multicore API The compiler generates source codes with OSCAR API  Renesas RP-X Heterogeneous Multicore  SH4A  Hitachi HA8000  Intel Xeon  Hitachi SR16000  IBM Power7 2012/01/23 50Ph.D Defense
  • 51. WASEDA UNIVERSITY Future Work  Fully automatic parallelization of C program  Automatic detection of acceleration part  Power capping control  Local memory management for heterogeneous multicore 2012/01/23 51Ph.D Defense
  • 52. WASEDA UNIVERSITY Acknowledgement  Deep gratitude for  Dr.Hironori Kasahara(Waseda)  Dr.Keiji Kimura(Waseda)  Dr.Satoshi Goto(Waseda)  Dr.Nozomu Togawa(Waseda)  A part of this research has been supported by  NEDO “Advanced Heterogeneous Multiprocessor”  NEDO “Heterogeneous Multicore for Consumer Electronics”  MEXT project “Global COE Ambient Soc” 2012/01/23 52Ph.D Defense
  • 53. WASEDA UNIVERSITY Acknowledgement  Special thanks  Dr.Hirofumi Nakano(Waseda)  Dr.Jun Shirako(Rice)  Dr.Yasutaka Wada(Waseda)  Dr.Takamichi Miyamoto(NEC)  Dr.Fumiyo Takano(NEC)  Dr.Masayoshi Mase(Hitachi)  Mr.Sekiya Yamashita(Waseda)  Mr.Hiroki Mikami(Waseda)  Mr.Mamoru Shimaoka(Waseda)  Mr.Yoshiya Hirase(Waseda)  Mr.Yasir I. Al-Dosary (Waseda)  Ms.Cecilia Gonzalez (Waseda)  All student and alumni in Kasahara Kimura Laboratory 2012/01/23 53Ph.D Defense
  • 54. WASEDA UNIVERSITY Acknowledgement  Special Thanks  Dr.Takeshi Ikenaga(Waseda)  Mr.Keita Miyamura(IBM)  Mr.Yoichi Matsuyama(Waseda)  Ms.Miki Yajima(NTT Comware)  All faculty, stuff, and RA in CS department  Great Staffs in GCOE office  WIZDOM  Hiroshima Toyo Carp  Friends in Osaka  My parents and brother  and you! 2012/01/23 54Ph.D Defense
  • 55. WASEDA UNIVERSITY Thank you for your attention! 2012/01/23 55Ph.D Defense
  • 56. WASEDA UNIVERSITY Library Step3:Utilizing Hand-tuned library  Accelerator compilers skip functional call whose name starts with “oscarlib” 2012/01/23 C with OSCAR API #pragma oscar accelerator_task_entry controller(2) oscartask_CTRL2_call_FFT void oscartask_CTRL2_call_FFT(int var1, int *x) { oscarlib_CTRL2_ACCEL3_FFT(var1, x); } 56 Controller CPU code void oscarlib_CTRL2_ACCEL3_FFT(int x, int *v) { // Data-transfer // Kernel Invocation // Data-transfer } Accelerator binary Ph.D Defense

Editor's Notes

  1. Our goal is to realize a fully automatic parallelization of C or Fortran program for various heterogeneous multicores. We have been developing an automatic parallelizing compiler called OSCAR for homogeneous multicore. We proposed OSCAR homogeneous API, ver 1.0, which is available on this web-site and an coarse-grain automatic parallelization scheme on heterogeneous multicore in our previous work. Today, I will be talking about details on general purpose compilation framework which supports various types of heterogeneous multicores, using accelerators. This framework includes hint directives for OSCAR compiler, OSCAR heterogeneous parallelizing compiler, and OSCAR heterogeneous API. OSCAR heterogensou API is to be released soon.
  2. Let me explain the overview of OSCAR API. The input is the sequential Fortran or Parallelizable C program. The Parallelizable C is the kind of programming style for C program, which has some limits of pointer usage or something. OSCAR compiler works as a source-to-source compiler. It generates parallelized Fortran or C program with OSCAR API. After that, backend compiler including API Analyzer and Sequential compiler generates the target machine codes.
  3. OK, let’s talk about the compilation framework This framework allow us to utilize an accelerator compiler and existing hand-tuned library with OSCAR compiler. The first step is that hint directive insertion for OSCAR compiler. Accelerator compilers or programmers specify execution time and input/output variables or the other information about accelerate-able program part. The second step is that parallelization by OSCAR compiler. OSCAR compiler performs coarse-grain task generation, heterogeneous task scheduling and power reduction. After that, OSCAR compiler generates the source code with OSCAR API for each core. The third step is that accelerator binary generation. Accelerator compiler performs control code generation and accelerator binary generation. The final step is that generating executable.
  4. We defined this architecture as a reference architecture supported by OSCAR API. OSCAR Compiler and API can generate parallelized program for this architecture and subset of architectures. The architecture may contains the three kinds of processor element, These are general purpose processor cores, accelerator cores with their controller and accelerator cores without controllers. General purpose processor core and Accelerator cores with their controllers may have a data transfer unit which you might call dma controller, local program memory or instruction cache, Local data memory or data cache, distributed shared memory and freqeuncy/voltage control register. The existing heterogeneous multicore available on the market can be seen as a subset of the architecture. So, our methodology is applicable to various heterogeneous multicores from different vendors.
  5. Let me show you the list of OSCAR heterogeneous API. You can see the specification of the OSCAR API ver. 1.0 for a homogeneous multicore part in this website. It is composed of parallel execution API such parallel sections, memory mapping API such thread private, Synchronization API, Timer API, data transfer API and Power control API. To support the heterogeneous multicore, just this directive is newly added and these three directives are also added to non-coherent cache control.
  6. OK, let’s talk about the compilation flow details. This is the example of the hint directives for OSCAR compiler. These directive indicates that this loop or funtion can run on specified accelerator. Accelerator compiler or programmers specify execution time and input/output variables for accelerate-able program part.
  7. The second step is automatic parallelization by OSCAR compiler. First, the program is decomposed into three kinds of block called macro tasks. That is basic block, loop block, and function block. Aftert that, OSCAR compiler exploits the parallelism of the program. First, the compiler make a task graph called macro-flow graph which indicates control-flow and data dependency. Then the compiler analyze an earliest executable condition and then we get a macro-task graph which shows coarse-grain task parallelism.
  8. Then, the compiler schedules these tasks onto CPU and accelerator using the algorithm called CP/MISF. Green colored tasks show the tasks can be execute on at least on an accelerator. Note that the algorithm will not assign a task to accelerator, when target accelerator is busy and the execution on processor core May give us shorter execution time..
  9. Then the compiler applies the power reduction. When the compiler finds the idle time on the scheduled result, the compiler inserts the instruction which changes frequency/voltage considering the deadline.
  10. Then accelerator compiler generates the binary. The role of accelerator compiler is to generate the data-transfer code between CPUs and accelerators in addition to accelerator binary. And you can see the API in this figure, this is called accelerator_task_entry. This directive indicates that the specified function is called by controller CPU.
  11. What you see here is demonstration of optical flow calculation. Shown here on your left is sequential execution by 1 cpu execution. Shown here on your right is parallel execution by 8 cpus and 4 accelerators. Like shown in this movie, we can achieve 32 times speedup by the proposed framwork.
  12. 先行研究の手法を言う
  13. 縦軸