Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
5. Why OpenCL?
Source code lasts longer than platforms
http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y
http://www.amd.com/PublishingImages/P
ublic/Photograph_ProductShots/375WPN
G/61979.png
http://www.hardwarezone.com.sg/file
s/img/2012/06/Xeon_Phi_PCIe_Card_
M.jpg
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
http://www.thinkcomputers.org/articl
es/ces11_amd/main.jpg
http://www.bjorn3d.com/Material/revi
mages/cpu/Core_I7_965/New_Core_I7.j
pg
synergy.cs.vt.edu
6. The Goal
To take advantage of OpenCL's portability...
http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg
Without sacrificing man-years of existing code
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
7. CUDA and OpenCL APIs
CUDA Module
OpenCL Module
Thread
Contexts &
Command Queues
Device
Platforms & Devices
Stream
Command Queues
Event
Events
Memory
Memory Objects
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
8. CUDA and OpenCL APIs
CUDA Module
OpenCL Module
Thread
Contexts &
Command Queues
Device
Platforms & Devices
Stream
Command Queues
Event
Events
Memory
Memory Objects
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
9. CUDA and OpenCL APIs
CUDA Module
OpenCL Module
Thread
Contexts &
Command Queues
Device
Platforms & Devices
Stream
Command Queues
Event
Events
Memory
Memory Objects
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
10. CUDA and OpenCL Data
CUDA
OpenCL
Vector types (e.g. float4)
Host: cl_float4
Kernel: float4
dim3
size_t[3]
cudaStream_t
cl_command_queue
cudaEvent_t
cl_event
Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat
cl_image_format
textureReference
cl_mem created through
clCreateImage
cudaDeviceProp
No direct equivalent
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
11. CUDA and OpenCL Data
CUDA
OpenCL
Vector types (e.g. float4)
Host: cl_float4
Kernel: float4
dim3
size_t[3]
cudaStream_t
cl_command_queue
cudaEvent_t
cl_event
Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat
cl_image_format
textureReference
cl_mem created through
clCreateImage
cudaDeviceProp
No direct equivalent
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
12. CUDA and OpenCL Data
CUDA
OpenCL
Vector types (e.g. float4)
Host: cl_float4
Kernel: float4
dim3
size_t[3]
cudaStream_t
cl_command_queue
cudaEvent_t
cl_event
Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat
cl_image_format
textureReference
cl_mem created through
clCreateImage
cudaDeviceProp
No direct equivalent
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
13. CUDA and OpenCL Data
CUDA
OpenCL
Vector types (e.g. float4)
Host: cl_float4
Kernel: float4
dim3
size_t[3]
cudaStream_t
cl_command_queue
cudaEvent_t
cl_event
Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat
cl_image_format
textureReference
cl_mem created through
clCreateImage
cudaDeviceProp
No direct equivalent
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
15. The Problem
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
16. The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe
O eC
pnL
Su e
or
c
Cd
oe
xkcd.com
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
17. The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe
O eC
pnL
Su e
or
c
Cd
oe
xkcd.com
A tm t Tast n
u ac r li
o i nao
( cns
s od)
e
C 2L
UC
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
19. Translation Is Easy ...
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
20. Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
21. Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)
• High-level language → low-level representation, e.g., C →
LLVM
x*y+z→
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
22. Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)
• High-level language → low-level representation, e.g., C →
LLVM
x*y+z→
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
• Between languages, e.g., CUDA → OpenCL
__powf(x[threadIdx.x], y[threadIdx.y]) →
native_pow(x[get_local_id(0)], y[get_local_id(1)])
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
23. Translation is more difficult
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
24. Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
25. Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
26. Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??
• Dialects
– Latin American Spanish vs. Castilian Spanish → English
– CUDA Runtime API vs. CUDA Driver API → OpenCL
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
38. Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
39. Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
40. Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
41. Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
53. AST-driven, String-based Rewriting
CUDA
__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )
Func
Arg
Field
0
Arg
OpenCL
Struct
Struct
Field
1
native_pow(x[get_local_id(0)], y[get_local_id(1)])
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
54. AST-driven, String-based Rewriting
CUDA
__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )
Func
Arg
Struct
Field
0
Arg
Struct
Field
1
x[
OpenCL
]
y[
]
native_pow(x[get_local_id(0)], y[get_local_id(1)])
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
55. AST-driven, String-based Rewriting
CUDA
__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )
Func
Arg
OpenCL
Field
0
Arg
native_pow
Struct
Struct
Field
1
x[
]
y[
]
native_pow(x[get_local_id(0)], y[get_local_id(1)])
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
56. AST-driven, String-based Rewriting
CUDA
__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )
Func
Arg
OpenCL
Field
0
Arg
native_pow
Struct
Struct
Field
1
x[
Write Out
]
y[
]
native_pow(x[get_local_id(0)], y[get_local_id(1)])
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
57. AST-driven, String-based Rewriting
CUDA
__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )
Func
Arg
OpenCL
Field
0
Arg
native_pow
Struct
Struct
Field
1
x[
Write Out
]
y[
]
native_pow(x[get_local_id(0)], y[get_local_id(1)])
Advantage: formatting remains intact → maintainable
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
59. Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
60. Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
CUDA Kernel Launch
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
61. Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
CUDA Kernel Launch
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);
Naive OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
62. Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference
CUDA Kernel Launch
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);
Naive OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);
Correct OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f;
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float),
&__cu2cl_Kernel_kernel_arg_1);
int __cu2cl_Kernel_kernel_arg_2 = 256;
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int),
&__cu2cl_Kernel_kernel_arg_2);
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
64. Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
65. Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...);
//get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
66. Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...
//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...);
//get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);
– Implement our own handler to emulate and encapsulate
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
68. Test Code
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
69. Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics
• 100k+ SLOC in total
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
70. Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics
• 100k+ SLOC in total
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
71. Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics
• 100k+ SLOC in total
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
72. Translator Coverage
O eC L e
pnL i s
n
C agd
hne
Pr n
e et
c
A tm ta Tast
u acl r le
o i l na d
y
1
3
5
5
9.
6
3
bnwdh et
ad i T s
t
81
9
5
9.
8
9
B cShl
l kco s
a
e
37
4
1
4
9.
6
0
Fs l Tas r
at s r f m
Wah n o
37
2
3
0
9.
0
8
m t Ml
ai u
r
x
31
5
9
9.
7
4
sar o
clP d
ar
21
5
1
8
9.
2
8
vco d
etr d
A
1
4
7
0
10
0
Bc P pgt n
ak r aao
o
i
Rd i
oi
n
a
C D Le
U A is
n
ayc P
snA I
S KSm l
D aps
e
A pctn
plao
ii
3
1
3
2
4
9.
2
3
B at FsSa h
r dh it er
e -r
c
36
0
3
5
8.
8
6
Gusn
asi
a
30
9
2
6
9.
3
3
Ht o
os t
p
38
2
2
9.
9
4
Nel a- nc
ed m nWush
e
40
3
3
9.
9
3
Fn i
eZ
16
78
7
16
7
8
8.
9
9
GM
E
54
2
1
5
9.
7
1
IP
ZS
80
42
1
6
6
9.
8
0
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
73. Translator Coverage
O eC L e
pnL i s
n
C agd
hne
Pr n
e et
c
A tm ta Tast
u acl r le
o i l na d
y
1
3
5
5
9.
6
3
bnwdh et
ad i T s
t
81
9
5
9.
8
9
B cShl
l kco s
a
e
37
4
1
4
9.
6
0
Fs l Tas r
at s r f m
Wah n o
37
2
3
0
9.
0
8
m t Ml
ai u
r
x
31
5
9
9.
7
4
sar o
clP d
ar
21
5
1
8
9.
2
8
vco d
etr d
A
1
4
7
0
10
0
Bc P pgt n
ak r aao
o
i
Rd i
oi
n
a
C D Le
U A is
n
ayc P
snA I
S KSm l
D aps
e
A pctn
plao
ii
3
1
3
2
4
9.
2
3
B at FsSa h
r dh it er
e -r
c
36
0
3
5
8.
8
6
Gusn
asi
a
30
9
2
6
9.
3
3
Ht o
os t
p
38
2
2
9.
9
4
Nel a- nc
ed m nWush
e
40
3
3
9.
9
3
Fn i
eZ
16
78
7
16
7
8
8.
9
9
GM
E
54
2
1
5
9.
7
1
IP
ZS
80
42
1
6
6
9.
8
0
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
74. Translation Challenges
Identified
P fd
ri
ol
e
C a ne
hl g
l
e
C D SK
UA D
F qec ( )
r uny %
e
Rd i
oi
n
a
F qec ( )
r uny %
e
D v ednfr
ei I tis
c ei
e
5.
4
4
2.
9
4
Le la m t s
ir Pr e r
ta a e
10
9
.
2.
3
5
Spre o pao
ea t C m i i
a
l n
t
5.
4
4
2.
9
4
C D L ri
U A iae
b rs
1.
0
1
0
K r le p ts
e eT m le
n
a
25
1
.
0
T x rM m r
et e e o
u
y
2.
7
8
2.
3
5
Gah sn r e bi 2.
r i Ie pr iy 4
p c to a l
t
1
0
C nt t e o
os nM m r
a
y
17
7
.
2.
9
4
Sa d e o
hr M m r
e
y
4.
6
8
Kernel Function Pointer Invocations
Preprocessor Effects
Warp-level Synchronization
Device Intrinsic Functions
Device Buffer cl_mem Type
Propagation
#defined Function Definitions
Device Buffers as Struct Members
Arrays of Device Buffers
Implicitly-Defined Kernel Functions
Device-side Classes, Constructors,
& Destructors
Struct Alignment Atbt
ti e
ru s
_t edec(
_h af e
r n )
7.
0
6
Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL
Translation”. ICPP Workshops 2012: 89-96
Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of
a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
75. Translator Performance
10 0
00
1
0
R= .
+01
6
R= .
+05
9
10
00
Total Translation Time
(s)
1
CU2CL Translation Time
(microseconds)
10
0
0
.
1
01
.
0
10
0
S KSm l
D a ps
e
1
0
Su e i s
or L e
c n
10
00
Rd iSm l
o ia a p s
n
e
10 0
00
10 0
000
Lre plaos
a A pctn
g
i i
1
10
0
Su e i s
or L e
c n
10
00
S KSm l
D a ps
e
Rd iSm l
o ia a p s
n
e
Lre plaos
a A pctn
g
i i
10 0
00
10 0
000
L er D Sm l )
i a( K a p s
n S
e
L er o ia a p s
i a( d iSm l )
n R n
e
Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce
GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
76. Translated Application Performance
2
.
5
S KSm l
D aps
e
Rd iSm l
oi a p s
n
a
e
Time (s)
CUDA OpenCL
2
Lower is Better
1
.
5
1
GM
E
Ne
ed
la
e
mn
-u
Wn
sh
c
Ht
os
pt
o
Gu
as
sn
i
a
BS
F
bc
ak
po
rp
vc
et
od
rd
A
sa
cl
ar
ro
Pd
mt
ai
rM
xu
l
Fs
at
Wa
lT
s
hr
as
nf
om
r
Bc
lk
aS
co
hl
e
s
bn
ad
wd
ih
tT
et
s
ay
sn
cP
AI
0
.
5
Note: all runs on same Nvidia GPU for fair comparison purposes
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
77. CU2CL Reliability
0
%
1%
0
2%
0
3%
0
4%
0
5%
0
6%
0
7%
0
8%
0
9%
0
10
0%
Bfr
e e
o
d
U gae
pr s
C D S KSm l
U A D aps
e
2.
0%
3
1%
1
.
4
6.
8%
3
Rd iSm l
oi a p s
n
a
e
5.
2%
9
1%
1
.
8
3.
5%
3
2%
.
5
Atr
f
e
d
U gae
pr s
C D S KSm l
U A D aps
e
2.
0%
3
17
2%
.
25
1%
.
12
5%
.
1%
.
3
2.
4%
1
Rd iSm l
oi a p s
n
a
e
5.
2%
9
5 % 2.
.
9
3%
5
Failed
Partial
Complete
Clang 3.2
main() method handling
Template handling
5% 5% 5%
.
9
.
9
.
9
OpenGL #defined function handling
Separately declared and defined function handling
Kernel pointer invocation handling
Increase reliability in translating samples
after latest round of improvements
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
78. CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold
CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverage, and
Reliability
Analysis and profiling
of difficult-to-translate
CUDA structures
CU2CL w/
Functional
Portability
Expand CUDA
coverage
• Shared, const,
texture memory
• Driver API
• OpenGL
Handling unmapped
CUDA structs /
behaviors
• Warp sync
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
CU2CL w/
Performance
Portability
Automatic
de-optimization
Device-agnostic
optimization
Device-specific
optimization
synergy.cs.vt.edu
79. CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold
CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverage, and
Reliability
Analysis and profiling
of difficult-to-translate
CUDA structures
CU2CL w/
Functional
Portability
Expand CUDA
coverage
• Shared, const,
texture memory
• Driver API
• OpenGL
Handling unmapped
CUDA structs /
behaviors
• Warp sync
CU2CL w/
Performance
Portability
Automatic
de-optimization
Device-agnostic
optimization
Device-specific
optimization
What about CUDA to HSA?
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
80. Related Work
Swan
– High-level abstraction API, links to either OpenCL or CUDA
implementation
Ocelot & Caracal
– Translate NVIDIA PTX IR to other device IRs
CUDAtoOpenCL
– Source to source translator, based on Cetus
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
82. CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
83. CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance
• Challenges
– CUDA functionality missing in OpenCL
• __threadfence()
– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS
– Implicit semantics
• Implicit synchronization across warps
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
84. CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance
• Challenges
– CUDA functionality missing in OpenCL
• __threadfence()
– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS
– Implicit semantics
• Implicit synchronization across warps
• What's Next?
– Improved functional portability
– Support for performance portability
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu
85. Acknowledgements
Suet Gabriel Martinez, Paul Sathre
t n:
d s
This work was supported in part by NSF I/UCRC IIP-0804155
via the NSF Center for High-Performance Reconfigurable
Computing (CHREC).
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//
synergy.cs.vt.edu