PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

Automated CUDA-to-OpenCL
Translation with CU2CL:
What’s Next?

Wu Feng and Mark Gardner
Virginia Tech
2013-11-12
synergy.cs.vt.edu

Why OpenCL?

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/PublishingImages/P
ublic/Photograph_ProductShots/375WPN
G/61979.png

http://www.hardwarezone.com.sg/file
s/img/2012/06/Xeon_Phi_PCIe_Card_
M.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

http://www.thinkcomputers.org/articl
es/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revi
mages/cpu/Core_I7_965/New_Core_I7.j
pg

synergy.cs.vt.edu

Why OpenCL?

Source code lasts longer than platforms

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/PublishingImages/P
ublic/Photograph_ProductShots/375WPN
G/61979.png

http://www.hardwarezone.com.sg/file
s/img/2012/06/Xeon_Phi_PCIe_Card_
M.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

http://www.thinkcomputers.org/articl
es/ces11_amd/main.jpg

http://www.bjorn3d.com/Material/revi
mages/cpu/Core_I7_965/New_Core_I7.j
pg

synergy.cs.vt.edu

The Goal
To take advantage of OpenCL's portability...

http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg

Without sacrificing man-years of existing code
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA and OpenCL APIs
CUDA Module

OpenCL Module

Thread

Contexts &
Command Queues

Device

Platforms & Devices

Stream

Command Queues

Event

Events

Memory

Memory Objects

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream_t

cl_command_queue

cudaEvent_t

cl_event

Device pointers (e.g. float* created cl_mem created through
through cudaMalloc)
clCreateBuffer
cudaChannelFormat

cl_image_format

textureReference

cl_mem created through
clCreateImage

cudaDeviceProp

No direct equivalent

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA and OpenCL
Execution and Memory Models

synergy.cs.vt.edu

The Problem

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe

O eC
pnL
Su e
or
c
Cd
oe
xkcd.com

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe

O eC
pnL
Su e
or
c
Cd
oe
xkcd.com

A tm t Tast n
u ac r li
o i nao
( cns
s od)
e

C 2L
UC

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Forecast

http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US

•
•
•
•
•

Observations about Translating
Examples: CUDA and OpenCL constructs
CU2CL Architecture
Current State of CU2CL: Robustness and Performance
Future Directions
synergy.cs.vt.edu

Translation Is Easy ...

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

• High-level language → low-level representation, e.g., C →
LLVM
x*y+z→
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

• High-level language → low-level representation, e.g., C →
LLVM
x*y+z→
%tmp = mul i32 %x, %y
%tmp2 = add i32 %tmp, %z
• Between languages, e.g., CUDA → OpenCL
__powf(x[threadIdx.x], y[threadIdx.y]) →
native_pow(x[get_local_id(0)], y[get_local_id(1)])
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Translation is more difficult

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

• Idiomatic Expressions
– “Putting all your eggs in one basket” → ?? in Spanish
– CUDA threadfence() → OpenCL ??

• Dialects
– Latin American Spanish vs. Castilian Spanish → English
– CUDA Runtime API vs. CUDA Driver API → OpenCL

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA and OpenCL

http://www.dragon1.com/images/examples.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA Initialization Code

None
(Implicit)

Dialect: CUDA runtime API
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

OpenCL Initialization Code
Explicit
//get a platform and device, set up a context and command queue
clGetPlatformIDs(1, &__cu2cl_Platform, NULL);
clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL);
__cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL);
__cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device,
CL_QUEUE_PROFILING_ENABLE, NULL);
//read kernel source from disk
FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r");
fseek(f, 0, SEEK_END);
size_t progLen = (size_t) ftell(f);
const char * progSrc = (const char *) malloc(sizeof(char)*len);
rewind(f);
fread((void *) progSrc, len, 1, f);
fclose(f);
//build device program and kernel
__cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc,
&progLen, NULL);
free((void *) progSrc);
clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL);
__cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA Kernel Invocation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(uiWC / threads.x, uiHC / threads.y);
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

OpenCL Kernel Invocation
// setup execution parameters
size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1};
size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1};
// execute the kernel
int nIter = 30;
for (int j = 0; j < nIter; j++)
{
clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA);
clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB);
localWorkSize[0] = threads[0];
globalWorkSize[0] = grid[0]*localWorkSize[0];
clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL,
globalWorkSize,localWorkSize, 0, NULL, NULL);
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

OpenCL
// Device code
__kernel void VecAdd(const __global float* A, const __global float* B,
__global float* C, int N) {
int i = get_local_size(0) * get_group_id(0) + get_local_id(0);
if (i < N)
C[i] = A[i] + B[i];
}

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Architecture

http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Compilation Process

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Compilation Process
Preprocessor
Source
Code

Lexer

Preprocessed
Code

Semantic
Analyzer

Parser
Tokenized
Code

Parse
Tree

Code
Generator

Intermediate
Representation

Binary

LLVM

Clang

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Compilation Process
Preprocessor
Source
Code

Lexer

Preprocessed
Code

Semantic
Analyzer

Parser
Tokenized
Code

Parse
Tree

Code
Generator

Intermediate
Representation

Binary

LLVM

Clang

Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

AST-driven, String-based Rewriting
CUDA

OpenCL

__powf(x[threadIdx.x], y[threadIdx.y])


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA


Func

OpenCL


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA


Func
Arg
Arg

OpenCL


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA


Func
Arg
Arg

OpenCL

Struct
Struct


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA


Func
Arg

Field

Arg

OpenCL

Struct
Struct

Field


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA


Func
Arg

Field

0

Arg

OpenCL

Struct
Struct

Field

1


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA

get_local_id( )
get_local_id( )

Func
Arg

Field

0

Arg

OpenCL

Struct
Struct

Field

1


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA

get_local_id( )
get_local_id( )

Func
Arg

Struct

Field

0

Arg

Struct

Field

1

x[

OpenCL

]

y[

]


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA

get_local_id( )
get_local_id( )

Func
Arg

OpenCL

Field

0

Arg
native_pow

Struct
Struct

Field

1

x[

]

y[

]


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA

get_local_id( )
get_local_id( )

Func
Arg

OpenCL

Field

0

Arg
native_pow

Struct
Struct

Field

1

x[

Write Out

]

y[

]


A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA

get_local_id( )
get_local_id( )

Func
Arg

OpenCL

Field

0

Arg
native_pow

Struct
Struct

Field

1

x[

Write Out

]

y[

]


Advantage: formatting remains intact → maintainable
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Complex Semantic Conversions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-reference

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA Kernel Launch
kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA Kernel Launch

Naive OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1);
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CUDA Kernel Launch

Naive OpenCL Translation
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f);
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256);

Correct OpenCL Translation
float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f;
clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float),
&__cu2cl_Kernel_kernel_arg_1);
int __cu2cl_Kernel_kernel_arg_2 = 256;
clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int),
&__cu2cl_Kernel_kernel_arg_2);
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in CUDA, use cudaSetDevice(int id)
– To change devices in OpenCL, use...

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...);
//get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

//scan all devices
//save old platform, device, context, queue, program, & kernels
myDevice = allDevices[id]
ClGetDeviceInfo(...);
//get new device's platform
myContext = clCreateContext(...);
myQueue = clCreateCommandQueue(...);
//load program source
clBuildProgram(...);
myKernel = clCreateKernel(...);

– Implement our own handler to emulate and encapsulate

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Evaluation

Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Test Code

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– Fen Zi – Molecular Dynamics

• 100k+ SLOC in total

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Translator Coverage
O eC L e
pnL i s
n
C agd
hne

Pr n
e et
c
A tm ta Tast
u acl r le
o i l na d
y

1
3
5

5

9.
6
3

bnwdh et
ad i T s
t

81
9

5

9.
8
9

B cShl
l kco s
a
e

37
4

1
4

9.
6
0

Fs l Tas r
at s r f m
Wah n o

37
2

3
0

9.
0
8

m t Ml
ai u
r
x

31
5

9

9.
7
4

sar o
clP d
ar

21
5

1
8

9.
2
8

vco d
etr d
A

1
4
7

0

10
0

Bc P pgt n
ak r aao
o
i
Rd i
oi
n
a

C D Le
U A is
n

ayc P
snA I
S KSm l
D aps
e

A pctn
plao
ii

3
1
3

2
4

9.
2
3

B at FsSa h
r dh it er
e -r
c

36
0

3
5

8.
8
6

Gusn
asi
a

30
9

2
6

9.
3
3

Ht o
os t
p

38
2

2

9.
9
4

Nel a- nc
ed m nWush
e

40
3

3

9.
9
3

Fn i
eZ

16
78
7

16
7
8

8.
9
9

GM
E

54
2

1
5

9.
7
1

IP
ZS

80
42

1
6
6

9.
8
0

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Translation Challenges
Identified

P fd
ri
ol
e
C a ne
hl g
l
e

C D SK
UA D
F qec ( )
r uny %
e

Rd i
oi
n
a
F qec ( )
r uny %
e

D v ednfr
ei I tis
c ei
e

5.
4
4

2.
9
4

Le la m t s
ir Pr e r
ta a e

10
9
.

2.
3
5

Spre o pao
ea t C m i i
a
l n
t

5.
4
4

2.
9
4

C D L ri
U A iae
b rs

1.
0
1

0

K r le p ts
e eT m le
n
a

25
1
.

0

T x rM m r
et e e o
u
y

2.
7
8

2.
3
5

Gah sn r e bi 2.
r i Ie pr iy 4
p c to a l
t
1

0

C nt t e o
os nM m r
a
y

17
7
.

2.
9
4

Sa d e o
hr M m r
e
y

4.
6
8

Kernel Function Pointer Invocations
Preprocessor Effects
Warp-level Synchronization
Device Intrinsic Functions
Device Buffer cl_mem Type
Propagation
#defined Function Definitions
Device Buffers as Struct Members
Arrays of Device Buffers
Implicitly-Defined Kernel Functions
Device-side Classes, Constructors,
& Destructors
Struct Alignment Atbt
ti e
ru s
_t edec(
_h af e
r n )

7.
0
6

Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL
Translation”. ICPP Workshops 2012: 89-96
Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of
a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Translator Performance
10 0
00

1
0

R= .
+01
6
R= .
+05
9

10
00

Total Translation Time
(s)

1

CU2CL Translation Time
(microseconds)

10
0

0
.
1

01
.
0
10
0

S KSm l
D a ps
e

1
0

Su e i s
or L e
c n
10
00

Rd iSm l
o ia a p s
n
e

10 0
00

10 0
000

Lre plaos
a A pctn
g
i i

1
10
0

Su e i s
or L e
c n
10
00

S KSm l
D a ps
e
Rd iSm l
o ia a p s
n
e
Lre plaos
a A pctn
g
i i

10 0
00

10 0
000

L er D Sm l )
i a( K a p s
n S
e
L er o ia a p s
i a( d iSm l )
n R n
e

Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce
GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Translated Application Performance
2
.
5
S KSm l
D aps
e

Rd iSm l
oi a p s
n
a
e

Time (s)
CUDA OpenCL

2

Lower is Better
1
.
5

1

GM
E

Ne
ed
la
e
mn
-u
Wn
sh
c

Ht
os
pt
o

Gu
as
sn
i
a

BS
F

bc
ak
po
rp

vc
et
od
rd
A

sa
cl
ar
ro
Pd

mt
ai
rM
xu
l

Fs
at
Wa
lT
s
hr
as
nf
om
r

Bc
lk
aS
co
hl
e
s

bn
ad
wd
ih
tT
et
s

ay
sn
cP
AI

0
.
5

Note: all runs on same Nvidia GPU for fair comparison purposes
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Reliability
0
%

1%
0

2%
0

3%
0

4%
0

5%
0

6%
0

7%
0

8%
0

9%
0

10
0%

Bfr
e e
o
d
U gae
pr s

C D S KSm l
U A D aps
e
2.
0%
3

1%
1
.
4

6.
8%
3

Rd iSm l
oi a p s
n
a
e
5.
2%
9

1%
1
.
8

3.
5%
3
2%
.
5

Atr
f
e
d
U gae
pr s

C D S KSm l
U A D aps
e
2.
0%
3

17
2%
.

25
1%
.

12
5%
.

1%
.
3

2.
4%
1

Rd iSm l
oi a p s
n
a
e
5.
2%
9

5 % 2.
.
9
3%
5
Failed
Partial
Complete

Clang 3.2
main() method handling
Template handling

5% 5% 5%
.
9
.
9
.
9

OpenGL #defined function handling
Separately declared and defined function handling
Kernel pointer invocation handling

Increase reliability in translating samples
after latest round of improvements
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold

CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverage, and
Reliability
Analysis and profiling
of difficult-to-translate
CUDA structures

CU2CL w/
Functional
Portability
Expand CUDA
coverage
• Shared, const,
texture memory
• Driver API
• OpenGL
Handling unmapped
CUDA structs /
behaviors
• Warp sync

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

CU2CL w/
Performance
Portability
Automatic
de-optimization
Device-agnostic
optimization
Device-specific
optimization

synergy.cs.vt.edu

CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold

CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverage, and
Reliability
Analysis and profiling
of difficult-to-translate
CUDA structures

CU2CL w/
Functional
Portability
Expand CUDA
coverage
• Shared, const,
texture memory
• Driver API
• OpenGL
Handling unmapped
CUDA structs /
behaviors
• Warp sync

CU2CL w/
Performance
Portability
Automatic
de-optimization
Device-agnostic
optimization
Device-specific
optimization

What about CUDA to HSA?
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Related Work
Swan
– High-level abstraction API, links to either OpenCL or CUDA
implementation

Ocelot & Caracal
– Translate NVIDIA PTX IR to other device IRs

CUDAtoOpenCL
– Source to source translator, based on Cetus

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Conclusions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible difference in performance

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Conclusions
• Status

• Challenges
– CUDA functionality missing in OpenCL
• __threadfence()

– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS

– Implicit semantics
• Implicit synchronization across warps

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

CU2CL Conclusions
• Status

• Challenges
– CUDA functionality missing in OpenCL
• __threadfence()

– Equivalent libraries needed in OpenCL
• cuFFT, MAGMA, cuBLAS

– Implicit semantics
• Implicit synchronization across warps

• What's Next?
– Improved functional portability
– Support for performance portability
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

Acknowledgements
Suet Gabriel Martinez, Paul Sathre
t n:
d s
This work was supported in part by NSF I/UCRC IIP-0804155
via the NSF Center for High-Performance Reconfigurable
Computing (CHREC).

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

Similar to PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner