Automated CUDA-to-OpenCL
Translation with CU2CL:
What’s Next?

Wu Feng and Mark Gardner
Virginia Tech
2013-11-12
synergy.c...
Why OpenCL?

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/Publishi...
Why OpenCL?

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/Publishi...
Why OpenCL?

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan.jp
g?thumb=y

http://www.amd.com/Publishi...
Why OpenCL?

Source code lasts longer than platforms

http://www2.pcmag.com/media/imag
es/375584-nvidia-geforce-gtx-titan....
The Goal
To take advantage of OpenCL's portability...

http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg

Without sa...
CUDA and OpenCL APIs
CUDA Module

OpenCL Module

Thread

Contexts &
Command Queues

Device

Platforms & Devices

Stream

C...
CUDA and OpenCL APIs
CUDA Module

OpenCL Module

Thread

Contexts &
Command Queues

Device

Platforms & Devices

Stream

C...
CUDA and OpenCL APIs
CUDA Module

OpenCL Module

Thread

Contexts &
Command Queues

Device

Platforms & Devices

Stream

C...
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream...
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream...
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream...
CUDA and OpenCL Data
CUDA

OpenCL

Vector types (e.g. float4)

Host: cl_float4
Kernel: float4

dim3

size_t[3]

cudaStream...
CUDA and OpenCL
Execution and Memory Models

synergy.cs.vt.edu
The Problem

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe

O eC
pnL
Su e
or
c
Cd
oe
xkcd.com

A DDvl eS...
The Problem
MnaTastn
aulr li
nao
(ek, ot )
w esm n s
h
CD
UA
Su e
or
c
Cd
oe

O eC
pnL
Su e
or
c
Cd
oe
xkcd.com

A tm t Ta...
Forecast

http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US

•
•
•
•
•

Observations about Translating
Exampl...
Translation Is Easy ...

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)...
Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)...
Translation Is Easy ...
…when there is NO ambiguity in the translation between
languages (i.e., there is a direct mapping)...
Translation is more difficult

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
...
Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
...
Translation is more difficult
…when there IS ambiguity (or lack of a direct
mapping) in the translation between languages
...
CUDA and OpenCL

http://www.dragon1.com/images/examples.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CUDA Initialization Code

None
(Implicit)

Dialect: CUDA runtime API
A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.c...
OpenCL Initialization Code
Explicit
//get a platform and device, set up a context and command queue
clGetPlatformIDs(1, &_...
OpenCL Initialization Code
Explicit
//get a platform and device, set up a context and command queue
clGetPlatformIDs(1, &_...
OpenCL Initialization Code
Explicit
//get a platform and device, set up a context and command queue
clGetPlatformIDs(1, &_...
CUDA Kernel Invocation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(uiWC / threads.x, uiH...
CUDA Kernel Invocation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(uiWC / threads.x, uiH...
CUDA Kernel Invocation
// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(uiWC / threads.x, uiH...
OpenCL Kernel Invocation
// setup execution parameters
size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1};
size_t grid[3] = {...
OpenCL Kernel Invocation
// setup execution parameters
size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1};
size_t grid[3] = {...
OpenCL Kernel Invocation
// setup execution parameters
size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1};
size_t grid[3] = {...
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
i...
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
i...
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
i...
Kernel Code for Vector Add
CUDA
// Device code
__global__ void VecAdd(const float* A, const float* B, float*
C, int N) {
i...
CU2CL Architecture

http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
0...
Compilation Process

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Compilation Process
Preprocessor
Source
Code

Lexer

Preprocessed
Code

Semantic
Analyzer

Parser
Tokenized
Code

Parse
Tr...
Compilation Process
Preprocessor
Source
Code

Lexer

Preprocessed
Code

Semantic
Analyzer

Parser
Tokenized
Code

Parse
Tr...
AST-driven, String-based Rewriting
CUDA

OpenCL

__powf(x[threadIdx.x], y[threadIdx.y])

native_pow(x[get_local_id(0)], y[...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func

OpenCL

native_pow(x[get_local_id(0...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg
Arg

OpenCL

native_pow(x[get_lo...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg
Arg

OpenCL

Struct
Struct

nati...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg

Field

Arg

OpenCL

Struct
Stru...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg

Field

Arg

OpenCL

Struct
Stru...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])

Func
Arg

Field

0

Arg

OpenCL

Struct
S...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg
...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg
...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg
...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg
...
AST-driven, String-based Rewriting
CUDA

__powf(x[threadIdx.x], y[threadIdx.y])
get_local_id( )
get_local_id( )

Func
Arg
...
Complex Semantic Conversions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-referenc...
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-referenc...
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-referenc...
Complex Semantic Conversions
1. Literal Parameters to Kernels
– CUDA pass-by-value invocations vs. OpenCL pass-by-referenc...
Complex Semantic Conversions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in...
Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in...
Complex Semantic Conversions
2. Device Identification
– CUDA uses int, OpenCL uses opaque cl_device
– To change devices in...
CU2CL Evaluation

Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg

A DDvl eSm i
M ee pru mt
o
2 1 11
03...
Test Code

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– ...
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– ...
Test Code
• 79 CUDA SDK Samples
• 17 Rodinia Samples
• Applications
– GEM – Molecular Modeling
– IZ PS – Neural Network
– ...
Translator Coverage
O eC L e
pnL i s
n
C agd
hne

Pr n
e et
c
A tm ta Tast
u acl r le
o i l na d
y

1
3
5

5

9.
6
3

bnwd...
Translator Coverage
O eC L e
pnL i s
n
C agd
hne

Pr n
e et
c
A tm ta Tast
u acl r le
o i l na d
y

1
3
5

5

9.
6
3

bnwd...
Translation Challenges
Identified

P fd
ri
ol
e
C a ne
hl g
l
e

C D SK
UA D
F qec ( )
r uny %
e

Rd i
oi
n
a
F qec ( )
r ...
Translator Performance
10 0
00

1
0

R= .
+01
6
R= .
+05
9

10
00

Total Translation Time
(s)

1

CU2CL Translation Time
(...
Translated Application Performance
2
.
5
S KSm l
D aps
e

Rd iSm l
oi a p s
n
a
e

Time (s)
CUDA OpenCL

2

Lower is Bette...
CU2CL Reliability
0
%

1%
0

2%
0

3%
0

4%
0

5%
0

6%
0

7%
0

8%
0

9%
0

10
0%

Bfr
e e
o
d
U gae
pr s

C D S KSm l
U ...
CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold

CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverag...
CU2CL Roadmap & Future Work
CU2CL
Alpha
(2011)
Well-designed
scaffold

CU2CL
Beta
(2013)
Improved Robustness,
CUDA Coverag...
Related Work
Swan
– High-level abstraction API, links to either OpenCL or CUDA
implementation

Ocelot & Caracal
– Translat...
CU2CL Conclusions

A DDvl eSm i
M ee pru mt
o
2 1 11
03 12
//

synergy.cs.vt.edu
CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible diff...
CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible diff...
CU2CL Conclusions
• Status
– What used to take months by hand takes seconds
• 90+ successful translation
• Negligible diff...
Acknowledgements
Suet Gabriel Martinez, Paul Sathre
t n:
d s
This work was supported in part by NSF I/UCRC IIP-0804155
via...
Upcoming SlideShare
Loading in …5
×

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

4,783 views

Published on

Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.

Published in: Technology, Business
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,783
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
86
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

  1. 1. Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next? Wu Feng and Mark Gardner Virginia Tech 2013-11-12 synergy.cs.vt.edu
  2. 2. Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  3. 3. Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  4. 4. Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  5. 5. Why OpenCL? Source code lasts longer than platforms http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
  6. 6. The Goal To take advantage of OpenCL's portability... http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg Without sacrificing man-years of existing code A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  7. 7. CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  8. 8. CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  9. 9. CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  10. 10. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  11. 11. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  12. 12. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  13. 13. CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  14. 14. CUDA and OpenCL Execution and Memory Models synergy.cs.vt.edu
  15. 15. The Problem A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  16. 16. The Problem MnaTastn aulr li nao (ek, ot ) w esm n s h CD UA Su e or c Cd oe O eC pnL Su e or c Cd oe xkcd.com A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  17. 17. The Problem MnaTastn aulr li nao (ek, ot ) w esm n s h CD UA Su e or c Cd oe O eC pnL Su e or c Cd oe xkcd.com A tm t Tast n u ac r li o i nao ( cns s od) e C 2L UC A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  18. 18. Forecast http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US • • • • • Observations about Translating Examples: CUDA and OpenCL constructs CU2CL Architecture Current State of CU2CL: Robustness and Performance Future Directions synergy.cs.vt.edu
  19. 19. Translation Is Easy ... A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  20. 20. Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  21. 21. Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) • High-level language → low-level representation, e.g., C → LLVM x*y+z→ %tmp = mul i32 %x, %y %tmp2 = add i32 %tmp, %z A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  22. 22. Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) • High-level language → low-level representation, e.g., C → LLVM x*y+z→ %tmp = mul i32 %x, %y %tmp2 = add i32 %tmp, %z • Between languages, e.g., CUDA → OpenCL __powf(x[threadIdx.x], y[threadIdx.y]) → native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  23. 23. Translation is more difficult A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  24. 24. Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  25. 25. Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages • Idiomatic Expressions – “Putting all your eggs in one basket” → ?? in Spanish – CUDA threadfence() → OpenCL ?? A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  26. 26. Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages • Idiomatic Expressions – “Putting all your eggs in one basket” → ?? in Spanish – CUDA threadfence() → OpenCL ?? • Dialects – Latin American Spanish vs. Castilian Spanish → English – CUDA Runtime API vs. CUDA Driver API → OpenCL A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  27. 27. CUDA and OpenCL http://www.dragon1.com/images/examples.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  28. 28. CUDA Initialization Code None (Implicit) Dialect: CUDA runtime API A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  29. 29. OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  30. 30. OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  31. 31. OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  32. 32. CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  33. 33. CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  34. 34. CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  35. 35. OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  36. 36. OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  37. 37. OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  38. 38. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  39. 39. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  40. 40. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  41. 41. Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  42. 42. CU2CL Architecture http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  43. 43. Compilation Process A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  44. 44. Compilation Process Preprocessor Source Code Lexer Preprocessed Code Semantic Analyzer Parser Tokenized Code Parse Tree Code Generator Intermediate Representation Binary LLVM Clang A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  45. 45. Compilation Process Preprocessor Source Code Lexer Preprocessed Code Semantic Analyzer Parser Tokenized Code Parse Tree Code Generator Intermediate Representation Binary LLVM Clang Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  46. 46. AST-driven, String-based Rewriting CUDA OpenCL __powf(x[threadIdx.x], y[threadIdx.y]) native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  47. 47. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func OpenCL native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  48. 48. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Arg OpenCL native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  49. 49. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Arg OpenCL Struct Struct native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  50. 50. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field Arg OpenCL Struct Struct Field native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  51. 51. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field Arg OpenCL Struct Struct Field native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  52. 52. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field 0 Arg OpenCL Struct Struct Field 1 native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  53. 53. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg Field 0 Arg OpenCL Struct Struct Field 1 native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  54. 54. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg Struct Field 0 Arg Struct Field 1 x[ OpenCL ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  55. 55. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  56. 56. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ Write Out ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  57. 57. AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ Write Out ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) Advantage: formatting remains intact → maintainable A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  58. 58. Complex Semantic Conversions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  59. 59. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  60. 60. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  61. 61. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); Naive OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f); clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  62. 62. Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); Naive OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f); clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256); Correct OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f; clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &__cu2cl_Kernel_kernel_arg_1); int __cu2cl_Kernel_kernel_arg_2 = 256; clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &__cu2cl_Kernel_kernel_arg_2); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  63. 63. Complex Semantic Conversions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  64. 64. Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  65. 65. Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... //scan all devices //save old platform, device, context, queue, program, & kernels myDevice = allDevices[id] ClGetDeviceInfo(...); //get new device's platform myContext = clCreateContext(...); myQueue = clCreateCommandQueue(...); //load program source clBuildProgram(...); myKernel = clCreateKernel(...); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  66. 66. Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... //scan all devices //save old platform, device, context, queue, program, & kernels myDevice = allDevices[id] ClGetDeviceInfo(...); //get new device's platform myContext = clCreateContext(...); myQueue = clCreateCommandQueue(...); //load program source clBuildProgram(...); myKernel = clCreateKernel(...); – Implement our own handler to emulate and encapsulate A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  67. 67. CU2CL Evaluation Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  68. 68. Test Code A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  69. 69. Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  70. 70. Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  71. 71. Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  72. 72. Translator Coverage O eC L e pnL i s n C agd hne Pr n e et c A tm ta Tast u acl r le o i l na d y 1 3 5 5 9. 6 3 bnwdh et ad i T s t 81 9 5 9. 8 9 B cShl l kco s a e 37 4 1 4 9. 6 0 Fs l Tas r at s r f m Wah n o 37 2 3 0 9. 0 8 m t Ml ai u r x 31 5 9 9. 7 4 sar o clP d ar 21 5 1 8 9. 2 8 vco d etr d A 1 4 7 0 10 0 Bc P pgt n ak r aao o i Rd i oi n a C D Le U A is n ayc P snA I S KSm l D aps e A pctn plao ii 3 1 3 2 4 9. 2 3 B at FsSa h r dh it er e -r c 36 0 3 5 8. 8 6 Gusn asi a 30 9 2 6 9. 3 3 Ht o os t p 38 2 2 9. 9 4 Nel a- nc ed m nWush e 40 3 3 9. 9 3 Fn i eZ 16 78 7 16 7 8 8. 9 9 GM E 54 2 1 5 9. 7 1 IP ZS 80 42 1 6 6 9. 8 0 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  73. 73. Translator Coverage O eC L e pnL i s n C agd hne Pr n e et c A tm ta Tast u acl r le o i l na d y 1 3 5 5 9. 6 3 bnwdh et ad i T s t 81 9 5 9. 8 9 B cShl l kco s a e 37 4 1 4 9. 6 0 Fs l Tas r at s r f m Wah n o 37 2 3 0 9. 0 8 m t Ml ai u r x 31 5 9 9. 7 4 sar o clP d ar 21 5 1 8 9. 2 8 vco d etr d A 1 4 7 0 10 0 Bc P pgt n ak r aao o i Rd i oi n a C D Le U A is n ayc P snA I S KSm l D aps e A pctn plao ii 3 1 3 2 4 9. 2 3 B at FsSa h r dh it er e -r c 36 0 3 5 8. 8 6 Gusn asi a 30 9 2 6 9. 3 3 Ht o os t p 38 2 2 9. 9 4 Nel a- nc ed m nWush e 40 3 3 9. 9 3 Fn i eZ 16 78 7 16 7 8 8. 9 9 GM E 54 2 1 5 9. 7 1 IP ZS 80 42 1 6 6 9. 8 0 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  74. 74. Translation Challenges Identified P fd ri ol e C a ne hl g l e C D SK UA D F qec ( ) r uny % e Rd i oi n a F qec ( ) r uny % e D v ednfr ei I tis c ei e 5. 4 4 2. 9 4 Le la m t s ir Pr e r ta a e 10 9 . 2. 3 5 Spre o pao ea t C m i i a l n t 5. 4 4 2. 9 4 C D L ri U A iae b rs 1. 0 1 0 K r le p ts e eT m le n a 25 1 . 0 T x rM m r et e e o u y 2. 7 8 2. 3 5 Gah sn r e bi 2. r i Ie pr iy 4 p c to a l t 1 0 C nt t e o os nM m r a y 17 7 . 2. 9 4 Sa d e o hr M m r e y 4. 6 8 Kernel Function Pointer Invocations Preprocessor Effects Warp-level Synchronization Device Intrinsic Functions Device Buffer cl_mem Type Propagation #defined Function Definitions Device Buffers as Struct Members Arrays of Device Buffers Implicitly-Defined Kernel Functions Device-side Classes, Constructors, & Destructors Struct Alignment Atbt ti e ru s _t edec( _h af e r n ) 7. 0 6 Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation”. ICPP Workshops 2012: 89-96 Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  75. 75. Translator Performance 10 0 00 1 0 R= . +01 6 R= . +05 9 10 00 Total Translation Time (s) 1 CU2CL Translation Time (microseconds) 10 0 0 . 1 01 . 0 10 0 S KSm l D a ps e 1 0 Su e i s or L e c n 10 00 Rd iSm l o ia a p s n e 10 0 00 10 0 000 Lre plaos a A pctn g i i 1 10 0 Su e i s or L e c n 10 00 S KSm l D a ps e Rd iSm l o ia a p s n e Lre plaos a A pctn g i i 10 0 00 10 0 000 L er D Sm l ) i a( K a p s n S e L er o ia a p s i a( d iSm l ) n R n e Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  76. 76. Translated Application Performance 2 . 5 S KSm l D aps e Rd iSm l oi a p s n a e Time (s) CUDA OpenCL 2 Lower is Better 1 . 5 1 GM E Ne ed la e mn -u Wn sh c Ht os pt o Gu as sn i a BS F bc ak po rp vc et od rd A sa cl ar ro Pd mt ai rM xu l Fs at Wa lT s hr as nf om r Bc lk aS co hl e s bn ad wd ih tT et s ay sn cP AI 0 . 5 Note: all runs on same Nvidia GPU for fair comparison purposes A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  77. 77. CU2CL Reliability 0 % 1% 0 2% 0 3% 0 4% 0 5% 0 6% 0 7% 0 8% 0 9% 0 10 0% Bfr e e o d U gae pr s C D S KSm l U A D aps e 2. 0% 3 1% 1 . 4 6. 8% 3 Rd iSm l oi a p s n a e 5. 2% 9 1% 1 . 8 3. 5% 3 2% . 5 Atr f e d U gae pr s C D S KSm l U A D aps e 2. 0% 3 17 2% . 25 1% . 12 5% . 1% . 3 2. 4% 1 Rd iSm l oi a p s n a e 5. 2% 9 5 % 2. . 9 3% 5 Failed Partial Complete Clang 3.2 main() method handling Template handling 5% 5% 5% . 9 . 9 . 9 OpenGL #defined function handling Separately declared and defined function handling Kernel pointer invocation handling Increase reliability in translating samples after latest round of improvements A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  78. 78. CU2CL Roadmap & Future Work CU2CL Alpha (2011) Well-designed scaffold CU2CL Beta (2013) Improved Robustness, CUDA Coverage, and Reliability Analysis and profiling of difficult-to-translate CUDA structures CU2CL w/ Functional Portability Expand CUDA coverage • Shared, const, texture memory • Driver API • OpenGL Handling unmapped CUDA structs / behaviors • Warp sync A DDvl eSm i M ee pru mt o 2 1 11 03 12 // CU2CL w/ Performance Portability Automatic de-optimization Device-agnostic optimization Device-specific optimization synergy.cs.vt.edu
  79. 79. CU2CL Roadmap & Future Work CU2CL Alpha (2011) Well-designed scaffold CU2CL Beta (2013) Improved Robustness, CUDA Coverage, and Reliability Analysis and profiling of difficult-to-translate CUDA structures CU2CL w/ Functional Portability Expand CUDA coverage • Shared, const, texture memory • Driver API • OpenGL Handling unmapped CUDA structs / behaviors • Warp sync CU2CL w/ Performance Portability Automatic de-optimization Device-agnostic optimization Device-specific optimization What about CUDA to HSA? A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  80. 80. Related Work Swan – High-level abstraction API, links to either OpenCL or CUDA implementation Ocelot & Caracal – Translate NVIDIA PTX IR to other device IRs CUDAtoOpenCL – Source to source translator, based on Cetus A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  81. 81. CU2CL Conclusions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  82. 82. CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  83. 83. CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance • Challenges – CUDA functionality missing in OpenCL • __threadfence() – Equivalent libraries needed in OpenCL • cuFFT, MAGMA, cuBLAS – Implicit semantics • Implicit synchronization across warps A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  84. 84. CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance • Challenges – CUDA functionality missing in OpenCL • __threadfence() – Equivalent libraries needed in OpenCL • cuFFT, MAGMA, cuBLAS – Implicit semantics • Implicit synchronization across warps • What's Next? – Improved functional portability – Support for performance portability A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
  85. 85. Acknowledgements Suet Gabriel Martinez, Paul Sathre t n: d s This work was supported in part by NSF I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing (CHREC). A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu

×