• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner
 

PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner

on

  • 1,909 views

Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.

Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.

Statistics

Views

Total Views
1,909
Views on SlideShare
1,909
Embed Views
0

Actions

Likes
1
Downloads
38
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner Presentation Transcript

    • Automated CUDA-to-OpenCL Translation with CU2CL: What’s Next? Wu Feng and Mark Gardner Virginia Tech 2013-11-12 synergy.cs.vt.edu
    • Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
    • Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
    • Why OpenCL? http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
    • Why OpenCL? Source code lasts longer than platforms http://www2.pcmag.com/media/imag es/375584-nvidia-geforce-gtx-titan.jp g?thumb=y http://www.amd.com/PublishingImages/P ublic/Photograph_ProductShots/375WPN G/61979.png http://www.hardwarezone.com.sg/file s/img/2012/06/Xeon_Phi_PCIe_Card_ M.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // http://www.thinkcomputers.org/articl es/ces11_amd/main.jpg http://www.bjorn3d.com/Material/revi mages/cpu/Core_I7_965/New_Core_I7.j pg synergy.cs.vt.edu
    • The Goal To take advantage of OpenCL's portability... http://people.emich.edu/akavetsk/424/scribeatdesk_1.jpg Without sacrificing man-years of existing code A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL APIs CUDA Module OpenCL Module Thread Contexts & Command Queues Device Platforms & Devices Stream Command Queues Event Events Memory Memory Objects A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL Data CUDA OpenCL Vector types (e.g. float4) Host: cl_float4 Kernel: float4 dim3 size_t[3] cudaStream_t cl_command_queue cudaEvent_t cl_event Device pointers (e.g. float* created cl_mem created through through cudaMalloc) clCreateBuffer cudaChannelFormat cl_image_format textureReference cl_mem created through clCreateImage cudaDeviceProp No direct equivalent A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL Execution and Memory Models synergy.cs.vt.edu
    • The Problem A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • The Problem MnaTastn aulr li nao (ek, ot ) w esm n s h CD UA Su e or c Cd oe O eC pnL Su e or c Cd oe xkcd.com A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • The Problem MnaTastn aulr li nao (ek, ot ) w esm n s h CD UA Su e or c Cd oe O eC pnL Su e or c Cd oe xkcd.com A tm t Tast n u ac r li o i nao ( cns s od) e C 2L UC A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Forecast http://www.weather.com/weather/5-day/San+Jose+CA+USCA0993:1:US • • • • • Observations about Translating Examples: CUDA and OpenCL constructs CU2CL Architecture Current State of CU2CL: Robustness and Performance Future Directions synergy.cs.vt.edu
    • Translation Is Easy ... A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) • High-level language → low-level representation, e.g., C → LLVM x*y+z→ %tmp = mul i32 %x, %y %tmp2 = add i32 %tmp, %z A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation Is Easy ... …when there is NO ambiguity in the translation between languages (i.e., there is a direct mapping) • High-level language → low-level representation, e.g., C → LLVM x*y+z→ %tmp = mul i32 %x, %y %tmp2 = add i32 %tmp, %z • Between languages, e.g., CUDA → OpenCL __powf(x[threadIdx.x], y[threadIdx.y]) → native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation is more difficult A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages • Idiomatic Expressions – “Putting all your eggs in one basket” → ?? in Spanish – CUDA threadfence() → OpenCL ?? A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation is more difficult …when there IS ambiguity (or lack of a direct mapping) in the translation between languages • Idiomatic Expressions – “Putting all your eggs in one basket” → ?? in Spanish – CUDA threadfence() → OpenCL ?? • Dialects – Latin American Spanish vs. Castilian Spanish → English – CUDA Runtime API vs. CUDA Driver API → OpenCL A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA and OpenCL http://www.dragon1.com/images/examples.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA Initialization Code None (Implicit) Dialect: CUDA runtime API A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • OpenCL Initialization Code Explicit //get a platform and device, set up a context and command queue clGetPlatformIDs(1, &__cu2cl_Platform, NULL); clGetDeviceIDs(__cu2cl_Platform, CL_DEVICE_TYPE_GPU, 1, &__cu2cl_Device, NULL); __cu2cl_Context = clCreateContext(NULL, 1, &__cu2cl_Device, NULL, NULL, NULL); __cu2cl_CommandQueue = clCreateCommandQueue(__cu2cl_Context, __cu2cl_Device, CL_QUEUE_PROFILING_ENABLE, NULL); //read kernel source from disk FILE *f = fopen(“matrixMul_kernel.cu-cl.cl”, "r"); fseek(f, 0, SEEK_END); size_t progLen = (size_t) ftell(f); const char * progSrc = (const char *) malloc(sizeof(char)*len); rewind(f); fread((void *) progSrc, len, 1, f); fclose(f); //build device program and kernel __cu2cl_Program_matrixMul_kernel_cu = clCreateProgramWithSource(__cu2cl_Context, 1, &progSrc, &progLen, NULL); free((void *) progSrc); clBuildProgram(__cu2cl_Program_matrixMul_kernel_cu, 1, &__cu2cl_Device, "-I .", NULL, NULL); __cu2cl_Kernel_matrixMul = clCreateKernel(__cu2cl_Program_matrixMul_kernel_cu, "matrixMul", NULL); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CUDA Kernel Invocation // setup execution parameters dim3 threads(BLOCK_SIZE, BLOCK_SIZE); dim3 grid(uiWC / threads.x, uiHC / threads.y); // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { matrixMul<<< grid, threads >>>(d_C, d_A, d_B, uiWA, uiWB); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • OpenCL Kernel Invocation // setup execution parameters size_t threads[3] = {BLOCK_SIZE, BLOCK_SIZE, 1}; size_t grid[3] = {uiWC / threads[0], uiHC / threads[1], 1}; // execute the kernel int nIter = 30; for (int j = 0; j < nIter; j++) { clSetKernelArg(__cu2cl_Kernel_matrixMul, 0, sizeof(cl_mem), &d_C); clSetKernelArg(__cu2cl_Kernel_matrixMul, 1, sizeof(cl_mem), &d_A); clSetKernelArg(__cu2cl_Kernel_matrixMul, 2, sizeof(cl_mem), &d_B); clSetKernelArg(__cu2cl_Kernel_matrixMul, 3, sizeof(int), &uiWA); clSetKernelArg(__cu2cl_Kernel_matrixMul, 4, sizeof(int), &uiWB); localWorkSize[0] = threads[0]; localWorkSize[1] = threads[1]; localWorkSize[2] = threads[2]; globalWorkSize[0] = grid[0]*localWorkSize[0]; globalWorkSize[1] = grid[1]*localWorkSize[1]; globalWorkSize[2] = grid[2]*localWorkSize[2]; clEnqueueNDRangeKernel(__cu2cl_CommandQueue, __cu2cl_Kernel_matrixMul, 3, NULL, globalWorkSize,localWorkSize, 0, NULL, NULL); } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Kernel Code for Vector Add CUDA // Device code __global__ void VecAdd(const float* A, const float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } OpenCL // Device code __kernel void VecAdd(const __global float* A, const __global float* B, __global float* C, int N) { int i = get_local_size(0) * get_group_id(0) + get_local_id(0); if (i < N) C[i] = A[i] + B[i]; } A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Architecture http://dotsconnectedkat.files.wordpress.com/2011/02/agrigento.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Compilation Process A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Compilation Process Preprocessor Source Code Lexer Preprocessed Code Semantic Analyzer Parser Tokenized Code Parse Tree Code Generator Intermediate Representation Binary LLVM Clang A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Compilation Process Preprocessor Source Code Lexer Preprocessed Code Semantic Analyzer Parser Tokenized Code Parse Tree Code Generator Intermediate Representation Binary LLVM Clang Martinez, Gardner, and Feng, “CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures,” IEEE ICPADS 2011 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA OpenCL __powf(x[threadIdx.x], y[threadIdx.y]) native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func OpenCL native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Arg OpenCL native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Arg OpenCL Struct Struct native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field Arg OpenCL Struct Struct Field native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field Arg OpenCL Struct Struct Field native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) Func Arg Field 0 Arg OpenCL Struct Struct Field 1 native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg Field 0 Arg OpenCL Struct Struct Field 1 native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg Struct Field 0 Arg Struct Field 1 x[ OpenCL ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ Write Out ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • AST-driven, String-based Rewriting CUDA __powf(x[threadIdx.x], y[threadIdx.y]) get_local_id( ) get_local_id( ) Func Arg OpenCL Field 0 Arg native_pow Struct Struct Field 1 x[ Write Out ] y[ ] native_pow(x[get_local_id(0)], y[get_local_id(1)]) Advantage: formatting remains intact → maintainable A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); Naive OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f); clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions 1. Literal Parameters to Kernels – CUDA pass-by-value invocations vs. OpenCL pass-by-reference CUDA Kernel Launch kernel <<<grid, block >>>(foo1, foo2 * 2.0f, 256); Naive OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &foo2 * 2.0f); clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &256); Correct OpenCL Translation clSetKernelArg(__cu2cl_Kernel_kernel , 0 , sizeof(float), &foo1); float __cu2cl_Kernel_kernel_arg_1 = foo2 * 2.0f; clSetKernelArg(__cu2cl_Kernel_kernel , 1 , sizeof(float), &__cu2cl_Kernel_kernel_arg_1); int __cu2cl_Kernel_kernel_arg_2 = 256; clSetKernelArg(__cu2cl_Kernel_kernel , 2 , sizeof(int), &__cu2cl_Kernel_kernel_arg_2); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... //scan all devices //save old platform, device, context, queue, program, & kernels myDevice = allDevices[id] ClGetDeviceInfo(...); //get new device's platform myContext = clCreateContext(...); myQueue = clCreateCommandQueue(...); //load program source clBuildProgram(...); myKernel = clCreateKernel(...); A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Complex Semantic Conversions 2. Device Identification – CUDA uses int, OpenCL uses opaque cl_device – To change devices in CUDA, use cudaSetDevice(int id) – To change devices in OpenCL, use... //scan all devices //save old platform, device, context, queue, program, & kernels myDevice = allDevices[id] ClGetDeviceInfo(...); //get new device's platform myContext = clCreateContext(...); myQueue = clCreateCommandQueue(...); //load program source clBuildProgram(...); myKernel = clCreateKernel(...); – Implement our own handler to emulate and encapsulate A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Evaluation Image: http://learn.cvuhs.org/file.php/1427/scales_of_justice2.jpg A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Test Code A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Test Code • 79 CUDA SDK Samples • 17 Rodinia Samples • Applications – GEM – Molecular Modeling – IZ PS – Neural Network – Fen Zi – Molecular Dynamics • 100k+ SLOC in total A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translator Coverage O eC L e pnL i s n C agd hne Pr n e et c A tm ta Tast u acl r le o i l na d y 1 3 5 5 9. 6 3 bnwdh et ad i T s t 81 9 5 9. 8 9 B cShl l kco s a e 37 4 1 4 9. 6 0 Fs l Tas r at s r f m Wah n o 37 2 3 0 9. 0 8 m t Ml ai u r x 31 5 9 9. 7 4 sar o clP d ar 21 5 1 8 9. 2 8 vco d etr d A 1 4 7 0 10 0 Bc P pgt n ak r aao o i Rd i oi n a C D Le U A is n ayc P snA I S KSm l D aps e A pctn plao ii 3 1 3 2 4 9. 2 3 B at FsSa h r dh it er e -r c 36 0 3 5 8. 8 6 Gusn asi a 30 9 2 6 9. 3 3 Ht o os t p 38 2 2 9. 9 4 Nel a- nc ed m nWush e 40 3 3 9. 9 3 Fn i eZ 16 78 7 16 7 8 8. 9 9 GM E 54 2 1 5 9. 7 1 IP ZS 80 42 1 6 6 9. 8 0 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translator Coverage O eC L e pnL i s n C agd hne Pr n e et c A tm ta Tast u acl r le o i l na d y 1 3 5 5 9. 6 3 bnwdh et ad i T s t 81 9 5 9. 8 9 B cShl l kco s a e 37 4 1 4 9. 6 0 Fs l Tas r at s r f m Wah n o 37 2 3 0 9. 0 8 m t Ml ai u r x 31 5 9 9. 7 4 sar o clP d ar 21 5 1 8 9. 2 8 vco d etr d A 1 4 7 0 10 0 Bc P pgt n ak r aao o i Rd i oi n a C D Le U A is n ayc P snA I S KSm l D aps e A pctn plao ii 3 1 3 2 4 9. 2 3 B at FsSa h r dh it er e -r c 36 0 3 5 8. 8 6 Gusn asi a 30 9 2 6 9. 3 3 Ht o os t p 38 2 2 9. 9 4 Nel a- nc ed m nWush e 40 3 3 9. 9 3 Fn i eZ 16 78 7 16 7 8 8. 9 9 GM E 54 2 1 5 9. 7 1 IP ZS 80 42 1 6 6 9. 8 0 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translation Challenges Identified P fd ri ol e C a ne hl g l e C D SK UA D F qec ( ) r uny % e Rd i oi n a F qec ( ) r uny % e D v ednfr ei I tis c ei e 5. 4 4 2. 9 4 Le la m t s ir Pr e r ta a e 10 9 . 2. 3 5 Spre o pao ea t C m i i a l n t 5. 4 4 2. 9 4 C D L ri U A iae b rs 1. 0 1 0 K r le p ts e eT m le n a 25 1 . 0 T x rM m r et e e o u y 2. 7 8 2. 3 5 Gah sn r e bi 2. r i Ie pr iy 4 p c to a l t 1 0 C nt t e o os nM m r a y 17 7 . 2. 9 4 Sa d e o hr M m r e y 4. 6 8 Kernel Function Pointer Invocations Preprocessor Effects Warp-level Synchronization Device Intrinsic Functions Device Buffer cl_mem Type Propagation #defined Function Definitions Device Buffers as Struct Members Arrays of Device Buffers Implicitly-Defined Kernel Functions Device-side Classes, Constructors, & Destructors Struct Alignment Atbt ti e ru s _t edec( _h af e r n ) 7. 0 6 Sathre, Gardner, Feng: “Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation”. ICPP Workshops 2012: 89-96 Gardner, Feng, Sathre, Martinez: “Characterizing the Challenges and Evaluating the Efficacy of a CUDA-to-OpenCL Translator”. ParCo Special Issue 2013, to appear A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translator Performance 10 0 00 1 0 R= . +01 6 R= . +05 9 10 00 Total Translation Time (s) 1 CU2CL Translation Time (microseconds) 10 0 0 . 1 01 . 0 10 0 S KSm l D a ps e 1 0 Su e i s or L e c n 10 00 Rd iSm l o ia a p s n e 10 0 00 10 0 000 Lre plaos a A pctn g i i 1 10 0 Su e i s or L e c n 10 00 S KSm l D a ps e Rd iSm l o ia a p s n e Lre plaos a A pctn g i i 10 0 00 10 0 000 L er D Sm l ) i a( K a p s n S e L er o ia a p s i a( d iSm l ) n R n e Experimental Setup: AMD Phenom II X6 1090T (six-cores 3.2Ghz), 16 GB RAM, NVIDIA GeForce GTX 480 (driver version 310.32, CUDA Runtime 5.0), 64-bit Ubuntu 12.04 A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Translated Application Performance 2 . 5 S KSm l D aps e Rd iSm l oi a p s n a e Time (s) CUDA OpenCL 2 Lower is Better 1 . 5 1 GM E Ne ed la e mn -u Wn sh c Ht os pt o Gu as sn i a BS F bc ak po rp vc et od rd A sa cl ar ro Pd mt ai rM xu l Fs at Wa lT s hr as nf om r Bc lk aS co hl e s bn ad wd ih tT et s ay sn cP AI 0 . 5 Note: all runs on same Nvidia GPU for fair comparison purposes A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Reliability 0 % 1% 0 2% 0 3% 0 4% 0 5% 0 6% 0 7% 0 8% 0 9% 0 10 0% Bfr e e o d U gae pr s C D S KSm l U A D aps e 2. 0% 3 1% 1 . 4 6. 8% 3 Rd iSm l oi a p s n a e 5. 2% 9 1% 1 . 8 3. 5% 3 2% . 5 Atr f e d U gae pr s C D S KSm l U A D aps e 2. 0% 3 17 2% . 25 1% . 12 5% . 1% . 3 2. 4% 1 Rd iSm l oi a p s n a e 5. 2% 9 5 % 2. . 9 3% 5 Failed Partial Complete Clang 3.2 main() method handling Template handling 5% 5% 5% . 9 . 9 . 9 OpenGL #defined function handling Separately declared and defined function handling Kernel pointer invocation handling Increase reliability in translating samples after latest round of improvements A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Roadmap & Future Work CU2CL Alpha (2011) Well-designed scaffold CU2CL Beta (2013) Improved Robustness, CUDA Coverage, and Reliability Analysis and profiling of difficult-to-translate CUDA structures CU2CL w/ Functional Portability Expand CUDA coverage • Shared, const, texture memory • Driver API • OpenGL Handling unmapped CUDA structs / behaviors • Warp sync A DDvl eSm i M ee pru mt o 2 1 11 03 12 // CU2CL w/ Performance Portability Automatic de-optimization Device-agnostic optimization Device-specific optimization synergy.cs.vt.edu
    • CU2CL Roadmap & Future Work CU2CL Alpha (2011) Well-designed scaffold CU2CL Beta (2013) Improved Robustness, CUDA Coverage, and Reliability Analysis and profiling of difficult-to-translate CUDA structures CU2CL w/ Functional Portability Expand CUDA coverage • Shared, const, texture memory • Driver API • OpenGL Handling unmapped CUDA structs / behaviors • Warp sync CU2CL w/ Performance Portability Automatic de-optimization Device-agnostic optimization Device-specific optimization What about CUDA to HSA? A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Related Work Swan – High-level abstraction API, links to either OpenCL or CUDA implementation Ocelot & Caracal – Translate NVIDIA PTX IR to other device IRs CUDAtoOpenCL – Source to source translator, based on Cetus A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Conclusions A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance • Challenges – CUDA functionality missing in OpenCL • __threadfence() – Equivalent libraries needed in OpenCL • cuFFT, MAGMA, cuBLAS – Implicit semantics • Implicit synchronization across warps A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • CU2CL Conclusions • Status – What used to take months by hand takes seconds • 90+ successful translation • Negligible difference in performance • Challenges – CUDA functionality missing in OpenCL • __threadfence() – Equivalent libraries needed in OpenCL • cuFFT, MAGMA, cuBLAS – Implicit semantics • Implicit synchronization across warps • What's Next? – Improved functional portability – Support for performance portability A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu
    • Acknowledgements Suet Gabriel Martinez, Paul Sathre t n: d s This work was supported in part by NSF I/UCRC IIP-0804155 via the NSF Center for High-Performance Reconfigurable Computing (CHREC). A DDvl eSm i M ee pru mt o 2 1 11 03 12 // synergy.cs.vt.edu