SlideShare a Scribd company logo
Adapting Languages for Parallel Processing on
Neil Henning – Technology Lead

Neil Henning




Current landscape


What is wrong with the current landscape


How to enable your language on GPUs


Developing tools for GPUs
Neil Henning

Neil Henning
Introduction – who am I?


Five years in the industry


Spent all of that using SPUs, GPUs, vectors units &


Last two years focused on open standards (mostly


Passionate about making compute easy

Neil Henning
Introduction – who are we?


GPU Compiler Experts based out of Edinburgh, Scotland


35 employees working on contracts, R&D and internal tech
Neil Henning
Current Landscape

Neil Henning
Current Landscape


Languages – CUDA, RenderScript, C++AMP & OpenCL


Targets – GPU (mobile & desktop), CPU (scalar & vector), DSPs, FPGAs


Concerns – performance, power, precision, parallelism & portability

Neil Henning
Current Landscape - CUDA

__global__ void kernel(char * a, char * b)
a[blockIdx.x] = b[blockIdx.x];

char in[SIZE], out[SIZE];
char * cIn, * cOut;
cudaMalloc((void **)&cIn, SIZE);
cudaMalloc((void **)&cOut, SIZE);
cudaMemcpy(cIn, in, size,
kernel<<<SIZE, 1>>>(cOut, cIn);
cudaMemcpy(out, cOut, size,


CUDA incredibly established



First major GPU compute approach to market

majority of devices


Huge bank of tools, libraries and knowledge


Really only had uptake in offline processing


Used in banking, medical imaging, game asset


Standard isn’t open, little room (or enthusiasm) for

creation, and many many more uses!

Using CUDA means abandoning compute on

other vendors to implement
Neil Henning
Current Landscape - RenderScript
#pragma version(1)
#pragma rs java_package_name(foo)
rs_allocation gIn; rs_allocation gOut;
rs_script gScript;
void root(const char * in, char * out,
const void * usr, uint32_t x, uint32_t y) {
*out = *in;
void filter() {
rsForEach(gScript, gIn, gOut, NULL);

Context ctxt = /* … */;
RenderScript rs = RenderScript.create(ctxt);
ScriptC_foo script = new ScriptC_foo(rs,
Allocation in = Allocation.createSized(rs,
Element.I8(rs), SIZE);
Allocation out = Allocation.createSized(rs,
Element.I8(rs), SIZE);
script.set_gIn(in); script.set_gOut(out);


Intelligent runtime load balances kernels


Only on Android


Creates Java classes to interface with kernels


Limited documentation & shortage of examples


Focused on performance portability


No real idea of feature roadmap
Neil Henning
Current Landscape – C++AMP

int in[SIZE], out[SIZE];
array_view<const int, 1> aIn(SIZE, in);
array_view<int, 1> aOut(SIZE, out);
[=](index<1> idx) restrict(amp)
aOut[idx] = aIn[idx];


Very well thought out single source approach


Lovely use of C++ templates to capture type information,

array dimensions

Great use of C++11 Lambda’s for capturing kernel intent


Part of target community is really C++11 averse, need

Limited low-level support


Initial interest by community faded fast


// can access aOut[…] like normal


Xbox One will support C++AMP – watch this space 

Neil Henning
Current Landscape - OpenCL

void kernel foo(global int * a, global int * b)
int idx = get_global_id(0);
a[idx] = b[idx];

// device, context, queue, in, out already created
cl_program program =
clCreateProgramWithSource(context, 1,
fooAsStr, NULL, NULL);
clBuildProgram(program, 1, &device,
cl_kernel kernel = clCreateKernel(program,
“foo”, NULL);
// set kernel arguments
clEnqueueNDRangeKernel(queue, kernel, 1,
NULL, &size, NULL, 0, NULL, NULL);


Open standard with many contributors


API is verbose, very very verbose!


API puts control in developer hands


Steep learning curve for new developers


Support on lots of heterogeneous platforms – not just GPUs!


Have to support diverse range of application types
Neil Henning
Current Landscape

Modern systems have many compute-capable devices in them

Not unlike the fictitious system shown above!
Neil Henning
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Neil Henning
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if
kernel has vector types

Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target
Neil Henning
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if
kernel has vector types

Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target

Can make no assumptions as to
what DSPs ‘look’ like

Digital Signal Processors (DSPs)
are a future target for the compute market
Neil Henning
Current Landscape
Scalar CPUs are the ‘normal’ target for programmers, easy
to target, easy to use

Mostly a fallback target for

compute currently

Vector units are supported if
kernel has vector types

Can auto-vectorize user kernels,
as vector units harder for ‘normal’ programmers to target

GPUs do not forgive poor code like a CPU or even a DSP
could, require large arrays of work to utilize

GPUs are the reason we have

compute in the first place

Can make no assumptions as to
what they ‘look’ like

Digital Signal Processors (DSPs)
are a future target for the compute market
Neil Henning
Current Landscape



Have to weigh up many competing concerns for languages

Platform, operating system, device type, battery life, use case
Neil Henning
What is wrong with the current landscape

Neil Henning
What is wrong with the current landscape


Compute approaches are not on all device and OS combinations


No CUDA on AMD, RenderScript on iOS or C++AMP on Linux


Have to support offline precise compute & time-bound online compute


Very divergent targets/use cases/device types is problematic!

Neil Henning
What is wrong with the current landscape


What if loop count is always multiple of four?

void foo(int * a, int * b, int * count)
for(int idx = 0; idx < *(count); ++idx)
a[idx] = 42 * b[idx];

Neil Henning
What is wrong with the current landscape


void foo(int * a, int * b, int * count)
for(int idx = 0; idx < *(count); idx += 4)
a[idx + 0] = 42 * b[idx + 0];
a[idx + 1] = 42 * b[idx + 1];
a[idx + 2] = 42 * b[idx + 2];
a[idx + 3] = 42 * b[idx + 3];

What if loop count is always multiple of four?


Can unroll the loop four times!

Neil Henning
What is wrong with the current landscape


void foo(int * a, int * b, int * count)
for(int idx = 0; idx < *(count); idx += 4)
a[idx + 0] = 42 * b[idx + 0];
a[idx + 1] = 42 * b[idx + 1];
a[idx + 2] = 42 * b[idx + 2];
a[idx + 3] = 42 * b[idx + 3];

What if loop count is always multiple of four?


Can unroll the loop four times!


What if pointers a & b are sixteen byte aligned?

Neil Henning
What is wrong with the current landscape


What if loop count is always multiple of four?


Can unroll the loop four times!


What if pointers a & b are sixteen byte aligned?


void foo(int * a, int * b, int * count)
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;

Can vectorize the loop body!

for(int idx = 0; idx < vecCount; ++idx)
vA[idx] = vB[idx] * (int4 )42;

Neil Henning
What is wrong with the current landscape

for(int idx = 0; idx < vecCount; ++idx)
vA[idx] = vB[idx] * (int4 )42;


What if loop count is always multiple of four?


Can unroll the loop four times!


What if pointers a & b are sixteen byte aligned?


void foo(int * a, int * b, int * count)
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;

Can vectorize the loop body!


Why does my code look so radically different now?


Neil Henning
What is wrong with the current landscape

for(int idx = 0; idx < vecCount; ++idx)
vA[idx] = vB[idx] * (int4 )42;


What if loop count is always multiple of four?


Can unroll the loop four times!


What if pointers a & b are sixteen byte aligned?


void foo(int * a, int * b, int * count)
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;

Can vectorize the loop body!


Why does my code look so radically different now?


Current languages force drastic developer interventions


Neil Henning
What is wrong with the current landscape

void foo(int * a, int * b, int * count)
int vecCount = count / 4;
int4 * vA = (int4 * )a;
int4 * vB = (int4 * )b;
for(int idx = 0; idx < vecCount; ++idx)
vA[idx] = vB[idx] * (int4 )42;


Existing languages (mostly) force developers to do coding

wizardry that is unnecessary


Also no real feedback to developer as ‘main’ compute

target has highly secretive ISAs


Don’t want to force vendors to reveal secrets, but do want

ability to influence kernel code generation


Neil Henning
What is wrong with the current landscape


Rely on vendors to provide tools to aid development


Debuggers, profilers, static analysis all increasingly required


Libraries can vastly decrease development time


Rely solely on vendors to provide all these complicated pieces

Neil Henning
What is wrong with the current landscape


Vendors already have lots of targets to support


Every generation of devices need to test conformance


Need to support compilers, graphics, compute, tools, list goes on!


Why should the vendor be the only one taking the burden?

Neil Henning
What is wrong with the current landscape


No one can agree on what is the ‘best’ approach


Personal preference of developer/organization sways opinions


Why not allow Lisp on a GPU? Lua on a DSP?


Vendor doesn’t need extra headache of supporting these niche use cases

Neil Henning
What is wrong with the current landscape


My pitch – let community support compute standards


Take the approach of LLVM & Clang


Vendor has to support lower standard on their hardware


But allows community to support & innovate

Neil Henning
How to enable your language on GPUs

Neil Henning
How to enable your language on GPUs


First step – be able to compile language to a binary


Can’t output real binary though


Vendor doesn’t want to expose ISA


Developer wants portability of compiled kernels

Neil Henning
How to enable your language on GPUs


Need to use an Intermediate Representation (IR)


Two approaches in development for this!


HSA Intermediate Language (HSAIL)


OpenCL Standard Portable Intermediate Representation (SPIR)

Neil Henning
How to enable your language on GPUs




Language -> LLVM IR -> HSAIL


Language -> LLVM IR -> SPIR


Low level mapping onto hardware, more of a virtual ISA


Then pass SPIR to OpenCL runtime as binary


Execute like normal OpenCL C Language kernel


Provisional specification available!

than an IR

HSAIL heavily in development

Neil Henning
How to enable your language on GPUs



HSA will provide a low-level runtime to interface

between HSA compiled binaries and OS



OpenCL SPIR will require a SPIR compliant OpenCL

implementation as target


HSAIL is being standardized and ratified


Can compile using LLVM, then use


Existing JIT’ed languages potential targets

clCreateProgramWithBinary, passing SPIR options
Neil Henning
How to enable your language on GPUs


At present, SPIR is only target we can investigate


Intel has OpenCL drivers with provisional SPIR support


Can use Clang -> LLVM -> SPIR, then use Intel’s OpenCL to consume SPIR


Can take code that compiles to LLVM and run it on OpenCL

Neil Henning
How to enable your language on GPUs


Various steps to getting your language working on GPUs with SPIR


We’ll use Intel’s OpenCL SDK with provisional SPIR support;

Create a test harness to load a SPIR binary


Create a simple kernel using Intel’s SPIR compiler on host


Create a simple kernel using tip Clang (language OpenCL) targeting SPIR


Try other languages that compile to LLVM with SPIR target

Neil Henning
How to enable your language on GPUs

// some SPIR bitcode file
const unsigned char spir_bc[spir_bc_length];
// already initialized platform, device & context for a SPIR compliant device
cl_platform_id platform = ... ;
cl_device device = ... ;
cl_context context = … ;
// create our program with our SPIR bitcode file
cl_program program = clCreateProgramWithBinary(
context, 1, &device, &spir_bc_length, &spir_bc, NULL, NULL);
// build, passing arguments telling the compiler language is SPIR, and the SPIR standard we are using
clBuildProgram(program, 1, &device, “–x spir –spir–std=1.2”, NULL, NULL);

Neil Henning
How to enable your language on GPUs

// already initialized memory buffers for our context
cl_mem in_mem = ... ;
cl_mem out_mem = ... ;
// assume our kernel function from the spir kernel was called foo
cl_kernel kernel = clCreateKernel(program, “foo”, NULL);
// assume our kernel has one read buffer as first argument, and one write buffer as second
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void * )&in_mem);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void * )&out_mem);

Neil Henning
How to enable your language on GPUs

// already initialized command queue
cl_command_queue queue = … ;
cl_event write_event, run_event;
clEnqueueWriteBuffer(queue, in_mem, CL_FALSE, 0, BUFFER_SIZE,
&read_payload, 0, NULL, &write_event);
const size_t size = BUFFER_SIZE / sizeof(cl_int);
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 1, &write_event, &run_event);
clEnqueueReadBuffer(queue, out_mem, CL_TRUE, 0, BUFFER_SIZE,
&result_payload, 1, &run_event, NULL);

Neil Henning
How to enable your language on GPUs


Now, create a simple OpenCL kernel

void kernel foo(global int * in, global int * out)
out[get_global_id(0)] = in[get_global_id(0)];


And use Intel’s command line (or GUI!) tool to build

Ioc32 –cmd=build –input –spir32=foo.bc

Neil Henning
How to enable your language on GPUs


Next we point the buffer for our SPIR kernel at the generated SPIR kernel


And it fails…?

Turns out Intel’s OpenCL runtime doesn’t like us telling them they are building


Simply remove “–x spir –spir–std=1.2” from the build options and voila!

Neil Henning
How to enable your language on GPUs


Next step – use tip Clang to build our kernel

clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc –o foo.bc


Compiles ok, but when we run it fails…?

So Clang generated SPIR bitcode file could very well not work


We’ll take a look at the readable IR for the Intel & Clang compiled kernels

Neil Henning
How to enable your language on GPUs

Clang Output

; ModuleID = ''
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir-unknown-unknown"
; Function Attrs: nounwind
define void @foo(i32 addrspace(1)* nocapture readonly %a, i32
addrspace(1)* nocapture %b) #0 {
%0 = load i32 addrspace(1)* %a, align 4, !tbaa !2
store i32 %0, i32 addrspace(1)* %b, align 4, !tbaa !2
ret void

attributes #0 = { nounwind "less-precise-fpmad"="false" "noframe-pointer-elim"="false" "no-infs-fp-math"="false" "no-nansfp-math"="false" "no-realign-stack" "stack-protector-buffersize"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
!opencl.kernels = !{!0}
!llvm.ident = !{!1}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
!1 = metadata !{metadata !"clang version 3.4 (trunk)"}
!2 = metadata !{metadata !3, metadata !3, i64 0}
!3 = metadata !{metadata !"int", metadata !4, i64 0}
!4 = metadata !{metadata !"omnipotent char", metadata !5, i64
!5 = metadata !{metadata !"Simple C/C++ TBAA"}

Neil Henning
How to enable your language on GPUs

IOC Output

; ModuleID = 'ex.bc'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir-unknown-unknown"
define spir_kernel void @foo(i32 addrspace(1)* %a, i32
addrspace(1)* %b) nounwind {
%1 = alloca i32 addrspace(1)*, align 4
%2 = alloca i32 addrspace(1)*, align 4
store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4
store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4
%3 = load i32 addrspace(1)** %1, align 4
%4 = load i32 addrspace(1)* %3, align 4
%5 = load i32 addrspace(1)** %2, align 4
store i32 %4, i32 addrspace(1)* %5, align 4
ret void

!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo, metadata !1, metadata !2, metadata !3, metadata !4,
metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata
!"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int*",
metadata !"int*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata
!"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"a",
metadata !"b"}
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
Neil Henning
How to enable your language on GPUs

IOC Output

; ModuleID = ''
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir-unknown-unknown"
define spir_kernel void @foo(i32 addrspace(1)* %a, i32
addrspace(1)* %b) nounwind {
%1 = alloca i32 addrspace(1)*, align 4
%2 = alloca i32 addrspace(1)*, align 4
store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4
store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4
%3 = load i32 addrspace(1)** %1, align 4
%4 = load i32 addrspace(1)* %3, align 4
%5 = load i32 addrspace(1)** %2, align 4
store i32 %4, i32 addrspace(1)* %5, align 4
ret void

!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo, metadata !1, metadata !2, metadata !3, metadata !4,
metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata
!"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int*",
metadata !"int*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata
!"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"a",
metadata !"b"}
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
Neil Henning
How to enable your language on GPUs


So the metadata is different!

We could fix Clang to produce the right metadata…?


Or just hack around!


Lets use Intel’s compiler to generate a stub function


Then we can use an extern function defined in our Clang module!

Neil Henning
How to enable your language on GPUs

extern int doSomething(int a);
void kernel foo(global int * in, global int * out)
int id = get_global_id(0);
out[id] = doSomething(in[id]);

int doSomething(int a)
return a;

Neil Henning
How to enable your language on GPUs


And it fails…? 

Intel’s compiler doesn’t like extern functions!


We’ve already bodged it thus far…


So lets continue!

Int __attribute__((weak)) doSomething(int a) {}
void kernel foo(global int * in, global int * out)
int id = get_global_id(0);
out[id] = doSomething(in[id]);
Neil Henning
How to enable your language on GPUs


More than a little nasty…

Relies on Clang extension to declare function weak within OpenCL


Relies on Intel using Clang and allowing extension


But it works!


Can build both the Intel stub code & the Clang actual code


Then use llvm-link to pull them together!

Neil Henning
How to enable your language on GPUs


So now we can compile two OpenCL kernels, link them together, and run it


What is next? Want to enable your language!

What about using Clang, but using a different language?


C & C++ come to mind!

Neil Henning
How to enable your language on GPUs


Use a simple C file

int doSomething(int a)
return a;


And use Clang to compile it

clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.c –o foo.bc

Neil Henning
How to enable your language on GPUs


Or a simple C++ file!

extern “C” int doSomething(int a);
template<typename T> T templatedSomething(const T t)
return t;
int doSomething(int a)
return templatedSomething(a);
Neil Henning
How to enable your language on GPUs


Lets have some real C++ code


Use features that OpenCL doesn’t provide us

We’ll do a matrix multiplication in C++

Use classes, constructors, templates

Neil Henning
How to enable your language on GPUs

typedef float __attribute__((ext_vector_type(4))) float4;
typedef float __attribute__((ext_vector_type(16))) float16;
float __attribute__((overloadable)) dot(float4 a, float4 b);
template<typename T, unsigned int WIDTH, unsigned int HEIGHT> class Matrix
typedef T __attribute__((ext_vector_type(WIDTH))) RowType;
RowType rows[HEIGHT];
Matrix() {}
template<typename U> Matrix(const U & u) { __builtin_memcpy(&rows, &u, sizeof(U)); }
RowType & operator[](const unsigned int index) { return rows[index]; }
const RowType & operator[](const unsigned int index) const { return rows[index]; }

Neil Henning
How to enable your language on GPUs

template<typename T, unsigned int WIDTH, unsigned int HEIGHT>
Matrix<T, WIDTH, HEIGHT> operator *(const Matrix<T, WIDTH, HEIGHT> & a, const Matrix<T,
Matrix<T, HEIGHT, WIDTH> bShuffled;
for(unsigned int h = 0; h < HEIGHT; h++)
for(unsigned int w = 0; w < WIDTH; w++)
bShuffled[w][h] = b[h][w];
Matrix<T, WIDTH, HEIGHT> result;
for(unsigned int h = 0; h < HEIGHT; h++)
for(unsigned int w = 0; w < WIDTH; w++)
result[h][w] = dot(a[h], bShuffled[w]);
return result;

Neil Henning
How to enable your language on GPUs

extern “C” float16 doSomething(float16 a, float16 b);
float16 doSomething(float16 a, float16 b)
Matrix<float, 4, 4> matA(a);
Matrix<float, 4, 4> matB(b);
Matrix<float, 4, 4> mul = matA * matB;
float16 result = (float16 )0;
result.s0123 = mul[0];
result.s4567 = mul[1];
result.s89ab = mul[2];
result.scdef = mul[3];
return result;
Neil Henning
How to enable your language on GPUs


And when we run it…

ex5.vcxproj -> E:AMDDeveloperSummit2013buildExample5Debugex5.exe
Found 2 platforms!
Choosing vendor 'Intel(R) Corporation'!
Found 1 devices!
SPIR file length '3948' bytes!
[ 0.0, 1.0, 2.0, 3.0] * [ 16.0, 15.0, 14.0, 13.0] = [ 40.0, 34.0, 28.0, 22.0]
[ 4.0, 5.0, 6.0, 7.0] * [ 12.0, 11.0, 10.0, 9.0] = [200.0, 178.0, 156.0, 134.0]
[ 8.0, 9.0, 10.0, 11.0] * [ 8.0, 7.0, 6.0, 5.0] = [360.0, 322.0, 284.0, 246.0]
[ 12.0, 13.0, 14.0, 15.0] * [ 4.0, 3.0, 2.0, 1.0] = [520.0, 466.0, 412.0, 358.0]

Neil Henning
How to enable your language on GPUs


The least you need to target a GPU;

Generate correct LLVM IR with SPIR


Or at least generate LLVM IR and

use the approach we used to
combine Clang and IOC generated

!opencl.kernels = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!opencl.spir.version = !{!6}
!opencl.ocl.version = !{!7}
!opencl.used.extensions = !{!8}
!opencl.used.optional.core.features = !{!8}
!opencl.compiler.options = !{!8}
!0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)*
@foo, metadata !1, metadata !2, metadata !3, metadata !4,
metadata !5}
!1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32
!2 = metadata !{metadata !"kernel_arg_access_qual", metadata
!"none", metadata !"none"}
!3 = metadata !{metadata !"kernel_arg_type", metadata !"int*",
metadata !"int*"}
!4 = metadata !{metadata !"kernel_arg_type_qual", metadata
!"", metadata !""}
!5 = metadata !{metadata !"kernel_arg_name", metadata !"a",
metadata !"b"}
!6 = metadata !{i32 1, i32 0}
!7 = metadata !{i32 0, i32 0}
!8 = metadata !{}
Neil Henning
How to enable your language on GPUs


Porting C/C++ libraries to SPIR requires a little more work

int foo(int * a)
return *a;

The data pointed to by ‘a’ will by default be put in the private address space

But a straight conversion to SPIR needs all data in global address space


Means that any porting of existing code could be quite intrusive

Neil Henning
How to enable your language on GPUs


To target your language at GPUs


Need to be able to segregate work into parallel chunks


Have to ban certain features that don’t work with compute



Need to deal with distinct address spaces

Language could also provide an API onto OpenCL SPIR builtins

But with OpenCL SPIR it is now possible to make any language work on a GPU!

Neil Henning
Developing tools for GPUs

Neil Henning
Developing tools for GPUs


Tools increasingly required to support development


Even having printf (which OpenCL 1.2 added) is novel!


But with increasingly complex code better tools needed


Main three are debuggers, profilers and compiler-tools

Neil Henning
Developing tools for GPUs


Debuggers for compute are difficult for non-vendor to develop


Codeplay has developed such tools on top of compute standards


Problem is bedrock for these tools can change at any time


Hard to beat vendor-owned approach that has lower-level access

Neil Henning
Developing tools for GPUs



Codeplay are pushing hard for HSA to have features

that aid tool development

Debuggers are much easier with instruction

support, debug info, change registers, call stacks



OpenCL SPIR harder to create debugger for without

vendor support

Can we standardize a way to debug OpenCL SPIR,

or allow debugging via emulation of SPIR?
Neil Henning
Developing tools for GPUs


Profilers require superset of debugger feature-set


Need to be able to trap kernels at defined points


Accurate timings only other requirement beyond debugger support


More fun when we go beyond performance, and measure power

Neil Henning
Developing tools for GPUs


HSA and OpenCL SPIR both good profiler targets


Could split SPIR kernels into profiling sections


Then use existing timing information in OpenCL


HSA will only require debugger features we are pushing for

Neil Henning
Developing tools for GPUs


Compiler tools consist of optimizers and analysis


Both HSA and OpenCL SPIR being based on LLVM enable this!


We as compiler experts can aid existing runtimes


You as developers can add optimizations & analyse your kernels!

Neil Henning

Neil Henning


With the rise of open standards, compute is increasingly easy


With HSA & OpenCL SPIR hardware is finally open to us!


Just need standards to ratify, mature & be available on hardware!


Next big push into compute is upon us

Neil Henning
Can also catch me on twitter @sheredom

Neil Henning


SPIR extension on Khronos website



SPIR provisional specification



HSA Foundation


Neil Henning

More Related Content

What's hot

PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
AMD Developer Central
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
AMD Developer Central
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
AMD Developer Central
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
AMD Developer Central
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
AMD Developer Central
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
AMD Developer Central
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
AMD Developer Central
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
AMD Developer Central
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
AMD Developer Central
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
AMD Developer Central
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
AMD Developer Central
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime
HSA Foundation
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
AMD Developer Central
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
AMD Developer Central
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
AMD Developer Central
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderPT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
AMD Developer Central
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
AMD Developer Central
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
AMD Developer Central
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
AMD Developer Central

What's hot (20)

PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
Keynote (Johan Andersson) - Mantle for Developers - by Johan Andersson, Techn...
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderPT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl HilleslandPG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
PG-4034, Using OpenGL and DirectX for Heterogeneous Compute, by Karl Hillesland
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by  Mikael ...
WT-4069, WebCL: Enabling OpenCL Acceleration of Web Applications, by Mikael ...

Viewers also liked

TRENDS: What You Need To Know From CES 2012
TRENDS: What You Need To Know From CES 2012TRENDS: What You Need To Know From CES 2012
TRENDS: What You Need To Know From CES 2012
Product Development Technologies
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
AMD Developer Central
Softkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya VarbanoveSoftkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Industrial Design Center
iMinds & SME Innovation
iMinds & SME InnovationiMinds & SME Innovation
iMinds & SME Innovation

Viewers also liked (6)

TRENDS: What You Need To Know From CES 2012
TRENDS: What You Need To Know From CES 2012TRENDS: What You Need To Know From CES 2012
TRENDS: What You Need To Know From CES 2012
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
CE-4029, "eyeSite’s Gesture recognition technology + introducing the develope...
Soft kinetic identity
Soft kinetic identitySoft kinetic identity
Soft kinetic identity
Lékué history
Lékué historyLékué history
Lékué history
Softkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya VarbanoveSoftkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
Softkinetic user interface evolution by Ilse Ravyse and Tanya Varbanove
iMinds & SME Innovation
iMinds & SME InnovationiMinds & SME Innovation
iMinds & SME Innovation

Similar to PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8
Phil Eaton
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
Python for PHP developers
Python for PHP developersPython for PHP developers
Python for PHP developers
Return of c++
Return of c++Return of c++
Return of c++
Yongwei Wu
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
Sang Don Kim
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
MathWorks Interview Lecture
MathWorks Interview LectureMathWorks Interview Lecture
MathWorks Interview LectureJohn Yates
Efficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas RoardEfficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas RoardParis Android User Group
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
Takeshi Akutsu
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
Yung-Yu Chen
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Windows Developer
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
Verilog Lecture1
Verilog Lecture1Verilog Lecture1
Verilog Lecture1
Béo Tú
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
Hideki Takase
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
Tharindu Weerasinghe
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
Modern Data Stack France
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native  benchmark test su dispositivi x86: java, ndk, ipp e tbbGo native  benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb

Similar to PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning (20)

AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8AOT-compilation of JavaScript with V8
AOT-compilation of JavaScript with V8
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
[CB16] Be a Binary Rockstar: An Introduction to Program Analysis with Binary ...
Python for PHP developers
Python for PHP developersPython for PHP developers
Python for PHP developers
Return of c++
Return of c++Return of c++
Return of c++
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
MathWorks Interview Lecture
MathWorks Interview LectureMathWorks Interview Lecture
MathWorks Interview Lecture
Efficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas RoardEfficient Image Processing - Nicolas Roard
Efficient Image Processing - Nicolas Roard
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
Verilog Lecture1
Verilog Lecture1Verilog Lecture1
Verilog Lecture1
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Michael Hall [InfluxData] | Become an InfluxDB Pro in 20 Minutes | InfluxDays...
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native  benchmark test su dispositivi x86: java, ndk, ipp e tbbGo native  benchmark test su dispositivi x86: java, ndk, ipp e tbb
Go native benchmark test su dispositivi x86: java, ndk, ipp e tbb

More from AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
AMD Developer Central
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
AMD Developer Central
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
AMD Developer Central
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
AMD Developer Central
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
AMD Developer Central
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
AMD Developer Central
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
AMD Developer Central
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
AMD Developer Central
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
AMD Developer Central
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
AMD Developer Central
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
AMD Developer Central
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
AMD Developer Central
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
AMD Developer Central
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
AMD Developer Central
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
AMD Developer Central
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
AMD Developer Central
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
AMD Developer Central

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes

PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning

  • 1. Adapting Languages for Parallel Processing on GPUs Neil Henning – Technology Lead Neil Henning
  • 2. Agenda ● Introduction ● Current landscape ● What is wrong with the current landscape ● How to enable your language on GPUs ● Developing tools for GPUs Neil Henning
  • 4. Introduction – who am I? ● Five years in the industry ● Spent all of that using SPUs, GPUs, vectors units & DSPs ● Last two years focused on open standards (mostly OpenCL) ● Passionate about making compute easy Neil Henning
  • 5. Introduction – who are we? ● GPU Compiler Experts based out of Edinburgh, Scotland ● 35 employees working on contracts, R&D and internal tech Neil Henning
  • 7. Current Landscape ● Languages – CUDA, RenderScript, C++AMP & OpenCL ● Targets – GPU (mobile & desktop), CPU (scalar & vector), DSPs, FPGAs ● Concerns – performance, power, precision, parallelism & portability Neil Henning
  • 8. Current Landscape - CUDA __global__ void kernel(char * a, char * b) { a[blockIdx.x] = b[blockIdx.x]; } char in[SIZE], out[SIZE]; char * cIn, * cOut; cudaMalloc((void **)&cIn, SIZE); cudaMalloc((void **)&cOut, SIZE); cudaMemcpy(cIn, in, size, cudaMemcpyHostToDevice); kernel<<<SIZE, 1>>>(cOut, cIn); cudaMemcpy(out, cOut, size, cudaMemcpyDeviceToHost); cudaFree(cIn); cudaFree(cOut); ● CUDA incredibly established ● ● First major GPU compute approach to market majority of devices ● Huge bank of tools, libraries and knowledge ● Really only had uptake in offline processing ● Used in banking, medical imaging, game asset ● Standard isn’t open, little room (or enthusiasm) for creation, and many many more uses! Using CUDA means abandoning compute on other vendors to implement Neil Henning
  • 9. Current Landscape - RenderScript #pragma version(1) #pragma rs java_package_name(foo) rs_allocation gIn; rs_allocation gOut; rs_script gScript; void root(const char * in, char * out, const void * usr, uint32_t x, uint32_t y) { *out = *in; } void filter() { rsForEach(gScript, gIn, gOut, NULL); } Context ctxt = /* … */; RenderScript rs = RenderScript.create(ctxt); ScriptC_foo script = new ScriptC_foo(rs, getResources(),; Allocation in = Allocation.createSized(rs, Element.I8(rs), SIZE); Allocation out = Allocation.createSized(rs, Element.I8(rs), SIZE); script.set_gIn(in); script.set_gOut(out); script.set_gScript(script); script.invoke_filter(); ● Intelligent runtime load balances kernels ● Only on Android ● Creates Java classes to interface with kernels ● Limited documentation & shortage of examples ● Focused on performance portability ● No real idea of feature roadmap Neil Henning
  • 10. Current Landscape – C++AMP int in[SIZE], out[SIZE]; array_view<const int, 1> aIn(SIZE, in); array_view<int, 1> aOut(SIZE, out); aOut.discard_data(); parallel_for_each(aOut.extent, [=](index<1> idx) restrict(amp) { aOut[idx] = aIn[idx]; } ); ● Very well thought out single source approach ● Lovely use of C++ templates to capture type information, array dimensions ● Great use of C++11 Lambda’s for capturing kernel intent ● Part of target community is really C++11 averse, need convincing Limited low-level support ● Initial interest by community faded fast ● // can access aOut[…] like normal ● Xbox One will support C++AMP – watch this space  Neil Henning
  • 11. Current Landscape - OpenCL void kernel foo(global int * a, global int * b) { int idx = get_global_id(0); a[idx] = b[idx]; } // device, context, queue, in, out already created cl_program program = clCreateProgramWithSource(context, 1, fooAsStr, NULL, NULL); clBuildProgram(program, 1, &device, NULL, NULL, NULL); cl_kernel kernel = clCreateKernel(program, “foo”, NULL); // set kernel arguments clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 0, NULL, NULL); ● Open standard with many contributors ● API is verbose, very very verbose! ● API puts control in developer hands ● Steep learning curve for new developers ● Support on lots of heterogeneous platforms – not just GPUs! ● Have to support diverse range of application types Neil Henning
  • 12. Current Landscape Modern systems have many compute-capable devices in them Not unlike the fictitious system shown above! Neil Henning
  • 13. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Neil Henning
  • 14. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Vector units are supported if kernel has vector types Can auto-vectorize user kernels, as vector units harder for ‘normal’ programmers to target Neil Henning
  • 15. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Vector units are supported if kernel has vector types Can auto-vectorize user kernels, as vector units harder for ‘normal’ programmers to target Can make no assumptions as to what DSPs ‘look’ like Digital Signal Processors (DSPs) are a future target for the compute market Neil Henning
  • 16. Current Landscape Scalar CPUs are the ‘normal’ target for programmers, easy to target, easy to use Mostly a fallback target for compute currently Vector units are supported if kernel has vector types Can auto-vectorize user kernels, as vector units harder for ‘normal’ programmers to target GPUs do not forgive poor code like a CPU or even a DSP could, require large arrays of work to utilize GPUs are the reason we have compute in the first place Can make no assumptions as to what they ‘look’ like Digital Signal Processors (DSPs) are a future target for the compute market Neil Henning
  • 17. Current Landscape ● ● Have to weigh up many competing concerns for languages Platform, operating system, device type, battery life, use case Neil Henning
  • 18. What is wrong with the current landscape Neil Henning
  • 19. What is wrong with the current landscape ● Compute approaches are not on all device and OS combinations ● No CUDA on AMD, RenderScript on iOS or C++AMP on Linux ● Have to support offline precise compute & time-bound online compute ● Very divergent targets/use cases/device types is problematic! Neil Henning
  • 20. What is wrong with the current landscape ● What if loop count is always multiple of four? void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); ++idx) { a[idx] = 42 * b[idx]; } } Neil Henning
  • 21. What is wrong with the current landscape ● void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); idx += 4) { a[idx + 0] = 42 * b[idx + 0]; a[idx + 1] = 42 * b[idx + 1]; a[idx + 2] = 42 * b[idx + 2]; a[idx + 3] = 42 * b[idx + 3]; } } What if loop count is always multiple of four? ● Can unroll the loop four times! Neil Henning
  • 22. What is wrong with the current landscape ● void foo(int * a, int * b, int * count) { for(int idx = 0; idx < *(count); idx += 4) { a[idx + 0] = 42 * b[idx + 0]; a[idx + 1] = 42 * b[idx + 1]; a[idx + 2] = 42 * b[idx + 2]; a[idx + 3] = 42 * b[idx + 3]; } } What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? Neil Henning
  • 23. What is wrong with the current landscape ● What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? ● void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; Can vectorize the loop body! for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } } Neil Henning
  • 24. What is wrong with the current landscape for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } ● What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? ● void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; Can vectorize the loop body! ● Why does my code look so radically different now? } Neil Henning
  • 25. What is wrong with the current landscape for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } ● What if loop count is always multiple of four? ● Can unroll the loop four times! ● What if pointers a & b are sixteen byte aligned? ● void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; Can vectorize the loop body! ● Why does my code look so radically different now? ● Current languages force drastic developer interventions } Neil Henning
  • 26. What is wrong with the current landscape void foo(int * a, int * b, int * count) { int vecCount = count / 4; int4 * vA = (int4 * )a; int4 * vB = (int4 * )b; for(int idx = 0; idx < vecCount; ++idx) { vA[idx] = vB[idx] * (int4 )42; } ● Existing languages (mostly) force developers to do coding wizardry that is unnecessary ● Also no real feedback to developer as ‘main’ compute target has highly secretive ISAs ● Don’t want to force vendors to reveal secrets, but do want ability to influence kernel code generation } Neil Henning
  • 27. What is wrong with the current landscape ● Rely on vendors to provide tools to aid development ● Debuggers, profilers, static analysis all increasingly required ● Libraries can vastly decrease development time ● Rely solely on vendors to provide all these complicated pieces Neil Henning
  • 28. What is wrong with the current landscape ● Vendors already have lots of targets to support ● Every generation of devices need to test conformance ● Need to support compilers, graphics, compute, tools, list goes on! ● Why should the vendor be the only one taking the burden? Neil Henning
  • 29. What is wrong with the current landscape ● No one can agree on what is the ‘best’ approach ● Personal preference of developer/organization sways opinions ● Why not allow Lisp on a GPU? Lua on a DSP? ● Vendor doesn’t need extra headache of supporting these niche use cases Neil Henning
  • 30. What is wrong with the current landscape ● My pitch – let community support compute standards ● Take the approach of LLVM & Clang ● Vendor has to support lower standard on their hardware ● But allows community to support & innovate Neil Henning
  • 31. How to enable your language on GPUs Neil Henning
  • 32. How to enable your language on GPUs ● First step – be able to compile language to a binary ● Can’t output real binary though ● Vendor doesn’t want to expose ISA ● Developer wants portability of compiled kernels Neil Henning
  • 33. How to enable your language on GPUs ● Need to use an Intermediate Representation (IR) ● Two approaches in development for this! ● HSA Intermediate Language (HSAIL) ● OpenCL Standard Portable Intermediate Representation (SPIR) Neil Henning
  • 34. How to enable your language on GPUs Our Language Our Language ● Language -> LLVM IR -> HSAIL ● Language -> LLVM IR -> SPIR ● Low level mapping onto hardware, more of a virtual ISA ● Then pass SPIR to OpenCL runtime as binary ● Execute like normal OpenCL C Language kernel ● Provisional specification available! than an IR ● HSAIL heavily in development Neil Henning
  • 35. How to enable your language on GPUs Our Language ● HSA will provide a low-level runtime to interface between HSA compiled binaries and OS Our Language ● OpenCL SPIR will require a SPIR compliant OpenCL implementation as target ● HSAIL is being standardized and ratified ● Can compile using LLVM, then use ● Existing JIT’ed languages potential targets clCreateProgramWithBinary, passing SPIR options Neil Henning
  • 36. How to enable your language on GPUs ● At present, SPIR is only target we can investigate ● Intel has OpenCL drivers with provisional SPIR support ● Can use Clang -> LLVM -> SPIR, then use Intel’s OpenCL to consume SPIR ● Can take code that compiles to LLVM and run it on OpenCL Neil Henning
  • 37. How to enable your language on GPUs ● Various steps to getting your language working on GPUs with SPIR ● We’ll use Intel’s OpenCL SDK with provisional SPIR support; 1. Create a test harness to load a SPIR binary 2. Create a simple kernel using Intel’s SPIR compiler on host 3. Create a simple kernel using tip Clang (language OpenCL) targeting SPIR 4. Try other languages that compile to LLVM with SPIR target Neil Henning
  • 38. How to enable your language on GPUs // some SPIR bitcode file const unsigned char spir_bc[spir_bc_length]; // already initialized platform, device & context for a SPIR compliant device cl_platform_id platform = ... ; cl_device device = ... ; cl_context context = … ; // create our program with our SPIR bitcode file cl_program program = clCreateProgramWithBinary( context, 1, &device, &spir_bc_length, &spir_bc, NULL, NULL); // build, passing arguments telling the compiler language is SPIR, and the SPIR standard we are using clBuildProgram(program, 1, &device, “–x spir –spir–std=1.2”, NULL, NULL); Neil Henning
  • 39. How to enable your language on GPUs // already initialized memory buffers for our context cl_mem in_mem = ... ; cl_mem out_mem = ... ; // assume our kernel function from the spir kernel was called foo cl_kernel kernel = clCreateKernel(program, “foo”, NULL); // assume our kernel has one read buffer as first argument, and one write buffer as second clSetKernelArg(kernel, 0, sizeof(cl_mem), (void * )&in_mem); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void * )&out_mem); Neil Henning
  • 40. How to enable your language on GPUs // already initialized command queue cl_command_queue queue = … ; cl_event write_event, run_event; clEnqueueWriteBuffer(queue, in_mem, CL_FALSE, 0, BUFFER_SIZE, &read_payload, 0, NULL, &write_event); const size_t size = BUFFER_SIZE / sizeof(cl_int); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &size, NULL, 1, &write_event, &run_event); clEnqueueReadBuffer(queue, out_mem, CL_TRUE, 0, BUFFER_SIZE, &result_payload, 1, &run_event, NULL); Neil Henning
  • 41. How to enable your language on GPUs ● Now, create a simple OpenCL kernel void kernel foo(global int * in, global int * out) { out[get_global_id(0)] = in[get_global_id(0)]; } ● And use Intel’s command line (or GUI!) tool to build Ioc32 –cmd=build –input –spir32=foo.bc Neil Henning
  • 42. How to enable your language on GPUs ● Next we point the buffer for our SPIR kernel at the generated SPIR kernel ● And it fails…? ● Turns out Intel’s OpenCL runtime doesn’t like us telling them they are building SPIR! ● Simply remove “–x spir –spir–std=1.2” from the build options and voila! Neil Henning
  • 43. How to enable your language on GPUs ● Next step – use tip Clang to build our kernel clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc –o foo.bc ● Compiles ok, but when we run it fails…? ● So Clang generated SPIR bitcode file could very well not work ● We’ll take a look at the readable IR for the Intel & Clang compiled kernels Neil Henning
  • 44. How to enable your language on GPUs ● Clang Output ; ModuleID = '' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" ; Function Attrs: nounwind define void @foo(i32 addrspace(1)* nocapture readonly %a, i32 addrspace(1)* nocapture %b) #0 { entry: %0 = load i32 addrspace(1)* %a, align 4, !tbaa !2 store i32 %0, i32 addrspace(1)* %b, align 4, !tbaa !2 ret void } attributes #0 = { nounwind "less-precise-fpmad"="false" "noframe-pointer-elim"="false" "no-infs-fp-math"="false" "no-nansfp-math"="false" "no-realign-stack" "stack-protector-buffersize"="8" "unsafe-fp-math"="false" "use-soft-float"="false" } !opencl.kernels = !{!0} !llvm.ident = !{!1} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo} !1 = metadata !{metadata !"clang version 3.4 (trunk)"} !2 = metadata !{metadata !3, metadata !3, i64 0} !3 = metadata !{metadata !"int", metadata !4, i64 0} !4 = metadata !{metadata !"omnipotent char", metadata !5, i64 0} !5 = metadata !{metadata !"Simple C/C++ TBAA"} Neil Henning
  • 45. How to enable your language on GPUs ● IOC Output ; ModuleID = 'ex.bc' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" define spir_kernel void @foo(i32 addrspace(1)* %a, i32 addrspace(1)* %b) nounwind { %1 = alloca i32 addrspace(1)*, align 4 %2 = alloca i32 addrspace(1)*, align 4 store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4 store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4 %3 = load i32 addrspace(1)** %1, align 4 %4 = load i32 addrspace(1)* %3, align 4 %5 = load i32 addrspace(1)** %2, align 4 store i32 %4, i32 addrspace(1)* %5, align 4 ret void } !opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{} Neil Henning
  • 46. How to enable your language on GPUs ● IOC Output ; ModuleID = '' target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024" target triple = "spir-unknown-unknown" define spir_kernel void @foo(i32 addrspace(1)* %a, i32 addrspace(1)* %b) nounwind { %1 = alloca i32 addrspace(1)*, align 4 %2 = alloca i32 addrspace(1)*, align 4 store i32 addrspace(1)* %a, i32 addrspace(1)** %1, align 4 store i32 addrspace(1)* %b, i32 addrspace(1)** %2, align 4 %3 = load i32 addrspace(1)** %1, align 4 %4 = load i32 addrspace(1)* %3, align 4 %5 = load i32 addrspace(1)** %2, align 4 store i32 %4, i32 addrspace(1)* %5, align 4 ret void } !opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{} Neil Henning
  • 47. How to enable your language on GPUs ● So the metadata is different! ● We could fix Clang to produce the right metadata…? ● Or just hack around! ● Lets use Intel’s compiler to generate a stub function ● Then we can use an extern function defined in our Clang module! Neil Henning
  • 48. How to enable your language on GPUs extern int doSomething(int a); void kernel foo(global int * in, global int * out) { int id = get_global_id(0); out[id] = doSomething(in[id]); } int doSomething(int a) { return a; } Neil Henning
  • 49. How to enable your language on GPUs ● And it fails…?  ● Intel’s compiler doesn’t like extern functions! ● We’ve already bodged it thus far… ● So lets continue! Int __attribute__((weak)) doSomething(int a) {} void kernel foo(global int * in, global int * out) { int id = get_global_id(0); out[id] = doSomething(in[id]); } Neil Henning
  • 50. How to enable your language on GPUs ● More than a little nasty… ● Relies on Clang extension to declare function weak within OpenCL ● Relies on Intel using Clang and allowing extension ● But it works! ● Can build both the Intel stub code & the Clang actual code ● Then use llvm-link to pull them together! Neil Henning
  • 51. How to enable your language on GPUs ● So now we can compile two OpenCL kernels, link them together, and run it ● What is next? Want to enable your language! ● What about using Clang, but using a different language? ● C & C++ come to mind! Neil Henning
  • 52. How to enable your language on GPUs ● Use a simple C file int doSomething(int a) { return a; } ● And use Clang to compile it clang –cc1 –triple spir-unknown-unknown –emit-llvm-bc foo.c –o foo.bc Neil Henning
  • 53. How to enable your language on GPUs ● Or a simple C++ file! extern “C” int doSomething(int a); template<typename T> T templatedSomething(const T t) { return t; } int doSomething(int a) { return templatedSomething(a); } Neil Henning
  • 54. How to enable your language on GPUs ● Lets have some real C++ code ● Use features that OpenCL doesn’t provide us We’ll do a matrix multiplication in C++ Use classes, constructors, templates Neil Henning
  • 55. How to enable your language on GPUs typedef float __attribute__((ext_vector_type(4))) float4; typedef float __attribute__((ext_vector_type(16))) float16; float __attribute__((overloadable)) dot(float4 a, float4 b); template<typename T, unsigned int WIDTH, unsigned int HEIGHT> class Matrix { typedef T __attribute__((ext_vector_type(WIDTH))) RowType; RowType rows[HEIGHT]; public: Matrix() {} template<typename U> Matrix(const U & u) { __builtin_memcpy(&rows, &u, sizeof(U)); } RowType & operator[](const unsigned int index) { return rows[index]; } const RowType & operator[](const unsigned int index) const { return rows[index]; } }; Neil Henning
  • 56. How to enable your language on GPUs template<typename T, unsigned int WIDTH, unsigned int HEIGHT> Matrix<T, WIDTH, HEIGHT> operator *(const Matrix<T, WIDTH, HEIGHT> & a, const Matrix<T, WIDTH, HEIGHT> & b) { Matrix<T, HEIGHT, WIDTH> bShuffled; for(unsigned int h = 0; h < HEIGHT; h++) for(unsigned int w = 0; w < WIDTH; w++) bShuffled[w][h] = b[h][w]; Matrix<T, WIDTH, HEIGHT> result; for(unsigned int h = 0; h < HEIGHT; h++) for(unsigned int w = 0; w < WIDTH; w++) result[h][w] = dot(a[h], bShuffled[w]); return result; } Neil Henning
  • 57. How to enable your language on GPUs extern “C” float16 doSomething(float16 a, float16 b); float16 doSomething(float16 a, float16 b) { Matrix<float, 4, 4> matA(a); Matrix<float, 4, 4> matB(b); Matrix<float, 4, 4> mul = matA * matB; float16 result = (float16 )0; result.s0123 = mul[0]; result.s4567 = mul[1]; result.s89ab = mul[2]; result.scdef = mul[3]; return result; } Neil Henning
  • 58. How to enable your language on GPUs ● And when we run it… ex5.vcxproj -> E:AMDDeveloperSummit2013buildExample5Debugex5.exe Found 2 platforms! Choosing vendor 'Intel(R) Corporation'! Found 1 devices! SPIR file length '3948' bytes! [ 0.0, 1.0, 2.0, 3.0] * [ 16.0, 15.0, 14.0, 13.0] = [ 40.0, 34.0, 28.0, 22.0] [ 4.0, 5.0, 6.0, 7.0] * [ 12.0, 11.0, 10.0, 9.0] = [200.0, 178.0, 156.0, 134.0] [ 8.0, 9.0, 10.0, 11.0] * [ 8.0, 7.0, 6.0, 5.0] = [360.0, 322.0, 284.0, 246.0] [ 12.0, 13.0, 14.0, 15.0] * [ 4.0, 3.0, 2.0, 1.0] = [520.0, 466.0, 412.0, 358.0] ● Success! Neil Henning
  • 59. How to enable your language on GPUs ● The least you need to target a GPU; ● Generate correct LLVM IR with SPIR metadata ● Or at least generate LLVM IR and use the approach we used to combine Clang and IOC generated kernels !opencl.kernels = !{!0} !opencl.enable.FP_CONTRACT = !{} !opencl.spir.version = !{!6} !opencl.ocl.version = !{!7} !opencl.used.extensions = !{!8} !opencl.used.optional.core.features = !{!8} !opencl.compiler.options = !{!8} !0 = metadata !{void (i32 addrspace(1)*, i32 addrspace(1)*)* @foo, metadata !1, metadata !2, metadata !3, metadata !4, metadata !5} !1 = metadata !{metadata !"kernel_arg_addr_space", i32 1, i32 1} !2 = metadata !{metadata !"kernel_arg_access_qual", metadata !"none", metadata !"none"} !3 = metadata !{metadata !"kernel_arg_type", metadata !"int*", metadata !"int*"} !4 = metadata !{metadata !"kernel_arg_type_qual", metadata !"", metadata !""} !5 = metadata !{metadata !"kernel_arg_name", metadata !"a", metadata !"b"} !6 = metadata !{i32 1, i32 0} !7 = metadata !{i32 0, i32 0} !8 = metadata !{} Neil Henning
  • 60. How to enable your language on GPUs ● Porting C/C++ libraries to SPIR requires a little more work int foo(int * a) { return *a; } ● The data pointed to by ‘a’ will by default be put in the private address space ● But a straight conversion to SPIR needs all data in global address space ● Means that any porting of existing code could be quite intrusive Neil Henning
  • 61. How to enable your language on GPUs ● To target your language at GPUs ● ● Need to be able to segregate work into parallel chunks ● Have to ban certain features that don’t work with compute ● ● Need to deal with distinct address spaces Language could also provide an API onto OpenCL SPIR builtins But with OpenCL SPIR it is now possible to make any language work on a GPU! Neil Henning
  • 62. Developing tools for GPUs Neil Henning
  • 63. Developing tools for GPUs ● Tools increasingly required to support development ● Even having printf (which OpenCL 1.2 added) is novel! ● But with increasingly complex code better tools needed ● Main three are debuggers, profilers and compiler-tools Neil Henning
  • 64. Developing tools for GPUs ● Debuggers for compute are difficult for non-vendor to develop ● Codeplay has developed such tools on top of compute standards ● Problem is bedrock for these tools can change at any time ● Hard to beat vendor-owned approach that has lower-level access Neil Henning
  • 65. Developing tools for GPUs Our Language ● Codeplay are pushing hard for HSA to have features that aid tool development ● Debuggers are much easier with instruction support, debug info, change registers, call stacks Our Language ● OpenCL SPIR harder to create debugger for without vendor support ● Can we standardize a way to debug OpenCL SPIR, or allow debugging via emulation of SPIR? Neil Henning
  • 66. Developing tools for GPUs ● Profilers require superset of debugger feature-set ● Need to be able to trap kernels at defined points ● Accurate timings only other requirement beyond debugger support ● More fun when we go beyond performance, and measure power Neil Henning
  • 67. Developing tools for GPUs ● HSA and OpenCL SPIR both good profiler targets ● Could split SPIR kernels into profiling sections ● Then use existing timing information in OpenCL ● HSA will only require debugger features we are pushing for Neil Henning
  • 68. Developing tools for GPUs ● Compiler tools consist of optimizers and analysis ● Both HSA and OpenCL SPIR being based on LLVM enable this! ● We as compiler experts can aid existing runtimes ● You as developers can add optimizations & analyse your kernels! Neil Henning
  • 70. Conclusion ● With the rise of open standards, compute is increasingly easy ● With HSA & OpenCL SPIR hardware is finally open to us! ● Just need standards to ratify, mature & be available on hardware! ● Next big push into compute is upon us Neil Henning
  • 71. Questions? Can also catch me on twitter @sheredom Neil Henning
  • 72. Resources ● SPIR extension on Khronos website ● ● SPIR provisional specification ● ● HSA Foundation ● Neil Henning