SlideShare a Scribd company logo
CLWrap
Nonsense free control of your GPU
John Colvin
Why bother?
• CPUs are way more complicated than necessary
for a lot of computational tasks.
• They are optimised for serial workloads, parallelism
has been somewhat tacked on the side.
• GPUs are purpose-built to run massively parallel
computations at maximum speed, low cost and low
power consumption.
• Piggy-backing on the games industry’s economies
of scale
• Everybody already has one (or two, or three…)
GP-GPU programming
• NVIDIA CUDA
• Kronos OpenCL
• Compute shaders (various APIs)
• Or just do compute using
normal shaders
• Open standard
• Designed to work with any heterogeneous system,
not just CPU host + GPU device.
• Drivers are available for all major GPUs and even
CPUs (can target the CPU as the compute device).
• Has been somewhat overshadowed by CUDA,
party due to a confusing model and intimidating C
API.
Primitives of OpenCL
• platform : The driver, e.g. one from Intel for your CPU, one from NVIDIA
for your GPU. Always a shared entry point between them: selecting is
part of the OpenCL API. All the following are create w.r.t. one platform.
• device : E.g. one GPU.
• context : A group of devices.
• kernel : The entry point to some GPU code, compiled seperately for
each device + context pair.
• command_queue : Execution queue for one device, put a kernel on here
to get it run.
• buffer / image : data accessible from kernels. Bound to contexts,
the driver is responsible for making sure it is available on any given
device when needed.
The C API
Let’s just try and report the names
of the GPUs we’ve got…
// assume have already got a platform: platform_id
size_t nGpus;
cl_int err = clGetDeviceIDs(platform_id,
CL_DEVICE_TYPE_GPU, 0, NULL, &nGpus);
if (err != CL_SUCCESS) { /* handle error */ }
cl_device_id *gpus = malloc(nGpus * sizeof(cl_device_id));
if (!gpus) { /* handle error */ }
err = clGetDeviceIDs(platform_id,
CL_DEVICE_TYPE_GPU, nGpus, gpus, NULL);
if (err != CL_SUCCESS) { /* handle error */ }
char name[256] = {‘0’};
for (size_t i = 0; i < nDevices; ++i)
{
err = clGetDeviceInfo(gpus[i],
CL_DEVICE_NAME, 255, name, NULL);
if(err != CL_SUCCESS) { /* handle error */ }
puts(name);
}
Eww
Let’s try again in D
// assume have already got a platform, platform_id
size_t nGpus;
cl.int_ err = cl.getDeviceIDs(platform_id,
cl.DEVICE_TYPE_GPU, 0, null, &nGpus);
if (err != cl.SUCCESS) { /* handle error */ }
cl.device_id *gpus = cast(cl.device_id*)
malloc(nGpus * sizeof(cl.device_id));
if (!gpus) { /* handle error */ }
err = cl.getDeviceIDs(platform_id,
cl.DEVICE_TYPE_GPU, nGpus, gpus, NULL);
if (err != cl.SUCCESS) { /* handle error */ }
char[256] name = ‘0’;
for (size_t i = 0; i < nDevices; ++i)
{
err = cl.getDeviceInfo(gpus[i],
cl.DEVICE_NAME, 255, name, NULL);
if(err != cl.SUCCESS) { /* handle error */ }
puts(name);
}
import clWrap, std.stdio, std.algorithm, std.conv, std.array;
int main(string[] args)
{
if (args.length != 2 || args[1].empty)
{
stderr.writeln("Error: please provide platform name");
return 1;
}
auto platform = getChosenPlatform(args[1]);
auto devices = platform.getDevices(cl.DEVICE_TYPE_GPU);
devices.map!(getInfo!(cl.DEVICE_NAME))
.joiner("n").writeln;
return 0;
}
Tldr;
OpenCL, threads and data:
Work
items
Local
Memory
Global
Memory
Work
Group
Device
NDRanges
(0,0)
(0,0)
(1,0)
(1,0)
(2,0)
(2,0)
(3,0)
(3,0)
(4,0)
(0,0)
(5,0)
(1,0)
(6,0)
(2,0)
(7,0)
(3,0)
(0,1)
(0,1)
(1,1)
(1,1)
(2,1)
(2,1)
(3,1)
(3,1)
(4,1)
(0,1)
(5,1)
(1,1)
(6,1)
(2,1)
(7,1)
(3,1)
(0,2)
(0,0)
(1,2)
(1,0)
(2,2)
(2,0)
(3,2)
(3,0)
(4,2)
(0,0)
(5,2)
(1,0)
(6,2)
(2,0)
(7,2)
(3,0)
(0,3)
(0,1)
(1,3)
(1,1)
(2,3)
(2,1)
(3,3)
(3,1)
(4,3)
(0,1)
(5,3)
(1,1)
(6,3)
(2,1)
(7,3)
(3,1)
(0,4)
(0,0)
(1,4)
(1,0)
(2,4)
(2,0)
(3,4)
(3,0)
(4,4)
(0,0)
(5,4)
(1,0)
(6,4)
(2,0)
(7,4)
(3,0)
(0,5)
(0,1)
(1,5)
(1,1)
(2,5)
(2,1)
(3,5)
(3,1)
(4,5)
(0,1)
(5,5)
(1,1)
(6,5)
(2,1)
(7,5)
(3,1)
(x,y): global id
(x,y): local id
OpenCL C
• The language in which you write your device code.
• A restricted C with some extra builtins and funny type
qualifiers.
• Compiled at runtime for the exact target hardware.
• pointers are tagged with global or local to specify
which memory they point to.
• No pointers to pointers, everything is flat.
• a function annotated with kernel is an entry point, think
like a main.
Some boring kernels:
kernel void myKernel(global float *input, float b)
{
size_t i = get_global_id(0);
input[i] = exp(input[i] * b);
}
kernel void myOtherKernel(global float *input,
global float *output) {
size_t idx = get_global_id(0) * get_global_size(1)
+ get_global_id(1);
output[idx] = 1.0f / input[idx];
}
kernel void myStrangeKernel(global float *input,
global float *output) {
//assuming random_integer defined elsewhere
size_t i_in = random_integer(0, get_global_size(0))
+ get_global_offset(0);
size_t i_out = get_global_id(0);
output[i_out] = input[i_in];
}
clWrap architecture
• Layer 1: Take the OpenCL C API and make it
strictly typed
• Layer 2: Take this strictly typed layer and layer a
more D-appropriate API on top
• Layer 3: Go to town. High level abstractions to
enable people to easily execute code on co-
processors. Based on Layer 2.
Let’s do it live!
Notes:
• The OpenCL API is quite big and it’s only getting bigger
• The lack of assumptions and the implied choice that this
gives you is both OpenCL’s main strength and biggest
weakness
• clWrap doesn’t aim or need to be a comprehensive D-
like wrapper of every possible use-case
• It sits on top of the C API, streamlining common tasks.
• The user can drop down to the C API at any point
without inconvenience or fear of invalidating future use
of clWrap functions.
clMap
1 to 1 mapping
of elements to
work items
Data Work Items
auto myKernel = CLKernelDef!("someKernel",
CLBuffer!float, "input",
float, "b")
(q{
input[gLinId] = exp(cos(input[gLinId] * b));
});
void main()
{
auto output = new float[](1000);
auto input = iota(1000).map!(to!float).array;
auto inputBuff = input
.sliced(100, 10)
.toBuffer(cl.MEM_COPY_HOST_PTR);
inputBuff.clMap!myKernel(3.4f).enqueue;
inputBuff.read(output).writeln;
}
Why not OpenCL 2.x?
• NVIDIA: only recently started supporting OpenCL
1.2, who knows how long until 2.0 (2.1? 2.2?)
• Derelict-CL: only supports 1.2
• Will probably move to support it anyway, the
opportunities are too good to pass up.
What’s next?
• Massive cleanup. A product v.s. an accretion of ideas
• Testing, testing…
• Get information from people about the things that really
annoy them in OpenCL and GPU programming in
general, fix them.
• Make D the obvious choice for cross-platform GPU
control.
• Make general-purpose GPU usage a normal thing to do.
inputBuff.clMap!(q{
*p0 = cos(*p0 * p1 + p2)
})(3.4, 42)
.read(output)
.writeln;
inputBuff.clMap!(
(p0, p1, p2) => cos(p0 * p1 + p2)
)(3.4, 42)
.read(output)
.writeln;
?

More Related Content

What's hot

TensorFlow XLA RPC
TensorFlow XLA RPCTensorFlow XLA RPC
TensorFlow XLA RPC
Mr. Vengineer
 
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Mr. Vengineer
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Mr. Vengineer
 
DLL Design with Building Blocks
DLL Design with Building BlocksDLL Design with Building Blocks
DLL Design with Building Blocks
Max Kleiner
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
Sergey Platonov
 
Return of c++
Return of c++Return of c++
Return of c++
Yongwei Wu
 
200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience
Andrey Karpov
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
HSA Foundation
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
Yung-Yu Chen
 
Job Queue in Golang
Job Queue in GolangJob Queue in Golang
Job Queue in Golang
Bo-Yi Wu
 
用 Go 語言打造多台機器 Scale 架構
用 Go 語言打造多台機器 Scale 架構用 Go 語言打造多台機器 Scale 架構
用 Go 語言打造多台機器 Scale 架構
Bo-Yi Wu
 
Facebook Glow Compiler のソースコードをグダグダ語る会
Facebook Glow Compiler のソースコードをグダグダ語る会Facebook Glow Compiler のソースコードをグダグダ語る会
Facebook Glow Compiler のソースコードをグダグダ語る会
Mr. Vengineer
 
Optimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with TruffleOptimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with Truffle
Stefan Marr
 
2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english
Jen Yee Hong
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012
aleks-f
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Sergey Platonov
 
Fuzzing: The New Unit Testing
Fuzzing: The New Unit TestingFuzzing: The New Unit Testing
Fuzzing: The New Unit Testing
Dmitry Vyukov
 
PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)
Andrey Karpov
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
Ofer Rosenberg
 
How to make a large C++-code base manageable
How to make a large C++-code base manageableHow to make a large C++-code base manageable
How to make a large C++-code base manageable
corehard_by
 

What's hot (20)

TensorFlow XLA RPC
TensorFlow XLA RPCTensorFlow XLA RPC
TensorFlow XLA RPC
 
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
 
DLL Design with Building Blocks
DLL Design with Building BlocksDLL Design with Building Blocks
DLL Design with Building Blocks
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Return of c++
Return of c++Return of c++
Return of c++
 
200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 
Job Queue in Golang
Job Queue in GolangJob Queue in Golang
Job Queue in Golang
 
用 Go 語言打造多台機器 Scale 架構
用 Go 語言打造多台機器 Scale 架構用 Go 語言打造多台機器 Scale 架構
用 Go 語言打造多台機器 Scale 架構
 
Facebook Glow Compiler のソースコードをグダグダ語る会
Facebook Glow Compiler のソースコードをグダグダ語る会Facebook Glow Compiler のソースコードをグダグダ語る会
Facebook Glow Compiler のソースコードをグダグダ語る会
 
Optimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with TruffleOptimizing Communicating Event-Loop Languages with Truffle
Optimizing Communicating Event-Loop Languages with Truffle
 
2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english
 
Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012Dynamic C++ Silicon Valley Code Camp 2012
Dynamic C++ Silicon Valley Code Camp 2012
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
 
Fuzzing: The New Unit Testing
Fuzzing: The New Unit TestingFuzzing: The New Unit Testing
Fuzzing: The New Unit Testing
 
PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)PVS-Studio for Linux (CoreHard presentation)
PVS-Studio for Linux (CoreHard presentation)
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 
How to make a large C++-code base manageable
How to make a large C++-code base manageableHow to make a large C++-code base manageable
How to make a large C++-code base manageable
 

Similar to clWrap: Nonsense free control of your GPU

Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
Arjan Lamers
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
SirKetchup
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
JunZhao68
 
Ensemble: A DSL for Concurrency, Adaptability, and Distribution
Ensemble: A DSL for Concurrency, Adaptability, and DistributionEnsemble: A DSL for Concurrency, Adaptability, and Distribution
Ensemble: A DSL for Concurrency, Adaptability, and Distribution
jhebus
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
Unai Lopez-Novoa
 
Boost.Compute GTC 2015
Boost.Compute GTC 2015Boost.Compute GTC 2015
Boost.Compute GTC 2015
kylelutz
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
Fastly
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
sean chen
 
OpenCL Heterogeneous Parallel Computing
OpenCL Heterogeneous Parallel ComputingOpenCL Heterogeneous Parallel Computing
OpenCL Heterogeneous Parallel Computing
João Paulo Leonidas Fernandes Dias da Silva
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
AnastasiaStulova
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
Tigabu Yaya
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
George Papaioannou
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
gopikahari7
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Juan Fumero
 
Gpus graal
Gpus graalGpus graal
Gpus graal
Juan Fumero
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
Steve Severance
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
AMD Developer Central
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
AMD Developer Central
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
Andrei Varanovich
 
Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
Andrei Varanovich
 

Similar to clWrap: Nonsense free control of your GPU (20)

Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
 
Ensemble: A DSL for Concurrency, Adaptability, and Distribution
Ensemble: A DSL for Concurrency, Adaptability, and DistributionEnsemble: A DSL for Concurrency, Adaptability, and Distribution
Ensemble: A DSL for Concurrency, Adaptability, and Distribution
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Boost.Compute GTC 2015
Boost.Compute GTC 2015Boost.Compute GTC 2015
Boost.Compute GTC 2015
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
OpenCL Heterogeneous Parallel Computing
OpenCL Heterogeneous Parallel ComputingOpenCL Heterogeneous Parallel Computing
OpenCL Heterogeneous Parallel Computing
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
MattsonTutorialSC14.pdf
MattsonTutorialSC14.pdfMattsonTutorialSC14.pdf
MattsonTutorialSC14.pdf
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in JavaRuntime Code Generation and Data Management for Heterogeneous Computing in Java
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
 
Gpus graal
Gpus graalGpus graal
Gpus graal
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
 
Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
 

Recently uploaded

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Fwdays
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 

Recently uploaded (20)

GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk"Frontline Battles with DDoS: Best practices and Lessons Learned",  Igor Ivaniuk
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 

clWrap: Nonsense free control of your GPU

  • 1. CLWrap Nonsense free control of your GPU John Colvin
  • 3. • CPUs are way more complicated than necessary for a lot of computational tasks. • They are optimised for serial workloads, parallelism has been somewhat tacked on the side. • GPUs are purpose-built to run massively parallel computations at maximum speed, low cost and low power consumption. • Piggy-backing on the games industry’s economies of scale • Everybody already has one (or two, or three…)
  • 4. GP-GPU programming • NVIDIA CUDA • Kronos OpenCL • Compute shaders (various APIs) • Or just do compute using normal shaders
  • 5. • Open standard • Designed to work with any heterogeneous system, not just CPU host + GPU device. • Drivers are available for all major GPUs and even CPUs (can target the CPU as the compute device). • Has been somewhat overshadowed by CUDA, party due to a confusing model and intimidating C API.
  • 6. Primitives of OpenCL • platform : The driver, e.g. one from Intel for your CPU, one from NVIDIA for your GPU. Always a shared entry point between them: selecting is part of the OpenCL API. All the following are create w.r.t. one platform. • device : E.g. one GPU. • context : A group of devices. • kernel : The entry point to some GPU code, compiled seperately for each device + context pair. • command_queue : Execution queue for one device, put a kernel on here to get it run. • buffer / image : data accessible from kernels. Bound to contexts, the driver is responsible for making sure it is available on any given device when needed.
  • 7. The C API Let’s just try and report the names of the GPUs we’ve got…
  • 8. // assume have already got a platform: platform_id size_t nGpus; cl_int err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 0, NULL, &nGpus); if (err != CL_SUCCESS) { /* handle error */ } cl_device_id *gpus = malloc(nGpus * sizeof(cl_device_id)); if (!gpus) { /* handle error */ } err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, nGpus, gpus, NULL); if (err != CL_SUCCESS) { /* handle error */ } char name[256] = {‘0’}; for (size_t i = 0; i < nDevices; ++i) { err = clGetDeviceInfo(gpus[i], CL_DEVICE_NAME, 255, name, NULL); if(err != CL_SUCCESS) { /* handle error */ } puts(name); }
  • 9. Eww
  • 11. // assume have already got a platform, platform_id size_t nGpus; cl.int_ err = cl.getDeviceIDs(platform_id, cl.DEVICE_TYPE_GPU, 0, null, &nGpus); if (err != cl.SUCCESS) { /* handle error */ } cl.device_id *gpus = cast(cl.device_id*) malloc(nGpus * sizeof(cl.device_id)); if (!gpus) { /* handle error */ } err = cl.getDeviceIDs(platform_id, cl.DEVICE_TYPE_GPU, nGpus, gpus, NULL); if (err != cl.SUCCESS) { /* handle error */ } char[256] name = ‘0’; for (size_t i = 0; i < nDevices; ++i) { err = cl.getDeviceInfo(gpus[i], cl.DEVICE_NAME, 255, name, NULL); if(err != cl.SUCCESS) { /* handle error */ } puts(name); }
  • 12. import clWrap, std.stdio, std.algorithm, std.conv, std.array; int main(string[] args) { if (args.length != 2 || args[1].empty) { stderr.writeln("Error: please provide platform name"); return 1; } auto platform = getChosenPlatform(args[1]); auto devices = platform.getDevices(cl.DEVICE_TYPE_GPU); devices.map!(getInfo!(cl.DEVICE_NAME)) .joiner("n").writeln; return 0; }
  • 16. OpenCL C • The language in which you write your device code. • A restricted C with some extra builtins and funny type qualifiers. • Compiled at runtime for the exact target hardware. • pointers are tagged with global or local to specify which memory they point to. • No pointers to pointers, everything is flat. • a function annotated with kernel is an entry point, think like a main.
  • 17. Some boring kernels: kernel void myKernel(global float *input, float b) { size_t i = get_global_id(0); input[i] = exp(input[i] * b); } kernel void myOtherKernel(global float *input, global float *output) { size_t idx = get_global_id(0) * get_global_size(1) + get_global_id(1); output[idx] = 1.0f / input[idx]; } kernel void myStrangeKernel(global float *input, global float *output) { //assuming random_integer defined elsewhere size_t i_in = random_integer(0, get_global_size(0)) + get_global_offset(0); size_t i_out = get_global_id(0); output[i_out] = input[i_in]; }
  • 18. clWrap architecture • Layer 1: Take the OpenCL C API and make it strictly typed • Layer 2: Take this strictly typed layer and layer a more D-appropriate API on top • Layer 3: Go to town. High level abstractions to enable people to easily execute code on co- processors. Based on Layer 2.
  • 19. Let’s do it live!
  • 20. Notes: • The OpenCL API is quite big and it’s only getting bigger • The lack of assumptions and the implied choice that this gives you is both OpenCL’s main strength and biggest weakness • clWrap doesn’t aim or need to be a comprehensive D- like wrapper of every possible use-case • It sits on top of the C API, streamlining common tasks. • The user can drop down to the C API at any point without inconvenience or fear of invalidating future use of clWrap functions.
  • 21. clMap 1 to 1 mapping of elements to work items Data Work Items
  • 22. auto myKernel = CLKernelDef!("someKernel", CLBuffer!float, "input", float, "b") (q{ input[gLinId] = exp(cos(input[gLinId] * b)); }); void main() { auto output = new float[](1000); auto input = iota(1000).map!(to!float).array; auto inputBuff = input .sliced(100, 10) .toBuffer(cl.MEM_COPY_HOST_PTR); inputBuff.clMap!myKernel(3.4f).enqueue; inputBuff.read(output).writeln; }
  • 23. Why not OpenCL 2.x? • NVIDIA: only recently started supporting OpenCL 1.2, who knows how long until 2.0 (2.1? 2.2?) • Derelict-CL: only supports 1.2 • Will probably move to support it anyway, the opportunities are too good to pass up.
  • 24. What’s next? • Massive cleanup. A product v.s. an accretion of ideas • Testing, testing… • Get information from people about the things that really annoy them in OpenCL and GPU programming in general, fix them. • Make D the obvious choice for cross-platform GPU control. • Make general-purpose GPU usage a normal thing to do.
  • 25. inputBuff.clMap!(q{ *p0 = cos(*p0 * p1 + p2) })(3.4, 42) .read(output) .writeln; inputBuff.clMap!( (p0, p1, p2) => cos(p0 * p1 + p2) )(3.4, 42) .read(output) .writeln;
  • 26. ?