SlideShare a Scribd company logo
1 of 26
Download to read offline
BOLT UPDATE
BEN SANDER
AMD SENIOR FELLOW
OUTLINE

Introduction

Key Strategic Directions:
•Portability
•Programmability
•Performance

Conclusions

2 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
INTRODUCTION AND MOTIVATION
 What is Bolt?
‒ C++ Template Library for GPU and multi-core CPU programming
‒ Optimized library routines for common GPU operations
‒ CPU optimized as well (high-performance, multi-core CPU routines)

‒ Works with open standards (OpenCL™ and C++ AMP)
‒ Distributed as open source

 Make GPU programming as easy as CPU programming
‒ Resembles familiar C++ Standard Template Library
‒ Customizable via C++ template parameters
‒ Single source base for GPU and CPU
‒ Functional and performance portability

‒ Improves developer productivity

 Well-suited for HSA
‒ Leverage high-performance shared virtual memory
‒ No data copies; can use pointers in data structures
3 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
SIMPLE BOLT EXAMPLE
#include <bolt/amp/sort.h>
#include <vector>
#include <algorithm>
void main()
{
// generate random data (on host)
std::vector<int> a(1000000);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device
bolt::amp::sort(a.begin(), a.end());
}

Interface similar to familiar C++ Standard Template Library
No explicit mention of C++ AMP or OpenCL™ (or GPU!)
‒ More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™

Direct use of host data structures (ie std::vector)
bolt::sort implicitly runs on the platform
‒ Runtime automatically selects CPU or GPU (or both)
4 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR
#include <bolt/amp/transform.h>
#include <vector>
struct SaxpyFunctor
{
float _a;
SaxpyFunctor(float a) : _a(a) {};
float operator() (const float &xx, const float &yy) restrict(cpu,amp)
{
return _a * xx + yy;
};

};
void main() {
SaxpyFunctor s(100);
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(), s);
};

5 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA
#include <bolt/amp/transform.h>
#include <vector>
void main(void)
{
const float a=100;
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
// saxpy with C++ Lambda
bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(),
[=] (float xx, float yy) restrict(cpu, amp) {
return a * xx + yy;
});
};

Functor (“a * xx + yy”) now specified inside the loop (more natural)
Can capture variables from surrounding scope (“a”) – eliminate boilerplate class
C++11 improves interface for template function libraries
6 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR
#include <bolt/cl/transform.h>
#include <vector>

BOLT_FUNCTOR(SaxpyFunctor,
struct SaxpyFunctor
{
float _a;
SaxpyFunctor(float a) : _a(a) {};
float operator() (const float &xx, const float &yy)
{
return _a * xx + yy;
};

};
);
void main() {
SaxpyFunctor s(100);
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
bolt::cl::transform(x.begin(), x.end(), y.begin(), z.begin(), s);
};

7 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

Similar syntax to C++AMP

Macros used to make functors visible to
OpenCL compiler.
Functor code is compiled on first Bolt call
BOLT 1.1
 Timeline:
‒ July 2012 (AFDS-2012) : Announced Bolt
‒ July-2013 : Bolt v1.0 General Availability
‒ Nov-2013 (Now!): Bolt v1.1 General Availability
‒ Additional functions and optimizations
‒ Windows®7, Windows®8, Windows® 8.1, and Linux® Support

 Open-source and available here:
‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-appsdk/bolt-c-template-library/
‒ https://github.com/HSA-Libraries/Bolt

 Contains 30-40 template functions
‒ Includes transform, sort, stable sort, scan, reduce

 OpenCL™ and multi-core CPU paths for all functions
 Linux® and Windows®
 Supports Microsoft® Visual Studio ® 2010, Visual Studio ® 2012, GCC 4.6/4.7/4.8
8 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
BOLT 1.1 FUNCTION SUPPORT MATRIX
API

OpenCL™ C++AMP Multicore Serial
(GPU)
(GPU) TBB (CPU) (CPU)

constant_iterator
copy
copy_n
count
count_if
counting_iterator
device_vector
exclusive_scan

YES
YES
YES
YES
YES
YES
YES
YES

NO
NO
NO
YES
YES
NO
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

exclusive_scan_by_key

YES

NO

YES

YES

fill
fill_n
generate
generate_n
inclusive_scan
inclusive_scan_by_key
inner_product
max_element

YES
YES
YES
YES
YES
YES
YES
YES

NO
NO
NO
NO
YES
NO
NO
NO

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

9 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

API
min_element
reduce
reduce_by_key
sort
sort_by_key
stable_sort
stable_sort_by_key
transform
transform_exclusive_s
can
transform_inclusive_s
can
transform_reduce
binary_search
merge
scatter
scatter_if
gather
gather_if

OpenCL™ C++AMP Multicore
Serial
(GPU)
(GPU) TBB (CPU)
YES
YES
YES
YES
YES
YES
YES
YES

NO
YES
NO
YES
NO
NO
NO
YES

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

YES

NO

YES

YES

YES
YES
YES
YES
YES
YES
YES
YES

NO
YES
NO
NO
NO
NO
NO
NO

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES
BOLT BENEFITS FOR PROGRAMMERS
 Single-source can target GPUs and/or multi-core CPU
 Optimized GPU library implementations for common sort, scan, reduce, etc operations
 Familiar STL-like C++ syntax
‒ Kernels created automatically

 STL-like “device_vector” to simplify memory management
‒ Supports typed memory allocation

 Benefits for OpenCL users:
‒ Sensible selection of defaults (platform, context, device, queue)
‒ With optional overrides for users who want more control
‒ Kernels compiled automatically on first call.
‒ Simplified, C++-style kernel calling convention (replaces clSetKernelArg)

10 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
Portability
PORTABILITY
 Current State of Bolt OS and Vendor Portability:
‒ Support C++AMP
‒ Run on any DX11-compliant video card
‒ But: no Linux® solution

‒ Support OpenCL™
‒ Run on Windows® and Linux®
‒ But: use AMD C++ static kernel feature which is only available on AMD OpenCL™

 Future : Improving portability
‒ C++AMP
‒ Provide Linux® port for C++AMP
‒ Increase the number of Bolt APIs which are supported with C++AMP

‒ OpenCL™ :
‒ Provide translator tool for C++ static kernel language

12 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
OPEN-SOURCE C++ AMP TOOLCHAIN

HSAIL
C++AMP

CLANG
Front-end

LLVM-IR
or
SPIR 1.2

Any HSA
Implementation

SPIR 1.2

Any OpenCL™+SPIR
Implementation

LLVM Compiler

 Open-source!

 Preliminary Version: https://bitbucket.org/multicoreware/cppamp-driver/
 Samples: https://bitbucket.org/multicoreware/cxxamp_sandbox
 More development coming in 1H-2014
13 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
C++ STATIC KERNEL LANGUAGE
AND NEW TRANSLATOR TOOL

 AMD C++ Static Kernel Language (aka “OpenCL-C++ Kernel Language”)
‒ Adds C++ features to OpenCL-C kernel language
‒ Templates !
‒ Classes
‒ Also: Namespaces, Inheritance, References, “this” operator, more
‒ Available as an OpenCL™ extension on AMD platforms
‒ http://developer.amd.com/wordpress/media/2012/10/CPP_kernel_language.pdf

 New translator tool is designed to bring these benefits to any OpenCL™ Implementation:

Bolt
Code

C++ Template
Instantiation

OpenCLC++
Code

 Translator expected to be available in Q1-2014
14 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

Translator

OpenCLC Code

Any OpenCL™
Implementation !
FUTURE BOLT SUPPORT MATRIX

OpenCL™

Windows®
Linux®

C++ AMP

Any OpenCL™ vendor

Any DX11 vendor

Any OpenCL™ vendor

Any OpenCL™ SPIR or
HSAIL vendor

 Green shows additional platforms that Bolt will run accelerated code paths
 Bolt can also run multi-core CPU paths if GPU acceleration is not available

15 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
BOLT DEMO
 OpenMM : Open-source simulation tool for molecular simulation

 See how minor code modifications to a large application enable significant acceleration.
 See Bolt code translated and running on Intel® OpenCL™ platform
16 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
Programmability
SHARED VIRTUAL MEMORY - RECAP
 CPU and GPU share
virtual memory space

PHYSICAL MEMORY

 Can pass pointers
between CPU and GPU

VIRTUAL MEMORY

CPU0

VA->PA

 GPU can access
terabytes of pageable
virtual memory
GPU
VA->PA

 High performance
from all devices
 Called “Shared Virtual
Memory” or “SVM”

 Key feature of HSA
18 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
PROGRAMMABILITY
HOW SVM MAKES BOLT BETTER

 Bolt today provides
‒ Familiar, programmer-friendly interface for accelerated programming
‒ Single-source portability to GPUs and multi-core CPUs

 SVM + Bolt:
‒ Efficient access to host memory from GPU
‒ No need for “heap management” between host and device memory
‒ No copies, less overhead
‒ Single address space is a natural fit for Bolt single-source interface
‒ Program like a multi-core CPU, benefit from GPU-like acceleration

‒ Ability to use pointer-containing data structures
‒ Dramatically expand the code that use functors and template library – functors can contain pointers!

19 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
CONVOLUTION / SOBEL EDGE FILTER
Compute Kernel applied to
each pixel:
Gx =

[ -1
[ -2
[ -1

0 +1 ]
0 +2 ]
0 +1 ]

Gy =

[ -1 -2 -1 ]
[ 0 0 0]
[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy2)

Challenge: Need to
examine surrounding
pixels.

20 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
SOBEL CODE EXAMPLE
struct SobelFilter {
uchar *inImg;
uchar *outImg;
int
w, h;

Pointers in the struct

SobelFilter(uchar *inI, uchar *outI, int xw, int xh) :
inImg(inI), outImg(outI), w(xw), h(xh) {};

Pass the pointers through the functor:

void operator() (int y, int x)
{
int i = y*w + x;
int gx =
1 * *(inImg + i - 4 - w)
+ 2 * *(inImg + i - w)
+ 1 * *(inImg + i + 4 - w)
+ -1 * *(inImg + i - 4 + w)
+ -2 * *(inImg + i + w)
+ -1 * *(inImg + i + 4 + w);

// Construct functor object:
SobelFilter
filter(inputImage,outImage, w*4, h);

int gy =

1 * *(inImg + i - 4 - w)
+ -1 * *(inImg + i + 4 - w)
+ 2 * *(inImg + i - 4)
+ -2 * *(inImg + i + 4)
+ 1 * *(inImg + i - 4 + w)
+ -1 * *(inImg + i + 4 + w);
outImg[i] = (uchar)(sqrt(gx*gx + gy*gy)/ 2.0f);
};
};
21 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

inputImage = (uchar*)malloc(w*h*4);
// Init inputImage not shown
outputImage = (uchar*)malloc(w*h*4);

// Set border and call the
// filter for each pixel.
// Pass 2D coordinate to filter
bolt::cl::for_each_2d(
1, h-1, 4, w*4-4,
filter);

Kernel uses pointers and computes
indices for surrounding pixels
THE POWER OF SVM AND BOLT
 Functor can store additional parameters
 With SVM, additional parameters can also be pointers !
‒ Powerful way to access host data structures, linked lists, avoid copies

 Initialized with constructor (run on host), be accessed in body operator() (run on device)
struct SampleFunctor
{
SampleFunctor(…) : /* Init Pointers Here */
// Sample Pointers:
class MyFancyHostDataStructure *handy;
class ListNode
*head;
int
*myArray;
float operator() (…) {
// Access pointers here!
};
};

22 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

{}
Performance
ALGORITHM PERFORMANCE
Sort Performance (32-bit elements)
Bolt vs SHOC sort

Sort Performance (32-bit elements)
Bolt vs std::sort
120

120

100

100
Bolt(OpenCL)

Bolt(OpenCL)

Bolt(MultiCoreCPU)
std::sort

80
Millions Elements/Sec

Millions Elements/Sec

80

60

SHOC

60

40

40

20

20

0

0
1

2

4

8

16

32

Millions Elements
24 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

1

2

4

8

Millions Elements

Data collected on future AMD APU

16

32
CONCLUSIONS
Bolt

 C++ Template Library
 Designed For Heterogeneous Computing

Portability

(Linux®, Windows®) X
(Multiple GPU Vendors) X
(OpenCL™, C++ AMP)

Programmability

 SVM + Bolt = Even easier and more flexible heterogeneous programming mod
 Handy access to host pointers and extra functor parameters

Performance

 Bolt-GPU > Multi-Core CPU > STL
 High-performance implementations, written by experts

25 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. OpenCL™ is a trademark of Apple Inc. Microsoft and Visual Studio are trademarks of Microsoft Corp.
Linux is a trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

26 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

More Related Content

What's hot

MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansAMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...AMD Developer Central
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 
CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...
CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...
CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...AMD Developer Central
 
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsWT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsAMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbr Skip
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
MM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent BetbederMM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent BetbederAMD Developer Central
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...HSA Foundation
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 

What's hot (20)

MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
PT-4056, Harnessing Heterogeneous Systems Using C++ AMP – How the Story is Ev...
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...
CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...
CE-4030, Optimizing Photo Editing Application with HSA Technology, by Stanley...
 
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon WoodsWT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
WT-4073, ANGLE and cross-platform WebGL support, by Shannon Woods
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tb
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
MM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent BetbederMM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
MM-4085, Designing a game audio engine for HSA, by Laurent Betbeder
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 

Similar to PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander

Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapManolis Vavalis
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...Yutaka Kawai
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationIntel® Software
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...ryancox
 
Serverless Data Architecture at scale on Google Cloud Platform
Serverless Data Architecture at scale on Google Cloud PlatformServerless Data Architecture at scale on Google Cloud Platform
Serverless Data Architecture at scale on Google Cloud PlatformMeetupDataScienceRoma
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdfJunZhao68
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLinaro
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track fAlona Gradman
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensAlona Gradman
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
sigrok: Adventures in Integrating a Power-Measurement Device
sigrok: Adventures in Integrating a Power-Measurement Devicesigrok: Adventures in Integrating a Power-Measurement Device
sigrok: Adventures in Integrating a Power-Measurement DeviceBayLibre
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinEqunix Business Solutions
 

Similar to PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander (20)

Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Automatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmapAutomatic generation of platform architectures using open cl and fpga roadmap
Automatic generation of platform architectures using open cl and fpga roadmap
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationclCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration
 
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
 
Serverless Data Architecture at scale on Google Cloud Platform
Serverless Data Architecture at scale on Google Cloud PlatformServerless Data Architecture at scale on Google Cloud Platform
Serverless Data Architecture at scale on Google Cloud Platform
 
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop_May19.pdf
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
sigrok: Adventures in Integrating a Power-Measurement Device
sigrok: Adventures in Integrating a Power-Measurement Devicesigrok: Adventures in Integrating a Power-Measurement Device
sigrok: Adventures in Integrating a Power-Measurement Device
 
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander KukushkinPGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 

More from AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 

More from AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander

  • 2. OUTLINE Introduction Key Strategic Directions: •Portability •Programmability •Performance Conclusions 2 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 3. INTRODUCTION AND MOTIVATION  What is Bolt? ‒ C++ Template Library for GPU and multi-core CPU programming ‒ Optimized library routines for common GPU operations ‒ CPU optimized as well (high-performance, multi-core CPU routines) ‒ Works with open standards (OpenCL™ and C++ AMP) ‒ Distributed as open source  Make GPU programming as easy as CPU programming ‒ Resembles familiar C++ Standard Template Library ‒ Customizable via C++ template parameters ‒ Single source base for GPU and CPU ‒ Functional and performance portability ‒ Improves developer productivity  Well-suited for HSA ‒ Leverage high-performance shared virtual memory ‒ No data copies; can use pointers in data structures 3 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 4. SIMPLE BOLT EXAMPLE #include <bolt/amp/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device bolt::amp::sort(a.begin(), a.end()); } Interface similar to familiar C++ Standard Template Library No explicit mention of C++ AMP or OpenCL™ (or GPU!) ‒ More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™ Direct use of host data structures (ie std::vector) bolt::sort implicitly runs on the platform ‒ Runtime automatically selects CPU or GPU (or both) 4 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 5. BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR #include <bolt/amp/transform.h> #include <vector> struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) restrict(cpu,amp) { return _a * xx + yy; }; }; void main() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(), s); }; 5 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 6. BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA #include <bolt/amp/transform.h> #include <vector> void main(void) { const float a=100; std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); // saxpy with C++ Lambda bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(), [=] (float xx, float yy) restrict(cpu, amp) { return a * xx + yy; }); }; Functor (“a * xx + yy”) now specified inside the loop (more natural) Can capture variables from surrounding scope (“a”) – eliminate boilerplate class C++11 improves interface for template function libraries 6 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 7. BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR #include <bolt/cl/transform.h> #include <vector> BOLT_FUNCTOR(SaxpyFunctor, struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) { return _a * xx + yy; }; }; ); void main() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); bolt::cl::transform(x.begin(), x.end(), y.begin(), z.begin(), s); }; 7 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013 Similar syntax to C++AMP Macros used to make functors visible to OpenCL compiler. Functor code is compiled on first Bolt call
  • 8. BOLT 1.1  Timeline: ‒ July 2012 (AFDS-2012) : Announced Bolt ‒ July-2013 : Bolt v1.0 General Availability ‒ Nov-2013 (Now!): Bolt v1.1 General Availability ‒ Additional functions and optimizations ‒ Windows®7, Windows®8, Windows® 8.1, and Linux® Support  Open-source and available here: ‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-appsdk/bolt-c-template-library/ ‒ https://github.com/HSA-Libraries/Bolt  Contains 30-40 template functions ‒ Includes transform, sort, stable sort, scan, reduce  OpenCL™ and multi-core CPU paths for all functions  Linux® and Windows®  Supports Microsoft® Visual Studio ® 2010, Visual Studio ® 2012, GCC 4.6/4.7/4.8 8 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 9. BOLT 1.1 FUNCTION SUPPORT MATRIX API OpenCL™ C++AMP Multicore Serial (GPU) (GPU) TBB (CPU) (CPU) constant_iterator copy copy_n count count_if counting_iterator device_vector exclusive_scan YES YES YES YES YES YES YES YES NO NO NO YES YES NO YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES exclusive_scan_by_key YES NO YES YES fill fill_n generate generate_n inclusive_scan inclusive_scan_by_key inner_product max_element YES YES YES YES YES YES YES YES NO NO NO NO YES NO NO NO YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES 9 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013 API min_element reduce reduce_by_key sort sort_by_key stable_sort stable_sort_by_key transform transform_exclusive_s can transform_inclusive_s can transform_reduce binary_search merge scatter scatter_if gather gather_if OpenCL™ C++AMP Multicore Serial (GPU) (GPU) TBB (CPU) YES YES YES YES YES YES YES YES NO YES NO YES NO NO NO YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES NO YES YES YES YES YES YES YES YES YES YES NO YES NO NO NO NO NO NO YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES
  • 10. BOLT BENEFITS FOR PROGRAMMERS  Single-source can target GPUs and/or multi-core CPU  Optimized GPU library implementations for common sort, scan, reduce, etc operations  Familiar STL-like C++ syntax ‒ Kernels created automatically  STL-like “device_vector” to simplify memory management ‒ Supports typed memory allocation  Benefits for OpenCL users: ‒ Sensible selection of defaults (platform, context, device, queue) ‒ With optional overrides for users who want more control ‒ Kernels compiled automatically on first call. ‒ Simplified, C++-style kernel calling convention (replaces clSetKernelArg) 10 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 12. PORTABILITY  Current State of Bolt OS and Vendor Portability: ‒ Support C++AMP ‒ Run on any DX11-compliant video card ‒ But: no Linux® solution ‒ Support OpenCL™ ‒ Run on Windows® and Linux® ‒ But: use AMD C++ static kernel feature which is only available on AMD OpenCL™  Future : Improving portability ‒ C++AMP ‒ Provide Linux® port for C++AMP ‒ Increase the number of Bolt APIs which are supported with C++AMP ‒ OpenCL™ : ‒ Provide translator tool for C++ static kernel language 12 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 13. OPEN-SOURCE C++ AMP TOOLCHAIN HSAIL C++AMP CLANG Front-end LLVM-IR or SPIR 1.2 Any HSA Implementation SPIR 1.2 Any OpenCL™+SPIR Implementation LLVM Compiler  Open-source!  Preliminary Version: https://bitbucket.org/multicoreware/cppamp-driver/  Samples: https://bitbucket.org/multicoreware/cxxamp_sandbox  More development coming in 1H-2014 13 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 14. C++ STATIC KERNEL LANGUAGE AND NEW TRANSLATOR TOOL  AMD C++ Static Kernel Language (aka “OpenCL-C++ Kernel Language”) ‒ Adds C++ features to OpenCL-C kernel language ‒ Templates ! ‒ Classes ‒ Also: Namespaces, Inheritance, References, “this” operator, more ‒ Available as an OpenCL™ extension on AMD platforms ‒ http://developer.amd.com/wordpress/media/2012/10/CPP_kernel_language.pdf  New translator tool is designed to bring these benefits to any OpenCL™ Implementation: Bolt Code C++ Template Instantiation OpenCLC++ Code  Translator expected to be available in Q1-2014 14 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013 Translator OpenCLC Code Any OpenCL™ Implementation !
  • 15. FUTURE BOLT SUPPORT MATRIX OpenCL™ Windows® Linux® C++ AMP Any OpenCL™ vendor Any DX11 vendor Any OpenCL™ vendor Any OpenCL™ SPIR or HSAIL vendor  Green shows additional platforms that Bolt will run accelerated code paths  Bolt can also run multi-core CPU paths if GPU acceleration is not available 15 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 16. BOLT DEMO  OpenMM : Open-source simulation tool for molecular simulation  See how minor code modifications to a large application enable significant acceleration.  See Bolt code translated and running on Intel® OpenCL™ platform 16 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 18. SHARED VIRTUAL MEMORY - RECAP  CPU and GPU share virtual memory space PHYSICAL MEMORY  Can pass pointers between CPU and GPU VIRTUAL MEMORY CPU0 VA->PA  GPU can access terabytes of pageable virtual memory GPU VA->PA  High performance from all devices  Called “Shared Virtual Memory” or “SVM”  Key feature of HSA 18 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 19. PROGRAMMABILITY HOW SVM MAKES BOLT BETTER  Bolt today provides ‒ Familiar, programmer-friendly interface for accelerated programming ‒ Single-source portability to GPUs and multi-core CPUs  SVM + Bolt: ‒ Efficient access to host memory from GPU ‒ No need for “heap management” between host and device memory ‒ No copies, less overhead ‒ Single address space is a natural fit for Bolt single-source interface ‒ Program like a multi-core CPU, benefit from GPU-like acceleration ‒ Ability to use pointer-containing data structures ‒ Dramatically expand the code that use functors and template library – functors can contain pointers! 19 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 20. CONVOLUTION / SOBEL EDGE FILTER Compute Kernel applied to each pixel: Gx = [ -1 [ -2 [ -1 0 +1 ] 0 +2 ] 0 +1 ] Gy = [ -1 -2 -1 ] [ 0 0 0] [ +1 +2 +1 ] G = sqrt(Gx2 + Gy2) Challenge: Need to examine surrounding pixels. 20 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 21. SOBEL CODE EXAMPLE struct SobelFilter { uchar *inImg; uchar *outImg; int w, h; Pointers in the struct SobelFilter(uchar *inI, uchar *outI, int xw, int xh) : inImg(inI), outImg(outI), w(xw), h(xh) {}; Pass the pointers through the functor: void operator() (int y, int x) { int i = y*w + x; int gx = 1 * *(inImg + i - 4 - w) + 2 * *(inImg + i - w) + 1 * *(inImg + i + 4 - w) + -1 * *(inImg + i - 4 + w) + -2 * *(inImg + i + w) + -1 * *(inImg + i + 4 + w); // Construct functor object: SobelFilter filter(inputImage,outImage, w*4, h); int gy = 1 * *(inImg + i - 4 - w) + -1 * *(inImg + i + 4 - w) + 2 * *(inImg + i - 4) + -2 * *(inImg + i + 4) + 1 * *(inImg + i - 4 + w) + -1 * *(inImg + i + 4 + w); outImg[i] = (uchar)(sqrt(gx*gx + gy*gy)/ 2.0f); }; }; 21 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013 inputImage = (uchar*)malloc(w*h*4); // Init inputImage not shown outputImage = (uchar*)malloc(w*h*4); // Set border and call the // filter for each pixel. // Pass 2D coordinate to filter bolt::cl::for_each_2d( 1, h-1, 4, w*4-4, filter); Kernel uses pointers and computes indices for surrounding pixels
  • 22. THE POWER OF SVM AND BOLT  Functor can store additional parameters  With SVM, additional parameters can also be pointers ! ‒ Powerful way to access host data structures, linked lists, avoid copies  Initialized with constructor (run on host), be accessed in body operator() (run on device) struct SampleFunctor { SampleFunctor(…) : /* Init Pointers Here */ // Sample Pointers: class MyFancyHostDataStructure *handy; class ListNode *head; int *myArray; float operator() (…) { // Access pointers here! }; }; 22 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013 {}
  • 24. ALGORITHM PERFORMANCE Sort Performance (32-bit elements) Bolt vs SHOC sort Sort Performance (32-bit elements) Bolt vs std::sort 120 120 100 100 Bolt(OpenCL) Bolt(OpenCL) Bolt(MultiCoreCPU) std::sort 80 Millions Elements/Sec Millions Elements/Sec 80 60 SHOC 60 40 40 20 20 0 0 1 2 4 8 16 32 Millions Elements 24 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013 1 2 4 8 Millions Elements Data collected on future AMD APU 16 32
  • 25. CONCLUSIONS Bolt  C++ Template Library  Designed For Heterogeneous Computing Portability (Linux®, Windows®) X (Multiple GPU Vendors) X (OpenCL™, C++ AMP) Programmability  SVM + Bolt = Even easier and more flexible heterogeneous programming mod  Handy access to host pointers and extra functor parameters Performance  Bolt-GPU > Multi-Core CPU > STL  High-performance implementations, written by experts 25 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
  • 26. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL™ is a trademark of Apple Inc. Microsoft and Visual Studio are trademarks of Microsoft Corp. Linux is a trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners. 26 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013