PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander

BOLT UPDATE
BEN SANDER
AMD SENIOR FELLOW

OUTLINE

Introduction

Key Strategic Directions:
•Portability
•Programmability
•Performance

Conclusions

2 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013

INTRODUCTION AND MOTIVATION
 What is Bolt?
‒ C++ Template Library for GPU and multi-core CPU programming
‒ Optimized library routines for common GPU operations
‒ CPU optimized as well (high-performance, multi-core CPU routines)

‒ Works with open standards (OpenCL™ and C++ AMP)
‒ Distributed as open source

 Make GPU programming as easy as CPU programming
‒ Resembles familiar C++ Standard Template Library
‒ Customizable via C++ template parameters
‒ Single source base for GPU and CPU
‒ Functional and performance portability

‒ Improves developer productivity

 Well-suited for HSA
‒ Leverage high-performance shared virtual memory
‒ No data copies; can use pointers in data structures

SIMPLE BOLT EXAMPLE
#include <bolt/amp/sort.h>
#include <vector>
#include <algorithm>
void main()
{
// generate random data (on host)
std::vector<int> a(1000000);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device
bolt::amp::sort(a.begin(), a.end());
}

Interface similar to familiar C++ Standard Template Library
No explicit mention of C++ AMP or OpenCL™ (or GPU!)
‒ More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™

Direct use of host data structures (ie std::vector)
bolt::sort implicitly runs on the platform
‒ Runtime automatically selects CPU or GPU (or both)

BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR
#include <bolt/amp/transform.h>
#include <vector>
struct SaxpyFunctor
{
float _a;
SaxpyFunctor(float a) : _a(a) {};
float operator() (const float &xx, const float &yy) restrict(cpu,amp)
{
return _a * xx + yy;
};

};
void main() {
SaxpyFunctor s(100);
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(), s);
};


BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA
#include <bolt/amp/transform.h>
#include <vector>
void main(void)
{
const float a=100;
// saxpy with C++ Lambda
bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(),
[=] (float xx, float yy) restrict(cpu, amp) {
return a * xx + yy;
});
};

Functor (“a * xx + yy”) now specified inside the loop (more natural)
Can capture variables from surrounding scope (“a”) – eliminate boilerplate class
C++11 improves interface for template function libraries

BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR
#include <bolt/cl/transform.h>
#include <vector>

BOLT_FUNCTOR(SaxpyFunctor,
struct SaxpyFunctor
{
float _a;
SaxpyFunctor(float a) : _a(a) {};
float operator() (const float &xx, const float &yy)
{
return _a * xx + yy;
};

};
);
void main() {
SaxpyFunctor s(100);
bolt::cl::transform(x.begin(), x.end(), y.begin(), z.begin(), s);
};


Similar syntax to C++AMP

Macros used to make functors visible to
OpenCL compiler.
Functor code is compiled on first Bolt call

BOLT 1.1
 Timeline:
‒ July 2012 (AFDS-2012) : Announced Bolt
‒ July-2013 : Bolt v1.0 General Availability
‒ Nov-2013 (Now!): Bolt v1.1 General Availability
‒ Additional functions and optimizations
‒ Windows®7, Windows®8, Windows® 8.1, and Linux® Support

 Open-source and available here:
‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-appsdk/bolt-c-template-library/
‒ https://github.com/HSA-Libraries/Bolt

 Contains 30-40 template functions
‒ Includes transform, sort, stable sort, scan, reduce

 OpenCL™ and multi-core CPU paths for all functions
 Linux® and Windows®
 Supports Microsoft® Visual Studio ® 2010, Visual Studio ® 2012, GCC 4.6/4.7/4.8

BOLT 1.1 FUNCTION SUPPORT MATRIX
API

OpenCL™ C++AMP Multicore Serial
(GPU)
(GPU) TBB (CPU) (CPU)

constant_iterator
copy
copy_n
count
count_if
counting_iterator
device_vector
exclusive_scan

YES
YES
YES
YES
YES
YES
YES
YES

NO
NO
NO
YES
YES
NO
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

exclusive_scan_by_key

YES

NO

YES

YES

fill
fill_n
generate
generate_n
inclusive_scan
inclusive_scan_by_key
inner_product
max_element

YES
YES
YES
YES
YES
YES
YES
YES

NO
NO
NO
NO
YES
NO
NO
NO

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES


API
min_element
reduce
reduce_by_key
sort
sort_by_key
stable_sort
stable_sort_by_key
transform
transform_exclusive_s
can
transform_inclusive_s
can
transform_reduce
binary_search
merge
scatter
scatter_if
gather
gather_if

OpenCL™ C++AMP Multicore
Serial
(GPU)
(GPU) TBB (CPU)
YES
YES
YES
YES
YES
YES
YES
YES

NO
YES
NO
YES
NO
NO
NO
YES

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

YES

NO

YES

YES

YES
YES
YES
YES
YES
YES
YES
YES

NO
YES
NO
NO
NO
NO
NO
NO

YES
YES
YES
YES
YES
YES
YES
YES

YES
YES
YES
YES
YES
YES
YES
YES

BOLT BENEFITS FOR PROGRAMMERS
 Single-source can target GPUs and/or multi-core CPU
 Optimized GPU library implementations for common sort, scan, reduce, etc operations
 Familiar STL-like C++ syntax
‒ Kernels created automatically

 STL-like “device_vector” to simplify memory management
‒ Supports typed memory allocation

 Benefits for OpenCL users:
‒ Sensible selection of defaults (platform, context, device, queue)
‒ With optional overrides for users who want more control
‒ Kernels compiled automatically on first call.
‒ Simplified, C++-style kernel calling convention (replaces clSetKernelArg)


PORTABILITY
 Current State of Bolt OS and Vendor Portability:
‒ Support C++AMP
‒ Run on any DX11-compliant video card
‒ But: no Linux® solution

‒ Support OpenCL™
‒ Run on Windows® and Linux®
‒ But: use AMD C++ static kernel feature which is only available on AMD OpenCL™

 Future : Improving portability
‒ C++AMP
‒ Provide Linux® port for C++AMP
‒ Increase the number of Bolt APIs which are supported with C++AMP

‒ OpenCL™ :
‒ Provide translator tool for C++ static kernel language


OPEN-SOURCE C++ AMP TOOLCHAIN

HSAIL
C++AMP

CLANG
Front-end

LLVM-IR
or
SPIR 1.2

Any HSA
Implementation

SPIR 1.2

Any OpenCL™+SPIR
Implementation

LLVM Compiler

 Open-source!

 Preliminary Version: https://bitbucket.org/multicoreware/cppamp-driver/
 Samples: https://bitbucket.org/multicoreware/cxxamp_sandbox
 More development coming in 1H-2014

C++ STATIC KERNEL LANGUAGE
AND NEW TRANSLATOR TOOL

 AMD C++ Static Kernel Language (aka “OpenCL-C++ Kernel Language”)
‒ Adds C++ features to OpenCL-C kernel language
‒ Templates !
‒ Classes
‒ Also: Namespaces, Inheritance, References, “this” operator, more
‒ Available as an OpenCL™ extension on AMD platforms
‒ http://developer.amd.com/wordpress/media/2012/10/CPP_kernel_language.pdf

 New translator tool is designed to bring these benefits to any OpenCL™ Implementation:

Bolt
Code

C++ Template
Instantiation

OpenCLC++
Code

 Translator expected to be available in Q1-2014

Translator

OpenCLC Code

Any OpenCL™
Implementation !

FUTURE BOLT SUPPORT MATRIX

OpenCL™

Windows®
Linux®

C++ AMP

Any OpenCL™ vendor

Any DX11 vendor

Any OpenCL™ vendor

Any OpenCL™ SPIR or
HSAIL vendor

 Green shows additional platforms that Bolt will run accelerated code paths
 Bolt can also run multi-core CPU paths if GPU acceleration is not available


BOLT DEMO
 OpenMM : Open-source simulation tool for molecular simulation

 See how minor code modifications to a large application enable significant acceleration.
 See Bolt code translated and running on Intel® OpenCL™ platform

SHARED VIRTUAL MEMORY - RECAP
 CPU and GPU share
virtual memory space

PHYSICAL MEMORY

 Can pass pointers
between CPU and GPU

VIRTUAL MEMORY

CPU0

VA->PA

 GPU can access
terabytes of pageable
virtual memory
GPU
VA->PA

 High performance
from all devices
 Called “Shared Virtual
Memory” or “SVM”

 Key feature of HSA

PROGRAMMABILITY
HOW SVM MAKES BOLT BETTER

 Bolt today provides
‒ Familiar, programmer-friendly interface for accelerated programming
‒ Single-source portability to GPUs and multi-core CPUs

 SVM + Bolt:
‒ Efficient access to host memory from GPU
‒ No need for “heap management” between host and device memory
‒ No copies, less overhead
‒ Single address space is a natural fit for Bolt single-source interface
‒ Program like a multi-core CPU, benefit from GPU-like acceleration

‒ Ability to use pointer-containing data structures
‒ Dramatically expand the code that use functors and template library – functors can contain pointers!


CONVOLUTION / SOBEL EDGE FILTER
Compute Kernel applied to
each pixel:
Gx =

[ -1
[ -2
[ -1

0 +1 ]
0 +2 ]
0 +1 ]

Gy =

[ -1 -2 -1 ]
[ 0 0 0]
[ +1 +2 +1 ]

G = sqrt(Gx2 + Gy2)

Challenge: Need to
examine surrounding
pixels.


SOBEL CODE EXAMPLE
struct SobelFilter {
uchar *inImg;
uchar *outImg;
int
w, h;

Pointers in the struct

SobelFilter(uchar *inI, uchar *outI, int xw, int xh) :
inImg(inI), outImg(outI), w(xw), h(xh) {};

Pass the pointers through the functor:

void operator() (int y, int x)
{
int i = y*w + x;
int gx =
1 * *(inImg + i - 4 - w)
+ 2 * *(inImg + i - w)
+ 1 * *(inImg + i + 4 - w)
+ -1 * *(inImg + i - 4 + w)
+ -2 * *(inImg + i + w)
+ -1 * *(inImg + i + 4 + w);

// Construct functor object:
SobelFilter
filter(inputImage,outImage, w*4, h);

int gy =

1 * *(inImg + i - 4 - w)
+ -1 * *(inImg + i + 4 - w)
+ 2 * *(inImg + i - 4)
+ -2 * *(inImg + i + 4)
+ 1 * *(inImg + i - 4 + w)
+ -1 * *(inImg + i + 4 + w);
outImg[i] = (uchar)(sqrt(gx*gx + gy*gy)/ 2.0f);
};
};

inputImage = (uchar*)malloc(w*h*4);
// Init inputImage not shown
outputImage = (uchar*)malloc(w*h*4);

// Set border and call the
// filter for each pixel.
// Pass 2D coordinate to filter
bolt::cl::for_each_2d(
1, h-1, 4, w*4-4,
filter);

Kernel uses pointers and computes
indices for surrounding pixels

THE POWER OF SVM AND BOLT
 Functor can store additional parameters
 With SVM, additional parameters can also be pointers !
‒ Powerful way to access host data structures, linked lists, avoid copies

 Initialized with constructor (run on host), be accessed in body operator() (run on device)
struct SampleFunctor
{
SampleFunctor(…) : /* Init Pointers Here */
// Sample Pointers:
class MyFancyHostDataStructure *handy;
class ListNode
*head;
int
*myArray;
float operator() (…) {
// Access pointers here!
};
};


{}

ALGORITHM PERFORMANCE
Sort Performance (32-bit elements)
Bolt vs SHOC sort

Sort Performance (32-bit elements)
Bolt vs std::sort
120

120

100

100
Bolt(OpenCL)

Bolt(OpenCL)

Bolt(MultiCoreCPU)
std::sort

80
Millions Elements/Sec

Millions Elements/Sec

80

60

SHOC

60

40

40

20

20

0

0
1

2

4

8

16

32

Millions Elements

1

2

4

8

Millions Elements

Data collected on future AMD APU

16

32

CONCLUSIONS
Bolt

 C++ Template Library
 Designed For Heterogeneous Computing

Portability

(Linux®, Windows®) X
(Multiple GPU Vendors) X
(OpenCL™, C++ AMP)

Programmability

 SVM + Bolt = Even easier and more flexible heterogeneous programming mod
 Handy access to host pointers and extra functor parameters

Performance

 Bolt-GPU > Multi-Core CPU > STL
 High-performance implementations, written by experts


DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices,
Inc. in the United States and/or other jurisdictions. OpenCL™ is a trademark of Apple Inc. Microsoft and Visual Studio are trademarks of Microsoft Corp.
Linux is a trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.


PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander

Similar to PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander