Presentation PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander, at the AMD Developer Summit (APU13) November 11-13, 2013.
3. INTRODUCTION AND MOTIVATION
What is Bolt?
‒ C++ Template Library for GPU and multi-core CPU programming
‒ Optimized library routines for common GPU operations
‒ CPU optimized as well (high-performance, multi-core CPU routines)
‒ Works with open standards (OpenCL™ and C++ AMP)
‒ Distributed as open source
Make GPU programming as easy as CPU programming
‒ Resembles familiar C++ Standard Template Library
‒ Customizable via C++ template parameters
‒ Single source base for GPU and CPU
‒ Functional and performance portability
‒ Improves developer productivity
Well-suited for HSA
‒ Leverage high-performance shared virtual memory
‒ No data copies; can use pointers in data structures
3 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
4. SIMPLE BOLT EXAMPLE
#include <bolt/amp/sort.h>
#include <vector>
#include <algorithm>
void main()
{
// generate random data (on host)
std::vector<int> a(1000000);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device
bolt::amp::sort(a.begin(), a.end());
}
Interface similar to familiar C++ Standard Template Library
No explicit mention of C++ AMP or OpenCL™ (or GPU!)
‒ More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™
Direct use of host data structures (ie std::vector)
bolt::sort implicitly runs on the platform
‒ Runtime automatically selects CPU or GPU (or both)
4 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
5. BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR
#include <bolt/amp/transform.h>
#include <vector>
struct SaxpyFunctor
{
float _a;
SaxpyFunctor(float a) : _a(a) {};
float operator() (const float &xx, const float &yy) restrict(cpu,amp)
{
return _a * xx + yy;
};
};
void main() {
SaxpyFunctor s(100);
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(), s);
};
5 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
6. BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA
#include <bolt/amp/transform.h>
#include <vector>
void main(void)
{
const float a=100;
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
// saxpy with C++ Lambda
bolt::amp::transform(x.begin(), x.end(), y.begin(), z.begin(),
[=] (float xx, float yy) restrict(cpu, amp) {
return a * xx + yy;
});
};
Functor (“a * xx + yy”) now specified inside the loop (more natural)
Can capture variables from surrounding scope (“a”) – eliminate boilerplate class
C++11 improves interface for template function libraries
6 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
7. BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR
#include <bolt/cl/transform.h>
#include <vector>
BOLT_FUNCTOR(SaxpyFunctor,
struct SaxpyFunctor
{
float _a;
SaxpyFunctor(float a) : _a(a) {};
float operator() (const float &xx, const float &yy)
{
return _a * xx + yy;
};
};
);
void main() {
SaxpyFunctor s(100);
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
bolt::cl::transform(x.begin(), x.end(), y.begin(), z.begin(), s);
};
7 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
Similar syntax to C++AMP
Macros used to make functors visible to
OpenCL compiler.
Functor code is compiled on first Bolt call
8. BOLT 1.1
Timeline:
‒ July 2012 (AFDS-2012) : Announced Bolt
‒ July-2013 : Bolt v1.0 General Availability
‒ Nov-2013 (Now!): Bolt v1.1 General Availability
‒ Additional functions and optimizations
‒ Windows®7, Windows®8, Windows® 8.1, and Linux® Support
Open-source and available here:
‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-appsdk/bolt-c-template-library/
‒ https://github.com/HSA-Libraries/Bolt
Contains 30-40 template functions
‒ Includes transform, sort, stable sort, scan, reduce
OpenCL™ and multi-core CPU paths for all functions
Linux® and Windows®
Supports Microsoft® Visual Studio ® 2010, Visual Studio ® 2012, GCC 4.6/4.7/4.8
8 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
9. BOLT 1.1 FUNCTION SUPPORT MATRIX
API
OpenCL™ C++AMP Multicore Serial
(GPU)
(GPU) TBB (CPU) (CPU)
constant_iterator
copy
copy_n
count
count_if
counting_iterator
device_vector
exclusive_scan
YES
YES
YES
YES
YES
YES
YES
YES
NO
NO
NO
YES
YES
NO
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
exclusive_scan_by_key
YES
NO
YES
YES
fill
fill_n
generate
generate_n
inclusive_scan
inclusive_scan_by_key
inner_product
max_element
YES
YES
YES
YES
YES
YES
YES
YES
NO
NO
NO
NO
YES
NO
NO
NO
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
9 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
API
min_element
reduce
reduce_by_key
sort
sort_by_key
stable_sort
stable_sort_by_key
transform
transform_exclusive_s
can
transform_inclusive_s
can
transform_reduce
binary_search
merge
scatter
scatter_if
gather
gather_if
OpenCL™ C++AMP Multicore
Serial
(GPU)
(GPU) TBB (CPU)
YES
YES
YES
YES
YES
YES
YES
YES
NO
YES
NO
YES
NO
NO
NO
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
NO
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
NO
YES
NO
NO
NO
NO
NO
NO
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
10. BOLT BENEFITS FOR PROGRAMMERS
Single-source can target GPUs and/or multi-core CPU
Optimized GPU library implementations for common sort, scan, reduce, etc operations
Familiar STL-like C++ syntax
‒ Kernels created automatically
STL-like “device_vector” to simplify memory management
‒ Supports typed memory allocation
Benefits for OpenCL users:
‒ Sensible selection of defaults (platform, context, device, queue)
‒ With optional overrides for users who want more control
‒ Kernels compiled automatically on first call.
‒ Simplified, C++-style kernel calling convention (replaces clSetKernelArg)
10 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
12. PORTABILITY
Current State of Bolt OS and Vendor Portability:
‒ Support C++AMP
‒ Run on any DX11-compliant video card
‒ But: no Linux® solution
‒ Support OpenCL™
‒ Run on Windows® and Linux®
‒ But: use AMD C++ static kernel feature which is only available on AMD OpenCL™
Future : Improving portability
‒ C++AMP
‒ Provide Linux® port for C++AMP
‒ Increase the number of Bolt APIs which are supported with C++AMP
‒ OpenCL™ :
‒ Provide translator tool for C++ static kernel language
12 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
13. OPEN-SOURCE C++ AMP TOOLCHAIN
HSAIL
C++AMP
CLANG
Front-end
LLVM-IR
or
SPIR 1.2
Any HSA
Implementation
SPIR 1.2
Any OpenCL™+SPIR
Implementation
LLVM Compiler
Open-source!
Preliminary Version: https://bitbucket.org/multicoreware/cppamp-driver/
Samples: https://bitbucket.org/multicoreware/cxxamp_sandbox
More development coming in 1H-2014
13 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
14. C++ STATIC KERNEL LANGUAGE
AND NEW TRANSLATOR TOOL
AMD C++ Static Kernel Language (aka “OpenCL-C++ Kernel Language”)
‒ Adds C++ features to OpenCL-C kernel language
‒ Templates !
‒ Classes
‒ Also: Namespaces, Inheritance, References, “this” operator, more
‒ Available as an OpenCL™ extension on AMD platforms
‒ http://developer.amd.com/wordpress/media/2012/10/CPP_kernel_language.pdf
New translator tool is designed to bring these benefits to any OpenCL™ Implementation:
Bolt
Code
C++ Template
Instantiation
OpenCLC++
Code
Translator expected to be available in Q1-2014
14 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
Translator
OpenCLC Code
Any OpenCL™
Implementation !
15. FUTURE BOLT SUPPORT MATRIX
OpenCL™
Windows®
Linux®
C++ AMP
Any OpenCL™ vendor
Any DX11 vendor
Any OpenCL™ vendor
Any OpenCL™ SPIR or
HSAIL vendor
Green shows additional platforms that Bolt will run accelerated code paths
Bolt can also run multi-core CPU paths if GPU acceleration is not available
15 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
16. BOLT DEMO
OpenMM : Open-source simulation tool for molecular simulation
See how minor code modifications to a large application enable significant acceleration.
See Bolt code translated and running on Intel® OpenCL™ platform
16 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
18. SHARED VIRTUAL MEMORY - RECAP
CPU and GPU share
virtual memory space
PHYSICAL MEMORY
Can pass pointers
between CPU and GPU
VIRTUAL MEMORY
CPU0
VA->PA
GPU can access
terabytes of pageable
virtual memory
GPU
VA->PA
High performance
from all devices
Called “Shared Virtual
Memory” or “SVM”
Key feature of HSA
18 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
19. PROGRAMMABILITY
HOW SVM MAKES BOLT BETTER
Bolt today provides
‒ Familiar, programmer-friendly interface for accelerated programming
‒ Single-source portability to GPUs and multi-core CPUs
SVM + Bolt:
‒ Efficient access to host memory from GPU
‒ No need for “heap management” between host and device memory
‒ No copies, less overhead
‒ Single address space is a natural fit for Bolt single-source interface
‒ Program like a multi-core CPU, benefit from GPU-like acceleration
‒ Ability to use pointer-containing data structures
‒ Dramatically expand the code that use functors and template library – functors can contain pointers!
19 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
20. CONVOLUTION / SOBEL EDGE FILTER
Compute Kernel applied to
each pixel:
Gx =
[ -1
[ -2
[ -1
0 +1 ]
0 +2 ]
0 +1 ]
Gy =
[ -1 -2 -1 ]
[ 0 0 0]
[ +1 +2 +1 ]
G = sqrt(Gx2 + Gy2)
Challenge: Need to
examine surrounding
pixels.
20 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
21. SOBEL CODE EXAMPLE
struct SobelFilter {
uchar *inImg;
uchar *outImg;
int
w, h;
Pointers in the struct
SobelFilter(uchar *inI, uchar *outI, int xw, int xh) :
inImg(inI), outImg(outI), w(xw), h(xh) {};
Pass the pointers through the functor:
void operator() (int y, int x)
{
int i = y*w + x;
int gx =
1 * *(inImg + i - 4 - w)
+ 2 * *(inImg + i - w)
+ 1 * *(inImg + i + 4 - w)
+ -1 * *(inImg + i - 4 + w)
+ -2 * *(inImg + i + w)
+ -1 * *(inImg + i + 4 + w);
// Construct functor object:
SobelFilter
filter(inputImage,outImage, w*4, h);
int gy =
1 * *(inImg + i - 4 - w)
+ -1 * *(inImg + i + 4 - w)
+ 2 * *(inImg + i - 4)
+ -2 * *(inImg + i + 4)
+ 1 * *(inImg + i - 4 + w)
+ -1 * *(inImg + i + 4 + w);
outImg[i] = (uchar)(sqrt(gx*gx + gy*gy)/ 2.0f);
};
};
21 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
inputImage = (uchar*)malloc(w*h*4);
// Init inputImage not shown
outputImage = (uchar*)malloc(w*h*4);
// Set border and call the
// filter for each pixel.
// Pass 2D coordinate to filter
bolt::cl::for_each_2d(
1, h-1, 4, w*4-4,
filter);
Kernel uses pointers and computes
indices for surrounding pixels
22. THE POWER OF SVM AND BOLT
Functor can store additional parameters
With SVM, additional parameters can also be pointers !
‒ Powerful way to access host data structures, linked lists, avoid copies
Initialized with constructor (run on host), be accessed in body operator() (run on device)
struct SampleFunctor
{
SampleFunctor(…) : /* Init Pointers Here */
// Sample Pointers:
class MyFancyHostDataStructure *handy;
class ListNode
*head;
int
*myArray;
float operator() (…) {
// Access pointers here!
};
};
22 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
{}
24. ALGORITHM PERFORMANCE
Sort Performance (32-bit elements)
Bolt vs SHOC sort
Sort Performance (32-bit elements)
Bolt vs std::sort
120
120
100
100
Bolt(OpenCL)
Bolt(OpenCL)
Bolt(MultiCoreCPU)
std::sort
80
Millions Elements/Sec
Millions Elements/Sec
80
60
SHOC
60
40
40
20
20
0
0
1
2
4
8
16
32
Millions Elements
24 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013
1
2
4
8
Millions Elements
Data collected on future AMD APU
16
32
25. CONCLUSIONS
Bolt
C++ Template Library
Designed For Heterogeneous Computing
Portability
(Linux®, Windows®) X
(Multiple GPU Vendors) X
(OpenCL™, C++ AMP)
Programmability
SVM + Bolt = Even easier and more flexible heterogeneous programming mod
Handy access to host pointers and extra functor parameters
Performance
Bolt-GPU > Multi-Core CPU > STL
High-performance implementations, written by experts
25 | BOLT:C++ TEMPLATE LIBRARY FOR HETEROGENEOUS COMPUTING | NOVEMBER 19, 2013