The document introduces Bolt, a C++ template library for heterogeneous system architecture (HSA) that aims to improve developer productivity for GPU programming. Bolt provides optimized library routines for common GPU operations using open standards like OpenCL and C++ AMP. It resembles the familiar C++ Standard Template Library. Bolt allows programming GPUs as easily as CPUs, handles workload distribution across devices, and provides a single source code base for both CPU and GPU. Examples show how Bolt can be used with C++ AMP and OpenCL, including passing user-defined functors. An exemplary video enhancement application demonstrates Bolt's use in a commercial product.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
Great Paper on HSAemu Full system simulator built form PQUEMU to do Full System Emulation of HSA from our Academic Member Yeh-Ching Chung of National Tsing Hua University
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
Keynote presentation, The Role of Java in Heterogeneous Computing, and How You Can Help, by Nandini Ramani, VP, Java Platform, Oracle Corporation, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
Presentation WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang, John Yoon and Nicolas Lorain at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
Presentation CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
This Webinar explores a variety of new and updated features in Java 8, and discuss how these changes can positively impact your day-to-day programming.
Watch the video replay here: http://bit.ly/1vStxKN
Your Webinar presenter, Marnie Knue, is an instructor for Develop Intelligence and has taught Sun & Oracle certified Java classes, RedHat JBoss administration, Spring, and Hibernate. Marnie also has spoken at JavaOne.
Gives a high level overview of the new memory model introduced in C++11 and C11. Intended to give a useful mental model to aid understanding of more technical descriptions.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
Great Paper on HSAemu Full system simulator built form PQUEMU to do Full System Emulation of HSA from our Academic Member Yeh-Ching Chung of National Tsing Hua University
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
Keynote presentation, The Role of Java in Heterogeneous Computing, and How You Can Help, by Nandini Ramani, VP, Java Platform, Oracle Corporation, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...AMD Developer Central
Presentation WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang, John Yoon and Nicolas Lorain at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
Presentation CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
Modern graphics processing units (GPUs) are efficient general-purpose stream processors. Learn how Java can exploit the power of GPUs to optimize high-performance enterprise and technical computing applications such as big data and analytics workloads. This presentation covers principles and considerations for GPU programming from Java and looks at the software stack and developer tools available. It also presents a demo showing GPU acceleration and discusses what is coming in the future.
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
This Webinar explores a variety of new and updated features in Java 8, and discuss how these changes can positively impact your day-to-day programming.
Watch the video replay here: http://bit.ly/1vStxKN
Your Webinar presenter, Marnie Knue, is an instructor for Develop Intelligence and has taught Sun & Oracle certified Java classes, RedHat JBoss administration, Spring, and Hibernate. Marnie also has spoken at JavaOne.
Gives a high level overview of the new memory model introduced in C++11 and C11. Intended to give a useful mental model to aid understanding of more technical descriptions.
With the typical Apple understatement, Craig Federighi defined Swift as "how really everyone should be programming for the next 20 years". Is it true? Is it convenient? Is it safe? Is it fun?
With the typical Apple understatement, Craig Federighi defined Swift as "how really everyone should be programming for the next 20 years". Is it true? Is it convenient? Is it safe? Is it fun?
With the typical Apple understatement, Craig Federighi defined Swift as "how really everyone should be programming for the next 20 years".
Is it true? Is it convenient? Is it safe? Is it fun?
In this talk, we'll see what happened since Swift was released, and then OpenSourced, presenting pros and cons of using it as Server Side programming language.
We'll see the major web frameworks in the Swift ecosystem, similarities and differences, and finally, we'll tinker with two "real world" Swift on Linux apps.
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Chris Adamson
Abstract: Chances are you code in a language that's either descended from C, inspired by C, or run in an interpreter that itself is written in C. Still... do you actually know how to code in C? Despite its long-standing position as a sort of "lingua franca", an agreed-upon common language, more and more developers are putting together successful, satisfying careers, without ever learning this seminal language. But what if you have to call into C code from your favorite scripting language, or use APIs like OpenGL that are written to be called from C? Many developers find C very challenging, particularly its manual memory-management and other low-level concerns. In this session, we'll show you why you shouldn't be afraid of C, how you can use the skills you already have from the languages you code in today, and how to master structs, enums, typedefs, malloc(), free(), and the rest of C's sharp edges. Examples will be from the point-of-view of the C-skewing iPhone SDK, but will be designed to be broadly applicable and platform-agnostic.
Despite being a slow interpreter, Python is a key component in high-performance computing (HPC). Python is easy to use. C++ is fast. Together they are a beautiful blend. A new tool, pybind11, makes this approach even more attractive to HPC code. It focuses on the niceties C++11 brings in. Beyond the syntactic sugar around the Python C API, it is interesting to see how pybind11 handles the vast difference between the two languages, and what matters to HPC.
PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben SanderAMD Developer Central
Presentation PT-4059, Bolt: A C++ Template Library for Heterogeneous Computing, by Ben Sander, at the AMD Developer Summit (APU13) November 11-13, 2013.
We describe ocl, a Python library built on top of pyOpenCL and numpy. It allows programming
GPU devices using Python. Python functions which are marked up using the provided
decorator, are converted into C99/OpenCL and compiled using the JIT at runtime. This
approach lowers the barrier to entry to programming GPU devices since it requires only
Python syntax and no external compilation or linking steps. The resulting Python program runs
even if a GPU is not available. As an example of application, we solve the problem of
computing the covariance matrix for historical stock prices and determining the optimal
portfolio according to Modern Portfolio Theory
We describe ocl, a Python library built on top of pyOpenCL and numpy. It allows programming GPU devices using Python. Python functions which are marked up using the provided decorator, are converted into C99/OpenCL and compiled using the JIT at runtime. This approach lowers the barrier to entry to programming GPU devices since it requires only
Python syntax and no external compilation or linking steps. The resulting Python program runs
even if a GPU is not available. As an example of application, we solve the problem of computing the covariance matrix for historical stock prices and determining the optimal portfolio according to Modern Portfolio Theory.
Despite being a slow interpreter, Python is a key component in high-performance computing (HPC). Python is easy to use. C++ is fast. Together they are a beautiful blend. A new tool, pybind11, makes this approach even more attractive to HPC code. It focuses on the niceties C++11 brings in. Beyond the syntactic sugar around the Python C API, it is interesting to see how pybind11 handles the vast difference between the two languages, and what matters to HPC.
Similar to Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD (20)
HSA Runtime Specification Provisional 1.0 which describes the HSA Runtime going over error handling, runtime initiization and shutdown, system and agent information, signals and synchronization, architected dispatch, and memory management.
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed HSA Foundation
HSA Foundation Provisional 1.0 Platform Systems Architecture Specification
The document identifies from the hardware point of view the system architecture requirements necessary to support the Heterogeneous System Architecture (HSA) programming model and HSA application and system software infrastructure.
It defines a set of functionality and features for HSA hardware product deliverables to meet the minimum specified requirements to qualify for a valid HSA product.
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Foundation
This document describes the Heterogeneous System Architecture Intermediate Language (HSAIL), which is a virtual machine and an intermediate language.
This document serves as the specification for the HSAIL language for HSA implementers. Note that there are a wide variety of methods for implementing these requirements.
If you like access to HSAIL Simulator and Assembler go to github.com/HSAFoundation
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...HSA Foundation
HSA is a new computing platform architecture being standardized by the HSA Foundation which has as Founding members, AMD, ARM, Imagination, TI, Mediatek, Samsung and Qualcomm. HSA is intended to make the use of heterogeneous programming widespread by making purpose built architectures as easy to program as modern CPUs are. We start off by doing this with the GPU, the most widely deployed companion processor to the CPU and one which especially complements the CPU in low power and performance workloads. This requires some hardware architecture changes, that we have been working on for some time (in particular those that enable user mode scheduling, unified address space, unified shared memory, compute context switching, etc.) and which we have encapsulated into the spec currently under review by the HSA Foundation.
In short, HSA codifies the hardware architecture changes that are needed to enable mainstream programmers to develop heterogeneous application with the same facility that they do CPU only applications by seamlessly integrating the sequential programming capability of the CPU with the parallel compute capability of the GPU. We describe the software stacks that are needed for HSA, the benefits that accrue to both developers as well as end users, and describe our vision of the how HSA will help unify the ecosystems of the smartphone and tablet platforms as well as bring it closer to that of the traditional PC market. We will provide analysis of several examples which arise in applications and present data to validate the performance per watt benefit of HSA.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
1.
2. BOLT: A C++ TEMPLATE LIBRARY
FOR HSA
Ben Sander
AMD
Senior Fellow
3. MOTIVATION
§ Improve developer productivity
– Optimized library routines for common GPU operations
– Works with open standards (OpenCL™ and C++ AMP)
– Distributed as open source
§ Make GPU programming as easy as CPU programming
– Resemble familiar C++ Standard Template Library
– Customizable via C++ template parameters
– Leverage high-performance shared virtual memory
C++ Template Library For HSA
§ Optimize for HSA
– Single source base for GPU and CPU
– Platform Load Balancing
3 | BOLT | June 2012
4. AGENDA
§ Introduction and Motivation
§ Bolt Code Examples for C++ AMP and OpenCL™
§ ISV Proof Point
§ Single source code base for CPU and GPU
§ Platform Load Balancing
§ Summary
4 | BOLT | June 2012
5. SIMPLE BOLT EXAMPLE
#include <bolt/sort.h>
#include <vector>
#include <algorithm>
void main()
{
// generate random data (on host)
std::vector<int> a(1000000);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device
bolt::sort(a.begin(), a.end());
}
§ Interface similar to familiar C++ Standard Template Library
§ No explicit mention of C++ AMP or OpenCL™ (or GPU!)
– More advanced use case allow programmer to supply a kernel in C++ AMP or OpenCL™
§ Direct use of host data structures (ie std::vector)
§ bolt::sort implicitly runs on the platform
– Runtime automatically selects CPU or GPU (or both)
5 | BOLT | June 2012
7. BOLT FOR C++ AMP : LEVERAGING C++11 LAMBDA
#include <bolt/transform.h>
#include <vector>
void main(void)
{
const float a=100;
std::vector<float> x(1000000); // initialization not shown
std::vector<float> y(1000000); // initialization not shown
std::vector<float> z(1000000);
// saxpy with C++ Lambda
bolt::transform(x.begin(), x.end(), y.begin(), z.begin(),
[=] (float xx, float yy) restrict(cpu, amp) {
return a * xx + yy;
});
};
§ Functor (“a * xx + yy”) now specified inline
§ Can capture variables from surrounding scope (“a”) – eliminate boilerplate class
7 | BOLT | June 2012
8. BOLT FOR OPENCL™
#include <clbolt/sort.h>
#include <vector>
#include <algorithm>
void main()
{
// generate random data (on host)
std::vector<int> a(1000000);
std::generate(a.begin(), a.end(), rand);
// sort, run on best device
clbolt::sort(a.begin(), a.end());
}
§ Interface similar to familiar C++ Standard Template Library
§ clbolt uses OpenCL™ below the API level
– Host data copied or mapped to the GPU
– First call to clbolt::sort will generate and compile a kernel
§ More advanced use case allow programmer to supply a kernel in OpenCL™
8 | BOLT | June 2012
9. BOLT FOR OPENCL™ : USER-SPECIFIED FUNCTOR
#include <clbolt/transform.h> § Challenge: OpenCL™ split-source model
#include <vector>
– Host code in C or C++
– OpenCL™ code specified in strings
BOLT_FUNCTOR(SaxpyFunctor,
struct SaxpyFunctor
{
float _a; § Solution:
SaxpyFunctor(float a) : _a(a) {};
– BOLT_FUNCTOR macro creates both host-side
float operator() (const float &xx, const float &yy) and string versions of “SaxpyFunctor” class
{ definition
return _a * xx + yy;
}; § Class name (“SaxpyFunctor”) stored in TypeName trait
}; § OpenCL™ kernel code (SaxpyFunctor class def) stored
); in ClCode trait.
void main2() { – Clbolt function implementation
SaxpyFunctor s(100);
std::vector<float> x(1000000); // initialization not shown § Can retrieve traits from class name
std::vector<float> y(1000000); // initialization not shown § Uses TypeName and ClCode to construct a customized
std::vector<float> z(1000000); transform kernel
clbolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); § First call to clbolt::transform compiles the kernel
};
– Advanced users can directly create
ClCode trait
9 | BOLT | June 2012
10. BOLT: C++ AMP VS. OPENCL™
BOLT for C++ AMP BOLT for OpenCL™
§ C++ template library for HSA § C++ template library for HSA
– Developer can customize data types and operations – Developer can customize data types and operations
– Provide library of optimized routines for AMD GPUs. – Provide library of optimized routines for AMD GPUs.
§ C++ Host Language § C++ Host Language
§ Kernels marked with “restrict(cpu, amp)” § Kernels marked with “BOLT_FUNCTOR” macro
§ Kernels written in C++ AMP kernel language § Kernels written in OpenCL™ kernel language
– Restricted set of C++ – Subset of C99, with extensions (ie vectors, builtins)
§ Kernels compiled at compile-time § Kernels compiled at runtime, on first call
– Some compile errors shown on first call
§ C++ Lambda Syntax Supported § C++11 Lambda Syntax NOT supported
§ Functors may contain array_view § Functors may not contain pointers
§ Parameters can use host data structures (ie std::vector) § Parameters can use host data structures (ie std::vector)
§ Parameters can be array or array_view types § Parameters can be cl::Buffer or cl_buffer types
§ Use “bolt” namespace § Use “clbolt” namespace
10 | BOLT | June 2012
11. BOLT : WHAT’S NEW?
§ Optimized template library routines for common GPU functions
– For OpenCL™ and C++ AMP, across multiple platforms
§ Direct interfaces to host memory structures (ie std::vectors)
– Leverage HSA unified address space and zero-copy memory
– C++ AMP array and cl::Buffer also supported if memory already on device
§ Bolt submits to the entire platform rather than a specific device
– Runtime automatically selects the device
– Provides opportunities for load-balancing
– Provides optimal CPU path if no GPU is available.
– Override to specify specific accelerator is supported
– Enables developers to fearlessly move to the GPU
§ Bolt will contain new APIs optimized for HSA Devices
– Multi-device bolt::pipeline, bolt::parallel_filter
11 | BOLT | June 2012
12. EXAMPLARY ISV PROOF-POINT
Hessian Algorithm Pseudo Code:
§ “Hessian” kernel from “MotionDSP Ikena” // x,y are coordinates of pixel to transform
– Commercially available video enhancement software // Pixel difference:
It = W(y, x) - I(y, x);
– Optimized for CPU and GPU // average left/right pixels:
Ix = 0.5f *( W(y, x+1) - W(y, x-1) );
// average top/bottom pixels:
Iy = 0.5f*( W(y+1, x) - W(y-1, x) );
§ Basic Hessian Algorithm
X = x dist of this pixel from center
– Two input images I and W Y = y dist of this pixel from center
…
– Transform, followed by reduce (“transform_reduce”) // Compute for each pixel:
H[ 0] = (Ix*X+Iy*Y) * (Ix*X+Iy*Y)
§ For each pixel in image, compute 14 float coefficients H[ 1] = (Ix*X-Iy*Y) * (Ix*X+Iy*Y)
H[ 2] = (Ix*X-Iy*Y) * (Ix*X-Iy*Y)
H[ 3] = (Ix ) * (Ix*X+Iy*Y)
§ Sum the coefficients for all the pixels– final result is 14 floats H[ 4] = (Ix ) * (Ix*X-Iy*Y)
H[ 5] = (Ix ) * (Ix )
– Complex, computationally intense, real-world algorithm H[ 6] = (Iy ) * (Ix*X+Iy*Y)
H[ 7] = (Iy ) * (Ix*X-Iy*Y)
H[ 8] = (Iy ) * (Ix )
H[ 9] = (Iy ) * (Iy )
H[10] = (It ) * (Ix*X+Iy*Y)
§ Developed multiple implementations of Hessian kernel H[11] = (It ) * (Ix*X-Iy*Y)
H[12] = (It ) * (Ix )
– CPU, GPU, Bolt H[13] = (It ) * (Iy )
12 | BOLT | June 2012
13. LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT PROGRAMMING MODELS
(Exemplary ISV “Hessian” Kernel)
350
35.00
300
30.00
Init.
250
25.00
Relative Performance
Launch
200
Compile 20.00
LOC
Compile
Copy
Copy
150
15.00
Launch Launch Launch
Algorithm
100
Launch 10.00
Launch
Algorithm Algorithm Algorithm Launch
50
5.00
Algorithm Algorithm Algorithm
Copy-back Copy-back Copy-back
0
0
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Copy-back Algorithm Launch Copy Compile Init Performance
13 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
14. PERFORMANCE PORTABILITY - INTRODUCTION
§ For many algorithms, core operation same between CPU and GPU
– See sort, saxpy, hessian examples
– Same Core Operation
– Differences in how data is routed to the core operation
§ Bolt hides the device-specific routing details inside the library function implementation
– GPU implementations:
§ GPU-friendly data strides
§ Launch enough threads to hide memory latency
§ Group Memory and work-group communication
– CPU implementations:
§ CPU-friendly data strides
§ Launch enough threads to use all cores
14 | BOLT | June 2012
15. PERFORMANCE PORTABILITY – RESULTS
CPU
Performance
vs
Programming
Model
(Exemplary
ISV
"Hessian"
Kernel")
4.50
4.00
3.50
3.00
Rel
Perf
2.50
2.00
1.50
1.00
0.50
0.00
Serial
CPU
TBB
CPU
OpenCL
(CPU)
HSA
Bolt
(CPU)
15 | BOLT | June 2012
16. PERFORMANCE PORTABILITY – WHAT’S NEW ?
§ New GPU programming models are close to CPU programming models
– C++ AMP : Single-source, (restricted) C++11 kernel language, high-quality debugger/profiler, etc
§ Shared Virtual Memory in HSA
– Removes tedious copies between address spaces
– Will allow use of complex pointer-containing data structures
§ Less performance cliffs in modern GPU architectures (ie AMD GCN)
– Reduce need for GPU-specific optimizations in core operation
– Example: 14:7:1 Bandwidth Ratio for Group:Cache:Global Memory
§ Autovectorization
– Modern compilers include auto-vectorization support
– Restrictions of GPU programming models facilitate vectorization
§ Finally, Bolt functors can provide device-specific implementations if needed
16 | BOLT | June 2012
17. HSA LOAD BALANCING : KEY FEATURES AND OBSERVATIONS
§ High-performance shared virtual memory
– Developers no longer have to worry about data location (ie device vs host)
§ HSA platforms have tightly integrated CPU and GPU
– GPU better at wide vector parallelism, extracting memory bandwidth, latency hiding
– CPU better at fine-grained vector parallelism, cache-sensitive code, control-flow
§ Bolt Abstractions
– Provides insight into the characteristics of the algorithm
§ Reduce vs Transform vs parallel_filter
– Abstraction above the details of a “kernel launch”
§ Don’t need to specify device, workgroup shape, work-items, number of kernels, etc
§ Runtime may optimize these for the platform
§ Bolt has access to both optimized CPU and GPU implementations, at the same time
– Let’s use both!
17 | BOLT | June 2012
18. EXAMPLES OF HSA LOAD-BALANCING
Example
DescripBon
Exemplary
Use
Cases
Data
Size
Run
large
data
sizes
on
GPU,
small
on
CPU
Same
call-‐site
used
for
varying
data
sizes.
Run
iniWal
reducWon
phases
on
GPU,
run
ReducWon
final
stages
on
CPU
Any
reducWon
operaWon.
Border/Edge
Run
wide
center
regions
on
GPU,
run
OpWmizaWon
border
regions
on
CPU.
Image
processing.
PlaUorm
Super-‐ Distribute
workgroups
to
available
Kernel
has
similar
performance
/energy
on
Device
processing
units
on
the
enWre
plaUorm.
CPU
and
GPU.
Run
a
pipelined
series
of
user-‐defined
Heterogeneous
stages.
Stages
can
be
CPU-‐only,
GPU-‐only,
Pipeline
or
CPU
or
GPU.
Video
processing
pipeline.
GPU
scans
all
candidates
and
rejects
early
mismatches;
CPU
more
deeply
evaluates
Parallel_filter
the
survivors.
Haar
detector,
word
search,
audio
search.
18 | BOLT | June 2012
19. HETEROGENEOUS PIPELINE
§ Mimics a traditional manufacturing assembly line
– Developer supplies a series of pipeline stages
– Each stage processes it’s input token, passes an output token to the next stage
– Stages can be either CPU-only, GPU-only, or CPU/GPU
§ CPU/GPU tasks are dynamically scheduled
– Use queue depth and estimated execution time to drive scheduling decision
– Adapt to variation in target hardware or system utilization
– Data location not an issue in HSA
– Leverage single source code
§ GPU kernels scheduled asynchronously
– Completion invokes next stage of the pipeline
§ Simple Video Pipeline Example: Video
Video Video
Decode Processing Render
(CPU-only) (CPU/GPU) (GPU-only)
19 | BOLT | June 2012
20. CASCADE DEPTH ANALYSIS
Cascade Depth 25
20
15
10
5
0 20-25
15-20
10-15
5-10
0-5
20 | The Programmer’s Guide to a Universe of Possibility | June 12, 2012
21. PARALLEL_FILTER
§ Target applications with a “Filter” pattern
– Filter out a small number of results from a large initial pool of candidates
– Initial phases best run on GPU:
§ Large data sets (too big for caches), wide vector, high-bandwidth
– Tail phases best run on CPU
§ Smaller data sets (may fit in cache), divergent control flow, fine-grained vector width
– Examples: Haar detector, word search, acoustic search
§ Developer specifies:
– Execution Grid
– Iteration state type and initial value
– Filter function
§ Accepts a point to process and the current iteration state
§ Return True to continue processing or False to exit
§ BOLT / HSA Runtime
– Automatically hands off work between CPU and GPU
– Balances work by adjusting the split point between GPU and CPU
21 | BOLT | June 2012
22. SUMMARY
§ Bolt: C++ Template Library
– Optimized GPU and HSA Library routines
– Customizable via templates
– For both OpenCL™ and C++ AMP
§ Enjoy the unique advantages of the HSA Platform
– High-performance shared virtual memory
– Tightly integrated CPU and GPU
C++ Template Library For HSA
§ Enable advanced HSA features
– A single source base for CPU and GPU
– Platform load balancing across CPU and GPU
22 | BOLT | June 2012