The document provides a history of GPUs and GPGPU computing. It describes how GPUs evolved from fixed hardware for graphics to programmable hardware. This allowed general purpose computing on GPUs (GPGPU). It discusses the development of GPGPU languages and APIs like CUDA, OpenCL, and DirectCompute. The anatomy of a modern GPU is explained, highlighting its massively parallel architecture. Typical GPGPU execution and memory models are outlined. Usage of GPGPU for applications like graphics, physics, computer vision, and HPC is mentioned. Leading GPU vendors and their products are briefly introduced.
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
Presentation MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabilities, by Srikanth Gollapudi at the AMD Developer Summit (APU13) November 11-13, 2013.
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
This is a presentation I gave on last GPGPU workshop we did on April 2013.
The usage of GPGPU is expanding, and creates a continuum from Mobile to HPC. At the same time, question is whether the GPGPU languages are the right ones (well, no) and aren't we wasting resources on re-developing the same SW stack instead of converging.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
Presentation MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabilities, by Srikanth Gollapudi at the AMD Developer Summit (APU13) November 11-13, 2013.
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
Ostech war story using mainline linux for an android tv bspNeil Armstrong
Android TV is a relatively recent Google Initiative to use the Android Operating System for TV Set-top-boxes, reusing the Phone Operating System architecture.
In the last years, the Android Hardware Abstraction Libraries were adapted/rewritten to use the modern and recent Linux APIs like DRM/KMS, V4L2 for Video Decode, ... allowing Android to boot and work with mainline Linux.
During last year, Neil was involved into an upstream-first open Android TV BSP, aiming to fully support AOSP for TV running on a Low-Cost generally available ARM based System-on-Chip designed for TV application. Neil will overview the requirements and struggles in term of system support, upstreaming & Android tweaking to enable AOSP to boot on such device, including the whole trusted boot chain, to graphical Linux with multimedia features enabled.
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
Presentation PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier, at the AMD Developer Summit (APU13) November 11-13, 2013.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas
The Linux graphics stack is constantly evolving to add support for new hardware. This evolution and new software specifications have forced the X graphical server to be split into several components including a now rotates in the Linux kernel, the Direct Rendering Manager (DRM). A quick presentation of these components and their role will be carried out before looking at new major change in the common code, the NVIDIA Optimus technology.
One equipped with Optimus technology laptop has two graphics processing units (GPUs), one from Intel and one from NVIDIA. This technology combines the low power Intel GPU when the machine is not used to the performance of NVIDIA GPUs when the user plays. This technology, however, is a nightmare to manage kernel-side although the final building blocks necessary for its complete management are being finalized. Further explanation of this issue will be made and we’ll see how this new software architecture has added graphics acceleration on embedded processor SoCs like Tegra.
The case of open source NVIDIA driver, called “New” will then be studied. This is the graphics driver community as it is developed without the help of NVIDIA and attracted several regular contributors, including myself! We’ll take a quick history of the project before talking about the current developments and issues related to the lack of documentation.
The end of this presentation will then be left to the participants so they can ask more general questions about the graphics stack, if they wish.
Martin Peres, Laboratoire Bordelais de Recherche en Informatique
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
Presentation CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compiler Developers, by Yaxun Liu from the AMD Developer Summit (APU13) November 11-13, 2013.
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
Presentation Hc-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton at the AMD Developer Summit (APU13) November 11-13, 2013.
TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
A presentation at FOSDEM 2022 about AMD GPUs, tuning, programming models and software roadmap. This is continuation from the previous talk (FOSDEM 2021)
Ostech war story using mainline linux for an android tv bspNeil Armstrong
Android TV is a relatively recent Google Initiative to use the Android Operating System for TV Set-top-boxes, reusing the Phone Operating System architecture.
In the last years, the Android Hardware Abstraction Libraries were adapted/rewritten to use the modern and recent Linux APIs like DRM/KMS, V4L2 for Video Decode, ... allowing Android to boot and work with mainline Linux.
During last year, Neil was involved into an upstream-first open Android TV BSP, aiming to fully support AOSP for TV running on a Low-Cost generally available ARM based System-on-Chip designed for TV application. Neil will overview the requirements and struggles in term of system support, upstreaming & Android tweaking to enable AOSP to boot on such device, including the whole trusted boot chain, to graphical Linux with multimedia features enabled.
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
Presentation PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier, at the AMD Developer Summit (APU13) November 11-13, 2013.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
Presentation CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distributed Platforms, by Max Grossman at the AMD Developer Summit (APU13) November 11-13, 2013.
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas
The Linux graphics stack is constantly evolving to add support for new hardware. This evolution and new software specifications have forced the X graphical server to be split into several components including a now rotates in the Linux kernel, the Direct Rendering Manager (DRM). A quick presentation of these components and their role will be carried out before looking at new major change in the common code, the NVIDIA Optimus technology.
One equipped with Optimus technology laptop has two graphics processing units (GPUs), one from Intel and one from NVIDIA. This technology combines the low power Intel GPU when the machine is not used to the performance of NVIDIA GPUs when the user plays. This technology, however, is a nightmare to manage kernel-side although the final building blocks necessary for its complete management are being finalized. Further explanation of this issue will be made and we’ll see how this new software architecture has added graphics acceleration on embedded processor SoCs like Tegra.
The case of open source NVIDIA driver, called “New” will then be studied. This is the graphics driver community as it is developed without the help of NVIDIA and attracted several regular contributors, including myself! We’ll take a quick history of the project before talking about the current developments and issues related to the lack of documentation.
The end of this presentation will then be left to the participants so they can ask more general questions about the graphics stack, if they wish.
Martin Peres, Laboratoire Bordelais de Recherche en Informatique
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
Presentation CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compiler Developers, by Yaxun Liu from the AMD Developer Summit (APU13) November 11-13, 2013.
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
Presentation Hc-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton at the AMD Developer Summit (APU13) November 11-13, 2013.
TensorFlow is the most popular machine learning framework nowadays. TensorFlow Lite (TFLite), open sourced in late 2017, is TensorFlow’s runtime designed for mobile devices, esp. Android cell phones. TFLite is getting more and more mature. One the most interesting new components introduced recently are its GPU delegate and new NNAPI delegate. The GPU delegate uses Open GL ES compute shader on Android platforms and Metal shade on iOS devices. The original NNAPI delegate is an all-or-nothing design (if one of the ops in the compute graph is not supported by NNAPI, the whole graph is not delegated). The new one is a per-op design. When an op in a graph is not supported by NNAPI, the op is automatically fell back to the CPU runtime. I’ll have a quick review TFLite and its interpreter, then walk the audience through example usage of the two delegates and important source code of them.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
Presentation PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by Wu Feng and Mark Gardner at the AMD Developer Summit (APU13) November 11-13, 2013.
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
In this article we compare the results obtained with an implementation of the Finite Volume for structured meshes on GPGPUs with experimental results and also with a Finite Element code with boundary fitted strategy. The example is a fully submerged spherical buoy immersed in a cubic water recipient. The recipient undergoes an harmonic linear motion imposed with a shake table. The experiment is recorded with a high speed camera and the displacement of the buoy if obtained from the video with a MoCap (Motion Capture) algorithm. The amplitude and phase of the resulting motion allows to determine indirectly the added mass and drag of the sphere.
NVIDIA CEO Jen-Hsun Huang introduces NVLink and shares a roadmap of the GPU. Primary topics also include an introduction of the GeForce GTX Titan Z, CUDA for machine learning, and Iray VCA.
Dustin Franklin (GPGPU Applications Engineer, GE Intelligent Platforms ) presents:
"GPUDirect support for RDMA provides low-latency interconnectivity between NVIDIA GPUs and various networking, storage, and FPGA devices. Discussion will include how the CUDA 5 technology increases GPU autonomy and promotes multi-GPU topologies with high GPU-to-CPU ratios. In addition to improved bandwidth and latency, the resulting increase in GFLOPS/watt poses a significant impact to both HPC and embedded applications. We will dig into scalable PCIe switch hierarchies, as well as software infrastructure to manage device interopability and GPUDirect streaming. Highlighting emerging architectures composed of Tegra-style SoCs that further decouple GPUs from discrete CPUs to achieve greater computational density."
Learn more at: http://www.gputechconf.com/page/home.html
Graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a specialized microprocessor that offloads and accelerates graphics rendering from the central (micro) processor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In CPU, only a fraction of the chip does computations where as the GPU devotes more transistors to data processing.
GPGPU is a programming methodology based on modifying algorithms to run on existing GPU hardware for increased performance. Unfortunately, GPGPU programming is significantly more complex than traditional programming for several reasons.
Monte Carlo simulation is one of the most important numerical methods in financial derivative pricing and risk management. Due to the increasing sophistication of exotic derivative models, Monte Carlo becomes the method of choice for numerical implementations because of its flexibility in high-dimensional problems. However, the method of discretization of the underlying stochastic differential equation (SDE) has a significant effect on convergence. In addition the choice of computing platform and the exploitation of parallelism offers further efficiency gains. We consider here the effect of higher order discretization methods together with the possibilities opened up by the advent of programmable graphics processing units (GPUs) on the overall performance of Monte Carlo and quasi-Monte Carlo methods.
Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
Similar to Newbie’s guide to_the_gpgpu_universe (20)
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
4. From Shaders to Compute (1)
In the beginning, GPU HW was fixed & optimized for Graphics…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:
5. From Shaders to Compute (2)
• GPUs evolved to programmable
(which made Gaming companies very happy…)
Shader:
A simple program, that may run on a graphics processing
unit, and describe the traits of either a vertex or a pixel.
6. The birth of GPGPU (1)
• Interest from the academic world
Pixel shader = do the same program for (1024 X 768 X 60)
= highly efficient SPMD (Single Program, Multiple Data) machine
• Fictitious graphics pipe to solve problems
– Advanced Graphics problems
– General Computational problems
6
7. The birth of GPGPU (2)
• In 2002, Mark Harris from NVIDIA
coined the term GPGPU
“General-Purpose computation on
Graphics Processing Units”
• Used a graphics language for general
computation
• Highly effective, but :
– The developer needs to learn another
(not intuitive) language
– The developer was limited by the
graphics language
8. From Shaders to Compute (3)
• GPUs needed one more evolutional step Unified Shaders
8
9. Rise of modern GPGPU
• Unified Architecture paved the way for modern GPGPU languages
GeForce 8800
GTX (G80) was
released on
Nov. 2006
CUDA 0.8 was
released on Feb.
2007 (first official
Beta)
ATI x1900
(R580)
released on
Jan 2006
CTM was
released on
Nov. 2006
10. Evolution of Compute APIs (GPGPU)
• CUDA & CTM led to two compute standards: Direct Compute & OpenCL
• DirectCompute is a Microsoft standard
– Released as part of WIn7/DX11, a.k.a. Compute Shaders
– Runs only on Windows
– Microsoft C++ AMP maps to DirectCompute
• OpenCL is a cross-OS / cross-Vendor standard
– Managed by a working group in Khronos
– Apple is the spec editor & conformance owner
– Work can be scheduled on both GPUs and CPUs
CUDA 1.0
Released
June 2007
CUDA 2.0
Released
Aug 2008
OpenCL 1.0
Released
Dec 2008
DirectX 11
Released
Oct 2009
CUDA 3.0
Released
Mar 2010
OpenCL 1.1
Released
June 2010
CUDA 4.0
Released
May 2011
OpenCL 1.2
Released
Nov 2011
CUDA 4.1
Released
Jan 2012
CUDA 4.2
Released
April 2012
C++ AMP 1.0
Released
Aug 2012
CUDA 5.0
Released
Oct 2012
CUDA 5.5
Released
July 2013
OpenCL 2.0
Provisional
Released
July 2013
CTM SDK
Released
Nov 2006
12. GPGPU Evolution
Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1
1,024 Intel Xeon E5450 CPUs
5,120 Radeon 4870 X2 GPUs
Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A
14,336 Xeon X5670 CPUs
7,168 Nvidia Tesla M2050 GPUs
Source: http://www.top500.org/lists/
13. GPGPU Evolution
2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320)
Nexus 10 (ARM Mali T604)
Android 4.2 adds GPU support for Renderscript
2014 – NVIDIA Tegra 5 will support CUDA
2013 – GPGPU Continuum becomes a reality
17. Parallelism detailed
• Multi (Many) Cores
• Wide Vector Unit
• Multi-threaded (latency/stalls hiding)
17
14 SMXsK20NVIDIA
32 Compute UnitsHD7970AMD
60 CoresXeon Phi 5110PIntel
6 Warps per SMX32 floats = WarpK20NVIDIA
4 Wavefronts per CU64 floats = WavefrontHD7970AMD
1 VPU per Core16 floats = VPUXeon Phi 5110PIntel
64 Warps per SMXK20NVIDIA
40 Wavefronts per CUHD7970AMD
NVIDIA GK110 SMX
18. Typical GPU Caveats
• Wide vectors = SIMD (SIMT) execution
– Conditional code has to be executed “vector wide”
– Mitigation: Predication (execute all code using masks on parts)
– Performance hit on mixed execution, up to 1/N efficiency (where N is
vector width)
• Many Cores & Small caches = High percentage of Stalls
– Mitigation:
• Hold multiple in-flight contexts (aka Warps/Wavefronts) per core
• Stall = fast context switch between in-flight context and active context
• Requires huge register bank (NV & AMD: 256KB per SMX/CU)
– Latency hiding depends on having enough in-flight contexts
18A Must Read: (images to the right are taken from this talk)
“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
19. Typical GPGPU Models
This section describes some general GPGPU models, which apply
to a wide range of languages
19
20. Simplified System Model
• Host runs the OS, Application, Drivers, etc.
• GPU is connected to the Host through PCIe, Shared
Memory, etc.
Application code contains API calls*,
which use a Runtime environment,
which provides GPU access
The Application code contains “kernels”,
which are short programs/functions,
which are loaded and executed on the GPU
* In some languages the API calls are abstracted through special syntax or directives
20
Host
Application
Runtime
GPU
KernelKernel
Kernel
21. GPGPU Execution Model (1)
• A “kernel” is executed on a grid (1D/2D/3D)
• Each point in the grid executes one instance of
the kernel, orthogonally*
• Per-instance read/write is accomplished by using
the instance’s index
* There are sync primitives on a group/block level (or whole device)
21
OpenCL
CUDA
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
// Kernel invocation
dim3 dimBlock(16, 16);
dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,
(N + dimBlock.y – 1) / dimBlock.y);
MatAdd<<<dimGrid, dimBlock>>>(A, B, C);
}
22. GPGPU Execution Model (2)
• GPU execution model is asynchronous
– Commands are sent down the stack
– Kernels executed based on GPU load & status (serves a few Apps)
– Application code may wait on completion
• Quequeing Model
– Explicit (OpenCL)
– Default is implicit, Advanced usage is explicit (CUDA)
• SPMD MPMD
– GPU used to be able to execute one kernel at a time
– Modern languages support multiple simultaneous kernels 22
23. GPGPU Memory Model
Basically, a distributed memory system:
• Separated Host memory / Device memory
– Create a buffer/image on the host
– Create a buffer/image on the device
• Opaque handle (OpenCL) or device-side pointer (CUDA)
• Sync operations between memories:
– Read / Write
– Map / Unmap (marshalling)
• Pinned memory for faster sync
• GPU can access Host mapped memory (CUDA) 23
Host
Application
Runtime
GPU
Buffer
Create Write
Buffer
24. GPU Memory Model
• Few types, GPU architecture driven
• Has affect on performance – use the right type
• Watch out from coherency issues
– Not your typical MESI architecture…
24
25. Compilation Model
• Most GPGPU languages use dynamic compilation
– A common practice in the world of GPUs
– Different GPU architectures : no common ISA
– ISA varies even between generations of the same vendor
• Front-End converts High-level language to IR
(Intermediate Representation)
– Assembly of a virtual machine
– LLVM is very common in this world
– In some languages, this happens at application compile time
• Back-End(s) converts from IR to Binary
– Some Vendors use additional intermediate-to-intermediate stages
• Most languages enable storing of IR & IL
– Some do it implicitly (CUDA)
OpenCL C C for CUDA Fortran
LLVM* IR
PTX IL
GK110 Binary GF104 Binary
OpenACC
* NVIDIA has “NVVM”, which
is LLVM with a set of
restrictions
30. Vendor overview: NVIDIA
Geforce:
• GPU for Gaming
• GTX680
Tesla:
• GPU Accelerators
• K10 / K20
Quadro:
• Professional GFX
• K5000
All running the same cores (Kepler GK104 or GK110)
31. Vendor overview: AMD
31
Radeon:
• GPU for Gaming
• HD7970
FirePro:
• Professional GFX
• W9000
All running the same cores (GCN)
APU:
• CPU+GPU on same die
• A10
32. Vendor overview: Intel
Xeon Phi:
• Accelerator Card
• 5110P
CPU:
• CPU+GPU on same die
• Haswell Core i7-4xxx
33. Leading Mobile GPU Vendors
Vivante CG4000
• Unified Shaders
• 4 Cores, SIMD4 each
• Supports OpenCL 1.2
• 48 Gflops
NVIDIA Tegra 4
• 6 X 4-wide Vertex shaders
• 4 X 4-wide Pixel Shaders
• No GPGPU support
• 74 GFLOPS
ARM Mali T604
• 4 Cores
• Multiple “pipes” per core
• Supports OpenCL 1.1
• 68 GFlops
Imagination PowerVR 5xx
• Apple, Samsung, Motorola,
Intel
• Unified Shaders
• Supports OpenCL 1.1 EP (543)
• 38 Gflops (Apple’s MP4 ver)
Qualcomm Adreno 320
• Part of Snapdragon S4
• Unified Shader
• Supports OpenCL 1.1 EP
• 50 GFlops