The document discusses low-level shader optimization techniques for next-generation consoles and DirectX 11 hardware. It provides lessons from last year on writing efficient shader code, and examines how modern GPU hardware has evolved over the past 7-8 years. Key points include separating scalar and vector work, using hardware-mapped functions like reciprocals and trigonometric functions, and being aware of instruction throughput and costs on modern GCN-based architectures.
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017.
Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Past, Present and Future Challenges of Global Illumination in GamesColin Barré-Brisebois
Global illumination (GI) has been an ongoing quest in games. The perpetual tug-of-war between visual quality and performance often forces developers to take the latest and greatest from academia and tailor it to push the boundaries of what has been realized in a game product. Many elements need to align for success, including image quality, performance, scalability, interactivity, ease of use, as well as game-specific and production challenges.
First we will paint a picture of the current state of global illumination in games, addressing how the state of the union compares to the latest and greatest research. We will then explore various GI challenges that game teams face from the art, engineering, pipelines and production perspective. The games industry lacks an ideal solution, so the goal here is to raise awareness by being transparent about the real problems in the field. Finally, we will talk about the future. This will be a call to arms, with the objective of uniting game developers and researchers on the same quest to evolve global illumination in games from being mostly static, or sometimes perceptually real-time, to fully real-time.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
This talk focuses on the technical side of Killzone Shadow Fall, the platform exclusive launch title for PlayStation 4.
We present the details of several new techniques that were developed in the quest for next generation image quality, and the talk uses key locations from the game as examples. We discuss interesting aspects of the new content pipeline, next-gen lighting engine, usage of indirect lighting and various shadow rendering optimizations. We also describe the details of volumetric lighting, the real-time reflections system, and the new anti-aliasing solution, and include some details about the image-quality driven streaming system. A common, very important, theme of the talk is the temporal coherency and how it was utilized to reduce aliasing, and improve the rendering quality and image stability above the baseline 1080p resolution seen in other games.
The presentation describes Physically Based Lighting Pipeline of Killzone : Shadow Fall - Playstation 4 launch title. The talk covers studio transition to a new asset creation pipeline, based on physical properties. Moreover it describes light rendering systems used in new 3D engine built from grounds up for upcoming Playstation 4 hardware. A novel real time lighting model, simulating physically accurate Area Lights, will be introduced, as well as hybrid - ray-traced / image based reflection system.
We believe that physically based rendering is a viable way to optimize asset creation pipeline efficiency and quality. It also enables the rendering quality to reach a new level that is highly flexible depending on art direction requirements.
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
This talk covers changes in CryENGINE 3 technology during 2012, with DX11 related topics such as moving to deferred rendering while maintaining backward compatibility on a multiplatform engine, massive vegetation rendering, MSAA support and how to deal with its common visual artifacts, among other topics.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017.
Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Past, Present and Future Challenges of Global Illumination in GamesColin Barré-Brisebois
Global illumination (GI) has been an ongoing quest in games. The perpetual tug-of-war between visual quality and performance often forces developers to take the latest and greatest from academia and tailor it to push the boundaries of what has been realized in a game product. Many elements need to align for success, including image quality, performance, scalability, interactivity, ease of use, as well as game-specific and production challenges.
First we will paint a picture of the current state of global illumination in games, addressing how the state of the union compares to the latest and greatest research. We will then explore various GI challenges that game teams face from the art, engineering, pipelines and production perspective. The games industry lacks an ideal solution, so the goal here is to raise awareness by being transparent about the real problems in the field. Finally, we will talk about the future. This will be a call to arms, with the objective of uniting game developers and researchers on the same quest to evolve global illumination in games from being mostly static, or sometimes perceptually real-time, to fully real-time.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
This talk focuses on the technical side of Killzone Shadow Fall, the platform exclusive launch title for PlayStation 4.
We present the details of several new techniques that were developed in the quest for next generation image quality, and the talk uses key locations from the game as examples. We discuss interesting aspects of the new content pipeline, next-gen lighting engine, usage of indirect lighting and various shadow rendering optimizations. We also describe the details of volumetric lighting, the real-time reflections system, and the new anti-aliasing solution, and include some details about the image-quality driven streaming system. A common, very important, theme of the talk is the temporal coherency and how it was utilized to reduce aliasing, and improve the rendering quality and image stability above the baseline 1080p resolution seen in other games.
The presentation describes Physically Based Lighting Pipeline of Killzone : Shadow Fall - Playstation 4 launch title. The talk covers studio transition to a new asset creation pipeline, based on physical properties. Moreover it describes light rendering systems used in new 3D engine built from grounds up for upcoming Playstation 4 hardware. A novel real time lighting model, simulating physically accurate Area Lights, will be introduced, as well as hybrid - ray-traced / image based reflection system.
We believe that physically based rendering is a viable way to optimize asset creation pipeline efficiency and quality. It also enables the rendering quality to reach a new level that is highly flexible depending on art direction requirements.
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
This talk covers changes in CryENGINE 3 technology during 2012, with DX11 related topics such as moving to deferred rendering while maintaining backward compatibility on a multiplatform engine, massive vegetation rendering, MSAA support and how to deal with its common visual artifacts, among other topics.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Inside XBox One by Martin Fuller from the Sweden Game Developers Conference, June 2, 2014, Stockholm, Sweden. View other presentations here: http://bit.ly/TrEUeC
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
This presentation discusses the Mantle API, what it is, why choose it, and abstraction level, small batch performance and platform efficiency.
Download the presentation from the AMD Developer website here: http://bit.ly/TrEUeC
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
Johan Andersson will show how the Frostbite 3 game engine is using the low-level graphics API Mantle to deliver significantly improved performance in Battlefield 4 on PC and future games from Electronic Arts in this presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at http://developer.amd.com/resources/documentation-articles/conference-presentations/
This Webinar explores a variety of new and updated features in Java 8, and discuss how these changes can positively impact your day-to-day programming.
Watch the video replay here: http://bit.ly/1vStxKN
Your Webinar presenter, Marnie Knue, is an instructor for Develop Intelligence and has taught Sun & Oracle certified Java classes, RedHat JBoss administration, Spring, and Hibernate. Marnie also has spoken at JavaOne.
Learn more about DirectGMA in this blog post: bit.ly/AMDDirectGMA
AMD has introduced Direct Graphics Memory Access in order to:
‒ Makes a portion of the GPU memory accessible to other devices
‒ Allows devices on the bus to write directly into this area of GPU memory
‒ Allows GPUs to write directly into the memory of remote devices on the bus supporting DirectGMA
‒ Provides a driver interface to allow 3rd party hardware vendors to support data exchange with an AMD GPU using DirectGMA
‒ and more
View the accompanying blog post here: bit.ly/AMDDirectGMA
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Harris Gasparakis, AMD, at the Embedded Vision Alliance Summit, May 2014.
Harris Gasparakis, Ph.D., is AMD’s OpenCV manager. In addition to enhancing OpenCV with OpenCL acceleration, he is engaged in AMD’s Computer Vision strategic planning, ISVs, and AMD Ventures engagements, including technical leadership and oversight in the AMD Gesture product line. He holds a Ph.D. in theoretical high energy physics from YITP at SUNYSB. He is credited with enabling real-time volumetric visualization and analysis in Radiology Information Systems (Terarecon), including the first commercially available virtual colonoscopy system (Vital Images). He was responsible for cutting edge medical technology (Biosense Webster, Stereotaxis, Boston Scientific), incorporating image and signal processing with AI and robotic control.
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
In this webinar presentation, ArrayFire COO Oded Green demonstrates best practices to help you quickly get started with OpenCL™ programming. Learn how to get the best performance from AMD hardware in various programming languages using ArrayFire. Oded discusses the latest advancements in the OpenCL™ ecosystem, including cutting edge OpenCL™ libraries such as clBLAS, clFFT, clMAGMA and ArrayFire. Examples are shown in real code for common application domains.
Watch the webinar here: http://bit.ly/1obT0M2
For more developer resources, visit:
http://arrayfire.com/
http://developer.amd.com/
Follow us on Twitter: https://twitter.com/AMDDevCentral
See info in the slides for more contact information and resource links!
This is the slide deck from the popular "Introduction to Node.js" webinar with AMD and DevelopIntelligence, presented by Joshua McNeese. Watch our AMD Developer Central YouTube channel for the replay at https://www.youtube.com/user/AMDDevCentral.
This presentation accompanies the webinar replay located here: http://bit.ly/1zmvlkL
AMD Media SDK Software Architect Mikhail Mironov shows you how to leverage an AMD platform for multimedia processing using the new Media Software Development Kit. He discusses how to use a new set of C++ interfaces for easy access to AMD hardware blocks, and shows you how to leverage the Media SDK in the development of video conferencing, wireless display, remote desktop, video editing, transcoding, and more.
AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
Vulkan and DirectX12 share many common concepts, but differ vastly from the APIs most game developers are used to. As a result, developing for DX12 or Vulkan requires a new approach to graphics programming and in many cases a redesign of the Game Engine. This lecture will teach the basic concepts common to Vulkan and DX12 and help developers overcome the main problems that often appear when switching to one of the new APIs. It will explain how those new concepts will help games utilize the hardware more efficiently and discuss best practices for game engine development.
For more, visit http://developer.amd.com/
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
Explains compilers optimizations, gives taxanomy and examples. The examples are mostly compiler for ARM armv7-a and armv8-a targets, but most of optimizations are machine independent.
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia
By Andy Wingo.
With the new compiler and virtual machine in Guile 2.2, Guile hackers need to update their mental performance models. This talk will give a bit of a state of the union of Guile performance, with an updated overview of the cost of various kinds of abstractions. Sometimes abstraction is free!
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
SFO15-500: VIXL
Speaker: Amaury Le Leyzour
Date: September 25, 2015
★ Session Description ★
VIXL is dynamic code generation toolkit for ARMv8 that we hope will enable JIT creators to rapidly target the ARM instruction set.
Over the past few years we (the ARM JIT team) have worked on the code generators of many of the leading JIT compilers for the JavaScript and Java languages. During that time we built up a strong knowledge base on some of the pitfalls and time-sinks involved in creating a good JIT compiler backend. This led us to develop some tools to help improve our productivity. With ARM announcing the new Cortex-A range of processors supporting the AArch64 execution state we decided that we would focus our efforts on A64 tooling to enable developers to rapidly port programming language virtual machines for this new processor range. Soon after we decided to support Aarch32 as well.
This presentation will introduce you to what VIXL is, what’s new in VIXL and how to use it and take advantage of all its components that cover all the aspects of software development on ARM CPUs.
★ Resources ★
Video: https://www.youtube.com/watch?v=XxMTSO4clQY
Etherpad: pad.linaro.org/p/sfo15-500
Pathable: https://sfo15.pathable.com/meetings/303091
★ Event Details ★
Linaro Connect San Francisco 2015 - #SFO15
September 21-25, 2015
Hyatt Regency Hotel
http://www.linaro.org
http://connect.linaro.org
By Eduardo Lima.
If you are running Linux on an Intel GPU, chances are that your graphics driver just got much better. Mesa, the most popular open source OpenGL implementation, has got a new intermediate language to represent GLSL shader programs. It is called NIR, and is based on modern knowledge on compilers and GPU architecture. The Intel i965 driver is fully powered by NIR now, after support to non-scalar shaders has been recently added.
This talk will give the audience a tour around the work done during 2015, that culminated in the rewrite of the i965 non-scalar backend to use NIR, and was released in Mesa 11.0. It will present a brief technical overview of how NIR integrates into a Mesa backend, some important considerations, and the main challenges we faced. It will also illustrate the benefits of the new backend by comparing to the old one in terms of performance and backend code complexity.
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
I am Frank Allen. I am a Computer Architecture Assignment Expert at architectureassignmenthelp.com. I hold a Master's in Computer Architecture from, Ontario Tech University, Canada. I have been helping students with their assignments for the past 10 years. I solve assignments related to Computer Architecture.
Visit architectureassignmenthelp.com or email info@architectureassignmenthelp.com. You can also call on +1 678 648 4277 for any assistance with Computer Architecture Assignments.
Getting Started with Raspberry Pi - DCC 2013.1Tom Paulus
The Raspberry Pi is a small credit-card sized linux computer. Developers and hobbyists around the world are creating miraculous applications and projects, and now you can join them. Last year we presented Raspberry Pi, What We Have Learned So Far, This year's presentation covers the first steps to using your Pi. From the basics, like burning your SD Card to creating a News Reader, you will learn GPIO Basics and simple Python tools. Communication between other components using SPI or I2C will also be covered. It is recommended, but not required that you have a Raspberry Pi, some knowledge of Python and simple electronics.
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Chris Adamson
Abstract: Chances are you code in a language that's either descended from C, inspired by C, or run in an interpreter that itself is written in C. Still... do you actually know how to code in C? Despite its long-standing position as a sort of "lingua franca", an agreed-upon common language, more and more developers are putting together successful, satisfying careers, without ever learning this seminal language. But what if you have to call into C code from your favorite scripting language, or use APIs like OpenGL that are written to be called from C? Many developers find C very challenging, particularly its manual memory-management and other low-level concerns. In this session, we'll show you why you shouldn't be afraid of C, how you can use the skills you already have from the languages you code in today, and how to master structs, enums, typedefs, malloc(), free(), and the rest of C's sharp edges. Examples will be from the point-of-view of the C-skewing iPhone SDK, but will be designed to be broadly applicable and platform-agnostic.
Advance procedures in assembly are fully explained by me and my group mates.
Main topics are:
*Stack frames
-Recursion
-ADDR, INVOKE , LOCAL, PROC , PROTO directives and variables
-MultiModule Programs in assembly
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
Learn more about how AMD’s RapidFire SDK simplifies the delivery of multi-game streaming from a single GPU while minimizing latency to ensure one of the best cloud gaming experiences in this presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
Oxide Games Partners Dan Baker and Tim Kipp will show you how to build a high throughput renderer using the Mantle API in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at http://developer.amd.com/resources/documentation-articles/conference-presentations/
This AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21 explains how Mantle features can enable developers to improve both CPU and GPU performance in their titles. Also view this and other presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
A look at how new Direct3D advancements enhance efficiency and enable fully-threaded building of command buffers in this prentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
Keynote presentation, The Role of Java in Heterogeneous Computing, and How You Can Help, by Nandini Ramani, VP, Java Platform, Oracle Corporation, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...AMD Developer Central
Keynote, Developers: The Heart of AMD Innovation, by Dr. Lisa Su, Senior VP and GM, Global Business Units, AMD, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
2. Introduction
GDC 2013
“Low-level Thinking in High-level Shading Languages”
Covered the basic shader features set
Float ALU ops
New since last year
Next-gen consoles
GCN-based GPUs
DX11 feature set mainstream
70% on Steam have DX11 GPUs [1]
3. Main lessons from last year
You get what you write!
Don't rely on compiler “optimizing” for you
Compiler can't change operation semantics
Write code in MAD-form
Separate scalar and vector work
Also look inside functions
Even built-in functions!
Add parenthesis to parallelize work for VLIW
4. More lessons
Put abs() and negation on input, saturate() on
output
rcp(), rsqrt(), sqrt(), exp2(), log2(), sin(), cos()
map to HW
Watch out for inverse trigonometry!
Low-level and High-level optimizations are not
mutually exclusive!
Do both!
5. A look at modern hardware
7-8 years from last-gen to next-gen
Lots of things have changed
Old assumptions don't necessarily hold anymore
Guess the instruction count!
TextureCube Cube;
SamplerState Samp;
float4 main(float3 tex_coord : TEXCOORD) : SV_Target
{
return Cube.Sample(Samp, tex_coord);
}
sample o0.xyzw, v0.xyzx, t0.xyzw, s0
7. Hardware evolution
Fixed function moving to ALU
Interpolators
Vertex fetch
Export conversion
Projection/Cubemap math
Gradients
Was ALU, became TEX, back to ALU (as swizzle + sub)
8. Hardware evolution
Most of everything is backed by memory
No constant registers
Textures, sampler-states, buffers
Unlimited resources
“Stateless compute”
13. Shader inputs
Shader gets a few freebees from the scheduler
VS – Vertex Index
PS – Barycentric coordinates, SV_Position
CS – Thread and group IDs
Not the same as earlier hardware
Not the same as APIs pretend
Anything else must be fetched or computed
14. Shader inputs
There is no such thing as a VertexDeclaration
Vertex data manually fetched by VS
Driver patches shader when VDecl changes
float4 main(float4 tc: TC) : SV_Position
{
return tc;
}
s_swappc_b64 s[0:1], s[0:1]
v_mov_b32 v0, 1.0
exp pos0, v4, v5, v6, v7 done
exp param0, v0, v0, v0, v0
float4 main(uint id: SV_VertexID) : SV_Position
{
return asfloat(id);
}
v_mov_b32 v1, 1.0
exp pos0, v0, v0, v0, v0 done
exp param0, v1, v1, v1, v1
Sub-routine call
15. Shader inputs
Up to 16 user SGPRs
The primary communication path from driver to shader
Shader Resource Descriptors take 4-8 SGPRs
Not a lot of resources fit by default
Typically shader needs to load from a table
17. Shader inputs
Interpolation costs two ALU per float
Packing does nothing on GCN
Use nointerpolation on constant values
A single ALU per float
SV_Position
Comes preloaded, no interpolation required
noperspective
Still two ALU, but can save a component
19. Shader inputs
SV_IsFrontFace comes as 0x0 or 0xFFFFFFFF
return (face? 0xFFFFFFFF : 0x0) is a NOP
Or declare as uint (despite what documentation says)
Typically used to flip normals for backside lighting
float flip = face? 1.0f : -1.0f;
return normal * flip;
v_cmp_ne_i32 vcc, 0, v2
v_cndmask_b32 v0, -1.0, 1.0, vcc
v_mul_f32 v1, v0, v1
v_mul_f32 v2, v0, v2
v_mul_f32 v0, v0, v3
return face? normal : -normal;
v_cmp_ne_i32 vcc, 0, v2
v_cndmask_b32 v0, -v0, v0, vcc
v_cndmask_b32 v1, -v1, v1, vcc
v_cndmask_b32 v2, -v3, v3, vcc
return asfloat(
BitFieldInsert(face,
asuint(normal), asuint(-normal))
);
v_bfi_b32 v0, v2, v0, -v0
v_bfi_b32 v1, v2, v1, -v1
v_bfi_b32 v2, v2, v3, -v3
20. GCN instructions
Instructions limited to 32 or 64bits
Can only read one scalar reg or one literal constant
Special inline constants
0.5f, 1.0f, 2.0f, 4.0f, -0.5f, -1.0f, -2.0f, -4.0f
-64..64
Special output multiplier values
0.5f, 2.0f, 4.0f
Underused by compilers (fxc also needlessly interferes)
21. GCN instructions
GCN is “scalar” (i.e. not VLIW or vector)
Operates on individual floats/ints
Don't confuse with GCN's scalar/vector instruction!
Wavefront of 64 “threads”
Those 64 “scalars” make a SIMD vector
… which is what vector instructions work on
Additional scalar unit on the side
Independent execution
Loads constants, does control flow etc.
22. GCN instructions
Full rate
Float add/sub/mul/mad/fma
Integer add/sub/mul24/mad24/logic
Type conversion, floor()/ceil()/round()
½ rate
Double add
23. GCN instructions
¼ rate
Transcendentals (rcp(), rsq(), sqrt(), etc.)
Double mul/fma
Integer 32-bit multiply
For “free” (in some sense)
Scalar operations
24. GCN instructions
Super expensive
Integer divides
Unsigned integer somewhat less horrible
Inverse trigonometry
Caution:
Instruction count not indicative of performance
anymore
26. GCN instructions
MAD-form still usually beneficial
When none of the instruction limitations apply
When using inline constants (1.0f, 2.0f, 0.5f etc)
When input is a vector
32. ROPs
Not enough ROPs to utilize all BW!
Always for RGBA8
Often for RGBA16F
Bypass ROPs with compute shader
Write straight to a UAV texture or buffer
Done right, you'll be BW bound
We have seen 60-70% BW utilization improvements
33. Branching
Branching managed by scalar unit
Execution is controlled by a 64bit mask in scalar regs
Does not count towards you vector instruction count
Branchy code tends to increase GPRs
x? a : b
Semantically a branch, typically optimized to CndMask
Can use explicit CndMask()
34. Integer
mul24()
Inputs in 24bit, result full 32bit
Get the upper 16 bits of 48bit result with mul24_hi()
4x speed over 32bit mul
Also has a 24-bit mad
No 32bit counterpart
The addition part is full 32bit
36. Integer division
Not natively supported by HW
Compiler does some obvious optimizations
i / 4 => i >> 2
Also some less obvious optimizations [2]
i / 3 => mul_hi(i, 0xAAAAAAB) >> 1
General case emulated with loads of instructions
~40 cycles for unsigned
~48 cycles for signed
37. Integer division
Stick to unsigned if possible
Helps with divide by non-POT constant too
Implement your own mul24-variant
i / 3 ⇒ mul24(i, 0xAAAB) >> 17
Works with i in [0, 32767*3+2]
Consider converting to float
Can do with 8 cycles including conversions
Special case, doesn't always work
38. Doubles
Do you actually need doubles?
My professional career's entire list of use of doubles:
Mandelbrot
Quick hacks
Debug code to check if precision is the issue
39. Doubles
Use FMA if possible
Same idea as with MAD/FMA on floats
No double equivalent to float MAD
No direct support for division
Also true for floats, but x * rcp(y) done by compiler
0.5 ULP division possible, but far more expensive
Double a / b very expensive
Explicit x * rcp(y) is cheaper (but still not cheap)
40. Packing
Built-in functions for packing
f32tof16()
f16tof32()
Hardware has bit-field manipulation instructions
Fast unpack of arbitrarily packed bits
int r = s & 0x1F; // 1 cycle
int g = (s >> 5) & 0x3F; // 1 cycle
int b = (s >> 11) & 0x1F; // 1 cycle
41. Float
Prefer conditional assignment
sign() - Horribly poorly implemented
step() - Confusing code and suboptimal for typical case
Special hardware features
min3(), max3(), med3()
Useful for faster reductions
General clamp: med3(x, min_val, max_val)
42. Texturing
SamplerStates are data
Must be fetched by shader
Prefer Load() over Sample()
Reuse sampler states
Old-school texture ↔ sampler-state link suboptimal
43. Texturing
Cubemapping
Adds a bunch of ALU operations
Skybox with cubemap vs. six 2D textures
Sample offsets
Load(tc, offset) bad
Consider using Gather()
Sample(tc, offset) fine
44. Registers
GPUs hide latency by keeping multiple
wavefronts in flight
GCN can support up to 10 simultaneous wavefronts
Fixed pool of 64KB for VGRPs, 2KB for SGPRs
45. Registers
Keep register life-time low
for (each){ WorkA(); }
for (each){ WorkB(); }
is better than:
for (each){ WorkA(); WorkB(); }
Don't just sample and output an alpha just because you
have one available
46. Registers
Consider using specialized shaders
#ifdef instead of branching
Über-shaders pay for the worst case
Reduce branch nesting
50. How to get to the HW asm
GPUShaderAnalyzer doesn’t support GCN yet
CodeXL to the rescue!
Cmdline only, but gets the job done
Detailed AMD blog-post [4]
Provides ASM, GPR stats etc.
CodeXLAnalyzer -c Hawaii -f main -s HLSL -p ps_5_0 -a stats.csv --isa ISA.txt Shader.hlsl
*************** Build Began for 1 Devices***************
Compile for device: Hawaii - Succeeded. Extracting ISA for device: Hawaii - Succeeded.
Writing Analysis data succeeded!
51. Things shader authors should stop doing
pow(color, 2.2f)
You almost certainly did something wrong
This is NOT sRGB!
normal = Normal.Sample(...) * 2.0f – 1.0f;
Use signed texture format instead
52. Things compilers should stop doing
x * 2 => x + x
Makes absolutely no sense, confuses optimizer
saturate(a * a) => min(a * a, 1.0f)
This is a pessimization
x * 4 + x => x * 5
This is a pessimization
(x << 2) + x => x * 5
Dafuq is wrong with you?
53. Things compilers should stop doing
asfloat(0x7FFFFF) => 0
This is a bug. It's a cast. Even if it was a MOV it should still
preserve all bits and not flush denorms.
Spend awful lots of time trying to unroll loops with [loop] tag
I don't even understand this one
Treat vectors as anything else than a collection of floats
54. Things compilers should be doing
x * 5 => (x << 2) + x
Use mul24() when possible
Compiler for HD6xxx detects some cases, not for GCN
Expose more hardware features as intrinsics
More and better semantics in the D3D bytecode
Require type conversions to be explicit
55. Potential extensions
Hardware has many unexplored features
Cross-thread communication
“Programmable” branching
Virtual functions
Goto
56. References
[1] Steam HW stats
[2] Division of integers by constants
[3] Open GPU Documentation
[4] CodeXL for game developers: How to analyze
your HLSL for GCN
Refer to slide deck from last year (”Low-level Thinking in High-level Shading Languages”) for details on these optimizations. This presentation will assume that you already know what things like ”MAD-form” means.
Back in DX9 era a cubemap lookup was still a single sample instruction, and the same was true for projective textures (tex2Dproj). In DX10 direct support for projective textures was removed, with the expectation that shaders that need projective texturing will simply do the division by w manually. This reflected the fact that no hardware did the division by w in the texture unit anymore, so there was no need to pretend it did. The cost of this fixed function hardware could no longer be motivated when we have so much ALU units that would be perfectly capable of doing this math. The situation for cubemaps is the same. Obviously we still need fast sampling, so cubemaps are still a first class citizen in the API and will likely remain that way; however, the coordinate normalization is not something that we want to spend an awful lot of transistors on when those transistors could rather be used to add more general ALU cores instead. Consequently, this is handled by the ALUs these days. The D3D bytecode still treats sampling a cubemap as a simple sample instruction. However, it may surprise you what this expands to in native hardware instructions.
This is what the actual shader looks like in the end. There is a set of different types of instructions here, vector ALU instruction (VALU) which are your typical math instructions and operate on wide SIMD vectors across all threads/pixels/vertices, and scalar instructions (SALU) that operate on things that are common for all threads. More on these instructions later in this presentation. There is also an image instruction (IMG) that does the actual sampling here, and finally an export instruction (EXP) that writes out the final output data, which in the case of a pixel shader is what lands in your framebuffer.
The general trend is that more and more fixed function units move over to the shader cores. This makes a lot of sense from a transistor budget point of view and is something that has been going on for a long time. Interpolators became ALU instructions with DX11 hardware. Vertex fetch has been done by the shader for a long time. Even Xbox360 did this. Export conversion is now handled by the ALUs since GCN. Projection/cubemap math since DX10. Gradients have moved a bit back and forth. Since gradient are needed by the texture units anyway, it made sense in the past to let them handle it; however, now that GCN has a generic lane swizzle, the ALUs has all the tools to do the work itself, so now it’s done in the ALUs again.
A side effect of this trend is that things that previously were more or less for free could now come at a moderate cost in terms of ALU instructions. For instance, for shaders with a sufficiently large number of instructions / interpolator ratio interpolators used to be free, although short shaders or shaders with many interpolators could easily become interpolator-bound. On DX11 hardware where interpolation is an ALU instruction, you basically pay for the interpolation cost. Previously interpolator-bound shaders could now became ALU-bound and likely run substantially faster, whereas if you weren’t previously interpolator-bound, you would now see a slowdown due to the additional interpolation cost.
In the past the hardware had a lot of global device state. Things sat is registers that the hardware units read. This is not the case anymore. Most things are backed by memory and then read by the hardware whenever it needs it. This is not as scary as it may sound, there are obviously caches in-between keeping the bandwidth cost to a minimum, and values may be lying around in local registers for whatever hardware unit needs it after it has been loaded from memory. Constants obviously moved to memory with the introduction of constant buffers; however, these days things like texture descriptors and sampler-states are just a piece of data and behave just like constants. On GCN architecture you could technically put a bunch of texture descriptors and sampler-states in your constant buffer, provided of course that we have the proper API and shader infrastructure to do things that way.
For historical reasons APIs have assigned resources to “slots”. These do not exist in hardware anymore. Instead drivers assemble the set of active slots and create a list of those in memory and just passes the pointer to the shader.
The main implication of this is that there is no longer any particular limitation to the number of resources we can access from a shader. Just provide a long enough array and the shader can grab anything from anywhere as it pleases.
The other implication is that access to resources also comes at a cost for grabbing the backing data of sampler-states and texture descriptors. This cost can mostly be hidden on GCN since those are SALU instructions that run independently of VALU, but it is worth knowing that resource descriptors are loaded explicitly by the shader using actual shader instructions.
It is worth noting that compute units in GCN architecture are completely stateless. This means that once a compute shader is up and running, it has all data in its local registers and no longer depends on any global device state (other than data in memory). This means that it is perfectly possible for compute units to run in parallel with different shaders, completely asynchronously, and in parallel with the graphics pipeline. However, the graphics pipeline itself still relies on some global state, and thus it is currently not possible to run two different graphics pipelines in parallel. It would not surprise me if this would change in future hardware.
An interesting exercise when trying to understand the basic parameters to a shader is trying to write a “NULL” shader, or a shader that outputs no actual instructions other than say a final export. On AMD’s DX10 level hardware something like this would do. We are simply returning an interpolator directly. On DX10 level hardware, interpolators came preloaded into registers, so we only need to output that register and we are done. One downside of this approach is that if you have a lot of interpolators the shader will by necessity also consume a lot of registers, just to provide the shader with its inputs. This can negatively impact latency hiding and ultimately performance, which is another reason to go away from this approach.
On DX11 level hardware the situation is different. On all AMD DX11 hardware the interpolation is done manually by the shader. Left side is an HD5000 series chip, right side the results on the GCN architecture. On GCN we also see the addition of export conversion to pack the result into FP16 format.
The simple pass-through shader from DX10 has now become a bunch of work to be done by the shader cores.
To create a NULL shader on DX11 one could simply return the screen position system value instead. These values still come preloaded in registers. On GCN we can see that they were loaded into registers v2-v5. The registers v0 and v1 usually hold the barycentric coordinates used for interpolation (not used here though).
A shader only gets a handful of input parameters to work with, the rest of the data it needs must be loaded or computed. Depending on what shader type we are talking about, the input parameter varies. A vertex shader, for instance, gets a vertex index. This is used to manually fetch the vertex data. A pixel shaders gets barycentric coordinates for interpolation, as well as the screen position. A compute shader gets thread and group IDs. From the way shaders look in HLSL it may appear as if some things are for free from the shader’s point of view, such as vertex data, however, there is an increasingly large gap between what the APIs pretend and what the underlying hardware does.
Vertex fetch has not been a fixed function hardware unit for quite some time, despite what APIs pretend. Even good old Xbox360 did vertex fetch manually in the shader. The vertex declaration is an API construct that does not map to any underlying hardware object. Instead a the shader is patched according to the current vertex declaration. On Xbox360 this could be very expensive since the vertex fetch was inserted as a bunch of instruction directly into the top of the vertex shader instruction stream. For GCN architecture it can be done in a simpler way with a simple sub-routing call. Depending on what the active vertex declaration is, the driver provides a function pointer in two scalar registers that the shader simply calls as the first thing in the vertex shader. Note that in the left shader where we do not use any vertex streams, the function call is absent.
Other than the hardware provided data, the driver can also pass a small set of data into the shader. Typically the driver will pass information such as a pointer to the list of resources. If the number of resources is small enough to fit within the maximum 16 scalar registers, the driver will simply pass the raw resource descriptor and avoid an indirection. Given the limited amount of data possible to pass, this only works in a very limited number of cases, such as using only a single texture.
In the case of a single texture, the shader can use a raw resource descriptor. With two textures, the driver has to create a list of texture descriptors in memory and provide a pointer to it. Then the shader must load the texture descriptors from that table before it can access the textures.
In the past, an optimization that was commonly used was to pack interpolators together to reduce the number of interpolators, for instance packing two float2 texture coordinates into a single float4 vector. This used to be beneficial on older hardware, but on GCN this accomplishes little in terms of interpolation cost. It will still cost exactly the same, i.e. two ALU instructions per float. However, it may still help with the number of exports from the vertex shader. What will help with ALU though is to use the nointerpolation flag on attributes that are constant. This will reduce the cost of bringing the data in to a single ALU instruction since the data can just be copied into the shader instead of interpolated.
The position system value comes preloaded into shader registers, so from the shader’s point of view they are free. However, much like how you in the past could be interpolator-bound, you can also be bound by the generation of this system value by the hardware. So for short shaders using SV_Position could slow things down, whereas for longer shaders they become free and could avoid the cost of interpolation for screen-coordinate values.
The noperspective attribute doesn’t really affect the interpolation cost. It just means different barycentric coordinates are provided. It still costs 2 ALUs per float. However, in some cases where in the past you may have passed a w value and done the division by that in the pixel shader, you can instead use the noperspective flags and save one component as well as skip the division, so it could still save up to 9 ALU instructions for an xyz(w) attribute.
This is the effect of using nointerpolation, cutting the instruction count into half over interpolated values.
All API documentation consider SV_IsFrontFace to be a bool attribute. The hardware does not really have bools, all registers are 32bits and that is the smallest basic unit a GPU works with. So obviously bool attributes come at a full 32bit under the hood. Despite documentation, you can actually declare the SV_IsFrontFace variable as an uint and get the raw bits from the hardware, which is either 0x00000000 or 0xFFFFFFFF. This in combination with GCN’s BFI instruction can be used to implement a very fast normal flipping for back-face lighting, which is the typical use case of this attribute.
Left side shows a typical DX9/lastgen-esque implementation, which results in 5 instructions required. The middle one is a more DX10-ish implementation, which reduces the instruction count by one. Finally the right side shows the GCN-specific implementation. The BFI instruction basically does a bitwise blend between two inputs based on the mask. Given that the input SV_IsFrontFace flag is a bitmask of all zeros or ones, it becomes a select between the inputs. This allows us to drop the comparison, further reducing the instruction count by one.
The GCN architecture is a fair bit more restricted than earlier AMD hardware. The stated goal has been to reduce complexity to allow for more efficient hardware. The downside is that this in some cases leads to longer shaders than what you would see on earlier hardware. Given instructions size fixed at either 32 or 64bit, there is obviously not much space left for passing literal constants. Passing only one constant takes 32bits, so obviously there is no possibility at all to pass two given that we need some bits to encode the actual instruction as well. Additionally, the hardware can also only read a single scalar register per instruction, and not at the same time as a literal constant. This can be problematic for taking full advantage of the MAD instruction, which will be discussed later. There is however a limited set of common values, a handful powers of two float values and integers from -64 to 64 that have special encoding and can be provided for free. These can mitigate the problem in some cases. Similarly there are a few special output multipliers that may also help. Unfortunately compilers do not take advantage of this as much as they could, but that is at least a software problem and fixable.
The GCN architecture is ”scalar”. This is in contrast with earlier VLIW and vector based architectures. This terminology can at first be a bit confusing when you hear about vector and scalar instructions, but these terms are viewing the hardware from different angles. From the shader’s point of view each instruction operates on a single float or integer. That is what “scalar” means when discussing the architecture. However, the hardware will still run many instances of the shader in lockstep, basically as a very wide SIMD vector, 64 wide to be precise in the case of GCN, and that is what we refer to as vector instructions. So where the shader programmer sees a scalar float, the hardware sees a 64 float wide SIMD vector. On the side of the wide vector ALU hangs a small scalar unit. This unit runs what is called scalar instructions. It does things that are common for all threads/shaders that runs in parallel, such as loading constants, fetching resource descriptors, as well as flow control. This unit runs independently of the shader vector math. So in typical shaders where the math or texture fetch is dominating the shading work, the scalar unit more or less operates for free.
Different instructions come at different execution speed. The full rate include your typical floating point math, as well as basic integer and logic. Note however that as far as integer multiplies goes, there is a special 24bit multiply that runs at full rate, whereas a normal 32bit multiply does not. Current hardware is still optimized primarily for floating point math, and 32bit integer multiplies are just not common enough in shaders to motivate spending the transistors to turn it into full rate. 24bit multiplies on the other hand is something you get almost for free out the floating point multiplier that the hardware already has, so that is why it can be done at full rate.
In earlier AMD hardware type conversions sat on the special transcendental unit, so you might have expected them to be quarter rate. However, in GCN they are full rate. Clamping or rounding to integers of a float are also full rate.
Double additions come at half the rate, although they also operate on twice as much data, so that is actually very good throughput.
Transcendentals, such as square-roots, reciprocals, trigonometry, logarithms, exponentials etc. come at a quarter rate. This is also true for double multiply or the fused-multiplyadd. Note that there is no equivalent to MAD on doubles, i.e. a multiply with proper IEEE-754 rounding followed by add with proper rounding. Only a fused operation exists for doubles, with rounding in the end. This is unlikely to cause much problems, but is worth noting.
Note also that integer 32-bit multiples are quarter rate, and thus are as slow as multiplying doubles.
Scalar operations are handled by the independent scalar unit and thus does not count towards your vector instruction count. In typical shaders where math and texturing dominates, scalar operations become more or less free, although it is of course possible to make a shader be dominated by scalar instructions.
Integer division and inverse trigonometry are things you should avoid like the plague. They are not supported natively by the hardware and must be emulated with loads of instructions. The use of inverse trigonometry in particular is typically a sign in itself that you are doing something wrong. There are not a lot of cases where working with angles is the right thing to do, and almost certainly not computing them in the shader. Most likely the problem can be solved much more efficiently and elegantly using linear algebra. When reformulated things often boil down to a dot-product or two, perhaps a cross-product, and maybe a little basic math on top of that.
One thing to note is that unlike in previous AMD hardware where you could get a pretty good idea of final performance just by looking at the total VLIW slot (assuming all ALU), you cannot just look at the final instruction count on GCN and infer anything. You must look at individual instructions since their execution speed varies.
In last year’s talk I put a fair amount of emphasis on the fact that all hardware since the dawn of time has been built to execute multiply-adds really fast. It has always been a single instruction on all hardware I have ever worked with, whereas an add followed by a multiply has always been two instructions. The implication thus is naturally that you should write shaders that fits this MAD-form to the greatest extent as possible. The GCN architecture complicates the issue somewhat due to its more restricted instruction set. As a general rule-of-thumb, writing in MAD-form is still the way to go; however, there are cases on GCN where it may not be beneficial, or even add a couple of scalar instructions. The shaders here illustrates a couple of failure cases. The left side represents a “before” case that consists of an add followed by multiply, which generates these two instructions as expected. The middle case uses two literal constants. These cannot both be baked into a 64bit instruction, so the compiler has to expand it to two instructions. The result differs somewhat between the platforms here, one compiler only generated a single scalar instruction and another one used two. Neither is optimal and it would have been possible to skip scalar instructions all together similar to the instruction sequence on the left, but this is what compilers at the time of this writing generated. Finally, the right side represents the case where both constants come from a constant buffer. In this case both values would lie in scalar registers, and given the read limitation of a single scalar register per instruction, one value must first be moved to a vector register before the MAD, which eliminates the benefit.
While there are cases where MAD-form does not result in a speedup, it is worth noting that in none of these cases does the vector instruction count increase. We are just not improving things. In the case of added scalar instructions, this is something that future compiler will hopefully address. However, the important thing to remember is that while things got a little bit worse for this in GCN, there still remains plenty of cases where writing in MAD-form is still beneficial.
This illustrates two cases where writing in MAD-form is still beneficial on GCN. The first case only needs a single immediate constant, so this case works. Note that the left and right panels are not equivalent code, but just illustrates the difference between these forms.
The other example shows how using one of the special inline constants also fixes it. There these two examples are equivalent and the MAD-form runs faster. It is worth noting though that one compiler was able to optimize the ADD-MUL case using the output multiplier instead, resulting in a single vector instruction even for that case.
Finally, when using a vector, the extra overhead from GCN limitations can be amortized and still result in a substantial overall improvement. Instead of cutting down this float4 add-multiply from 8 to 4 operations as on previous hardware, we only go from 8 to 5, but that is still a good improvement. While we would not see a benefit on a single float in MAD-form here, even with a two-component vector would begin to see an improvement.
In the past people often tried to write vectorized code in order to better take advantage of the underlying vector architectures. People bunched together separate things into vectors and did math on vectors instead of writing things in a more intuitive form. This made sense in 2005 perhaps, when GPUs were vector based and shader optimizers were not particularly sophisticated. However, since DX10 GPUs arrived everything has been scalar or VLIW, making explicit vectorization dubious at best and counter-productive at worst. If explicit vectorization introduces extra math operations, it will only slow things down. Refer to last year’s talk and the topic of separating scalar and vector work to see how that could easily happen.
Fortunately I do not see much explicit vectorization these days, the remaining exception may be trying to use dot-products in hope of taking advantage of a build-in instruction. This may work on vector-based architectures, but scalar architectures do not even have a dot-product instruction. Instead it is implemented as a series of MUL and MADs. Even on VLIW architectures that do have a dot-product instruction it is unlikely to speed things up since you will occupy at least as many lanes and possibly more.
The problem in the example here is that the evaluation order will require that the dot-product is evaluated first, followed by the subtraction. This unnecessarily prevents the compiler from merging the subtraction and multiplication into a single MAD. This is not the case for the expression on the left.
As hardware has gotten increasingly more powerful over the years, some parts of it has lagged behind. The number of ROPs (i.e. how many pixels we can output per clock) remains very low. While this reflects typical use cases where the shader is reasonably long, it may limit the performance of short shaders. Unless the output format is wide, we are not even theoretically capable of using the full bandwidth available. For the HD7970 we need a 128bit format to become bandwidth bound. For the PS4 64bit would suffice.
On the XB1, if we are rendering to ESRAM, 64bit just about hits the crossover point between ROP and bandwidth-bound. But even if we render to the relatively slow DDR3 memory, we will still be ROP-bound if the render-target is a typical 32bit texture.
The solution is to use a compute shader. Writing through a UAV bypasses the ROPs and goes straight to memory. This solution obviously does not apply to all sorts of rendering, for one we are skipping the entire graphics pipeline as well on which we still depend for most normal rendering. However, in the cases where it applies it can certainly result in a substantial performance increase. Cases where we are initializing textures to something else than a constant color, simple post-effects etc., this would be useful.
Branching on GCN is generally speaking very fast, provided of course that we have good coherency and so on. The overhead from the branching itself is very low. Other than doing the comparisons that we are branching on, it is more or less free given that the actual branching is handled by the scalar unit. The thing to look out for, however, is the register pressure. Very branchy code tends to increase the number of registers the shader requires. This can impact performance much more significantly than the branching itself does.
Most of the time branches are fine. For very tiny branches, a conditional assignment may be preferable. The easiest way to do that is to simply use the ?-operator. Semantically it is a branch though, rather than a conditional assignment, although simple expressions written that way tends to compile down to a conditional assignment. However, we have observed cases on some platforms where this didn’t happen and a full branch was emitted. In this case it may help to use an explicit CndMask() that maps straight to the underlying conditional mask instruction.
As discussed earlier, 24bit multiplies are full rate whereas 32bit multiplies are quarter rate, so the 24bit is preferable in the multitude of cases where it is sufficient. It is worth noting that while the inputs are 24bit, the result is a full 32bit value. In fact, the hardware can also give you the top 16 bits from the 48bit result, should you have a use for it. There is also a MAD version of the 24bit multiply (unlike the 32bit one) and it is worth pointing out that for the addition part the input is the full 32 bits, not just 24.
As mentioned, 24bit multiply is full rate whereas 32bit is quarter rate. The difference is further expanded when you also need to add as well. The addition can be merged into a 24bit MAD, whereas no such instruction exists for 32bit multiplies. This makes 32bit multiply-add 5 times slower than 24bit.
The hardware does not support any form of integer division natively. For some obvious cases, like dividing by a power-of-two constant the compiler does the obvious conversion of a bit shift. For the more generic division by constant there are known methods for generating a short sequence of a multiply by magic constant and some shifts and add/sub. The exact sequence depends on your input, so the cost is variable. The expensive part is the multiplication by the magic number since it is done in full 32bit, followed by 1-4 basic ALUs, for a total cost of 5-8 cycles. For signed integer division it is more expensive.
For the general case where the denominator is unknown the division is emulated with loads of instruction. This should obviously be avoided if at all possible.
The easiest optimization is to simply switch to unsigned if you do not need to deal with negative numbers. This has a large impact even on division by constant.
If you are dividing by constant, one thing you can do if your inputs are limited enough in range is to implement your own mul24 based version. For instance a division by 3 can be done in 2 cycles instead of the 5 required for a full 32bit. This works as long as the input is no larger than 98303.
For some specific cases it may be possible to do the division in floating point instead. Including conversions back and forth it will be 8 cycles. You may have to take special case to handle precision and rounding here, so be careful and do proper testing if you go down this path.
GCN supports doubles. First question if you consider using them is whether you actually need them. There is a good chance that you do not. In my entire career doubles have been the answer (aside from quick hacks and tests) pretty much only in Mandelbrot rendering, and in that case it is mostly just the better choice compared to floats rather than the final solution, ideally I would like even more precision. However, there are certainly scientific computations and other applications where the use of doubles is a sensible choice, and in that case there are a few things you can do.
Much like with floats, doubles benefit from MAD-form. Note that there is no IEEE-754 compatible MAD instruction for doubles that rounds between the multiply and add. There is only a fused multiply-add, with a single rounding in the end. I would imagine that this would rarely cause any problems for any typical applications, and FMA is certainly the preferable choice if you are not constrained by strict IEEE-754 compatibility.
There is no direct support for double division in GCN. Actually not for floats either. For floats the compiler simply accept 1.0 ULP, thus allowing it to be implemented as an rcp followed by a multiply. However, the hardware has some supporting functions for the case where you really really need that proper 0.5 ULP result, allowing you to potentially flip the last bit on your float at the cost of lots of cycles.
Note that for doubles the compiler implements a strict 0.5 ULP by default. This means you will go down the expensive path by default, unlike for floats. If this is not desired and you can live with somewhat worse precision in your divisions, you can implement the division as an rcp and multiply manually. This will reduce the cost somewhat, although it is still not cheap by any means.
With every generation the ALU power keeps increasing at a far greater rate than the available bandwidth. As such, it is of increasing important to pack your data as tightly as possible. There are a few handy features for fast pack and unpack. Firstly the standard DX11 functions for converting floats to half and vice versa. The other is the bit-field manipulation instructions on GCN. These are exposed as intrinsics on some platforms and can be used explicitly there. In other cases the compiler will typically spot extraction and packing of bit-fields and generate these instructions for you. The case illustrated here with an unpacking of a good old R5G6B5 color will compile into only three instructions, rather than five that you might expect with standard bitwise operations.
As I mentioned in my talk last year, sign() is poorly implemented under the hood and for most practical cases does too much work. Most use cases actually do not care about handling 0, or would rather flip it in either direction instead of returning 0, but sign() is a generic built-in function and does generic things. Implementing it as a conditional assignment is much faster, like (x >= 0)? 1 : -1. Similarly the step() function can be implemented as (x >= a)? 1 : 0. The advantage of doing this explicitly, other than improved readability (IMHO), is that the typical use case can also be further optimized. Normally sign() and step() are multiplied with something afterward rather than just used as is. If you use the functions, not only are you paying for the suboptimal handling under the hood, you are also paying for that multiplication. However, when written as a conditional assignment you can simply merge that into your parameters that you select from and get it for free.
GCN has 3-component min, max and med instructions. The obvious usage case is to do faster reductions, e.g. finding the max value in a 3x3 neighborhood would take 4 cycles instead of 8. A somewhat less obvious use case is to use med3() as a general clamp in one cycle instead of two.
SamplerStates are data that must be fetched. Thus it is preferable to use Load() instead of Sample() with a point filter if that works and without introducing conversions between floats and ints (which may be more costly as they are vector instructions).
If you are coming from a DX9 or last-gen console engine, chances are that sampler states and textures are linked, i.e. texture 0 uses sampler state 0 and texture 1 uses sampler state 1 etc.. Already DX10 stepped away from this model, but it is easy to get stuck with the same conceptual model as long as your engine has last-gen console support or uses shaders with DX9-style syntax. Typically a shader has many more textures than sampler states, so the result is that the shader will have to load many more sampler states than is necessary and will simply keep many copies of the same data in its registers, wasting scalar register space and loads.
As mentioned before, cubemap sampling comes at a bit of overhead in terms of ALU instructions. This can typically not be avoided since you usually really need a cubemap where you use them. The exception is things like a skybox. While storing it as a cubemaps seems reasonable, it can be faster to simply draw it using a set of 2D textures or an atlas.
Load() and Sample() both accept integer offsets for sampling neighboring texels. In previous generations this was baked into the instruction and came for free. For GCN this is no longer true for Load(). As a result, sampling with an offset inserts ALU instructions to add the offsets between each Load(). In many cases like this, it makes sense to use Gather() instead. For Sample() though, providing an offset is still OK and will still be for free.
GPUs hide latency by keeping loads of threads in flight. GCN can support up to 10 simultaneous wavefronts, but only if you actually have enough resources available to hold all that state. If your shader needs more than 24 VGPRs or more than 48 SGPRs, you will get fewer simultaneous wavefronts and therefore worse ability to hide latency, potentially leading to worse performance, particularly if you are bound by memory accesses such as texture fetches. These tables are a quick reference to how many wavefronts you can possibly get given the shader’s register counts.
For a more detailed summary of this topic, refer to:
http://developer.amd.com/community/blog/2014/05/16/codexl-game-developers-analyze-hlsl-gcn/
Optimizing for registers is unfortunately one of the hardest things to give solid advice on. This is mostly a black box completely under the control of the compiler and not something that is particularly deterministic. ALU expressions tend to give quite predictable results, but the number of registers can go up and down in non-intuitive ways depending on the code you write. However, there are a few piece of advice one can give. The idea is overall to keep the shader “thin” in terms of the data it needs around and the stuff that it does. You will get as many registers as your worst-case path through the shader. So if you need to do two sets of work, it may be better to do one thing first and then the other, rather than both together and risking a fat path when data for both types of work need to be simultaneously resident. Keep the life-time of data as low as possible. Fetch data near its use. Keep data packed until you need it. And do not just keep data around if you have no need for it. The typical case would be to hold on to an alpha value you got from sampling a texture and simply return it in the alpha channel by the end of the shader, even if you do not really need an alpha. This will occupy a register for the entire span from sampling the texture, until the end of the shader.
Über-shaders are great for authoring; however, not so much for execution. Branching unfortunately tends to increase register pressure. So while branches are cheap on GCN, and branching for feature selection would be a viable approach from that point of view, it can have a very negative impact on register count, and thus performance. You will always pay for your worst-case path, so even if you end up skipping the vast majority of shader work you may still get a significant slowdown from reduced latency hiding. So it is advisable to create a set of specialized shaders, perhaps with #ifdef in the Über-shader.
Register indexing, a feature in HLSL since DX10, has never been particularly fast. It is still a relatively rarely used feature, but will likely occur more frequently as more advanced algorithms start utilizing it. The default implementation on GCN is shown here. It is a loop that will essentially fetch one value at a time until all threads have had their value fetched. The number of loops depends on the coherency. For very coherent accesses the cost may be relatively low, but once you loop twice or more on average, chances are that a manually unrolled selection will be faster.
Manual selection of array elements can expand to some horribly looking code, although compared looping over some unwieldy fetching code this tends to turn out faster, unless your accesses are extremely coherent within wavefronts. If your array is sufficiently short the driver may actually do this unrolling for you; however, there is still an advantage to doing this manually anyway. The driver adds code for properly handling the out-of-range condition, which likely isn’t something you require. At the time of this writing the driver also adds a bit-shift instruction that serves no particular function and only means it compares to numbers 4x larger than the code sequence presented here.
Whether to use this code or the default array indexing naturally should be decided based on profiling and seeing what runs faster in your particular context.
Another approach is to do a binary search. This substantially reduces the comparisons required to find the right element. In this case we save 8 ALU in total for an array of 16 elements. On the other hand, this approach also needs more GPRs, which may reduce performance. At the expense of readability, this code can be flattened somewhat like this to reduce the GPR usage:
bool lo0 = (index < 8);
bool lo1 = ((index & 0x7) < 4);
float c0 = lo1? (lo0? s[0] : s[8]) : (lo0? s[4] : s[12]);
float c1 = lo1? (lo0? s[1] : s[9]) : (lo0? s[5] : s[13]);
float c2 = lo1? (lo0? s[2] : s[10]) : (lo0? s[6] : s[14]);
float c3 = lo1? (lo0? s[3] : s[11]) : (lo0? s[7] : s[15]);
lo0 = ((index & 0x3) < 2);
lo1 = ((index & 0x1) < 1);
float b = lo1? (lo0? c0 : c2) : (lo0? c1 : c3);
Consoles provide tools to get your hands at the HW assembly. Look in the respective documentation for that.
For PC we are all hoping to get a version of GPUShaderAnalyzer that supports GCN soon; however, in the meantime we have to settle with a commandline interface for this purpose, provided by CodeXL. It may be a bit clumsy, but it gets the job done. With a bat file it’s reasonably effective at prototyping and comparing different approaches and seeing which one generates the most favorable code.
The rant part of this talk!
Compilers are sometimes too smart for their own good. In the case of fxc for windows, the biggest issue is that it has no knowledge about the hardware, it only works with an abstract pseudo-instruction set. From that point of view merging things to multiply integer by 5 instead of the intended shift and add makes sense, except on all hardware a 32bit multiply is much slower than shifts and adds.
Obviously, what the compiler should be doing is the reverse, take a multiply by 5 and convert to a shift and add. This will be an improvement that works for a limited number of integers, but for many common small numbers it works, as well as some large ones.
For PC the D3D bytecode needs to provide more semantics to allow the driver optimizer to do a better job. Information about input ranges and so on that fxc knows about is lost in the process.
As a general complaint on HLSL, the language is way to permissive on accepting implicit type conversions. Make a function that takes a float4x4 matrix, call it with a bool and it just compiles and obviously does nothing like what you intended. Can we please at least require a cast?
There are lots of things on the hardware that would be interesting to explore in the future. Some of these things are just kind of cool in a sense, but without a clear use case. Such as passing data between lanes in a shader. What would you use it for, other than gradients? I don’t know. The data for branching just lies in a scalar register. What can we use it for? Debug visualization of branch coherency perhaps? I am sure we will discover many interesting opportunities in the future.
The GCN architecture is surprisingly well documented. If you enjoyed this talk, I would recommend you take a look at the documentation. A lot of the content was inspired from that.