With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Ever wondered how to use modern OpenGL in a way that radically reduces driver overhead? Then this talk is for you.
John McDonald and Cass Everitt gave this talk at Steam Dev Days in Seattle on Jan 16, 2014.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Past, Present and Future Challenges of Global Illumination in GamesColin Barré-Brisebois
Global illumination (GI) has been an ongoing quest in games. The perpetual tug-of-war between visual quality and performance often forces developers to take the latest and greatest from academia and tailor it to push the boundaries of what has been realized in a game product. Many elements need to align for success, including image quality, performance, scalability, interactivity, ease of use, as well as game-specific and production challenges.
First we will paint a picture of the current state of global illumination in games, addressing how the state of the union compares to the latest and greatest research. We will then explore various GI challenges that game teams face from the art, engineering, pipelines and production perspective. The games industry lacks an ideal solution, so the goal here is to raise awareness by being transparent about the real problems in the field. Finally, we will talk about the future. This will be a call to arms, with the objective of uniting game developers and researchers on the same quest to evolve global illumination in games from being mostly static, or sometimes perceptually real-time, to fully real-time.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017.
Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Ever wondered how to use modern OpenGL in a way that radically reduces driver overhead? Then this talk is for you.
John McDonald and Cass Everitt gave this talk at Steam Dev Days in Seattle on Jan 16, 2014.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Past, Present and Future Challenges of Global Illumination in GamesColin Barré-Brisebois
Global illumination (GI) has been an ongoing quest in games. The perpetual tug-of-war between visual quality and performance often forces developers to take the latest and greatest from academia and tailor it to push the boundaries of what has been realized in a game product. Many elements need to align for success, including image quality, performance, scalability, interactivity, ease of use, as well as game-specific and production challenges.
First we will paint a picture of the current state of global illumination in games, addressing how the state of the union compares to the latest and greatest research. We will then explore various GI challenges that game teams face from the art, engineering, pipelines and production perspective. The games industry lacks an ideal solution, so the goal here is to raise awareness by being transparent about the real problems in the field. Finally, we will talk about the future. This will be a call to arms, with the objective of uniting game developers and researchers on the same quest to evolve global illumination in games from being mostly static, or sometimes perceptually real-time, to fully real-time.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017.
Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.
Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. We present a novel algorithm inspired by recent advances in depth culling for graphics hardware, but adapted and optimized for SIMD-capable CPUs. Our algorithm has very low memory overhead and is three times faster than previous work, while culling 98% of all triangles by a full resolution depth buffer approach. It supports interleaving occluder rasterization and occlusion queries without penalty, making it easy
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
This presentation introduces a new NVIDIA extension called Command-list.
The purpose of this presentation is to explain the basic concepts on how to use it and show what are the benefits.
The sample I used for the talk is here: https://github.com/nvpro-samples/gl_commandlist_bk3d_models
The driver for trying should be PreRelease 347.09
http://www.nvidia.com/download/driverResults.aspx/80913/en-us
In this deck from ATPESC 2019, Jack Dongarra from UT Knoxville presents: Adaptive Linear Solvers and Eigensolvers.
"Success in large-scale scientific computations often depends on algorithm design. Even the fastest machine may prove to be inadequate if insufficient attention is paid to the way in which the computation is organized. We have used several problems from computational physics to illustrate the importance of good algorithms, and we offer some very general principles for designing algorithms. Two subthemes are, first, the strong connection between the algorithm and the architecture of the target machine; and second, the importance of non-numerical methods in scientific computations."
Watch the video: https://wp.me/p3RLHQ-lq3
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid
“Just leave the garbage outside and we will take care of it for you”. This is the panacea promised by garbage collection mechanisms built into most software stacks available today. So, we don’t need to think about it anymore, right? Wrong! When misused, garbage collectors can fail miserably. When this happens they slow down your application and lead to unacceptable pauses. In this talk we will go over different garbage collectors approaches and understand under which conditions they function well.
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
This talk covers changes in CryENGINE 3 technology during 2012, with DX11 related topics such as moving to deferred rendering while maintaining backward compatibility on a multiplatform engine, massive vegetation rendering, MSAA support and how to deal with its common visual artifacts, among other topics.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
2. Acronyms
Optimizations and algorithms presented are AMD GCN-centric [1][8]
VGT Vertex Grouper Tessellator
PA Primitive Assembly
CP Command Processor
IA Input Assembly
SE Shader Engine
CU Compute Unit
LDS Local Data Share
HTILE Hi-Z Depth Compression
GCN Graphics Core Next
SGPR Scalar General-Purpose Register
VGPR Vector General-Purpose Register
ALU Arithmetic Logic Unit
SPI Shader Processor Interpolator
7. 12 CU * 64 ALU * 2 FLOPs
1,536 ALU ops / cy
18 CU * 64 ALU * 2 FLOPs
2,304 ALU ops / cy
64 CU * 64 ALU * 2 FLOPs
8,192 ALU ops / cy
8. 1,536 ALU ops / 2 engines
768 ALU ops per triangle
2,304 ALU ops / 2 engines
1,017 ALU ops per triangle
8,192 ALU ops / 4 engines
2,048 ALU ops per triangle
9. 768 ALU ops / 2 ALU per cy
= 384 instruction limit
1,017 ALU ops / 2 ALU per cy
= 508 instruction limit
2,048 ALU ops / 2 ALU per cy
= 1024 instruction limit
10. Can anyone here cull a triangle in less than 384
instructions on Xbox One?
… I sure hope so ☺
11. Motivation – Death By 1000 Draws
DirectX 12 promised millions of draws!
Great CPU performance advancements
Low overhead
Power in the hands of (experienced) developers
Console hardware is a fixed target
GPU still chokes on tiny draws
Common to see 2nd half of base pass barely utilizing the GPU
Lots of tiny details or distant objects – most are Hi-Z culled
Still have to run mostly empty vertex wavefronts
More draws not necessarily a good thing
13. Motivation – Primitive Rate
Wildly optimistic to assume we get close to 2 prims per cy – Getting 0.9 prim / cy
If you are doing anything useful, you will be bound elsewhere in the pipeline
You need good balance and lucky scheduling between the VGTs and PAs
Depth of FIFO between VGT and PA
Need positions of a VS back in < 4096 cy, or reduces primitive rate
Some games hit close to peak perf (95+% range) in shadow passes
Usually slower regions in there due to large triangles
Coarse raster only does 1 super-tile per clock
Triangles with bounding rectangle larger than 32x32?
Multi-cycle on coarse raster, reduces primitive rate
14. Motivation – Primitive Rate
Benchmarks that get 2 prims / cy (around 1.97) have these characteristics:
VS reads nothing
VS writes only SV_Position
VS always outputs 0.0f for position - Trivially cull all primitives
Index buffer is all 0s - Every vertex is a cache hit
Every instance is a multiple of 64 vertices – Less likely to have unfilled VS waves
No PS bound – No parameter cache usage
Requires that nothing after VS causes a stall
Parameter size <= 4 * PosSize
Pixels drain faster than they are generated
No scissoring occurs
PA can receive work faster than VS can possibly generate it
Often see tessellation achieve peak VS primitive throughout; one SE at a time
15. Motivation – Opportunity
Coarse cull on CPU, refine on GPU
Latency between CPU and GPU prevents optimizations
GPGPU Submission!
Depth-aware culling
Tighten shadow bounds sample distribution shadow maps [21]
Cull shadow casters without contribution [4]
Cull hidden objects from color pass
VR late-latch culling
CPU submits conservative frustum and GPU refines
Triangle and cluster culling
Covered by this presentation
16. Motivation – Opportunity
Maps directly to graphics pipeline
Offload tessellation hull shader work
Offload entire tessellation pipeline! [16][17]
Procedural vertex animation (wind, cloth, etc.)
Reusing results between multiple passes & frames
Maps indirectly to graphics pipeline
Bounding volume generation
Pre-skinning
Blend shapes
Generating GPU work from the GPU [4] [13]
Scene and visibility determination
Treat your draws as data!
Pre-build
Cache and reuse
Generate on GPU
19. Culling Overview
Batch
Configurable subset of meshes in
a scene
Meshes within a batch share the
same shader and strides
(vertex/index)
Near 1:1 with DirectX 12 PSO
(Pipeline State Object)
20. Culling Overview
Mesh Section
Represents an indexed draw call
(triangle list)
Has its own:
Vertex buffer(s)
Index buffer
Primitive count
Etc.
21. Culling Overview
Work Item
Optimal number of triangles for
processing in a wavefront
AMD GCN has 64 threads per
wavefront
Each culling thread processes 1
triangle
Work item processes 256 triangles
22. Culling Overview
Batch
Work Item
Mesh Section
Batch
Mesh Section Mesh SectionMesh Section
Work Item Work Item Work Item Work Item Work Item Work Item Work Item
Multi Draw Indirect
Draw Args Draw Args Draw Args Draw Args
Culling Culling Culling Culling Culling Culling Culling Culling
Draw Call Compaction (No Zero Size Draws)
Draw Args Draw Args Draw Args
Scene
…
23. Mapping Mesh ID to MultiDraw ID
Indirect draws no longer know the mesh section or instance they came from
Important for loading various constants, etc.
A DirectX 12 trick is to create a custom command signature
Allows for parsing a custom indirect arguments buffer format
We can store the mesh section id along with each draw argument block
PC drivers use compute shader patching
Xbox One has custom command processor microcode support
OpenGL has gl_DrawId which can be used for this
SPI Loads StartInstanceLocation into reserved SGPR and adds to SV_InstanceID
A fallback approach can be an instancing buffer with a step rate of 1 which maps from instance id to
draw id
24. Mapping Mesh ID to MultiDraw ID
Mesh Section Id
Draw Args
Index Count Per Instance
Instance Count
Start Index Location
Base Vertex Location
Start Instance Location
25. De-Interleaved Vertex Buffers
P0 P1 P2 P3 …
N0 N1 N2 N3 …
TC0 TC1 TC2 TC3 …
Draw Call
P0 N0 TC0 P1 N1 TC1 P2 N2 TC2 …
Draw Call
Do This!
De-Interleaved vertex buffers are optimal on GCN architectures
They also make compute processing easier!
26. De-Interleaved Vertex Buffers
Helpful for minimizing state changes for compute processing
Constant vertex position stride
Cleaner separation of volatile vs. non-volatile data
Lower memory usage overall
More optimal for regular GPU rendering
Evict cache lines as quickly as possible!
28. Cluster Culling
Generate triangle clusters using spatially coherent bucketing in spherical coordinates
Optimize each triangle cluster to be cache coherent
Generate optimal bounding cone of each cluster [19]
Project normals on to the unit sphere
Calculate minimum enclosing circle
Diameter is the cone angle
Center is projected back to Cartesian for cone normal
Store cone in 8:8:8:8 SNORM
Cull if dot(cone.Normal, -view) < -sin(cone.angle)
29. Cluster Culling
64 is convenient on consoles
Opens up intrinsic optimizations
Not optimal, as the CP bottlenecks on too many draws
Not LDS bound
256 seems to be the sweet spot
More vertex reuse
Fewer atomic operations
Larger than 256?
2x VGTs alternate back and forth (256 triangles)
Vertex re-use does not survive the flip
30. Cluster Culling
Coarse reject clusters of triangles [4]
Cull against:
View (Bounding Cone)
Frustum (Bounding Sphere)
Hi-Z Depth (Screen Space Bounding Box)
Be careful of perspective distortion! [22]
Spheres become ellipsoids under projection
36. Compaction
__XB_Ballot64
Produce a 64 bit mask
Each bit is an evaluated predicate per wavefront thread
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
__XB_Ballot64(threadId & 1)
38. Compaction
V_MBCNT_LO_U32_B32 [5]
Masked bit count of the lower 32 threads (0-31)
V_MBCNT_HI_U32_B32 [5]
Masked bit count of the upper 32 threads (32-63)
For each thread, returns the # of active threads which come before it.
42. Per-Triangle Culling
Each thread in a wavefront processes 1 triangle
Cull masks are balloted and counted to determine compaction index
Maintain vertex reuse across a wavefront
Maintain vertex reuse across all wavefronts - ds_ordered_count [5][15]
+0.1ms for ~3906 work items – use wavefront limits
43. Per-Triangle Culling
For Each Triangle
Unpack Index and Vertex Data (16 bit)
Orientation and Zero Area Culling (2DH)
Small Primitive Culling (NDC)
Frustum Culling (NDC)
Count Number of Surviving Indices
Compact Index Stream (Preserving Ordering)
Reserve Output Space for Surviving Indices
Write out Surviving Indices (16 bit)
Depth Culling – Hi-Z (NDC)
Perspective Divide (xyz/w) Scalar Branch (!culled)
Scalar Branch (!culled)
Scalar Branch (!culled)
__XB_GdsOrderedCount
(Optional)
__XB_MBCNT64
__XB_BALLOT64
44. Per-Triangle Culling
Without ballot
Compiler generates two tests for most if-statements
1) One or more threads enter the if-statement
2) Optimization where no threads enter the if-statement
With ballot (or high level any/all/etc.), or if branch on scalar value (__XB_MakeUniform)
Compiler only generates case# 2
Skips extra control flow logic to handle divergence
Use ballot for force uniform branching and avoid divergence
No harm letting all threads execute the full sequence of culling tests
53. Small Primitive Culling (NDC)
This triangle is not culled because it encloses a
pixel center
any(round(min) == round(max))
54. Small Primitive Culling (NDC)
This triangle is culled because it does not
enclose a pixel center
any(round(min) == round(max))
55. Small Primitive Culling (NDC)
This triangle is culled because it does not
enclose a pixel center
any(round(min) == round(max))
56. Small Primitive Culling (NDC)
This triangle is not culled because the bounding
box min and max snap to different coordinates
This triangle should be culled, but accounting
for this case is not worth the cost
any(round(min) == round(max))
65. Depth Tile Culling (NDC)
Another available culling approach is to do manual depth testing
Perform an LDS optimized parallel reduction [9], storing out the conservative depth
value for each tile
16x16 Tiles
67. Depth Pyramid Culling (NDC)
Another approach to depth culling is a hierarchical Z pyramid [10][11][23]
Populate the Hi-Z pyramid after depth laydown
Construct a mip-mapped screen resolution texture
Culling can be done by comparing the depth of a bounding volume with the depth stored in the Hi-Z
pyramid
int mipMapLevel = min(ceil(log2(max(longestEdge, 1.0f))), levels - 1);
68. AMD GCN HTILE
Depth acceleration meta data called HTILE [6][7]
Every group of 8x8 pixels has a 32bit meta data block
Can be decoded manually in a shader and used for 1 test -> 64 pixel rejection
Avoids slow hardware decompression or resummarize
Avoids losing Hi-Z on later depth enabled render passes
DEPTH HTILE
71. AMD GCN HTILE
Manually encode; skip the resummarize on half resolution depth!
HTILE encodes both near and far depth for each 8x8 pixel tile.
Stencil Enabled = 14 bit near value, and a 6 bit delta towards far plane
Stencil Disabled = MinMax depth encoded in 2x 14 bit UNORM pairs
72. Software Z
One problem with using depth for culling is availability
Many engines do not have a full Z pre-pass
Restricts asynchronous compute scheduling
Wait for Z buffer laydown
You can load the Hi-Z pyramid with software Z!
In Frostbite since Battlefield 3 [12]
Done on the CPU for the upcoming GPU frame
No latency
You can prime HTILE!
Full Z pre-pass
Minimal cost
80. Batching
Dispatch #0
Render #0
Dispatch #1 Dispatch #2 Dispatch #3
Render #1 Render #2 Render #3Startup Cost
Overlapping culling and render on the graphics pipe is great
But there is a high startup cost for dispatch #0 (no graphics to overlap)
If only there were something we could use….
81. Batching
Asynchronous compute to the rescue!
We can launch the dispatch work alongside other GPU work in the frame
Water simulation, physics, cloth, virtual texturing, etc.
This can slow down “Other GPU Stuff” a bit, but overall frame is faster!
Just be careful about what you schedule culling with
We use wait on lightweight label operations to ensure that dispatch and render are
pipelined correctly
Dispatch #0
Render #0
Dispatch #1 Dispatch #2 Dispatch #3
Render #1 Render #2 Render #3Other GPU Stuff
86. Future Work
Reuse results between multiple passes
Once for all shadow cascades
Depth, gbuffer, emissive, forward, reflection
Cube maps – load once, cull each side
Xbox One supports switching PSOs with ExecuteIndirect
Single submitted batch!
Further reduce bottlenecks
Move more and more CPU rendering logic to GPU
Improve asynchronous scheduling
87. Future Work
Instancing optimizations
Each instance (re)loads vertex data
Synchronous dispatch
Near 100% L2$ hit
ALU bound on render - 24 VGPRs, measured occupancy of 8
1.5 bytes bandwidth usage per triangle
Asynchronous dispatch
Low L2$ residency - other render work between culling and render
VMEM bound on render
20 bytes bandwidth usage per triangle
88. Future Work
Maximize bandwidth and throughput
Load data into LDS chunks, bandwidth amplification
Partition data into per-chunk index buffers
Evaluate all instances
More tuning of wavefront limits and CU masking
92. Hardware Tessellation
Mesh Data
Compute Shader
Structured Work Queue #1
(Patches with factor [1…1]
Tessellation Factors
Structured Work Queue #2
(Patches with factor [2…7]
Tessellation Factors
Structured Work Queue #3
(Patches with factor [8…N]
Tessellation Factors
Patches with factor 0 (culled) are not
processed further, and do not get
inserted to any work queue.
93. Hardware Tessellation
Structured Work Queue #1
(Patches with factor [1…1]
Tessellation Factors
Structured Work Queue #2
(Patches with factor [2…7]
Tessellation Factors
Structured Work Queue #3
(Patches with factor [8…N]
Tessellation Factors
Compute Shader
Patch SubD 1 -> 4
Tessellation Factor 1/4
Tessellated Draw
Non-Tessellated Draw
Low Expansion Factor
GCN Friendly
High Expansion Factor
GCN Unfriendly
No Expansion Factor
Avoid Tessellator!
94. Summary
Small and inefficient draws are a problem
Compute and graphics are friends
Use all the available GPU resources
Asynchronous compute is extremely powerful
Lots of cool GCN instructions available
Check out AMD GPUOpen GeometryFX [20]
96. Acknowledgements
Matthäus Chajdas (@NIV_Anteru)
Ivan Nevraev (@Nevraev)
Alex Nankervis
Sébastien Lagarde (@SebLagarde)
Andrew Goossen
James Stanard (@JamesStanard)
Martin Fuller (@MartinJIFuller)
David Cook
Tobias “GPU Psychiatrist” Berghoff (@TobiasBerghoff)
Christina Coffin (@ChristinaCoffin)
Alex “I Hate Polygons” Evans (@mmalex)
Rob Krajcarski
Jaymin “SHUFB 4 LIFE” Kessler (@okonomiyonda)
Tomasz Stachowiak (@h3r2tic)
Andrew Lauritzen (@AndrewLauritzen)
Nicolas Thibieroz (@NThibieroz)
Johan Andersson (@repi)
Alex Fry (@TheFryster)
Jasper Bekkers (@JasperBekkers)
Graham Sellers (@grahamsellers)
Cort Stratton (@postgoodism)
David Simpson
Jason Scanlin
Mike Arnold
Mark Cerny (@cerny)
Pete Lewis
Keith Yerex
Andrew Butcher (@andrewbutcher)
Matt Peters
Sebastian Aaltonen (@SebAaltonen)
Anton Michels
Louis Bavoil (@LouisBavoil)
Yury Uralsky
Sebastien Hillaire (@SebHillaire)
Daniel Collin (@daniel_collin)
97. References
[1] “The AMD GCN Architecture – A Crash Course” – Layla Mah
[2] “Clipping Using Homogenous Coordinates” – Jim Blinn, Martin Newell
[3] "Triangle Scan Conversion using 2D Homogeneous Coordinates“ - Marc Olano, Trey Greer
[4] “GPU-Driven Rendering Pipelines” – Ulrich Haar, Sebastian Aaltonen
[5] “Southern Islands Series Instruction Set Architecture” – AMD
[6] “Radeon Southern Islands Acceleration” – AMD
[7] “Radeon Evergreen / Northern Islands Acceleration” - AMD
[8] “GCN Architecture Whitepaper” - AMD
[9] “Optimizing Parallel Reduction In CUDA” – Mark Harris
[10] “Hierarchical-Z Map Based Occlusion Culling” – Daniel Rákos
[11] “Hierarchical Z-Buffer Occlusion Culling” – Nick Darnell
[12] “Culling the Battlefield: Data Oriented Design in Practice” – Daniel Collin
[13] “The Rendering Pipeline – Challenges & Next Steps” – Johan Andersson
[14] “GCN Performance Tweets” – AMD
[15] “Learning from Failure: … Abandoned Renderers For Dreams PS4 …” – Alex Evans
[16] “Patch Based Occlusion Culling For Hardware Tessellation” - Matthias Nießner, Charles Loop
[17] “Tessellation In Call Of Duty: Ghosts” – Wade Brainerd
[18] “MiniEngine Framework” – Alex Nankervis, James Stanard
[19] “Optimal Bounding Cones of Vectors in Three Dimensions” – Gill Barequet, Gershon Elber
[20] “GPUOpen GeometryFX” – AMD
[21] “Sample Distribution Shadow Maps” – Andrew Lauritzen
[22] “2D Polyhedral Bounds of a Clipped, Perspective-Projected 3D Sphere” – Mara and McGuire
[23] “Practical, Dynamic Visibility for Games” - Stephen Hill
98. Thank You!
graham@frostbite.com
Questions?
Twitter - @gwihlidal
“If you’ve been struggling with a
tough ol’ programming problem all
day, maybe go for a walk. Talk to a
tree. Trust me, it helps.“
- Bob Ross, Game Dev
99. Instancing Optimizations
Can do a fast bitonic sort of the instancing buffer for optimal
front-to-back order
Utilize DS_SWIZZLE_B32
Swizzles input thread data based on offset mask
Data sharing within 32 consecutive threads
Only 32 bit, so can efficiently sort 32 elements
You could do clustered sorting
Sort each cluster’s instances (within a thread)
Sort the 32 clusters