The document discusses using DirectX 11 linked lists and per-pixel linked lists to implement order-independent transparency (OIT) and indirect illumination. It describes how linked lists can be used to store transparent fragments and then sort and blend them for OIT. It also explains how to trace rays through linked lists to compute indirect shadows for indirect illumination. The techniques open up new rendering algorithms for real-time graphics.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Ever wondered how to use modern OpenGL in a way that radically reduces driver overhead? Then this talk is for you.
John McDonald and Cass Everitt gave this talk at Steam Dev Days in Seattle on Jan 16, 2014.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
OpenGL 4.4 provides new features for accelerating scenes with many objects, which are typically found in professional visualization markets. This talk will provide details on the usage of the features and their effect on real-life models. Furthermore we will showcase how more work for rendering a scene can be off-loaded to the GPU, such as efficient occlusion culling or matrix calculations.
Video presentation here: http://on-demand.gputechconf.com/gtc/2014/video/S4379-opengl-44-scene-rendering-techniques.mp4
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Ever wondered how to use modern OpenGL in a way that radically reduces driver overhead? Then this talk is for you.
John McDonald and Cass Everitt gave this talk at Steam Dev Days in Seattle on Jan 16, 2014.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017.
Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.
Talk by Fabien Christin from DICE at GDC 2016.
Designing a big city that players can explore by day and by night while improving on the unique visual from the first Mirror's Edge game isn't an easy task.
In this talk, the tools and technology used to render Mirror's Edge: Catalyst will be discussed. From the physical sky to the reflection tech, the speakers will show how they tamed the new Frostbite 3 PBR engine to deliver realistic images with stylized visuals.
They will talk about the artistic and technical challenges they faced and how they tried to overcome them, from the simple light settings and Enlighten workflow to character shading and color grading.
Takeaway
Attendees will get an insight of technical and artistic techniques used to create a dynamic time of day system with updating radiosity and reflections.
Intended Audience
This session is targeted to game artists, technical artists and graphics programmers who want to know more about Mirror's Edge: Catalyst rendering technology, lighting tools and shading tricks.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
Volumetric Lighting for Many Lights in Lords of the FallenBenjamin Glatzel
In this session I’m going to give you an in-depth insight into the design and the implementation of the volumetric lighting system we’ve developed for ‘Lords of the Fallen’. The system allows the simulation of countless volumetric lighting effects in parallel while still being a feasible solution on next-gen consoles.
This presentation was held at the Digital Dragons 2014 conference.
Videos shown during the talk are available here: http://bglatzel.movingblocks.net/publications
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
Next generation gaming brought high resolutions, very complex environments and large textures to our living rooms. With virtually every asset being inflated, it's hard to use traditional forward rendering and hope for rich, dynamic environments with extensive dynamic lighting. Deferred rendering, on the other hand, has been traditionally described as a nice technique for rendering of scenes with many dynamic lights, that unfortunately suffers from fill-rate problems and lack of anti-aliasing and very few games that use it were published.
In this talk, we will discuss our approach to face this challenge and how we designed a deferred rendering engine that uses multi-sampled anti-aliasing (MSAA). We will give in-depth description of each individual stage of our real-time rendering pipeline and the main ingredients of our lighting, post-processing and data management. We'll show how we utilize PS3's SPUs for fast rendering of a large set of primitives, parallel processing of geometry and computation of indirect lighting. We will also describe our optimizations of the lighting and our parallel split (cascaded) shadow map algorithm for faster and stable MSAA output.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
This session presents a detailed programmer oriented overview of our SPU based shading system implemented in DICE's Frostbite 2 engine and how it enables more visually rich environments in BATTLEFIELD 3 and better performance over traditional GPU-only based renderers. We explain in detail how our SPU Tile-based deferred shading system is implemented, and how it supports rich material variety, High Dynamic Range Lighting, and large amounts of light sources of different types through an extensive set of culling, occlusion and optimization techniques.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017.
Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.
Talk by Fabien Christin from DICE at GDC 2016.
Designing a big city that players can explore by day and by night while improving on the unique visual from the first Mirror's Edge game isn't an easy task.
In this talk, the tools and technology used to render Mirror's Edge: Catalyst will be discussed. From the physical sky to the reflection tech, the speakers will show how they tamed the new Frostbite 3 PBR engine to deliver realistic images with stylized visuals.
They will talk about the artistic and technical challenges they faced and how they tried to overcome them, from the simple light settings and Enlighten workflow to character shading and color grading.
Takeaway
Attendees will get an insight of technical and artistic techniques used to create a dynamic time of day system with updating radiosity and reflections.
Intended Audience
This session is targeted to game artists, technical artists and graphics programmers who want to know more about Mirror's Edge: Catalyst rendering technology, lighting tools and shading tricks.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
Volumetric Lighting for Many Lights in Lords of the FallenBenjamin Glatzel
In this session I’m going to give you an in-depth insight into the design and the implementation of the volumetric lighting system we’ve developed for ‘Lords of the Fallen’. The system allows the simulation of countless volumetric lighting effects in parallel while still being a feasible solution on next-gen consoles.
This presentation was held at the Digital Dragons 2014 conference.
Videos shown during the talk are available here: http://bglatzel.movingblocks.net/publications
Bill explains some of the ways that the Vertex Shader can be used to improve performance by taking a fast path through the Vertex Shader rather than generating vertices with other parts of the pipeline in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Check out more technical presentations at http://developer.amd.com/resources/documentation-articles/conference-presentations/
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
Next generation gaming brought high resolutions, very complex environments and large textures to our living rooms. With virtually every asset being inflated, it's hard to use traditional forward rendering and hope for rich, dynamic environments with extensive dynamic lighting. Deferred rendering, on the other hand, has been traditionally described as a nice technique for rendering of scenes with many dynamic lights, that unfortunately suffers from fill-rate problems and lack of anti-aliasing and very few games that use it were published.
In this talk, we will discuss our approach to face this challenge and how we designed a deferred rendering engine that uses multi-sampled anti-aliasing (MSAA). We will give in-depth description of each individual stage of our real-time rendering pipeline and the main ingredients of our lighting, post-processing and data management. We'll show how we utilize PS3's SPUs for fast rendering of a large set of primitives, parallel processing of geometry and computation of indirect lighting. We will also describe our optimizations of the lighting and our parallel split (cascaded) shadow map algorithm for faster and stable MSAA output.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
This session presents a detailed programmer oriented overview of our SPU based shading system implemented in DICE's Frostbite 2 engine and how it enables more visually rich environments in BATTLEFIELD 3 and better performance over traditional GPU-only based renderers. We explain in detail how our SPU Tile-based deferred shading system is implemented, and how it supports rich material variety, High Dynamic Range Lighting, and large amounts of light sources of different types through an extensive set of culling, occlusion and optimization techniques.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Overview of Cassandra architecture. Learn about how data is read and written into a Cassandra cluster. Internal gossip protocol. Some key data structure Cassandra uses like bloom filters, consistent hashing.
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
This presentation introduces a new NVIDIA extension called Command-list.
The purpose of this presentation is to explain the basic concepts on how to use it and show what are the benefits.
The sample I used for the talk is here: https://github.com/nvpro-samples/gl_commandlist_bk3d_models
The driver for trying should be PreRelease 347.09
http://www.nvidia.com/download/driverResults.aspx/80913/en-us
This talk is all about the Berkeley Packet Filters (BPF) and their uses in Linux.
Agenda:
* What is a BPF and why do we need it?
* Writing custom BPFs
* Notes on BPF implementation in the kernel
* Usage examples: SOCKET_FILTER & seccomp
Speaker:
Kfir Gollan, senior embedded software developer, Linux kernel hacker and software team leader.
Advance procedures in assembly are fully explained by me and my group mates.
Main topics are:
*Stack frames
-Recursion
-ADDR, INVOKE , LOCAL, PROC , PROTO directives and variables
-MultiModule Programs in assembly
▪ Developed a recursive-descent parser to generate an intermediate representation for subsequent optimizations in Java
▪ Implemented common subexpression elimination and copy propagation on control flow graph
▪ Deployed a code generator for the source language that yields optimized native programs
Progscon 2017: Taming the wild fronteer - Adventures in ClojurescriptJohn Stevenson
Progscon 2017 conference talk, introducing Clojurescript for a functional programming approach to building React.js apps.
Examples include using React.js directly and the Om Clojurescript library that closely follows the React.js API. Also cover a simpler approach to React with the Clojurescript libraries called Reagent and Rum.
A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)Sylvain Hallé
This talk reviews the basic principles behind the BeepBeep 3 event stream processing engine, and the facilities it provides to help you design you own, domain-specific query language.
In this document you will find all needed informations to use the script ‘Variants_Density_Along_DNA_Sequence.R’.
This R script allow you to plot the density of variants contained in a VCF file along the DNA sequence you have annotated (Chromosome, Scaffold, Contig, Whole Genome. . . ).
Exploit Research and Development Megaprimer: DEP Bypassing with ROP ChainsAjin Abraham
Exploit Research and Development Megaprimer
http://opensecurity.in/exploit-research-and-development-megaprimer/
http://www.youtube.com/playlist?list=PLX3EwmWe0cS_5oy86fnqFRfHpxJHjtuyf
I am Joseph G . I am a Programming Assignment Expert at programminghomeworkhelp.com. I hold a Ph.D. Programming, Schiller International University, USA. I have been helping students with their homework for the past 8 years. I solve assignments related to Programming.
Visit programminghomeworkhelp.com or email support@programminghomeworkhelp.com.
You can also call on +1 678 648 4277 for any assistance with Programming Assignments.
4. Introduction Direct3D 11 HW opens the door to many new rendering algorithms In particular per pixel linked lists allow for a number of new techniquesOIT, Indirect Shadows, Ray Tracing of dynamic scenes, REYES surface dicing, custom AA, Irregular Z-buffering, custom blending, Advanced Depth of Field, etc. This talk will walk you through:A DX11 implementation of per-pixel linked list and two effects that utilize this techique OIT Indirect Illumination
5. Per-Pixel Linked Lists with Direct3D 11 Element Element Element Element Link Link Link Link Nicolas Thibieroz European ISV Relations AMD
6. Why Linked Lists? Data structure useful for programming Very hard to implement efficiently with previous real-time graphics APIs DX11 allows efficient creation and parsing of linked lists Per-pixel linked lists A collection of linked lists enumerating all pixels belonging to the same screen position Element Element Element Element Link Link Link Link
7. Two-step process 1) Linked List Creation Store incoming fragments into linked lists 2) Rendering from Linked List Linked List traversal and processing of stored fragments
9. PS5.0 and UAVs Uses a Pixel Shader 5.0 to store fragments into linked lists Not a Compute Shader 5.0! Uses atomic operations Two UAV buffers required - “Fragment & Link” buffer - “Start Offset” buffer UAV = Unordered Access View
10. Fragment & Link Buffer The “Fragment & Link” buffer contains data and link for all fragments to store Must be large enough to store all fragments Created with Counter support D3D11_BUFFER_UAV_FLAG_COUNTER flag in UAV view Declaration: structFragmentAndLinkBuffer_STRUCT { FragmentData_STRUCTFragmentData; // Fragment data uintuNext; // Link to next fragment }; RWStructuredBuffer <FragmentAndLinkBuffer_STRUCT> FLBuffer;
11. Start Offset Buffer The “Start Offset” buffer contains the offset of the last fragment written at every pixel location Screen-sized:(width * height * sizeof(UINT32) ) Initialized to magic value (e.g. -1) Magic value indicates no more fragments are stored (i.e. end of the list) Declaration: RWByteAddressBufferStartOffsetBuffer;
12. Linked List Creation (1) No color Render Target bound! No rendering yet, just storing in L.L. Depth buffer bound if needed OIT will need it in a few slides UAVs bounds as input/output: StartOffsetBuffer (R/W) FragmentAndLinkBuffer (W)
13. Linked List Creation (2a) Start Offset Buffer -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Viewport -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 Fragment and Link Buffer Counter = Fragment and Link Buffer Fragment and Link Buffer Fragment and Link Buffer
14. Linked List Creation (2b) Start Offset Buffer -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Viewport -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 0 Fragment and Link Buffer Counter = Fragment and Link Buffer -1 Fragment and Link Buffer -1 Fragment and Link Buffer
15. Linked List Creation (2c) Start Offset Buffer -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Viewport -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 3 Fragment and Link Buffer Counter = Fragment and Link Buffer -1 Fragment and Link Buffer -1 -1 Fragment and Link Buffer
16. Linked List Creation (2d) Start Offset Buffer -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 Viewport -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 4 3 5 Fragment and Link Buffer Counter = -1 -1 -1 Fragment and Link Buffer 0 -1 Fragment and Link Buffer
17. Linked List Creation - Code float PS_StoreFragments(PS_INPUT input) : SV_Target { // Calculate fragment data (color, depth, etc.) FragmentData_STRUCTFragmentData = ComputeFragment(); // Retrieve current pixel count and increase counter uintuPixelCount = FLBuffer.IncrementCounter(); // Exchange offsets in StartOffsetBuffer uintvPos = uint(input.vPos); uintuStartOffsetAddress= 4 * ( (SCREEN_WIDTH*vPos.y) + vPos.x ); uintuOldStartOffset; StartOffsetBuffer.InterlockedExchange(uStartOffsetAddress, uPixelCount, uOldStartOffset); // Add new fragment entry in Fragment & Link Buffer FragmentAndLinkBuffer_STRUCT Element; Element.FragmentData = FragmentData; Element.uNext = uOldStartOffset; FLBuffer[uPixelCount] = Element; }
19. Rendering Pixels (1) “Start Offset” Buffer and “Fragment & Link” Buffer now bound as SRV Buffer<uint> StartOffsetBufferSRV; StructuredBuffer<FragmentAndLinkBuffer_STRUCT> FLBufferSRV; Render a fullscreen quad For each pixel, parse the linked list and retrieve fragments for this screen position Process list of fragments as required Depends on algorithm e.g. sorting, finding maximum, etc. SRV = Shader Resource View
20. Rendering from Linked List Start Offset Buffer -1 -1 -1 -1 -1 -1 4 -1 -1 -1 -1 3 3 -1 -1 -1 -1 -1 -1 Render Target -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 Fragment and Link Buffer -1 -1 -1 -1 Fragment and Link Buffer 0 0 -1 Fragment and Link Buffer
21. Rendering Pixels (2) float4 PS_RenderFragments(PS_INPUT input) : SV_Target { // Calculate UINT-aligned start offset buffer address uintvPos = uint(input.vPos); uintuStartOffsetAddress = SCREEN_WIDTH*vPos.y + vPos.x; // Fetch offset of first fragment for current pixel uintuOffset = StartOffsetBufferSRV.Load(uStartOffsetAddress); // Parse linked list for all fragments at this position float4 FinalColor=float4(0,0,0,0); while (uOffset!=0xFFFFFFFF) // 0xFFFFFFFF is magic value { // Retrieve pixel at current offset Element=FLBufferSRV[uOffset]; // Process pixel as required ProcessPixel(Element, FinalColor); // Retrieve next offset uOffset = Element.uNext; } return (FinalColor); }
23. Description Straight application of the linked list algorithm Stores transparent fragments into PPLL Rendering phase sorts pixels in a back-to-front order and blends them manually in a pixel shader Blend mode can be unique per-pixel! Special case for MSAA support
24. Linked List Structure Optimize performance by reducing amount of data to write to/read from UAV E.g. uint instead of float4 for color Example data structure for OIT: structFragmentAndLinkBuffer_STRUCT { uintuPixelColor; // Packed pixel color uintuDepth; // Pixel depth uintuNext; // Address of next link }; May also get away with packed color and depth into the same uint! (if same alpha) 16 bits color (565) + 16 bits depth Performance/memory/quality trade-off
25. Visible Fragments Only! Use [earlydepthstencil] in front of Linked List creation pixel shader This ensures only transparent fragments that pass the depth test are stored i.e. Visible fragments! Allows performance savings and rendering correctness! [earlydepthstencil] float PS_StoreFragments(PS_INPUT input) : SV_Target { ... }
26. Sorting Pixels Sorting in place requires R/W access to Linked List Sparse memory accesses = slow! Better way is to copy all pixels into array of temp registers Then do the sorting Temp array declaration means a hard limit on number of pixel per screen coordinates Required trade-off for performance
27. Sorting and Blending 0.95 0.93 0.87 0.98 Background color Temp Array Render Target PS color Blend fragments back to front in PS Blending algorithm up to app Example: SRCALPHA-INVSRCALPHA Or unique per pixel! (stored in fragment data) Background passed as input texture Actual HW blending mode disabled 0.98 0.87 0.95 0.93 -1 0 12 34
28. Storing Pixels for Sorting (...) static uint2 SortedPixels[MAX_SORTED_PIXELS]; // Parse linked list for all pixels at this position // and store them into temp array for later sorting intnNumPixels=0; while (uOffset!=0xFFFFFFFF) { // Retrieve pixel at current offset Element=FLBufferSRV[uOffset]; // Copy pixel data into temp array SortedPixels[nNumPixels++]= uint2(Element.uPixelColor, Element.uDepth); // Retrieve next offset [flatten]uOffset = (nNumPixels>=MAX_SORTED_PIXELS) ? 0xFFFFFFFF : Element.uNext; } // Sort pixels in-place SortPixelsInPlace(SortedPixels, nNumPixels); (...)
29. Pixel Blending in PS (...) // Retrieve current color from background texture float4 vCurrentColor=BackgroundTexture.Load(int3(vPos.xy, 0)); // Rendering pixels using SRCALPHA-INVSRCALPHA blending for (int k=0; k<nNumPixels; k++) { // Retrieve next unblended furthermost pixel float4 vPixColor= UnpackFromUint(SortedPixels[k].x); // Manual blending between current fragment and previous one vCurrentColor.xyz= lerp(vCurrentColor.xyz, vPixColor.xyz, vPixColor.w); } // Return manually-blended color return vCurrentColor; }
31. Sample Coverage Storing individual samples into Linked Lists requires a huge amount of memory ... and performance will suffer! Solution is to store transparent pixels into PPLL as before But including sample coverage too! Requires as many bits as MSAA mode Declare SV_COVERAGE in PS structure struct PS_INPUT { float3 vNormal : NORMAL; float2 vTex : TEXCOORD; float4 vPos : SV_POSITION; uintuCoverage : SV_COVERAGE; }
32. Linked List Structure Almost unchanged from previously Depth is now packed into 24 bits 8 Bits are used to store coverage structFragmentAndLinkBuffer_STRUCT { uintuPixelColor; // Packed pixel color uintuDepthAndCoverage; // Depth + coverage uintuNext; // Address of next link };
33. Sample Coverage Example Pixel Center Sample Third sample is covered uCoverage = 0x04 (0100 in binary) Element.uDepthAndCoverage = ( In.vPos.z*(2^24-1) << 8 ) | In.uCoverage;
34. Rendering Samples (1) Rendering phase needs to be able to write individual samples Thus PS is run at sample frequency Can be done by declaring SV_SAMPLEINDEX in input structure Parse linked list and store pixels into temp array for later sorting Similar to non-MSAA case Difference is to only store sample if coverage matches sample index being rasterized
35. Rendering Samples (2) static uint2 SortedPixels[MAX_SORTED_PIXELS]; // Parse linked list for all pixels at this position // and store them into temp array for later sorting intnNumPixels=0; while (uOffset!=0xFFFFFFFF) { // Retrieve pixel at current offset Element=FLBufferSRV[uOffset]; // Retrieve pixel coverage from linked list element uintuCoverage=UnpackCoverage(Element.uDepthAndCoverage); if ( uCoverage & (1<<In.uSampleIndex) ) { // Coverage matches current sample so copy pixel SortedPixels[nNumPixels++]=Element; } // Retrieve next offset [flatten]uOffset = (nNumPixels>=MAX_SORTED_PIXELS) ? 0xFFFFFFFF : Element.uNext; }
38. Indirect Illumination Introduction 1 Real-time Indirect illumination is an active research topic Numerous approaches existReflective Shadow Maps (RSM)[Dachsbacher/Stammiger05]Splatting Indirect Illumination [Dachsbacher/Stammiger2006]Multi-Res Splatting of Illumination [Wyman2009]Light propagation volumes [Kapalanyan2009]Approximating Dynamic Global Illumination in Image Space [Ritschel2009] Only a few support indirect shadowsImperfect Shadow Maps [Ritschel/Grosch2008]Micro-Rendering for Scalable, Parallel Final Gathering(SSDO) [Ritschel2010] Cascaded light propagation volumes for real-time indirect illumination [Kapalanyan/Dachsbacher2010] Most approaches somehow extend to multi-bounce lighting
39. Indirect Illumination Introduction 2 This section will coverAn efficient and simple DX9-compliant RSM based implementation for smooth one bounce indirect illumination Indirect shadows are ignored here A Direct3D 11 technique that traces rays to compute indirect shadows Part of this technique could generally be used for ray-tracing dynamic scenes
40. Indirect Illumination w/o Indirect Shadows Draw scene g-buffer Draw Reflective Shadowmap (RSM) RSM shows the part of the scene that recieves direct light from the light source Draw Indirect Light buffer at ½ res RSM texels are used as light sources on g-buffer pixels for indirect lighting Upsample Indirect Light (IL) Draw final image adding IL
41. Step 1 G-Buffer needs to allow reconstruction of World/Camera space position World/Camera space normal Color/ Albedo DXGI_FORMAT_R32G32B32A32_FLOAT positions may be required for precise ray queries for indirect shadows
42. Step 2 RSM needs to allow reconstruction of World/Camera space position World/Camera space normal Color/ Albedo Only draw emitters of indirect light DXGI_FORMAT_R32G32B32A32_FLOAT position may be required for ray precise queries for indirect shadows
43. Step 3 Render a ½ res IL as a deferred op Transform g-buffer pix to RSM space ->Light Space->project to RSM texel space Use a kernel of RSM texels as light sources RSM texels also called Virtual Point Light(VPL) Kernel size depends on Desired speed Desired look of the effect RSM resolution
44. Computing IL at a G-buf Pixel 1 Sum up contribution of all VPLs in the kernel
45. Computing IL at a G-buf Pixel 2 RSM texel/VPL g-buffer pixel This term is very similar to terms used in radiosity form factor computations
46. Computing IL at a G-buf Pixel 3 A naive solution for smooth IL needs to consider four VPL kernels with centers at t0, t1, t2 and t3. stx : sub RSM texel x position [0.0, 1.0[ sty : sub RSM texel y position [0.0, 1.0[
47. Computing IL at a g-buf pixel 4 IndirectLight = (1.0f-sty) * ((1.0f-stx) * + stx * ) + (0.0f+sty) * ((1.0f-stx) * + stx * ) Evaluation of 4 big VPL kernels is slow VPL kernel at t0 stx : sub texel x position [0.0, 1.0[ VPL kernel at t2 sty : sub texel y position [0.0, 1.0[ VPL kernel at t1 VPL kernel at t3
48. Computing IL at a g-buf pixel 5 SmoothIndirectLight = (1.0f-sty)*(((1.0f-stx)*(B0+B3)+stx*(B2+B5))+B1)+ (0.0f+sty)*(((1.0f-stx)*(B6+B3)+stx*(B8+B5))+B7)+B4 stx : sub RSM texel x position of g-buf pix [0.0, 1.0[ sty : sub RSM texel y position of g-buf pix [0.0, 1.0[ This trick is probably known to some of you already. See backup for a detailed explanation !
50. Step 4 Indirect Light buffer is ½ res Perform a bilateral upsampling step SeePeter-Pike Sloan, Naga K. Govindaraju, Derek Nowrouzezahrai, John Snyder. "Image-Based Proxy Accumulation for Real-Time Soft Global Illumination". Pacific Graphics 2007 Result is a full resolution IL
53. Combined Image DEMO ~280 FPS on a HD5970 @ 1280x1024 for a 15x15 VPL kernel Deffered IL pass + bilateral upsampling costs ~2.5 ms
54. How to add Indirect Shadows Use a CS and the linked lists technique Insert blocker geomety of IL into 3D grid of lists – let‘s use the triangles of the blocker for now see backup for alternative data structure Look at a kernel of VPLs again Only accumulate light of VPLs that are occluded by blocker tris Trace rays through 3d grid to detect occluded VPLs Render low res buffer only Subtract blocked indirect light from IL buffer Blurred version of low res blocked IL is used Blur is combined bilateral blurring/upsampling
55. Insert tris into 3D grid of triangle lists Rasterize dynamic blockers to 3D grid using a CS and atomics Scene
56. Insert tris into 3D grid of triangle lists (0,1,0) Rasterize dynamic blockers to 3D grid using a CS and atomics World space 3D grid of triangle lists around IL blockers laid out in a UAV Scene eol = End of list (0xffffffff)
62. Final Image DEMO ~70 FPS on a HD5970 @ 1280x1024 ~300 million rays per second for Indirect Shadows Ray casting costs ~9 ms
63. Future directions Speed up IL rendering Render IL at even lower res Look into multi-res RSMs Speed up ray-tracing Per pixel array of lists for depth buckets (see backup) Other data structures Raytrace other primitive types Splats, fuzzy ellipsoids etc. Proxy geometry or bounding volumes of blockers Get rid of Interlocked*() ops Just mark grid cells as occupied Lower quality but could work on earlier hardware
64. Q&A Holger Gruen holger.gruen@AMD.com Nicolas Thibieroz nicolas.thibieroz@AMD.com Credits for the basic idea of how to implement PPLL under Direct3D 11 go to Jakub Klarowicz (Techland),Holger Gruen and Nicolas Thibieroz (AMD)
66. Computing IL at a g-buf pixel 1 Want to support low res RSMs Want to create smooth indirect light Goal is bi-linear filtering of four VPL-Kernels Otherwise results don‘t look smooth
67. Computing IL at a g-buf pixel 2 stx : sub texel x position [0.0, 1.0[ sty : sub texel y position [0.0, 1.0[
68. Computing IL at a g-buf pixel 3 For smooth IL one needs to consider four VPL kernels with centers at t0, t1, t2 and t3. stx : sub texel x position [0.0, 1.0[ sty : sub texel y position [0.0, 1.0[
69. Computing IL at a g-buf pixel 4 Center at t0 VPL kernel at t0 stx : sub texel x position [0.0, 1.0[ sty : sub texel y position [0.0, 1.0[
70. Computing IL at a g-buf pixel 4 Center at t1 VPL kernel at t0 stx : sub texel x position [0.0, 1.0[ sty : sub texel y position [0.0, 1.0[ VPL kernel at t1
71. Computing IL at a g-buf pixel 5 Center at t2 stx : sub texel x position [0.0, 1.0[ VPL kernel at t2 VPL kernel at t0 sty : sub texel y position [0.0, 1.0[ VPL kernel at t1
72. Computing IL at a g-buf pixel 6 Center at t3 VPL kernel at t0 stx : sub texel x position [0.0, 1.0[ VPL kernel at t2 sty : sub texel y position [0.0, 1.0[ VPL kernel at t1 VPL kernel at t3
73. Computing IL at a g-buf pixel 7 IndirectLight = (1.0f-sty) * ((1.0f-stx) * + stx * ) + (0.0f+sty) * ((1.0f-stx) * + stx * ) Evaluation of 4 big VPL kernels is slow VPL kernel at t0 stx : sub texel x position [0.0, 1.0[ VPL kernel at t2 sty : sub texel y position [0.0, 1.0[ VPL kernel at t1 VPL kernel at t3
74. Computing IL at a g-buf pixel 8 VPL kernel at t0 stx : sub texel x position [0.0, 1.0[ VPL kernel at t2 sty : sub texel y position [0.0, 1.0[ VPL kernel at t1 VPL kernel at t3
75. Computing IL at a g-buf pixel 9 B1 B2 B0 B3 B4 B5 B6 B7 B8 VPL kernel at t0 stx : sub texel x position [0.0, 1.0[ VPL kernel at t2 sty : sub texel y position [0.0, 1.0[ VPL kernel at t1 VPL kernel at t3
76. Computing IL at a g-buf pixel 9 IndirectLight = (1.0f-sty)*(((1.0f-stx)*(B0+B3)+stx*(B2+B5))+B1)+ (0.0f+sty)*(((1.0f-stx)*(B6+B3)+stx*(B8+B5))+B7)+B4 Evaluation of 7 small and 1 bigger VPL kernels is fast stx : sub texel x position [0.0, 1.0[ sty : sub texel y position [0.0, 1.0[
77. Insert Tris into 2D Map of Lists of Tris Rasterize blockers of IL from view of light Light 2D buffer Scene
78. Insert Tris into 2D Map of Lists of Tris Rasterize blockers of IL from view of light using a GS and conservative rasterization Light Scene 2D buffer of lists of triangles written to by scattering PS eol = End of list (0xffffffff)
80. Linked List Creation (2) For every pixel: Calculate pixel data (color, depth etc.) Retrieve current pixel count from Fragment & Link UAV and increment counter uintuPixelCount =FragmentAndLinkBuffer.IncrementCounter(); Swap offsets in Start Offset UAV uintuOldStartOffset; StartOffsetBuffer.InterlockedExchange( PixelScreenSpacePositionLinearAddress, uPixelCount, uOldStartOffset); Add new entry to Fragment & Link UAV FragmentAndLinkBuffer_STRUCT Element; Element.FragmentData = FragmentData; Element.uNext = uOldStartOffset; FragmentAndLinkBuffer[uPixelCount] = Element;
81. Linked List Creation (3d) Start Offset Buffer -1 -1 -1 -1 -1 -1 -1 3 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 Render Target -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 Fragment and Link Buffer Fragment and Link Buffer 0 -1 -1 -1 -1 Fragment and Link Buffer Fragment and Link Buffer
82. PPLL Pros and Cons Pros: Allows storing of variable number of elements per pixel Only one atomic operation per pixel Using built-in HW UAV counter Good performance Cons: Traversal causes a lot of random access Special case for storing MSAA samples
83. Alternative Technique:Per Pixel Array There is a two pass method that allows the construction of per-pixel arrays First described in ’Sample Based Visibility for Soft Shadows using Alias-free Shadow Maps’(E.Sintorn, E. Eisemann, U. Assarsson, EG 2008) Uses a pre-fix sum to allocate multiple entries into a buffer Elements belonging to the same location will therefore be contiguous in memory Faster traversal (less random access) Not as fast as PPLL for scenes with ‘high’ depth complexity in our testing
84. Rendering SamplesAlternate Method Writes pixels straight into non-MSAA RT Only viable if access to samples is no longer required Samples must be resolved into pixels before writing This affects the manual blending process Samples with the same coverage are blended back-to-front ...then averaged (resolved) with other samples before written out to RT This method can prove slightly faster than per-sample rendering
85. Rendering FragmentsAlternate Method (2x MSAA) // Retrieve current sample colors from background texture float4 vCurrentColor0=BackgroundTexture.Load(int3(vPos.xy, 0)); float4 vCurrentColor1=BackgroundTexture.Load(int3(vPos.xy, 1)); // Rendering pixels using SRCALPHA-INVSRCALPHA blending for (int k=0; k<nNumPixels; k++) { // Retrieve next unblended furthermost pixel float4 vPixColor= UnpackFromUint(SortedPixels[k].x); // Retrieve sample coverage uintuCoverage=UnpackCoverage(SortedPixels[k].y); // Manual blending between current samples and previous ones if (uCoverage & (1<<0)) vCurrentColor0= lerp(vCurrentColor0.xyz, vPixColor.xyz, vPixColor.w); if (uCoverage & (1<<1)) vCurrentColor1= lerp(vCurrentColor1.xyz, vPixColor.xyz, vPixColor.w); } // Return resolved and manually-blended color return (vCurrentColor0+vCurrentColor1)/0.5; }
Editor's Notes
A node is composed of Element and Link
Uses a Pixel Shader 5.0 to store fragments into linked list: because we need trianglesrasterized as normal through the rendering pipeline. Briefly describe UAVs.Uses atomic operations: only 1 such operation per pixel is used (although it has feedback)
“Fragment & Link” buffer only use Write access actually.FragmentData: typically fragment color, but also transparency, depth, blend mode id, etc.Must be large enough to store all fragments: allocation must take into account maximum number of elements to store. E.g. “average overdraw per pixel” can be used for allocation: width*height*averageoverdraw*sizeof(Element&Link)Fragment data struct should be kept minimal to reduce performance requirements (and improve performance in the process by performing less memory operations)
StartOffsetBuffer is 1D buffer despite being screen-sized; this is because atomic operations are not supported on 2D UAV textures. Initialized to magic value: thus by default the linked lists are empty since they point to -1 (end of list).
“Viewport” is used to illustrate incoming triangle (as no RT is bound)Here fragment data is just the pixel color for simplification.Start Offset Buffer is initialized to “magic value” (-1)
We’re going to exchange the offset in the StartOffsetBuffer with the current value of the counter (before incrementation).This is because the current value of the counter gives us the offset (address) of the fragment we’re storing.
Assuming a rasterization order of left to right for the green triangle
Assuming a rasterization order of left to right for the yellow triangle
Current pixel count is TOTAL pixel count, not pixel count per pixel location.Not using Append Buffers because they require the DXGI_FORMAT_R32_UNKNOWN format.Times 4 because startoffsetaddress is in bytes, not UINT (1 UINT = 4 bytes).
No “4x”to calculate offset address here becauseStartOffsetBufferSRV is declared with the uint type.
No point storing transparent fragments that are hidden by opaque geometry, i.e. that fail the depth test.
May also get away with packed color and depth into the same uint!: this would make the element size a nice 64 bits.Deferred rendering engines may wish to store the same properties in F&L structure as opaque pixels, but watch out for memory and performance impact of large structure!If using different blend modes per pixel then you would need to store blend mode as id into fragment data.Depth is stored as uint – we’ll explain why later on.
Default DX11 pixel pipeline executes pixel shader before depth test[earlydepthstencil] allows the depth test to be performed *before* pixel shader.Normally the HW will try to do this for you automatically but a PS writing to UAVs cannot benefit from this treatment – a bit like alpha testing.Allows performance savings and rendering correctness!: less fragments to store = good!
Sparse memory accesses = slow!: especially true as original rendering order can mean that fragments belonging to the same screen coordinates can be (very) far apart from each other.
Under-blending can be used insteadNo need for background input thenActual HW blending mode enabledBlending mode to use with under-blending is ONE, SRCALPHA
SortedPixels[].x storescolor, SortedPixels[].y stores depthUse (nNumPixels>=MAX_SORTED_PIXELS) test to prevent overflowActual sort algorithm used can be anything (e.g. insertion sort, etc.)
This happens after sorting.vCurrentColor.xyz= vFragColor.w * vFragColor.xyz + (1.0-vFragColor.w)* vCurrentColor.xyz;
The problem with OIT via PPLL and MSAA is that fragment sorting and blending must happens on a per-sample basis, which means a lot of storage and processing.Storing individual samples into Linked Lists requires a huge amount of memory: as most pixels are not edge pixels, they would get all their samples covered. Unfortunately schemes mixing edge and non-edge pixels proved difficult to implement and did not generate satisfactory results.SV_COVERAGE isDX11 only
Depth is now packed into 24 bits: this is why depth is stored as uint
Blue triangle covers one depth sample
Assuming a rasterization order of left to right for the yellow triangle
Traversal causes a lot of random access: Elements belonging to a same pixel location can be far apart in LL!
Not as fast: probably because multiple passes involved.
Only viable if access to samples is no longer required: (e.g. for HDR-correct tone mapping)Sorting is unaffected because stored samples share the same depth.Samples with the same coverage are blended back-to-front: this means blending happens on a per-sample basis as it normally would.
Code illustration for 2 samplesSortedPixels[] contains pixel color in .x and depth+coverage in .y