This document discusses optimizing post processing and deferred lighting on the PlayStation 3 by offloading work to the system processor units (SPUs). It provides examples of using the SPUs for depth of field preprocessing, screen space ambient occlusion, deferred lighting, volumetric lighting, and shadow mapping. The key advantages of using the SPUs include reducing GPU bottlenecking and improving performance by leveraging the SPUs' flexibility to optimize fragment processing through techniques like tile-based classification and caching shadow map tiles in local memory. Offloading post processing effects to the SPUs can significantly reduce frame times compared to performing the entire effect on the GPU.
This session presents a detailed programmer oriented overview of our SPU based shading system implemented in DICE's Frostbite 2 engine and how it enables more visually rich environments in BATTLEFIELD 3 and better performance over traditional GPU-only based renderers. We explain in detail how our SPU Tile-based deferred shading system is implemented, and how it supports rich material variety, High Dynamic Range Lighting, and large amounts of light sources of different types through an extensive set of culling, occlusion and optimization techniques.
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
This talk focuses on the technical side of Killzone Shadow Fall, the platform exclusive launch title for PlayStation 4.
We present the details of several new techniques that were developed in the quest for next generation image quality, and the talk uses key locations from the game as examples. We discuss interesting aspects of the new content pipeline, next-gen lighting engine, usage of indirect lighting and various shadow rendering optimizations. We also describe the details of volumetric lighting, the real-time reflections system, and the new anti-aliasing solution, and include some details about the image-quality driven streaming system. A common, very important, theme of the talk is the temporal coherency and how it was utilized to reduce aliasing, and improve the rendering quality and image stability above the baseline 1080p resolution seen in other games.
This session presents a detailed programmer oriented overview of our SPU based shading system implemented in DICE's Frostbite 2 engine and how it enables more visually rich environments in BATTLEFIELD 3 and better performance over traditional GPU-only based renderers. We explain in detail how our SPU Tile-based deferred shading system is implemented, and how it supports rich material variety, High Dynamic Range Lighting, and large amounts of light sources of different types through an extensive set of culling, occlusion and optimization techniques.
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
This talk focuses on the technical side of Killzone Shadow Fall, the platform exclusive launch title for PlayStation 4.
We present the details of several new techniques that were developed in the quest for next generation image quality, and the talk uses key locations from the game as examples. We discuss interesting aspects of the new content pipeline, next-gen lighting engine, usage of indirect lighting and various shadow rendering optimizations. We also describe the details of volumetric lighting, the real-time reflections system, and the new anti-aliasing solution, and include some details about the image-quality driven streaming system. A common, very important, theme of the talk is the temporal coherency and how it was utilized to reduce aliasing, and improve the rendering quality and image stability above the baseline 1080p resolution seen in other games.
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
This talk covers changes in CryENGINE 3 technology during 2012, with DX11 related topics such as moving to deferred rendering while maintaining backward compatibility on a multiplatform engine, massive vegetation rendering, MSAA support and how to deal with its common visual artifacts, among other topics.
This presentation gives an overview of the rendering techniques used in KILLZONE 2. We put the main focus on the lighting and shadowing techniques of our deferred shading engine and how we made them play nicely with anti-aliasing.
Next generation gaming brought high resolutions, very complex environments and large textures to our living rooms. With virtually every asset being inflated, it's hard to use traditional forward rendering and hope for rich, dynamic environments with extensive dynamic lighting. Deferred rendering, on the other hand, has been traditionally described as a nice technique for rendering of scenes with many dynamic lights, that unfortunately suffers from fill-rate problems and lack of anti-aliasing and very few games that use it were published.
In this talk, we will discuss our approach to face this challenge and how we designed a deferred rendering engine that uses multi-sampled anti-aliasing (MSAA). We will give in-depth description of each individual stage of our real-time rendering pipeline and the main ingredients of our lighting, post-processing and data management. We'll show how we utilize PS3's SPUs for fast rendering of a large set of primitives, parallel processing of geometry and computation of indirect lighting. We will also describe our optimizations of the lighting and our parallel split (cascaded) shadow map algorithm for faster and stable MSAA output.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Killzone Shadow Fall: Creating Art Tools For A New Generation Of GamesGuerrilla
This talk describes the tool improvements Guerrilla Games implemented to make Killzone Shadow Fall shine on the PlayStation 4. It highlights additions to the Maya pipeline, such as Viewport 2.0, Maya's coupling with in-game updates and in-engine deferred renderer features including real-time shadow-casting, volumetric lighting, hardware instancing, lens flares and color grading.
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
The presentation describes Physically Based Lighting Pipeline of Killzone : Shadow Fall - Playstation 4 launch title. The talk covers studio transition to a new asset creation pipeline, based on physical properties. Moreover it describes light rendering systems used in new 3D engine built from grounds up for upcoming Playstation 4 hardware. A novel real time lighting model, simulating physically accurate Area Lights, will be introduced, as well as hybrid - ray-traced / image based reflection system.
We believe that physically based rendering is a viable way to optimize asset creation pipeline efficiency and quality. It also enables the rendering quality to reach a new level that is highly flexible depending on art direction requirements.
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
This talk covers changes in CryENGINE 3 technology during 2012, with DX11 related topics such as moving to deferred rendering while maintaining backward compatibility on a multiplatform engine, massive vegetation rendering, MSAA support and how to deal with its common visual artifacts, among other topics.
This presentation gives an overview of the rendering techniques used in KILLZONE 2. We put the main focus on the lighting and shadowing techniques of our deferred shading engine and how we made them play nicely with anti-aliasing.
Next generation gaming brought high resolutions, very complex environments and large textures to our living rooms. With virtually every asset being inflated, it's hard to use traditional forward rendering and hope for rich, dynamic environments with extensive dynamic lighting. Deferred rendering, on the other hand, has been traditionally described as a nice technique for rendering of scenes with many dynamic lights, that unfortunately suffers from fill-rate problems and lack of anti-aliasing and very few games that use it were published.
In this talk, we will discuss our approach to face this challenge and how we designed a deferred rendering engine that uses multi-sampled anti-aliasing (MSAA). We will give in-depth description of each individual stage of our real-time rendering pipeline and the main ingredients of our lighting, post-processing and data management. We'll show how we utilize PS3's SPUs for fast rendering of a large set of primitives, parallel processing of geometry and computation of indirect lighting. We will also describe our optimizations of the lighting and our parallel split (cascaded) shadow map algorithm for faster and stable MSAA output.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
Killzone Shadow Fall: Creating Art Tools For A New Generation Of GamesGuerrilla
This talk describes the tool improvements Guerrilla Games implemented to make Killzone Shadow Fall shine on the PlayStation 4. It highlights additions to the Maya pipeline, such as Viewport 2.0, Maya's coupling with in-game updates and in-engine deferred renderer features including real-time shadow-casting, volumetric lighting, hardware instancing, lens flares and color grading.
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
The presentation describes Physically Based Lighting Pipeline of Killzone : Shadow Fall - Playstation 4 launch title. The talk covers studio transition to a new asset creation pipeline, based on physical properties. Moreover it describes light rendering systems used in new 3D engine built from grounds up for upcoming Playstation 4 hardware. A novel real time lighting model, simulating physically accurate Area Lights, will be introduced, as well as hybrid - ray-traced / image based reflection system.
We believe that physically based rendering is a viable way to optimize asset creation pipeline efficiency and quality. It also enables the rendering quality to reach a new level that is highly flexible depending on art direction requirements.
Unity optimization techniques applied in Catan UniverseExozet Berlin GmbH
In this presentation we make a summary of important optimization techniques that were adopted when porting Catan Universe to mobile. Most of them can also be universally applied in other Unity projects; let us know if they helped you.
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unity Technologies
In this session, the Unity Demo team provides their best tips and tricks for optimizing detailed, complex environment scenes for modern console performance.
Speakers:
Rob Thompson (Unity Technologies)
The most important part of a modern PostFX pipeline is picking the right color model to support. This way the whole PostFX pipeline can use 32-bit render targets and at the same time have increased color representation and luminance representation.
Next generation mobile gp us and rendering techniques - niklas smedbergMary Chan
This session provides a detailed analysis of next-generation mobile graphics hardware and points out special caveats and opportunities that are specific to mobile. This is followed by an in-depth explanation of advanced rendering techniques that were previously only considered for high-end PCs or dedicated gaming consoles, but which have now been brought over to mobile devices in Unreal Engine 4 - including features such as physically-based rendering, HDR and image-based lighting.
Progressive Lightmapper: An Introduction to Lightmapping in UnityUnity Technologies
In 2018.1 we removed the preview label from the Progressive Lightmapper – we’ve made memory improvements, optimizations, and have had customers battle test it. We are now also working on a GPU accelerated version of the lightmapper. In this session, Tobias and Kuba will provide an intro to the basics of lightmapping and address of the most common issues that users struggle with and how to solve them. They will also provide an update on the future roadmap for lightmapping in Unity.
Tobias Alexander Franke & Kuba Cupisz (Unity Technologies)
In this deck from the GPU Technology Conference, John Romein and Bram Veenboer from ASTRON in the Netherlands present: Can FPGAs compete with GPUs?
"We'll discuss how FPGAs are changing as a result of new technology such as the Open CL high-level programming language, hard floating-point units, and tight integration with CPU cores. Traditionally energy-efficient FPGAs were considered notoriously difficult to program and unsuitable for complex HPC applications. We'll compare the latest FPGAs to GPUs, examining the architecture, programming models, programming effort, performance, and energy efficiency by considering some real applications."
Watch the video: https://wp.me/p3RLHQ-kfK
Learn more: https://www.astron.nl/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to Deferred Lighting and Post Processing on PLAYSTATION®3 (20)
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
"Impact of front-end architecture on development cost", Viktor Turskyi
Deferred Lighting and Post Processing on PLAYSTATION®3
1. Deferred Lighting and Post Processing on
PLAYSTATION®3
Matt Swoboda
PhyreEngine™ Team
Sony Computer Entertainment Europe (SCEE) R&D
2. 2
Where Are We Now?
• PS3 into its 3rd
year
• Many developers on their 2nd
generation engines
• Solved the basic problems
• SPUs STILL underused
– But it’s improving
3. 3
But..
• GPU now the most common bottleneck
• Usually limited by fragment operations
• Many titles take > 1/3 of their time in post processing
• Most developers want to do even more fragment work
– More / heavier post processing effects
– Better lighting techniques / more lights / softer shadows
– Longer shaders
– Features ported from PC / other console hardware
4. 4
“We fixed the vertex bottleneck..”
• Many possible solutions to improve geometry
performance beyond just “optimising the shader”
– LOD
– Occlusion culling & visibility culling
– Move large vertex operations to SPU, e.g. skinning
– SPU triangle culling
5. 5
What About Pixels?
• Fragment operations / post processing rarely optimised
like geometry operations
– Throw whole operation at the GPU
– Same operation done for every pixel
– Spatial optimization / branching considered too slow
• SPU not considered: “too slow”, “uses too much
bandwidth”
6. 6
SPU pixel processing
• Yes, the SPU is fast enough to process pixels
• Won’t beat the GPU in a brute force race
• GPU specialises in rasterising triangles and sampling
textures – has dedicated hardware
• SPU is a general purpose processor
– Use flexibility to your advantage
– Choose different code branches and fast paths
8. 8
What to do on SPU
• Options:
• Offload whole processes from GPU to SPU
• Or use SPU and GPU together to do one process
9. 9
Depth Of Field Pre-Process
• High quality depth of field requires a long fragment shader
– Read depth samples and colour samples in a kernel / disc
– Check depths against centre pixel depth
– Weight colours by depth check results
• Wasteful for “most” of the screen
– All depth checks pass (out of focus) or all fail (in focus)
– All fail == pass through original buffer
– All pass == use pre-blurred buffer – separable gaussian blur
• Categorise the screen for these cases on SPU
10. 10
Depth Of Field Classification Results
• Post process depth buffer
• Classify by min/max depth
• Green: fully in focus
• Blue: fully out of focus
• Red: neither fully in or out
11. 11
Depth Of Field Pre-process results
• Pre-process only on SPU,
blur operations on GPU
– Goal: minimise overall frame
time and latency
• Large blur w.r.t. depth
• 15 ms+ on GPU alone
• 1.5-2ms on SPU + 3 ms on
GPU
12. 12
Screen Tile Classification
• Categorise the screen using the range of depth values
within a tile
• Powerful technique with many applications
– Full screen effect optimization - DOF, SSAO..
– Soft particles
– Affecting lights
– Occluder information
13. 13
Screen Space Ambient Occlusion (SSAO)
• Generate an ambient occlusion approximation using the
depth buffer alone
• Perform a large kernel-based series of depth
comparisons and sum the results
• Downsample output to ½ size for performance
– Output normals for bilateral upsampling
14. 14
SPU Screen Space Ambient Occlusion
Results
• GPU version: 10ms+
• SPU version: 6ms on 2
SPUs
• Used in “Donkey Trader”
PhyreEngine game
template
16. 16
Deferred Shading Overview
• Rasterise geometry information to multiple “GBuffers”
(geometry buffers)
• Apply lighting and shading in a post process
18. 18
Deferred Lighting on SPU
• The SPU can handle the deferred lighting process
• The GPU renders the geometry to GBuffers
• SPU and GPU execute in parallel
– Total time : max( geometry, lighting )
19. 19
Deferred Lighting on SPU: Implementation (1)
• Process each pixel once
• Work out which lights affect each pixel
• Apply the N affecting lights in a loop
• Process the screen in tiles
• Use classification techniques per tile to optimise
20. 20
Deferred Lighting on SPU: Implementation (2)
• Calculate affecting lights per tile
– Build a frustum around the tile using the min and max depth
values in that tile
– Perform frustum check with each light’s bounding volume
– Compare light direction with tile average normal value
• Choose fast paths based on tile contents
– No lights affect the tile? Use fast path
– Check material values to see if any pixels are marked as lit
21. 21
Deferred Lighting on SPU: Implementation (3)
• Choose whether to process MSAA
per tile
– If no sample pair values differ, light
only one sample from the pair,
otherwise light both samples
separately
– Typically quite few tiles need both
MSAA samples lit
Tiles requiring MSAA
22. 22
Deferred Lighting on SPU: Results
• 3 shadow casting lights, 100
point lights
• 2x MSAA, 720p
– Lighting performed per sample
• Apply tone mapping on SPU
– Virtually free
• Performance: > 60 fps, 3
SPUs for 11ms each
– No MSAA: 2 SPUs for 11ms
23. 23
Deferred Lighting on SPU: Issues
• Potential latency
– Must keep GPU busy while SPU process is running
– Render something else or add a frame of latency
• Main memory requirements
• Shadows
– Requires “random” texture access – not ideal for SPU
– Can render shadows on GPU to a full screen buffer and use it
on SPU
24. 24
Flavours of Deferred Lighting on SPU
• Full deferred render on SPU
– Input all GBuffers, output final composited result
• Light pre-pass render on SPU
– Input normal and depth only; calculate light result; sample in
2nd
geometry pass
• Light tile classification data output?
– SPU outputs information per tile about affecting lights
– Do lighting calculations on GPU
26. 26
Volumetric Lighting
• Also known as “god rays” or “light beams”
• Simulates the effect of light illuminating dust particles in
the air
• Numerous fakes exist
– Artist-placed geometry
– Artist-placed particles
• Better: generate using the shadow map
– Works in a “general case”
27. 27
Volumetric Lighting
• Ray march through the shadow map
– Trace one ray per pixel in screen space
– Sample the depth buffer to determine
the end of the ray
• Sample the shadow map at N points
along the ray
– N ~= 50
– Attenuate and sum up the number of
samples that passed
• Blur and add noise
28. 28
Volumetric Lighting
• Effect is a bit too slow to be practical on GPU: ~5ms
• Do it on SPU instead
• Parallelises with GPU easily
– Result needed late in the render at compositing stage
– Only needs depth and shadow map inputs
• Problem: must randomly sample from the shadow map
29. 29
Texture sampling on SPU
• “Random access” texture sampling is bad for SPU
• It’s bad for GPU, too, but sometimes you just have to do it
• GPU:
– Fast access from texture cache; cache miss is slow
– Dedicated hardware handles lookups, filtering and wrapping
• SPU:
– Fast access from “texture cache” (SPU local memory)
– Slow access on cache miss (DMA from main memory)
– Cache lookups slow (no dedicated hardware)
– Must manually handle filtering and wrapping (again, slow)
30. 30
Texture sampling on SPU
• Either:
– Make the texture entirely fit in SPU local memory
– Problem solved!
– Still inefficient: random accesses reduce register parallelism
• Or
– Write a very good software cache
– Locate potential cache misses early - long before you need the values
– Avoid branches in sampling code
31. 31
Volumetric Lighting on SPU
• Volumetric light result will be blurred
– Don’t need full shadow map accuracy
– No filtering on texture samples needed
• Downsample shadow map from 1024x1024, 32 bit to
256x256, 16 bit
– 128k – fits in SPU local memory
• Fast enough to sample on SPU
33. 33
Shadow Mapping on SPU (1)
• Needs the full-size shadow map
– 1024x1024x32 bit == 4mb : won’t fit in SPU local memory
– We’ll have to write that “very good software cache”, then
• Pre-process the shadow map on SPU
– Calculate min and max depth for each tile
– Store in a low resolution depth hierarchy map
– Output high resolution shadow map as cache tiles
34. 34
Shadow Mapping on SPU (2)
• Software cache with 32 entries
– Each entry is a shadow map tile
– Branchless determination of cache entry index for tile index
• Locate cache misses early
– While detiling depth data – work out required shadow tiles
– Pull in all cache-missed tiles
• Sample shadow map during lighting calculations
– All required shadow tiles are now definitely in cache – lookup is
branchless
• It’s quite slow
– Locate tile in cache per pixel
35. 35
Shadow Mapping on SPU (3)
• Optimise via special cases to win back
performance
• Use the low resolution shadow tile map
– Always in SPU local memory
– If pixel shadow z > tile max Z : definitely in shadow
– If pixel shadow z < tile min Z : definitely not in shadow
• Check low resolution map before triggering
cache fetches
• Classify whole screen tiles as in or out of
shadow
– Don’t need to sample high resolution shadow map at
all for those tiles Tiles requiring high resolution shadow samples
37. 37
Conclusion
• New additions to your toolbox:
– Tile-based classification techniques on SPU
– Deferred lighting on SPU
– Texture sampling on SPU
• Rendering is no longer just a GPU problem
– Use general purpose nature of the SPU to your advantage
• Rethink fragment processing optimisation strategies
– Make the GPU work smarter, not harder
38. 38
Conclusion
• Some titles are already using SPU post processing
– Killzone 2
• PhyreEngine™ is here to help
– (If you’re a registered PS3 developer) it’s on DevNet now
– Not just an engine: also a reference
– Comes with full source
– Download it, learn from it, steal bits of the code
– Check out the PhyreEngine™ SPU Post Processing Library
Editor's Notes
The Typical performance limitation on PS3 titles has moved from a CPU bottleneck to a GPU bottleneck - specifically fragment operations.
There’s a large range of techniques that can be applied to optimise vertex performance, often focusing on only drawing what you can see, and spending the most time drawing the most important things.
In addition the SPU has successfully been used to perform vertex processing.
Fragment operations are usually applied in a brute force manner. Techniques for spending the most time only where it makes the most difference – e.g. edge cases – are rarely used because of the need for branching and potentially complex pre-passes. The SPU is rarely used for pixel processing because it’s perceived as being too slow or needing too much bandwidth.
The SPU is fast enough to perform some pixel processing tasks. Bandwidth is rarely an issue – every post process we’ve ever developed so far on SPU has been cycle limited, not bandwidth limited – the PS3 bus really is that fast.
The GPU is very good at specific tasks, such as rasterisation and texture sampling, but it has strict limitations.
The SPU is a general purpose processor – you can read and write data in any order you like, apply branches anywhere you like, and use fast paths to optimise the process.
A brief run-through of some post processes we’ve implemented on SPU, to contrast the differing approaches used.
Use SPU to optimise GPU operations
Depth buffer tile classification
DXT compress render targets on SPU
Use SPU to perform processes more suitable for SPU architecture
Summed area table generation – much easier on SPU than GPU
Deferred lighting – SPU can pick fast paths per block of pixels
Use SPU to perform processes to offload from GPU
Screen space ambient occlusion, volumetric lighting
Operations on SPU work in parallel with GPU doing other work – minimise time on critical path
Depth of field is a very desirable but potentially slow post process.
The process works by performing a blur where each weight in the kernel is scaled by a function of the difference between the kernel sample depth and pixel depth. As such the kernel can’t be separated, and the blur shader is therefore quite long and slow.
This is in fact a waste of time for much of the screen. For most pixels, the depth differences are such that the weights are all 1 or 0. If we can detect areas like this we can run those areas through considerably shorter shaders and greatly reduce execution time.
We can perform this classification process on SPU.
The process reads the depth buffer, processes it in tiles, classifies the results and outputs three lists of point sprites – one for each classification type. The lists are then rendered on the GPU using different shaders to perform the effect.
For more detail about this process please refer to my SCEE DevStation 2008 presentation on SPU post processing.
The result is that using a version of the effect with classification techniques is massively faster than one without.
The performance timings can be scaled down if the original shader used is simpler.
Soft particles: determine which particles need to be handled as “soft” by checking depth min/max tiles. Only those intersecting need to be handled as “soft”. The rest can be handled as regular particles or even avoided completely if they are totally obscured by the depth.
For SSAO we perform the entire process on the SPU. The effect is generated and output to a texture which is sampled during rendering on the GPU.
The process was able to be kicked off after the depth pre pass, and parallelised with the shadow rendering pass on the GPU. The results were then available to be read during the main render pass.
Rasterise geometry information to multiple “GBuffers” (geometry buffers)
colour, normal, depth, specular and material information
Apply lighting and shading as post processes
Multiple lights and spatial optimizations easy
Fewer shader combinations required
Has some negative points
GBuffers consume memory and bandwidth
MSAA is problematic
Fixed BRDFs
Demo showing deferred lighting on the SPU.
The SPU handles the deferred lighting process usually done on GPU.
Parallelise the lighting process across multiple SPUs to improve performance.
The GPU handles rasterisation processes - like rendering GBuffers, shadow maps, alpha passes, reflection geometry etc.
SPU may be slower than GPU at light processing, but it’s faster than doing it all in serial on GPU.
By moving such a large body of work off the GPU, we can greatly increase the overall frame rate if the GPU is the bottleneck.
To handle multiple lighting models, calculate all the different models and select based on light type. Branch per tile to optimise the set of light types used.
Comparing the light direction with tile average normal value can avoid lights behind walls and so on.
MSAA - why does this work? If there are no triangle edges or intersections, both samples rendered from the colour output will contain the same value.
The resulting performance is massively increased compared to a GPU-only equivalent.
Tone mapping is virtually free on SPU – accumulation is just a sum of all pixel luminance, easily rolled into lighting calculations. The current frame’s tone mapping is applied using the eye adapted value from the previous frame. This maps much better to SPU than GPU.
Why not roll in colour correction and other post processing operations – e.g. a depth of field pre-process - to the full deferred render solution?
GBuffers must be in main memory to be read by SPU – potentially requiring a lot of main memory.
Find something else for the GPU to do that doesn’t depend on SPU results - alpha geometry, reflections, shadow maps, effects?
Otherwise, add a frame of latency.
Different options exist depending on your limitations.
Potentially very slow. 50 texture samples times 1280x720 pixels? Downsample to ¼ width and height – result is blurred anyway.
There is a demo of this effect running on GPU in the NVIDIA SDK.
Even after considerable down-sampling the effect still takes over 5ms on the GPU – too slow for our situation. So we decided to implement it on SPU instead.
Unfortunately the effect requires random sampling of a shadow map texture – something which is difficult to map to the SPU.
To work out the best way to do SPU texture sampling, consider the GPU. The GPU has a texture cache which stores a small portion of the texture in fast-to-access memory, and the rest in a slower, larger memory buffer elsewhere (main memory or VRAM). On the SPU, that texture cache is the SPU’s local store. Accesses to this are fast, but if the data is not in cache it must be pulled in from main memory by DMA – which is slower. Also there is no dedicated hardware to manage the texture lookup – everything must be emulated in software.
If the texture can fit entirely in the SPU’s local store, we can avoid the whole texture cache issue. If not, we have to write a software cache that can handle it.
This software cache must be branchless for lookups, otherwise performance of the calling code will be destroyed. This implies that cache misses must be caught and resolved early, so there are no DMAs in the main processing loop either.
Fortunately for volumetric lighting, the whole shadow map can be made to fit in SPU local memory by downsampling and reducing it.
The effect is fast and parallel enough to run on an SPU in the background while other work is done on the GPU.
Low resolution format: 64x64 for a 1k x 1k shadow map.
Output the high resolution shadow map in a series of 16x16 tiles – they map to cache pages 1k in size.
The low resolution shadow min/max depth map can be used to greatly optimise the process by skipping high resolution shadow reads where the whole shadow tile is all in or out of shadow, and skipping shadow lookups entirely where the whole screen tile is in or out of shadow. By doing all this we can achieve good performance – good enough to make it practical to sample shadow maps on the SPU.
This screen tile information can be output in a pre-process step similar to the depth of field classification and used for deferred shadowing on the GPU – only sampling the shadow map on GPU for the edge cases where the tile falls on the border of in and out of shadow. The rest of the screen can run through a fast path. This can greatly optimise the performance of deferred shadowing.
Key takeaways:
Reconsider your approach to fragment processing operations.
Use tile-based classification on the SPU to optimise heavy fragment processes.
Move 2D fragment processing operations such as post process effects or deferred lighting to the SPU.
Texture sampling on the SPU is possible too.
Much of the work you need for SPU post processing has already been done for you – download PhyreEngine and you’ll find a complete engine with full source code which implements the effects in this presentation. It also provides the necessary GPU/SPU sync framework and many useful utilities to aid post processing – such as de-tiling of main memory render targets.