Game engines have long been in the forefront of taking advantage of the ever
increasing parallel compute power of both CPUs and GPUs. This talk is about how the
parallel compute is utilized in practice on multiple platforms today in the Frostbite game
engine and how we think the parallel programming models, hardware and software in
the industry should look like in the next 5 years to help us make the best games possible.
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
Game engines have long been in the forefront of taking advantage of the ever increasing parallel compute power of both CPUs and GPUs. This talk is about how the parallel compute is utilized in practice on multiple platforms today in the Frostbite game engine and how we think the parallel programming models, hardware and software in the industry should look like in the next 5 years to help us make the best games possible
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
With further advancement in the current console cycle, new tricks are being learned to squeeze the maximum performance out of the hardware. This talk will present how the compute power of the console and PC GPUs can be used to improve the triangle throughput beyond the limits of the fixed function hardware. The discussed method shows a way to perform efficient "just-in-time" optimization of geometry, and opens the way for per-primitive filtering kernels and procedural geometry processing.
Takeaway:
Attendees will learn how to preprocess geometry on-the-fly per frame to improve rendering performance and efficiency.
Intended Audience:
This presentation is targeting seasoned graphics developers. Experience with DirectX 12 and GCN is recommended, but not required.
A technical deep dive into the DX11 rendering in Battlefield 3, the first title to use the new Frostbite 2 Engine. Topics covered include DX11 optimization techniques, efficient deferred shading, high-quality rendering and resource streaming for creating large and highly-detailed dynamic environments on modern PCs.
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
In this talk, the authors will describe an overview of a different method for deferred lighting approach used in CryENGINE 3, along with an in-depth description of the many techniques used. Original file and videos at http://crytek.com/cryengine/presentations
In this technical presentation Johan Andersson shows how the Frostbite 3 game engine is using the low-level graphics API Mantle to deliver significantly improved performance in Battlefield 4 on PC and future games from Electronic Arts. He will go through the work of bringing over an advanced existing engine to an entirely new graphics API, the benefits and concrete details of doing low-level rendering on PC and how it fits into the architecture and rendering systems of Frostbite. Advanced optimization techniques and topics such as parallel dispatch, GPU memory management, multi-GPU rendering, async compute & async DMA will be covered as well as sharing experiences of working with Mantle in general.
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
Ever wondered how to use modern OpenGL in a way that radically reduces driver overhead? Then this talk is for you.
John McDonald and Cass Everitt gave this talk at Steam Dev Days in Seattle on Jan 16, 2014.
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
This session presents a detailed programmer oriented overview of our SPU based shading system implemented in DICE's Frostbite 2 engine and how it enables more visually rich environments in BATTLEFIELD 3 and better performance over traditional GPU-only based renderers. We explain in detail how our SPU Tile-based deferred shading system is implemented, and how it supports rich material variety, High Dynamic Range Lighting, and large amounts of light sources of different types through an extensive set of culling, occlusion and optimization techniques.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Talk from SIGGRAPH 2010 and the <a />Beyond Programmable Shading course</a>
Also see <a />publications.dice.se</a> for more material and other DICE talks.
In this technical presentation Johan Andersson shows how the Frostbite 3 game engine is using the low-level graphics API Mantle to deliver significantly improved performance in Battlefield 4 on PC and future games from Electronic Arts. He will go through the work of bringing over an advanced existing engine to an entirely new graphics API, the benefits and concrete details of doing low-level rendering on PC and how it fits into the architecture and rendering systems of Frostbite. Advanced optimization techniques and topics such as parallel dispatch, GPU memory management, multi-GPU rendering, async compute & async DMA will be covered as well as sharing experiences of working with Mantle in general.
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
Ever wondered how to use modern OpenGL in a way that radically reduces driver overhead? Then this talk is for you.
John McDonald and Cass Everitt gave this talk at Steam Dev Days in Seattle on Jan 16, 2014.
Bindless Deferred Decals in The Surge 2Philip Hammer
These are the slides for my talk at Digital Dragons 2019 in Krakow.
Update: The recordings are online on youtube now:
https://www.youtube.com/watch?v=e2wPMqWETj8
Siggraph2016 - The Devil is in the Details: idTech 666Tiago Sousa
A behind-the-scenes look into the latest renderer technology powering the critically acclaimed DOOM. The lecture will cover how technology was designed for balancing a good visual quality and performance ratio. Numerous topics will be covered, among them details about the lighting solution, techniques for decoupling costs frequency and GCN specific approaches.
Talk by Yuriy O’Donnell at GDC 2017.
This talk describes how Frostbite handles rendering architecture challenges that come with having to support a wide variety of games on a single engine. Yuriy describes their new rendering abstraction design, which is based on a graph of all render passes and resources. This approach allows implementation of rendering features in a decoupled and modular way, while still maintaining efficiency.
A graph of all rendering operations for the entire frame is a useful abstraction. The industry can move away from “immediate mode” DX11 style APIs to a higher level system that allows simpler code and efficient GPU utilization. Attendees will learn how it worked out for Frostbite.
This session presents a detailed programmer oriented overview of our SPU based shading system implemented in DICE's Frostbite 2 engine and how it enables more visually rich environments in BATTLEFIELD 3 and better performance over traditional GPU-only based renderers. We explain in detail how our SPU Tile-based deferred shading system is implemented, and how it supports rich material variety, High Dynamic Range Lighting, and large amounts of light sources of different types through an extensive set of culling, occlusion and optimization techniques.
Course presentation at SIGGRAPH 2014 by Charles de Rousiers and Sébastian Lagarde at Electronic Arts about transitioning the Frostbite game engine to physically-based rendering.
Make sure to check out the 118 page course notes on: http://www.frostbite.com/2014/11/moving-frostbite-to-pbr/
During the last few months, we have revisited the concept of image quality in Frostbite. The core of our approach was to be as close as possible to a cinematic look. We used the concept of reference to evaluate the accuracy of produced images. Physically based rendering (PBR) was the natural way to achieve this. This talk covers all the different steps needed to switch a production engine to PBR, including the small details often bypass in the literature.
The state of the art of real-time PBR techniques allowed us to achieve good overall results but not without production issues. We present some techniques for improving convolution time for image based reflection, proper ambient occlusion handling, and coherent lighting units which are mandatory for level editing.
Moreover, we have managed to reduce the quality gap, highlighted by our systematic reference comparison, in particular related to rough material handling, glossy screen space reflection, and area lighting.
The technical part of PBR is crucial for achieving good results, but represents only the top of the iceberg. Frostbite has become the de facto high-end game engine within Electronic Arts and is now used by a large amount of game teams. Moving all these game teams from “old fashion” lighting to PBR has required a lot of education, which have been done in parallel of the technical development. We have provided editing and validation tools to help the transition of art production. In addition, we have built a flexible material parametrisation framework to adapt to the various authoring tools and game teams’ requirements.
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
This lecture covers rendering topics related to Crytek’s latest engine iteration, the technology which powers titles such as Ryse, Warface, and Crysis 3. Among covered topics, Sousa presented SMAA 1TX: an update featuring a robust and simple temporal antialising component; performant and physically-plausible camera related post-processing techniques such as motion blur and depth of field were also covered.
A description of the next-gen rendering technique called Triangle Visibility Buffer. It offers up to 10x - 20x geometry compared to Deferred rendering and much higher resolution. Generally it aligns better with memory access patterns in modern GPUs compared to Deferred Lighting like Clustered Deferred Lighting etc.
Talk from SIGGRAPH 2010 and the <a />Beyond Programmable Shading course</a>
Also see <a />publications.dice.se</a> for more material and other DICE talks.
Talk by Graham Wihlidal (Frostbite Labs) at GDC 2017.
Checkerboard rendering is a relatively new technique, popularized recently by the introduction of the PlayStation 4 Pro. Many modern game engines are adding support for it right now, and in this talk, Graham will present an in-depth look at the new implementation in Frostbite, which is used in shipping titles like 'Battlefield 1' and 'Mass Effect Andromeda'. Despite being conceptually simple, checkerboard rendering requires a deep integration into the post-processing chain, in particular temporal anti-aliasing, dynamic resolution scaling, and poses various challenges to existing effects. This presentation will cover the basics of checkerboard rendering, explain the impact on a game engine that powers a wide range of titles, and provide a detailed look at how the current implementation in Frostbite works, including topics like object id, alpha unrolling, gradient adjust, and a highly efficient depth resolve.
Talk by Johan Andersson (DICE/EA) in the Beyond Programmable Shading Course at SIGGRAPH 2012.
The other talks in the course can be found here: http://bps12.idav.ucdavis.edu/
Talk by Fabien Christin from DICE at GDC 2016.
Designing a big city that players can explore by day and by night while improving on the unique visual from the first Mirror's Edge game isn't an easy task.
In this talk, the tools and technology used to render Mirror's Edge: Catalyst will be discussed. From the physical sky to the reflection tech, the speakers will show how they tamed the new Frostbite 3 PBR engine to deliver realistic images with stylized visuals.
They will talk about the artistic and technical challenges they faced and how they tried to overcome them, from the simple light settings and Enlighten workflow to character shading and color grading.
Takeaway
Attendees will get an insight of technical and artistic techniques used to create a dynamic time of day system with updating radiosity and reflections.
Intended Audience
This session is targeted to game artists, technical artists and graphics programmers who want to know more about Mirror's Edge: Catalyst rendering technology, lighting tools and shading tricks.
This talk presents the approach Frostbite took to add support for HDR displays. It will summarize Frostbite's previous post processing pipeline and what the issues were. Attendees will learn the decisions made to fix these issues, improve the color grading workflow and support high quality HDR and SDR output. This session will detail the display mapping used to implement the"grade once, output many" approach to targeting any display and why an ad-hoc approach as opposed to filmic tone mapping was chosen. Frostbite retained 3D LUT-based grading flexibility and the accuracy differences of computing these in decorrelated color spaces will be shown. This session will also include the main issues found on early adopter games, differences between HDR standards, optimizations to achieve performance parity with the legacy path and why supporting HDR can also improve the SDR version.
Takeaway
Attendees will learn how and why Frostbite chose to support High Dynamic Range [HDR] displays. They will understand the issues faced and how these were resolved. This talk will be useful for those still to support HDR and provide discussion points for those who already do.
Intended Audience
The intended audience is primarily rendering engineers, technical artists and artists; specifically those who focus on grading and lighting and those interested in HDR displays. Ideally attendees will be familiar with color grading and tonemapping.
Audio for Multiplayer & Beyond - Mixing Case Studies From Battlefield: Bad Co...Electronic Arts / DICE
Leanings from creating soundscapes for online multiplayer games. With experiences from the Battlefield Series with an emphasis on Battlefield: Bad Company.
Presentation from DICE Coder's Day (2010 November) by Andreas Fredriksson in the Frostbite team.
Goes into detail about Scope Stacks, which are a systems programming tool for memory layout that provides
• Deterministic memory map behavior
• Single-cycle allocation speed
• Regular C++ object life cycle for objects that need it
This makes it very suitable for games.
How High Dynamic Range Audio Makes Battlefield: Bad Company Go BOOMAnders Clerwall
Slides from my lecture at GDC09.
Right now the second video refuses to play, but I'll try to get that fixed. For now, watch it at: http://www.youtube.com/watch?v=o7TJjlFSYeM
With the highest-quality video options, Battlefield 3 renders its Screen-Space Ambient Occlusion (SSAO) using the Horizon-Based Ambient Occlusion (HBAO) algorithm. For performance reasons, the HBAO is rendered in half resolution using half-resolution input depths. The HBAO is then blurred in full resolution using a depth-aware blur. The main issue with such low-resolution SSAO rendering is that it produces objectionable flickering for thin objects (such as alpha-tested foliage) when the camera and/or the geometry are moving. After a brief recap of the original HBAO pipeline, this talk describes a novel temporal filtering algorithm that fixed the HBAO flickering problem in Battlefield 3 with a 1-2% performance hit in 1920x1200 on PC (DX10 or DX11). The talk includes algorithm and implementation details on the temporal filtering part, as well as generic optimizations for SSAO blur pixel shaders. This is a joint work between Louis Bavoil (NVIDIA) and Johan Andersson (DICE).
Presentation by Andrew Hamilton and Ken Brown from DICE at GDC 2016.
Photogrammetry has started to gain steam within the Games Industry in recent years. At DICE, this technique was first used on Battlefield and they fully embraced the technology and workflow for Star Wars: Battlefront. This talk will cover their research and development, planning and production, techniques, key takeaways and plans for the future. The speakers will cover photogrammetry as a technology, but more than that, show that it's not a magic bullet but instead a tool like any other that can be used to help achieve your artistic vision and craft.
Takeaway
Come and learn how (and why) photogrammetry was used to create the world of Star Wars. This talk will cover Battlefront's use of of the technology from pre-production to launch as well as some of their philosophies around photogrammetry as a tool. Many visuals will be included!
Intended Audience
A content creator friendly talk intended for pretty much any developer, especially those involved in 3D content creation. It is not a technical talk focused on the code or engineering of photogrammetry. The speakers will quickly cover all basics, so absolutely no prerequisite knowledge required.
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
Unreal Engine* 4 is a high-performance game engine for game developers. Learn how Intel and Epic Games* worked together to improve engine performance both for CPUs and GPUs and how developers can take advantage of it.
Technical talk from the AMD GPU14 Tech Day by Johan Andersson in the Frostbite team at DICE/EA about Battlefield 4 on PC which is the first title that will use 'Mantle' - a very high-performance low-level graphics API being in close collaboration by AMD and DICE/EA to get the absolute best performance and experience in Frostbite games on PC.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/luxoft/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Alexey Rybakov, Senior Director at LUXOFT, presents the "Making Computer Vision Software Run Fast on Your Embedded Platform" tutorial at the May 2016 Embedded Vision Summit.
Many computer vision algorithms perform well on desktop class systems, but struggle on resource constrained embedded platforms. This how-to talk provides a comprehensive overview of various optimization methods that make vision software run fast on low power, small footprint hardware that is widely used in automotive, surveillance, and mobile devices. The presentation explores practical aspects of deep algorithm and software optimization such as thinning of input data, using dynamic regions of interest, mastering data pipelines and memory access, overcoming compiler inefficiencies, and more.
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas
The Linux graphics stack is constantly evolving to add support for new hardware. This evolution and new software specifications have forced the X graphical server to be split into several components including a now rotates in the Linux kernel, the Direct Rendering Manager (DRM). A quick presentation of these components and their role will be carried out before looking at new major change in the common code, the NVIDIA Optimus technology.
One equipped with Optimus technology laptop has two graphics processing units (GPUs), one from Intel and one from NVIDIA. This technology combines the low power Intel GPU when the machine is not used to the performance of NVIDIA GPUs when the user plays. This technology, however, is a nightmare to manage kernel-side although the final building blocks necessary for its complete management are being finalized. Further explanation of this issue will be made and we’ll see how this new software architecture has added graphics acceleration on embedded processor SoCs like Tegra.
The case of open source NVIDIA driver, called “New” will then be studied. This is the graphics driver community as it is developed without the help of NVIDIA and attracted several regular contributors, including myself! We’ll take a quick history of the project before talking about the current developments and issues related to the lack of documentation.
The end of this presentation will then be left to the participants so they can ask more general questions about the graphics stack, if they wish.
Martin Peres, Laboratoire Bordelais de Recherche en Informatique
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
1. Stockholm Game Developer Forum 2010 Parallel Futures of a Game Engine (v2.0) Johan Andersson Rendering Architect, DICE
2. Background DICE Stockholm, Sweden ~250 employees Part of Electronic Arts Battlefield & Mirror’s Edge game series Frostbite Proprietary game engine used at DICE & EA Developed by DICE over the last 5 years 2
7. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 7
8. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 8
9. General ”game code” (1/2) This is the majority of our 1.5 million lines of C++ Runs on Win32, Win64, Xbox 360 and PS3 We do not use any scripting language Similar to general application code Huge amount of code & logic to maintain + continue to develop Low compute density ”Glue code” Scattered in memory (pointer chasing) Difficult to efficiently parallelize Out-of-order execution is a big help, but consoles are in-order Key to be able to quickly iterate & change This is the actual game logic & glue that builds the game C++ not ideal, but has the invested infrastructure 9
10. General ”game code” (2/2) PS3 is one of the main challenges Standard CPU parallelization doesn’t help (much) CELL only has 2 HW threads on the PPU Split the code in 2: game code & system code Game logic, policy and glue code only on CPU ”If it runs well on the PS3 PPU, it runs well everywhere” Lower-level systems on PS3 SPUs Main goals going forward: Simplify & structure code base Reduce coupling with lower-level systems Increase in task parallelism for PC CELL processor 10
11. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 11
12. Job-based parallelism Essential to utilize the cores on our target platforms Xbox 360: 6 HW threads PlayStation 3: 2 HW threads + 6 powerful SPUs PC: 2-16 HW threads Divide up system work into Jobs (a.k.a. Tasks) 15-200k C++ code each. 25k is common Can depend on each other (if needed) Dependencies create job graph All HW threads consume jobs ~200-300 / frame 12
13. What is a Job for us? An asynchronous function call Function ptr + 4 uintptr_t parameters Cross-platform scheduler: EA JobManager Uses work stealing 2 types of Jobs in Frostbite: CPU job (good) General code moved into job instead of threads SPU job (great!) Stateless pure functions, no side effects Data-oriented, explicit memory DMA to local store Designed to run on the PS3 SPUs = also very fast on in-order CPU Can hot-swap quick iterations 13
25. Rendering jobs Rendering systems are heavily divided up into CPU-& SPU-jobs Jobs: Terrain geometry [3] Undergrowth generation [2] Decal projection [4] Particle simulation Frustum culling Occlusion culling Occlusion rasterization Command buffer generation [6] PS3: Triangle culling [6] Most will move to GPU Eventually.. A few have already! PC CPU<->GPU latency wall Mostly one-way data flow 18
26. Occlusion culling job example Problem: Buildings & env occlude large amounts of objects Obscured objects still have to: Update logic & animations Generate command buffer Processed on CPU & GPU = expensive & wasteful Difficult to implement full culling: Destructible buildings Dynamic occludees Difficult to precompute From Battlefield: Bad Company PS3 19
27. Solution: Software occlusion culling Rasterize coarse zbuffer on SPU/CPU 256x114 float Low-poly occluder meshes 100 m view distance Max 10000 vertices/frame Parallel vertex & raster SPU-jobs Cost: a few milliseconds Cull all objects against zbuffer Screen-space bounding-box test Before passed to all other systems Big performance savings! 20
28. GPU occlusion culling Ideally want to use the GPU, but current APIs are limited: Occlusion queries introduces overhead & latency Conditional rendering only helps GPU Compute Shader impl. possible, but same latency wall Future 1: Low-latency GPU execution context Rasterization and testing done on GPU where it belongs Lockstep with CPU, need to read back within a few ms Should be possible on Larrabee & Fusion, want on all PC Future 2: Move entire cull & rendering to ”GPU” World, cull, systems, dispatch. End goal Very thin GPU graphics driver – GPU feeds itself 21
29. Levels of code in Frostbite Editor (C#) Pipeline (C++) Game code (C++) System CPU-jobs (C++) System SPU-jobs (C++/asm) Generated shaders (HLSL) Compute kernels (HLSL) Offline CPU Runtime GPU 22
30. Shader types Generated shaders [1] Graph-based surface shaders Treated as content, not code Artist created Generates HLSL code Used by all meshes and 3d surfaces Graphics / Compute kernels Hand-coded & optimized HLSL Statically linked in with C++ Pixel- & compute-shaders Lighting, post-processing & special effects Graph-based surface shader in FrostEd 2 23
32. 3 major challenges/goals going forward: How do we make it easier to develop, maintain & parallelize general game code? What do we need to continue to innovate & scale up real-time computational graphics? How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors? Challenges Most likely the same solution(s)! 25
33. Challenge 1 “How do we make it easier to develop, maintain & parallelize general game code?” Shared State Concurrency is a killer Not a big believer in Software Transactional Memory either Because of performance and too ”optimistic” flow A more strict & adapted C++ model Support for true immutable & r/w-only memory access Per-thread/task memory access opt-in Reduce the possibility for side effects in parallel code As much compile-time validation as possible Micro-threads / coroutines as first class citizens More? Other languages? 26
34. Challenge 1 - Task parallelism Multiple task libraries EA JobManager Dynamic job graphs, on-demand dependencis & continuations Targeted to cross-platform SPU-jobs, key requirement for this generation Not geared towards super-simple to use CPU parallelism MS ConcRT, Apple GCD, Intel TBB All has some good parts! Neither works on all of our current platforms OpenMP Just say no. Parallelism can’t be tacked on, need to be an explicit core part Need C++ enhancements to simplify usage C++ 0x lambdas / GCD blocks Glacial C++ development & deployment Want on all platforms, so lost on this console generation 27
35. Challenge 2 - Definition ”Real-time interactive graphics & simulation at Pixar level of quality” Needed visual features: Complete anti-aliasing + natural motion blur & DOF Global indirect lighting & reflections Sub-pixel geometry OIT Huge improvements in character animation These require massively more compute, BW and improved model! (animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now) 28
36. Challenge 2 - Problems Problems & limitations with current GPU model: Fixed rasterization pipeline Compute pipeline not fast enough to replace it GPU is handicapped by being spoon-fed by CPU Irregular workloads are difficult & inefficient Can’t express nested data-parallelismwell Current HLSL is a very limited language 29
37. Challenge 2 - Solutions Absolutely a job for a high-throughput oriented data-parallel processor With a highly flexible programming model The CPU, as we know it, and APIs/drivers are only in the way Pure software solution not practical as next step after DX11 PC Multi-vendor & multi-architecture marketplace But for future consoles, the more flexible the better (perf. permitting) Long term goal: Multi-vendor & multi-platform standard/virtual DP ISA Want a rich programmable compute model as next step Nested data-parallelism & GPU self-feeding Low-latency CPU<->GPU interaction Efficiently target varying HW architectures 30
38. ”Pipelined Compute Shaders” Queues as streaming I/O between compute kernels Simple & expressive model supporting irregular workloads Keeps data on chip, supports variable sized caches & cores Can target multiple types of HW & architectures Hybrid graphics/compute user-defined pipelines Language/API defining fixed stages inputs & outputs Pipelines can feed other pipelines (similar to DrawIndirect) Reyes-style Rendering with Ray Tracing Shade Split Raster Sub-D Prims Tess Frame Buffer Trace 31
39. ”Pipelined Compute Shaders” Wanted for next major DirectX and OpenCL/OpenGL As a standard, as soon as possible Run on all: discrete GPU, integrated GPU and CPU Model is also a good fit for many of our CPU/SPU jobs Parts of job graph can be seen as queues between stages Easier to write kernels/jobs with streaming I/O Instead of explicit fixed-buffers and ”memory passes” Or dynamic memory allocation 32
40. Language? Future DP language is a big question But the concepts & infrastructure are what is important! Could be an extendedHLSL or ”data-parallel C++” Data-oriented imperative language (i.e. not standard C++) Think HLSL would probably be easier & the most explicit Amount of code is small and written from scratch Shader-like implicit SoA vectorization (SIMT) Instead of SSE/LRBni-like explicit vectorization Need to be first class language citizen, not tacked on C++ Easier to target multiple evolving architectures implicitly 33
41. Future hardware (1/2) Single main memory & address space Critical to share resources between graphics, simulation and game in immersive dynamic worlds Configurable kernel local stores / cache Similar to Nvidia Fermi & Intel Larrabee Local stores = reliability & good for regular loads Together with user-driven async DMA engines Caches = essential for irregular data structures Cache coherency? Not always important for kernels But essential for general code, can partition? 34
42. Future hardware (2/2) 2015 = 40 TFLOPS, we would spend it on: 80% graphics 15% simulation 4% misc 1% game (wouldn’t use all 400 GFLOPS for game logic & glue!) OOE CPUs more efficient for the majority of our game code But for the vast majority of our FLOPS these are fully irrelevant Can evolve to a small dot on a sea of DP cores Or run scalar on general DP cores wasting some compute In other words: no need for separate CPU and GPU! 35 -> single heterogeneous processor
43. Conclusions Future is an interleaved mix of task- & data-parallelism On both the HW and SW level But programmable DP is where the massive compute is done Data-parallelism requires data-oriented design Developer productivity can’t be limited by model(s) It should enhance productivity & perf on all levels Tools & language constructs play a critical role We should welcome our parallel future! 36
44. Thanks to DICE, EA and the Frostbite team The graphics/gamedev community on Twitter Steve McCalla, Mike Burrows Chas Boyd Nicolas Thibieroz, Mark Leather Dan Wexler, Yury Uralsky Kayvon Fatahalian 37
45. References Previous Frostbite-related talks: [1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html [2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf [3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural Shader Splatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf [4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html [5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html [6] Johan Andersson. ”Parallel Graphics in Frostbite”. Siggraph 2009, Beyond Programmable Shading course. http://repi.blogspot.com/2009/08/siggraph09-parallel-graphics-in.html 38
48. Game development 2 year development cycle New IP often takes much longer, 3-5 years Engine is continuously in development & used AAA teams of 70-90 people 40% artists 30% designers 20% programmers 10% audio Budgets $20-40 million Cross-platform development is market reality Xbox 360 and PlayStation 3 PC (DX9, DX10 & DX11 for BC2) Current consoles will stay with us for many more years 41
49. Game engine requirements (1/2) Stable real-time performance Frame-driven updates, 30 fps Few threads, instead per-frame jobs/tasks for everything Predictable memory usage Fixed budgets for systems & content, fail if over Avoid runtime allocations Love unified memory! Cross-platform The consoles determines our base tech level & focus PS3 is design target, most difficult and good potential Scale up for PC, dual core is min spec (slow!) 42
50. Game engine requirements (2/2) Full system profiling/debugging Engine is a vertical solution, touches everywhere PIX, xbtracedump, SN Tuner, ETW, GPUView Quick iterations Essential in order to be creative Fast building & fast loading, hot-swapping resources Affects both the tools and the game Middleware Use when it make senses, cross-platform & optimized Parallelism have to go through our systems 43
51. Editor & Pipeline Editor (”FrostEd 2”) WYSIWYG editor for content C#, Windows only Basic threading / tasks Pipeline Offline/background data-processing & conversion C++, some MC++, Windows only Typically IO-bound A few compute-heavy steps use CPU-jobs Texture compression uses CUDA, prefer OpenCL or CS CPU parallelism models are generally not a problem here 44
52. struct FB_ALIGN(16) EntityRenderCullJobData { enum { MaxSphereTreeCount = 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities; }; void entityRenderCullJob(EntityRenderCullJobData* data); void validate(const EntityRenderCullJobData& data); Frustum culling of dynamic entities in sphere tree struct contain all input data needed Max output data pre-allocated by callee Single job function Compile both as CPU & SPU job Optional struct validation func EntityRenderCull job example 45
53. EntityRenderCull SPU setup // local store variables EntityRenderCullJobData g_jobData; float g_zBuffer[256*114]; u16 g_terrainHeightData[64*64]; int main(uintptr_t dataEa, uintptr_t, uintptr_t, uintptr_t) { dmaBlockGet("jobData", &g_jobData, dataEa, sizeof(g_jobData)); validate(g_jobData); if (g_jobData.zBufferTestEnable) { dmaAsyncGet("zBuffer", g_zBuffer, g_jobData.zBuffer, g_jobData.zBufferResX*g_jobData.zBufferResY*4); g_jobData.zBuffer = g_zBuffer; if (g_jobData.zBufferShadowTestEnable && g_jobData.terrainHeightData) { dmaAsyncGet("terrainHeight", g_terrainHeightData, g_jobData.terrainHeightData, g_jobData.terrainHeightDataSize); g_jobData.terrainHeightData = g_terrainHeightData; } dmaWaitAll(); // block on both DMAs } // run the actual job, will internally do streaming DMAs to the output entity list entityRenderCullJob(&g_jobData); // put back the data because we changed outEntityCount dmaBlockPut(dataEa, &g_jobData, sizeof(g_jobData)); return 0; } 46
54. Timing view Example: PC, 4 CPU cores, 2 GPUs in AFR (AMD Radeon 4870x2) Real-time in-game overlay See timing events & effective parallelism On CPU, SPU & GPU – for all platforms Use to reduce sync-points & optimize load balancing GPU timing through DX event queries Our main performance tool! 47
55. Language (cont.) Requirements: Full rich debugging, ideally in Visual Studio Asserts Internal kernel profiling Hot-swapping / edit-and-continue of kernels Opportunity for IHVs and platform providers to innovate here! Goal: Cross-vendor standard Think of the co-development of Nvidia Cg and HLSL 48
56. Unified development environment Want to debug/profile task- & data-parallel code seamlessly On all processors! CPU, GPU & manycore From any vendor = requires standard APIs or ISAs Visual Studio 2010 looks promising for task-parallel PC code Usable by our offline tools & hopefully PC runtime Want to integrate our own JobManager Nvidia Nexus looks great for data-parallel GPU code Eventual must have for all HW, how? Huge step forward! 49 VS2010 Parallel Tasks
Editor's Notes
Xbox 360 version + offline super-high AA and resolution
”Glue code”
Better code structure!Gustafson’s LawFixed 33 ms/fWe don’t use threads for anything that can be a bit computationally heavy, jobs instead
Data-layout
We have a lot of jobs across the engine, too many to go through so I chose to focus more on some of the rendering.Our intention is to move all of these to the GPU
Conservative
Want to rasterize on GPU, not CPU CPU Û GPU job dependencies
Simple keywords as override & sealed have been a help in other areas
No low-latency GPU contextsNested data-parallelism
Would require multi-vendor open ISA standard
With nested data parallelism inside kernels
OpenCL
Isolate systems/areasWe do not use exceptions, nor much error handling