Grass, Fur and all Things Hairy - AMD at GDC14
Upcoming SlideShare
Loading in...5
×
 

Grass, Fur and all Things Hairy - AMD at GDC14

on

  • 878 views

Learn about new developments in simulating and rendering grass, fur and hair. We’ll show thousands of blades of grass or strands of fur being simulated in real-time, as well as our latest findings ...

Learn about new developments in simulating and rendering grass, fur and hair. We’ll show thousands of blades of grass or strands of fur being simulated in real-time, as well as our latest findings in Order-Independent Transparency in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21.

Statistics

Views

Total Views
878
Views on SlideShare
872
Embed Views
6

Actions

Likes
0
Downloads
19
Comments
0

2 Embeds 6

https://twitter.com 3
http://www.slideee.com 3

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • NEW GIRL
  • Easy LOD: can easily be done with tessellation

Grass, Fur and all Things Hairy - AMD at GDC14 Grass, Fur and all Things Hairy - AMD at GDC14 Presentation Transcript

  • Grass, Fur and all things hairy Nicolas Thibieroz Karl Hillesland Gaming Engineering Manager, AMD Senior Research Engineer, AMD
  • Next-gen Grass, Fur and Hair ● The time for next-gen quality is now ● Tomb Raider pioneered next-gen hair ● Even on PS4/XB1 ● Users expect this level of quality for next- gen titles ● You need to start thinking about this ● This talk is about making high-quality fur, grass and hair run at real-time performance
  • TressFX applied to Grass, Fur and Hair ● Variations of the same technique can be used for all those applications ● In all cases the core principles of next-gen quality are still needed: ● Compute simulations ● Anti-aliasing ● Transparency ● Volumetric self-shadowing ● A good lighting model
  • Forward Rendering Pipeline – a refresher ● Consists of three steps: ● Hair simulation ● Shade and store fragments into buffers ● Fetch shaded fragments, sort and render
  • // Retrieve current pixel count and increase counter uint uPixelCount = LinkedListUAV.IncrementCounter(); uint uOldStartOffset; // Exchange indices in LinkedListHead texture corresponding to pixel location InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset); // Append new element at the end of the Fragment and Link Buffer Element.uNext = uOldStartOffset; LinkedListUAV[uPixelCount] = Element; ● Head UAV ● Each pixel location has a “head pointer” to a linked list in the PPLL UAV ● PPLL UAV ● As new fragments are rendered, they are added to the next open location in the PPLL (using UAV counter) ● A link is created to the fragment pointed to by the head pointer ● Head pointer then points to the new fragment Per-Pixel Linked Lists Head UAV PPLL UAV
  • CSCSCS Input Geometry Post-simulation geometry (UAV) Forward Rendering Pipeline – a refresher Hair Simulation Simulation parameters Model space World space
  • Forward Rendering Pipeline – a refresher Shade and Store fragments into Buffers Coverage depth color coverage next Lighting VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Shadows Extrusion from line segments to non-indexed triangles
  • Full Screen Quad Forward Rendering Pipeline – a refresher Fetch shaded fragments, sort and render VS PS Stencil Head UAV PPLL UAV Render target Fragment sorting and manual blending
  • Forward Rendering Performance ● Main cost in forward rendering mode is in the shading part ● All fragments are lit and shadowed before being stored ● PPLL storing is typically not the bottleneck! ● Don’t need maximum quality on all fragments ● “tail” fragments need only “good enough” quality ● Solution: Use shader LOD
  • Forward vs Deferred Rendering Pipeline Deferred rendering pipeline ● Hair simulation ● Store fragment properties into buffers ● Fetch fragment properties, sort, shade and render ● Full shading on K-frontmost fragments ● “Tail” fragments are shaded with a simpler light equation and shadowing algorithm Forward rendering pipeline ● Hair simulation ● Full shading and store fragments into buffers ● Fetch shaded fragments, sort and render
  • CSCSCS Input Geometry Post-simulation geometry (UAV) Deferred Rendering Pipeline Hair Simulation – unchanged! Simulation parameters Model space World space
  • Deferred Rendering Pipeline – a refresher Store Fragment Properties into Buffers Coverage depth tangent coverage next VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Index Buffer Indexed triangle list
  • Deferred Rendering Pipeline Fetch fragments, sort, shade and render VS PS Stencil Head UAV PPLL UAV Render target K frontmost fragment: full shading, sorting and manual blending Lighting Shadows Full Screen Quad Tail fragments: cheap chading, no sorting and manual blending
  • Deferred Rendering Shading LOD Optimization ● Deferred approach allows a reduction in shading cost “Shader LOD” ● Only sort and shade K frontmost fragments at high quality ● “Simple” shading and out-of-order rendering on tail fragments ● Single-tap shadowing on tail fragments ● Very little quality difference compared to full shading ● But much better performance! Technique Cost Out of order, no shading 1.31 ms Out of order, shading 2.80 ms Forward PPLL, shading 3.38 ms Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands Running on AMD Radeon 7970 @ 1080p Shading cost is ~ 1.5 ms PPLL cost is ~ 0.58 ms Fast!
  • Full quality shading forced on for all fragments Shading LOD
  • ● A great portion of time was spent in the GPU front-end ● 920,000 line segments for fur model ● Expansion from line segments to triangles was done in GS and then VS with Draw() ● Each segment would create a quad (two triangles) with 6 vertices Geometry Optimizations DrawIndexed() method Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) }; 1 Line segments Expanded quads 0 1 2 3 2 4 0 5 1,4 Draw() method Line segments Expanded quads 0 1 2 3,5 6 2,3 7,10 8,9 0 11 Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) }; ● Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!
  • ● Input line segments have a random order ● Just render fewer (but thicker) fragments when far away! ● Needs shading adjustments to ensure smooth quality transitions ● Increase alpha threshold for fragment inclusion when far away Distance-based LOD system Optimization
  • ● PPLL Head UAV uses a RWTexture2D instead of a Buffer ● Results in more efficient caching for UAV accesses ● Avoid GPR indexing for sorting ● Sorting K frontmost fragments required array of Generic Purpose Registers with random indexing into it ● Used an ALU-based indexing approach to improve performance ● TO DO: compute shader simulation optimizations ● Currently a set of multiple compute shaders ● Looking at combining some of these, optimizing shaders and output formats Other Optimizations
  • Per-Pixel Linked Lists UAV Memory Considerations ● How much memory is needed? ● Guesstimate for a given usage model ● Max (hair pixels x average overdraw) fragments ● What happens when I run out? ● Missing fragments ● What can be done about it?
  • k-Buffer in Memory
  • PP Linked-List (PPLL) k-Buffer fixed size array Node Pool All fragments How big? k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k Simple Memory Bound
  • The Front k Approximation to avoid massive sorting ● Only sort the front k fragments per-pixel ● Blend the rest out-of-order If deferring for shader LOD … also ● Full quality shade on front k ● Cheap shade on rest 20 frags/pixel (ave) Red = over 100 k is 4, 8, 16
  • The Front k Approximation to avoid massive sorting ● Only sort the front k fragments per-pixel ● Blend the rest out-of-order If deferring for shader LOD … also ● Full quality shade on front k ● Cheap shade on rest k-Buffer Tail Can’t know front k until all fragments processed
  • k-Buffer For Each Fragment in Each Pixel Index of furthest New Fragment Blend Tail ColorTail Fragment
  • If New Fragment in k Index of furthest k-Buffer Blend Tail Color If in k 1. Swap with furthest 2. Find new furthest 3. Blend with tail Tail Fragment New Fragment
  • If not in k Index of furthest k-Buffer Blend Tail Color If not in k 1. Blend with tail Tail Fragment New Fragment
  • From PPLL to k-Buffer For each pixel: Write frags to mem For each fragment in each pixel read fragment from mem update k-buffer (reg) blend tail fragment (reg) Read k-buffer from mem Sort and blend k-buffer (reg) update k-buffer (mem) blend tail fragment (mem)
  • k-Buffer Screen Width ScreenHeight k 8 bytes each (depth and data) PPLL nodes were 12 bytes (depth, data, next) K=4, 8, 16
  • PPLL: 2nd Pass New Fragment Index of furthest Blend Tail ColorTail Fragment k-Buffer Registers
  • k-Buffer in Memory: 1st Pass New Fragment Index of furthest Blend Tail ColorTail FragmentMutex, index, … Blend Unit k-Buffer Memory
  • Mutex/Count/Index Buffer Screen Width ScreenHeight Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit 32 bits
  • Spinlock Mutex [allow_uav_condition] for(; i<MAX_LOOP_COUNT && !bStop; ++i) { uint oldID; InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID); if( (oldID&RESERVED) != RESERVED) ) { [[ … Do work ]] DeviceMemoryBarrier(); tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED; bStop = true; } // end mutex check }// end spinlock loop Paranoia Try Release Do Work
  • Find New Max Depth uint new_max_depth = u_inDepth; [unroll] for(int t=0; t<KBUFFER_SIZE; t++) { uint element_depth = DEPTH( vScreenAddress, t ); if(element_depth > new_max_depth ) { new_max_depth = element_depth; new_max_id = t; } } Generally more memory traffic than PPLL
  • Initialization: The first k Options ● Clear k-buffer fullscreen (0,1) ● Clear k-buffer stenciled, 3rd pass ● Clear on first fragment ● Count Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit
  • The first k InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount); [allow_uav_condition] if(oldCount < KBUFFER_SIZE) { DATA(vScreenAddress,oldCount) = u_inData; DEPTH(vScreenAddress,oldCount) = u_inDepth; return uint2(u_outDepth,u_outData); } Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit
  • Models 2k polygons ~20k hairs~130k hairs Stats 2-3.5 M fragments 200-300k pixels Shading One point light & shadow 2 shifted specular lobes
  • Depth Complexity Grey 1 Blue 8 Green 50 Red 100+
  • Contention Max attempts per pixel, k=4 Dark Blue 1 Aqua <=4 Bright Aqua <=8
  • Performance Time ratio to out-of-order blending ● Forward PPLL: 1.02 to 1.4 ● Forward k-Buffer: 1.2 to 1.4 ● Deferred PPLL: 0.7 to 0.9 ● Deferred k-Buffer: 0.9 to 1.6
  • K-Buffer in Memory ● Simple memory bound ● Can be less memory ● Usually slower ● Increased memory traffic
  • Simulation
  • Hair Simulation ● Length Constraint ● Local Constraint ● Global Constraint ● Model Transform ● Collision Shapes ● External Forces (wind, gravity, etc.)
  • Fur Simulation ● Length Constraint ● Local Constraint ● Global Constraint ● Model Transform ● Collision Shapes ● External Forces (wind, gravity, etc.)
  • Grass Simulation ● Length Constraint ● Local Constraint (1D) ● Global Constraint ● Model Transform ● Collision Shapes ● External Forces (wind, gravity, etc.)
  • Constraint Method (iterative) ● Used for length, local and global constraints ● Length is most difficult to converge ● particularly under large movement C0 C1 Cn-2 p0 p2 Pn-2 Pn-1
  • Tridiagonal Matrix Formulation ● Direct solve for length constraint ● Almost zero stretch ● Limited to smaller time steps (stability) ● Still cheap ● Leverages matrix structure of strands ● Two sweeps of strand
  • Tridiagonal Matrix Formulation “Tridiagonal Matrix Formulation for Inextensible Hair Strand Simulation”, VRIPHYS, 2013
  • Demos
  • Summary ● Next-gen look is possible now! ● Deferred Rendering for shading LOD is fastest ● k-buffer in memory is an option for memory-constrained situations ● High-quality grass and fur simulation with compute Upcoming TressFX 2 SDK sample update with fur scenario at http://developer.amd.com/tools-and-sdks/graphics- development/amd-radeon-sdk/
  • Questions?
  • Extras
  • Isoline Tessellation for hair/fur? 1/2 ● Isoline tessellation has two tess factors ● First is line density (lines per invocation) ● Second is line detail (segments per line) ● In theory provides easy LOD system ● Variable line density and detail by increasing both tessellation factors based on distance Tess = (1,1) Tess = (2,1) Tess = (2,2) Tess = (2,3) Tess = (3,3)
  • Isoline Tessellation for hair/fur? 2/2 ● In practice isoline tessellation is not cost effective for this scenario ● Lines are always 1-pixel thick ● Need GS to extrude them into triangles for smooth edges ● Major impact on performance! ● Alternative is to enable MSAA ● Most engines are deferred so this causes a large performance impact ● No extrusion for smoothing edges and no MSAA = poor quality! ● Bottom line: a pure Vertex Shader solution is faster ● LOD benefit is easily done in VS (more on this later) ● Curvature is rarely a problem (dependant on vertices/strands at authoring time)
  • AA, Self-shadowing and Transparency Basic Rendering Antialiasing Antialiasing + Self Shadowing Antialiasing + Self Shadowing + Transparency