Your SlideShare is downloading. ×
0
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
CPU is in Focus Again! Implementing DOF on CPU.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

CPU is in Focus Again! Implementing DOF on CPU.

1,863

Published on

Presented at Russian Game Developers Conference 2011. …

Presented at Russian Game Developers Conference 2011.

Depth of Field (DoF) is an optical focus effect widely used in photography, movies, 3D graphics and games for bringing the attention of the viewer to some part of the scene. Until recently, this effect has been too computationally expensive to do in realtime, but with the growing power of graphic processors, DoF is becoming widely used in modern computer games, raising the level of visual experience.

Physically correct DoF effect could be achieved with ray-tracing or accumulation buffer still being too compute intensive to be done in real-time. Like many effects in computer graphics, there is no “right” way to do Depth of Field in a real time application. Depth of Field Explorer offers developers a way to compare and contrast many different methods of calculating DoF and make an informed decision on the right balance between quality and performance on Sandy Bridge processors.

We present multiple DoF techniques along with a set of adjustable parameters which allow the user to explore their performance and quality characteristics. All DoF techniques have traditional implementations for GPU and some of them additionally have novel “CPU Onloaded” implementations, demonstrating advantages of integrated processor graphics on Sandy Bridge. The techniques presented are Poisson disk filter, separable Gaussian filter, Gaussian filter combined with Poisson disk, simple and advanced mipmap interpolation, and summed area tables (SAT) gather and scatter.

DoF Explorer demonstrates innovative “CPU Onloading” approaches to the Gaussian blur and summed area tables based DoF techniques. CPU Onloading moves compute intensive work from the GPU to the CPU, allowing faster DoF post-processing with better load balancing between graphics and central processor cores. CPU kernels demonstrate optimizations with SSE vector instructions and multi-threading on TBB along with asynchronous execution of tasks on GPU and CPU. Using run-time controls, Depth of Field Explorer enables developers to compare the performance of traditional GPU-based implementations with the CPU versions.

Depth of Field Explorer is implemented as a DirectX application based on the DXUT framework and custom post-processing pipeline infrastructure to facilitate running many different Depth of Field techniques. The pipeline infrastructure enables running the sequence of multiple stages either on GPU or on CPU with support of asynchronous execution, which enables hiding data-transfer latency between CPU and GPU. It was made easy to analyze DoF techniques performance with help of integrated Oscilloscope performance monitor, displaying charts of CPU and GPU execution times with breakdown by stages.

CPU Onloaded implementations of summed area tables gather and scatter techniques have been significantly accelerated in comparison with their traditional GPU implementations, showing 3x and 8x speedup appropriately on mobile system with Core i7 2720QM.

Published in: Technology, Art & Photos
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,863
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. CPU is in Focus Again!Implementing DOF on CPU. Advanced Visual Computing 3D Graphics Team Presenter: Evgeny Gorodetsky Graphics Software Engineer evgeny.gorodetsky@intel.com, twitter: egorodet
  • 2. Agenda Introduction to depth of field effect & techniques DOF Explorer and post-processing pipeline DOF Techniques on GPU & with CPU Onloading: – Traditional: Poisson Disk & Gaussian Blur – Advanced: Summed Area Tables Gather & Scatter Performance results on Sandy Bridge processors page 2
  • 3. Introduction to DOFDEPTH OF FIELD EXPLAINED page 3
  • 4. Depth of Field Explained  Common effect in: – Photography – Cinematography – Modern 3D games  Used to bring attention of the viewer  Optical nature of DoF: – Lens settings: Aperture (f-stop), Focal distance – Circle of Confusion (CoC) – Bokeh effect (not adresed) CoC (Blur Radius) Real dependency Max Blur LinearRadius approximation Distance from Camera Near Focal (Depth) 0 Far page 4
  • 5. There’s no right DoF technique! Physically correct reference techniques: – Ray Tracing – Accumulation Buffer Gathering vs. Scattering input Real-time post-processing: – Gathering techniques: output – Poisson Disk – Gaussian Blur – Summed area table Gather – Scattering techniques: – Summed area table Scatter – Heat diffusion simulation Common Challenges: – Color bleeding: – From sharp objects in front to blurred objects behind – From blurred objects behind to sharp objects in front – Blurriness discontinuities – Performance depending on resolution! page 5
  • 6. Depth of Field Explorer Post-processing on GPU and with CPU Onloading Compare DoF techniques: Depth of Field technique GPU CPU – On one of three scenes Poisson Disk   – Performance & quality Gaussian Blur   – Runtime settings Gaussian Blur mixed with Poisson Disk   Deferred rendering with async. CPU-GPU execution Summed Area Table (SAT) Gather   Summed Area Table (SAT) Scatter   Performance analysis Simple MipMap   Advanced MipMap   page 6
  • 7. Post-Processing Pipeline Infrastructure simplifies CPU Onloading Automatic resources management on GPU and CPU Deferred execution mode in CPU Onloading: – Performs computing on CPU while doing work on GPU – Hides data transfer latency Preview of intermediate resources Integrated performance analysis tools Stage 1 Stage 1 Stage 2 Stage 2 Stage 2 Defined by developer: render output pins input pins render output pin Color [size, format] Render Poisson Disk Pipeline Diagram: Color Scene DoF [size, format] Depth [size, format] Created by Pipeline Stage 1 Stage 1-2 Intermediate Stage 2 Stage 2 infrastructure: Render Target Views Resources Shader Resource Views Screen Render Target page 7
  • 8. Depth of Field Explorer DX and UI Controls Common explorer controlsPipeline Oscilloscopes (F6) for CPU & GPU Pipeline Preview (F5) Technique-specific controls page 8
  • 9. Poisson Disk & Gaussian Blur on GPU & CPUTRADITIONAL DOF TECHNIQUES page 9
  • 10. Poisson Disk DOF Technique Averages color by random Poisson disk samples around each pixel Easy to implement on GPU Not good for CPU, because of random memory access Used for Bokeh simulation in some games Variable number of Poisson taps can be generated in DOF Explorer page 10
  • 11. Gaussian Blur DOF Technique Convolution of NxN neighbor pixels with pre-computed weights: 2 +2 1 − , = 22 22 ; , = , ∙ ( , ) =1 =1 Decomposed into 2 passes: – Vertical pass – Horizontal pass 2 1 − 2 = 2 ; , = ∙ ∙ ( , ) 2 =1 =1 Implementation: – Traditional for GPU in pixel shader – Novell for CPU, accelerated with TBB SSE page 11
  • 12. Gaussian Blur Pipeline GPU CPU / GPU GPU Blurred Blurred Blurred DoF Color Resize Gaussian Gaussian Color Color Color Color SimpleRender 1280 x 800 X 0.5 640 x 400 Horiz. Blur 640 x 400 Vert. Blur 640 x 400 1280 x 800 Combine Scene Depth 1280 x 800 GPU CPU GPU page 12
  • 13. Gaussian Blur on CPU: Multi-threading with TBB 1. Vertical Pass: 2. Horizontal Pass: tbb::parallel_for F0 F1 F2 F3 F4 Gaussian weights x F0 F1 F2 x tbb::parallel_for F3 F4Gaussian weights: page 13
  • 14. Gaussian Blur on CPU: Vectorization with SSE 4 SSE SSE F0 F0 F0 F0 R0 G0 B0 A0 x F1 F1 F1 F1 R1 G1 B1 A1 Vertical Pass: F2 F2 F2 F2 R2 G2 B2 A2 = R0’ G0’ B0’ A0’ … … … … … … … … SSE SSE SSE F0 F0 F0 F0 F1 F1 F1 F1 F2 F2 F2 F2 F3 … x x xHorizontal Pass: R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 … (cache friendly) = = = R0’ G0’ B0’ A0’ R1’ G1’ B1’ A1’ R2’ G2’ B2’ A2’ R3’ … = R0 G0 B0 A0 page 14
  • 15. Gaussian Blur: Performance results Gaussian Blur speedup with TBB parallel_for 18 16 3,2Time in milliseconds 14 12 GPU Rendering 10 8 5,6 CPU Kernel Time 13,7 6 4 4,4 2 0 1 Thread 8 Threads page 15
  • 16. Summed Area Tables Gather ScatterADVANCED DOF TECHNIQUES page 16
  • 17. Summed Area Tables  Enables averaging values in variable rectangle areas in constant time: just with 4 SAT-texture reads! Source Table: Summed Area Table (SAT): Averaging values in the area of source table by SAT: 1 2 3 4 1 2 3 41 0 7 2 4 1 0 7 9 132 1 4 1 2 2 1 12 15 21 + - UL UR3 6 1 2 0 3 7 19 24 30 - LL + height LR4 0 3 5 2 4 7 22 32 40 width − − + = = = × = = page 17
  • 18. Gathering vs. Scattering Gathering: Scattering: Input:Output: page 18
  • 19. SAT Gather DoF pipeline GPU CPU / GPU GPU Color Build Color 8 bit/ch. SAT 32 bit/ch. Render SAT Gather Color Color Scene Temp DoF 8 bit/ch. DepthGPU CPU GPU page 19
  • 20. Building SAT on GPU in Pixel ShaderSource: 1 2 3 4 5 6 7 8Pass 1: 1 1..2 2..3 3..4 4..5 5..6 6..7 7..8Pass 2: 1 1..2 1..3 1..4 2..5 3..6 4..7 5..8Pass 3: 1 1..2 1..3 1..4 1..5 1..6 1..7 1..8 page 20
  • 21. Building SAT on CPU with SSE TBB Single pass on CPU Simultaneously process RGBA channels as 4 floats with SSE 4 (128-bit width vector instructions): – Can be easily extended to 256-bit width AVX on Sandy Bridge Split texture in tiles and process them in parallel threads: – Implemented in TBB Tasks – Run tile-processing tasks with respect of dependencies T1,1 T2,1 T3,1, = , + ,− + −, − −,− =−, −−,− +, ,− + Si-1,j-1 Si,j-1 T1,2 T2,2 T3,2 = , = − + ,Build SAT for each row j=1..n: Si-1,j Pi,j += , , = ,− + T1,3 T2,3 T3,3 page 21
  • 22. SAT Scatter DoF pipeline GPU CPU / GPU GPU Color SAT 8 bit/ch.Render 1280 x 800 Scatter Color Build Color Resize ColorScene Compute DoF 32 bit/ch. 1480 x 1000 SAT 32 bit/ch. 1480 x 1000 with Crop 8 bit/ch. 1280 x 800 Blur (add 100px (remove margins) Depth Blur Params. margins) Radius Color Temp GPU CPU GPU page 22
  • 23. SAT Scatter: rectangle spreading Spread pixels (derive), then build SAT (integrate). Input colors: Ongoing rectangle spreading Output colors: Ongoing SAT SAT building Computed x x x x x x x x x x x x x S x x x x x x x x x x x + x x ‒ x x x x x x x x x Input blur radius: ‒ + 0 Ongoing Clearing Padding page 23
  • 24. SAT Scatter: Optimization Notes Rectangle spreading on GPU: – Implemented in Geometry Shader – Requires huge number of Draw Calls = width x height – Works slow even on high-end GPUs – Compute Shaders could help, but not available on Sandy Bridge Rectangle spreading on CPU: – Takes advantage of SSE 4 instructions for RGBA float channels – Multi-threaded with TBB Tasks (like SAT, but with different dependencies) – Much faster than on GPU: 8.3x on SNB GT2, 2.7x on NHM GTX 280 Rectangle spreading CPU-stage can be fused with zeroing and SAT building to minimize memory footprint Quality can be improved with repeated SAT integration (next slides) page 24
  • 25. SAT Scatter : CPU Optimization ResultsSequential Rendering:Deferred Rendering: page 25
  • 26. Higher Order SAT Scatter (1/4)Original Image No filter page 26
  • 27. Higher Order SAT Scatter (2/4)1-st order filter box filter page 27
  • 28. Higher Order SAT Scatter (3/4)2-nd order filter triangle filter page 28
  • 29. Higher Order SAT Scatter (4/4)3-rd order filter parabolic filter page 29
  • 30. PERFORMANCE RESULTS ON 2-NDGENERATION CORE PROCESSORS page 30
  • 31. Depth of Field Performance on Sandy Bridge: GPU mode vs. CPU Onloading Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800 300 SNB Huron River 2720QM + HDG 3000: GPU only 262 SNB Huron River 2720QM + HDG 3000: CPU Onloading 250 Significant speedup with CPU Onloading for advanced 200 compute-intensive DoF techniques! 161 8xFPS 150 135 137 124 3x 100 67 58 60 60 50 40 19 8 0 DoF Techniques page 31
  • 32. Depth of Field Performance on Sandy Bridge in GPU mode on HDG 3000 HDG 2000 Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800 300 SNB Huron River 2720QM + HDG 3000: GPU only 262 SNB Sugar Bay 2600 + HDG 2000: GPU only 250 ~2x High dependency 200 from GPUs, having 161 twice difference in compute power (12FPS 150 135 137 125 vs 6 EUs) 100 91 70 64 60 58 50 35 31 19 17 8 3 0 DoF Techniques page 32
  • 33. Depth of Field Performance on Sandy Bridge in CPU Onloading mode on HDG 3000 HDG 2000 Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800 140 120 SNB Huron River 2720QM + HDG 3000: CPU Onloading 124 ~1.2-1.4x SNB Sugar Bay 2600 + HDG 2000: CPU Onloading 100 90 80FPS 67 Less dependent from 60 60 GPU with extensive 53 50 CPU Onloading! 40 40 34 20 0 DoF Techniques page 33
  • 34. DoF Techniques Overhead (1/2) page 36
  • 35. Conclusion Follow ups Accelerate traditional advanced post-processing techniques with CPU Onloading on modern processors with integrated processor graphics Optimize compute kernels code with Intel Parallel Studio, TBB, SSE/AVX, MKL, OpenCL and ICC: – http://software.intel.com/en-us/articles/intel-parallel-studio-home/ – http://software.intel.com/en-us/articles/opencl-sdk/ – http://software.intel.com/en-us/avx/ DOF Source code article (will be published later): – http://software.intel.com/en-us/articles/dofexplorer See other graphics samples: – http://software.intel.com/en-us/articles/code/ page 38
  • 36. page 39

×