An Implementation of a FIR Filter on a GPU Alexey Smirnov and Tzi-cker Chiueh ECSL Research Seminar 9/13/05
Outline Introduction GPU Computing Overview Related Work FIR Filter Definition FIR Filter Implementation on GPU Performance Evaluation Conclusion
Introduction Numerical algorithms often perform repeated computations on vectors of elements. Parallel computation improves performance. x86: MMX, SSE, SSE2, SSE3. Video cards are now programmable.
Computation and Bandwidth Rates Video cards have higher GFLOPs rate and memory bandwidth compared to CPU. However, data copying between main memory and video memory can reduce performance.
GPU Computing Background Rendering pipeline: User program defines vertex and texture coordinates.  Vertex processor converts vertex attributes from world coordinate system into screen coordinate system. Fragment processor computes color of each output pixel using textures and color. Interpolation defines coordinates and color for each pixel. Vertex and fragment processors are programmable for example in C-like language Cg.
Rendering APIs OpenGL (Linux, Windows, MacOS) and DirectX (Windows). OpenGL extensions allow to use advanced features of a video card. NV_float_buffer supports floating-point textures. ARB_render_texture allows to render to a texture instead of the screen.
GPU Program Architecture Create floating-point textures that contain input data and load them into video memory; Load the fragment program and enable multi-texturing; Define vertex and texture coordinates; Draw the figure to an off-screen buffer; If the results were rendered to an off-screen buffer then copy the image to a texture using glCopyTexSubImage2D(). Go to step 3 if more iterations needed. Use glGetTexImage() to copy data from video memory to main memory.
Input Data Representation Matrices are represented as textures naturally. Four elements per pixel (R, G, B, A). Vectors are wrapped into matrices. Textures have maximum dimensions.
Related Work Four papers describing matrix multiplication; Linear algebra operations; Array sorting; FFT; Earlier papers concluded that the CPU is more efficient then GPU. Recent video cards, e.g. GeForce 7800 and ATI X800 XT do better than CPU.
FIR Filter Definition Finite Impulse Response (FIR) filter is used in audio processing. We modified GNU Radio – an open-source software implementing Software Defined Radio.
Other Relevant Transformations Hilbert transformation: Frequency translation FIR filter:
FIR Filter on a GPU
FIR Filter’s Loop Initialization: Loop iteration:
FIR Filter’s Loop O(j+1)=O(j)+MI Final output value is computed as
Fragment Program
Optimizations Break loop into two to get rid of conditional expression; Unroll loop body w/ and w/o conditional expression; Process two rows of input and textures; Use different texture units in unrolled loops; Nothing of the above improved performance.
Performance Evaluation: FIR Filter
Performance of FreqXlating FIR Filter
Performance of Hilbert Transformation
Conclusion Not everything improves from GPU optimization. CPU optimization tricks do not work on GPU. Texture upload/download takes up to 60% of total time. GPU computation can take several seconds compared to millisecond time to render a frame in a game.
Future Work QoS for GPU: can application specify maximum latency or share of GPU resources? Work offload from CPU to GPU: is it possible to build a compiler that can automatically decide what is worth GPU optimization? Debugging support: a lot of tools for Windows, none for Linux.

FIR filter on GPU

  • 1.
    An Implementation ofa FIR Filter on a GPU Alexey Smirnov and Tzi-cker Chiueh ECSL Research Seminar 9/13/05
  • 2.
    Outline Introduction GPUComputing Overview Related Work FIR Filter Definition FIR Filter Implementation on GPU Performance Evaluation Conclusion
  • 3.
    Introduction Numerical algorithmsoften perform repeated computations on vectors of elements. Parallel computation improves performance. x86: MMX, SSE, SSE2, SSE3. Video cards are now programmable.
  • 4.
    Computation and BandwidthRates Video cards have higher GFLOPs rate and memory bandwidth compared to CPU. However, data copying between main memory and video memory can reduce performance.
  • 5.
    GPU Computing BackgroundRendering pipeline: User program defines vertex and texture coordinates. Vertex processor converts vertex attributes from world coordinate system into screen coordinate system. Fragment processor computes color of each output pixel using textures and color. Interpolation defines coordinates and color for each pixel. Vertex and fragment processors are programmable for example in C-like language Cg.
  • 6.
    Rendering APIs OpenGL(Linux, Windows, MacOS) and DirectX (Windows). OpenGL extensions allow to use advanced features of a video card. NV_float_buffer supports floating-point textures. ARB_render_texture allows to render to a texture instead of the screen.
  • 7.
    GPU Program ArchitectureCreate floating-point textures that contain input data and load them into video memory; Load the fragment program and enable multi-texturing; Define vertex and texture coordinates; Draw the figure to an off-screen buffer; If the results were rendered to an off-screen buffer then copy the image to a texture using glCopyTexSubImage2D(). Go to step 3 if more iterations needed. Use glGetTexImage() to copy data from video memory to main memory.
  • 8.
    Input Data RepresentationMatrices are represented as textures naturally. Four elements per pixel (R, G, B, A). Vectors are wrapped into matrices. Textures have maximum dimensions.
  • 9.
    Related Work Fourpapers describing matrix multiplication; Linear algebra operations; Array sorting; FFT; Earlier papers concluded that the CPU is more efficient then GPU. Recent video cards, e.g. GeForce 7800 and ATI X800 XT do better than CPU.
  • 10.
    FIR Filter DefinitionFinite Impulse Response (FIR) filter is used in audio processing. We modified GNU Radio – an open-source software implementing Software Defined Radio.
  • 11.
    Other Relevant TransformationsHilbert transformation: Frequency translation FIR filter:
  • 12.
  • 13.
    FIR Filter’s LoopInitialization: Loop iteration:
  • 14.
    FIR Filter’s LoopO(j+1)=O(j)+MI Final output value is computed as
  • 15.
  • 16.
    Optimizations Break loopinto two to get rid of conditional expression; Unroll loop body w/ and w/o conditional expression; Process two rows of input and textures; Use different texture units in unrolled loops; Nothing of the above improved performance.
  • 17.
  • 18.
  • 19.
    Performance of HilbertTransformation
  • 20.
    Conclusion Not everythingimproves from GPU optimization. CPU optimization tricks do not work on GPU. Texture upload/download takes up to 60% of total time. GPU computation can take several seconds compared to millisecond time to render a frame in a game.
  • 21.
    Future Work QoSfor GPU: can application specify maximum latency or share of GPU resources? Work offload from CPU to GPU: is it possible to build a compiler that can automatically decide what is worth GPU optimization? Debugging support: a lot of tools for Windows, none for Linux.