Design and implementation of GPU-based SAR image processor

Najeeb Ahmad
Master Thesis Presentation
May, 2012
Supervisor: Dr. Sun Jinping
Design and Implementation of GPU
based SAR Image Processor
School of Electronic Information
Engineering
Beihang University, Beijing China.

Contents
1. Introduction
2. GPU Computing
3. SAR Processing
4. Implementation
5. Conclusion & Future Work

1.Introduction
 Problem
 Motivation
 Objective
 Methodology

PROBLEM
Synthetic Aperture Radar data processing is a
computationally intensive and time consuming task
using conventional CPUs. Given the increasing
popularity and use of GPU for scientific computing,
it is required to accelerate simplified range Doppler
SAR processing algorithm on GPU using modern
GPGPU technology to achieve real/near real-time
performance and to evaluate its suitability for SAR
processing.

MOTIVATION
 Computationally intensive and time consuming
nature of SAR processing algorithms.
 Inherent algorithm parallelism in most SAR
processing algorithms.
 Advent of modern GPGPU technology and
availability of commodity GPUs as general
purpose computation engines.
 Architectural parallelism and availability of
sufficient hardware resources in modern GPUs
rendering them especially useful for handling
large data quantities and parallel SAR algorithm
implementation.

OBJECTIVE
 To implement and accelerate simplified range
Doppler SAR processing algorithm on a modern
NVIDIA TESLA GPU using CUDA and MATLAB-
GPU capabilities.
 The resulting research will explore the areas like:
 Algorithm adaptation for parallel implementation.
 Suitability of MATLAB for algorithm implementation.
 Suitability of CUDA for algorithm implementation.
 Comparison of CPU/CUDA/MATLAB-GPU
implementations.
 GPU as SAR processing platform.

METHODOLOGY
 Algorithm implementation and verification on Intel
Xeon CPU using MATLAB.
 Identification of parallelizable portions of
algorithm.
 Algorithm implementation on TESLA C1060 GPU
using MATLAB’s native GPU capabilities.
 Algorithm implementation on TESLA C1060 GPU
using CUDA.
 Analysis of CPU, MATLAB-GPU and CUDA
implementations.

2.GPU Computing
 Introduction to GPU Computing
 GPGPU: Brief History
 NVIDIA CUDA
 Writing efficient code

Introduction to GPU Computing
 Use of Graphics Processing Units (GPUs) for
general purpose computing applications.
 CPU: Single, four or eight cores. Capable of
handling few threads. Suitable for serial code.
 GPU: Hundreds of cores. Capable of handling
hundreds of threads. Suitable for parallel code.

Introduction to GPU Computing
 GPU Computing Model: Heterogeneous
computing model employing both CPU and GPU
with serial computing on CPU, parallel computing
on GPU.

GPGPU: Brief History
 First use of GPU as general purpose computing
device, around 1999-2000 using graphics APIs.
Huge performance boosts observed. Generally
unpopular due to tedious programming.
 Introduction of NVIDIAs “CUDA” and AMDs
“Stream Computing” in 2007. Beginning of
modern GPGPU era. Other vendors introduced
their own GPGPU systems.
 NVIDIAs CUDA gaining popularity due to its
maturity and performance.

NVIDIA CUDA
 Compute Unified Device Architecture.
 Comprises of Instruction Set Architecture (ISA)
and parallel compute engine in GPU
programmable with high level languages
extended for GPU computing.
 CUDA framework comprises of two parts;
hardware and software. From software
perspective, CUDA means extended C/C++,
FORTRAN to support GPU computing.
 CUDA is “Single Instruction Multiple Thread”
(SIMT) architecture.

CUDA Hardware
 Streaming multiprocessor (SM): Basic computing unit
of the GPU. Comprises of eight streaming processors
(SP) and memory. Different GPUs differ in number of
SMs and SP clock frequency.
SP SP
SP SP
SP SP
SP SP
SFU SFU
MT IU
Shared Memory

CUDA Memory Architecture
 Understanding of memory architecture critical for
writing efficient CUDA programs.
 All CUDA-enabled hardware have following types
of memory:
 Global memory
 Shared memory and registers.
 Texture memory and texture cache.
 Constant memory and constant cache.
 Local memory for register spilling.
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM n
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM 3
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM 1
SM 2
GPU
Global memory (RAM)
Local MemoryTexture memory
Constant
memory

NVIDIA TESLA C1060 GPU
 PCI Express 2.0 compliant computing processor
board based on NVIDIA Tesla T10 graphics
processing unit targeted for HPC applications.
 Feature highlights
 30 SMs = 240 SPs.
 SP Clock = 1.296 GHz
 4 GB DDR3 memory with 120
GB/s bandwidth.
 IEEE 754 single and double
floating point compliant.
 933 GFLOPS single and 78
GFLOPS double precision
performance.
 Compute capability: 1.3
 Supported by MATLAB for
GPU computing

CUDA Programming Model
 At its core are thread groups, shared memory and
barrier synchronization.
 Provides coarse-grained data and task
parallelism and fine-grained data and thread
parallelism providing expressivity and scalability.
 Thread hierarchy: Grid, blocks, threads.
 Kernels: Functions executed on device (GPU) in
parallel threads.
 CUDA provides APIs to run and launch kernels in
parallel threads and to synchronize them.

Processing Flow
 Copy input data from CPU to GPU memory.
 Load GPU program and execute, caching result
on the device.
 Copy results from GPU to CPU.
RAM
CPU
Host
Global memory
Constant
Texture
GPU
Device

Writing Efficient Code
 High priority considerations
 Minimum CPU-GPU transfers.
 Use of coalesced data transfers.
 Use of shared memory instead of global memory
whenever possible.
 Avoiding different execution paths within a warp.
 Medium priority considerations
 Access to shared memory should be planned to
avoid serialization.
 Redundant data transfers from global memory
should be avoided.

Writing Efficient Code
 Threads per block should be multiple of 32.
 Use of fast math library whenever possible.
 Low Priority Considerations
 Use of zero copy operations.
 For kernels with long argument list, some argument
should be placed in constant memory.
 Expensive modulo, division operations should be
avoided in favor of shift operations whenever
possible.
 Automatic conversion of double to float should be
avoided.
 Loop unrolling should be used whenever possible.

3.SAR Processing
 What is Synthetic Aperture Radar
 SAR Processing
 Processing Algorithms
 Basic RDA
 Simplified RDA

What is Synthetic Aperture Radar
 An active microwave remote sensing imaging
system.
 Employs long range propagation characteristics
of radar and complex signal processing
techniques to produce high resolution images.
 High resolution achieved by synthesizing long
antenna aperture through signal processing
techniques.
 Pros (in comparison with optical systems):
 All weather and day and night operation.
 No effects of constituents of atmosphere.
 Sensitivity to dielectric properties (can image ice,
biomass etc.)
 Sensitivity to surface roughness (oceans, wind

What is Synthetic Aperture Radar
 Accurate measurement of distance.
 Sensitivity to man made objects.
 Sensitivity to target structure.
 Subsurface penetration.
 Cons
 Complex interactions (difficult to visualize and
understand)
 Speckle effects (difficult in visual interpretation)
 Topographic effects

SAR Processing
 A set of procedures to obtain interpretable image
from raw scattered in azimuth and range
directions.
 In range, data is scattered by duration of
transmitted FM pulse.
 In azimuth, data spread by duration point target is
illuminated by the radar beam.
 SAR processing compresses this data taking into
account range cell migration, earth curvature,
earth rotation, air/spacecraft attitude noise to
produce the final image.
 Given nature of SAR system and signals, signal
processing rather than image processing provide
appropriate tools for SAR processing.

SAR Processing Algorithms
 Mainstream SAR processing include:
 Range Doppler algorithm (RDA)
 High resolution images for low squint and for relatively
smaller aperture sizes. Very popular.
 Chirp scaling algorithm (CSA)
 Two-dimensional operations with range independence
followed by range corrections in range Doppler domain.
 Omega-K algorithm (ωKA)
 Efficient and accurate in two-dimensional frequency
domain.
 SPECAN algorithm
 Good for medium to low resolution requirements.

Range Doppler Algorithm
 Versions of range Doppler:
 Basic RDA
 RDA with accurate SRC
 RDA with approximate SRC
 Simplified range Doppler

Basic RDA
Raw data
Range
Compression
Azimuth FFT
RCMC
Azimuth
Compression
Azimuth IFFT
and lookup
Summation
Final Image
Range FFT,
matched filter
multiply, range
IFFT
Data in range
Doppler domain
Interpolation
operation in range
Doppler domain
Azimuth matched
filter multiply
To bring back
signal into time
domain.

Simplified RDA
 For narrower swath width and medium resolution
requirements, RCM can be assumed independent
of range.Raw data Pre-filtering
Range
Compression
Azimuth FFTRCMCRange IFFT
Azimuth
Compression
Azimuth IFFT
and lookup
Summation
Final Image
To remove
Doppler centroid
Range FFT,
matched filter
multiply (No
range IFFT)
Both range and
azimuth in
frequency domain
RCM phase
function multiply
with each range
line
Data in range
Doppler domain

4.Implementation
 Hardware resources
 Software resources
 CPU Implementation
 MATLAB GPU Implementation
 CUDA Implementation
 Result Comparison

Hardware resources
CPU GPU
Name NVIDIA Tesla
C1060
# of cores 240
SP Clock 1.296 GHz
Memory 4 GB GDDR3
Maximum
memory
bandwidth
102 GB/s
Memory
interface
512 bit – PCI
Express
GFLOPS 933 single
precision, 78
double precision
Name Intel Xeon
E5504
CPU Clock 2 GHz
# of cores 4
System Memory 4 GB
DDR3 Clock 800 MHz
Maximum
memory
bandwidth
19.2 GB/s
Memory type DDR3 PC3
PCI Slot PCI Express

Software resources
CPU GPU
 Windows 7 Ultimate
64-bit
 MATLAB release
2010b
 Visual Studio 2008
SP1
 CUDA Toolkit 4.1
 MATLAB release
2010b
 NVIDIA Parallel
Nsight
 Visual Profiler
 CUDA MEMCHECK
 CUFFT library

RADARSAT – I Data
• CEOS Format
• Raw data is required to be
extracted from CEOS data
before SAR processing
algorithm can be applied.
Parameter Value Units
Sampling rate 32.317 MHz
Range FM rate 0.7213
5
MHz/µs
Pulse duration 41.74 µs
Radar frequency 5.3 GHz
Radar wavelength 0.0565
7
m
Pulse repetition
frequency
1256.9
8
Hz
Effective radar
velocity
7062 m/s
Azimuth FM rate 1733 Hz/s
Table RADARSAT – I data parameters
CEOS data
CEOS data
extraction utility
RAW SAR data

SAR Processing GUI
Functions
• CEOS data
extraction.
• MATLAB-
CPU SAR
processing.
• MATLAB-
GPU SAR
processing
• CUDA
input/output
manipulation.
• CUDA
program
execution.

CPU Implementation
 Implemented using MATLAB
 FFT/IFFT using standard MATLAB functions

CPU Processed SAR image
A 2048 x 4096 SAR image using CPU based implementation

MATLAB-GPU Implementation
 MATLAB started supporting GPU computing since
MATLAB release 2010b.
 Implemented using native MATLAB-GPU functions only
(no CUDA kernel calls).
 Vectorization strategy employed to implement vector-
matrix multiplications on GPU.
 All FFT/IFFTs performed using MATLAB-GPU FFT/IFFT
support functions.
Column1
Column2
………...
Columnn
Column1
Column2 ………...
Columnn
Column1
Column2
………...
Columnn

 Limit on maximum image size that can be
calculated due to GPU memory constraints.

 Speedup as high as 21 achieved compared with
CPU implementation

A 2048 x 4096 SAR image using MATLAB-GPU based implementation

 Advantages
 Quick and easy to implement
 Sufficient speedups obtained with little effort
 Little knowledge of GPU hardware and no
knowledge of optimization techniques required.
 Disadvantages
 Currently, limited number of MATLAB functions
supported on GPU.
 Not all overloads of a function available for GPU.
 Lesser control of hardware resources and memory.
 Not many optimization options.

CUDA Implementation
 Strategy
 Signal data read as binary file
 Vectors, matched filters calculated on CPU
 Vectors/signal data transferred to GPU
 Following kernels executed in order on GPU
 Pre-filtering kernel
 Range compression kernel
 RCMC kernel
 Azimuth compression kernel
 Image pixel calculation kernel
 Data transferred from GPU to CPU and saved on
disk as image.

Optimization considerations
 Chosen block size = 8 × 8 = 64. Conforms with
memory coalescing requirements.
 Constant variables stored in constant memory
 Local variable and phase function calculation
whenever possible to reduce global memory
access.
 CPU-GPU data transfer kept to minimum by
transferring data from CPUGPU at beginning
and GPUCPU transfers at the end of algorithm.
 Using CUFFTs cufftPlanMany() plan for
FFT/IFFTs along data columns.

CUDA Implementation Results
A 2048 x 4096 SAR image using CUDA based implementation

CUDA/MATLAB-CPU/MATLAB-CPU
Computation Time Comparison

MATLAB-GPU/CUDA Computation
Time Comparison

MATLAB-GPU/CUDA speedup
comparison
 Speedups as high as 53 times achieved in
comparison with maximum speedup of 21 times
in MATLAB.

Conclusions
 Feasibility of GPU for SAR processing
 Amount of data, computational effort and inherent
algorithm parallelism makes SAR processing
suitable on GPU.
 TESLA C1060 GPU offers enough memory to
handle various common SAR image sizes.
 Cooling GPU may be a challenge in some
environments.
 Scalability of CUDA will prove to be an advantage to
port existing SAR code to newer GPUs.
 GPUs might not be suitable where customizable
hardware is required or military hardware standards
are to be adhered.

Conclusions
 MATLAB-GPU based SAR Processing
 Significant speedups compared with CPU.
 Quick and easy to implement.
 Has some limitations:
 Currently have lesser function support for GPU. Expected to
improve with future MATLAB releases.
 Vectorization strategy needs more memory. Future release
promise to take away need for vectorization (e.g. bsxfun in
release 2012a).
 Lesser control over GPU resources (memory etc.).
 CUDA SAR Processing
 CUDA: Flexible and scalable with least learning curve.
 More control over GPU resources.
 Optimization strategies can be applied.
 Faster and more memory efficient than MATLAB
implementation.

Conclusions
 Downsides of GPU
 Significant testing/verification effort might be
required if GPU hardware have to be upgraded (due
to old one becoming obsolete).
 Proprietary nature of CUDA might be problematic in
case company discontinues CUDA or its support.

Future work
 CUDA kernels can be called in MATLAB code
using MATLAB’s CUDA kernel calling support.
 MATLAB GPU implementation can be improved
as newer and better functions become available.
 C/C++ based CPU implementation can be
developed to better judge MATLAB-CPU/CUDA
performance.
 Other SAR processing algorithms can be
implemented using framework laid out in this
project.

Design and implementation of GPU-based SAR image processor

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Design and implementation of GPU-based SAR image processor

Similar to Design and implementation of GPU-based SAR image processor (20)

Recently uploaded

Recently uploaded (20)

Design and implementation of GPU-based SAR image processor