Introduction to IEEE STANDARDS and its different types.pptx
Design and implementation of GPU-based SAR image processor
1. Najeeb Ahmad
Master Thesis Presentation
May, 2012
Supervisor: Dr. Sun Jinping
Design and Implementation of GPU
based SAR Image Processor
School of Electronic Information
Engineering
Beihang University, Beijing China.
4. PROBLEM
Synthetic Aperture Radar data processing is a
computationally intensive and time consuming task
using conventional CPUs. Given the increasing
popularity and use of GPU for scientific computing,
it is required to accelerate simplified range Doppler
SAR processing algorithm on GPU using modern
GPGPU technology to achieve real/near real-time
performance and to evaluate its suitability for SAR
processing.
5. MOTIVATION
Computationally intensive and time consuming
nature of SAR processing algorithms.
Inherent algorithm parallelism in most SAR
processing algorithms.
Advent of modern GPGPU technology and
availability of commodity GPUs as general
purpose computation engines.
Architectural parallelism and availability of
sufficient hardware resources in modern GPUs
rendering them especially useful for handling
large data quantities and parallel SAR algorithm
implementation.
6. OBJECTIVE
To implement and accelerate simplified range
Doppler SAR processing algorithm on a modern
NVIDIA TESLA GPU using CUDA and MATLAB-
GPU capabilities.
The resulting research will explore the areas like:
Algorithm adaptation for parallel implementation.
Suitability of MATLAB for algorithm implementation.
Suitability of CUDA for algorithm implementation.
Comparison of CPU/CUDA/MATLAB-GPU
implementations.
GPU as SAR processing platform.
7. METHODOLOGY
Algorithm implementation and verification on Intel
Xeon CPU using MATLAB.
Identification of parallelizable portions of
algorithm.
Algorithm implementation on TESLA C1060 GPU
using MATLAB’s native GPU capabilities.
Algorithm implementation on TESLA C1060 GPU
using CUDA.
Analysis of CPU, MATLAB-GPU and CUDA
implementations.
9. Introduction to GPU Computing
Use of Graphics Processing Units (GPUs) for
general purpose computing applications.
CPU: Single, four or eight cores. Capable of
handling few threads. Suitable for serial code.
GPU: Hundreds of cores. Capable of handling
hundreds of threads. Suitable for parallel code.
10. Introduction to GPU Computing
GPU Computing Model: Heterogeneous
computing model employing both CPU and GPU
with serial computing on CPU, parallel computing
on GPU.
11. GPGPU: Brief History
First use of GPU as general purpose computing
device, around 1999-2000 using graphics APIs.
Huge performance boosts observed. Generally
unpopular due to tedious programming.
Introduction of NVIDIAs “CUDA” and AMDs
“Stream Computing” in 2007. Beginning of
modern GPGPU era. Other vendors introduced
their own GPGPU systems.
NVIDIAs CUDA gaining popularity due to its
maturity and performance.
12. NVIDIA CUDA
Compute Unified Device Architecture.
Comprises of Instruction Set Architecture (ISA)
and parallel compute engine in GPU
programmable with high level languages
extended for GPU computing.
CUDA framework comprises of two parts;
hardware and software. From software
perspective, CUDA means extended C/C++,
FORTRAN to support GPU computing.
CUDA is “Single Instruction Multiple Thread”
(SIMT) architecture.
13. CUDA Hardware
Streaming multiprocessor (SM): Basic computing unit
of the GPU. Comprises of eight streaming processors
(SP) and memory. Different GPUs differ in number of
SMs and SP clock frequency.
SP SP
SP SP
SP SP
SP SP
SFU SFU
MT IU
Shared Memory
14. CUDA Memory Architecture
Understanding of memory architecture critical for
writing efficient CUDA programs.
All CUDA-enabled hardware have following types
of memory:
Global memory
Shared memory and registers.
Texture memory and texture cache.
Constant memory and constant cache.
Local memory for register spilling.
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM n
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM 3
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SP SP
Shared memory
SP SP SP
Texture cache
Constant cache
SM 1
SM 2
GPU
Global memory (RAM)
Local MemoryTexture memory
Constant
memory
15. NVIDIA TESLA C1060 GPU
PCI Express 2.0 compliant computing processor
board based on NVIDIA Tesla T10 graphics
processing unit targeted for HPC applications.
Feature highlights
30 SMs = 240 SPs.
SP Clock = 1.296 GHz
4 GB DDR3 memory with 120
GB/s bandwidth.
IEEE 754 single and double
floating point compliant.
933 GFLOPS single and 78
GFLOPS double precision
performance.
Compute capability: 1.3
Supported by MATLAB for
GPU computing
16. CUDA Programming Model
At its core are thread groups, shared memory and
barrier synchronization.
Provides coarse-grained data and task
parallelism and fine-grained data and thread
parallelism providing expressivity and scalability.
Thread hierarchy: Grid, blocks, threads.
Kernels: Functions executed on device (GPU) in
parallel threads.
CUDA provides APIs to run and launch kernels in
parallel threads and to synchronize them.
17. Processing Flow
Copy input data from CPU to GPU memory.
Load GPU program and execute, caching result
on the device.
Copy results from GPU to CPU.
RAM
CPU
Host
Global memory
Constant
Texture
GPU
Device
18. Writing Efficient Code
High priority considerations
Minimum CPU-GPU transfers.
Use of coalesced data transfers.
Use of shared memory instead of global memory
whenever possible.
Avoiding different execution paths within a warp.
Medium priority considerations
Access to shared memory should be planned to
avoid serialization.
Redundant data transfers from global memory
should be avoided.
19. Writing Efficient Code
Threads per block should be multiple of 32.
Use of fast math library whenever possible.
Low Priority Considerations
Use of zero copy operations.
For kernels with long argument list, some argument
should be placed in constant memory.
Expensive modulo, division operations should be
avoided in favor of shift operations whenever
possible.
Automatic conversion of double to float should be
avoided.
Loop unrolling should be used whenever possible.
20. 3.SAR Processing
What is Synthetic Aperture Radar
SAR Processing
Processing Algorithms
Basic RDA
Simplified RDA
21. What is Synthetic Aperture Radar
An active microwave remote sensing imaging
system.
Employs long range propagation characteristics
of radar and complex signal processing
techniques to produce high resolution images.
High resolution achieved by synthesizing long
antenna aperture through signal processing
techniques.
Pros (in comparison with optical systems):
All weather and day and night operation.
No effects of constituents of atmosphere.
Sensitivity to dielectric properties (can image ice,
biomass etc.)
Sensitivity to surface roughness (oceans, wind
22. What is Synthetic Aperture Radar
Accurate measurement of distance.
Sensitivity to man made objects.
Sensitivity to target structure.
Subsurface penetration.
Cons
Complex interactions (difficult to visualize and
understand)
Speckle effects (difficult in visual interpretation)
Topographic effects
23. SAR Processing
A set of procedures to obtain interpretable image
from raw scattered in azimuth and range
directions.
In range, data is scattered by duration of
transmitted FM pulse.
In azimuth, data spread by duration point target is
illuminated by the radar beam.
SAR processing compresses this data taking into
account range cell migration, earth curvature,
earth rotation, air/spacecraft attitude noise to
produce the final image.
Given nature of SAR system and signals, signal
processing rather than image processing provide
appropriate tools for SAR processing.
24. SAR Processing Algorithms
Mainstream SAR processing include:
Range Doppler algorithm (RDA)
High resolution images for low squint and for relatively
smaller aperture sizes. Very popular.
Chirp scaling algorithm (CSA)
Two-dimensional operations with range independence
followed by range corrections in range Doppler domain.
Omega-K algorithm (ωKA)
Efficient and accurate in two-dimensional frequency
domain.
SPECAN algorithm
Good for medium to low resolution requirements.
25. Range Doppler Algorithm
Versions of range Doppler:
Basic RDA
RDA with accurate SRC
RDA with approximate SRC
Simplified range Doppler
26. Basic RDA
Raw data
Range
Compression
Azimuth FFT
RCMC
Azimuth
Compression
Azimuth IFFT
and lookup
Summation
Final Image
Range FFT,
matched filter
multiply, range
IFFT
Data in range
Doppler domain
Interpolation
operation in range
Doppler domain
Azimuth matched
filter multiply
To bring back
signal into time
domain.
27. Simplified RDA
For narrower swath width and medium resolution
requirements, RCM can be assumed independent
of range.Raw data Pre-filtering
Range
Compression
Azimuth FFTRCMCRange IFFT
Azimuth
Compression
Azimuth IFFT
and lookup
Summation
Final Image
To remove
Doppler centroid
Range FFT,
matched filter
multiply (No
range IFFT)
Both range and
azimuth in
frequency domain
RCM phase
function multiply
with each range
line
Data in range
Doppler domain
29. Hardware resources
CPU GPU
Name NVIDIA Tesla
C1060
# of cores 240
SP Clock 1.296 GHz
Memory 4 GB GDDR3
Maximum
memory
bandwidth
102 GB/s
Memory
interface
512 bit – PCI
Express
GFLOPS 933 single
precision, 78
double precision
Name Intel Xeon
E5504
CPU Clock 2 GHz
# of cores 4
System Memory 4 GB
DDR3 Clock 800 MHz
Maximum
memory
bandwidth
19.2 GB/s
Memory type DDR3 PC3
PCI Slot PCI Express
30. Software resources
CPU GPU
Windows 7 Ultimate
64-bit
MATLAB release
2010b
Visual Studio 2008
SP1
CUDA Toolkit 4.1
MATLAB release
2010b
NVIDIA Parallel
Nsight
Visual Profiler
CUDA MEMCHECK
CUFFT library
31. RADARSAT – I Data
• CEOS Format
• Raw data is required to be
extracted from CEOS data
before SAR processing
algorithm can be applied.
Parameter Value Units
Sampling rate 32.317 MHz
Range FM rate 0.7213
5
MHz/µs
Pulse duration 41.74 µs
Radar frequency 5.3 GHz
Radar wavelength 0.0565
7
m
Pulse repetition
frequency
1256.9
8
Hz
Effective radar
velocity
7062 m/s
Azimuth FM rate 1733 Hz/s
Table RADARSAT – I data parameters
CEOS data
CEOS data
extraction utility
RAW SAR data
32. SAR Processing GUI
Functions
• CEOS data
extraction.
• MATLAB-
CPU SAR
processing.
• MATLAB-
GPU SAR
processing
• CUDA
input/output
manipulation.
• CUDA
program
execution.
34. CPU Processed SAR image
A 2048 x 4096 SAR image using CPU based implementation
35. MATLAB-GPU Implementation
MATLAB started supporting GPU computing since
MATLAB release 2010b.
Implemented using native MATLAB-GPU functions only
(no CUDA kernel calls).
Vectorization strategy employed to implement vector-
matrix multiplications on GPU.
All FFT/IFFTs performed using MATLAB-GPU FFT/IFFT
support functions.
Column1
Column2
………...
Columnn
Column1
Column2 ………...
Columnn
Column1
Column2
………...
Columnn
39. MATLAB-GPU Implementation
Advantages
Quick and easy to implement
Sufficient speedups obtained with little effort
Little knowledge of GPU hardware and no
knowledge of optimization techniques required.
Disadvantages
Currently, limited number of MATLAB functions
supported on GPU.
Not all overloads of a function available for GPU.
Lesser control of hardware resources and memory.
Not many optimization options.
40. CUDA Implementation
Strategy
Signal data read as binary file
Vectors, matched filters calculated on CPU
Vectors/signal data transferred to GPU
Following kernels executed in order on GPU
Pre-filtering kernel
Range compression kernel
RCMC kernel
Azimuth compression kernel
Image pixel calculation kernel
Data transferred from GPU to CPU and saved on
disk as image.
41. Optimization considerations
Chosen block size = 8 × 8 = 64. Conforms with
memory coalescing requirements.
Constant variables stored in constant memory
Local variable and phase function calculation
whenever possible to reduce global memory
access.
CPU-GPU data transfer kept to minimum by
transferring data from CPUGPU at beginning
and GPUCPU transfers at the end of algorithm.
Using CUFFTs cufftPlanMany() plan for
FFT/IFFTs along data columns.
49. Conclusions
Feasibility of GPU for SAR processing
Amount of data, computational effort and inherent
algorithm parallelism makes SAR processing
suitable on GPU.
TESLA C1060 GPU offers enough memory to
handle various common SAR image sizes.
Cooling GPU may be a challenge in some
environments.
Scalability of CUDA will prove to be an advantage to
port existing SAR code to newer GPUs.
GPUs might not be suitable where customizable
hardware is required or military hardware standards
are to be adhered.
50. Conclusions
MATLAB-GPU based SAR Processing
Significant speedups compared with CPU.
Quick and easy to implement.
Has some limitations:
Currently have lesser function support for GPU. Expected to
improve with future MATLAB releases.
Vectorization strategy needs more memory. Future release
promise to take away need for vectorization (e.g. bsxfun in
release 2012a).
Lesser control over GPU resources (memory etc.).
CUDA SAR Processing
CUDA: Flexible and scalable with least learning curve.
More control over GPU resources.
Optimization strategies can be applied.
Faster and more memory efficient than MATLAB
implementation.
51. Conclusions
Downsides of GPU
Significant testing/verification effort might be
required if GPU hardware have to be upgraded (due
to old one becoming obsolete).
Proprietary nature of CUDA might be problematic in
case company discontinues CUDA or its support.
52. Future work
CUDA kernels can be called in MATLAB code
using MATLAB’s CUDA kernel calling support.
MATLAB GPU implementation can be improved
as newer and better functions become available.
C/C++ based CPU implementation can be
developed to better judge MATLAB-CPU/CUDA
performance.
Other SAR processing algorithms can be
implemented using framework laid out in this
project.