Performance Evaluation of SAR       Image Reconstruction on CPUs and                    GPUsFisnik Kraja, Alin Murarasu, G...
The main points• The motivation statement• Description of the SAR 2DFMFI application• Description of the benchmarked archi...
Motivation• On board space based processing should be  On-board space-based  increased• Future space applications with hig...
SAR Image Reconstruction                                                                      SAR SensorSynthetic Data    ...
SAR Sensor Processing Profiling       SSP Processing Step                                            Computation Execution...
The Benchmarked Architecture                                                    Memory (6GB)                     Memory (6...
CPU Sequential Optimizations                                        1800                                        1600 Elaps...
CPU Thread Parallelization       800       700       600       500                                                        ...
Introduction to CUDA•   CUDA kernels are executed by    parallel threads.             threads                             ...
Porting SAR Application to CUDA• 2D Data Tiling for Loops                Thread (tx, ty) in block (bx, by) is to          ...
CUDA Implementation Discussions • CUFFT library provides a simple interface for computing parallel FFTs       – Batch exec...
Performance Results                                             12• CPU vs GPU                                 10    – Bet...
Using both CPU and GPU for processing • Programming heterogeneous systems is impacted by:       – Data dependencies       ...
Using Multiple GPU Devices• OpenMP + CUDA:  One OpenMP thread per device      – Separate GPU context            • Each thr...
Results Updated                    20                    18                    16                    14       peedup      ...
Summary and Conclusions• Porting the SAR application to CUDA requires knowledge on the  underlying hardware and on the CUD...
Thank You!         Questions?                            Fisnik Kraja                       Chair of Computer Architecture...
Upcoming SlideShare
Loading in …5
×

Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs

1,002 views

Published on

Presented at the 2012 IEEE Aerospace Conference

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,002
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs

  1. 1. Performance Evaluation of SAR Image Reconstruction on CPUs and GPUsFisnik Kraja, Alin Murarasu, Georg Acher, Arndt BodeChair of Computer Architecture, Technische Universität München, Germany 2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
  2. 2. The main points• The motivation statement• Description of the SAR 2DFMFI application• Description of the benchmarked architecture• Results of sequential optimizations and thread parallelization on the CPU• Porting SAR Image Reconstruction to CUDA• Comparison of CPU and GPU Results• Summary and conclusions2/24/2012 2
  3. 3. Motivation• On board space based processing should be On-board space-based increased• Future space applications with high p p pp g performance requirements – HRWS SAR: 1 Tera FLOPS, 603.1 Gbit/s throughput• Heterogeneous (CPU+GPU) architectures might be the solution• Novel accelerator designs integrate in one chip CPUs and graphics processing modules2/24/2012 3
  4. 4. SAR Image Reconstruction SAR SensorSynthetic Data Processing (SSP) P iGeneration(SDG): Reconstructed SARSynthetic SAR image is obtained byreturns from a applying the 2Duniform grid of Fourier Matchedpoint reflectors Filtering and Interpolation Raw Data Reconstructed Image SCALE mc n m nx 10 1600 3290 3808 2474 20 0 3 00 3200 6 60 6460 76 6 7616 4926 9 6 30 4800 9630 11422 7380 60 9600 19140 22844 14738 2/24/2012 4
  5. 5. SAR Sensor Processing Profiling SSP Processing Step Computation Execution Size & Type Time in % Layout1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]2. Transposition is needed 0.3 [n x mc]3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]4.4 Narrow-bandwidth Narrow bandwidth polar format reconstruction along slow time slow-time 1d_Fw_FFT 1d Fw FFT 0.5 05 [n x mc]5. Zero pad the spatial frequency domains compressed signal 0.4 [n x mc]6. Transform-back the zero padded spatial spectrum 1d_Bw_FFT 5.2 [n x m]7. Slow-time decompression CEXp, MAC 2.3 [n x m]8. Digitally spotlighted Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 1d Fw FFT 5.2 [n x m]9. Generate the Doppler domain representation the reference CEXP, MAC 3.4 [n x m] signals complex conjugate10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m] input[n x m] -> output[nx x m]12. Transform from the doppler domain image into a spatial domain 1d_Bw_FFT 10 [m x nx] image. IFFT[nx x m]-> Transpose -> FFT[m x nx] 1d_Bw_FFT13 Transform into a viewable image CABS 1.1 [m x nx] 2/24/2012 5
  6. 6. The Benchmarked Architecture Memory (6GB) Memory (6GB)• The dual socket ccNUMA – 2 Intel Nehalem CPUs 4Cores @2.13GHz CPU CPU – 2x6 GB=12 GB shared memory (4Cores) (4Cores) – 32 nm – Board TDP=120 W Input/Output Controller• 2 Accelerators with NVIDIA Tesla PCI Express 2.0 (up to 36 lanes) p ( p ) C2070 GPUs each: – 14 Streaming Multiprocessors – 448 scalar cores @ 1.15 GHz. – 6 GB of GDDR5 memory GPU GPU • 5.25 GB available(if ECC enabled) – 40 nm – B d TDP 238 W Board TDP=238 2/24/2012 6 Memory (6GB) Memory (6GB)
  7. 7. CPU Sequential Optimizations 1800 1600 Elapsed Time in Seconds 1400 1200 1000 800 600 400 200 0 O0 O1 O2 O3 O0 O1 O2 O3 SSE4 F_t cexp MEA GCC 4.6 ICC 12.0 Vectorization FFTW Elapsed_time 1606.7 1241.03 1201.6 1208.66 1060.8 861.5 773.5 761.3 751.9 582.83 562.9 537.41 3.5 3 2.5 Speedup 2 1.5 1 0.5 0 O0 O1 O2 O3 O0 O1 O2 O3 SSE4 F_t cexp MEA GCC 4.6 ICC 12.0 Vectorization FFTW Speedup 1 1.294650 1.337133 1.329323 1.514611 1.865002 2.077181 2.110468 2.136853 2.756721 2.854325 2.9897092/24/2012 7
  8. 8. CPU Thread Parallelization 800 700 600 500 The vectorized code is 400 – 27 % faster in sequential – 16% faster in parallel 300 200 100 0 16_Threads 8 Best fftw_threads 8 Threads HT 7 Sequential OpenMP 6Elapsed Time 733.5 733 5 183.5 183 5 122.5 122 5 100.7 100 7 5 Elapsed 537.41 161.97 103.06 84.36 4Time(vect) 3 2 A very well optimized 1 sequential code impacts 0 Best fftw_threads 8 Threads 16_Threads the scalability of the Sequential OpenMP HT application Speedup 1 3.997275204 3 997275204 5.987755102 5 987755102 7.284011917 7 284011917 Speedup(vect) 1 3.317960116 5.214535222 6.370436226 2/24/2012 8
  9. 9. Introduction to CUDA• CUDA kernels are executed by parallel threads. threads B B B B B B• A group of threads forms a thread B B B B B B block.• Shared memory among the threads in one block • Exploiting the locality of the algorithms ensures performance• Thread blocks are mapped to SMs in warps (32 threads) that receive the same instruction (SIMD) ( ) • Limited amount of memory brings the need for slow PCIe• Branches impact the efficiency of communications SIMD units2/24/2012 9
  10. 10. Porting SAR Application to CUDA• 2D Data Tiling for Loops Thread (tx, ty) in block (bx, by) is to calculate – Tile elements are computed • row (by*TILE_DIM+ty) and (by*TILE DIM+ty) by a block of threads • column (bx*TILE_DIM+tx) of the data set. – Tiling technique increases g q the number of active blocks, increasing so the level of occupancy – On the Tesla C2070 device: max 1024 threads per block. block • TILE_DIM=32 (32x32=1024) 2/24/2012 10
  11. 11. CUDA Implementation Discussions • CUFFT library provides a simple interface for computing parallel FFTs – Batch execution for multiple 1 dimensional transforms 1-dimensional – Drawback: memory needed on the host side increases with: • Size of the transform • Number of the configured transforms in the batch • Operations missing in CUDA: – Library functions like cexp() and cabs() – Atomic operations of floating point variables floating-point • Transcendental instructions: efficiently execute on Special Function Units ( (SFUs). ) – sine – cosine – square root2/24/2012 11
  12. 12. Performance Results 12• CPU vs GPU 10 – Better performance on the GPU 8 – Better power efficiency on peedup the CPU Sp 6• Small Scale vs Large Scale 4 – For small scale images g (SCALE<20), the data set 2 fits completely on the GPU memory 0 – For large scale images CPU_Seq CPU S CPU 8 CPU 16 GPU Threads Threads (SCALE > 30), the data set Scale=10 1 7.9474 8.8247 11.0488 does not fit in the GPU Scale=20 1 7.6237 8.1752 10.6159 memory Scale=30 1 6.0354 7.0146 10.2855 Scale=60 1 5.2145 6.3704 10.2364 2/24/2012 12
  13. 13. Using both CPU and GPU for processing • Programming heterogeneous systems is impacted by: – Data dependencies p – Scheduling algorithms – System Resources • Frequent Transfers between CPU and GPU should be avoided • Profiling is needed to identify the parts of the code that will benefit from executing on the GPU • In our case, it was decided to execute on the GPU only the Interpolation Loop (70% of the total execution time) i order t I t l ti L f th t t l ti ti ) in d to avoid transfers in steps like: – FFT_SHIFT – Transposition2/24/2012 13
  14. 14. Using Multiple GPU Devices• OpenMP + CUDA: One OpenMP thread per device – Separate GPU context • Each thread calls independently – Memory management functions – CUDA Kernels• 2 Approaches – Same image is reconstruction by 2 GPUs • Bottlenecks in the QPI (remote accesses) and PCIe links ( ) – Separate images are reconstructed on 2 separate GPUs (Pipelined version) • Reduced CPU <-> GPU data transfers2/24/2012 14
  15. 15. Results Updated 20 18 16 14 peedup 12 Sp 10 8 6 4 2 0 CPU 8 CPU 16 2GPUs CPU_Seq GPU GPU+CPU 2GPUs Threads Threads Pipelined Scale=10 1 7.9474 8.8247 11.0488 10.3740 2.3472 5.3086 Scale=20 1 7.6237 8.1752 10.6159 11.7166 5.7588 11.5306 Scale=30 1 6.0354 7.0146 10.2855 11.6952 8.8412 13.4404 Scale=60 1 5.5136 6.5883 10.2364 12.5270 11.3020 17.49382/24/2012 15
  16. 16. Summary and Conclusions• Porting the SAR application to CUDA requires knowledge on the underlying hardware and on the CUDA paradigm.• For the SAR application GPUs offer better performance than CPUs – But CPUs are more power efficient• Heterogeneous computing improves performance but the Performance/Watt ratio is impacted by the number of CPU <-> GPU transfers.• Static scheduling of CUDA kernels offers no flexibility in heterogeneous computing environments h t ti i t• When using multiple GPU devices, it is very important to reduce the number of CPU <-> GPU and GPU <-> GPU transfers. <> <> transfers2/24/2012 16
  17. 17. Thank You! Questions? Fisnik Kraja Chair of Computer Architecture Technische Universität München kraja@in.tum.de

×