Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Using Many-Core Processors to Impro... by Fisnik Kraja 538 views
- Performance Optimization of HPC App... by Fisnik Kraja 485 views
- Performance Analysis and Optimizati... by Fisnik Kraja 736 views
- Designing High Performance Computin... by Fisnik Kraja 381 views
- Runtime Performance Optimizations f... by Fisnik Kraja 1363 views
- Visualizing and Clustering Life Sci... by Geoffrey Fox 3944 views

480 views

Published on

Slides presented at the 2012 IEEE Aerospace Conference

Published in:
Technology

No Downloads

Total views

480

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

9

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Parallelization Techniques for the 2D Fourier M t h d Filt i F i Matched Filtering and d Interpolation SAR AlgorithmFisnik Kraja, Georg Acher, Arndt Bode j , g ,Chair of Computer Architecture, Technische Universität Münchenkraja@in.tum.de, acher@in.tum.de, bode@in.tum.de 2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
- 2. The main points will be: p • The motivation statement • Description of the SAR 2DFMFI application • Description of the benchmarked architectures • Parallelization techniques and results on q – shared-memory and – distributed-memory architectures • Specific optimizations for distributed memory environments • Summary and conclusionsFebruary 24, 2012 2
- 3. Motivation• C Current and f t t d future space applications with onboard hi h li ti ith b d high- performance requirement – Observation satellites with increased • Image resolutions • Data sets • Computational requirements p q• Novel and interesting research based on many-cores for space (Dependable Multiprocessor and Maestro)• The tendence to fly COTS products to space y p p• Performance/power ratio depends directly on the scalability of applications. li iFebruary 24, 2012 3
- 4. SAR 2DFMFI Application ppSynthetic Data SAR SensorGeneration(SDG): Processing (SSP)Synthetic SAR Reconstructed SARreturns from a image is obtained byuniform grid of applying the 2Dpoint reflectors fl Fourier Matched Filtering and Interpolation Raw Data Reconstructed Image SCALE mc n m nx 10 1600 3290 3808 2474 20 3200 6460 7616 4926 30 4800 9630 11422 7380 60 9600 19140 22844 14738February 24, 2012 4
- 5. SAR Sensor Processing Profiling g g SSP Processing Step Computation Execution Size & Type Time in % Layout1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]2. Transposition is needed p 0.3 [ [n x mc]]3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]4. Narrow-bandwidth polar format reconstruction along slow-time 1d_Fw_FFT 0.5 [n x mc]5. Zero pad the spatial frequency domains compressed signal 0.4 [n x mc]6.6 Transform back Transform-back the zero padded spatial spectrum 1d_Bw_FFT 1d Bw FFT 5.2 52 [n x m]7. Slow-time decompression CEXp, MAC 2.3 [n x m]8. Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 5.2 [n x m]9. Generate the Doppler domain representation the CEXP, MAC 3.4 [n x m] reference signals complex conjugate f i l l j t10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m] input[n x m] -> output[nx x m]12. Transform from the doppler domain image into a spatial domain 1d_Bw_FFT 10 [m x nx] image. 1d_Bw_FFT IFFT[nx x m]-> Transpose -> FFT[m x nx]13 Transform into a viewable imageg CABS 1.1 [ [m x nx] ] February 24, 2012 5
- 6. The benchmarked ccNUMA (distributed shared memory)The ccNUMA machine consists of:• 2 Nehalem CPUs: Intel(R) Memory (6GB) M Memory (6GB) M Xeon(R) CPU X5670 – 2.93 GHz Memory (6GB) Memory (6GB) – 12 MB L3 Smart Cache – 6 Cores/CPU / Memory (6GB) Memory (6GB) – TDP=95 Watt – 6.4 Giga Transfers/s QPI (25.6 CPU CPU GB/s) (6Cores) (6Cores) – DDR3 1066 memory interfacing• 36 Gigabytes of RAM – (18 GB/memory controller) I/O ControllerFebruary 24, 2012 6
- 7. Parallelization techniques on theccNUMA machine NUMA hiFebruary 24, 2012 7
- 8. Results on the ccNUMA machine 12 10 8 Scale=60 dupSpeed 6 Scale=10 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 Number of CoresFebruary 24, 2012 8
- 9. The benchmarked distributedmemory architecture hit t Nehalem cluster @HLRS.de Peak 62 TFlops Performance Number of 700 Dual Socket Quad Nodes d Core C Processor Intel Xeon (X5560) Nehalem @ 2.8 GHz, 8MB Cache Memory/node 12 GB Disk 80 TB shared scratch (lustre) Node-node Infiniband, Gigabit interconnect EthernetFebruary 24, 2012 9
- 10. MPI Master-Worker Model Master- • In MPI: row-by-row send-and-receive • In MPI2: send and receive chunks of rows • No more than 4 processes/node(8 cores) because of memory overhead 10 9 8 7Speedup 6 MPI MPI2 5 MPI(2Proc/Node)S 4 MPI2(2Proc/Node) 3 MPI(4Proc/Node) 2 MPI2(4Proc/Node) 1 0 1 2 4 8 12 16 Number of Nodes ( 8 Cores/Node ) February 24, 2012 10
- 11. MPI Memory Overhead y• This overhead comes from the data replication and reduction needed in the Interpolation Loop• To improve the scalability without increasing memory consumption a hybrid (MPI+OpenMP) version is implemented. y p p Worker_mem Master_mem Total_mem ytes 27.6 Memory consumptio in Giga By 25.1 22.9 20.4 18 on 15.9 14 13 8.2 5.8 6.5 65 5.7 5.8 4.7 4.1 4.9 4.7 4.5 3.8 3.6 3.4 3.3 0 1 2 3 4 5 6 7 8 Number of ProcessesFebruary 24, 2012 11
- 12. Hybrid(MPI+OpenMP) Hybrid(MPI+OpenMP) Versions y ( pHyb1: Hyb1 Hyb2 Hyb3 – 1Process(8-OpenMP Hyb4 Hyb4(2Pr/8Thr) Hyb4(4Pr/4Thr) threads)/Node. 20Hyb2: 18 – OpenMP FFTW + eedup 16 HyperThreading. HyperThreading Spe 14Hyb3: 12 – Non-Computationally p y 10 intensive work is done 8 only by the Master 6 p process. 4Hyb4: 2 0 – Send and Receive 1 2 4 8 12 16 chunks of rows. h k f Number of Nodes (8 Cores/Node) February 24, 2012 12
- 13. Master-Master-Worker Bottlenecks• In some steps of SSP, the data is collected by the Master process and then distrib ted again to the distributed Workers after the respective step.• Such steps are: – The 2-D FFT_SHIFT – Transposition Operations – The Reduction Operation after the Interpolation LoopFebruary 24, 2012 13
- 14. Inter-Inter-process Communication inthe FFT SHIFTth FFT_SHIFTNotional depiction of the fftshift operation PID 0 A1 B1 PID 1 A2 B2 A B D C PID 2 C1 D1 C D B A PID 3 C2 D2• New Communication PID 0 C1 D1 Pattern P tt PID 1 C2 D2 – Nodes communicate in PID 2 A1 B1 couples PID 3 A2 B2 – N d that h Nodes h have the dh data off the first and second quadrant send and receive data only to and from nodes with the third and fourth quadrant PID 0 D1 C1 respectively. PID 1 D2 C2 PID 2 B1 A1 PID 3 B2 A2February 24, 2012 14
- 15. Inter-Inter-Process Transposition p Data Partitioning (Tiling) and Buffering PID 0 D0 PID 0 D00 D01 D02 D03 PID 1 D1 PID 1 D10 D11 D12 D13 PID 2 D2 PID 2 D20 D21 D22 D23 PID 3 D3 PID 3 D30 D31 D32 D33 Transposition T iti D00 D10 D20 D30 The Resulting PID 0 Communication Pattern D01 D11 D21 D31 PID 1 D0 D1 D2 D3 02 12 22 32 PID 2 D03 D13 D23 D33 PID 3February 24, 2012 15
- 16. Reduction in the Interpolation Loop p p• To avoid a collective reduction a local reduction is applied pp between neighbor processes.• This reduces only the overlapped regions.• R d ti i scheduled i an ordered way: Reduction is h d l d in d d – the first process will send the data to the second process, which accumulates the new values with the old ones and send the results back to the first process.February 24, 2012 16
- 17. Pipelining the SSP Steps p g p• Each node processes a single p g image: – less inter-process communications –• It takes longer to reconstruct the fi i h first image, – but less time for the other g imagesFebruary 24, 2012 17
- 18. Speedup and Execution Time p p 90 80 Hyb4 70 Hyb5 60 PipelinedSpeedup 50 40 30 20 100 10 90 0 80 1 8 16 32 64 96 128 70 psed Time in Seconds Number f C N b of Cores(8 Cores per Node) (8 C N d ) 60 50 40 30 20 Ellap 10 0 Number of Cores 8 16 32 64 96 128 Hyb4 92.49 62.6 44.5 34.44 34.14 34.12 Hyb5 92.49 50.56 28.84 18.41 15.13 13.97 Pipelined 92.49 46.43 24.8 13.88 10.325 8.42 February 24, 2012 18
- 19. Summary and Conclusions y• In shared memory systems, the application can be efficiently parallelized, but the performance will always be limited by hardware resources.• In distributed memory systems, hardware resources on non-local nodes become available with the cost of communication overhead.• Performance improves with the number of resources, – Efficiency is not on the same scale.• The duty of each designer is to find the perfect compromise between performance and other factors like – power consumption – size – heat dissipationFebruary 24, 2012 19
- 20. Thank Y ! Th k You!Questions? Fisnik Kraja Chair of Computer Architecture Technische Universität München kraja@in.tum.de

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment