Your SlideShare is downloading. ×
0
Designing High Performance Computing
Architectures f Reliable S
A hit t
for R li bl Space
Applications
pp

Fisnik Kraja
Ph...
Out e
Outline
1. Motivation
2.
2 The Proposed Computing Architecture
3.
3 The 2DSSAR Benchmarking Application
4. Optimizat...
Motivation
ot at o
•

Future space applications will demand for:
– Increased on-board computing capabilities
– Preserved s...
The Proposed Architecture
e oposed c tectu e

Legend:
RHMU
Radiation-Hardened
Management Unit
PPN
Parallel processing Node...
The 2DSSAR Application
2- Dimensional Spotlight Synthetic Aperture Radar
Illuminated swath in Side‐looking Spotlight SAR
S...
Profiling SAR Image Reconstruction
g
g
Coverage
g
(in km)

Memory
y
(in GB)

FLOP
(in Giga)

Time
(in Seconds)

Scale=10

...
IR Optimizations for
Shared Memory Multiprocessing
• O
OpenMP
MP
– General optimizations:
• Thread Pinning and First Touch...
IR on a Shared Memory Node
y
12

The ccNUMA Node:

10
Sp
peedup  (Scale=60)

2 x Nehalem CPUs
6 Cores, 12 threads
2.93-3.3...
IR Optimizations for
Distributed Memory Multiprocessing
•

Programming Paradigms
g
g
g
– MPI
• Data Replication
• Process ...
IR on the Distributed Memory System
y y
The Nehalem Cluster:
60

50
Speedu
up (Scale=60)

Each Node
2 x 4 Cores,
16 thread...
IR Optimizations for
Heterogeneous CPU/GPU Computing
blockIndex.x (bx)

• ccNUMA Multi Processor
Multi-Processor

0

1

– ...
IR on a Heterogeneous Node
g
35

The Machine:

30
25
20
Speedup

ccNUMA Module
2 x 4 Cores,
16 threads
2.8 – 3.2 GHz
12 GB...
Conclusions
Co c us o s
•

Shared memory Nodes
y
– Performance is limited by hardware resources
– 1 Node (12 Cores/24 Thre...
Thank Y
Th k You

kraja@in.tum.de
Upcoming SlideShare
Loading in...5
×

Designing High Performance Computing Architectures for Reliable Space Applications

121

Published on

PhD Defense Talk

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
121
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Designing High Performance Computing Architectures for Reliable Space Applications"

  1. 1. Designing High Performance Computing Architectures f Reliable S A hit t for R li bl Space Applications pp Fisnik Kraja PhD Defense December 6, 2012 Advisors: 1st : Prof. Dr Arndt Bode Prof Dr. 2nd : Prof. Dr. Xavier Martorell
  2. 2. Out e Outline 1. Motivation 2. 2 The Proposed Computing Architecture 3. 3 The 2DSSAR Benchmarking Application 4. Optimizations and Benchmarking Results p g – – – Shared memory multiprocessor systems Distributed memory multiprocessor systems Heterogeneous CPU/GPU systems 5. Conclusions 2
  3. 3. Motivation ot at o • Future space applications will demand for: – Increased on-board computing capabilities – Preserved system reliability • Future missions: – Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s – Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1 operations/s, 603.1 Gbit/s • Challenges – – – – – – Costs Modularity Portability y Scalability Programmability Efficiency y (ASICs are very expensive) (component change and reuse) ( (across various spacecraft p p platforms) ) (hardware and software) (compatible to various environments) (p (power consumption and size) p ) 3
  4. 4. The Proposed Architecture e oposed c tectu e Legend: RHMU Radiation-Hardened Management Unit PPN Parallel processing Node Co t o us Control Bus Data Bus 4
  5. 5. The 2DSSAR Application 2- Dimensional Spotlight Synthetic Aperture Radar Illuminated swath in Side‐looking Spotlight SAR Synthetic Data Generation (SDG): Synthetic SAR returns from a uniform grid of point reflectors Spacecraft S ft Azimuth Flight Path SAR Sensor Processing (SSP) Altitude Altit d Swath Range Range Swath Cross-Range Read Generated Data Image Reconstruction (IR) Write Reconstructed Image Reconstructed SAR image i obtained b i is b i d by applying a 2D Fourier Matched Filtering and Interpolation p Algorithm 5
  6. 6. Profiling SAR Image Reconstruction g g Coverage g (in km) Memory y (in GB) FLOP (in Giga) Time (in Seconds) Scale=10 3.8 x 2.5 0.25 29.54 23 Scale 30 Scale=30 11.4 7.5 11 4 x 7 5 2 115.03 115 03 230 Scale=60 22.8 x 15 8 1302 Goal: Speedup 926 Transposition and  FFT‐shifting 2% 30x Compression and  Decompression  Loops 7% Interpolation  Interpolation Loop 69% IR Profiling FFTs 22% 6
  7. 7. IR Optimizations for Shared Memory Multiprocessing • O OpenMP MP – General optimizations: • Thread Pinning and First Touch Policy • Static/Dynamic Scheduling – FFT • Manual Multithreading of Loops of 1D-FFT(not the FFT itself) – Interpolation Loop (Polar to Rectangular Coordinates) • Atomic Operations • Replication and Reduction (R&R) • Other Programming Models – OmpSs, MPI, MPI+OpenMP MPI OpenMP 7
  8. 8. IR on a Shared Memory Node y 12 The ccNUMA Node: 10 Sp peedup  (Scale=60) 2 x Nehalem CPUs 6 Cores, 12 threads 2.93-3.33 GHz QPI: 25.6 GB/s IMC: IMC 32 GB/ GB/s Lithography: 32 nm TDP: 95 W 2 x 3 x 6 GB Memory Total: 36 GB DDR3 SDRAM 1066 MHz 8 6 4 2 0 Cores (Threads) OpenMP Atomic OpenMP R&R OpenMP R&R OmpSs Atomic OmpSs R&R MPI R&R MPI+OpenMP 1 2 4 6 8 10 12 12 (24) 1 1 1,55 1,78 3,05 3,51 4,45 5,02 5,81 6,36 6,94 7,74 7,98 9,03 10,54 11,06 1 1 1,61 1,93 3,12 3,73 4,62 5,52 5,92 7,13 7,02 8,65 8,13 10,37 10,72 12,37 1 1 1,92 1,89 1 89 3,65 3,54 3 54 5,30 4,88 4 88 6,57 6,40 6 40 7,94 8,02 8 02 9,81 9,94 9 94 11,20 11,69 11 69 8
  9. 9. IR Optimizations for Distributed Memory Multiprocessing • Programming Paradigms g g g – MPI • Data Replication • Process Creation Overhead – MPI+OpenMP PID 0 PID 0 PID 1 PID 2 PID 3 • 1 process/Node • 1 Thread/Core PID 0 • Communication Optimizations – Transposition (new: All-to-All) – FFT-shift (new: Peer-to-Peer) – Int.Loop Replication and Reduction • PID 1 PID 2 PID 3 D00 D10 D20 D30 D01 D11 D21 D31 A1 A2 D1 D2 D02 D12 D22 D32 D03 D13 D23 D33 B1 B2 C1 C2 Pipelined IR – Each node reconstructs a separate SAR Image 9
  10. 10. IR on the Distributed Memory System y y The Nehalem Cluster: 60 50 Speedu up (Scale=60) Each Node 2 x 4 Cores, 16 threads 2.8 - 3.2 GHz 12/24/48 GB RAM QPI: 25.6 GB/s IMC: 32 GB/s, Lithography: 45 nm TDP: 95 W/CPU Infiniband Network Fat-tree Topology 6 Backbone Switches 24 Leaf Switches 40 30 20 10 0 No. of Nodes (Cores) No of Nodes (Cores) MPI (4Proc/Node) Hybrid (1Proc:16Thr/Node) MPI_new (8Proc/Node‐24GB) Hyb new (1Proc:16Thr/Node) Hyb_new (1Proc:16Thr/Node) Pipelined (1Proc:16Thr/Node) ( ) 1 (8) ( ) 2 (16) ( ) 4 (32) ( ) 8 (64) ( ) 12 (96) ( ) 16 (128) 3,54 5,46 7,92 8,52 7,69 7,37 6,68 6,45 10,19 10,87 14,41 15,93 17,11 23,69 17,19 28,90 17,73 31,06 5,66 5 66 6,35 9,68 9 68 11,50 17,13 17 13 21,30 26,92 26 92 38,05 30,72 30 72 50,48 32,08 32 08 59,80 10
  11. 11. IR Optimizations for Heterogeneous CPU/GPU Computing blockIndex.x (bx) • ccNUMA Multi Processor Multi-Processor 0 1 – Sequential Optimizations – Minor load-balancing improvements 0 1 0 1 tsize threadInd dex.y (ty) blockInde ex.y (by) – CUDA – Tiling Technique – cuFFT Library – Transcendental functions 3 threadIndex.x (tx) • Computing on CPU+GPU • Accelerator (GP-GPU) 2 1 tsize Block (2 1) (2,1) 0 2 • Such as sine and cosine – CUDA 3.2 lacks • Some complex operations ( p p (multiplication and CEXP) p ) • Atomic operations for complex/float data – Memory Limitation • Atomic operations are used in SFI loop (R&R is not an option) • Large Scale IR dataset does not fit into GPU memory 11
  12. 12. IR on a Heterogeneous Node g 35 The Machine: 30 25 20 Speedup ccNUMA Module 2 x 4 Cores, 16 threads 2.8 – 3.2 GHz 12 GB RAM TDP: 95 W/CPU PCIe 2.0 (8 GB/s) Accelerator Module 2 GPU Cards NVIDIA Tesla(Fermi) ( ) 1.15 GHz 6 GB GDDR5 144GB/s TDP 238 W 15 10 5 0 CPU  CPU                CPU Best  CPU B t CPU            CPU 16 Threads         GPU Sequential  8 Threads (SMT) CPU + GPU 2 GPUs 2 GPUs  2 GPU Pipelined Scale=10 Scale=30 1 1 1,82 1,89 14,46 11,41 16,06 13,26 20,11 19,44 18,88 22,10 4,27 16,71 15,86 25,40 Scale=60 1 1,97 10,27 12,55 20,17 24,68 22,26 34,46 12
  13. 13. Conclusions Co c us o s • Shared memory Nodes y – Performance is limited by hardware resources – 1 Node (12 Cores/24 Threads): speedup = 12.4 • Distributed memory systems – Low efficiency in terms of performance per power consumption and size. – 8 Nodes (64 cores): speedup: 38.05 • Heterogeneous CPU/GPU systems – Perfect compromise: • Better performance than current shared memory nodes • Better efficiency than distributed memory systems • 1 CPU + 2 GPUs: speedup: 34.46 • Final Design Recommendations – Powerful shared memory PPN – PPN with ccNUMA CPUs and GPU accelerators – Distributed memory only if multiple PPNs are needed 13
  14. 14. Thank Y Th k You kraja@in.tum.de
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×