Sgi Hpc Day Kiev 2009 10 Uv

1,399 views
1,310 views

Published on

Project Ultraviolet Overview

Published in: Education
1 Comment
1 Like
Statistics
Notes
  • Very Good
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,399
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
40
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Sgi Hpc Day Kiev 2009 10 Uv

  1. 1. Project Ultraviolet Overview
  2. 2. Clusters vs. Shared Memory Architecture Small Node x86 Clusters SGI® Altix™ 4000 Family, UV Commodity Interconnect SGI® NUMAflex™ Interconnect mem mem mem mem mem Global shared memory system + system + system + system + ... system + system system system ... system OS OS OS OS OS OS • Each system has own memory and OS • All nodes operate on one large shared • Batch, not interactive user interface memory space • Coding required for parallel code execution • Cache Coherency • Great for capacity workflows • Eliminates data passing between nodes • SGI® Altix XE x86-64 clusters, Rackable BTO • Big data sets fit entirely in memory • Less memory per node required • Simpler to program • High Performance, Low Cost, Easy to Deploy Company Confidential 2
  3. 3. Infiniband vs. Numalink™ Interconnect Interconnect Type Bandwidth (each direction) Infiniband 4xDDR 2.0 GBytes/s Infiniband 4xQDR 4.0 GBytes/s Numalink4 (Altix 4700/450) 3.2 GBytes/s Numalink5 (UV) 7.5 GBytes/s Company Confidential 3
  4. 4. Independent Scaling CPU Memory I/O Company Confidential 4
  5. 5. SGI® Modularity Evolution Bricks Origin 3000, Altix SGI Blades Altix 4700/450 Modules Origin 2000 Modules Bricks Blades 1997 2006 Company Confidential 5
  6. 6. SGI Scalable ccNUMA Architecture Basic Node Structure and Interconnect C C C C A A A A C H CPU CPU C H C H CPU CPU C H E E E E NUMAlink Interconnect Interface Interface Chip Chip Physical Memory Physical Memory Shared Memory Company Confidential 6
  7. 7. SGI Scalable ccNUMA Architecture Scaling to Large Node Counts C A C H E CPU CPU C A C H E ….. C A C H E CPU CPU C A C H E C A C H E CPU CPU C A C H E … C A C H E CPU CPU C A C H E …. Interface Interface Interface Interface Chip Chip Chip Chip (Local) Physical Memory (Local) Physical Memory (Local) Physical Memory (Local) Physical Memory Shared Memory (Within an SSI: OpenMP) Shared Memory Shared Memory Globally Addressable Memory (GAM) Within a NUMAlinked System: MPI NUMAlink and Routers Company Confidential 7
  8. 8. Company Confidential 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% Communication vs. Computation Applications on Altix 3000 Nastran/4 CSM Pam- Crash/32 Ls-Dyna/48p Computation Radioss/96 Fluent/64 StarHPC/32 CFD Fire/32 Communication Gamess/32 CCM Amber/8 8 CASTEP/128 ADF/32 HOMME/1944 BIO MM5/96 HIRLAM/128 CWO CCM3/64 IFS /120 SPI GeoDepth Eclipse/52 RE VIP/32 S
  9. 9. Why MOE (MPI Offload Engine) ? Company Confidential 9
  10. 10. ® SGI Project Ultraviolet – Overview Extraordinary Capability in an x86 Architecture • Performance and Productivity for Demanding Workloads • Highly Data-Efficient – up to Many Terabytes of Data in Memory • Scales to 2048 Core and 16TB in Single x86 System • Scales IO to >1TB/s • Advanced Reliability • Hardware-enabled Fault Detection, Prevention, Containment • Enhanced Monitoring and Serviceability • Low TCO • X86-64 and Linux Economics • Industry Leading Rack-level Energy Efficiency • Easiest System to Administer and Productively Use Slide 10 Company Confidential 10
  11. 11. UV Architectural Scalability 16,384 Nodes (scaling supported by NUMAlink5 node ID) – 16,384 UV_HUBs – 32,768K Sockets / 262,144 Cores (with 8-cores per socket) – >2pflop Coherent shared memory – Xeon: 16TB (44 bits socket PA) 8PB coherent get/put memory (53 bits PA w/GRU) 16 DIMMs per node (2DIMMs per Channel) Intel coherence scheme within node SGI coherence scheme between nodes Company Confidential 11
  12. 12. UV Accelerated Performance For Distributed or Shared Memory Programming MPI Offload Engine (MOE) frees cpu from MPI activity - MPI Reductions 2-3X faster than competitive clusters/MPPs - barriers up to 80X+ faster NUMAlink Advances – industry’s most efficient interconnect Massively Memory-mapped I/O - Big speedup for I/O bound apps Hold massive datasets in memory - to 16TB per OS system image, to petascale across systems Company Confidential 12
  13. 13. UV Accelerated Performance For Distributed or Shared Memory Programming MPI Offload Engine (MOE) frees cpu from MPI activity - MPI Reductions 2-3X faster than competitive clusters/MPPs - barriers up to 80X+ faster 6 Altix 4700 Longest Path MPI Latency 5 Altix ICE NUMAlink Advances 4 UV - 2-3X MPI latency improvement 3 2 Massively Memory-mapped I/O 1 - Big speedup for I/O bound apps 0 0 1000 2000 Destination CPU Hold massive datasets in memory - to 16TB per OS image, to petascale across systems - Up to 10X+ speedup for data-intensive applications Company Confidential 13
  14. 14. UV Low TCO Economical to own and operate Excellent Price/performance – x86 economics plus UV performance advantages UV – 3-5X compared to today’s Altix 80% Delivered Rack-Level 78% – Can take the place of multiple systems 75% Power Efficiency 75% 70% 70% Leading Rack-level Power Efficiency 65% 65% – UV stretch goal = 80% 60% 60% 55% Origin 2000 Origin 3000 Altix 3000 Altix 4000 Ultraviolet Carlsbad Most Economical System – to administer and use Company Confidential 14
  15. 15. Project Ultraviolet Product Design •Bladed Node Package •Memory or compute-dense blades •Variety of IO expansion options •Mix/match resources •Expand or reconfigure when needed •Industry-leading Scalability •Run standard Linux Distros •RedHat, SLES Slide 15 Company Confidential 15
  16. 16. IRU (Chassis) Packaging and Topology N+1 PS 16 blade IRU for 24” rack 2 blade IRU for 19” rack Compute node with IO expansion capability 3U 18U 24” IRU Topology 1+1 PS For (8) NUMAlink 5 Ports Blowers per Router Cabled to Network (8) NUMAlink 5 Fan-In Ports per Router Paired Nodes (Dual NUMAlink 5 Cross- 24”EIA Linked) Company Confidential 16
  17. 17. Ultraviolet Rack • (64) Intel® Xeon® Sockets • Blade-based packaging • (512) Intel Xeon Cores • Air-Cooled electronics • (512) DDR3 RDIMMs • 128GB / node (w/ 8GB DIMMs) • N+1 12VDC Power Supplies • 4TB / rack (w/ 8GB DIMMs) • N+1 Axial Fans • Integrated BaseIO & Boot HDDs • (2) 60A 200VAC-240VAC • Integrated or External IO Expansion 3-Φ IEC 60309 plugs provide 17.3 kVA each • SGI® NUMAlink™ 5 network • Rack Nameplate 34.5 kVA max • (1) System Management Node per up to 4-racks • Optional water-cooling • Leverages SGI® Altix® ICE 8200 • IO Expansion for higher power or larger form factor cards Company Confidential 17
  18. 18. UV System Packaging Options High Performance Price-performance Midrange Capability Quad Router 19” rack Short Rack Admin Node Admin Node Storage Storage IO 16 blade Expansion 2 blade chassis 24/32 core chassis 42U, 24 inch rack 20U, 19 inch rack 42U, 24 inch rack, routerless 40U, 19 inch rack 64 skts, 512c per rack 24 skts, 192c 64 skts, 512c, Up to 50 skts, 400c, 4TB memory (8GB DIMM) 3TB memory (8GB DIMM) 4TB memory (8GB DIMM) 3TB memory (8GB DIMM) Up to 4.65 tflop Up to 1.8 tflop per short rack Up to 4.65 tflop Up to 3.5 tflop per rack Fat Tree, 7.5GB/s/skt bisection 2D Torus, 1.25GB/s/skt bisection NL Scalable to 16K sockets Can be clustered with IB, Gig-e Up to 2048core SSI supported Company Confidential 18
  19. 19. Capability Comparisons UV-Midrange Offers More Headroom UV-Midrange 96 SSI, S System Scale, Sockets 6 Max Memory,TB Max Memory, TB 64+ Max IO, Slots Max IO (PCIe slots/system) Scalable x86 (IBM. Bull, Unisys) System Scale, Sockets Max Memory, TB Max IO (PCIe slots/system) 8S Glueless System Scale, Sockets Max Memory, TB Max IO (PCIe slots/system) IBM P6 570,575, HP Integrity System Scale, Sockets Max Memory, TB Max IO (PCIe slots/system) Company Confidential 19
  20. 20. UV Nehalem-EX Node Board - Compute Blade Optional I/O Riser Boxboro IOH QPI QPI Nehalem- Nehalem- EX QPI EX Each Blade: 8-16 Xeon cores QPI QPI Up to 145gflop (8) DDR3 RDIMMs & (4) Millbrook Memory Up to 128GB Buffers per socket UV RLDRAM HUB (Snoop Acceleration) (2) Directory FB-DIMMs •SGI® NUMAlink™ 5 = 15.0 GB/s aggregate (4) NUMAlink 5 •Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) •Directory FBD1 = 6.4GB/s Read + 3.2GB/s Write (800MHz DIMMs) Single-Socket •Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / Socket Memory/IO Expansion •Intel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket Blade also Available Company Confidential 20
  21. 21. UV Single Socket or Memory Expansion Blade Optional I/O Riser Memory Expansion Blade Boxboro IOH QPI (16) DDR3 RDIMMs Nehalem- & (4) Millbrook EX Memory Buffers per Single-Socket QPI UV HUB (4) NUMAlink 5 SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / Socket Intel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket Company Confidential 21
  22. 22. UV Nehalem-EX Node Board Socket RLDRAM UV_HUB Socket Mezzanine Connector (2) Quick Path links to I/O Riser 11.2-in W x 19.5-in L (1/2 Panel Board) Company Confidential 22
  23. 23. (4) Integrated IO Riser Options Node Blade or Memory Expansion Blade BaseIO (2) Hot plug 2.5” Boot HDD Externalized IO Integrated (2) PCIe Gen2 x16 PCIe Gen2 Cable Connections to (1) x16 low-profile IO Expansion Chassis (1) x8 low-profile Company Confidential 23
  24. 24. UV IO Expansion Chassis in Development For Full-height and High-Power Card Support 1U One x16 PCIe G2.0 input connector Each unit supports up to 4 slots, either PCIx or PCIe Company Confidential 24
  25. 25. P0- Node, in Engr Test Company Confidential 25
  26. 26. P0- BaseIO Company Confidential 26
  27. 27. 18U 24-in EIA Individual Rack Unit (IRU) Front (Node Blade) View Rear (Router Blade) View Company Confidential 27
  28. 28. Power, Cooling and Facilities Company Confidential 28
  29. 29. SGI Altix ICE 8200 Water-Cooled Coils (4) Individual Coils Condensate Drain Pan Branch Feed to Target Heat Rejection Individual Coil 95% water / 05% air Chilled-Water Supply 45° to 60° (7.2° to 15.6° F F C C) 14.4 gpm (3.3 m3/hr) Max. 3/4” (1.91 cm) Coupling Swivel Coupling to Supply Hose Company Confidential 29
  30. 30. UV Rack w/ Top-Feed Water-Cooled Coil Target Heat Rejection 95% water / 05% air UV Enhancements: - Reduce water-side Chilled-Water Supply pressure drop 45° to 65° (7.2° to 18.3° F F C C) - Increase allowable 16.0 gpm (3.6 m3/hr) Max. water supply temp to 65° (18.3° F C) -Enable top-feed water 1” (2.54 cm) Coupling Company Confidential 30
  31. 31. 80 Plus® Organization Ultraviolet Power Supplies Planned to be Gold Certified Mission – Unique forum that is uniting electric utilities, the computer industry and consumers in a groundbreaking effort to bring energy efficient power supplies to desktop computers and servers N+0 desktop power supply certification available today – SGI worked with 80 Plus to draft N+1 server power supply specification 80 Plus Bronze Silver Gold Year 1 Year 2 Year 3 CSCI July-07 July-08 July-09 http://www.80plus.org/ 20% PSU Load 81% 85% 88% 50% PSU Load 85% 89% 92% 100% PSU Load 81% 85% 88% Company Confidential 31
  32. 32. Energy Efficiency : Rack Level stretch goal Rack 80% Net (all-in) Rack Energy Efficiency Roadmap (N.B. even higher efficiency if no water-coil) 78% 75% 75% 70% 70% 65% 65% 60% 60% 55% Origin 2000 Origin 3000 Altix 3000 Altix 4000 Ultraviolet Carlsbad Company Confidential 32
  33. 33. UV Rack Power 34.5kVA Rack Nameplate – Used for facilities wire-sizing 33.3kW Power Model Roll-Up – 130W TDP sockets, full memory, fans at altitude with water- coil impedance 30.0kW Estimate Running Linpack – 90% of Power Model – “Maximum Measured” 22.5kW Estimate Running Applications – ~75% of Linpack Power – Used for energy consumption planning (kWh) Company Confidential 33
  34. 34. Projected UV Performance Advances 6 IB P a ny Altix 4700 L n e t P thM I L te c Profile for Large Jobs UV Excellent BW/latency MPI 5 Bandwidth Altix ICE vs 4 UV Message NL4 Size Typical Cluster Systems 3 ogs a 2 UV-NL 5 1 0 Bytes Destination 0 1000 2000 CPU Destination CPU IB HPCC Benchmarks MPI and HPPC, Barriers Single element MPI_reduce UV 25 Speedups with GRU MPI_Reduce Ramdom UV with GRU 20 Access UV no GRU Time for MPI_Reduce (us) 15 3X 10 FFTE 5 0 ptrans 2 4 84 68 36 6 2 4 8 6 2 07 14 02 04 09 19 25 51 ,3 ,7 ,5 1, 2, 1, 2, 4, 8, 16 32 65 0 13 26 number of threads Barrier Latency <1usec (4096 thread) Source: Qlogic, Inc. Company Confidential 34
  35. 35. MPI Latency UV MPI Half Ping-Pong Latencies Longest Path 2000 1800 1600 1400 1200 Latency; ns 1000 800 600 400 200 0 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K Cores Slide 35 Company Confidential 35
  36. 36. UV_HUB / Node Controller Technologies Processor Interface Active Memory Unit • Snoop Acceleration • Rich set of Atomic Operations • Large Number of In-Flight References • AMO cache at memory home • Multicast • Message Queues in Coherent Memory Globally Addressable Memory • Page Initialization • Large Shared Address Space • Extremely Large Coherent Get/Put Space GRU Global Reference Unit • AMOs in Coherent Memory • High-BW, Low-Latency Socket • Coherence Directory Communication • Update Cache for many AMOs RAS • Scatter/Gather Operations • BCOPY Operations • Redundant Real-Time Clock • External TLB with Large Page Support • Built-In Debug and Performance Monitors • Internal/External Datapath Protection • Alpha-immune Flip-Flops Company Confidential 36
  37. 37. Thank You! © 2008 Silicon GraphicsI. All rights reserved. Silicon Graphics, SGI, Altix, XFS, the SGI logo, NUMAflex and the Silicon Graphics cube are registered trademarks and NUMAlink, CXFS are trademarks of SGI in the U.S. and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds in several countries. Intel, Itanium and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All other trademarks mentioned herein are the property of their respective owners. Company Confidential 37
  38. 38. Paired UV Nehalem-EX Nodes Optional Optional I/O Riser I/O Riser Boxboro Boxboro IOH (8) DDR3 RDIMMs & IOH (4) Millbrook Memory Buffers per socket QPI QPI QPI QPI Nehalem- Nehalem- Nehalem- Nehalem- EX QPI EX EX QPI EX QPI QPI QPI QPI UV UV HUB HUB (2) SGI NUMAlink 5 on Backplane (2) NUMAlink 5 (2) NUMAlink 5 SGI® NUMAlink™ 5 = 15.0 GB/s aggregate Intel® Quick Path Interconnect (QPI) = 25.6 GB/s aggregate (6.4GT/s) Millbrook Memory Buffers = 8.53GB/s (1067MHz DDR3 DIMMs) x 4 channels = 34.1 GB/s Read / Socket Intel® Scalable Memory Interconnect (SMI) = 30 GB/s/socket Company Confidential 38
  39. 39. SGI Flagship Platform Evolution SGI’s Flagship Product Line has 4 Characteristics: 1. GAM 2. SSI 3. x/core, where x={I/O, Memory} 4. SWAP (and cooling) UV - 3 things to know: 1. Xeons into the Flagship Product Line WITHOUT COMPROMISE 2. MOE (MPI Offload Engine) 3. Topology Options: - Selectable Fat-tree sizes - Vertices within a Torus - Paired Node Routerless or Routed - Constellations Company Confidential 39
  40. 40. UV HUB/Node Controller Features Extended Capability •Enabling Enterprise-class scalability and reliability on x86-64 •Cache-coherence across nodes •Fault resiliency – mirror thru block devices in memory – survive OS crash •Extensive fault isolation, datapath protection, monitoring/debug functions •Accelerating Large-scale workloads •Fast Message-Passing (without cpu cache-line delays) •Extends cpu capability for load requests •System scale to 256+ sockets, 2048+ cores on standard Linux •Accelerating Data-intensive applications •Extended physical memory address to peta-scale (8PB) •Extended “Super” TLB page size (1TB, map up to 4PB) •avoid TLB misses for large, random data references •Very fast locking mechanism for highly contended data (no cache-line delay) •Off-load add, compare, swap instructions •HUB/Node controller directly exposed to user for easy utilization •No system calls Company Confidential 40
  41. 41. System Management UV maintains the hierarchical system management approach. – Origin/Altix: L1/L2/L3 – ICE/UV: BMC, CMC, Leader Node/SMN – Command line interface at L2 & CMC very similar Unified approach to system management wrapped into SGI Cluster Manager SNMP used extensively across product lines including UV – Hardware inventories & sensor values stored in MIB format – SNMP data coalesced at SMN, available via SGI provided RAS software or through SNMP queries by 3rd party or customer developed apps Company Confidential 41

×