Your SlideShare is downloading. ×
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
SGI - HPC-29mai2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

SGI - HPC-29mai2012

2,835

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,835
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Multiple runs and optimizations have yielded different results Just focus on the graph showing the “relative” comparison of Linpack, idle, and application/benchmark power typical
  • The world’s fastest supercomputer just got faster! Largest performance boost ever - up to 5x performance density improvement over previous industry-leading generation - with future Intel ® Xeon ® processor E5 family Key design innovations and increased flexibility through enhanced R&D investment The world-renowned SGI quality and performance you love Entirely built on industry-standard hardware and software components, enabling access to the full spectrum of the Linux ecosystem Only system in its class that installs production-ready in hours or days, not weeks or months Flexible to fit your workload Ultimate configuration flexibility in topology/interconnect, power, cooling, CPUs and memory Seamless scalability from tens of teraflops to tens of petaflops Expandability within and across technology generations while maintaining uninterrupted production workflow
  • The world’s fastest supercomputer just got faster! Largest performance boost ever - up to 5x performance density improvement over previous industry-leading generation - with future Intel ® Xeon ® processor E5 family Key design innovations and increased flexibility through enhanced R&D investment The world-renowned SGI quality and performance you love Entirely built on industry-standard hardware and software components, enabling access to the full spectrum of the Linux ecosystem Only system in its class that installs production-ready in hours or days, not weeks or months Flexible to fit your workload Ultimate configuration flexibility in topology/interconnect, power, cooling, CPUs and memory Seamless scalability from tens of teraflops to tens of petaflops Expandability within and across technology generations while maintaining uninterrupted production workflow
  • First *over 1PF peak* InfiniBand pure compute connected CPU cluster World's fastest distributed memory system Top Intel-based overall SPEC_MPIM2007 and SPEC_MPIL2007 performance (base and peak) Top AMD-based SPEC_MPIM2007 and SPEC_MPIL2007 performance (base and peak) World’s fastest and most scalable computational fluid dynamics system SGI ICE 8400 demonstrated unmatched parallel scaling up to 3,072 cores with a rating of 1,333.3 standard benchmark jobs per day Also proved the ability to run ANSYS FLUENT on all 4,092 cores; to date, no other cluster has reported ANSYS FLUENT benchmark results above 2,048 cores The ANSYS FLUENT benchmark performance increase was achieved with the help of SGI MPI PerfBoost First and only vendor to support multiple fabric level topologies + flexibility at the node, switch and fabric level + application benchmarking expertise for same First and only vendor capable of live, large-scale compute capacity integration
  • Used for IP-115 Gemini “twin” blades; replaces the traditional air-cooled heat sinks on the CPUs to enable highest watt SKU support Resides between the pair of node boards in each blade slot (“M-Rack” deployment) Enables highest watt SKU support (e.g., 130W TDPs) Utilizes a liquid-to-water heat exchanger that provisions the required quantity of flow to the M-Racks for cooling  
  • “ Closed-Loop Airflow” Environment Integrated hot aisle containment No air from within the cell is mixed with the data center air wherein the cell is installed (versus a hot/cold aisle arrangement - open loop airflow - wherein the air is mixed) Always water-cooled Supports Warm Water Cooling Broad range of acceptable temperatures for additional cost savings Contains air-to-water heat exchanger Contains a liquid-to-water heat exchanger when cold sinks are deployed Contains Large, “Unified” Cooling Racks for Efficiency Compute racks do not have their own cooling at the rack level Decreases power costs associated with cooling All cooling elements utilize one water source
  • Synchronization of the OS overhead related tasks on each node to begin simultaneously on all nodes in the cluster results in significantly less wasted cycles over the duration of parallel workloads. The negative effect of unsynchronized OS noise grows continuously worse as node counts rise.
  • Left: August 1985. Right: August 2010. Iran’s Lake Oroumeih (also spelled Urmia) is the largest lake in the Middle East and the third largest saltwater lake on Earth. But dams on feeder streams, expanded use of ground water, and a decades-long drought have reduced it to 60 percent of the size it was in the 1980s. Light blue tones in the 2010 image represent shallow water and salt deposits. Increased salinity has led to an absence of fish and habitat for migratory waterfowl. At the current rate, the lake will be completely dry by the end of 2013.
  • Customer Name: iVEC and Dr Andrew King, Department of Mechanical Engineering Curtin, University of Technology, Australia Challenge : iVEC and the Fluid Dynamics Research Group at Curtin University are working together to solve large scale CFD problems like simulating wind flows in the capital city of Perth. SGI Cyclone Solution: The testing included running OpenFOAM on internal systems, SGI Cyclone and Amazon EC2 cloud. SGI Cyclone proved to scale better (1,024 cores) and was much faster!
  • Transcript

    • 1. HPC milestonesMichal Klimeš
    • 2. Experts @ H P C Structural Mechanics Structural Mechanics Computational Fluid Electro-Magnetics Implicit Explicit DynamicsComputational Chemistry Computational Chemistry Computational Biology Seismic Processing Quantum Mechanics Molecular Dynamics Reservoir Simulation Rendering / Ray Tracing Climate / Weather Data Analytics Ocean Simulation 2
    • 3. C o m p e t e n c y = Real HPC + BigStorage
    • 4. From TOP500
    • 5. There are no small things
    • 6. OpenFOAM® Performance with SGI MPI Speedup Performance: SGI MPT < --> OpenMPI Ratio MPT / OpenMPI 3,0 2,0 Automotive Interior Climate 2,72 2,5 Model, 19M cells 2,29 2,01 2,01 2,0 1 ,73 1 ,59 1 ,5 1 ,5 1 ,35 1 ,02 1 ,00 1 ,0 14 ,1 0,5 1 ,08 1 ,02 0,0 1 ,0 64 1 28 1 92 256 # Cores SGI MPT Speedup OpenMPI Speedup MPT/OpenMPI Ratio OpenFOAM with SGI MPI with up to 35% better performance 6
    • 7. W h a t is t h e “ a v e r a g e ” SGI Confidential p o w e r c o n s u m p t io n ? Linpack 30.5kW* STREAM GUPS Fluent 22.1kW* 23.3kW* 22.4kW* 72.5% 76.4% 73.4% Linpack kW Linpack kW Linpack kW Idle 15.9kW* 52.1% Linpack kW Average power consumption heavily depends on • application and its data profile • the level of code optimization (+ libraries + MPI optimization) • the ability of Job Scheduler to utilize the system • the bottlenecks in I/O subsystem and in OS* Measured on ICE 8200 system with 128x 2.66GHz Quad-Core Intel® Xeon® Processor 5300 series (1 Rack)
    • 8. Where is performance? Accelerated
    • 9. R e a l M e m o r y B a n d w id t hR e q u ir e m e n t sM e a s u re me nts a t L R Z o nS G I A lt ix 4 7 0 0 Source: Matthias Brehm (LRZ) in inSide, vol 4 No 2 p s 1 F lo /s : 1B
    • 10. SGI HPC Servers and Supercomputers Scale-Out Scale-Up Rackable™ CloudRack™ Altix ® ICE Altix® UV 1U, 2U, 3U, 4U & XE Tray Cluster Blade Cluster Shared-Memory Build-to-Order Leader Architecture (for Internet DC) Architecture Architecture Scalability Leader Virtualization & May-Core Leader
    • 11. S G I U V24th Generation SMP System • T h e m o s t f le x ib le s ys te m !
    • 12. SGI UV Shared Memory Architecture C o m m o d it y C lu s t e r s S G I U V P la t f o r m In f in iB a n d o r G ig a b it E t h e r n e t S G I N U M A lin k In t e r c o n n e c t Mem me m me m me m me m G l o b a l s h a r e d m e m o r y t o 16 T B~ 64GB S ys te ms ys te m ys te m ys te m ys te m + s + s + s + ... s ys te m + + OS OS OS OS OS OS • E a c h s ys te m h a s o w n • A ll n o d e s o p e r a t e o n o n e me mo ry a nd O S la r g e s h a r e d m e m o r y s p a c e • N o d e s c o m m u n ic a t e o v e r • E lim in a t e s d a t a p a s s in g c o m m o d it y in t e r c o n n e c t b e tw e e n no d e s • In e f f ic ie n t c r o s s -n o d e • B ig d a t a s e t s f it e n t ir e ly in c o m m u n ic a t io n c r e a t e s me mo ry b o t t le n e c k s • Le s s me mory p e r nod e • C o d in g r e q u ir e d f o r p a r a lle l r e q u ir e d c o d e e x e c u t io n • S im p le r t o p r o g r a m • H ig h p e r f o r m a n c e , lo w c o s t , e a s y t o d e p lo y
    • 13. The UV2 Advantage o n g 15 y e a r h e r i t a g e : s a m e p r i n c i p l e s a s A l t i x 4 7 0 0 , …. b u t ntel Sandy Bridge Xeon Multi-Core Processors arge scalable Shared Memory System • Up to 4096 Cores and 64TB per Partition • Up to 2048 Cores, 4096 Threads and 32TB per Partition • Multi-partition Systems with up to 16384 Sockets, 2PB in multiple Partitions • MPI, UPC Acceleration by Hardware Offload • Cross Partition Communication n 2 0 12 w i t h o u t c o m p e t i t i o n y help of proven SGI ccNuma Architecture
    • 14. SGI UV2 Interconnect with Global Addressing UMAlink® routers connect nodes to Multi-rack UV systems UB snoops Socket QPI and accelerates remote access High Radix Router UB Offloads Programmingkmodels UMA lin MPI, UPC, (CoArray not yet) N Altix UV Blade Altix UV Blade Altix UV Blade Altix UV Blade HUB HUB HUB HUB CPU CPU CPU CPU CPU CPU CPU CPU 64GB 64GB 64GB 64GB 64GB 64GB 64GB 64GB 512GB globally addressable Memory
    • 15. UV Foundation:GAM + Communications Offload [S] GSM – cc = GAM Intel GSM CPU • Partition Memory ( OS ) - Max. 2KC 16TB PI GAM [V] GRU • PGAS Memory ( X-Partition ) TLB NI • Communications Offload ( GRU + AMU ) [P] AMU NUMAlink - Accelerate PGAS Codes MI to Other Nodes - Accelerate MPI Codes ( MOE v.v. TOE ) GAM : Globally Addressable Memory  8PB ( 53b ) 15
    • 16. UV1 vs. UV2 Socket Socket S S - NHM-EX SNB-EX-B & SNB-EP - S S - WSM-EX IVB-EX-B & IVB-EP - - QPI 1.0 QPI 1.1 -D H H D H HS S Glue Glue R -H+H+R H + H +R - - 3 separate Chips into 1 Chip - - 90nm 40 nm - - (D) Directory DIMM No Directory DIMM - R - (S) Snoop DRAM No Snoop DRAM - Better AMOs - Interconnect - NL5 NL6 - Interconnect - 6.25 GT/s - - 8B/10B encoding Higher Payload - - 4 x 12 lanes 16 x 4 lanes - - Cu only Cu & Optical - - 7m max 20m max - 16
    • 17. UV MPI Barrier 17
    • 18. Additional Performance Acceleration Barrier Latency <1usec (4096 thread) ltix UV offers up to 3X improvement in MPI reduction processes. arrier latency is dramatically better (80x) than competing platforms HPCC Benchmarks UV with MOE UV, MOE disabled PCC benchmarks show substantial UV with MOE improvement possible with MPI Offload Engine (MOE) UV, MOE disabled UV with MOE UV. MOE disabled 0 Source: SGI Engineering projections
    • 19. UV2000 16 Socket 8 Blade IRU Notes • IRU: 10U high by 19” wide by 27” deep • 8 blades – 8 Harps & 16 sockets – per IRU • 1 or 2 CMCs in rear of IRU CMC • 3 UV1 12V Power Supplies • Nine 12V cooling fans N+1Signal BP Power BP Signal BP Two signal backplanes 16 NL channels cabled in air plenum to connect the right and Powerbackplane left backplane Front 19
    • 20. SGI UV2 Node Architecture and Numalink 6 PCIe Gen3 x16 PCIe Gen3 x16 4 DDR3 Sandy Sandy channels Bridge- Bridge-EP 2DPC EP or EX or EX 1600MHz QPI 1.1 8GT/s 32GB/s NL6 UV2-HUB 16 x4 channels 12.5GT/s IRU external links 16 x4 NL6 IRU external links NL0-plane NL1-plane 12 IRU internal links NO memory Buffers as in UV1 umalink 6 2.5GT/s – or – 6.7GB/s net bidirectional bandwidth per link Same per socket performance 6 NL6 links aggregate Bandwidth out of blade: 10 7 . 2 G B /s as in cluster 2 NL6 internal links to backplane – aggregate: 8 0 . 4 G B /s 40 PCIe lanes per socket 4 NL6 external links to routers – 2 6 . 8 G B /s umalink 6 Routers 6 NL6 ports umalink cable
    • 21. UV2 Topologyy s t e m To p o l o g y IRU Blade ypercube ax 2 hops between blades 21
    • 22. UV 2 Feature AdvancesFeature UV1 UV2System scale 2048c/4096t 4096c/4096tMemory/SSI 16TB 64TBInterconnect NUMAlink 5 NL 6 (2.5X data rate)NL fabric Scale 32K sockets 32K+ socketsProcessor Nehalem EX SandybridgeSockets/rack 64 (large 24”) 64 (standard 19”)Reliability Enterprise Class Enterprise Class 22
    • 23. MIC Architecture X8 6 com p at ible 1.3TF/s Double Precision peak 340GB/s bandwidth
    • 24. S G I IC E X …Fifth Generation ICE System • T h e w o r ld ’ s fa s te s t s u p e rc o mp u te r ju s t g o t f a s t e r ! • F le x ib le t o f it y o u r w o r k lo a d
    • 25. SGI® ICE: Firsts and Onlies• F i r s t * o v e r 1P F p e a k * I n f i n i B a n d pure compute connected CPU cluster• W o r l d s f a s t e s t distributed memory system• World’s fastest and m o s t s c a l a b l e computational fluid dynamics system• F i r s t a n d o n l y v e n d o r t o s u p p o r t multiple fabric level topologies + flexibility at the node, switch and fabric level + application benchmarking expertise for same• First and only vendor capable of l i v e , l a r g e - s c a l e c o m p u t e c a p a c it y in t e g r a t io n©2011 SGI
    • 26. D ialing U p T he D ensity!SGI ICE 8400  S G I I C E X S G I IC E 8 4 0 0 D -R a c k = M -R a c k 72 x 2 = =64N 72N 14 4 N (128 Sockets) (144 Sockets) (288 Sockets) 30” Width 24” Width 28” Width
    • 27. S G I I C E X Enclosure Design BuildingBlock Increments of Two Blade Enclosures - “OneEnclosure Pair”F e a ture s p e rE n c lo s u r e 17.7P a ir :• 3 6 b la d e 16.59 (9.5U) s lo t s Rear View 21U• F o u r f a b r ic 1.75 “Building s w it c h s lo t s (1U) Block”• I n t e g r a t e d Separable 19” rack mount m a n a g e m e nPower Shelf t
    • 28. S G I I C E X Compute BladeIP-113 (Dakota) for “D-Rack” F D R M e z z a n in e M a in F e a t u r e s : C a r d O p t io n s •Supports single or dual plane FDR InfiniBand •Supports two future Intel® Xeon® processor E5 family CPUs •Supports up to eight DDR3 DIMMs per socket @ 1600 MT/ s •Houses up to two 2.5” SATA drives for local swap/scratch usage •Utilizes traditional heat sinks
    • 29. S G I I C E X Compute BladeIP-115 (Gemini Twin) for “M-Rack” M a in F e a t u r e s : •Supports single plane FDR InfiniBand •Supports four future Intel® Xeon® processor E5 family CPUs •Two dual socket nodes •Supports four DDR3 DIMMs per socket @ 1600 MT/s •Houses up to two 2.5” SATA drives for local swap/ scratch usage • One per node •Utilizes traditional heat sinks and cold sinks (liquid)©2011 SGI
    • 30. On-Socket Water-Cooling DetailU s e d f o r I P - 115 G e m i n i “ t w i n ” b l a d e s ;r e p la c e s t h e t r a d it io n a l a ir -c o o le d h e a t s in k so n t h e C P U s t o e n a b le h ig h e s t w a t t S K Us upport•Resides between the pair of node boards in each blade slot (“M-Rack”deployment)•Enables highest watt SKU support (e.g., 130W TDPs)•Utilizes a liquid-to-water heat exchanger that provisions the requiredquantity of flow to the M-Racks for cooling Out
    • 31. Notable Features of a “Cell”D-Cell and M-Cell O n e C o o lin g O ne C o mp ute Rac k Rac k• “ C lo s e d -L o o p A i r f l o w ” Environment• Supports W a r m O n e C o m p le t e C e ll W a t e r Cooling• Contains Large, “ U n if ie d ” C o o l i n g R a c k s for Efficiency©2011 SGI
    • 32. Common Topologies Mes h or F a t Tre e E nha nc e To ru s (C LOS H yp e r c u d (2 , 3 or A ll-t o -A ll N e tw o rk be H yp e r c u more s ) be d im e n s i oWill ) nsS u ppo r te d o n S GI ICE 8 4 0 0 a n d S GI ICE X s u ppo r t w h e n in OF E D©2011 SGI
    • 33. ICE Differentiation: OS Noise Synchronization • OS system noise: CPU cycles stolen from a user application by the OS to do periodic or asynchronous work (monitoring, daemons, garbage collection, etc). • Management interface will allow users to select what gets synchronized • Performance boost on larger scales systemsProcess on: Unsynchronized OS Noise → Wasted Cycles System Wasted Wasted Node 1 Overhead Cycles Cycles Node 2 Wasted System Wasted Compute Cycles Cycles Overhead Cycles Node 3 Wasted Wasted System Cycles Cycles Overhead Barrier Complete Process on: System Node 1 Overhead System Node 2 Overhead System Node 3 Overhead Synchronized OS Noise → Faster ResultsSlide 33 Time
    • 34. S G I IC E XC o o l C us to me rs
    • 35. S G I I C E X : Initial Customers • N A S A : Increasing their current SGI ICE system, called “Pleiades,” by 35% with multiple racks with future Intel® Xeon® processor E5 family – will have 1.7 petaflops • Facilitate new discoveries for Earth Science research projects • Modeling and simulation to support flight regimes and new designs for aircraft • Engineering risk assessment of crew risk probabilities to support development of launch and commercial crew vehicles for space exploration missions • N T N U : 13 SGI ICE X racks @ >275 teraflops; 4 SGI InfiniteStorage 16000 racks @ 1.2 petabytes • Accelerate numerical weather predictions • Develop atmospheric and oceanographic models for improved weather forecasting©2011 SGI
    • 36. UN Chief Calls for Urgent Action on Climate Change NASA Advanced Supercomputing Division SGI® ICEImages taken by the Thematic Mapper sensor aboard Landsat 5Source: USGS Landsat Missions Gallery, U.S. Department of the Interior / U.S. Geological Survey
    • 37. Cyclones
    • 38. Cyclone Service Models SGI delivers techincal application expertise. Software (SaaS) SGI delivers commercially available open and 3rd party software via the Internet. SGI Cyclone SGI offers a platform for developers SGI delivers the system infrastructure.
    • 39. SGI OpenFOAM® Ready for Cyclone  Customer : iVEC and Curtin Technical Applications Portal Powered by University Australia User  Problem: Solving large scale CFD Su b mi problems like simulating wind flows ts Job in the capital city of Perth.  Solution: OpenFOAM scaled on SGI Cyclone better (1024 cores) and was 20x faster than on Amazon EC2.Source: Dr Andrew King, Department of Mechanical Engineering Curtin, University of Technology, Australia
    • 40. Balanced design & architectureDo you attachCaravan attachedto the F1?
    • 41. ©2011 SGI

    ×