From the Archives: Future of Supercomputing at Altparty 2009

CSC –Tieteen tietotekniikan keskus Oy
CSC –IT Center for Science Ltd.
The Futureof Supercomputing
Olli-Pekka Lehto
Systems Specialist

CSC –IT Center for Science
•Center for ScientificComputing
–Officeslocatedin Keilaniemi, Espoo
–Allsharesownedbythe ministryof education
–Founded in 1970 as a technical support unit for the Univac 1108
•Providesa varietyof servicesto the Finnishresearchcommunity
–HighPerformanceComputing(HPC)resources
–Consultingservicesfor scientificcomputing
–Scientificsoftware development(Chipster, Elmer etc.)
–IT infrastructureservices
–ISP services(FUNET)

CSC in numbers
•~180 employees
•3000 researchersusethe computingcapacityactively
–Around500 projectsat anygiventime
•~320 000 FUNET end- usersin 85 organizations

Louhi.csc.fi
Model
CrayXT4 (single-socketnodes)
CrayXT5 (dual-socketnodes)
Processors
10864 AMD Opteron2,3GHz cores
2716 QuadCoreprocessors
1012 XT4 + 852 XT5
Theoreticalpeakperformance
>100 TeraFlop/s
(= 2.3 * 10^9Hz * 4 Flop/Hz* 10864)
Memory
~10.3 TeraBytes
Interconnectnetwork
CraySeaStar2
3D torus: 6*5.6GByte/s linksper node
Power consumption
520.8 kW (highload)
~300 kW (nominalload)
Localfilesystem
67TB Lustrefilesystem
OperatingSystem
Service nodes: SuSELinux
Computenodes: CrayComputeNodeLinux
A”capability” system: Fewlarge(64-10000 core) jobs

Murska.csc.fi
Model
HPProliantBladecluster
Processors
2176AMD Opteron2,6GHz cores
1088 DualCoreprocessors
544 Bladeservers
Theoreticalpeakperformance
~11.3 TeraFlop/s
(= 2.6 * 10^9Hz * 4 Flop/Hz* 2176)
Memory
~5 TB
Interconnectnetwork
Voltaire 4xDDR InfiniBand
(16Gbit/sfat-treenetwork)
Power consumption
~75 kW (highload)
Localfilesystem
98 TB Lustrefilesystem
OperatingSystem
HP XCClusterSuite(RHEL basedLinux) A”capacity” system: Manysmall(1-128 core) jobs

Whyusesupercomputers?
•Constraints
–Resultsareneededin a reasonabletime
•Impatientusers
•Time-criticalproblems(e.g. weatherforecasting)
–Largeproblemsizes
•The problemdoesnotfitinto the memoryof a single system
•Manyproblemtypesrequireallthe processingpowercloseto eachother
–Distributedcomputing(BOINC etc.) workwellonlyon certainproblemtypes

WhousesHPC?
MILITARY
SCIENTIFICCOMMERCIAL
Weaponsmodelling
Signalsintelligence
Radar imageprocessing
Nuclearphysics
MathematicsQuantumchemistryFusionenergyNanotechnology
Climatechange
Weatherforecasting
Electronic Design Automation(EDA)
Genomics
Tacticalsimulation
Aerodynamics
Crashsimulations
MovieSFX
Feature-lengthmovies
Searchengines
Oilreservoirdiscovery
Stockmarketprediction
Banking& Insurance databases1960s1970s1980s
1990s
2000s
Strategicsimulation”Wargames”
2010s
Materialsscience
Drugdesign
Organmodelling

Stateof HPC 2009
•Movetowardscommoditycomponents
–Clustersbuiltfromoff-the-shelfservers
–Linux
–Opensourcetools(compilers, debuggers, clusteringmgmt, applications)
–Standard x86 processors
•Price-performanceefficientcomponents
–Low-latency, high-bandwitdhinterconnects
•Standard PCIcards
•InfiniBand, 10GigEthernet, Myrinet
–ParallelFilesystems
•StripedRAID (0)withfileservers
•Lustre, GPFS, PVFS2etc.

ModernHPC systems
Commodityclusters
•A largenumberof regularserversconnectedtogether
–Usuallya standardLinux OS
–Possibleto evenmix and matchcomponentsfromdifferentvendors
•Mayincludesomespecialcomponents
–High-performanceinterconnectnetwork
–Parallelfilesystems
•Low-endand midrangesystems
•Vendors: IBM, HP, Sun etc.
Proprietarysupercomputers
•Designedfromthe groundupfor HPC
–Custominterconnectnetwork
–CustomizedOS & software
–Vendor-specificcomponents
•High-endsupercomputersand specialapplications
•Examples:CrayXT-series, IBM BlueGene

The ThreeWalls
Therearethree”walls” whichCPU design is hittingnow:
•Memorywall
–Processorclockrateshavegrownfasterthanmemoryclockrates
•Power wall
–Processorsconsumean increasingamountof power
–The increaseis non-linear
•+13% performance= +73% powerconsumption
•Microarchitecturewall
–Addingmorecomplexityto the CPUsis nothelpingthatmuch
•Pipelining, branchpredictionetc.

A TypicalHPC System
•Builtfromcommodityservers
–1U orBladeformfactor
–1-10 management nodes
–1-10 loginnodes
•Programdevelopment, compilation
–10s of storagenodes
•Hostingparallelfilesystem
–100s of computenodes
•2-4 CPU socketsper node(4-24 cores), AMD OpteronorIntel Xeon
•Linux OS
•ConnectedwithInfiniBandorGigabitEthernet
•Programsin C/C++ orFortran and areparallelizedusingMPI (MessagePassingInterface) API

The Exaflopsystem
•Target:2015-2018
–10^18 (milliontrillion) floating-pointoperationsper second
–Currentsystem0.00165 Exaflops
•Expectationswithcurrenttechnologyevolution
–Power draw100-300 MW
•15-40% of a nuclearreactor(OlkiluotoI)!
•$1M/MW/year!
•Needto bringitdownto 30-50 MW
–500000 -5000 000processorcores
–Memory30-100 PB
–Storage1 Exabyte

ProgrammingLanguages
•Currenttrend(C/C++/Fortran + MPI)
–Difficultto programportableandefficientcode
–MPI is notfaulttolerantbydefault(1 taskdiesand the wholesystemcrashes)
•PGASlanguagesto the rescue?
–PartitionedGlobalAddressSpace
–Lookslikeglobalsharedmemory
•Butpossibleto definetask-localregions
•Compilergeneratescommunicationcode
–Currentstandards
•UPC -UnifiedParallelC
•CAF -Co-ArrayFortran
–Languagesunderdevelopment
•Titanium, Fortress, X10, Chapel

Whatto dowithan exaflop?
•Long termclimate-changemodelling
•Highresolutionweatherforecasts
–Predictionbycity block
–Extremeweather
•Largeproteinfolding
–Alzheimer, cancer, Parkinson’setc.
•Simulationof a humanbrain
•Veryrealisticvirtualenvironments
•Design of nanostructures
–Carbonnanotubes, nanobots
•Beata humanpro playerin a 19x19 Go

Accelerators: GPGPU
•General PurposeComputingon Graphics ProcessingUnits
•NvidiaTesla/Fermi, ATI FireStream, IBM Cell, Intel Larrabee
•Advantages
–Highvolumeproductionrates, lowprice
–HighmemorybandwidthonGPU(>100GB/s vs. 10-30GB/s of RAM)
–Highfloprate, for certainapplications
•Disadvantages
–Lowperformancein precise(64-bit)computation
–Gettingdata to the GPU memoryis a bottleneck(8GB/s PCI Express)
–Vendorshavedifferentprogramminglanguages
•Now: NvidiaCUDA, ATI Stream, Intel Ct, Celletc.
•Future: OpenCLon everything(hopefully!)
–Doesnotworkfor alltypesof applications
•Branching, randommemoryaccess, hugedatasetsetc.

Case: NvidiaFermi
•Announcedlastmonth, availablein 2010
•New HPC-orientedfeatures
–Error-correctingmemory
–Highdoubleprecisionperformance
•512 computecores, ~3 billiontransistors
–750 GFlops(DoublePrecision)
–1.5 Tflops(Single Precision)
•2011: Fermi-basedCraysupercomputerin OakRidgeNational Laboratory
–”10 timesfasterthanthe currentstateof the art”:~20 Petaflops

Case: Intel Larrabee
•Intel’snew GPU architecture, availablein 2010
•Basedon Pentium x86 processorcores
–Initiallytensof coresper GPU
–Pentium coreswithvectorunits
–Compatiblewithx86 programs
•Coresconnectedwitha ringbus

Accelerators:FPGA
•FieldProgrammableGate Arrays
•Vendors: Clearspeed, Mitrionics, Convey, Nallatech
•Chipwithprogrammablelogicunits
–Unitsconnectedwitha programmablenetwork
•Advantages
–Verylowpowerconsumption
–Arbitraryprecision
–Veryefficientin searchalgorithms
–Severalin-socketimplementations
•FPGA sitsdirectlyin the CPU socket
•Disadvantages
–Difficultto program
–Limited numberof logicblocks

PerformanceSP and DP GFlops050010001500200025003000350040004500 Nvidia Geforce GTX280Nvidia Tesla C1060Nvidia Tesla S1070ATI Radeon 4870ATI Radeon X2 4870ATI FireStream 9250ClearSpeed e710ClearSpeed CATS700IBM PowerXCell 8iAMD Opteron Barcelona GFLop/s SP Gflop/sDP Gflop/s

Power EfficiencyPower efficiency0123456789 Nvidia Geforce GTX280Nvidia Tesla C1060Nvidia Tesla S1070ATI Radeon 4870ATI Radeon X2 4870ATI FireStream 9250ClearSpeed e710ClearSpeed CATS700IBM PowerXCell 8iAMD Opteron Barcelona GFlop/s/Watt SPDP

3D IntegratedCircuits
•Wafersstackedon top of eachother
•Layersconnectedwiththrough-silicon”vias”
•Manybenefits
–Highbandwidthand lowlatency
–Savesspaceand power
–Addedfreedomin circuitdesign
–The stackmayconsistof differenttypesof wafers
•Severalchallenges
–Heatdissipation
–Complexdesign and manufacturing
•HPC killer-app: Memorystackedon top of a CPU

OtherTechnologies To Watch
•SSD (SolidState Disk)
–Fasttransactions, lowpower, improvingreliability
–Fastcheckpointingand restartingof programs
•Opticson silicon
–Lightpathsbothon a chipand on the PCB
•New memorytechnologies
–Phase-changememoryetc.
–Low-power, low-latency, highbandwidth
•Green datacentertechnologies
•DNA computing
•Quantumcomputing

Conclusions
•Differencesbetweenclustersand proprietarysupercomputersis diminishing
•Acceleratortechnologyis promimsing
–Simple, vendorindependentprogrammingmodelsareneeded
•Lotsof programmingchallengesin parallelisation
–Similarchallengesin mainstreamcomputingtoday
•Goingto Exaflopwillbeverytough
–Innovationneededin bothsoftware and hardware

From the Archives: Future of Supercomputing at Altparty 2009

More Related Content

What's hot

Similar to From the Archives: Future of Supercomputing at Altparty 2009

Recently uploaded

From the Archives: Future of Supercomputing at Altparty 2009