CSC –Tieteen tietotekniikan keskus Oy 
CSC –IT Center for Science Ltd. 
The Futureof Supercomputing 
Olli-Pekka Lehto 
Systems Specialist
CSC –IT Center for Science 
•Center for ScientificComputing 
–Officeslocatedin Keilaniemi, Espoo 
–Allsharesownedbythe ministryof education 
–Founded in 1970 as a technical support unit for the Univac 1108 
•Providesa varietyof servicesto the Finnishresearchcommunity 
–HighPerformanceComputing(HPC)resources 
–Consultingservicesfor scientificcomputing 
–Scientificsoftware development(Chipster, Elmer etc.) 
–IT infrastructureservices 
–ISP services(FUNET)
CSC in numbers 
•~180 employees 
•3000 researchersusethe computingcapacityactively 
–Around500 projectsat anygiventime 
•~320 000 FUNET end- usersin 85 organizations
Louhi.csc.fi 
Model 
CrayXT4 (single-socketnodes) 
CrayXT5 (dual-socketnodes) 
Processors 
10864 AMD Opteron2,3GHz cores 
2716 QuadCoreprocessors 
1012 XT4 + 852 XT5 
Theoreticalpeakperformance 
>100 TeraFlop/s 
(= 2.3 * 10^9Hz * 4 Flop/Hz* 10864) 
Memory 
~10.3 TeraBytes 
Interconnectnetwork 
CraySeaStar2 
3D torus: 6*5.6GByte/s linksper node 
Power consumption 
520.8 kW (highload) 
~300 kW (nominalload) 
Localfilesystem 
67TB Lustrefilesystem 
OperatingSystem 
Service nodes: SuSELinux 
Computenodes: CrayComputeNodeLinux 
A”capability” system: Fewlarge(64-10000 core) jobs
Murska.csc.fi 
Model 
HPProliantBladecluster 
Processors 
2176AMD Opteron2,6GHz cores 
1088 DualCoreprocessors 
544 Bladeservers 
Theoreticalpeakperformance 
~11.3 TeraFlop/s 
(= 2.6 * 10^9Hz * 4 Flop/Hz* 2176) 
Memory 
~5 TB 
Interconnectnetwork 
Voltaire 4xDDR InfiniBand 
(16Gbit/sfat-treenetwork) 
Power consumption 
~75 kW (highload) 
Localfilesystem 
98 TB Lustrefilesystem 
OperatingSystem 
HP XCClusterSuite(RHEL basedLinux) A”capacity” system: Manysmall(1-128 core) jobs
Whyusesupercomputers? 
•Constraints 
–Resultsareneededin a reasonabletime 
•Impatientusers 
•Time-criticalproblems(e.g. weatherforecasting) 
–Largeproblemsizes 
•The problemdoesnotfitinto the memoryof a single system 
•Manyproblemtypesrequireallthe processingpowercloseto eachother 
–Distributedcomputing(BOINC etc.) workwellonlyon certainproblemtypes
WhousesHPC? 
MILITARY 
SCIENTIFICCOMMERCIAL 
Weaponsmodelling 
Signalsintelligence 
Radar imageprocessing 
Nuclearphysics 
MathematicsQuantumchemistryFusionenergyNanotechnology 
Climatechange 
Weatherforecasting 
Electronic Design Automation(EDA) 
Genomics 
Tacticalsimulation 
Aerodynamics 
Crashsimulations 
MovieSFX 
Feature-lengthmovies 
Searchengines 
Oilreservoirdiscovery 
Stockmarketprediction 
Banking& Insurance databases1960s1970s1980s 
1990s 
2000s 
Strategicsimulation”Wargames” 
2010s 
Materialsscience 
Drugdesign 
Organmodelling
Stateof HPC 2009 
•Movetowardscommoditycomponents 
–Clustersbuiltfromoff-the-shelfservers 
–Linux 
–Opensourcetools(compilers, debuggers, clusteringmgmt, applications) 
–Standard x86 processors 
•Price-performanceefficientcomponents 
–Low-latency, high-bandwitdhinterconnects 
•Standard PCIcards 
•InfiniBand, 10GigEthernet, Myrinet 
–ParallelFilesystems 
•StripedRAID (0)withfileservers 
•Lustre, GPFS, PVFS2etc.
ModernHPC systems 
Commodityclusters 
•A largenumberof regularserversconnectedtogether 
–Usuallya standardLinux OS 
–Possibleto evenmix and matchcomponentsfromdifferentvendors 
•Mayincludesomespecialcomponents 
–High-performanceinterconnectnetwork 
–Parallelfilesystems 
•Low-endand midrangesystems 
•Vendors: IBM, HP, Sun etc. 
Proprietarysupercomputers 
•Designedfromthe groundupfor HPC 
–Custominterconnectnetwork 
–CustomizedOS & software 
–Vendor-specificcomponents 
•High-endsupercomputersand specialapplications 
•Examples:CrayXT-series, IBM BlueGene
The ThreeWalls 
Therearethree”walls” whichCPU design is hittingnow: 
•Memorywall 
–Processorclockrateshavegrownfasterthanmemoryclockrates 
•Power wall 
–Processorsconsumean increasingamountof power 
–The increaseis non-linear 
•+13% performance= +73% powerconsumption 
•Microarchitecturewall 
–Addingmorecomplexityto the CPUsis nothelpingthatmuch 
•Pipelining, branchpredictionetc.
A TypicalHPC System 
•Builtfromcommodityservers 
–1U orBladeformfactor 
–1-10 management nodes 
–1-10 loginnodes 
•Programdevelopment, compilation 
–10s of storagenodes 
•Hostingparallelfilesystem 
–100s of computenodes 
•2-4 CPU socketsper node(4-24 cores), AMD OpteronorIntel Xeon 
•Linux OS 
•ConnectedwithInfiniBandorGigabitEthernet 
•Programsin C/C++ orFortran and areparallelizedusingMPI (MessagePassingInterface) API
The Exaflopsystem 
•Target:2015-2018 
–10^18 (milliontrillion) floating-pointoperationsper second 
–Currentsystem0.00165 Exaflops 
•Expectationswithcurrenttechnologyevolution 
–Power draw100-300 MW 
•15-40% of a nuclearreactor(OlkiluotoI)! 
•$1M/MW/year! 
•Needto bringitdownto 30-50 MW 
–500000 -5000 000processorcores 
–Memory30-100 PB 
–Storage1 Exabyte
ProgrammingLanguages 
•Currenttrend(C/C++/Fortran + MPI) 
–Difficultto programportableandefficientcode 
–MPI is notfaulttolerantbydefault(1 taskdiesand the wholesystemcrashes) 
•PGASlanguagesto the rescue? 
–PartitionedGlobalAddressSpace 
–Lookslikeglobalsharedmemory 
•Butpossibleto definetask-localregions 
•Compilergeneratescommunicationcode 
–Currentstandards 
•UPC -UnifiedParallelC 
•CAF -Co-ArrayFortran 
–Languagesunderdevelopment 
•Titanium, Fortress, X10, Chapel
Whatto dowithan exaflop? 
•Long termclimate-changemodelling 
•Highresolutionweatherforecasts 
–Predictionbycity block 
–Extremeweather 
•Largeproteinfolding 
–Alzheimer, cancer, Parkinson’setc. 
•Simulationof a humanbrain 
•Veryrealisticvirtualenvironments 
•Design of nanostructures 
–Carbonnanotubes, nanobots 
•Beata humanpro playerin a 19x19 Go
Accelerators: GPGPU 
•General PurposeComputingon Graphics ProcessingUnits 
•NvidiaTesla/Fermi, ATI FireStream, IBM Cell, Intel Larrabee 
•Advantages 
–Highvolumeproductionrates, lowprice 
–HighmemorybandwidthonGPU(>100GB/s vs. 10-30GB/s of RAM) 
–Highfloprate, for certainapplications 
•Disadvantages 
–Lowperformancein precise(64-bit)computation 
–Gettingdata to the GPU memoryis a bottleneck(8GB/s PCI Express) 
–Vendorshavedifferentprogramminglanguages 
•Now: NvidiaCUDA, ATI Stream, Intel Ct, Celletc. 
•Future: OpenCLon everything(hopefully!) 
–Doesnotworkfor alltypesof applications 
•Branching, randommemoryaccess, hugedatasetsetc.
Case: NvidiaFermi 
•Announcedlastmonth, availablein 2010 
•New HPC-orientedfeatures 
–Error-correctingmemory 
–Highdoubleprecisionperformance 
•512 computecores, ~3 billiontransistors 
–750 GFlops(DoublePrecision) 
–1.5 Tflops(Single Precision) 
•2011: Fermi-basedCraysupercomputerin OakRidgeNational Laboratory 
–”10 timesfasterthanthe currentstateof the art”:~20 Petaflops
Case: Intel Larrabee 
•Intel’snew GPU architecture, availablein 2010 
•Basedon Pentium x86 processorcores 
–Initiallytensof coresper GPU 
–Pentium coreswithvectorunits 
–Compatiblewithx86 programs 
•Coresconnectedwitha ringbus
Accelerators:FPGA 
•FieldProgrammableGate Arrays 
•Vendors: Clearspeed, Mitrionics, Convey, Nallatech 
•Chipwithprogrammablelogicunits 
–Unitsconnectedwitha programmablenetwork 
•Advantages 
–Verylowpowerconsumption 
–Arbitraryprecision 
–Veryefficientin searchalgorithms 
–Severalin-socketimplementations 
•FPGA sitsdirectlyin the CPU socket 
•Disadvantages 
–Difficultto program 
–Limited numberof logicblocks
PerformanceSP and DP GFlops050010001500200025003000350040004500 Nvidia Geforce GTX280Nvidia Tesla C1060Nvidia Tesla S1070ATI Radeon 4870ATI Radeon X2 4870ATI FireStream 9250ClearSpeed e710ClearSpeed CATS700IBM PowerXCell 8iAMD Opteron Barcelona GFLop/s SP Gflop/sDP Gflop/s
Power EfficiencyPower efficiency0123456789 Nvidia Geforce GTX280Nvidia Tesla C1060Nvidia Tesla S1070ATI Radeon 4870ATI Radeon X2 4870ATI FireStream 9250ClearSpeed e710ClearSpeed CATS700IBM PowerXCell 8iAMD Opteron Barcelona GFlop/s/Watt SPDP
3D IntegratedCircuits 
•Wafersstackedon top of eachother 
•Layersconnectedwiththrough-silicon”vias” 
•Manybenefits 
–Highbandwidthand lowlatency 
–Savesspaceand power 
–Addedfreedomin circuitdesign 
–The stackmayconsistof differenttypesof wafers 
•Severalchallenges 
–Heatdissipation 
–Complexdesign and manufacturing 
•HPC killer-app: Memorystackedon top of a CPU
OtherTechnologies To Watch 
•SSD (SolidState Disk) 
–Fasttransactions, lowpower, improvingreliability 
–Fastcheckpointingand restartingof programs 
•Opticson silicon 
–Lightpathsbothon a chipand on the PCB 
•New memorytechnologies 
–Phase-changememoryetc. 
–Low-power, low-latency, highbandwidth 
•Green datacentertechnologies 
•DNA computing 
•Quantumcomputing
Conclusions 
•Differencesbetweenclustersand proprietarysupercomputersis diminishing 
•Acceleratortechnologyis promimsing 
–Simple, vendorindependentprogrammingmodelsareneeded 
•Lotsof programmingchallengesin parallelisation 
–Similarchallengesin mainstreamcomputingtoday 
•Goingto Exaflopwillbeverytough 
–Innovationneededin bothsoftware and hardware
Questions

From the Archives: Future of Supercomputing at Altparty 2009

  • 1.
    CSC –Tieteen tietotekniikankeskus Oy CSC –IT Center for Science Ltd. The Futureof Supercomputing Olli-Pekka Lehto Systems Specialist
  • 2.
    CSC –IT Centerfor Science •Center for ScientificComputing –Officeslocatedin Keilaniemi, Espoo –Allsharesownedbythe ministryof education –Founded in 1970 as a technical support unit for the Univac 1108 •Providesa varietyof servicesto the Finnishresearchcommunity –HighPerformanceComputing(HPC)resources –Consultingservicesfor scientificcomputing –Scientificsoftware development(Chipster, Elmer etc.) –IT infrastructureservices –ISP services(FUNET)
  • 3.
    CSC in numbers •~180 employees •3000 researchersusethe computingcapacityactively –Around500 projectsat anygiventime •~320 000 FUNET end- usersin 85 organizations
  • 4.
    Louhi.csc.fi Model CrayXT4(single-socketnodes) CrayXT5 (dual-socketnodes) Processors 10864 AMD Opteron2,3GHz cores 2716 QuadCoreprocessors 1012 XT4 + 852 XT5 Theoreticalpeakperformance >100 TeraFlop/s (= 2.3 * 10^9Hz * 4 Flop/Hz* 10864) Memory ~10.3 TeraBytes Interconnectnetwork CraySeaStar2 3D torus: 6*5.6GByte/s linksper node Power consumption 520.8 kW (highload) ~300 kW (nominalload) Localfilesystem 67TB Lustrefilesystem OperatingSystem Service nodes: SuSELinux Computenodes: CrayComputeNodeLinux A”capability” system: Fewlarge(64-10000 core) jobs
  • 5.
    Murska.csc.fi Model HPProliantBladecluster Processors 2176AMD Opteron2,6GHz cores 1088 DualCoreprocessors 544 Bladeservers Theoreticalpeakperformance ~11.3 TeraFlop/s (= 2.6 * 10^9Hz * 4 Flop/Hz* 2176) Memory ~5 TB Interconnectnetwork Voltaire 4xDDR InfiniBand (16Gbit/sfat-treenetwork) Power consumption ~75 kW (highload) Localfilesystem 98 TB Lustrefilesystem OperatingSystem HP XCClusterSuite(RHEL basedLinux) A”capacity” system: Manysmall(1-128 core) jobs
  • 6.
    Whyusesupercomputers? •Constraints –Resultsareneededina reasonabletime •Impatientusers •Time-criticalproblems(e.g. weatherforecasting) –Largeproblemsizes •The problemdoesnotfitinto the memoryof a single system •Manyproblemtypesrequireallthe processingpowercloseto eachother –Distributedcomputing(BOINC etc.) workwellonlyon certainproblemtypes
  • 7.
    WhousesHPC? MILITARY SCIENTIFICCOMMERCIAL Weaponsmodelling Signalsintelligence Radar imageprocessing Nuclearphysics MathematicsQuantumchemistryFusionenergyNanotechnology Climatechange Weatherforecasting Electronic Design Automation(EDA) Genomics Tacticalsimulation Aerodynamics Crashsimulations MovieSFX Feature-lengthmovies Searchengines Oilreservoirdiscovery Stockmarketprediction Banking& Insurance databases1960s1970s1980s 1990s 2000s Strategicsimulation”Wargames” 2010s Materialsscience Drugdesign Organmodelling
  • 8.
    Stateof HPC 2009 •Movetowardscommoditycomponents –Clustersbuiltfromoff-the-shelfservers –Linux –Opensourcetools(compilers, debuggers, clusteringmgmt, applications) –Standard x86 processors •Price-performanceefficientcomponents –Low-latency, high-bandwitdhinterconnects •Standard PCIcards •InfiniBand, 10GigEthernet, Myrinet –ParallelFilesystems •StripedRAID (0)withfileservers •Lustre, GPFS, PVFS2etc.
  • 9.
    ModernHPC systems Commodityclusters •A largenumberof regularserversconnectedtogether –Usuallya standardLinux OS –Possibleto evenmix and matchcomponentsfromdifferentvendors •Mayincludesomespecialcomponents –High-performanceinterconnectnetwork –Parallelfilesystems •Low-endand midrangesystems •Vendors: IBM, HP, Sun etc. Proprietarysupercomputers •Designedfromthe groundupfor HPC –Custominterconnectnetwork –CustomizedOS & software –Vendor-specificcomponents •High-endsupercomputersand specialapplications •Examples:CrayXT-series, IBM BlueGene
  • 10.
    The ThreeWalls Therearethree”walls”whichCPU design is hittingnow: •Memorywall –Processorclockrateshavegrownfasterthanmemoryclockrates •Power wall –Processorsconsumean increasingamountof power –The increaseis non-linear •+13% performance= +73% powerconsumption •Microarchitecturewall –Addingmorecomplexityto the CPUsis nothelpingthatmuch •Pipelining, branchpredictionetc.
  • 11.
    A TypicalHPC System •Builtfromcommodityservers –1U orBladeformfactor –1-10 management nodes –1-10 loginnodes •Programdevelopment, compilation –10s of storagenodes •Hostingparallelfilesystem –100s of computenodes •2-4 CPU socketsper node(4-24 cores), AMD OpteronorIntel Xeon •Linux OS •ConnectedwithInfiniBandorGigabitEthernet •Programsin C/C++ orFortran and areparallelizedusingMPI (MessagePassingInterface) API
  • 12.
    The Exaflopsystem •Target:2015-2018 –10^18 (milliontrillion) floating-pointoperationsper second –Currentsystem0.00165 Exaflops •Expectationswithcurrenttechnologyevolution –Power draw100-300 MW •15-40% of a nuclearreactor(OlkiluotoI)! •$1M/MW/year! •Needto bringitdownto 30-50 MW –500000 -5000 000processorcores –Memory30-100 PB –Storage1 Exabyte
  • 13.
    ProgrammingLanguages •Currenttrend(C/C++/Fortran +MPI) –Difficultto programportableandefficientcode –MPI is notfaulttolerantbydefault(1 taskdiesand the wholesystemcrashes) •PGASlanguagesto the rescue? –PartitionedGlobalAddressSpace –Lookslikeglobalsharedmemory •Butpossibleto definetask-localregions •Compilergeneratescommunicationcode –Currentstandards •UPC -UnifiedParallelC •CAF -Co-ArrayFortran –Languagesunderdevelopment •Titanium, Fortress, X10, Chapel
  • 14.
    Whatto dowithan exaflop? •Long termclimate-changemodelling •Highresolutionweatherforecasts –Predictionbycity block –Extremeweather •Largeproteinfolding –Alzheimer, cancer, Parkinson’setc. •Simulationof a humanbrain •Veryrealisticvirtualenvironments •Design of nanostructures –Carbonnanotubes, nanobots •Beata humanpro playerin a 19x19 Go
  • 15.
    Accelerators: GPGPU •GeneralPurposeComputingon Graphics ProcessingUnits •NvidiaTesla/Fermi, ATI FireStream, IBM Cell, Intel Larrabee •Advantages –Highvolumeproductionrates, lowprice –HighmemorybandwidthonGPU(>100GB/s vs. 10-30GB/s of RAM) –Highfloprate, for certainapplications •Disadvantages –Lowperformancein precise(64-bit)computation –Gettingdata to the GPU memoryis a bottleneck(8GB/s PCI Express) –Vendorshavedifferentprogramminglanguages •Now: NvidiaCUDA, ATI Stream, Intel Ct, Celletc. •Future: OpenCLon everything(hopefully!) –Doesnotworkfor alltypesof applications •Branching, randommemoryaccess, hugedatasetsetc.
  • 16.
    Case: NvidiaFermi •Announcedlastmonth,availablein 2010 •New HPC-orientedfeatures –Error-correctingmemory –Highdoubleprecisionperformance •512 computecores, ~3 billiontransistors –750 GFlops(DoublePrecision) –1.5 Tflops(Single Precision) •2011: Fermi-basedCraysupercomputerin OakRidgeNational Laboratory –”10 timesfasterthanthe currentstateof the art”:~20 Petaflops
  • 17.
    Case: Intel Larrabee •Intel’snew GPU architecture, availablein 2010 •Basedon Pentium x86 processorcores –Initiallytensof coresper GPU –Pentium coreswithvectorunits –Compatiblewithx86 programs •Coresconnectedwitha ringbus
  • 18.
    Accelerators:FPGA •FieldProgrammableGate Arrays •Vendors: Clearspeed, Mitrionics, Convey, Nallatech •Chipwithprogrammablelogicunits –Unitsconnectedwitha programmablenetwork •Advantages –Verylowpowerconsumption –Arbitraryprecision –Veryefficientin searchalgorithms –Severalin-socketimplementations •FPGA sitsdirectlyin the CPU socket •Disadvantages –Difficultto program –Limited numberof logicblocks
  • 19.
    PerformanceSP and DPGFlops050010001500200025003000350040004500 Nvidia Geforce GTX280Nvidia Tesla C1060Nvidia Tesla S1070ATI Radeon 4870ATI Radeon X2 4870ATI FireStream 9250ClearSpeed e710ClearSpeed CATS700IBM PowerXCell 8iAMD Opteron Barcelona GFLop/s SP Gflop/sDP Gflop/s
  • 20.
    Power EfficiencyPower efficiency0123456789Nvidia Geforce GTX280Nvidia Tesla C1060Nvidia Tesla S1070ATI Radeon 4870ATI Radeon X2 4870ATI FireStream 9250ClearSpeed e710ClearSpeed CATS700IBM PowerXCell 8iAMD Opteron Barcelona GFlop/s/Watt SPDP
  • 21.
    3D IntegratedCircuits •Wafersstackedontop of eachother •Layersconnectedwiththrough-silicon”vias” •Manybenefits –Highbandwidthand lowlatency –Savesspaceand power –Addedfreedomin circuitdesign –The stackmayconsistof differenttypesof wafers •Severalchallenges –Heatdissipation –Complexdesign and manufacturing •HPC killer-app: Memorystackedon top of a CPU
  • 22.
    OtherTechnologies To Watch •SSD (SolidState Disk) –Fasttransactions, lowpower, improvingreliability –Fastcheckpointingand restartingof programs •Opticson silicon –Lightpathsbothon a chipand on the PCB •New memorytechnologies –Phase-changememoryetc. –Low-power, low-latency, highbandwidth •Green datacentertechnologies •DNA computing •Quantumcomputing
  • 23.
    Conclusions •Differencesbetweenclustersand proprietarysupercomputersisdiminishing •Acceleratortechnologyis promimsing –Simple, vendorindependentprogrammingmodelsareneeded •Lotsof programmingchallengesin parallelisation –Similarchallengesin mainstreamcomputingtoday •Goingto Exaflopwillbeverytough –Innovationneededin bothsoftware and hardware
  • 24.