• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
High performance computing - building blocks, production & perspective
 

High performance computing - building blocks, production & perspective

on

  • 5,399 views

 

Statistics

Views

Total Views
5,399
Views on SlideShare
5,397
Embed Views
2

Actions

Likes
4
Downloads
6
Comments
0

1 Embed 2

http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    High performance computing - building blocks, production & perspective High performance computing - building blocks, production & perspective Presentation Transcript

    • High Performance Computing - Building blocks, Production & Perspective Jason Shih Feb, 2012
    • What is HPC? HPC Definition:  14.9K hits from Google   Uses supercomputers and computer clusters to solve advanced computation problems. Today, computer..... (Wikipedia)  Use of parallel processing for running advanced application programs efficiently, reliably and quickly. The term applies especially to systems that function above a teraflop or 1012 floating-point operations per second. The term HPC is occasionally used as a synonym for supercomputing, although. (Techtarget)  A branch of computer science that concentrates on developing supercomputers and software to run on supercomputers. A main area of this discipline is developing parallel processing algorithms and software (Webopedia)  And another “14.9k - 3” definitions…… 2
    • So What is HPC Really? My understanding  No clear definition!!  At least O(2) time as powerful as PC  Solving advanced computation problem? Online game!?  HPC ~ Supercomputer? & Supercomputer ~ Σ Cluster(s)  Possible Components:  CPU1, CPU2, CPU3….. CPU”N”  ~ O(1) tons of memory dimm….  ~ O(2) kW power consumption  O(1) - ~O(3) K-Cores  ~ 1 system admin  Rmember:  “640K ought to be enoughfor anybody. ” Bill Gates, 1981 3
    • Why HPC? Possible scenario:header: budget_Unit = MUSDif{budget[Gov.] >= O(2) && budget[Com.] >= O(1)} else {Show_Off == “true”} else if{Possible_Run_on_PC == “false”} {Exec “HPC Implementation”}} Got to Wait another 6M ~ 1Yr…….  ruth is: T  Time consuming operations  Huge memory demanded tasks  Mission critical e.g. limit time duration  Large quantities of run (cores/CPUs etc.)  non-optimized programs…. 4
    • Why HPC? Story Cont’  rand Challenge Application Requirements G  Capable with 97’ MaxTop500 & 05’ MinTop500 break TFlops10 PB LHC Est. CPU/DISK ~ 143MSI2k/56.3PB ~100K-Cores PFlops 5
    • How Need HPC? HPC Domain Applications Fluid dynamics & Heat Transfer Physics & Astrophysics Nanoscience Chemistry & Biochemistry Biophysics & Bioinformatics Geophysics & Earth Imaging Medical Physics & Drop Discovery Databases & Data Mining Financial Modeling Signal & Image Processing And more .... 6
    • HPC – Speed vs. Size> Can’t fit on a PC –usually because they > Take a very very long time to runneed more than a few on a PC: months or even years. ButGB of RAM, or more Size a problem that would take a monththan a few 100 GB of on a PC might take only a few hoursdisk. on a supercomputer. Speed 7
    • HPC ~ Supercomputer ~ Σ Cluster(s) What is cluster?  Again, 1.4k hits from google…..   A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer... (Wikipedia)  Single logical unit consisting of multiple computers that are linked through a LAN. The networked computers essentially act as a single, much more powerful machine.. (Techopedia)  And…… But the cluster is:   CPU1, CPU2, CPU3….. CPU”N”  ~ O() Kg of memory dimm….  < O(1) kW power consumption  ~ O(1) K-Cores  Still ~ 1 system admin  8
    • HPC – Trend in Growth Potential & Easy of Use 10 1.0GFlops per Processors - 1995 - 2000 0.1 - 2005 - 2010 0.01 0.001 10 100 1000 10K 100K Number of Processors (P) 9
    • HPCEnergy Projection Strawmen Project 10
    • HPC – Numbers Before 2000s  11
    • HPC – Annual Performance Distribution   Top500 projection in 2012:   Cray Titan: est. 20PF, transformed from Jaguar @ORNL  1st phase: replace Cray XT5 w/ Cray XK6 & Operon CPUs & Tesla GPUs  2nd phase: 18K additional Tesla GPUs   IBM Sequoia: est. 20PF base on Glue Gene/Q @LLNL   ExaFlops of “world computing power” in 2016? 10.5 Pflops, K computer, SPARC64 VIIIfx 2.0GHz 1PFlops 35.8 TF NEC, Earth-Simulator JapanGFlops < 6 Years 1TFlops 8 Years PC: 109 GFlops Intel Core i7 980 XE 1GFlops
    • HPC – Performance Trend “Microprocessor” 13
    • Trend – Transistors per Processor Chip 14
    • HPC History (I)1PFlop/s IBM RoadRunner Cray Jaguar Parallel1TFlop/s 2008 PFlops Vector1GFlop/s SuperScalar 1987 GFlops1MFlop/s Scalar1KFlops/ Bit Level Parallelism Instruction Level Thread Level 1950 1960 1970 1980 1990 2000 15 2010
    • HPC History (II)CPU Year/Clock Rate/Instruction per sec. 16
    • HPC History (III)Four Decades of Computing – Time Sharing Era 17
    • HPC – Cost of Computing (1960s ~ 2011) About “17 million IBM 1620 units” costing $64,000 each. The 1620s multiplication operation takes 17.7 msCost (USD) Two 16-processor Beowulf clusters Cray X-MP with Pentium Pro microprocessors First computing technology which scaled to large First sub-US$1/MFLOPS computing applications while staying technology. It won the Gordon Bell Prize under US$1/MFLOPS in 2000 Bunyip Beowulf cluster KLAT2 First sub-US$100/GFLOPS computing technologyKASY0 As of August 2007, this 26.25 GFLOPS "personal" Microwulf Beowulf cluster can be built for $1256 Year HPU4Science $30,000 cluster was built using only commercially available "gamer" grade hardware 18Ref: http://en.wikipedia.org/wiki/FLOPS
    • HPC - Interconnect, Proc Type, Speed & Threads 19Ref: “Introduction to the HPC Challenge Benchmark Suite” by Piotr et. al.
    • Power Processor Roadmap 20
    • IBM mainframe Intel mainframe 21
    • HPC – Computing System Evolution (I) ENIAC  940s (Beginning) 1  ENIAC (Eckart & Mauchly, U.Penn)  Von Neumann Machine  Sperry Rand Corp  IBM Corp.  Vacuum tube  Thousands instruction per second (0.002 MIPS) 1950s (Early Days)  IBM 704, 709x  CDC 1604  Transistor (Bell Lab, 1948)  Memory: Drum/Magnetic Core (32K words)  Performance: 1 MIPS  Separate I/O processor IBM 704 22
    • HPC – Computing System Evolution (II)  960s (System Concept) 1 IBM S/360 Model 85   IBM Stretch Machine (1st Pipeline machine)   IBM System 360 (Model 64, Model 91)   CDC 6600   GE, UNIVAC, RCA, Honeywell & Burrough etc   Integrated Circuit/Mult-layer (Printed Circuit Board)   Memory: Semiconductor (3MB)   Cache (IBM 360 Model 85)   Performance: 10 MIPS (~1 MFLOPS) 1970s (Vector, Mini-Computer)   IBM System 370/M195, 308x   CDC 7600, Cyber Systems   DEC Minicomputer   FPS (Floating Point System)   Cray 1, XMP   Large Scale Integrated Circuit   Performance: 100 MIPS (~10 MFLOPS)   Multiprogrmming, Time Sharing IBM S/370 Model 168   Vector: Pipeline Data Stream 23
    • HPC – Computing System Evolution (III)  980s (RISC, Micro-Processor) 1  CDC Cyber 205  Cray 2, YMP  IBM 3090 VF  Japan Inc. (Fujitsu’s VP, NEC’s SX)  Thinking Machine: CM2 (1st Large Scale Parallel)  RISC system (Appolo, Sun, SGI, etc)  CONVAX Vector Machine (mini Cray)  Microprocessor: PC (Apple, IBM)  Memory: 100MB Connect Machine: CM-2  RISC system:  Pipeline Instruction Stream  Multiple execution units in core  Vector: Multiple vector pipelines  Thinking Machine: kernel level parallelism  Performance: 100 Mflops IBM 3090 Processor Complex 24
    • HPC – Computing System Evolution (IV)  990s (Cluster, Parallel Computing) 1   IBM Power Series (1,2,3)   SGI NUMA System   Cray CMP, T3E, Cray 3   CDC ETA   DEC’s Alpha IBM Power5 Family   SUN’s Internet Machine   Intel Parogon   Cluster of PC Power3 IBM Blue Gene   Memory: 512MB per processor   Performance: 1 Teraflops   SMP node in Cluster System  000s (Large Scale Parallel System) 2   IBM Power Series (4,5), Blue Gene   HP’s Superdome   Cray SV system   Intel’s Itanium, Xeon, Woodcrest, Westsmear processor   emory: 1-8 GB per processor M   erformance: Reach 10 Teraflops P 25
    • HPC – Programming Language (I) Microcode, Machine Language Assembly Language (1950s)  Mnemonic, based on machine instruction set Fortran (Formula Translation) (John Backus, 1956)  IBM Fortran Mark I – IV (1950s, 1960s)  IBM Fortran G, H, HX (1970), VS Fortran  CDC, DEC, Cray, etc..., Fortran  Industrial Standardized - Fortran 77 (1978)  Industrial Standardized - Fortran (88), 90, 95 (1991,1996)  HPF (High Performance Fortran) (late 1980) Algol (Algorithm Language) (1958) (1960, Dijksta, et. al.)  Based on Backus-Naur Form method  Considered as 1st Block Structure Language  COBOL (Common Business Oriented Language) (1960s) IBM PL/1, PL/2 (Programming Language) (mid 60-70s)  Combined Fortran, COBOL, & Algol  Pointer function 26  Exceptional handling
    • HPC – Programming Language (II) Applicative Languages  IBM APL (A Programming Language) (1970s)  LISP (List Processing Language) (1960s, MIT) BASIC (Beginner’s All-Purpose Symbolic Instruction Code) (mid 1960)  1st Interactive language via Interpreter PASCAL (1975, Nicklass Wirth)  Derived from Wirth’s Algol-W  Well designed programming language  Call argument list by value  & C++ (mid 1970, Bell Lab) C  Procedure language ADA (late 1980, U.S. DOD) Prolog (Programming Logic) (mid 1970) 27
    • HPC – Computing Environment Batch Processing (before 1970) Multi-programming, Time Sharing (1970) Remote Job Entry (RJE) (mid 1970) Network Computing  APARnet (mother of INTERNET)  IBM’s VNET (mid 1970) Establishment Community Computing Center   st Center: NCAR (1967) 1  U.S. National Supercomputer Centers (1980) Parallel Computing Distribute Computing  Emergence of microprocessors  Grid Computing (2000s) Volunteer Computing @Home Technology 28
    • HPC – Computational Platform Pro & Con 29
    • HPC – Parallel Computing (I) Characteristics:  Asynchronous Operation (Hardware)  Multiple Execution Units/Pipelines (Hardware)  Instruction Level  Data Parallel  Kernel Level  Loop Parallel  Domain Decomposition  Functional Decomposition 30
    • HPC – Parallel Computing (II) 1st Attempt - ILLIAC, 64-way monster (mid 1970) U.S. Navy’s parallel weather forecast program (1970s) Early programming method - UNIX thread (late 1970) 1st Viable Parallel Processing - Cray’s Micro-Tasking (80s) Many, Many proposed methods in 1980s: e.g. HPF SGI’s NUMA System - A very successful one (1990s) Oakridge NL’s PVM and Europe’s PARMAC (Early 90s) programming model for Distributing Memory System Adaption of MPI and OpenMP for parallel programming MPI - A main stream of parallel computing (late 1990)   Well and Clear defined programming model   Successful of Cluster Computing System   Network/Switch hardware performance   Scalability   Data decomposition allows for running large program Mixed MPI/OpenMP parallel programming model for SMP node cluster system (2000) 31
    • Env. Sci./Disaster MitigationDefense Engineering HPC Application Area Finance & Business Science Research 32
    • HPC – Applications (I) Early Days  Ballistic Table  Signal Processing  Cryptography  Von Neumann’s Weather Simulation 1950s-60s  Operational Weather Forecasting  Computational Fluid Dynamics (CFD, 2D problems)  Seismic Processing and Oil Reservoir Simulation  Particle Tracing  Molecular Dynamics Simulation  CAD/CAE, Circuit Analysis 1970s (emergency of “package” program)  Structural Analysis (FEM application)  Spectral Weather Forecasting Model  ab initio Chemistry Computation (Material modeling in quantum level)  3D Computational Fluid 33
    • HPC – Applications (II)  980s (Wide Spread of Commercial/Industrial Usage) 1  Petroleum Industry: Western Geo, etc  Computational Chemistry: Charmm, Amber, Gaussian,Gammes, Mopad, Crystal etc  Computational Fluid Dynamics: Fluent, Arc3D etc  Structural Analysis: NASTRAN, Ansys, Abacus, Dyna3D  Physics: QCD  Emergence of Multi-Discipline Application Program 1990s & 2000s  Grand Challenge Problems  Life Science  Large Scale Parallel Program  Coupling of Computational Models  Data Intensive Analysis/Computation 34
    • High Performance Computing – Cluster Inside & Insight Types of Cluster Architectures Multicores & Heterogeneous Architecture Cluster Overview & Bottleneck/Latency Global & Parallel Filesystem Application Development Step 35
    • Computer Architecture – Flynn’s Taxonomy SISD (single instruction & single data) SIMD (single instruction & multiple data) MISD (multiple instruction & single data)   IMD M (multiple instruction & multiple data) > Message Passing > Share Memory: UMA/NUMA/COMA 36
    • Constrain on Computing Solution – “Distributed Computing” Opposing forces Commodity  Budgets push toward lower cost computing solutions  At the expense of operation cost  Limitations for power & cooling  difficult to change on short time scales  Challenges:  Data Distribution & Data Management  Distributed Computing Model  Fault Tolerance, Scalability & Availability SMP Centralized Distributed 37
    • Shared Memory Architecture Hybrid Architecture HPC Cluster ArchitectureVector Architecture 38 Distributed Memory Architecture
    • HPC – Multi-core Architectural SpectrumHeterogeneous Multi-core Platform
    • NVIDIA Heterogeneous Architecture (GeForce)
    • Cluster – Commercial x86 Architecture Intel Core2 Quad, 2006 41
    • Cluster – Commercial x86 Architecture Intel Dunnington 7400-series  last CPU of the Penryn generation and Intels first multi- core die & features a single-die six- (or hexa-) core design with three unified 3 MB L2 caches 42
    • Cluster – Commercial x86 Architecture Intel Nehalem  Core i7 2009 Q1 Quadcores 43
    • Cluster – Commercial x86 Architecture Intel: ”Nehalem-Ex” (i7) 44
    • Cluster – Commercial x86 Architecture AMD Shanghai, 2007 45
    • Cluster Overview (I) System  Security & Account Policy  System Performance Optimization Parallel Computer Arch.  Mission: HT vs. HP Abstraction Layers  Benchmarking: Serial vs. Parallel  NPB, HPL, BioPerf, HPCC & SPEC (2000 & 2006) etc.  Memory/Cache: Stream, cachebench & BYTEMark etc.  Data: iozone, iometer, xdd, dd & bonie++ etc.  Network: NetPIPE, Netperf, Nettest, Netspec & iperf etc.  load generator: cpuburn, dbench, stress & contest etc.  Resource Mgmt: Scheduling  Account policy & Mgmt. Hardware  Regular maintenance: spare parts replacement  Facility Relocation & Cabling 46
    • Cluster Overview (II) Software  Compiler: e.g. Intel, PGI, xl*(IBM)  Compilation, Porting & Debug   ddressing: 32 vs. 64bit. A  Various: Sys. Arch. (IA64, RISC, SPARC etc.)  Scientific/Numerical Libraries  NetCDF, PETSC, GSL, CERNLIB (ROOT/PAW), GEANT etc.  Lapack, Blas, gotoBlas, Scalapack, FFTW, Linpack, HPC-Netlib etc.  End User Applications:  VASP, Guassian, Wien, Abinit, PWSCF, WRF, Comcot, Truchas, VORPAL etc. Others  Documentation  Functions: UG & AG  System Design Arch., Account Policy & Mgmt. etc.  Training 47
    • Cluster I/O – Latency & Bottleneck Modern CPU achieve ~ 5GFlops/core/sec.  ~ 2 8-Bytes words per OP.  CPU overhead: 80 GB/sec.  ~O(1) GB/core/sec. B/W  Case: IBM P7 (755) (Stream)  Copy: 105418.0 MB  Scale: 104865.0 MB  Add: 121341.0 MB  Triad: 121360.0  Latency: ~52.8ns (@2.5GHz, DDR3/1666)   ven worse: init. fetching data E  Cf. Cache:  L1@2.5GHz: 3 cycles  L2@2.5GHz: 20 cycles 48
    • Memory Access vs. Clock Cycles Data Rate Performance Memory vs. CPU 49
    • 50
    • Cluster – Message Time Breakdown Source Overhead Network Time Destination Overheard 51
    • Cluster – MPI & Resource Mgr. MPI Processes Mgmt. w/o Resource Mgr. MPI Processes Mgmt. w/ Resource Mgr. 52Ref: HPC BAS4 UF.
    • Network Performance Throughput vs. Latency (I) Peak 10G 9.1Gbps ~ 877 usec (Msg Size: 1MB) IB QDR reach 31.1Gbps with same msg size  Only 29% of 10G Latency (~256 usec)  Peak IB QDR 34.8Gbps ~ 57 usec (Msg Size: 262KB)
    • Network Performance Throughput vs. Latency (II) GbE, 10G (FC), IB and IBoIP (DDR vs. QDR)
    • Network Performance Throughput vs. Latency (III) Interconnection:  GbE, 10G (FC), IB and IBoIP (DDR vs. QDR) Max throughput not reach 80% of IB DDR (~46%) Peak of DDR IPoIB ~76% of IB peak (9.1Gbps)  Over IP, QDR have only 54%  While max throughput reach 85% (34.8Gbps) No significant performance gain for IPoIB using RDMA (by preloading SDO) Possible performance degradation  Existing activities over IB edge switch at the chassis  Midplane performance limitation Reaching 85% on clean IB QDR interconnection:  Redo performance measurement on IBM QDR
    • Cluster – File Server Performance Preload SDP provided by OFED Sockets Direct Protocol (SDP)  Note: Network protocol which provides an RDMA accelerated alternative to TCP over InfiniBand
    • Cluster – File Server IO Performance (I) Re-Write Performance Write Performance
    • Cluster – File Server IO Performance (II) Re-Read Performance Read Performance
    • Cluster I/O – Cluster filesystem options? (I) OCFS2 (Oracle Cluster File System)  Once proprietary, now GPL  Available in Linux vanilla kernel  not widely used outside the database world PVFS (Parallel Virtual File System)  Open source & easy to install  Userspace-only server  kernel module required only on clients  Optimized for MPI-IO  POSIX compatibility layer performance is sub-optimal pNFS (Parallel NFS)  Extension of NFSv4  Proprietary solutions available: “Panasas”  Put together benefits of parallel IO using standard solution (NFS) 59
    • Cluster I/O – Cluster filesystem options? (II) GPFS (General Parallel File System)  Rock-solid w/ 10-years history  Available for AIX, Linux & Windows Server 2003  Proprietary license  Tightly integrated with IBM cluster management tools Lustre  HA & LB implementation  highly scalable parallel filesystem: ~ 100K clients  Performance:  Client: ~1 GB/s & 1K Metadata Op/s  MDS: 3K ~ 15K Metadata Op/s  OSS: 500 ~ 2.5 GB/s  POSIX compatibility  Components:  single or dual Metadata Server (MDS) w/ attached Metadata Target (MDT) (if consider scalability & load balance)  multiple “up to ~O(3)” Object Storage Server (OSS) w/ attached Object 60 Storage Targets (OST)
    • Cluster I/O – Lustre Cluster Breakdown InfiniBand Interconnect Lustre Cluster OSS1 Compute Compute Compute OSS2 Compute Compute Compute … … … OSS nodes (Load Balanced) Compute Compute Compute Compute Compute Compute MDS nodes (High Availability) Compute Compute Admin MDS(M) Compute Compute Admin Compute Compute Login MDS(S) Compute Compute Login Lustre Quad-CPU Compute Nodes Connectivity to all Connectivity to all Connectivity to all Connectivity to all nodes nodes nodes nodes GigE ethernet for boot and system control traffic Connectivity to all Connectivity to all Connectivity to all Connectivity to all nodes nodes nodes nodes 10/100 Ethernet out-of-band management (power on/off, etc) 61
    • Cluster I/O – Parallel Filesystem using Lustre  ypical Setup T  MDS: ~ O(1) servers with good CPU and RAM, high seek rate  OSS: ~ O(3) server req. good bus bandwidth, storage 62
    • Cluster I/O – Lustre Performance (I) Interconnection:  PoB, IB & Quadrics I 63
    • Cluster I/O – Lustre Performance (II) Scalability  Throughput/Transactions vs. Num of OSS 64
    • Cluster I/O – Parallel Filesystem in HPC 65
    • Cluster – Consolidation & Pursuing High Density 66
    • Typical Blade System Connectivity Breakdown Fibre Channel Expansion Card (CFFv) Optical Pass-Through Module and MPO Cables BNT 1/10 Gb Uplink Ethernet Switch Module BladeServer Chassis – BCE
    •  Hardware & system software features affecting scalability Reliability - Hardware of parallel systems Scalable Tools - Software Machine Size - User/Developer - Proc. Performance - Manager - Num. Processors - Libraries Input/Output Totally Scalable Memory Size - Bandwidth Architecture - Virtual - Capacity - Physical Program Env. Interconnect Network - Familiar Program Paradigm - Latency - Familiar Interface Memory Type - Bandwidth - Distributed - Shared 68
    • HPC – Demanded Features from diff. Roles Def. Roles: Users, Developers, System Administrators Features Users Developers ManagersFamiliar User Interface ✔ ✔ ✔Familiar Programming Paradigm ✔ ✔Commercially Supported Applications ✔ ✔Standards ✔ ✔Scalable Libraries ✔ ✔Development Tools ✔Management Tools ✔Total System Costs ✔ 69
    • HPC – Application Development StepsPrep SA SPECRun SA SPEC Code Code Opt ParMod Opt Prep Run Par Mod 70
    • HPC – Service Scopes System Architecture Design   Various of interconnection e.g. GbE, IB, FC etc.   Mission specific e.g. high performance or high throughput   Computational or data intensive   OMP vs. MPI   Parallel/Global filesystem Cluster Implementation   Objectives:  High availability & Fault tolerance  Load Balancing Design & Validation  Distributed & Parallel Computing   Deployment, Configuration, Cluster Mgmt. & Monitoring   Service Automation and Event Mgmt.   KB & Helpdesk Service Level:   Helpdesk & Onsite Inspection   System Reliability & Availability   1st / 2nd line Tech. Support  Automation & Alarm Handling 71  Architecture & Outreach?
    • High Performance Computing – Performance Tuning & Optimization Tuning Strategy Best Practices Profile & Bottleneck drilldown System Optimization Filesystem Improvement & (re-)Design 72
    • High Performance Computing Cluster Management & Administration Tools  Categories:  System: OS, network, backup, filesystem & virtualization.  Clustering: deployment, monitoring, management alarm/logging, dashboard & automation  Administration: UID, security, scheduling, accounting  Application: library, compiler, message-passing & domain-specific.  Development: debug, profile, toolkits, VC & PM.  Services: helpdesk, event, KB & FAQ. 73
    • Cluster Implementation (I) Operating system  Candidates: CentOS, Scientific Linux, RedHat, Fedora etc. Cluster Management  Tools: Oscar, Rocks, uBuntu, xCAT etc. Deployment & Configurations  Tools: cobbler, kickstart, puppet, cfgng, quattor(CERN), DRBL Alarm, Probes & Automation  Tools: nagios, IPMI, lm_sensors System & Service monitoring  Tools: ganglia, openQRM Network Monitoring  Tools: MRTG, RRD, smokeping, awstats, weathermap 74
    • Cluster Implementation (II) Filesystem  Candidates: NFS, Lustre, openAFS, pNFS, GPFS etc. Performance Analysis & Profile:  Tools: gprof, pgroup(PGI), VTune(intel), tprof(IBM), TotalView etc. Compilers  Packages: Intel, PGI, MPI, Pathscale, Absoft, NAG, GNU, Cuda etc. Message Passing Libraries (parallel):  Packages: Intel MPI, OpenMPI, MPICH, MVAPICH, PVM(old), openMP(POSIX threads) etc. Memory Profile & Debug (Threads)  Tool: Valgrind, IDB, GNU(gdb) etc. Distributed computing  Toolkits: Condor, Globus, gLite(LCG) etc. 75
    • Cluster Implementation (III) Resource Mgmt. & Scheduling  Tools: Torque, Maui, Moab, Condor, Slurm, SGE(SunGrid Engine), NQS(old), loadleveler(IBM), LSF(Platform/IBM) etc. Dashboard  Tools: openQRM, openNMS, Ahatsup, OpenView, BigBrother etc. Helpdesk & Trouble Tracking  Tools: phpFAQ, OTRS, Request Tracker, osTIcket, simpleTicket, eTicket etc. Logging & Events  Tools: elog, syslogNG etc. Knowledge Base  Tools: vimwiki, Media Wiki, Twiki, phpFAQ, moinmoin etc. 76
    • Cluster Implementation (IV) Security  Functionality: scanning, intrusion detection, & vulnerability  Tools: honeypot, snort, saint, snmp, nessus, rootkithunter & chkrootkit etc. Revision Services  Tools: git, cvs, svn etc. Collaborative Project Mgmt.  Tools: bugzilla, OTRS, projectHQ, Accounting:  Tools: SACCT, PACCT etc.  Visualization: RRD G/W, Google Chart Tool etc. 77
    • Cluster Implementation (IV) Backup Services  Tools: Tivoli(IBM), Bacula, rsync, VERITAS, TSM, Netvault, Amanda, etc. Remote Console  Tools: openNX (no machine), rdp compatible, Hummingbird (XDMCP), VNC, Xwin32, Cygwin, IPMI v2 etc. Cloud & Virtualization  Packages: openstack, opennebula, eucalyptus, CERNVM, Vmware, Xen, Citrix, VirtualBox etc. 78
    • High Performance Computing - How We Get to Today? Moore’s Law, Heat/Energy/Power Density Hardware Evolution Datacenter & Green HPC History Reminder: 1980s - 1st Gflops in single vector processor 1994 - 1st TFlop via thousands of microprocessors 2009 - 1st Pflop via several hundred thousand cores 79
    • Moore’s Law & Power Density Dynamic Pwr ∝ V2fC  2X Transistors/Chip every 1.5Yr  Cubic effect if inc frequency & supply  Golden Moore (co-founder of voltage Intel) predicted in 1965.  Eff ∝ capacitance ∝ cores (linear)  High performance serial processor 33K ~ 38K MIPs waste power 7.5K ~ 11K MIPs  More transistors rather serial Transistor Count 1971-2011 1 Billion Transistors processors 25 MIPs 1.0 MIPs 0.1 MIPs Date of Production 80 Ref: http://en.wikipedia.org/wiki/List_of_Intel_microprocessors
    • Moore’s Law – What we learn?Transistor ∝ MIPs ∝ Watts ∝ BTUs Rule of thumb: 1 watt of power consumed requires 3.413 BTU/hr of cooling to remove the associated heat Inter-chip vs. Intra-chip parallelism Challenges: millions of concurrent threads HP: Data Center Power Density Went from 2.1 kW/Rack in 1992 to 14 kw/Rack in 2006 IDC: 3 Year Costs of Power and Cooling, Roughly Equal to Initial Capital Equipment Cost of Data Center NETWORKWORLD: 63% of 369 IT professionals said that running out of space or power in their data centers had already occurred 81
    • HPC – Feature size, Clock & Die Shrink Historical data TRTS Max Clock RateMain ITRS node (nm) Year Feature size (nm) Feature Size (nm) Year
    • Trend: Cores per Socket Top500 Nov 2011:  45.8% & 32% running 6 & quad cores proc.   5.8% sys. >= 8 cores (2.4% with 16 cores) 1  more than 2 fold inc. vs. 2010 Nov (6.8%) Top500 2011 Nov  Trend: quad (73% in 10’) to 6 cores (46% in 11’) 83
    • HPC – Evolution of Processors Transistors: Moore’s Law Clock rate no longer as a proxy for Moore’s Law & Cores may double instead. Power literately under control. Transistors Physical Gate LengthRef: “Scaling to Petascale and Beyond: Performance Analysis and Optimization of Applications” NERSC.
    • HPC – Comprehensive Approach CPU Chips  Clock Frequency & Voltage Scaling  75% power savings at idle and 40-70% power savings for utilization in the 20-80% range Server  Chassis: 20-50% Pwr reduction.  Modular switches & routers  Server consolidation & virtualization Storage Devices  Max. TB/Watt & Disk Capacity  Large Scale Tiered Storage  Max. Pwr Eff by Min. Storage over-provisioning Cabling & Networking  Stackable & backplane capacity (inc. Pwr Eff)  Scaling & Density 85
    • HPC – Datacenter Power Projection Case: ORNL/UTK inc. DOE & NSF sys.  Deploy 2 large Petascale systems in next 5 years  Current Power Consumption 4 MW  Exp to 15MW before year end (2011)  50MW by 2012.  Cost estimates based on $0.07 per KwH 86
    • HPC – Data Center Best Practices Traditional Approach  Hot/Cold Aisle  Min. Leakage  Eff. Improvement (Coolig & Power)  DC input (UPS opt.), Cabling & Container  Liquid Cooling  Free Cooling  Leveraging Hydroelectric Power Ref: http://www.google.com/about/datacenters/ 87 http://www.google.com/about/datacenters/inside/efficiency/power-usage.html
    • HPC – DataCenter Growing Power Density Total system efficiency comprises three main elements- the Grid, the Data Centre and the IT Components. Each element has its own efficiency factor- multiplied together for 100 watts of power generated, the CPU receives only 12 watts Heat Load Product Footprint (Watt/ft2)Ref: Internet2 P&C Nov 2011, “Managing Data Center Power Power & Cooling & Cooling” by Force10 88
    • HPC - Performance Benchmarking CPU Arch., Scalability, SMT & Perf/Watt Case study: Intel vs. AMD 89
    • HPC – Performance Strategy: “The Amdahl’s Law” Fixed-size Model : Speedup = 1 / (s + p/N) Scaled-size Model: Speedup = 1 / ((1-P) + P/N) ~ 1/(1-P)   arallel & Vector scale w/ problem size P  s: Σ (I/O + serial bottleneck + vector startup + program loading) SpeedUP 90 Numer of Processors
    • Price-Performance for Transaction-Processing OLTP – One of the largest server markets is online transaction processing  TPC-C – std. industry benchmark for OLTP is  Queries and updates rely on database system Significant factors of performance in TPC-C:  Reasonable approx. to a real OLTP app.  Predictive of real system performance:  total system performance, inc. the hardware, the operating system, the I/O system, and the database system.   Complete instruction and timing info for benchmarking  TPM (measure transactions per minute) & price- performance in dollars per TPM. 91
    •  20 SPEC benchmarks  1.9 GHZ IBM Power5 processor vs. 3.8 GHz Intel Pentium 4  10 Integer @LHS & 10 floating point @RHS  Fallacy:  Processors with lower CPIs will always be faster.  Processors with faster clock rates will always be faster. 92
    •  Characteristics of 10 OLTP systems & TPC-C as the benchmark 93
    •  Cost of purchase split between processor, memory, storage, and software 94
    • Pentium 4 Microarchitecture &Important characteristics ofthe recent Pentium 4 640implementation in 90 nmtechnology (code namedPrescott) 95
    • HPC – Performance Measurement (I) Objective:  Baseline Performance  Performance Optimization  Confident & Verifiable Measurement:  Open Std.: math kernel & application  MIPS (million instruction per second) (MIPS Tech. Inc.)  MFLOPS (million floating point operation per second) Characteristics:  Peak vs. Sustained  Speed-Up & Computing Efficiency (mainly for Parallel)  CPU Time vs. Elapsed Time  Program performance (HP) vs. System Throughput (HT)  Performance per WattRef: http://www-03.ibm.com/systems/power/hardware/benchmarks/hpc.htmlhttp://icl.cs.utk.edu/hpcc/ 96
    • HPC – Performance Measurement (II) Public Benchmark Utilities:  LINPACK (Jack Dongara, Oak Ridge N.L.)  Single Precision/Double Precision  n=100 TPP, n=1000 (Paper& Pencil benchmark)  HPL, n=’undefined’ (mainly for paraell system)  Synthetic: Drystone, Whetstone, Khornstone  SPEC (Standard Performance Evaluation Corp.)  SPECint (CINT2006), SPECfp(CFP2006), SPEComp●Not allow for source code modification  Livermore Loops (introduction of MFLOPS)  Los Alamos Suite (Vector Computing)  Stream (Memory Performance)  NPB (NASA Ames): NPB 1 and NPB 2 (A, B, C) Application (Weather/Material/MD/Statistics etc.):  MM5, NAMD, ANSYS, WRF, VASP etc. 97
    • Target Processors (I) - AMD vs. Intel AMD Magny-Cours Opteron (45nm, Rel. Mar. 2010)  Socket G34 multi-chip module  2 x 4-cores or 6-cores dices connecting with HT 3.1  6172 (12-cores), 2.1GHz  L2: 8 x 512K, L3: 2 x 6M  HT: 3.2 GHz  ACP/TDP: 80W/115W  Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a  6128HE (8-cores), 2.0GHz  L2: 8 x 512K, L3: 2 x 6M  HT: 3.2 GHz  ACP/TDP: 80W/115W  Streaming SIMD Extension: SSE, SEE2, SEE3 and SSE4a
    • Target Processors (II) - AMD vs. Intel Intel Woodcrest, Harpertown and Westmear (Rel. Jun 2006)  Xeon 5150  2.66GHz, LGA-771  L2: 4M  TDP: 65W  Streaming SIMD Extension: SSE, SSE2, SSE3 and SSSE3  Harpertown, Quad-Cores, 45nm (Rel. Nov 2007)  E5430 2.66GHz  L2: 2 x 6M  TDP: 80W  Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3 and SSE4.1  Westmear EP, 6-cores, 32nm (Re. Mar 2010)  X5650 2.67GHz, LGA-1366  L2/L3: 6x256K/12MB  I/O Bus: 2 x 6.4GT/s QPI  Streaming SIMD Extension: SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2
    • SPEC2006 Performance Comparison - SMT Off Turbo-on 8 Cores Nehalem-EP vs. 12 Cores Westmere-EP 32% performance gain by increase 50% of CPU-Cores Scalability 12% below Ideal Performance SMT Advantage:  Nehalem-EP 8 Cores to 16 Cores: “24.4%”  Westmere-EP 12 Cores to 24 Cores: “23.7”Ref: CERN Openlan Intel WEP Evaluation Report (2010)
    • Efficiency of Westmere-EP - Performance per Watt Extrapolated from 12G to 24G  2 Watt per additional GB of Memory Dual PSU (Upper) vs. Single PSU (Lower) SMT offer 21% boost in turns of efficiency Approx. 3% consume by SMT comparing with absolute performance (23.7%)Ref: CERN Openlan Intel WEP Evaluation Report (2010)
    • Efficiency of Nehalem-EP Microarchitecture With SMT Off Most efficiency Nehalem-EP L5520 vs. X5670  Westmere add 10%  With efficiency 9.75% using dual PSU  +23.4% using single PSU Nehalem L5520 vs. Harpertown (E5410)  +35% performance boostRef: CERN Openlan Intel WEP Evaluation Report (2010)
    • Multi-Cores Performance Scaling - AMD Magny-Cours vs. Intel Westmere (I)
    • Multi-Cores Performance Scaling - AMD Magny-Cours vs. Intel Westmere (II)
    • Single Server Linpack Performance - Intel X5650, 2.67GHz 12G DDR3 (6 cores) HPL Optimal Performance ~108.7 GFlops per Node
    • Lesson from Top500 Statistics, Analysis & Future Trend Processor Tech. & Cores/socket Cluster Interconnect power consumption & Efficiency Regional performance & Trend 106
    • Top 500 – 2011 Nov.Rmax(GFlops) Cores
    • HPC – Performance of Countries Nov 2011 Top500 Performance of Countries 108
    • Top500 Analysis – Power Consumption & Efficiency  Top 4 Power Eff.: GlueGene/Q (2011 Nov)  Rochester > Thomas J. Watson > DOE/NNSA/LLNL Eff: 2026 GF/kW BlueGene/Q, Power BQC 16C 1.60 GHz, Custom 11.87MW RIKEN Advanced Institute for Computational Science (AICS) - SPARC64 VIIIfx 2.0GHz, Tofu interconnect 3.6MW, Tianhe-1A National Supercomputing Center in Tianjin 2008 2009 2010 2011 109
    • Top500 Analysis - Performance & Efficiency 20% of Top-performed clusters contribute 60% of Total Computing Power (27.98PF) 5 Clusters Eff. < 30
    • Top500 Analysis - HPC Cluster Performance 272 (52%) of world fastest clusters have efficiency lower than 80% (Rmax/Rpeak) Only 115 (18%) could drive over 90% of theoretical peak  Sampling from Top500 HPC cluster Trend of Cluster Efficiency 2005-2009
    • Top500 Analysis – HPC Cluster Interconnection SDR, DDR and QDR in Top500  Promising efficiency >= 80%  Majority of IB ready cluster adopt DDR (87%) (2009 Nov)  Contribute 44% of total computing power  ~28 Pflops  Avg efficiency ~78%
    • Impact Factor: Interconnectivity - Capacity & Cluster Efficiency Over 52% of Cluster base on GbE  With efficiency around 50% only InfiniBand adopt by ~36% HPC Clusters
    • Common Semantics Programmer productivity Easy of deployment HPC filesystem are more mature, wider feature set:  High concurrent read and write  In the comfort zone of programmers (vs cloudFS) Wide support, adoption, acceptance possible  pNFS working to be equivalent  Reuse standard data management tools  Backup, disaster recovery and tiering
    • IB RoadmapTrend in HPC 74.2PF 10.5PF 50.9TF
    • Observation & Perspectives (I) Performance pursuing another 1000X would be tough   ~20PF Titan and Jaguar deliver in 2012   ExaFlops project ~ 2016 (PF in 2008)   Stil! IB & GbE are the most used interconnect solutions   multi-cores continue Moore’s Law  high level parallelism & software readiness  reduce bus traffic & data locality Storage is fastest-growing product sector   Storage consolidation intensifies competition   Lustre roadmap stabilized for HPC Computing paradigm   Complicated system vs. supplicated computing tools  hybrid computing model   Major concern: power efficiency  energy in memory & interconnect inc. data search application  exploit memory power efficiency: large cache?   Scalability and Reliability   Performance key factor: data communication  consider: layout, management & reuse 116
    • Observation & Perspectives (II) Vendor Support & User readiness No Moore’s Law for software, algorithms &  Service Orientation applications?  Standardization & KB  Automation & Expert system Emerging new possibility   Cloud Infrastructure & Platform  currently 3% of spending (mostly private cloud)  Technology push & market/demand pull  growing opportunity of “Big Data”  datacenter, SMB & HPC solution providers Rapidly growth of accelerator   Test by ~67% of users (20% in 10’)   NVIDIA posses 90% of current usage (11’) “I think there is a world market for maybe five computers” Thomas Watson, chairman of IBM, 1943 “Computers in the future may weight no more than 1.5 tons. ” Popular Mechanics, 1949 117
    • References Top500: http://top500.org Green Top500: http://www.green500.org HPC Advisory Council  http://www.hpcadvisorycouncil.com/subgroups.php HPC Inside  http://insidehpc.com/ HPC Wiki  http://en.wikipedia.org/wiki/High-performance_computing Supercomputing Conferences Series  http://www.supercomp.org/ Beowulf Cluster  http://www.beowulf.org/ MPI Forum:  http://www.mpi-forum.org/docs/docs.html 118
    • Reference - Mathematical & Numerical Lib. (I) Open Source  Linpack - numerical linear algebra intend to use on supercomputers  LAPACK - the successor to LINPACK (Netlib)  PLAPACK - Parallel Linear Algebra Package  BLAS - basic linear algebra subprograms  gotoBlas - optimal performance of Blas with new algorithm & memory techniques  Scalapack - high performance linear algebra routines or distributed memory message passing MIMD computer  FFTW - Fast Fourier Transform in the West  HPC-Netlib - is the high performance branch of Netlib  PETSc - portable, extensible toolkit for scientific computation  Numerical Recipes  GNU Scientific Libraries 119
    • Reference - Mathematical & Numerical Lib. (II) Commercial  ESSL & pESSL (IBM/AIX) - Engineering & Scientific Subroutine Library  MASS (IBM/AIX) - Mathematical Acceleration Subsystem  Intel Math Kernel - vector, linear algebra, special tuned math kernels  NAG Numerical Libraries - Numerical Algorithms Group  IMSL - International Mathematical and Statistical Libraries  PV-WAVE - Workstation Analysis & Visualization Env.  JAMA - Java matrix package, developed by the MathWorks & NIST.  WSSMP - Watson Symmetric Sparse Matrix Package 120
    • Reference - Message Passing PVM (Parallel Virtual Machine, ORNL/CSM) OpenMPI MVAPICH & MVAPICH2 MPICH & MPICH2  v1 channels:  ch_p4 - based on older p4 project (Portable Programs for Parallel Processors), tcp/ip  ch_p4mpd - p4 with mpd daemons to starting and managing processes  ch_shmem - shared memory only channel  globus2 – Globus2  v2 channels:  Nemesis – Universal  inter-node modules: elan, GM, IB (infiniband), MX (myrinet express), NewMadeleine, tcp intra-node variants of shared memory for large messages (LMT interface).  ssm - Sockets and Shared Memory  shm - SHared memory  sock - tcp/ip sockets  sctp - experimental channel over SCTP sockets 121
    • Reference - Performance, Benchmark & Tools High performance tools & technologies:  https://computing.llnl.gov/tutorials/performance_tools/ HighPerformanceToolsTechnologiesLC.pdf Linux Benchmarking Suite:  http://lbs.sourceforge.net Linux Test Tools Matrix:  http://ltp.sourceforge.net/tooltable.php Network Performance  http://compnetworking.about.com/od/networkperformance/ TCPIP_Network_Performance_Benchmarks_and_Tools.htm  http://tldp.org/HOWTO/Benchmarking-HOWTO-3.html  http://bulk.fefe.de/scalability/  http://linuxperf.sourceforge.net 122
    • Reference - Network Security Network Security  Tools: http://sectools.org/ , http://www.yolinux.com/TUTORIALS/ LinuxSecurityTools.html & http://www.lids.org/ etc.  packet sniffer, wrapper, firewall, scanner, services (MTA/BIND) etc. Online Org.:  CERT http://www.us-cert.gov  SANS http://www.sans.org Linux Network Security  basic config/utility/profile, encryption & routing.  (obsolete: http://www.drolez.com/secu/) Network Security Toolkit Audit, Intrusion Detection & Prevention  Event Types:  DDoS, Scanning, Worms, Policy violation & unexpected app. services  Honeypots, Tripwire, Snort, Tiger, Nessus, Ethereal, nmap, tcpdump, portscan, portsentry, chkrootkit, rootkithunter, AIDE(HIDE), LIDS etc.  Ref: NIST “Guide to Intrusion Detection and Prevention Systems” 123
    • Reference - Book Computer Architecture: A Quantitative Approach  2nd Ed., by David A. Patterson, John L. Hennessy, David Goldberg Parallel Computer Architecture: A Hardware/Software Approach  by David Culler and J.P. Singh with Anoop Gupta High-performance Computer Architecture  3rd Ed., by Harold Stone High Performance Compilers for Parallel Computing  by Michael Wolfe (Addison Wesley, 1996) Advanced Computer Architectures: A Design Space Approach  by Terence Fountain, Peter Kacsuk, Dezso Sima Introduction to Parallel Computing: Design and Analysis of Parallel Algorithms   by Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis Parallel Computing Works!   by Geoffrey C. Fox, Roy D. Williams, Paul C. Messina The Interaction of Compilation Technology and Computer Architecture   by David J. Lilja, Peter L. Bird (Editor) 124
    • National Laboratory Computing Facilities (I) ANL, Argonne National Laboratory  http://www.lcrc.anl.gov/ ASC, Alabama Supercomputer Center  http://www.asc.edu/supercomputing/ BNL, Brookhaven National Laboratory, Computational Science Center  http://www.bnl.gov/csc/ CACR, Center for Advanced Computing Researc  http://www.cacr.caltech.edu/main/ CAPP, Center for Applied Parallel Processing  http://www.ceng.metu.edu.tr/courses/ceng577/announces/ supercomputingfacilities.htm CHPC, Center for High Performance Computing, University of Utah  http://www.chpc.utah.edu/ 125
    • National Laboratory Computing Facilities (II) CRPC, Center For Research on Parallel Computation  http://www.crpc.rice.edu/ LANL, Los Alamos National Lab  http://www.lanl.gov/roadrunner/ LBL, Lawrence Berkeley National Lab  http://crd.lbl.gov/ LLNL, Lawrence Livermore National Lab  https://computing.llnl.gov/ MHPCC, Maui High Performance Computing Center  http://www.mhpcc.edu/ NCAR, National Center for Atmospheric Research  http://ncar.ucar.edu/ NCCS, National Center for Computational Science  http://www.nccs.gov/computing-resources/systems-status 126
    • National Laboratory Computing Facilities (III) NCSA, National Center for Supercomputing Application  http://www.ncsa.illinois.edu/ NERSC, National Energy Research Scientific Computing Center  http://www.nersc.gov/home-2/ NSCEE, National Supercomputing Center for Energy and the Environment  http://www.nscee.edu/ NWSC, NCAR-Wyoming Supercomputing Center  http://nwsc.ucar.edu/ ORNL, Oak Ridge National Lab  http://www.ornl.gov/ornlhome/high_performance_computing.shtml OSC, Ohio Supercomputer Center  http://www.osc.edu/ 127
    • National Laboratory Computing Facilities (IV) PSC, Pittsburgh Supercomputing Center  http://www.psc.edu/ SANDIA, Sandia National Laboratories  http://www.cs.sandia.gov/ SCRI, Supercomputer Computations Research Institute  http://www.sc.fsu.edu/ SDSC, San Diego Supercomputing Center  http://www.sdsc.edu/services/hpc.html ARSC, Arctic Region Supercomputing Center  http://nwsc.ucar.edu/ NASA, National Aeronautics and Space Admin  http://www.nas.nasa.gov/ 128