NWU and HPC
Upcoming SlideShare
Loading in...5
×
 

NWU and HPC

on

  • 988 views

Presentations on NWU's HPC strategy

Presentations on NWU's HPC strategy

Statistics

Views

Total Views
988
Views on SlideShare
987
Embed Views
1

Actions

Likes
1
Downloads
16
Comments
0

1 Embed 1

http://www.docseek.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • In summary we determined that the following would need to be address in any HPC to be successful.
  • In the beginning their where only one big shark (The Mainframe) The next era of supper computing came with the introduction vector Supercomputer the likes of Cray etc.. The next step was compacting into Mini Computer Al the previous approaches where based on SMP closely coupled in one box And then came the modes Personal Computer. Not very strong on it’s own, but connecting a lot of them together make one Big fish So we suited-up got our best fishing rods and decides to go fishing... For one of these new Big Fish Becoming part of the New World Order.
  • At the previous HPC conference we came, we saw, we determined that, as Institutional IT, the time was right. The University wanted to become a major player in the New World order. This would not be the first try at this, in 1991 we implemented the SP, but the time was not right (Previous part of the presentation) In the mean time the University also ventured into clustering with three Departmental clusters (FSK, Chemistry, BWI) So what do we want to do Technically that would be different. We what to implement the H in HPC > 1 TFlop configuration. Beowulf approach -> open source , commodity of the shelf hardware,
  • So what is a Beowulf Cluster ?
  • How did the first Beowulf cluster look like Note the amount of time it took the assemble the cluster 8 months, taking into account Moore’s law this would makeable influenced the effective production life of the cluster.
  • The light dotted lines show the originator of software. The issue for us is choice of Cluster software as to allow integration into grids The major issues is on scheduler level and making the HPC appear as a CE in the grid.
  • Concept framework source Cluster Resources, Inc. Show what did we decide, representing the previous slides in a layer approach simmalar to ISO layers We started with hardware. -> ? HW, OS Resource Manager, Cluster schedulers and finally the Grid workload manager.
  • Based the Barcelona picture we did put I a requisition for a new building to house the new NWU HPC.. But we are still waiting… OK the real reason is. Reason for showing #13. When slides where setup Barcelona was #5 dropped down in #13 in less than 6 months We needs to build have a strategy that is sustainable with fast upgraded path.
  • We started looking around to determine what is the major issues that HPC have and found that Reliability and Availability is a major factor.
  • In summary we determined that the following would need to be address in any HPC to be successful.
  • The first strategy that we will used to extend the capacity and lifecycle of HPC technology will be to: Utilize the characteristics of Data center vs. that of HPC Implement new high performance CPU in HPC and migrate technology to data center As first phase to manual hardware load management through swapping of blades between HPC and data center to match peak demands extend concept to later do this dynamically on Resource Manger level in the long run (also referred to as utility computing) We needed to a strategy to make the HPC cost effective
  • So looking at the technologies that we were already using in the data center Why start here ? Cost effectiveness, training people on new technology that is only used in HPC would reduce cost effectiveness. Take note Modular fast extension with less work
  • HP Confidential – may only be shown to customers under NDA and may not be left behind with a customer under any circumstance. [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{C97F4853-0C33-430E-AE0B-9F33E6E58879}}
  • HP Confidential – may only be shown to customers under NDA and may not be left behind with a customer under any circumstance. [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{A59D56E3-21C4-498D-B6C4-605439F2D290}}
  • Show how does the NWU HPC configuration look like.
  • What is the Spec’s 256 Cores
  • Addressing the Reliability and Availability
  • Institutional facility how do we like this. The limitation still is speed. Brining on SANREN.
  • Monday, 31 March 2008 : The four sites are the main campuses of Wits, UJ, and two of UJ’s satellite campuses, Bunting and Doornfontein says Christiaan Kuun , SANReN Project Manager at the Meraka Institute
  • How will SANREN be used for the National GRID But what about International Grid. -> SEACOM
  • SEACOM PROJECT UPDATE - 14 Aug 2008 Construction on-schedule with major ground and sea-based activities proceeding over the next eight months 14 August 2008 – The construction of SEACOM’s 15,000 km fibre optic undersea cable, linking southern and east Africa, Europe and south Asia, is on schedule and set to go live as planned in June 2009 . Some 10,000 km of cable has been manufactured to date at locations in the USA and Japan and Tyco Telecommunications (US) Inc., the project contractors, will begin shipping terrestrial equipment this month with the cable expected to be loaded on the first ship in September 2008. Laying of shore end cables for each landing stations will also proceed from September. This process will comprise the cable portions at shallow depths ranging from 15 to 50m where large vessels are not able to operate. From October 2008, the first of three Reliance Class vessels will start laying the actual cable. The final splicing, which involves connecting all cable sections together, will happen in April 2009, allowing enough time for testing of the system before the commercial launch in June 2009. The final steps of the Environmental Social Impact Assessment (ESIA) process are well advanced and all small archeological, marine and ecological studies, which required scuba diving analysis, have been completed, as well as social consultations with the affected parties. The cable, including repeaters necessary to amplify the signal, will be stored in large tanks onboard the ships. The branching units necessary to divert the cable to the planned landing stations will be connected into the cable path on the ship just prior to deployment into the sea. The cable will then be buried under the ocean bed with the help of a plow along the best possible route demarcated through the marine survey. The connectivity from Egypt to Marseille, France, will be provided through Telecom Egypt’s TE-North fibre pairs that SEACOM has purchased on the system. TE-North is a new cable currently being laid across the Mediterranean Sea. Brian Herlihy, SEACOM President, said: “ We are very happy with the progress made over the past five months. Our manufacturing and deployment schedule is on target and we are confident that we will meet our delivery promises in what is today an incredibly tight market underpinned by sky-rocketing demand for new cables resulting in worldwide delivery delays. “The recently announced executive appointments combined with the project management capabilities already existent within SEACOM position us as a fully fledged telecoms player. We are able to meet the African market’s urgent requirements for cheap and readily available bandwidth within less than a year. ” The cable will go into service long before the 2010 FIFA World Cup kicks-off in South Africa and SEACOM has already been working with key broadcasters to meet their broadband requirements. The team is also trying to expedite the construction in an attempt to assist with the broadcasting requirements of the FIFA Confederations Cup scheduled for June 2009. SEACOM, which is privately funded and over three quarter African owned, will assist communication carriers in south and east Africa through the sale of wholesale international capacity to global networks via India and Europe. The undersea fibre optic cable system will provide African retail carriers with equal and open access to inexpensive bandwidth, removing the international infrastructure bottleneck and supporting east and southern African economic growth. SEACOM will be the first cable to provide broadband to countries in east Africa which, at the moment, rely entirely on expensive satellite connections.
  • The result of SEACOM and SANREN…
  • The Timeline vision in terms of production quality National & International GRID
  • In Summary : NWU‘s HPC will consists of …

NWU and HPC NWU and HPC Presentation Transcript

  • Attie Juyn & Wilhelm van Belkum igh H erformance P omputing C & Computing GRID
  • Agenda
    • The birth of a HPC…
    • Part A: management perspective
    • Part B: technical perspective
  • Background
    • Various departmental compute clusters
    • A flagship project at the CHPC
    • Fragmented resources and effort
    • At last year’s conference, our vision was ….
  • To establish an Institutional HPC Level 1 : (Entry Level) Personal workstation Level 2 : Departmental Compute Cluster Level 3 : Institutional HPC Level 4 Nat./Int. HPC
  • University Strategy
    • Increased focus on research
    • To develop into a balanced teaching-learning & research university
    • As a result of merger, a central IT department
  • The Challenge: to innovate
    • Sustainability
    • HPC must be a service, not a project or experiment
    • Funding model must enable constant renewal
    • Support model with clear responsibilities
    • Reliability
        • Redundant design principles (DR capability)
        • 24x7x365 (not 99.99%)
    • Availability
        • Standardised user interface (not root)
        • Equally accessible on all campuses
    • Efficiency
        • Power, cooling, etc.
  • HPC (IT) success criteria
    • Sustainability
    • Efficiency
    • Reliability
    • Availability
    • = key issues of this decade
    & Performance NWU HPC management strategy NWU HPC design
  • Enabling factors
    • A spirit of co-operation
    • Key researchers & IT agreeing on what should be done
    • A professional, experienced IT team
    • Supporting +- 200 servers in 4 distributed data centers
    • A well managed, state-of-the-art infrastructure
      • Resulting from the merger period
    • Management trust & commitment
    • International support & connections
      • Networks, grids, robust & open software
  • Project milestones
    • March 2007: first discussions & documentation of vision
    • April 2007: budget compilation and submission
    • 27 November 2007: Project and budget approved
    • December 2007: CHPC Conference, tested our vision
    • 17 March 2008: Dr Bruce Becker visits Potchefstroom
    • (first discussions of gLite, international & SA grids)
    • 18 March 2008: Grid concept presented to IT Directors
    • May 2008: established POC cluster, testing software
    • June-October: recruitment & training of staff
    • July 2008: Grid Conference at UCT & SA Grid Initiation
    • August – September 2008: detailed planning & testing
    • October 2008: tenders & ordered equipment
    • Nov. 2008 - Jan. 2009: implementation
  • Management principles
    • A dedicated research facility
    • (not for general computing)
    • To serve researchers in approved research programmes of all three campuses
    • Implemented, maintained and supported by Institutional IT
    • (IT should do the IT)
    • Configured to international standards & best practice
    • (to be shown later)
    • Parallel applications only
    • Usage governed by an institutional and representative governance body
    • Sustainability subject to acceptable ROI
    • (to justify future budgets)
  •  
  • The New World Order Source 2006 UC Regents Mainframe Mini Computer PC Cluster & Grids Vector Supercomputer
  • Technical goals Build a Institutional H igh P erformance C omputing facility, based on Beowulf cluster principals, coexisting and linking existing departmental cluster, the National and International computational Grids
  • Beowulf cluster
    • The term " Beowulf cluster " refers to a cluster of workstations (usually Intel architecture, but not necessarily) running some flavor of Linux that is utilized as a parallel computation resource.
    • The main idea is to use commodity , off-the-shelf computing components with Open Source software to create a networked cluster of workstations .
  • History of Clusters - The first Beowulf
    • 07/2002 – Design system
    • 08/2002 to 11/2002 – Build system
    • 03/2003 – System in Production
    • 7-8 Months for Concept to Production
    • Moore’s Law 18 months to
      • -> Half life of performance and cost
      • -> Useful life 3-4 years
    Source 2006 UC Regents
  • The Evolved Cluster Compute Nodes Admin User Job Queue Source Cluster Resources, Inc. Resource Manager Scheduler License Manager Myrinet Identity Manager Allocation Manager Resource Manager Scheduler Departmental Cluster
  • Cluster and Grid software landscape
  • Grid/Cluster Stack or Framework EGEE Chinese USA EU MPI PVM LAM MPICH Parallel Serial Application Resource Manager Rocks Oscar MPI PVM LAM MPICH Parallel Serial Application Resource Manager Oscar Torque Rocks Hardware (Cluster or SMP) CentOS Solaris RedHat UNICOS AIX Scientific Linux Windows Mac OS X HP UX Other Operating System Security GLOBUS CROWNGrid gLite UNICORE Grid Workload Manager: Scheduler, Policy Manager, Integration Platform Load Leveler PBSpro PBS SGE Condor(G) LSF SLURM Cluster Workload Manager: Scheduler, Policy Manager, Integration Platform Nimrod MOAB MAUI Portal CLI GUI Application Users Admin
  • Departmental Computer Cluster
  • CHPC (May 2007) iQudu” (isiXhosa name for Kudu “ Tshepe” (Sesotho name for ‘Springbok’) and Impala
    • 160 Node Linux cluster
    • Each node with 2xdual-core AMD Opteron 2.6GHz Ref F processors and 16GB of random access memory
    • Infiniband 10 GB cluster interconnecting
    • 50TB SAN
    • 640 processing (2.5 Teraflops per second )
    • 2x IBM p690 with 32 x 1.9GHz Power4+ CPUs
    • 32GB of RAM each
  • The #1 and #13 in world (2007) BlueGene/L - eServer Blue Gene Solution (IBM/212992 Power cores) DOE/NNSA/LLNL - USA MareNostrum - BladeCenter JS21 Cluster, PPC 970, 2.3 GHz, Myrinet (IBM 10240 Power cores) Barcelona Supercomputer Centre – Spain (63.83 teraFLOP) 478.2 trillion floating operations per second (teraFLOPS) on LINPACK The #4 and #40 in world (2008)
  • As of November 2008 #1 : Roadrunner Roadrunner - BladeCenter QS22/LS21 Cluster, 12,240 x PowerXCell 8i 3.2 Ghz 6,562 Dual-Core Opteron 1.8 GHz DOE/NNSA/LANL - United States 1.105 PetaFlop
  • Reliability & Availability of HPC
  • HPC (IT) success criteria
    • Sustainability
    • Efficiency
    • Reliability
    • Availability
    • = key issues of this decade
    & Performance NWU HPC management strategy NWU HPC design
  • Introducing - Utility Computing Swap & migrating of Hardware (First Phase) Dynamic load shifting on RM level (Second Phase) Grid Workload Manager Condor, MOAB Utility Computing Data Center RM HPC RM
  • Grid/Cluster Stack or Framework EGEE Chinese USA EU Hardware (Cluster or SMP) MPI PVM LAM MPICH Parallel Serial Application Resource Manager Rocks Oscar MPI PVM LAM MPICH Parallel Serial Application Resource Manager Oscar Torque Rocks CentOS Solaris RedHat UNICOS AIX Scientific Linux Windows Mac OS X HP UX Other Operating System GLOBUS CROWNGrid gLite UNICORE Grid Workload Manager: Scheduler, Policy Manager, Integration Platform Load Leveler PBSpro PBS SGE Condor(G) LSF SLURM Cluster Workload Manager: Scheduler, Policy Manager, Integration Platform Nimrod MOAB MAUI Security Portal CLI GUI Application Users Admin
  • HP BL460c 8*3GHz Xeon 12G L2, 1333Mhz FSB 10G memory (96GFlop) HP Modular Cooling System G2 Up to 4 HP C7000, 512 CPU cores 5.12 TFlop HP Blc Virtual Connect Ethernet D-Link X-stack DSN3200 10.5TB RAID5, 80 000 I/O per second HP C7000 Up to 16 HP2x220c (3.072TFlop) 1024 CPU cores HP2x220c (12.288TFlop) BL2x220c 16*3GHz Xeon 192GFlop HP C7000 Up to 16 HP460c (1.536TFlop)
  • HP ProLiant BL460c Integrated Lights Out 2 Standard Blade Edition Management 2 integrated Multifunction Gigabit NICs Networking 2 mezzanine expansion slots Mezzanine Slots BL460c
    • FBDIMM 667MHz
    • 8 DIMM Slots
    • 32GB max
    Memory
    • 2 Hot-Plug SFF SAS HDDs
    • Standard RAID 0/1 controller with optional BBWC
    Internal Storage Up to two Dual & Quad-Core Intel Xeon processors Processor
  • BL460c Internal View Embedded Smart Array Controller integrated on drive backplane 8 Fully Buffered DIMM Slots DDR II 667Mhz
    • Two Mezzanine Slots:
    • One x4
    • One x8
    Two hot-plug SAS/SATA drive bays QLogic QMH2462 2-pt 4Gb FC HBA NC512m 2-pt 10GbE-KX4Netxen 4x DDR InfiniBand 2-pt 4X DDR (20Gb) Mellanox
  • HP ProLiant BL2x220c G5 32 server blades in 10U enclosure 16 server blades in 6U enclosure *2 blades per HH enclosure bay Density Integrated Lights Out 2 Standard Blade Edition Management 2 integrated Gigabit NICs per board Networking 1 PCIe mezzanine expansion slot (x8, Type I) per board Mezzanine Slots BL2x220c G5
    • Registered DDR2 (533/667 MHz)
    • 4 DIMM Sockets per board
    • 16GB max (with 4GB DIMMs)
    Memory 1 Non Hot-Plug SFF SATA HDD per board Internal Storage Up to two Dual or Quad-Core Intel® Xeon® processors per board Processor
  • HP ProLiant BL2x220c G5 Internal View Two Mezzanine Slots Two x8 (both reside on bottom board) 2 x Optional SATA HDDs Top and bottom PCA, side by side 2 x 2 CPUs 2 x 4 DIMM Slots DDR2 533/667MHz 2 x Embedded 1Gb Ethernet Dual-Port NICs Server Board Connectors
  • 10U
    • Max. Capacity
    • HP Modular Cooling System G2
    • Up to 4 HP C7000,
    • 1024 CPU cores
    • 12.228 TFlop
    Servers and other racked equipment
    • Half-Height Blade Server
    • Up to 16 per enclosure
  • NWU HPC Hardware Spec.
    • 16 Dual Quad-Core Intel Xeon E5450
      • 3GHz CPU , 12MB L2, 1333MHz FSB, 80W power
      • 16 xHP BL460c
      • 10G Memory
      • HP c7000 enclosure
      • HP Modular Cooling System G2 (MCS G2)
      • Link iSCSI DSN-3200 (20Tb disk)
    • 16 Dual quad-Core Intel Xeon E5450
      • 3GHz CPU , 12MB L2, 1333MHz FSB, 80W power
      • 8 xHP BL2x220c
      • 10G Memory
      • HP c7000
      • HP Modular Cooling System G2 (MCS G2)
      • Link iSCSI DSN-3200 (20Tb disk)
    • 32 * 8 *3Ghz *4 = 3.072TFlops (256 Cores)
    • 32 * 10 Gbyte = 320 G memory
    • 2 * 10 TByte storage
    • Gig Ethernet Interconnect : 42.23 microseconds latency (IB= 4 Microseconds)
  • NWU HPC/Grid Campus GRID
  • University Wide Area Network/Internet Total of 45Mbps 34.2Mbps International
  • SANREN SANREN Vision and the Players InfraCo SEACOM
  • SA-Grid CHPC NWU C4 UOVS SA-Grid
  • SEACOM TE-North is a new cable currently being laid across the Mediterranean Sea Cable Laying to start Oct. 08 Final splicing April 09 Service launch June 09
  • International Grid
  • High Performance Computing @ NWU 12/15/2008
  • igh H erformance P omputing C & GRID Computing orth U U est U U niversity Sustainable Efficient Reliable High Availability & Performance @ >3TFlop Scientific Linux
  •