Sponge v2

352 views
299 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
352
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sponge v2

  1. 1. The Monash Campus GridProgramme Enhancing Research with High-Performance/High- Throughput Computing
  2. 2. Information TechnologyServices DivisionOffice of the CIO What is HPC?  High-performance computing is about leveraging the best and cost-effective technologies, from processors, memory chips, disks and networks, to provide aggregated computational capabilities beyond what is typically available to the enduser  high-performance -- running a program as quickly as possible  high-throughput -- running as many programs as quickly as possible within a unit of time  HPC/HTC are enabling technologies for larger experiments, more complex data analyses, achieving higher accuracy in computational models
  3. 3. Information TechnologyServices DivisionOffice of the CIO The Monash Campus Grid WWW Nimrod Grid-enabled Middleware Secure Shell GT2 / GT4 Secure Copy GridFTP Monash Sun Grid HPC Cluster Monash SPONGE Condor Pool LaRDS (peta-scale storage) Monash Gigabit Network https://confluence-vre.its.monash.edu.au/display/mcgwiki/Monash+MCG
  4. 4. Information TechnologyServices DivisionOffice of the CIO Monash Sun Grid  Central high-performance compute cluster (HPC + HTC capable)  Monash eResearch Centre and Information Technology Services Division  Key features:  dedicated Linux cluster with ~205 computers providing ~1,650 CPU cores  processor configurations from 2 CPU cores to up to 48 CPU cores per computer  primary memory configurations from 4 GB to up to 1,024 GB per machine  broad range of applications and development environments  flexibility in addressing our customer requirements https://confluence-vre.its.monash.edu.au/display/mcgwiki/Monash+Sun+Grid+Overvie
  5. 5. Information TechnologyServices DivisionOffice of the CIO Monash Sun Grid node types 2005 2009 2008-10 2010 2006 2008-10
  6. 6. Information TechnologyServices DivisionOffice of the CIO Monash Sun Grid 2010  Very-Large RAM nodes  2010  Dell R910 - four eight-core Intel Xeon (Nehalem) CPUs per node  two nodes [ 64 cores total ]  1,024 GB RAM / node  16 x 600GB 10k RPM SAS disk drives  over 640 Mflop/s  redundant 1.1 kW PSU on each  ~300 Mflop/W http://www.dell.com/us/business/p/poweredge-r910/pd
  7. 7. Information TechnologyServices DivisionOffice of the CIO Monash Sun Grid 2011  Partnership Nodes with Engineering  2010-11  Dell R815 - four 12-core AMD Opteron CPUs per node  five nodes [ 240 cores ]  128 GB RAM / node  10G Ethernet  > 2,400 Gflop/s  redundant 1.1 kW PSUs  ~400 Mflop/W http://www.dell.com/us/business/p/poweredge-r815/pd
  8. 8. Information Technology Monash Sun GridServices DivisionOffice of the CIO Summary Core Name Vintage Node Gflop/s Power Req’t Mflop/W Count MSG-I 2005 V20z 70 336 ~17 kW 20 Mflop/W MSG-II 2006 X2100 64 332 ~11 kW 42 Mflop/W MSG-IIIe 2007 X6220 120 624 ~7.2 kW 65 Mflop/W MSG-IV 2008 X4600 96 885 ~3.6 kW 250 Mflop/W MSG-III 2009 X6250 720 7200 ~23 kW 330 Mflop/W MSG-III 2010 X6250 240 2400 ~7 kW 330 Mflop/W MSG-gpu 2010 Dell 80 > 800 + 18,660 ?? ?? MSG-vlm 2010 Dell R910 64 > 640 ~2.2 kW 290 Mflop/W MSG-pn 2011 Dell R815 240 > 2400 ~5.5 kW 436 Mflop/W Monash Sun Grid HPC Cluster has 1,694 cores & clocks at over 12.5(+ 18.6)Tflops with >5.7 TB of RAM
  9. 9. Information TechnologyServices DivisionOffice of the CIO Software Stack • Underworld • S/W Development, Environments, • CFD Codes Libraries • OpenFOAM, ANSYS Fluent, CFX, viper (user- • gcc, Intel C/Fortran, Intel MKL, IMSL installed) numerical library, Ox, python, openmpi, mpich2, NetCDF, java, PETsc, FFTW, • CUDA toolkit, Qt, VirtualGL, itksnap, drishti, BLAS, LAPACK, gsl, mathematica, Paraview octave, matlab (limited) • CrystalSpace • Statistics, Econometrics • FSL • R, Gauss, Stata • Meep • Computational Chemistry • CircuitScape • Gaussian 09, GaussView 5, Molden, • Structure and Beast GAMESS, Siesta, Materials Studio (Discovery), AMBER 10 • XMDS • Molecular Dynamics • ENViMET (via wine) • Growing List! • NAMD, LAMMPS, Gromacs ENVI/IDL • etc etc etc
  10. 10. Information TechnologyServices DivisionOffice of the CIO Specialist Support and Advise Initial General Advise Engagement Requirements & Startup Account Analysis Tutorial Creation Follow Up Customised Maintenance Solutions
  11. 11. Information TechnologyServices DivisionOffice of the CIO Specialist Support and Advise • Cluster Queue Configuration and Management • Compute job preparation • Custom scripting • Software installation and tuning • Job performance and/or error diagnosis • etc
  12. 12. Information TechnologyServices DivisionOffice of the CIO Growth of CPU Usage cpu hours 2008 859K 20093,300K 20106,863K
  13. 13. Information TechnologyServices DivisionOffice of the CIO Growth of CPU Usage 2008 859K 2009 3,300K cpu 783 CPU years!!! 2010 6,863K hours
  14. 14. Information TechnologyServices DivisionOffice of the CIO Active Users Projected Active Users 2008 71 2009 145 2010 169 24-Aug
  15. 15. Information TechnologyServices DivisionOffice of the CIO What to expect in the future?  Continued refresh of hardware and software  decommissioning older machines  More grid nodes (CPU cores) to meet growing demand  Scalable and high-performance storage architecture without sacrificing data availability  Custom grid nodes &/or subclusters with special configurations to meet user requirements  Better integration with Grid tools and middleware
  16. 16. Green IT Strategy
  17. 17. Monash Sun Grid Beginnings  MSG-I 2005 Sun V20z AMD Opteron (dual core) initially 32 nodes = 64 cores, with 3 new nodes added in 2007 making a total of 70 cores 4 GB RAM / node 336 Gflop/s ~17kW http://www.sun.com/servers/entry/v20z/index.js 20 Mflop/W
  18. 18. Monash Sun Grid  MSG-II  2006  Sun X2100 AMD Opteron (dual core)  initially 24 nodes = 48 cores with 8 nodes added in 2007 making 64 cores at present  4 GB RAM / node  332 Gflop/s  ~11 kW  42 Mflop/W http://www.sun.com/servers/entry/x2100/Picture on the right was googled and found from Jason Callaway’sFlicker page:http://www.flickr.com/photos/29925031@N07/
  19. 19. Monash Sun Grid Big MemBoxes  MSG-III (now named as MSG-IIIe)  2008  Sun X6220 Blades - two dual core AMD Opterons per node  currently 20 nodes = 80 cores with 10 nodes to be added in 2010 making 120 cores  40 GB RAM / node  624 Gflop/s  ~7.2 kW  330 Mflop/W http://www.sun.com/servers/blades/x6220/ http://www.sun.com/servers/blades/x6220/datasheet.pdf
  20. 20. Monash Sun Grid 2010 MSG-III expansion and GPU nodes  2010  Sun X6250 - two quad-core Intel Xeon CPUs per node  240 cores  24 GB RAM / node  Dell nodes connected to two Tesla C1060 GPU cards  Ten nodes = 20 GPU cards  48 GB and 96 GB RAM configs http://www.sun.com/servers/blades/x62520/ http://www.nvidia.com/object/product_tesla_c1060_us.html
  21. 21. Monash Sun Grid 2009 MSG-III  2009  Sun X6250 - two quad- core Intel Xeon CPUs per node  as of 2009: 720 cores  16 GB RAM / node  > 7 Tflop/s  ~23 kW  ~330 Mflop/W http://www.sun.com/servers/blades/x62520/
  22. 22. Monash Sun Grid Big SMPboxes  MSG-IV  2009  Sun X4600 - eight quad-core AMD Opterons CPUs per node  currently three nodes = 96 cores  96 GB RAM / node  885 Gflop/s  ~3.6 kW  250 Mflop/W http://www.sun.com/servers/blades/x4600/
  23. 23. Information TechnologyServices DivisionOffice of the CIO Benefits of using a cluster shared use 2, 4, 8, 32 cores memory a single node parallel characteristic distributed memory use multiple nodes job multiple use multiple cores sequential scenarios or cases? use tools like Nimrod
  24. 24. SPONGE
  25. 25. IntroductionSerendipitous Processing on Operating Nodes in Grid Environment (SPONGE) Core Idea and Motivation  Resource Harnessing  Accessibility and  Utilization How SPONGE achieves this. What SPONGE Can do at the Moment What SPONGE cannot do at the moment. Infrastructure and Usage statistics (Pretty Pictures). Acknowledgements
  26. 26. Core Idea and Motivation The core idea - is to harness tremendous amount of un/under- utilized computational power to perform high throughput computing. Motivation - Large (Giga, Terra, Peta ??) scale computational problems that needs  High throughput, generally embarrisingly parallel applications, e.gPSAs.  Latin Squares (Mathematics) – Dr. Ian Wanless and Judith Egan; Department of Mathematics.  Molecular Replacement (Biology, Chemistry) – Jason Schmidberger and Dr. Ashley Buckle; Department of Biochemistry and Molecular Biology.  Bayesian Estimation of Bandwidth in Multivariate Kernel Regression with an Unknown Error Density (Business, Economics) – Han Shang, Dr. Xibin Zhang and Dr. Maxwell King; Department of Business and Economics.  HPC Solution for Optimization of Transit Priority in Transportation Networks; Dr. MahmoudMesbah, Department of Civil Engineering.  Short running applications that do not require specialized software/hardware and can be easily parallelized.  Single point of submission, monitoring and control.
  27. 27. Core Idea and Motivation Contd…Key Focus Areas Resource Harnessing – involves tapping “existing” (no new hardware) infrastructure that would contribute in solving the computational problem.  Student Labs in different Faculties, ITS, EWS etc..  Staff Computers – Personal Contributions included. Accessibility  How to access these facilities -> Middleware.  When to access these facilities -> Access and Usage Policies. Utilization - How to properly utilize these facilities  Implementation abstraction. Single System Image.  Job submission, monitoring and control.
  28. 28. How are we achieving this… Using Condor – The goal of Condor Project us to develop, implement, deploy and evaluate mechanisms and policies that support High Throughput Computing on large collection of distributively owned computing resources. User Submits Jobs directly to CondorCondor Submission Node Submission andSubmission or Via Nimrod, Execution NodesNode Globus constantly updates the Condor Head Node Central Manager or Central ManagerCondorExecuteNode Caulfield Clayton Campus Peninsula Campus Campus
  29. 29. How are we achieving this…contd User Submits Condor Head Node Jobs directly to Default Condor Configuration can be Condor Submission Node modified centrally Submission or Via Globus upto node level. Node •Queue Management •Resource Reservation Sponge Works – Configuration LayerCondorExecuteNode Caulfield Clayton Campus Peninsula Campus Campus
  30. 30. What SPONGE can do… Execute large number of short running embarrassingly parallel jobs by leveraging un/under utilized existing computational resources. Sounds simple  Advantages  Leveraging Idle CPU time that remains unused.  Single point of Job Submission, Monitoring, Control and collation of results  Remote job submission using Nimrod/G, Globus.
  31. 31. What SPONGE cannot do at themoment Sponge Pool consists Mostly non-dedicated computers. Distributed ownerships. Limited availability.This restricts execution of Jobs that: Require Specialized Software/Hardware  High Memory  Large Storage Space  Additional Software Takes long time to execute (several days or weeks) Perform Inter-Process Communication
  32. 32. Some StatisticsUser Name CPU Hrs Used User Name CPU Hrs Used jirving 13258.09shikha 2012437.67 nice-user.pcha13 13205.38jegan 1534528.43 wojtek 7095.26kylee 1166358.76 nice-user.wojtek 6890.78pxuser 414972.76 mmesbah 5562.53iwanless 371833.24 transport 5379 philipc 3733.35zatsepin 257631.86 shahaan 3251.94hanshang 77930.72 zatsepin 3069.35llopes 66747.09 kylee 2988.84iwanless 30930.82 jegan 1937.55jvivian 29611.87 transport 1308.44 Total 688 + CPU Years to date…
  33. 33. Statistics contd…
  34. 34. Acknowledgements WojtekGoscinski Philip Chan Jefferson Tan
  35. 35. 35Nimrod Tools for e-ResearchMonash e-Science & Grid Engineering Laboratory Faculty of Information Technology
  36. 36. 36 Overview Supporting a Software Lifecycle Software Lifecycle Tools
  37. 37. 37 Plan File Nimrod Portal Nimrod/O Nimrod/E Nimrod/Gparameter pressure float range from 5000 to 6000 points 4parameter concent float range from 0.002 to 0.005 points 2parameter material text select anyof “Fe” “Al” Actuatorstask main copy compModel node:compModel copy inputFile.skel node:inputFile.skel node:substitute inputFile.skel inputFile Grid Middleware node:execute ./compModel < inputFile > results copy node:results results.$jobnameendtask
  38. 38. 38From one workstation ..
  39. 39. 39.. Scaled Up
  40. 40. 40 Why is this challenging?Develop, Deploy, Test…
  41. 41. 41 Why is this challenging?Build, Schedule & Execute virtual application
  42. 42. 42 Approaches to Grid programming General Purpose Workflows  Generic solution  Workflow editor  Scheduler Special purpose workflows  Solve one class of problem  Specification language  Scheduler
  43. 43. 43 Nimrod Development Cycle Sent to available machinesPrepare Jobs using Portal Results displayed & interpreted Jobs Scheduled Executed Dynamically
  44. 44. 44 Acknowledgements  Message Lab  Funding & Support  Colin Enticott  CRC for Enterprise  Slavisa Garic Distributed Systems (DSTC)  Blair Bethwaite  Australian Research Council  Tom Peachy  GrangeNet (DCITA)  Jeff Tan  Australian Research Collaboration Service (ARCS)  MeRC  Microsoft  Shahaan Ayyub  Sun Microsystems  Philip Chan  IBM  Hewlett Packard  AxceleonMessage Lab Wiki:https://messagelab.monash.edu.au/nimrod

×