Hpc compass transtec_2012

2,114 views

Published on

Our new HPC compass is available for download!!

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,114
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
37
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hpc compass transtec_2012

  1. 1. AutomotiveSimulation Risk Analysis High Throughput Computing Price Modelling EngineeringHIGH CAE AerospacePERFORMANCECOMPUTING 2012/13TECHNOLOGYCOMPASS CAD Big Data Analytics Life Sciences
  2. 2. TECHNOLOGY COMPASS INTEL CLUSTER READY ............................................................................62 A Quality Standard for HPC Clusters...................................................... 64TABLE OF CONTENTS AND INTRODUCTION Intel Cluster Ready builds HPC Momentum ..................................... 69 The transtec Benchmarking Center ....................................................... 73HIGH PERFORMANCE COMPUTING .................................................... 4 WINDOWS HPC SERVER 2008 R2 ........................................................74Performance Turns Into Productivity ......................................................6 Elements of the Microsoft HPC Solution ............................................ 76Flexible deployment with xCAT ...................................................................8 Deployment, system management, and monitoring ................. 78 Job scheduling..................................................................................................... 80CLUSTER MANAGEMENT MADE EASY ..............................................12 Service-oriented architecture ................................................................... 82Bright Cluster Manager ................................................................................. 14 Networking and MPI ........................................................................................ 85 Microsoft Office Excel support ................................................................. 88INTELLIGENT HPC WORKLOAD MANAGEMENT .........................28Moab HPC Suite – Enterprise Edition.................................................... 30 PARALLEL NFS ...............................................................................................90New in Moab 7.0 ................................................................................................. 34 The New Standard for HPC Storage ....................................................... 92Moab HPC Suite – Basic Edition................................................................ 37 Whats´s new in NFS 4.1? ............................................................................... 94Moab HPC Suite - Grid Option .................................................................... 43 Panasas HPC Storage ...................................................................................... 99NICE ENGINE FRAME .................................................................................50 NVIDIA GPU COMPUTING ....................................................................110A technical portal for remote visualization ...................................... 52 The CUDA Architecture ............................................................................... 112Application highlights.................................................................................... 54 Codename “Fermi” ......................................................................................... 116Desktop Cloud Virtualization .................................................................... 57 Introducing NVIDIA Parallel Nsight ..................................................... 122Remote Visualization...................................................................................... 58 QLogic TrueScale InfiniBand and GPUs ............................................ 126 INFINIBAND .................................................................................................130 High-speed interconnects ........................................................................ 132 Top 10 Reasons to Use QLogic TrueScale InfiniBand ................ 136 Intel MPI Library 4.0 Performance ........................................................ 139 InfiniBand Fabric Suite (IFS) – What’s New in Version 6.0 ...... 141 PARSTREAM .................................................................................................144 Big Data Analytics .......................................................................................... 146 GLOSSARY .....................................................................................................156 2
  3. 3. MORE THAN 30 YEARS OF EXPERIENCE IN SCIENTIFIC COMPUTING environment is of a highly heterogeneous nature. Even the1980 marked the beginning of a decade where numerous startups dynamical provisioning of HPC resources as needed does notwere created, some of which later transformed into big players in constitute any problem, thus further leading to maximal utiliza-the IT market. Technical innovations brought dramatic changes tion of the cluster.to the nascent computer market. In Tübingen, close to one of Ger-many’s prime and oldest universities, transtec was founded. transtec HPC solutions use the latest and most innovative technology. Their superior performance goes hand in hand withIn the early days, transtec focused on reselling DEC computers energy efficiency, as you would expect from any leading edge ITand peripherals, delivering high-performance workstations to solution. We regard these basic characteristics.university institutes and research facilities. In 1987, SUN/Sparcand storage solutions broadened the portfolio, enhanced by This brochure focusses on where transtec HPC solutions excel.IBM/RS6000 products in 1991. These were the typical worksta- To name a few: Bright Cluster Manager as the technology leadertions and server systems for high performance computing then, for unified HPC cluster management, leading-edge Moab HPCused by the majority of researchers worldwide. Suite for job and workload management, Intel Cluster Ready certification as an independent quality standard for our sys-In the late 90s, transtec was one of the first companies to offer tems, Panasas HPC storage systems for highest performancehighly customized HPC cluster solutions based on standard and best scalability required of an HPC storage system. Again,Intel architecture servers, some of which entered the TOP500 with these components, usability and ease of managementlist of the world’s fastest computing systems. are central issues that are addressed. Also, being NVIDIA Tesla Preferred Provider, transtec is able to provide customers withThus, given this background and history, it is fair to say that well-designed, extremely powerful solutions for Tesla GPUtranstec looks back upon a more than 30 years’ experience in computing. QLogic’s InfiniBand Fabric Suite makes managing ascientific computing; our track record shows nearly 500 HPC large InfiniBand fabric easier than ever before – transtec mas-installations. With this experience, we know exactly what cus- terly combines excellent and well-chosen components that aretomers’ demands are and how to meet them. High performance already there to a fine-tuned, customer-specific, and thoroughlyand ease of management – this is what customers require to- designed HPC solution.day. HPC systems are for sure required to peak-perform, as theirname indicates, but that is not enough: they must also be easy Last but not least, your decision for a transtec HPC solutionto handle. Unwieldy design and operational complexity must be means you opt for most intensive customer care and best ser-avoided or at least hidden from administrators and particularly vice in HPC. Our experts will be glad to bring in their expertiseusers of HPC computer systems. and support to assist you at any stage, from HPC design to daily cluster operations, to HPC Cloud Services.transtec HPC solutions deliver ease of management, both in theLinux and Windows worlds, and even where the customer´s Have fun reading the transtec HPC Compass 2012/13! 3
  4. 4. HIGH PERFORMANCECOMPUTINGPERFORMANCETURNS INTOPRODUCTIVITY
  5. 5. High Performance Computing (HPC) has been with us from the verybeginning of the computer era. High-performance computers werebuilt to solve numerous problems which the “human computers” couldnot handle. The term HPC just hadn’t been coined yet. More important,some of the early principles have changed fundamentally.HPC systems in the early days were much different from those we seetoday. First, we saw enormous mainframes from large computer manu-facturers, including a proprietary operating system and job managementsystem. Second, at universities and research institutes, workstationsmade inroads and scientists carried out calculations on their dedicatedUnix or VMS workstations. In either case, if you needed more computingpower, you scaled up, i.e. you bought a bigger machine.Today the term High-Performance Computing has gained a fundamen-tally new meaning. HPC is now perceived as a way to tackle complexmathematical, scientific or engineering problems. The integration ofindustry standard, “off-the-shelf” server hardware into HPC clusters fa-cilitates the construction of computer networks of such power that onesingle system could never achieve. The new paradigm for parallelizationis scaling out. 5
  6. 6. HIGH PERFORMANCE COMPUTING Computer-supported simulations of realistic processes (so- called Computer Aided Engineering – CAE) has established itselfPERFORMANCE TURNS INTO PRODUCTIVITY as a third key pillar in the field of science and research along- side theory and experimentation. It is nowadays inconceivable that an aircraft manufacturer or a Formula One racing team would operate without using simulation software. And scien- tific calculations, such as in the fields of astrophysics, medicine, pharmaceuticals and bio-informatics, will to a large extent be dependent on supercomputers in the future. Software manu- facturers long ago recognized the benefit of high-performance computers based on powerful standard servers and ported their programs to them accordingly. The main advantages of scale-out supercomputers is just that: they are infinitely scalable, at least in principle. Since they are based on standard hardware components, such a supercomputer can be charged with more power whenever the computational capacity of the system is not sufficient any more, simply by adding additional nodes of the same kind. A “transtec HPC solutions are meant to provide cumbersome switch to a different technology can be avoided customers with unparalleled ease-of-manage- in most cases. ment and ease-of-use. Apart from that, deciding for a transtec HPC solution means deciding for The primary rationale in using HPC clusters is to grow, to scale the most intensive customer care and the best out computing capacity as far as necessary. To reach that goal, service imaginable” an HPC cluster returns most of the invest when it is continu- ously fed with computing problems. Dr. Oliver Tennert Director Technology Management & HPC Solutions The secondary reason for building scale-out supercomputers is to maximize the utilization of the system. 6
  7. 7. If the individual processes engage in a large amount of com- munication, the response time of the network (latency) becomes important. Latency in a Gigabit Ethernet or a 10GE network is typi- cally around 10 µs. High-speed interconnects such as InfiniBand, reduce latency by a factor of 10 down to as low as 1 µs. Therefore, high-speed interconnects can greatly speed up total processing. The other frequently used variant is called SMP applications.VARIATIONS ON THE THEME: MPP AND SMP SMP, in this HPC context, stands for Shared Memory Processing.Parallel computations exist in two major variants today. Ap- It involves the use of shared memory areas, the specific imple-plications running in parallel on multiple compute nodes are mentation of which is dependent on the choice of the underlyingfrequently so-called Massively Parallel Processing (MPP) applica- operating system. Consequently, SMP jobs generally only run ontions. MPP indicates that the individual processes can each a single node, where they can in turn be multi-threaded and thusutilize exclusive memory areas. This means that such jobs are be parallelized across the number of CPUs per node. For many HPCpredestined to be computed in parallel, distributed across the applications, both the MPP and SMP variant can be chosen.nodes in a cluster. The individual processes can thus utilize theseparate units of the respective node – especially the RAM, the Many applications are not inherently suitable for parallel execu-CPU power and the disk I/O. tion. In such a case, there is no communication between the in- dividual compute nodes, and therefore no need for a high-speedCommunication between the individual processes is imple- network between them; nevertheless, multiple computing jobsmented in a standardized way through the MPI software can be run simultaneously and sequentially on each individualinterface (Message Passing Interface), which abstracts the node, depending on the number of CPUs.underlying network connections between the nodes fromthe processes. However, the MPI standard (current version In order to ensure optimum computing performance for these2.0) merely requires source code compatibility, not binary applications, it must be examined how many CPUs and corescompatibility, so an off-the-shelf application usually needs deliver the optimum performance.specific versions of MPI libraries in order to run. Examples ofMPI implementations are OpenMPI, MPICH2, MVAPICH2, Intel We find applications of this sequential type of work typically inMPI or – for Windows clusters – MS-MPI. the fields of data analysis or Monte-Carlo simulations. 7
  8. 8. HIGH PERFORMANCE COMPUTINGFLEXIBLE DEPLOYMENT WITH XCAT xCAT as a Powerful and Flexible Deployment Tool xCAT (Extreme Cluster Administration Tool) is an open source toolkit for the deployment and low-level administration of HPC cluster environments, small as well as large ones. xCAT provides simple commands for hardware control, node dis- covery, the collection of MAC addresses, and the node deploy- ment with (diskful) or without local (diskless) installation. The cluster configuration is stored in a relational database. Node groups for different operating system images can be defined. Also, user-specific scripts can be executed automatically at installation time. xCAT Provides the Following Low-Level Administrative Features  Remote console support  Parallel remote shell and remote copy commands  Plugins for various monitoring tools like Ganglia or Nagios  Hardware control commands for node discovery, collect- ing MAC addresses, remote power switching and resetting of nodes 8
  9. 9.  Automatic configuration of syslog, remote shell, DNS, DHCP, when the code is self-developed, developers often prefer one and ntp within the cluster MPI implementation over another. Extensive documentation and man pages According to the customer’s wishes, we install various compil-For cluster monitoring, we install and configure the open ers, MPI middleware, as well as job management systems likesource tool Ganglia or the even more powerful open source Parastation, Grid Engine, Torque/Maui, or the very powerfulsolution Nagios, according to the customer’s preferences and Moab HPC Suite for the high-level cluster management.requirements.Local Installation or Diskless InstallationWe offer a diskful or a diskless installation of the cluster nodes.A diskless installation means the operating system is hostedpartially within the main memory, larger parts may or maynot be included via NFS or other means. This approach allowsfor deploying large amounts of nodes very efficiently, and thecluster is up and running within a very small timescale. Also,updating the cluster can be done in a very efficient way. Forthis, only the boot image has to be updated, and the nodes haveto be rebooted. After this, the nodes run either a new kernel oreven a new operating system. Moreover, with this approach,partitioning the cluster can also be very efficiently done, eitherfor testing purposes, or for allocating different cluster parti-tions for different users or applications.Development Tools, Middleware, and ApplicationsAccording to the application, optimization strategy, or underlyingarchitecture, different compilers lead to code results of verydifferent performance. Moreover, different, mainly commercial,applications, require different MPI implementations. And even 9
  10. 10. HPC solution benchmarking of applicationHIGH PERFORMANCE COMPUTING different systems installationPERFORMANCE TURNS INTO PRODUCTIVITY continual improvement maintenance, integration onsite customer into support & hardware training customer’s managed services assembly environmentSERVICES AND CUSTOMER CARE FROM A TO Z application-, burn-in tests software individual Presales customer-, of systems & OS consulting site-specific installation sizing of HPC solution benchmarking of application different systems installation continual improvement maintenance, integration onsite customer into support & hardware training customer’s managed services assembly environment 10
  11. 11. to important middleware components like cluster management or developer tools and the customer’s production applications. Onsite delivery means onsite integration into the customer’s production environment, be it establishing network connectivity to the corporate network, or setting up software and configura- tion parts. transtec HPC clusters are ready-to-run systems – we deliver, youHPC @ TRANSTEC: SERVICES AND CUSTOMER CARE FROM A TO Z turn the key, the system delivers high performance. Every HPCtranstec AG has over 30 years of experience in scientific comput- project entails transfer to production: IT operation processes anding and is one of the earliest manufacturers of HPC clusters. policies apply to the new HPC system. Effectively, IT personnel isFor nearly a decade, transtec has delivered highly customized trained hands-on, introduced to hardware components and soft-High Performance clusters based on standard components to ware, with all operational aspects of configuration management.academic and industry customers across Europe with all thehigh quality standards and the customer-centric approach that transtec services do not stop when the implementation projectstranstec is well known for. ends. Beyond transfer to production, transtec takes care. transtec offers a variety of support and service options, tailored to theEvery transtec HPC solution is more than just a rack full of hard- customer’s needs. When you are in need of a new installation, aware – it is a comprehensive solution with everything the HPC major reconfiguration or an update of your solution – transtec isuser, owner, and operator need. able to support your staff and, if you lack the resources for main- taining the cluster yourself, maintain the HPC solution for you.In the early stages of any customer’s HPC project, transtec ex- From Professional Services to Managed Services for daily opera-perts provide extensive and detailed consulting to the customer tions and required service levels, transtec will be your complete– they benefit from expertise and experience. Consulting is fol- HPC service and solution provider. transtec’s high standards oflowed by benchmarking of different systems with either specifi- performance, reliability and dependability assure your productiv-cally crafted customer code or generally accepted benchmarking ity and complete satisfaction.routines; this aids customers in sizing and devising the optimaland detailed HPC configuration. transtec’s offerings of HPC Managed Services offer customers the possibility of having the complete management and administra-Each and every piece of HPC hardware that leaves our factory tion of the HPC cluster managed by transtec service specialists,undergoes a burn-in procedure of 24 hours or more if necessary. in an ITIL compliant way. Moreover, transtec’s HPC on DemandWe make sure that any hardware shipped meets our and our services help provide access to HPC resources whenever theycustomers’ quality requirements. transtec HPC solutions are turn- need them, for example, because they do not have the possibilitykey solutions. By default, a transtec HPC cluster has everything of owning and running an HPC cluster themselves, due to lackinginstalled and configured – from hardware and operating system infrastructure, know-how, or admin staff. 11
  12. 12. CLUSTERMANAGEMENTMADE EASY
  13. 13. Bright Cluster Manager removes the complexity from theinstallation, management and use of HPC clusters, withoutcompromizing performance or capability. With Bright ClusterManager, an administrator can easily install, use and managemultiple clusters simultaneously, without the need for expertknowledge of Linux or HPC. 13
  14. 14. CLUSTER MANAGEMENT MADE EASY A UNIFIED APPROACH Other cluster management offerings take a “toolkit” approachBRIGHT CLUSTER MANAGER in which a Linux distribution is combined with many third-partyTHE CLUSTER INSTALLER TAKES THE ADMINISTRATOR THROUGH THE tools for provisioning, monitoring, alerting, etc.INSTALLATION PROCESS AND OFFERS ADVANCED OPTIONS SUCH AS“EXPRESS” AND “REMOTE”. This approach has critical limitations because those separate tools were not designed to work together, were not designed for HPC, and were not designed to scale. Furthermore, each of the tools has its own interface (mostly command-line based), and each has its own daemons and databases. Countless hours of scripting and testing from highly skilled people are required to get the tools to work for a specific cluster, and much of it goes undocumented. Bright Cluster Manager takes a much more fundamental, inte- grated and unified approach. It was designed and written from the ground up for straightforward, efficient, comprehensive clus- ter management. It has a single lightweight daemon, a central database for all monitoring and configuration data, and a singleBY SELECTING A CLUSTER NODE IN THE TREE ON THE LEFT AND THE TASKS CLI and GUI for all cluster management functionality.TAB ON THE RIGHT, THE ADMINISTRATOR CAN EXECUTE A NUMBER OFPOWERFUL TASKS ON THAT NODE WITH JUST A SINGLE MOUSE CLICK.. This approach makes Bright Cluster Manager extremely easy to use, scalable, secure and reliable, complete, flexible, and easy to maintain and support. EASE OF INSTALLATION Bright Cluster Manager is easy to install. Typically, system admin- istrators can install and test a fully functional cluster from “bare metal” in less than an hour. Configuration choices made during the installation can be modified afterwards. Multiple installation modes are available, including unattended and remote modes. Cluster nodes can be automatically identified based on switch ports rather than MAC addresses, improving speed and reliability of installation, as well as subsequent maintenance. 14
  15. 15. EASE OF USE are performed through one intuitive, visual interface.Bright Cluster Manager is easy to use. System administrators Multiple clusters can be managed simultaneously. The CMGUIhave two options: the intuitive Cluster Management Graphical runs on Linux, Windows and MacOS (coming soon) and can beUser Interface (CMGUI) and the powerful Cluster Management extended using plugins. The CMSH provides practically the sameShell (CMSH). The CMGUI is a standalone desktop application functionality as the Bright CMGUI, but via a command-line inter-that provides a single system view for managing all hardware face. The CMSH can be used both interactively and in batch modeand software aspects of the cluster through a single point of via scripts. Either way, system administrators now have unprec-control. Administrative functions are streamlined as all tasks edented flexibility and control over their clusters.CLUSTER METRICS, SUCH AS GPU AND CPU TEMPERATURES, FAN SPEEDS AND NETWORKS STATISTICS CAN BE VISUALIZED BY SIMPLY DRAGGING AND DROPPING THEM FROMTHE LIST ON THE LEFT INTO A GRAPHING WINDOW ON THE RIGHT. MULTIPLE METRICS CAN BE COMBINED IN ONE GRAPH AND GRAPHS CAN BE ZOOMED INTO. GRAPH LAYOUTAND COLORS CAN BE TAILORED TO YOUR REQUIREMENTS. 15
  16. 16. CLUSTER MANAGEMENT MADE EASY SUPPORT FOR LINUX AND WINDOWS Bright Cluster Manager is based on Linux and is availableBRIGHT CLUSTER MANAGER with a choice of pre-integrated, pre-configured and opti- mized Linux distributions, including SUSE Linux Enterprise THE STATUS OF CLUSTER NODES, SWITCHES, OTHER HARDWARE, AS WELL AS UP TO SIX METRICS CAN BE VISUALIZED IN THE RACKVIEW. A ZOOM-OUT OPTION IS AVAIL- ABLE FOR CLUSTERS WITH MANY RACKS.THE OVERVIEW TAB PROVIDES INSTANT, HIGH-LEVEL INSIGHT INTOTHE STATUS OF THE CLUSTER. Server, Red Hat Enterprise Linux, CentOS and Scientific Linux. Dual-boot installations with Windows HPC Server are supported as well, allowing nodes to either boot from the Bright-managed Linux head node, or the Windows-managed head node. EXTENSIVE DEVELOPMENT ENVIRONMENT Bright Cluster Manager provides an extensive HPC development environment for both serial and parallel applications, including the following (some optional): 16
  17. 17.  Compilers, including full suites from GNU, Intel, AMD and THE PARALLEL SHELL ALLOWS FOR SIMULTANEOUS EXECUTION OF COMMANDS OR SCRIPTS ACROSS NODE GROUPS OR ACROSS THE ENTIRE CLUSTER. Portland Group Debuggers and profilers, including the GNU debugger and profiler, TAU, TotalView, Allinea DDT and Allinea OPT GPU libraries, including CUDA and OpenCL MPI libraries, including OpenMPI, MPICH, MPICH2, MPICH- MX, MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled with the compilers installed on Bright Cluster Manager, and optimized for high-speed interconnects such as InfiniBand and Myrinet Mathematical libraries, including ACML, FFTW, GMP, GotoBLAS, MKL and ScaLAPACK Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net- CDF and PETScBright Cluster Manager also provides Environment Modules to Linux kernels can be assigned to individual images. Incremen-make it easy to maintain multiple versions of compilers, librar- tal changes to images can be deployed to live nodes withouties and applications for different users on the cluster, without rebooting or re-installation.creating compatibility conflicts. Each Environment Module file The provisioning system propagates only changes to thecontains the information needed to configure the shell for an images, minimizing time and impact on system performanceapplication, and automatically sets these variables correctly and availability. Provisioning capability can be assigned tofor the particular application when it is loaded. Bright Cluster any number of nodes on-the-fly, for maximum flexibility andManager includes many preconfigured module files for many scalability. Bright Cluster Manager can also provision overscenarios, such as combinations of compliers, mathematical InfiniBand and to RAM disk.and MPI libraries. COMPREHENSIVE MONITORINGPOWERFUL IMAGE MANAGEMENT AND PROVISIONING With Bright Cluster Manager, system administrators can collect,Bright Cluster Manager features sophisticated software image monitor, visualize and analyze a comprehensive set of metrics.management and provisioning capability. A virtually unlimited Practically all software and hardware metrics available to thenumber of images can be created and assigned to as many Linux kernel, and all hardware management interface metricsdifferent categories of nodes as required. Default or custom (IPMI, iLO, etc.) are sampled. 17
  18. 18. CLUSTER MANAGEMENT MADE EASYBRIGHT CLUSTER MANAGER HIGH PERFORMANCE MEETS EFFICIENCY Initially, massively parallel systems constitute a challenge to both administrators and users. They are complex beasts. Any- one building HPC clusters will need to tame the beast, master the complexity and present users and administrators with an easy-to-use, easy-to-manage system landscape. Leading HPC solution providers such as transtec achieve this goal. They hide the complexity of HPC under the hood and match high performance with efficiency and ease-of-use for both users and administrators. The “P” in “HPC” gains a double meaning: “Performance” plus “Productivity”. Cluster and workload management software like Moab HPC Suite, Bright Cluster Manager or QLogic IFS provide the means to master and hide the inherent complexity of HPC systems. For administrators and users, HPC clusters are presented as single, large machines, with many different tuning parameters. The software also provides a unified view of existing clusters when- ever unified management is added as a requirement by the customer at any point in time after the first installation. Thus, daily routine tasks such as job management, user management, queue partitioning and management, can be performed easily with either graphical or web-based tools, without any advanced scripting skills or technical expertise required from the adminis- trator or user. 18
  19. 19.  Powerful cluster automation functionality allows preemptive actions based on monitoring thresholds  Comprehensive cluster monitoring and health checking framework, including automatic sidelining of unhealthy nodes to prevent job failure Scalability from Deskside to TOP500  Off-loadable provisioning for maximum scalabilityTHE BRIGHT ADVANTAGE  Proven on some of the world’s largest clustersBright Cluster Manager offers many advantages that lead toimproved productivity, uptime, scalability, performance and Minimum Overhead/Maximum Performancesecurity, while reducing total cost of ownership.  Single lightweight daemon drives all functionality  Daemon heavily optimized to minimize effect on operatingRapid Productivity Gains system and applications Easy to learn and use, with an intuitive GUI  Single database stores all metric and configuration data Quick installation: from bare metal to a cluster ready to use, in less than an hour Top Security Fast, flexible provisioning: incremental, live, disk-full, disk-  Automated security and other updates from key-signed less, provisioning over InfiniBand, auto node discovery repositories Comprehensive monitoring: on-the-fly graphs, rackview,  Encrypted external and internal communications (optional) multiple clusters, custom metrics  X.509v3 certificate-based public-key authentication Powerful automation: thresholds, alerts, actions  Role-based access control and complete audit trail Complete GPU support: NVIDIA, AMD ATI, CUDA, OpenCL  Firewalls and secure LDAP On-demand SMP: instant ScaleMP virtual SMP deployment Powerful cluster management shell and SOAP API for auto- mating tasks and creating custom capabilities Seamless integration with leading workload managers: PBS Pro, Moab, Maui, SLURM, Grid Engine, Torque, LSF Integrated (parallel) application development environment. Easy maintenance: automatically update your cluster from Linux and Bright Computing repositories Web-based user portal Bright ComputingMaximum Uptime Unattended, robust head node failover to spare head node 19
  20. 20. CLUSTER MANAGEMENT MADE EASY Examples include CPU and GPU temperatures, fan speeds, switches, hard disk SMART information, system load, memoryBRIGHT CLUSTER MANAGER utilization, network statistics, storage metrics, power systems statistics, and workload management statistics. Custom metrics can also easily be defined. Metric sampling is done very efficiently – in one process, or out-of-band where possible. System administrators have full flexibility over how and when metrics are sampled, and historic data can be consolidated over time to save disk space. THE AUTOMATION CONFIGURATION WIZARD GUIDES THE SYSTEM ADMINISTRATOR THROUGH THE STEPS OF DEFINING A RULE: SELECTING METRICS, DEFINING THRESH- OLDS AND SPECIFYING ACTIONS. CLUSTER MANAGEMENT AUTOMATION Cluster management automation takes preemptive actions when predetermined system thresholds are exceeded, sav- ing time and preventing hardware damage. System thresh- olds can be configured on any of the available metrics. The built-in configuration wizard guides the system administra- 20
  21. 21. tor through the steps of defining a rule: selecting metrics, EXAMPLE GRAPHS THAT VISUALIZE METRICS ON A GPU CLUSTER.defining thresholds and specifying actions. For example,a temperature threshold for GPUs can be established thatresults in the system automatically shutting down an over-heated GPU unit and sending an SMS message to the systemadministrator’s mobile phone. Several predefined actions areavailable, but any Linux command or script can be config-ured as an action.COMPREHENSIVE GPU MANAGEMENTBright Cluster Manager radically reduces the time and ef-fort of managing GPUs, and fully integrates these devicesinto the single view of the overall system. Bright includespowerful GPU management and monitoring capability thatleverages functionality in NVIDIA Tesla GPUs. System admin-istrators can easily assume maximum control of the GPUsand gain instant and time-based status insight. In additionto the standard cluster management capabilities, BrightCluster Manager monitors the full range of GPU metrics,including: MULTI-TASKING VIA PARALLEL SHELL GPU temperature, fan speed, utilization The parallel shell allows simultaneous execution of multiple GPU exclusivity, compute, display, persistance mode commands and scripts across the cluster as a whole, or across GPU memory utilization, ECC statistics easily definable groups of nodes. Output from the executed Unit fan speed, serial number, temperature, power commands is displayed in a convenient way with variable levels usage, voltages and currents, LED status, firmware of verbosity. Running commands and scripts can be killed easily Board serial, driver version, PCI info if necessary. The parallel shell is available through both the CMGUI and the CMSH.Beyond metrics, Bright Cluster Manager features built-insupport for GPU computing with CUDA and OpenCL libraries. INTEGRATED WORKLOAD MANAGEMENTSwitching between current and previous versions of CUDA and Bright Cluster Manager is integrated with a wide selection ofOpenCL has also been made easy. free and commercial workload managers. This integration 21
  22. 22. CLUSTER MANAGEMENT MADE EASY provides a number of benefits:  The selected workload manager gets automatically installedBRIGHT CLUSTER MANAGER and configured  Many workload manager metrics are monitored  The GUI provides a user-friendly interface for configuring, monitoring and managing the selected workload manager  The CMSH and the SOAP API provide direct and powerful access to a number of workload manager commands and metricsWORKLOAD MANAGEMENT QUEUES CAN BE VIEWED AND CON- CREATING AND DISMANTLING A VIRTUAL SMP NODE CAN BE ACHIEVED WITH JUSTFIGURED FROM THE GUI, WITHOUT THE NEED FOR WORKLOAD A FEW CLICKS WITHIN THE GUI OR A SINGLE COMMAND IN THE CLUSTER MANAGE-MANAGEMENT EXPERTISE. MENT SHELL. 22
  23. 23.  Reliable workload manager failover is properly configured MAXIMUM UPTIME WITH HEALTH CHECKING The workload manager is continuously made aware of the Bright Cluster Manager – Advanced Edition includes a powerful health state of nodes (see section on Health Checking) cluster health checking framework that maximizes system uptime. It continually checks multiple health indicators for all hardwareThe following user-selectable workload managers are tightly and software components and proactively initiates correctiveintegrated with Bright Cluster Manager: actions. It can also automatically perform a series of standard PBS Pro, Moab, Maui, LSF and user-defined tests just before starting a new job, to ensure SLURM, Grid Engine, Torque a successful execution. Examples of corrective actions include autonomous bypass of faulty nodes, automatic job requeuing toAlternatively, Lava, LoadLeveler or other workload managers can avoid queue flushing, and process “jailing” to allocate, track, tracebe installed on top of Bright Cluster Manager. and flush completed user processes. The health checking frame- work ensures the highest job throughput, the best overall clusterINTEGRATED SMP SUPPORT efficiency and the lowest administration overhead.Bright Cluster Manager – Advanced Edition dynamically ag-gregates multiple cluster nodes into a single virtual SMP node, WEB-BASED USER PORTALusing ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating The web-based user portal provides read-only access to essentialand dismantling a virtual SMP node can be achieved with just cluster information, including a general overview of the clustera few clicks within the CMGUI. Virtual SMP nodes can also be status, node hardware and software properties, workload managerlaunched and dismantled automatically using the scripting statistics and user-customizable graphs. The User Portal can easilycapabilities of the CMSH. In Bright Cluster Manager a virtual be customized and expanded using PHP and the SOAP API.SMP node behaves like any other node, enabling transparent,on-the-fly provisioning, configuration, monitoring and man- USER AND GROUP MANAGEMENTagement of virtual SMP nodes as part of the overall system Users can be added to the cluster through the CMGUI or themanagement. CMSH. Bright Cluster Manager comes with a pre-configured LDAP database, but an external LDAP service, or alternativeMAXIMUM UPTIME WITH HEAD NODE FAILOVER authentication system, can be used instead.Bright Cluster Manager – Advanced Edition allows two headnodes to be configured in active-active failover mode. Both ROLE-BASED ACCESS CONTROL AND AUDITINGhead nodes are on active duty, but if one fails, the other takes Bright Cluster Manager’s role-based access control mechanismover all tasks, seamlessly. allows administrator privileges to be defined on a per-role basis. 23
  24. 24. CLUSTER MANAGEMENT MADE EASY Administrator actions can be audited using an audit file which stores all their write action.BRIGHT CLUSTER MANAGER TOP CLUSTER SECURITY Bright Cluster Manager offers an unprecedented level of secu- rity that can easily be tailored to local requirements. Security features include:  Automated security and other updates from key-signed Linux and Bright Computing repositories  Encrypted internal and external communications  X.509v3 certificate based public-key authentication to the cluster management infrastructure THE WEB-BASED USER PORTAL PROVIDES READ-ONLY ACCESS TO ESSENTIAL CLUSTER INFORMATION, INCLUDING A GENERAL OVERVIEW OF THE CLUSTER STATUS, NODE HARDWARE AND SOFTWARE PROPERTIES, WORKLOAD MANAGER STATISTICS AND USER-CUSTOMIZABLE GRAPHS. “The building blocks for transtec HPC solu- tions must be chosen according to our goals ease-of-management and ease-of-use. With Bright Cluster Manager, we are happy to have the technology leader at hand, meeting these requirements, and our customers value that.” Armin Jäger HPC Solution Engineer 24
  25. 25.  Role-based access control and complete audit trail STANDARD AND ADVANCED EDITIONS Firewalls and secure LDAP Bright Cluster Manager is available in two editions: Standard Secure shell access and Advanced. The table on this page lists the differences. You can easily upgrade from the Standard to the Advanced EditionMULTI-CLUSTER CAPABILITY as your cluster grows in size or complexity.Bright Cluster Manager is ideal for organizations that need tomanage multiple clusters, either in one or in multiple locations. DOCUMENTATION AND SERVICESCapabilities include: A comprehensive system administrator manual and user manu- All cluster management and monitoring functionality availa- al are included in PDF format. Customized training and profes- ble for all clusters through one GUI sional services are available. Services include various levels of Selecting any set of configurations in one cluster and support, installation services and consultancy. export them to any or all other clusters with a few mouse clicks Making node images available to other clusters.BRIGHT CLUSTER MANAGER CAN MANAGE MULTIPLE CLUSTERS SIMULTANEOUSLY. CLUSTER HEALTH CHECKS CAN BE VISUALIZED IN THE RACKVIEW. THIS SCREENSHOTTHIS OVERVIEW SHOWS CLUSTERS IN OSLO, ABU DHABI AND HOUSTON, ALL MAN- SHOWS THAT GPU UNIT 41 FAILS A HEALTH CHECK CALLED “ALLFANSRUNNING”.AGED THROUGH ONE GUI. 25
  26. 26. CLUSTER MANAGEMENT MADE EASYBRIGHT CLUSTER MANAGERFEATURE STANDARD ADVANCEDChoice of Linux distributions x xIntel Cluster Ready x xCluster Management GUI x xCluster Management Shell x xWeb-Based User Portal x xSOAP API x xNode Provisioning x xNode Identification x xCluster Monitoring x xCluster Automation x xUser Management x xParallel Shell x xWorkload Manager Integration x xCluster Security x xCompilers x xDebuggers & Profilers x xMPI Libraries x xMathematical Libraries x xEnvironment Modules x xNVIDIA CUDA & OpenCL x xGPU Management & Monitoring x xScaleMP Management & Monitoring - xRedundant Failover Head Nodes - xCluster Health Checking - xOff-loadable Provisioning - xSuggested Number of Nodes 4–128 129–10,000+Multi-Cluster Management - xStandard Support x xPremium Support Optional Optional 26
  27. 27. 27
  28. 28. INTELLIGENTHPC WORKLOADMANAGEMENT
  29. 29. While all HPC systems face challenges in workload demand,resource complexity, and scale, enterprise HPC systems facemore stringent challenges and expectations. Enterprise HPCsystems must meet mission-critical and priority HPC workloaddemands for commercial businesses and business-orientedresearch and academic organizations. They have complex SLAsand priorities to balance. Their HPC workloads directly impactthe revenue, product delivery, and organizational objectivesof their organizations. 29
  30. 30. INTELLIGENT MOAB HPC SUITE Moab is the most powerful intelligence engine for policy-based,HPC WORKLOAD MANAGEMENT predictive scheduling across workloads and resources. MoabMOAB HPC SUITE – ENTERPRISE EDITION HPC Suite accelerates results delivery and maximize utiliza- tion while simplifying workload management across complex, heterogeneous cluster environments. The Moab HPC Suite products leverage the multi-dimensional policies in Moab to continually model and monitor workloads, resources, SLAs, and priorities to optimize workload output. And these policies utilize the unique Moab management abstraction layer that integrates data across heterogeneous resources and resource managers to maximize control as you automate workload man- agement actions. Managing the World’s Top Systems, Ready to Manage Yours Moab manages the world’s largest, most scale-intensive and complex HPC environments in the world including 40% of the top 10 supercomputing systems, nearly 40% of the top 25 and 36% of the compute cores in the top 100 systems based on rankings from www.Top500.org. So you know it is battle-tested and ready “With Moab HPC Suite, we can meet very de- to efficiently and intelligently manage the complexities of your manding customers’ requirements as regards environment. unified management of heterogeneous cluster environments, grid management, and provide MOAB HPC SUITE – ENTERPRISE EDITION them with flexible and powerful configuration Moab HPC Suite - Enterprise Edition provides enterprise-ready and reporting options. Our customers value HPC workload management that self-optimizes the productivity, that highly.” workload uptime and meeting of SLAs and business priorities for HPC systems and HPC cloud. It uses the battle-tested and patented Moab intelligence engine to automate the mission- Thomas Gebert HPC Solution Architect critical workload priorities of enterprise HPC systems. Enterprise customers benefit from a single integrated product that brings 30
  31. 31. together key enterprise HPC capabilities, implementation, train- achievement of business objectives and outcomes that depending, and 24x7 support services to speed the realization of benefits on the results the enterprise HPC systems deliver. Moab HPCfrom their HPC system for their business. Moab HPC Suite – En- Suite Enterprise Edition delivers:terprise Edition delivers: Productivity acceleration Productivity acceleration to get more results faster and at a Uptime automation lower cost Auto-SLA enforcement Moab HPC Suite – Enterprise Edition gets more results delivered Grid- and cloud-ready HPC management faster from HPC resources to lower costs while accelerating overall system, user and administrator productivity. MoabDesigned to Solve Enterprise HPC Challenges provides the unmatched scalability, 90-99 percent utilization,While all HPC systems face challenges in workload and resource and fast and simple job submission that is required to maximizecomplexity, scale and demand, enterprise HPC systems face productivity in enterprise HPC organizations. The Moab intel-more stringent challenges and expectations. Enterprise HPC ligence engine optimizes workload scheduling and orchestratessystems must meet mission-critical and priority HPC workload resource provisioning and management to maximize workloaddemands for commercial businesses and business-oriented speed and quantity. It also unifies workload managementresearch and academic organizations. These organizations have across heterogeneous resources, resource managers and evencomplex SLA and priorities to balance. And their HPC workloads multiple clusters to reduce management complexity and costs.directly impact the revenue, product delivery, and organization-al objectives of their organizations. Uptime automation to ensure workload completes successfullyEnterprise HPC organizations must eliminate job delays and HPC job and resource failures in enterprise HPC systems lead tofailures. They are also seeking to improve resource utilization delayed results and missed organizational opportunities andand workload management efficiency across multiple heteroge- objectives. Moab HPC Suite – Enterprise Edition intelligentlyneous systems. To maximize user productivity, they are required automates workload and resource uptime in HPC systems to en-to make it easier to access and use HPC resources for users and sure that workload completes reliably and avoids these failures.even expand to other clusters or HPC cloud to better handleworkload demand and surges. Auto-SLA enforcement to consistently meet service guaran- tees and business prioritiesBENEFITS Moab HPC Suite – Enterprise Edition uses the powerful MoabMoab HPC Suite - Enterprise Edition offers key benefits to intelligence engine to optimally schedule and dynamicallyreduce costs, improve service performance, and accelerate the adjust workload to consistently meet service level agreementsproductivity of enterprise HPC systems. These benefits drive the (SLAs), guarantees, and business priorities. This automatically 31
  32. 32. INTELLIGENT ensures that the right workloads are completed at the optimal times, taking into account the complex number of departments,HPC WORKLOAD MANAGEMENT priorities and SLAs to be balanced.MOAB HPC SUITE – ENTERPRISE EDITION Grid- and Cloud-ready HPC management to more efficiently manage and meet workload demand The benefits of a traditional HPC environment can be extended to more efficiently manage and meet workload and resource demand by sharing workload across multiple clusters through grid management and the HPC cloud management capabilities provided in Moab HPC Suite – Enterprise Edition. CAPABILITIES Moab HPC Suite – Enterprise Edition brings together key en- terprise HPC capabilities into a single integrated product that self-optimizes the productivity, workload uptime, and meeting of SLA’s and priorities for HPC systems and HPC Cloud. Productivity acceleration capabilities deliver more results faster, lower costs, and increase resource, user and administra- tor productivityARCHITECTURE  Massive scalability accelerates job response and through- put, including support for high throughput computing  Workload-optimized allocation policies and provisioning gets more results out of existing heterogeneous resources to reduce costs  Workload unification across heterogeneous clusters maxi- mizes resource availability for workloads and administration efficiency by managing workload as one cluster  Simplified HPC submission and control for both users and ad- ministrators with job arrays, templates, self-service submission 32
  33. 33. portal and administrator dashboard (i.e. usage limits, usage reports, etc.) Optimized intelligent scheduling that packs workloads and  SLA and priority polices ensure the highest priority workloads backfills around priority jobs and reservations while balancing are processed first (i.e. quality of service, hierarchical priority SLAs to efficiently use all available resources weighting, dynamic fairshare policies, etc.) Advanced scheduling and management of GPGPUs for jobs to  Continuous plus future scheduling ensures priorities and gua- maximize their utilization including auto-detection, policy-based rantees are proactively met as conditions and workload levels GPGPU scheduling and GPGPU metrics reporting change (i.e. future reservations, priorities, and pre-emption) Workload-aware auto-power management reduces energy use and costs by 30-40 percent with intelligent workload consolidati- Grid- and cloud-ready HPC management extends the benefits of on and auto-power management your traditional HPC environment to more efficiently manage workload and better meet workload demandUptime automation capabilities ensure workload completes suc-  Pay-for-use showback and chargeback capabilities trackcessfully and reliably, avoiding failures and missed organizational actual resource usage with flexible chargeback options andopportunities and objectives reporting by user or department Intelligent resource placement prevents job failures with gra-  Manage and share workload across multiple remote nular resource modeling that ensures all workload requirements clusters to meet growing workload demand or surges with are met while avoiding at-risk resources the single self-service portal and intelligence engine with Auto-response to incidents and events maximizes job and sys- purchase of Moab HPC Suite - Grid Option tem uptime with configurable actions to pre-failure conditions, amber alerts, or other metrics and monitors ARCHITECTURE Workload-aware maintenance scheduling helps maintain a Moab HPC Suite - Enterprise Edition is architected to integrate stable HPC system without disrupting workload productivity on top of your existing job resource managers and other types Real-world services expertise ensures fast time to value and of resource managers in your environment. It provides policy- system uptime with included package of implementation, trai- based scheduling and management of workloads as well as ning, and 24x7 remote support services resource allocation and provisioning orchestration. The Moab intelligence engine makes complex scheduling and manage-Auto-SLA enforcement schedules and adjusts workload to con- ment decisions based on all of the data it integrates from thesistently meet service guarantees and business priorities so the various resource managers and then orchestrates the job andright workloads are completed at the optimal times management actions through those resource managers. It Department budget enforcement schedules resources in does this without requiring any additional agents. This makes line with resource sharing agreements and budgets it the ideal choice to integrate with existing and new systems 33
  34. 34. INTELLIGENTHPC WORKLOAD MANAGEMENTNEW IN MOAB 7.0 NEW MOAB HPC SUITE 7.0 The new Moab HPC Suite 7.0 releases deliver continued break- through advancements in scalability, reliability, and job array management to accelerate system productivity as well as ex- tended database support. Here is a look at the new capabilities and the value they offer customers: TORQUE Resource Manager Scalability and Reliability Ad- vancements for Petaflop and Beyond As part of the Moab HPC Suite 7.0 releases, the TORQUE 4.0 resource manager features scalability and reliability advance- ments to fully exploit Moab scalability. These advancements maximize your use of increasing hardware capabilities and enable you to meet growing HPC user needs. Key advancements in TORQUE 4.0 for Moab HPC Suite 7.0 include:  The new Job Radix enables you to efficiently run jobs that span tens of thousands or even hundreds of thousands of nodes. Each MOM daemon now cascades job communication with multiple other MOM daemons simultaneously to reduce the job start-up process time to a small fraction of what it would normally take across a large number of nodes. The Job Radix eliminates lost jobs and job start-up bottlenecks caused by having all nodes MOM daemons communicating with only one head MOM node. This saves critical minutes on job start-up process time and allows for higher job throughput. 34
  35. 35.  New MOM daemon communication hierarchy increases gration with existing user portals, plug-ins of resource manag- the number of nodes supported and reduces the overhead ers for rich data integration, and script integration. Customers of cluster status updates by distributing communication now have a standard interface to Moab with REST APIs. across multiple nodes instead of a single TORQUE head node. This makes status updates more efficient faster sched- Simplified Self-Service and Admin Dashboard Portal Experience uling and responsiveness. Moab HPC 7.0 features an enhanced self-service and admin New multi-threading improves response and reliability, dashboard portal with simplified “click-based” job submission allowing for instant feedback to user requests as well as the for end users as well as new visual cluster dashboard views of ability to continue work even if some processes linger. nodes, jobs, and reservations for more efficient management. The Improved network communications with all UDP-based new Visual Cluster dashboard provides administrators and users communication replaced with TCP to make data transfers views of their cluster resources that are easily filtered by almost from node to node more reliable. any factors including id, name, IP address, state, power, pending actions, reservations, load, memory, processors, etc. Users canJob Array Auto-Cancellation Policies Improve System Productivity also quickly filter and view their jobs by name, state, user, group,Moab HPC Suite 7.0 improves system productivity with new job ar- account, wall clock requested, memory requested, start date/ray auto-cancellation policies that cancel remaining sub-jobs in an time, submit date/time, etc. One-click drill-downs provide addi-array once the solution is found in the array results. This frees up tional details and options for management actions.resources, which would otherwise be running irrelevant jobs, to runother jobs in the queue jobs quicker. The job array auto-cancellation Resource Usage Accounting Flexibilitypolicies allow you to set auto-cancellations of sub-jobs based on Moab HPC Suite 7.0 includes more flexible resource usage ac-first, any instance of results success or failure, or specific exit codes. counting options that enable administrators to easily duplicate custom organizational hierarchies such as organization, groups,Extended Database Support Now Includes PostgreSQL and projects, business units, cost centers etc. in the Moab Account-Oracle in Addition to MySQL ing Manager usage budgets and charging structure. This ensuresThe extended database support in Moab HPC Suite 7.0 enables resource usage is budgeted , tracked, and reported or chargedcustomers to use ODBC-compliant PostgreSQL and Oracle back for in the most useful way to admins and their customerdatabases in addition to MySQL. This provides customers the groups and users.flexibility to use the database that best meets their needs or isthe standard for their system.New Moab Web Services Provide Easier Standard Integrationand CustomizationNew Moab Web Services provide easier standard integrationand customization for a customer’s environment such as inte- 35
  36. 36. INTELLIGENT as well as to manage your HPC system as it grows and expands in the future.HPC WORKLOAD MANAGEMENTMOAB HPC SUITE – BASIC EDITION Moab HPC Suite – Enterprise Edition includes the patented Moab intelligence engine that enables it to integrate with and automate management across existing heterogeneous environ- ments to optimize management and workload efficiency. This unique intelligence engine includes:  Industry leading multi-dimensional policies that automate the complex real-time decisions and actions for scheduling workload and allocating and adapting resources. These mul- ti-dimensional policies can model and consider the workload requirements, resource attributes and affinities, SLAs and priorities to enable more complex and efficient decisions to be automated.  Real-time and predictive future environment scheduling that drives more accurate and efficient decisions and service guarantees as it can proactively adjust scheduling and re- source allocations as it projects the impact of workload and resource condition changes.  Open & flexible management abstraction layer lets you integrate the data and orchestrate workload actions across the chaos of complex heterogeneous cluster environments and management middleware to maximize workload control, automation, and optimization. COMPONENTS Moab HPC Suite – Enterprise Edition includes the following inte- grated products and technologies for a complete HPC workload management solution:  Moab Workload Manager: Patented multi-dimensional 36
  37. 37. intelligence engine that automates the complex decisions based workload management system that accelerates and auto- and orchestrates policy-based workload placement and mates the scheduling, managing, monitoring, and reporting of scheduling as well as resource allocation, provisioning and HPC workloads on massive scale, multi-technology installations. energy management The Moab HPC Suite – Basic Edition patented multi-dimensional Moab Cluster Manager: Graphical desktop administrator decision engine accelerates both the decisions and orchestrati- application for managing, configuring, monitoring, and on of workload across the ideal combination of diverse resour- reporting for Moab managed clusters ces, including specialized resources like GPGPUs. The speed and Moab Viewpoint: Web-based user self-service job submis- accuracy of the decisions and scheduling automation optimizes sion and management portal and administrator dashboard workload throughput and resource utilization so more work portal is accomplished in less time with existing resources to control Moab Accounting Manager: HPC resource use budgeting costs and increase the value out of HPC investments. and accounting tool that enforces resource sharing agree- ments and limits based on departmental budgets and provi- Moab HPC Suite – Basic Edition enables you to address pressing des showback and chargeback reporting for resource usage HPC challenges including: Moab Services Manager: Integration interfaces to resource  Delays to workload start and end times slowing results managers and third-party tools  Inconsistent delivery on service guarantees and SLA commit- mentsMoab HPC Suite – Enterprise Edition is also integrated with  Under-utilization of resourcesTORQUE which is available as a free download on AdaptiveCom-  How to efficiently manage workload across heterogeneous andputing.com. TORQUE is an open-source job/resource manager hybrid systems of GPGPUs, hardware, and middlewarethat provides continually updated information regarding the  How to simplify job submission & management for users andstate of nodes and workload status. Adaptive Computing is the administrators to maximize productivitycustodian of the TORQUE project and is actively developingthe code base in cooperation with the TORQUE community to Moab HPC Suite – Basic Edition acts as the “brain” of an HPCprovide state of the art resource management. Each Moab HPC system to accelerate and automate complex decision makingSuite product subscription includes support for the Moab HPC processes. The patented decision engine is capable of makingSuite as well as TORQUE, if you choose to use TORQUE as the the complex multi-dimensional policy-based decisions needed tojob/resource manager for your cluster. schedule workload to optimize job speed, job success and resource utilization. Moab HPC Suite – Basic Edition integrates decision-MOAB HPC SUITE – BASIC EDITION making data from and automates actions through your system’sMoab HPC Suite – Basic Edition is a multi-dimensional policy- existing mix of resource managers. This enables all the dimensions 37
  38. 38. INTELLIGENT of real-time granular resource attributes and state as well as the timing of current and future resource commitments to be factoredHPC WORKLOAD MANAGEMENT into more efficient and accurate scheduling and allocation decisi-MOAB HPC SUITE – BASIC EDITION ons. It also dramatically simplifies the management tasks and pro- cesses across these complex, heterogeneous environments. Moab works with many of the major resource management and industry standard resource monitoring tools covering mixed hardware,MOAB HPC SUITE - BASIC EDITION network, storage and licenses. Moab HPC Suite – Basic Edition policies are also able to factor in organizational priorities and complexities when scheduling workload and allocating resources. Moab ensures workload is pro- cessed according to organizational priorities and commitments and that resources are shared fairly across users, groups and even multiple organizations. This enables organizations to automati- cally enforce service guarantees and effectively manage organiza- tional complexities with simple policy-based settings. BENEFITS Moab HPC Suite – Basic Edition drives more ROI and results from your HPC environment including:  Improved job response times and job throughput with a workload decision engine that accelerates complex wor- kload scheduling decisions to enable faster job start times and high throughput computing  Optimized resource utilization to 90-99 percent with multi- dimensional and predictive workload scheduling to accomp- lish more with your existing resources  Automated enforcement of service guarantees, priorities, and resource sharing agreements across users, groups, and projects  Increased productivity by simplifying HPC use, access, and 38
  39. 39. control for both users and administrators with job arrays, affinity- and node topology-based placement job templates, optional user portal, and GUI administrator  Backfill job scheduling speeds job throughput and maximi- management and monitoring tool zes utilization by scheduling smaller or less demanding jobs Streamline job turnaround and reduce administrative as they can fit around priority jobs and reservations to use burden by unifying and automating workload tasks and re- all available resources source processes across diverse resources and mixed-system  Security policies control which users and groups can access environments including GPGPUs which resources Provides a scalable workload management architecture  Checkpointing that can manage peta-scale and beyond, is grid-ready, compatible with existing infrastructure, and extensible to Real-time and predictive scheduling ensure job priorities and manage your environment as it grows and evolves guarantees are proactively met as conditions and workload levels changeCAPABILITIES  Advanced reservations guarantee that jobs run when requiredMoab HPC Suite – Basic Edition accelerates workload pro-  Maintenance reservations reserve resources for planned fu-cessing with a patented multi-dimensional decision engine ture maintenance to avoid disruption to business workloadsthat self-optimizes workload placement, resource utilization  Predictive scheduling enables the future workload scheduleand results output while ensuring organizational priorities to be continually forecasted and adjusted along with resour-are met across the users and groups leveraging the HPC ce allocations to adapt to changes in conditions and new jobenvironment. and reservation requestsPolicy-driven scheduling intelligently places workload on op- Advanced scheduling and management of GPGPUs for jobs totimal set of diverse resources to maximize job throughput and maximize their utilizationsuccess as well as utilization and the meeting of workload and  Automatic detection and management of GPGPUs in envi-group priorities ronment to eliminate manual configuration and make them Priority, SLA and resource sharing policies ensure the highest immediately available for scheduling priority workloads are processed first and resources are  Exclusively allocate and schedule GPGPUs on a per-job basis shared fairly across users and groups such as quality of  Policy-based management & scheduling using GPGPU service, hierarchical priority weighting, and fairshare targets, metrics limits and weights policies  Quick access to statistics on GPGPU utilization and key Allocation policies optimize resource utilization and prevent metrics for optimal management and issue diagnosis such as job failures with granular resource modeling and scheduling, error counts, temperature, fan speed, and memory 39
  40. 40. INTELLIGENT Easier submission, management, and control of job arrays im- prove user productivity and job throughput efficiencyHPC WORKLOAD MANAGEMENT  Users can easily submit thousands of sub-jobs with a singleMOAB HPC SUITE – BASIC EDITION job submission with an array index differentiating each array sub-job  Job array usage limit policies enforce number of job maxi- mums by credentials or class  Simplified reporting and management of job arrays for end users filters jobs to summarize, track and manage at the master job level Scalable job performance to large-scale, extreme-scale, and high-throughput computing environments  Efficiently manages the submission and scheduling of hund- reds of thousands of queued job submissions to support high throughput computing  Fast scheduler response to user commands while scheduling so users and administrators get the real-time job informati- on they need  Fast job throughput rate to get results started and delivered faster and keep utilization of resources up Open and flexible management abstraction layer easily integrates with and automates management across existing heterogeneous resources and middleware to improve management efficiency  Rich data integration and aggregation enables you to set pow- erful, multi-dimensional policies based on the existing real-time resource data monitored without adding any new agents  Heterogeneous resource allocation & management for wor- kloads across mixed hardware, specialty resources such as 40

×